In this article we will learn all the options we have to
split small and large XML files using Mule.
Since we are talking about dealing with large XML, we need
to understand a bit about DOM v/s SAX parsing. Both DOM and SAX parser are
extensively used to read and parse XML files.
DOM Stands for Document
Object Model and it represents an XML Document in a tree format which each element representing tree branches. DOM
Parser creates an In-Memory tree
representation of XML file and then parses it, so it requires more memory and
it's advisable to have increased the
heap size for DOM parser in order to avoid OutOfMemoryError in heap space. Parsing XML file using DOM parser
is quite fast if XML file is small but if you try to read a large XML file
using DOM parser there is more chances that it will take a long time or even
may not be able to load it completely simply because it requires lot of memory
to create XML Dom Tree.
SAX Stands for Simple
API for XML Parsing. This is an event
based XML Parser and it parses XML file step by step so much suitable for
large XML Files. SAX XML Parser fires an event when it encountered opening tag,
element or attribute, and the parsing works accordingly. It’s recommended to
use SAX XML parser for parsing large XML files because it doesn't require to
load whole XML file in-memory and it can read a big XML file in small parts. One
disadvantage of using SAX Parser is that reading XML file using SAX Parser
requires custom java coding in
comparison with DOM Parser.
Having learned about DOM v/s SAX, let’s get started with the
options we have in Mule for splitting XML files. Mule provides out of the box flow control - “Splitter” for splitting files which takes xml file as input and
outputs DOM object. Splitter is simple, effective and fast to implement
provided the source XML is small. In this example, we have a bunch of schools
that we need split based on School_Name tag and use XPATH3 expressions inside Expression component to retrieve
details like Address, Rating, Contact_Info of each school and send out emails
to users (parents) seeking this information. XPATH3 expression is used inside
Splitter. It outputs DOM, which is converted to XML using out of the box DOM to XML transformer to get access to
the split XML. Finally retrieve required data in expression component and email
the info.
Expression Component:
flowVars.School_Address=xpath3('/School_Name/School_Address',payload,
'STRING');
flowVars.Rating=xpath3('/School_Name/Rating',payload,
'STRING');
This approach becomes expensive when the XML file size gets
bigger. If we are using Splitter, then we need to pay attention to Mule
Infrastructure side. Things like number of cores used, RAM allocated are key.
If resources are limited and file size is medium to large, then we can consider
this approach.
Here we use DataWeave which takes the xml file as input and
outputs a collection (Map – key/value) object. This approach provides better
performance as memory consumption is significantly lower than DOM approach. We
iterate the map object using For-Each scope one-by-one to retrieve required
data and email the info.
Expression Component:
flowVars.School_Address
=payload.get("School_Address");
flowVars.Rating =payload.get("Rating");
If we are dealing with really large file, then best approach
would be to implement SAX or StAX parser in Java. Invoke a Java component, send
the payload to Java layer and let StAX parser handle the large file piece by
piece. Here is an excellent article from Mulesoft on how to split large xml file using StAX
parser. Here is another great article from DZone on difference between DOM, SAX &
StAX.