Monday, April 29, 2019

Splitting large XML file using Mule


In this article we will learn all the options we have to split small and large XML files using Mule.

Since we are talking about dealing with large XML, we need to understand a bit about DOM v/s SAX parsing. Both DOM and SAX parser are extensively used to read and parse XML files.

DOM Stands for Document Object Model and it represents an XML Document in a tree format which each element representing tree branches. DOM Parser creates an In-Memory tree representation of XML file and then parses it, so it requires more memory and it's advisable to have increased the heap size for DOM parser in order to avoid OutOfMemoryError in heap space. Parsing XML file using DOM parser is quite fast if XML file is small but if you try to read a large XML file using DOM parser there is more chances that it will take a long time or even may not be able to load it completely simply because it requires lot of memory to create XML Dom Tree.

SAX Stands for Simple API for XML Parsing. This is an event based XML Parser and it parses XML file step by step so much suitable for large XML Files. SAX XML Parser fires an event when it encountered opening tag, element or attribute, and the parsing works accordingly. It’s recommended to use SAX XML parser for parsing large XML files because it doesn't require to load whole XML file in-memory and it can read a big XML file in small parts. One disadvantage of using SAX Parser is that reading XML file using SAX Parser requires custom java coding in comparison with DOM Parser.

Having learned about DOM v/s SAX, let’s get started with the options we have in Mule for splitting XML files. Mule provides out of the box flow control - Splitter” for splitting files which takes xml file as input and outputs DOM object. Splitter is simple, effective and fast to implement provided the source XML is small. In this example, we have a bunch of schools that we need split based on School_Name tag and use XPATH3 expressions inside Expression component to retrieve details like Address, Rating, Contact_Info of each school and send out emails to users (parents) seeking this information. XPATH3 expression is used inside Splitter. It outputs DOM, which is converted to XML using out of the box DOM to XML transformer to get access to the split XML. Finally retrieve required data in expression component and email the info.







Expression Component:
flowVars.School_Address=xpath3('/School_Name/School_Address',payload, 'STRING');
flowVars.Rating=xpath3('/School_Name/Rating',payload, 'STRING');



This approach becomes expensive when the XML file size gets bigger. If we are using Splitter, then we need to pay attention to Mule Infrastructure side. Things like number of cores used, RAM allocated are key. If resources are limited and file size is medium to large, then we can consider this approach.




Here we use DataWeave which takes the xml file as input and outputs a collection (Map – key/value) object. This approach provides better performance as memory consumption is significantly lower than DOM approach. We iterate the map object using For-Each scope one-by-one to retrieve required data and email the info.

 



Expression Component:
flowVars.School_Address =payload.get("School_Address");
flowVars.Rating =payload.get("Rating");


If we are dealing with really large file, then best approach would be to implement SAX or StAX parser in Java. Invoke a Java component, send the payload to Java layer and let StAX parser handle the large file piece by piece. Here is an excellent article from Mulesoft on how to split large xml file using StAX parser. Here is another great article from DZone on difference between DOM, SAX & StAX.

No comments:

Post a Comment