efficient processing of large and complex xml documents in hadoop
DESCRIPTION
Many systems capture XML data in Hadoop for analytical processing. When XML documents are large and have complex nested structures, processing such data repeatedly would be inefficient as parsing XML becomes CPU intensive, not to mention the inefficiency of storing XML in its native form. The problem is compounded in the Big Data space, when millions of such documents have to be processed and analyzed within a reasonable time. In this talk an efficient method is proposed by leveraging the Avro storage and communication format, which is flexible, compact and specifically built for Hadoop environments to model complex data structures. XML documents may be parsed and converted into Avro format on load, which can then be accessed via Hive using a SQL-like interface, Java MapReduce or Pig. A concrete use-case is provided that validates this approach along with variations of the same and their relative trade-offs.TRANSCRIPT
![Page 1: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/1.jpg)
Efficient processing of large and complex XML documents in Hadoop
Sujoe Bose Senior Principal, Sabre Holdings June, 2013
![Page 2: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/2.jpg)
Presenta.on Outline
§ MoBvaBon § ETL vs. ELT § Avro Format § Mapping from XML to Avro § Interfaces to access Avro § Performance and Storage consideraBons § Other types of storage/processing formats
confidenBal 2
![Page 3: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/3.jpg)
You will learn about …
§ A method to store and process complex XML data in Hadoop as Avro files
§ Interfaces to access and analyze data in Avro from Hive, Java and Pig
§ VariaBons of the method and their relaBve trade-‐offs in storage and processing
confidenBal 3
![Page 4: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/4.jpg)
Mo.va.on
§ Prevalence of XML and its derivaBves – Spurred by WebServices and SOA – Preferred communicaBon format unBl newer formats entered
– Data and logs represented in XML § XML – metadata combined data – Flexibility vs. Complexity
§ Could be arbitrarily nested and large § Volumes of documents – Big Data
confidenBal 4
![Page 5: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/5.jpg)
Challenges
§ Parsing XML is CPU Intensive § Certain parsers/parsing methods result in more memory consumpBon
§ Repeated parsing for each query § Large and deeply nested XMLs makes problem worse § Presence of tags in data result in high I/O due to storage size
§ Special handling of opBonal fields
confidenBal 5
![Page 6: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/6.jpg)
ETL vs. ELT
confidenBal 6
§ Hadoop generally built for EL – T – aka Schema-‐on-‐Read – Load as-‐is – Transform on Access/Query
§ Compare with Data Warehouse ETL – Aka Schema-‐on-‐Write – Transform and Load – Queries are lot simpler – TransformaBon and cleansing done a priori
![Page 7: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/7.jpg)
Mix of ETL and ELT
§ Generally beaer in Flexibility
§ More suitable for simpler and well-‐defined formats
§ More applicable for experimentaBon
§ XML data parsed on demand for every query
confidenBal 7
§ Generally beaer in Performance
§ More suitable when substanBal cleansing and reformacng is needed
§ RepeBBve queries and producBon workloads
§ XML Data pre-‐parsed to minimize resource usage
ELT ETL
![Page 8: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/8.jpg)
Approaches
confidenBal 8
XML Files Avro Files
ETL Pre-‐parsing
Pig UDF
Avro Schema
On-‐demand Parsing
Interfaces
Processin
g Da
ta
Hive SerDe MapReduce Pig
UDF Hive SerDe MapReduce
![Page 9: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/9.jpg)
ELT
confidenBal 9 confidenBal 9
XML Files Avro Files
ETL Pre-‐parsing
Pig UDF
Avro Schema
On-‐demand Parsing
Interfaces
Processin
g Da
ta
Hive SerDe MapReduce Pig
UDF Hive SerDe MapReduce
![Page 10: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/10.jpg)
ETL
confidenBal 10 confidenBal 10
XML Files Avro Files
ETL Pre-‐parsing
Pig UDF
Avro Schema
On-‐demand Parsing
Interfaces
Processin
g Da
ta
Hive SerDe MapReduce Pig
UDF Hive SerDe MapReduce
![Page 11: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/11.jpg)
XML Pre-‐parsing
§ Nested Elements and Aaributes § RepresentaBon of parsed XML Structure § Enter Avro!
confidenBal 11
![Page 12: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/12.jpg)
Avro
§ Data serializaBon system § Specifically designed for Hadoop, but used in other environments also
§ Rich data structures: Arrays, Records, Maps etc. § Compact, fast, binary data format § Metadata stored at file level – not record level § Split-‐able – Ideal for Map-‐Reduce
confidenBal 12
![Page 13: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/13.jpg)
Avro APIs
§ Generic Objects and Pre-‐generated Objects – Easy API including simple gets and puts
§ APIs in several languages – Java – C# – C/C++ – Python – Ruby
confidenBal 13
![Page 14: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/14.jpg)
Use-‐case
§ FIXML – Financial InformaBon eXchange – hap://www.fixprotocol.org/specificaBons/
§ XML Database Benchmark – hap://tpox.sourceforge.net/
§ Provides sample data for benchmarking § Data Generator for generaBng large and predictable datasets
confidenBal 14
![Page 15: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/15.jpg)
FIXML
§ XML Data Generator – hap://tpox.sourceforge.net/tpoxdata.htm
§ Order: Buy and sell order of securiBes
confidenBal 15
![Page 16: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/16.jpg)
Simple mapping
confidenBal 16
XML Avro Pig
Elements with repeated nested elements
Array Bag
Elements with aaributes and text elements
Record Tuple
Aaributes and Text Elements Field Field
![Page 17: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/17.jpg)
Avro Schema
{ "type": "record", "name": "FIXOrder", "namespace": "com.sabre.fixml", "doc": "Definition and mapping for FIX Orders", "mapping": "/FIXML", "fields": [ { "name":"v", "type":"string", "mapping":"@v"}, { "name":"r", "type":"string", "mapping":"@r"}, { "name":"s", "type":"string", "mapping":"@s"}, { "name":"Order", "mapping":"Order", "type": { "name":"OrderRecord", "mapping":"Order", "type": "record", "fields": [ { "name":"ID", "type":"string", "mapping":"@ID"}, { "name":"ID2", "type":"string", "mapping":"@ID2"}, { "name":"OrignDt", "type":"string", "mapping":"@OrignDt"}, { "name":"TrdDt", "type":"string", "mapping":"@TrdDt"}, { "name":"Acct", "type":"string", "mapping":"@Acct"}, { "name":"AcctTyp", "type":"string", "mapping":"@AcctTyp"}, { "name":"DayBkngInst", "type":"string", "mapping":"@DayBkngInst"}, { "name":"BkngUnit", "type":"string", "mapping":"@BkngUnit"}, { "name":"PreallocMeth", "type":"string", "mapping":"@PreallocMeth"}, { "name":"AllocID", "type":"string", "mapping":"@AllocID"}, { "name":"CshMgn", "type":"string", "mapping":"@CshMgn"}, { "name":"ClrFeeInd", "type":"string", "mapping":"@ClrFeeInd"},
...
confidenBal 17
![Page 18: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/18.jpg)
Pig Schema
FIXOrder: tuple ( v: chararray, r: chararray, s: chararray, Order: tuple ( ID: chararray, ID2: chararray, OrignDt: chararray, TrdDt: chararray, Acct: chararray, AcctTyp: chararray, DayBkngInst: chararray, BkngUnit: chararray, PreallocMeth: chararray, AllocID: chararray, CshMgn: chararray, ClrFeeInd: chararray,
confidenBal 18
![Page 19: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/19.jpg)
Avro – Access Methods
§ Direct support for access from Hive (using SerDe)
CREATE EXTERNAL TABLE <TableName>!ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’!
STORED as INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’!
OUTPUTFORMAT! ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’! LOCATION ‘location-of-avro-files’! TBLPROPERTIES ('avro.schema.url'=‘location-of-schema-file.avsc')
§ Access via Pig -‐ AvroStorage § Avro API -‐ Java MapReduce
confidenBal 19
![Page 20: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/20.jpg)
Test Data
§ Base SecuriBes Order file 500,000 records § Replicated for volume – 15x -‐ 7.5 million records – 30x -‐ 15 million records – 45x -‐ 22.5 million records – 60x – 30 million records – 75x – 37.5 million records
confidenBal 20
![Page 21: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/21.jpg)
Comparison
confidenBal 21
XML Files Avro Files
ETL Pre-‐parsing
Pig UDF
Avro Schema
On-‐demand Parsing
Interfaces
Processin
g Da
ta
Hive SerDe MapReduce Pig
UDF Hive SerDe MapReduce
![Page 22: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/22.jpg)
File sizes: Orders
§ Base Data – XML file size as is: 749,337,916 (750MB) – Gzip Compressed: 182,687,654 (183MB)
§ Applied Avro conversion – Avro Snappy: 151,647,926 (152MB) – Avro Gzip: 107,898,177 (108MB)
confidenBal 22
![Page 23: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/23.jpg)
Storage Size Comparison
confidenBal 23
![Page 24: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/24.jpg)
Test Environment
§ 18 Nodes § Node configuraBon: – 12 cores per node – 48GB memory – 36 TB with 12 disks of 3TB each
§ CDH 4.1.2
confidenBal 24
![Page 25: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/25.jpg)
Sample Query
§ Security Orders per Account
order_records = LOAD '$AVRO_INPUT' using AVRO_LOAD AS ( -‐-‐-‐-‐-‐-‐-‐ Pig Schema goes here -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ ); order_projecBon = FOREACH order_records GENERATE Order.Acct as Account, Order.OrdQty.Qty
as QuanBty; order_group = GROUP order_projecBon BY Account; order_count = FOREACH order_group GENERATE group, SUM(order_projecBon.QuanBty); STORE order_count INTO '$PIG_OUTPUT' Using PigStorage(',');
confidenBal 25
![Page 26: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/26.jpg)
Run Types
§ Pre-‐parsed approach: – XML to Avro materializaBon: xml-‐to-‐avro
• XML to Avro is run only once on the data
– Avro to Pig via UDF: avro-‐to-‐pig § Parse on demand – XML parsing using Pig UDF: xml-‐to-‐pig
confidenBal 26
![Page 27: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/27.jpg)
confidenBal 27
Run .me in Seconds
Analysis on raw XML: XML to Pig
Pre-‐parsing XML: XML to Avro
Analysis on parsed XML: Avro to Pig
![Page 28: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/28.jpg)
confidenBal 28
CPU Usage Comparison
Analysis on raw XML: XML to Pig
Pre-‐parsing XML: XML to Avro
Analysis on parsed XML: Avro to Pig
![Page 29: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/29.jpg)
confidenBal 29 confidenBal 29
Memory Usage Comparison: Total Memused (GB)
Analysis on raw XML: XML to Pig
Pre-‐parsing XML: XML to Avro
Analysis on parsed XML: Avro to Pig
![Page 30: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/30.jpg)
Results
§ Analysis on pre-‐parsed data compared raw XML – RunBme reducBon by more than 50% – Memory and CPU consumpBon reduced by about 50%
§ Pre-‐parsing stage takes more resources and Bme than on-‐demand parsing
§ RepeBBve queries will benefit from one-‐Bme pre-‐parsing
confidenBal 30
![Page 31: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/31.jpg)
Caveats
§ Not all fields were extracted from the XML input (opBonal elements)
§ Challenge in keeping-‐up with versions/changes of XML
§ Performance numbers can depend on the type of data and the mapping used
confidenBal 31
![Page 32: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/32.jpg)
Alterna.ves
§ Formats other than Avro may be more suitable § Record Columnar formats (RC Files & ORC Files) § Trevni: a column file format supporBng Avro § Parquet: another columnar storage for Hadoop
confidenBal 32
![Page 33: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/33.jpg)
Mo.va.on for Columnar Format
§ Map Reduce capability § Column ProjecBons reduce I/O § Column Compression due to similarity of data further reduces I/O
confidenBal 33
![Page 34: Efficient processing of large and complex XML documents in Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/34.jpg)
Summary
§ Materialized version well-‐suited for repeated queries § For ad-‐hoc/experimental queries parse-‐on-‐demand is beaer
§ Mapping from XML to Avro can be automated § Hive, Pig and MapReduce Interfaces to access Avro Files
§ RelaBve trade-‐offs between flexibility and performance/storage
confidenBal 34