efficient processing of large and complex xml documents in hadoop

Efficient processing of large and complex XML documents in Hadoop

Sujoe Bose Senior Principal, Sabre Holdings June, 2013

Presenta.on Outline

§  MoBvaBon §  ETL vs. ELT §  Avro Format §  Mapping from XML to Avro §  Interfaces to access Avro §  Performance and Storage consideraBons §  Other types of storage/processing formats

confidenBal 2

You will learn about …

§  A method to store and process complex XML data in Hadoop as Avro files

§  Interfaces to access and analyze data in Avro from Hive, Java and Pig

§  VariaBons of the method and their relaBve trade-‐offs in storage and processing

confidenBal 3

Mo.va.on

§  Prevalence of XML and its derivaBves –  Spurred by WebServices and SOA –  Preferred communicaBon format unBl newer formats entered

– Data and logs represented in XML §  XML – metadata combined data –  Flexibility vs. Complexity

§  Could be arbitrarily nested and large §  Volumes of documents – Big Data

confidenBal 4

Challenges

§  Parsing XML is CPU Intensive §  Certain parsers/parsing methods result in more memory consumpBon

§  Repeated parsing for each query §  Large and deeply nested XMLs makes problem worse §  Presence of tags in data result in high I/O due to storage size

§  Special handling of opBonal fields

confidenBal 5

ETL vs. ELT

confidenBal 6

§  Hadoop generally built for EL – T –  aka Schema-‐on-‐Read –  Load as-‐is –  Transform on Access/Query

§  Compare with Data Warehouse ETL – Aka Schema-‐on-‐Write –  Transform and Load – Queries are lot simpler –  TransformaBon and cleansing done a priori

Mix of ETL and ELT

§  Generally beaer in Flexibility

§  More suitable for simpler and well-‐defined formats

§  More applicable for experimentaBon

§  XML data parsed on demand for every query

confidenBal 7

§  Generally beaer in Performance

§  More suitable when substanBal cleansing and reformacng is needed

§  RepeBBve queries and producBon workloads

§  XML Data pre-‐parsed to minimize resource usage

ELT ETL

Approaches

confidenBal 8

XML Files Avro Files

ETL Pre-‐parsing

Pig UDF

Avro Schema

On-‐demand Parsing

Interfaces

Processin

g Da

ta

Hive SerDe MapReduce Pig

UDF Hive SerDe MapReduce

ELT

confidenBal 9 confidenBal 9


ETL Pre-‐parsing

Pig UDF

Avro Schema


Interfaces

Processin

g Da

ta



ETL



ETL Pre-‐parsing

Pig UDF

Avro Schema


Interfaces

Processin

g Da

ta



XML Pre-‐parsing

§  Nested Elements and Aaributes §  RepresentaBon of parsed XML Structure §  Enter Avro!

confidenBal 11

Avro

§  Data serializaBon system §  Specifically designed for Hadoop, but used in other environments also

§  Rich data structures: Arrays, Records, Maps etc. §  Compact, fast, binary data format §  Metadata stored at file level – not record level §  Split-‐able – Ideal for Map-‐Reduce

confidenBal 12

Avro APIs

§  Generic Objects and Pre-‐generated Objects –  Easy API including simple gets and puts

§  APIs in several languages –  Java –  C# –  C/C++ –  Python –  Ruby

confidenBal 13

Use-‐case

§  FIXML – Financial InformaBon eXchange –  hap://www.fixprotocol.org/specificaBons/

§  XML Database Benchmark –  hap://tpox.sourceforge.net/

§  Provides sample data for benchmarking §  Data Generator for generaBng large and predictable datasets

confidenBal 14

FIXML

§  XML Data Generator –  hap://tpox.sourceforge.net/tpoxdata.htm

§  Order: Buy and sell order of securiBes

confidenBal 15

Simple mapping

confidenBal 16

XML Avro Pig

Elements with repeated nested elements

Array Bag

Elements with aaributes and text elements

Record Tuple

Aaributes and Text Elements Field Field

Avro Schema

{ "type": "record", "name": "FIXOrder", "namespace": "com.sabre.fixml", "doc": "Definition and mapping for FIX Orders", "mapping": "/FIXML", "fields": [ { "name":"v", "type":"string", "mapping":"@v"}, { "name":"r", "type":"string", "mapping":"@r"}, { "name":"s", "type":"string", "mapping":"@s"}, { "name":"Order", "mapping":"Order", "type": { "name":"OrderRecord", "mapping":"Order", "type": "record", "fields": [ { "name":"ID", "type":"string", "mapping":"@ID"}, { "name":"ID2", "type":"string", "mapping":"@ID2"}, { "name":"OrignDt", "type":"string", "mapping":"@OrignDt"}, { "name":"TrdDt", "type":"string", "mapping":"@TrdDt"}, { "name":"Acct", "type":"string", "mapping":"@Acct"}, { "name":"AcctTyp", "type":"string", "mapping":"@AcctTyp"}, { "name":"DayBkngInst", "type":"string", "mapping":"@DayBkngInst"}, { "name":"BkngUnit", "type":"string", "mapping":"@BkngUnit"}, { "name":"PreallocMeth", "type":"string", "mapping":"@PreallocMeth"}, { "name":"AllocID", "type":"string", "mapping":"@AllocID"}, { "name":"CshMgn", "type":"string", "mapping":"@CshMgn"}, { "name":"ClrFeeInd", "type":"string", "mapping":"@ClrFeeInd"},

...

confidenBal 17

Pig Schema

FIXOrder: tuple ( v: chararray, r: chararray, s: chararray, Order: tuple ( ID: chararray, ID2: chararray, OrignDt: chararray, TrdDt: chararray, Acct: chararray, AcctTyp: chararray, DayBkngInst: chararray, BkngUnit: chararray, PreallocMeth: chararray, AllocID: chararray, CshMgn: chararray, ClrFeeInd: chararray,

confidenBal 18

Avro – Access Methods

§  Direct support for access from Hive (using SerDe)

CREATE EXTERNAL TABLE <TableName>!ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’!

STORED as INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’!

OUTPUTFORMAT! ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’! LOCATION ‘location-of-avro-files’! TBLPROPERTIES ('avro.schema.url'=‘location-of-schema-file.avsc')

§  Access via Pig -‐ AvroStorage §  Avro API -‐ Java MapReduce

confidenBal 19

Test Data

§  Base SecuriBes Order file 500,000 records §  Replicated for volume –  15x -‐ 7.5 million records –  30x -‐ 15 million records –  45x -‐ 22.5 million records –  60x – 30 million records –  75x – 37.5 million records

confidenBal 20

Comparison

confidenBal 21


ETL Pre-‐parsing

Pig UDF

Avro Schema


Interfaces

Processin

g Da

ta



File sizes: Orders

§  Base Data –  XML file size as is: 749,337,916 (750MB) – Gzip Compressed: 182,687,654 (183MB)

§  Applied Avro conversion – Avro Snappy: 151,647,926 (152MB) – Avro Gzip: 107,898,177 (108MB)

confidenBal 22

Storage Size Comparison

confidenBal 23

Test Environment

§  18 Nodes §  Node configuraBon: –  12 cores per node –  48GB memory –  36 TB with 12 disks of 3TB each

§  CDH 4.1.2

confidenBal 24

Sample Query

§  Security Orders per Account

order_records = LOAD '$AVRO_INPUT' using AVRO_LOAD AS ( -‐-‐-‐-‐-‐-‐-‐ Pig Schema goes here -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ ); order_projecBon = FOREACH order_records GENERATE Order.Acct as Account, Order.OrdQty.Qty

as QuanBty; order_group = GROUP order_projecBon BY Account; order_count = FOREACH order_group GENERATE group, SUM(order_projecBon.QuanBty); STORE order_count INTO '$PIG_OUTPUT' Using PigStorage(',');

confidenBal 25

Run Types

§  Pre-‐parsed approach: –  XML to Avro materializaBon: xml-‐to-‐avro

•  XML to Avro is run only once on the data

– Avro to Pig via UDF: avro-‐to-‐pig §  Parse on demand –  XML parsing using Pig UDF: xml-‐to-‐pig

confidenBal 26

confidenBal 27

Run .me in Seconds

Analysis on raw XML: XML to Pig

Pre-‐parsing XML: XML to Avro

Analysis on parsed XML: Avro to Pig

confidenBal 28

CPU Usage Comparison





Memory Usage Comparison: Total Memused (GB)




Results

§  Analysis on pre-‐parsed data compared raw XML –  RunBme reducBon by more than 50% – Memory and CPU consumpBon reduced by about 50%

§  Pre-‐parsing stage takes more resources and Bme than on-‐demand parsing

§  RepeBBve queries will benefit from one-‐Bme pre-‐parsing

confidenBal 30

Caveats

§  Not all fields were extracted from the XML input (opBonal elements)

§  Challenge in keeping-‐up with versions/changes of XML

§  Performance numbers can depend on the type of data and the mapping used

confidenBal 31

Alterna.ves

§  Formats other than Avro may be more suitable §  Record Columnar formats (RC Files & ORC Files) §  Trevni: a column file format supporBng Avro §  Parquet: another columnar storage for Hadoop

confidenBal 32

Mo.va.on for Columnar Format

§  Map Reduce capability §  Column ProjecBons reduce I/O §  Column Compression due to similarity of data further reduces I/O

confidenBal 33

Summary

§  Materialized version well-‐suited for repeated queries §  For ad-‐hoc/experimental queries parse-‐on-‐demand is beaer

§  Mapping from XML to Avro can be automated §  Hive, Pig and MapReduce Interfaces to access Avro Files

§  RelaBve trade-‐offs between flexibility and performance/storage

confidenBal 34

Ques.ons & Comments

confidenBal 35

Thanks for Listening [email protected]

efficient processing of large and complex xml documents in hadoop

Technology

etl condenbal

millionrecords condenbal

elt condenbal

simplemapping condenbal

approaches condenbal

comparison condenbal

python ruby condenbal

tuple v