using apache spark as etl engine. pros and cons

Using Apache Spark as ETL engine Pros and cons

Maksym Doroshenko Big Data Software Engineer

LeadGenius, Provectus

Agenda1. What is Spark

2. Spark components

3. Spark pillars

4. What is ETL pipeline

5. Using Spark SQL for ETL

6. Customer use case

7. Demo

Spark, who are you?

I am is a fast and general engine

for large-scale data processing.

Prove it

Hadoop MR Spark Spark

Data Size 102.5 TB 100 TB 1000 TB

Elapsed Time 72 mins 23 mins 234 mins

# Nodes 2100 206 190

# Cores 50400 6592 6080

# Reducers 10,000 29,000 250,000

Rate 1.42 TB/min 4.27 TB/min 4.27 TB/min

Rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min

Environment dedicated data center

EC2 (i2.8xlarge) EC2 (i2.8xlarge)

Apache Spark has an advanced DAG execution enginethat supports acyclic data flow and in-memory computing.

Spark use cases- Simplify the challenging and compute-intensive task of

processing high volumes of data

- Real time data processing

- Seamlessly integrating complex capabilities such as machine learning and graph algorithms

- Spark brings Big Data processing to the masses

Survey: Why companies use Spark ?91% use Apache Spark because of its performance gains

77% use Apache Spark as it is easy to use

71% use Apache Spark due to the ease of deployment

64% use Apache Spark to leverage advanced analytics

52% use Apache Spark for real-time streaming.

Spark components

What is RDDResilient Distributed Dataset - a big collection of data with following properties:

- Immutable - Distributed - Lazily evaluated - Fault tolerante

Operations

Narrow transformations

- map - flatMap - filter - etc.

Wide transformations

- reduceByKey - groupByKey - sortByKey - etc.

Spark application tree

Spark DataframesDataframes is distributed collection of data grouped into named columns (RDD with schema) with more efficient storage options, advanced optimizer, and direct operations on serialized data. These components are super important for getting the best of Spark performance

What is ETL?1. Sequence of transformation on data

2. Source data is typically semi-structured/unstructured (Text, JSON, CSV etc.) and structured (JDBC, Parquet, ORC, AVRO, etc.)

3. Output data is clean, structured, integrated and ready for further data processing, analysis and reporting.

ETL query in Spark

Why is ETL Hard?1. Various sources/formats

2. Schema mismatch

3. Different representation

4. Corrupted files and data

5. Scalability

6. Schema evolution

This is why ETL is important

Consumers of this data do not want to deal with this messiness and complexity

Spark SQL's flexible APIs,

support for a wide variety of

datasources,

build-in support for structured

streaming,

state of art catalyst optimizer

and tungsten execution engine

make it a great framework for

building end-to-end ETL

pipelines.

Spark SQL

Data sources

https://spark-packages.org/

Schema inference: semi structured data

User specified schema

Faster No scan to infer schema

More flexible Easily handle schema evolving

More robust Handle type errors ASAP

Deal with bad datajava.io.IOException: org.apache.hadoop.io.compress.DecompressorStream.decompress java.io.EOFException: Unexpected end of input stream java.lang.RuntimeException: file:/temp/path/c000.json is not a Parquet file (too small)

[SPARK-17850] If true, the Spark jobs will continue to run even when it encounters corrupt files. The contents that have been read will still be returned.

spark.sql.files.ignoreCorruptFiles = true

Deal with bad data[SPARK-12833] [SPARK-13764] TextFile formats (JSON and CSV) supports 3 Parse modes while reading data:

PERMISSIVE DROPMALFORMED FAILFAST

Better JSON and CSV support[SPARK-18352] [SPARK-19610] Multiline JSON and CSV support

Spark SQL reads JSON/CSV one line at time Before Spark 2.2 it requires custom ETL

Transformations: Higher order functions in SQL

Transformations on complex objects like arrays, maps and structures inside of columns.

1. Check for element existence SELECT EXIST(values, e->e>30) AS v FROM tbl_nested;

2. Transform an array SELECT TRANSFORM(values, e->e*e) AS v FROM tbl_nested;

Transformations: Higher order functions in SQL

3. Filter an array SELECT FILTER(values, e->e>30) AS v FROM tbl_nested;

4. Aggregate an array SELECT REDUCE(values, 0, (value, acc)->acc+value) AS v FROM tbl_nested;

Load

Different modes:

Error Append Overwrite Ignore

Wide functionality:

df.write .partitionBy(“favorite_color”) .bucketBy(42, “name") .sortBy(“age”) .saveAsTable("people_partitioned_bucketed”))

Customer Use Case- Data sources in different formats

- Mapping data to golden customer schema

- Normalize all data (e.g. email, phone)

- Link and merge same entities

Spark ETL Pros & Cons- Pros:

Open source

Great community

Easy to scale

Strong transformation engine

Support different languages

Unified API for different components

- Cons:

No File management system

Resource consuming

Manual configuration tuning

No ETL UI

QA & DEMO TIME [email protected]

skype: maxdor3

mailto:[email protected]

using apache spark as etl engine. pros and cons

Engineering