using apache spark as etl engine. pros and cons

31
Using Apache Spark as ETL engine Pros and cons Maksym Doroshenko Big Data Software Engineer LeadGenius, Provectus

Upload: provectus

Post on 22-Jan-2018

183 views

Category:

Engineering


5 download

TRANSCRIPT

Page 1: Using Apache Spark as ETL engine. Pros and Cons

Using Apache Spark as ETL engine Pros and cons

Maksym Doroshenko Big Data Software Engineer

LeadGenius, Provectus

Page 2: Using Apache Spark as ETL engine. Pros and Cons

Agenda1. What is Spark

2. Spark components

3. Spark pillars

4. What is ETL pipeline

5. Using Spark SQL for ETL

6. Customer use case

7. Demo

Page 3: Using Apache Spark as ETL engine. Pros and Cons

Spark, who are you?

I am is a fast and general engine

for large-scale data processing.

Page 4: Using Apache Spark as ETL engine. Pros and Cons

Prove it

Hadoop MR Spark Spark

Data Size 102.5 TB 100 TB 1000 TB

Elapsed Time 72 mins 23 mins 234 mins

# Nodes 2100 206 190

# Cores 50400 6592 6080

# Reducers 10,000 29,000 250,000

Rate 1.42 TB/min 4.27 TB/min 4.27 TB/min

Rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min

Environment dedicated data center

EC2 (i2.8xlarge) EC2 (i2.8xlarge)

Apache Spark has an advanced DAG execution enginethat supports acyclic data flow and in-memory computing.

Page 5: Using Apache Spark as ETL engine. Pros and Cons

Spark use cases- Simplify the challenging and compute-intensive task of

processing high volumes of data

- Real time data processing

- Seamlessly integrating complex capabilities such as machine learning and graph algorithms

- Spark brings Big Data processing to the masses

Page 6: Using Apache Spark as ETL engine. Pros and Cons

Survey: Why companies use Spark ?91% use Apache Spark because of its performance gains

77% use Apache Spark as it is easy to use

71% use Apache Spark due to the ease of deployment

64% use Apache Spark to leverage advanced analytics

52% use Apache Spark for real-time streaming.

Page 7: Using Apache Spark as ETL engine. Pros and Cons

Spark components

Page 8: Using Apache Spark as ETL engine. Pros and Cons

What is RDDResilient Distributed Dataset - a big collection of data with following properties:

- Immutable - Distributed - Lazily evaluated - Fault tolerante

Page 9: Using Apache Spark as ETL engine. Pros and Cons

Operations

Page 10: Using Apache Spark as ETL engine. Pros and Cons

Narrow transformations

- map - flatMap - filter - etc.

Page 11: Using Apache Spark as ETL engine. Pros and Cons

Wide transformations

- reduceByKey - groupByKey - sortByKey - etc.

Page 12: Using Apache Spark as ETL engine. Pros and Cons

Spark application tree

Page 13: Using Apache Spark as ETL engine. Pros and Cons

Spark DataframesDataframes is distributed collection of data grouped into named columns (RDD with schema) with more efficient storage options, advanced optimizer, and direct operations on serialized data. These components are super important for getting the best of Spark performance

Page 14: Using Apache Spark as ETL engine. Pros and Cons

What is ETL?1. Sequence of transformation on data

2. Source data is typically semi-structured/unstructured (Text, JSON, CSV etc.) and structured (JDBC, Parquet, ORC, AVRO, etc.)

3. Output data is clean, structured, integrated and ready for further data processing, analysis and reporting.

Page 15: Using Apache Spark as ETL engine. Pros and Cons

ETL query in Spark

Page 16: Using Apache Spark as ETL engine. Pros and Cons

Why is ETL Hard?1. Various sources/formats

2. Schema mismatch

3. Different representation

4. Corrupted files and data

5. Scalability

6. Schema evolution

Page 17: Using Apache Spark as ETL engine. Pros and Cons

This is why ETL is important

Consumers of this data do not want to deal with this messiness and complexity

Page 18: Using Apache Spark as ETL engine. Pros and Cons

Spark SQL's flexible APIs,

support for a wide variety of

datasources,

build-in support for structured

streaming,

state of art catalyst optimizer

and tungsten execution engine

make it a great framework for

building end-to-end ETL

pipelines.

Spark SQL

Page 19: Using Apache Spark as ETL engine. Pros and Cons

Data sources

https://spark-packages.org/

Page 20: Using Apache Spark as ETL engine. Pros and Cons

Schema inference: semi structured data

Page 21: Using Apache Spark as ETL engine. Pros and Cons

Schema inference: semi structured data

Page 22: Using Apache Spark as ETL engine. Pros and Cons

User specified schema

Faster No scan to infer schema

More flexible Easily handle schema evolving

More robust Handle type errors ASAP

Page 23: Using Apache Spark as ETL engine. Pros and Cons

Deal with bad datajava.io.IOException: org.apache.hadoop.io.compress.DecompressorStream.decompress java.io.EOFException: Unexpected end of input stream java.lang.RuntimeException: file:/temp/path/c000.json is not a Parquet file (too small)

[SPARK-17850] If true, the Spark jobs will continue to run even when it encounters corrupt files. The contents that have been read will still be returned.

spark.sql.files.ignoreCorruptFiles = true

Page 24: Using Apache Spark as ETL engine. Pros and Cons

Deal with bad data[SPARK-12833] [SPARK-13764] TextFile formats (JSON and CSV) supports 3 Parse modes while reading data:

PERMISSIVE DROPMALFORMED FAILFAST

Page 25: Using Apache Spark as ETL engine. Pros and Cons

Better JSON and CSV support[SPARK-18352] [SPARK-19610] Multiline JSON and CSV support

Spark SQL reads JSON/CSV one line at time Before Spark 2.2 it requires custom ETL

Page 26: Using Apache Spark as ETL engine. Pros and Cons

Transformations: Higher order functions in SQL

Transformations on complex objects like arrays, maps and structures inside of columns.

1. Check for element existence SELECT EXIST(values, e->e>30) AS v FROM tbl_nested;

2. Transform an array SELECT TRANSFORM(values, e->e*e) AS v FROM tbl_nested;

Page 27: Using Apache Spark as ETL engine. Pros and Cons

Transformations: Higher order functions in SQL

3. Filter an array SELECT FILTER(values, e->e>30) AS v FROM tbl_nested;

4. Aggregate an array SELECT REDUCE(values, 0, (value, acc)->acc+value) AS v FROM tbl_nested;

Page 28: Using Apache Spark as ETL engine. Pros and Cons

Load

Different modes:

Error Append Overwrite Ignore

Wide functionality:

df.write .partitionBy(“favorite_color”) .bucketBy(42, “name") .sortBy(“age”) .saveAsTable("people_partitioned_bucketed”))

Page 29: Using Apache Spark as ETL engine. Pros and Cons

Customer Use Case- Data sources in different formats

- Mapping data to golden customer schema

- Normalize all data (e.g. email, phone)

- Link and merge same entities

Page 30: Using Apache Spark as ETL engine. Pros and Cons

Spark ETL Pros & Cons- Pros:

Open source

Great community

Easy to scale

Strong transformation engine

Support different languages

Unified API for different components

- Cons:

No File management system

Resource consuming

Manual configuration tuning

No ETL UI

Page 31: Using Apache Spark as ETL engine. Pros and Cons

QA & DEMO TIME [email protected]

skype: maxdor3