data reliability for data lakes · 2020-04-11 · evolution of a cutting-edge data lake events ai...

41
Data Reliability for Data Lakes Michael Armbrust @michaelarmbrust

Upload: others

Post on 28-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Data Reliability for Data Lakes

Michael Armbrust@michaelarmbrust

Page 2: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

1. Collect Everything

• Recommendation Engines• Risk, Fraud Detection• IoT & Predictive Maintenance• Genomics & DNA Sequencing

3. Data Science & Machine Learning

2. Store it all in the Data Lake

The Promise of the Data Lake

Garbage In Garbage Stored Garbage Out

!

!

!

!!

!

!

Page 3: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

What does a typical data lake project look like?

Page 4: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Evolution of a Cutting-Edge Data Lake

Events

?AI & Reporting

StreamingAnalytics

Data Lake

Page 5: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Evolution of a Cutting-Edge Data Lake

Events

AI & Reporting

StreamingAnalytics

Data Lake

Page 6: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Challenge #1: Historical Queries?

Data Lake

λ-arch

λ-arch

StreamingAnalytics

AI & Reporting

Eventsλ-arch1

1

1

Page 7: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Challenge #2: Messy Data?

Data Lake

λ-arch

λ-arch

StreamingAnalytics

AI & Reporting

Events

Validation

λ-archValidation

1

21

1

2

Page 8: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Reprocessing

Challenge #3: Mistakes and Failures?

Data Lake

λ-arch

λ-arch

StreamingAnalytics

AI & Reporting

Events

Validation

λ-archValidation

Reprocessing

Partitioned

1

2

3

1

1

3

2

Page 9: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Reprocessing

Challenge #4: Updates?

Data Lake

λ-arch

λ-arch

StreamingAnalytics

AI & Reporting

Events

Validation

λ-archValidation

Reprocessing

Updates

Partitioned

UPDATE &MERGE

Scheduled to Avoid Modifications

1

2

3

1

1

3

4

4

4

2

Page 10: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Wasting Time & Money

Solving Systems Problems

Instead of Extracting Value From Data

Page 11: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Data Lake Distractions

No atomicity means failed production jobs leave data in corrupt state requiring tedious recovery

No quality enforcement creates inconsistent and unusable data

No consistency / isolation makes it almost impossible to mix appends and reads, batch and streaming

Page 12: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Let’s try it instead with

Page 13: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Reprocessing

Challenges of the Data Lake

Data Lake

λ-arch

λ-arch

StreamingAnalytics

AI & Reporting

Events

Validation

λ-archValidation

Reprocessing

Updates

Partitioned

UPDATE &MERGE

Scheduled to Avoid Modifications

1

2

3

1

1

3

4

4

4

2

Page 14: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

AI & Reporting

StreamingAnalytics

The Architecture

Data Lake

CSV,JSON, TXT…

Kinesis

Page 15: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

AI & Reporting

StreamingAnalytics

The Architecture

Data Lake

CSV,JSON, TXT…

Kinesis

Full ACID Transaction

Focus on your data flow, instead of worrying about failures.

Page 16: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

AI & Reporting

StreamingAnalytics

The Architecture

Data Lake

CSV,JSON, TXT…

Kinesis

Open Standards, Open Source

Store petabytes of data without worries of lock-in. Growing community including Presto, Spark and more.

Page 17: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

AI & Reporting

StreamingAnalytics

The Architecture

Data Lake

CSV,JSON, TXT…

Kinesis

Powered by

Unifies Streaming / Batch. Convert existing jobs with minimal modifications.

Page 18: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Data Lake

AI & Reporting

StreamingAnalytics

Business-level Aggregates

Filtered, CleanedAugmented

Raw Ingestion

The

Bronze Silver Gold

CSV,JSON, TXT…

Kinesis

Quality

Delta Lake allows you to incrementally improve the quality of your data until it is ready for consumption.

*Data Quality Levels *

Page 19: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Data Lake

AI & Reporting

StreamingAnalytics

Business-level Aggregates

Filtered, CleanedAugmented

Raw Ingestion

The

Bronze Silver Gold

CSV,JSON, TXT…

Kinesis

• Dumping ground for raw data• Often with long retention (years)• Avoid error-prone parsing

!

Page 20: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Data Lake

AI & Reporting

StreamingAnalytics

Business-level Aggregates

Filtered, CleanedAugmented

Raw Ingestion

The

Bronze Silver Gold

CSV,JSON, TXT…

Kinesis

Intermediate data with some cleanup applied.Queryable for easy debugging!

Page 21: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Data Lake

AI & Reporting

StreamingAnalytics

Business-level Aggregates

Filtered, CleanedAugmented

Raw Ingestion

The

Bronze Silver Gold

CSV,JSON, TXT…

Kinesis

Clean data, ready for consumption.Read with Spark or Presto*

*Coming Soon

Page 22: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Data Lake

AI & Reporting

StreamingAnalytics

Business-level Aggregates

Filtered, CleanedAugmented

Raw Ingestion

The

Bronze Silver Gold

CSV,JSON, TXT…

Kinesis

Streams move data through the Delta Lake• Low-latency or manually triggered• Eliminates management of schedules and jobs

Page 23: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Data Lake

AI & Reporting

StreamingAnalytics

Business-level Aggregates

Filtered, CleanedAugmented

Raw Ingestion

The

Bronze Silver Gold

CSV,JSON, TXT…

Kinesis

Delta Lake also supports batch jobs and standard DML

UPDATE

DELETE MERGE

OVERWRITE

• Retention• Corrections• GDPR

• UPSERTS

INSERT

*DML released in 0.3.0

Page 24: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Data Lake

AI & Reporting

StreamingAnalytics

Business-level Aggregates

Filtered, CleanedAugmented

Raw Ingestion

The

Bronze Silver Gold

CSV,JSON, TXT…

Kinesis

Easy to recompute when business logic changes:• Clear tables• Restart streams

DELETE DELETE

Page 25: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Who is using ?

Page 26: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Used by 1000s of organizations world wide

> 1 exabyte processed last month alone

Page 27: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

27

Improved reliability: Petabyte-scale jobs

10x lower compute: 640 instances to 64!

Simpler, faster ETL: 84 jobs → 3 jobshalved data latency

Page 28: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

How do I use ?

Page 29: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

dataframe.write.format("delta").save("/data")

Get Started with Delta using Spark APIs

dataframe.write.format("parquet").save("/data")

Instead of parquet... … simply say delta

Add Spark Packagepyspark --packages io.delta:delta-core_2.12:0.1.0

bin/spark-shell --packages io.delta:delta-core_2.12:0.1.0

<dependency><groupId>io.delta</groupId> <artifactId>delta-core_2.12</artifactId> <version>0.1.0</version>

</dependency>

Maven

Page 30: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

In Progress: Declarative Pipelines

Enforce metadata, storage, and quality declaratively.

dataset("warehouse")

.query(input("kafka").select(…).join(…)) // Query to materialize

.location(…) // Storage Location

.schema(…) // Optional strict schema checking

.metastoreName(…) // Hive Metastore

.description(…) // Human readable description

.expect("validTimestamp", // Expectations on data quality"timestamp > 2012-01-01 AND …",

"fail / alert / quarantine")

*Coming Soon

Page 31: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

In Progress: Declarative Pipelines

Enforce metadata, storage, and quality declaratively.

dataset("warehouse").query(input("kafka").withColumn(…).join(…)) // Query to materialize.location(…) // Storage Location.schema(…) // Optional strict schema checking.metastoreName(…) // Hive Metastore.description(…) // Human readable description.expect("validTimestamp", // Expectations on data quality"timestamp > 2012-01-01 AND …","fail / alert / quarantine")

*Coming Soon

Page 32: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

In Progress: Declarative Pipelines

Enforce metadata, storage, and quality declaratively.

dataset("warehouse").query(input("kafka").withColumn(…).join(…)) // Query to materialize.location(…) // Storage Location.schema(…) // Optional strict schema checking.metastoreName(…) // Hive Metastore.description(…) // Human readable description.expect("validTimestamp", // Expectations on data quality"timestamp > 2012-01-01 AND …","fail / alert / quarantine")

*Coming Soon

Page 33: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

How does work?

Page 34: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Delta On Disk

my_table/_delta_log/

00000.json00001.json

date=2019-01-01/file-1.parquet

Transaction LogTable Versions

(Optional) Partition DirectoriesData Files

Page 35: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Version N = Version N-1 + Actions

Change Metadata – name, schema, partitioning, etcAdd File – adds a file (with optional statistics)Remove File – removes a file

Result: Current Metadata, List of Files

Page 36: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Implementing Atomicity

Changes to the table are stored as ordered, atomic units called commits

Add 1.parquet

Add 2.parquet

Remove 1.parquet

Remove 2.parquet

Add 3.parquet

000000.json

000001.json

Page 37: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Ensuring Serializablity

Need to agree on the order of changes, even when there are multiple writers. 000000.json

000001.json

000002.json

User 1 User 2

Page 38: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Solving Conflicts Optimistically

1. Record start version2. Record reads/writes3. Attempt commit4. If someone else wins,

check if anything you read has changed.

5. Try again.

000000.json

000001.json

000002.json

User 1 User 2Write: AppendRead: Schema

Write: AppendRead: Schema

Page 39: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Handling Massive Metadata

Large tables can have millions of files in them! How do we scale the metadata? Use Spark for scaling!

Add 1.parquet

Add 2.parquetRemove 1.parquet

Remove 2.parquet

Add 3.parquet

Checkpoint

Page 40: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Road Map

• 0.2.0 – Released!• S3 Support• Azure Blob Store and ADLS Support

• 0.3.0 - Released!• UPDATE (Scala)• DELETE (Scala)• MERGE (Scala)• VACUUM (Scala)

• 0.4.0 – This week!• CONVERT TO DELTA• DESCRIBE HISTORY• Python DML (Update/Delete/Merge)

• Spark 3.0• DDL Support / Hive Metastore• SQL DML Support

Page 41: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics

Build your own Delta Lakeat https://delta.io