big data day la 2015 - lessons learned designing data ingest systems

29
Lessons Learned Designing Data Ingest Systems Abraham Elmahrek ([email protected])

Upload: aaamase

Post on 16-Aug-2015

212 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

Lessons LearnedDesigning Data Ingest Systems

Abraham Elmahrek ([email protected])

Page 2: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

1. Overview of Big Data Ingest2. Real world examples with lessons interleaved3. A summary of lessons learned and extra ideas

Agenda

Page 3: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

Big Data Ingest

Ingesting from different data sources is the goal

Several data sources have different structures, but schemas vary mostly

Batch and Real Time ingest both have their places

Data sources Schema Speed

Page 4: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

Data sources

Relational databases, spreadsheets, object databases

XML, JSON, EDI, etc. Audio, video, email, etc.Structured Semi-structured Eventually structured

Page 5: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

Schema

One schema with a relatively flat structure or many schemas with nested structures.

Immutable schemas can’t be changed. Mutable schemas can evolve. Nested schemas can also have mutability properties.

Number of schemas Mutability InferenceSchema inference upon writing, reading, or offline.

Page 6: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

Real Time vs Batch

Transfer data from A -> B on demand.

Push data from A -> B consistently. Poll on data sources or act upon reception.

Batch Push model Pull modelClients pull data from A to write to B. Often times an intermediate storage system like Kafka is used to achieve this.

Page 7: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

• GOAL: Generate different forms for websites

• Store user information• Forms cannot change over time

Real world scenario: Form generator

Page 8: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

Lesson #1: Structure endpoint wisely

Form Definition

idform nameform metadata

Form 1

id<field 1><field 2><field 3>

Form 2

id<field 1><field 2><field 3>

Form Definition

idform name

Field Definition

idform idfield nametype

Field Values

idfield idvalue

Page 9: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

• GOAL: Generate list of active contributors on a repository and general stats about a repository relative to all other repositories.

• Scheduled batch Change Data Capture (CDC).

Real world scenario: Scrape github

Page 10: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

My implementation (naive)

Page 11: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

• Ingesting data twice doesn’t matter in a lot of cases.• The cost of re-processing or re-ingesting a few records is

normally pretty low.• It’s easy to manage and implement.• Exactly once semantics, in contrast, is not feasible

– Usually requires some de-duping

Lesson #2: At least once is acceptable

Page 12: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

A better implementation

Page 13: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

My favorite implementation

Page 14: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

• Change Data Capture (CDC) without a change log or an easy way to calculate differences is hard.

• Almost always requires some customized effort.

Lesson #3: CDC is hard

Page 15: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

• GOAL: Gather impressions and click information. Attribute to different vendors based on impressions and clicks.

• Expose a view for customers to understand their usage.• NRT with batch error checking.

Real world scenario: Ad attribution system

Page 16: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

• What is the incidence of errors?• How frequently should errors be checked?• Is data loss acceptable?• Is duplication acceptable?

Lesson #4: Know thy SLA

Page 17: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

Push version

Click Logs

Impression Logs

VIP

ScribeMaster

ScribeMaster

ScribeMaster

HBase

MySQL

Page 18: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

Push version analysis

• Negatives– Scribe would lose data in some edge cases. That’s not good for

attribution systems (money involved).– Amount of messages being written to HBase would cause major

compactions on a weekly basis halting the pipeline.• Positives

– Latency was super low– Relatively easy to maintain given scribe configuration

* Flume would have been a better choice! It has better reliability guarantees!

Page 19: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

Pull version

Click Logs

Impression Logs

VIP

Producer HBase

MySQLProducer

RabbitMQ

Consumer

Consumer

Page 20: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

Pull version analysis

• Negatives– Requires more management and configuration.

• Positives– Choose data loss with at most once or at least once semantics.– Intermediate storage relieves HBase.

* Kafka would have been a cool choice! It has better data retention and scalability!

Page 21: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

1. Structureless (or simple structure) and schemalessa. Log file (e.g. uuid|val1|val2|val3|...)

2. Structured without explicit schemaa. JSON (e.g. {“key1”: “val1”, ...})

3. Structured with explicit schemaa. Avro (e.g. {“key1”: “val1”, ...}, but with schema)

Lesson #5: Record format and schema

Page 22: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

• Verbosity directly related to human readability• Verbosity impacts performance of systems• A verbose and readable RPC: XML, YAML, JSON, etc.• A not-so-verbose and not-so-readable RPC: MessagePack,

Protobuf, Avro binary, Parquet, etc.• Sufficient tooling can make human readability less necessary.

Issues with structure

Page 23: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

• Flexibility and structure are inversely related.• A flexible schema

– Doesn’t require an upfront definition– Allows you to make and validate assumptions about the data.– Easy to extend, but difficult to track changes– May have nested structures

■ e.g. uuid|val1|val2|{“field1”: “value1”, ...}|...

• A structured schema– Easier for everyone (human and computer) to understand– Saves time when serializing/deserializing

Issues with schema

Page 24: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

• Where is the data coming from?• How has it changed as it enters the system?• Snapshots?• Who touched the data.

Lesson #6: Record lineage

Page 25: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

1. Structure endpoints wisely2. At least once semantics is easy and acceptable3. CDC is hard4. Know thy SLA5. Record format and schema should be thought through6. Record lineage (provenance)

Summary of lessons

Page 26: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

1. Keep track of erroneous recordsa. Anomalies lead to more knowledge about data sourceb. Improves debugging

2. Keep transformations to a minimuma. Schema inference makes senseb. Massive computations can slow down the ingest process and cause

back pressure in the pipeline

Extra ideas

Page 27: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

Checkout http://ingest.tips

for general ingest

Page 28: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

Thank you

Page 29: Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

LicensingPublic Domain

1. https://commons.wikimedia.org/wiki/File:West_Texas_Pumpjack.JPG2. https://commons.wikimedia.org/wiki/File%3ABulls_Ishikawa%2C_Okinawa_2007.jpg3. https://commons.wikimedia.org/wiki/File:Hammer_Ace_SATCOM_Antenna.jpg4. https://commons.wikimedia.org/wiki/File:Shanghai_Shimao_Plaza_Construction.jpg5. https://pixabay.com/p-111058/?no_redirect6. https://pixabay.com/p-70908/?no_redirect7. https://commons.wikimedia.org/wiki/File:

The_Sun_by_the_Atmospheric_Imaging_Assembly_of_NASA's_Solar_Dynamics_Observatory_-_20100819.jpg8. http://www.freestockphotos.biz/stockphoto/166949. https://pixabay.com/en/github-logo-favicon-mascot-button-154769/

Creative Commons V310. https://commons.wikimedia.org/wiki/File:Star-schema.pngCreative Commons V211. https://www.flickr.com/photos/the_pink_princess/370896536/12. https://www.flickr.com/photos/digitaljourney/5424241457