spark meetup at uber

D A T A

Spark & Hadoop @ Uber

Who We Are

Early Engineers On Hadoop team @ Uber

Kelvin Chu Reza ShiftehfarVinoth Chandar

Agenda

● Intro to Data @ Uber● Trips Pipeline Into Warehouse● Paricon● INotify DStream● Future

Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.

Uber’s Mission

“Transportation as reliable as running water,

everywhere, for everyone”

300+ Cities 60+ Countries

And growing...

Data @ Uber

● Impact of Data is Huge!○ 2000+ Unique Users Operating a massive transportation system

● Running critical business operations○ Payments, Fraud, Marketing Spend, Background Checks …

● Unique & Interesting Problems○ Supply vs Demand - Growth○ Geo-Temporal Analytics

● Latency Is King ○ Enormous business value in making data available asap

Data Architecture: Circa 2014

Kafka Logs

Schemaless Databases

RDBMS Tables

OLAP Warehouse

Applications

Bulk Uploader

Amazon S3

EMR

Celery/Python ETL Adhoc SQL

Challenges

● Scaling to high volume Kafka streams ○ eg: Event data coming from phones

● Merged Views of DB Changelogs across DCs○ Some of the most important data - trips (duh!)

● Fragile ingestion model ○ Projections/Transformation in pipelines○ Data Lake philosophy - raw data on HDFS, transform later using Spark

● Free-form JSON data → Data Breakages● First order of business - Reliable Data

New World Order: Hadoop & Spark

Kafka Logs

Schemaless Databases

RDBMS Tables

Amazon S3

HDFS

OLAP Warehouse

Applications

Adhoc SQL

Applications Adhoc SQL Machine Learning

Paricon

Spark SQL

Spark /Hive

Spark Jobs/Oozie

Spark?

Data Delivery Services

RawData Cooked

Spark/Spark Streaming

Trips Pipeline : Problem

● Most Valuable Dataset in Uber (100% Accuracy)● Trips stored in Uber’s ‘schemaless’ datastores (sharded

Mysql), across DCs, cross replicated● Need a consolidated view across dcs, quickly (~1-2 hr

end-end)

Trip Store (DC1)

Trip Store (DC2)

Writes in DC1 Writes in DC2

Multi Master XDC Replication

Trips Pipeline : Architecture

Trips Pipeline : ETL via SparkSQL

● Decouples raw ingestion from Relational Warehouse table model○ Ability to provision multiple tables off same data set

● Picks latest changelog entry in the files○ Applies them in order

● Applies projections & row level transformations○ Produce ingestible data into Warehouse

● Uses HiveContext to gain access to UDFs○ explode() etc to flatten JSON arrays.

● Scheduled Spark Job via Oozie ○ Runs every hour (tunable)

Paricon : PARquet Inference and CONversion

● Running in production since Feburary 2015○ first Spark application at Uber

Motivation 1: Data Breakage & Evolution

Upstream Data Producers

Downstream Data Consumers

JSON at S3 data evolving over time … and one day

Motivation 1: Why Schema● Contract

○ multiple teams○ producers○ consumers

● Avoid data breakage○ because we have schema evolution systems

● Data to persist in a typed manner○ analytics

● Serve as documentation○ understand data faster

● Unit testable

Paricon : Workflow

Transfer

Convert

Infer

Validate

JSON / Gzip / S3

Avro schema

Parquet /In-house HDFSSchema

Repository and

Management Systems

reviewed / consumed

Motivation 2: Why Parquet● Supports schema● 2 to 4 times FASTER than json/gzip

○ column pruning■ wide tables at Uber

○ filter predicate push-down○ compression

● Strong Spark support○ SparkSQL○ schema evolution

■ schema merging in Spark v1.3■ merge old and new compatible schema versions■ no “Alter table ...”

Paricon : Transfer● distcp on Spark

○ only subset of command-line options currently

● Approach○ compute the files list and assign them to RDD partitions○ avoid stragglers by randomly grouping different dates

● Extras○ Uber specific logic

■ filename conventions■ backup policies

○ internal Spark eco-system○ faster homegrown delta computation○ get around s3a problem in Hadoop 2.6

Paricon : Infer

● Infer by JsonRDD○ but not directly

● Challenge: Data is dirty○ garbage in garbage out

● Two passes approach○ first: data cleaning○ second: JsonRDD inference

Paricon : Infer

● Data cleaning○ structured as rules-based engine○ each rule is an expectation○ all rules are heuristics

■ based on business domain knowledge○ the rules are pluggable based on topics

● Struct@JsonRDD vs Avro:○ illegal characters in field names○ repeating group names○ more

Paricon : Convert

● Incremental conversion○ assign days to RDD partitions○ computation and checkpoint unit: day○ new job or after failure: work on those partial days only

● Most number of codes among the four tasks○ multiple source formats (encoded vs non-encoded)○ data cleaning based on inferred schema○ home grown JSON decoder for Avro○ file stitching

Stitching : MotivationFile size

Number of files

HDFS block size

● Inefficient for HDFS● Many large files

○ break them● But a lot more small files

○ stitch them

Stitching : GoalHDFS Block HDFS Block HDFS Block HDFS Block

Parquet Block Parquet Block Parquet Block

HDFS Block HDFS Block HDFS Block HDFS Block

Parquet File Parquet File Parquet File Parquet File

● One parquet block per file● Parquet file slightly less than HDFS

block

Stitching : Algorithms

● Algo1: Estimate a constant before conversion○ pros: easy to do○ cons: not work well with temporal variation

● Algo2: Estimate during conversion per RDD partition○ each day has its own estimate○ may even self-tuned during the day

Stitching : Experiments

●

○ N: number of Parquet files○ Si: size of the i-th Parquet file ○ B: HDFS block size○ First part: local I/O - files slightly smaller HDFS block○ Second part: network I/O - penalty of files going over a block

● Benchmark queries

Paricon : Validate

● Modeled as “Source and converted tables join”○ equi-join on primary key○ compare the counts○ compare the columns content

● SparkSQL○ easy for implementation○ hard for performance tuning

● Debugging tools

Some Production Numbers

● Inferred: >120 topics ● Converted: >40 topics ● Largest single job so far

○ process 15TB compressed (140TB uncompressed) data○ one single topic○ recover from multiple failures by checkpoints

● Numbers are increasing ...

Lessons

● Implement custom finer checkpointing○ S3 data network fee ○ jobs/tasks failure -> download all data repeatedly○ to save money and time

● There is no perfect data cleaning○ 100% clean is not needed often

● Schema parsing implementation○ tricky and takes much time for testing

Komondor: Problem Statement

● Current Kafka->HDFS ingestion service does too much work:○ Consume from Kafka -> Write Sequence Files -> Convert to Parquet ->

Upload to HDFS, HIVE compatible way○ Parquet generation needs a lot of memory○ Local writing and uploading is slow

● Need to decouple raw ingestion from consumable data○ Move heavy lifting into Spark -> Keep raw-data delivery service lean

● Streaming job to keep converting raw data into Parquet, as they land!

Komondor: Kafka Ingestion Service

Komondor

Streaming Raw Data Delivery

Kafka

HDFS

Streaming Ingestion

Batch Verification & File Stitching

Raw Data

Consumable Data

Komondor: Goals

● Fast raw data into permanent storage● Spark Streaming Ingestor to ‘cook’ raw data

○ For now, Parquet generation ○ But opens up polyglot world for ORC, RCFile,....

● De-duplicate of raw data before consumption○ Shields downstream consumers from at-least-once delivery of pipelines○ Simply replay events for an entire day, in the event of pipeline outages

● Improved wellness of HDFS○ Avoiding too many small files in HDFS○ File stitcher job to combine small files from past days

INotify DStream: Komondor De-Duplication

INotify DStream: Motivation

● Streaming Job to pick up raw data files ○ Keeps end-to-end latency low vs batch job

● Spark Streaming FileDStream not sufficient○ Only works 1 directory deep,

■ At least have two levels for <topic>/<dc>/

○ Provides the file contents directly■ Loses valuable information in file name. eg: partition num

○ Checkpoint contains an entire file list■ Will not scale to millions of files

○ Too much overhead to run one Job Per Topic

INotify DStream: HDFS INotify● Similar to Linux iNotify to watch file system changes

● Exposes the HDFS Edit Log as an event stream○ CREATE, CLOSE, APPEND, RENAME, METADATA, UNLINK events○ Introduced in Hadoop Summit 2015

● Provides transaction id ○ Client can use to resume from a given position

● Event Log Purged every time the FSImage is uploaded

INotify DStream: Implementation

● Provides the HDFS INotify events as a Spark DStream○ Implementation very similar to KafkaDirectDStream

● Checkpointing is straightforward:○ Transactions have unique IDs. ○ Just save Transaction ID to permanent storage

● Filed SPARK-10555, vote up if you think it is useful :)

https://issues.apache.org/jira/browse/SPARK-10555

INotify DStream: Early Results

● Pretty stable when running on YARN● HDFS iNotify reads ALL events from NameNode● Have to add filtering

○ to catch only events of interests (Paths/Ext.)○ Performed at Spark level

● Memory usage increases on NN when iNotify is running

INotify DStream: Future Uses, XDC Replication

● Open possibility, provided INotify is a charm in production

● Uber’s thinking about all active-active data architecture○ This means n HDFS clusters that need to be in-sync

● Typical batch-based distcp creates bursty network utilization ○ Or go through scheduling trouble to smoothen it out○ INotify DStream provides way to keep shipping files as they land○ Power of Spark to do any heavy lifting such as filtering sensitive data

Future/Ongoing Work

Our engines are revved up Forecast: Sunny & Awesome with lots of Spark!

Future/Ongoing Work

● Spark SQL Based ETL-Platform○ Powers all tables into warehouse

● Open up SQL-On-Hadoop via Spark SQL/Hive○ Spark Shell is already so nifty!

● Machine Learning Platform using Spark○ MLLib /GraphX Possibilities

● Standardized Support For Spark jobs● Apollo: Next Gen Real-time analytics using Spark

Streaming ○ Watch for our next talk! ;)

We Are Hiring!!! :)

Thank You

(Special kudos to Uber Facilities & Security)

Questions?

Extra Slides

Trips Pipeline : Consolidating Changelogs

● Data Model, very similar to BigTable/HBase○ row_key : uuid for trip○ col_key : One column in trip record○ version & body : version of the column & json blob○ cell : Unique tuple of {row_key, col_key, version}

● Provides REST endpoint to tail cell change log for every shard

Trips Pipeline : Challenge

● Existing ingestion turned cell changes into Warehouse upserts, ○ Losing the version information○ Unable to reject older (& duplicate) cell changes in logs, coming

from XDC replication

{ trip-xxx :

{“FARE”=>{f1:{body:12.35,version:11},f2:{body:val2,version:10}},

{“ETA”=>{f3:{body:val3,version:13},f4:{body:val4,version:10}}

}

trip-uuid FARE_f1 ETA_f3 ETA_f4

trip-xxx 12.35 4 5

trip-xyz 14.50 2 1

Spark At Uber

● Today○ Paricon: Turn Historical Json Into Parquet Gold Mine

○ Streamio/Spark SQL : Deliver Global View of Trip Database into Warehouse in near real-time

● Tomorrow○ INotify DStream :

■ Komondor - The ‘Uber’ data ingestor■ XDC Data Replicator

○ Adhoc SQL Access to data: Hive On Spark/Spark SQL

○ Spark Apps: Directly accessing data on HDFS

Trips Pipeline : Raw Row Images in HDFS

● Streamio : Generic connector of partitioned streams○ Pluggable in & out stream implementations

● Tails cell changes from both DCs into a Kafka topic● Uses HBase to construct full row image (latest value

for each column for a trip)○ Logs ‘row changelog’ to HDFS

● Preserves version of latest cell for each column/row○ Can efficiently de-duplicate/reconcile.

● Extensible to all Schemaless datastores

spark meetup at uber

Data & Analytics