big data platform at pinterest
TRANSCRIPT
![Page 1: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/1.jpg)
Confidential
Mao Ye
Big Data Platform at interest
1
![Page 2: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/2.jpg)
Data Architecture
Design Choices for Hadoop Platform
Pinball for Workflow Management
![Page 3: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/3.jpg)
Data Architecture
![Page 4: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/4.jpg)
Data at Pinterest• 60 Billion Pins• 1 Billion boards• 100M MAU• 60 PB of data on S3• 3 PB processed every day• 2000 node Hadoop cluster• 250 engineers
![Page 5: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/5.jpg)
Pinterest Data ArchitectureApp
![Page 6: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/6.jpg)
Pinterest Data ArchitectureApp
events
Kafka
Secor
Singer
![Page 7: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/7.jpg)
Pinterest Data ArchitectureApp
events
Kafka
Secor
Singer
![Page 8: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/8.jpg)
Pinterest Data ArchitectureApp
events
Kafka
SecorSkyline
Pinball
Redshift
Pinalytics
Features
Qubole (Hadoop)
Singer
![Page 9: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/9.jpg)
Design Choices for Hadoop Platform
![Page 10: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/10.jpg)
•Ephemeral clusters
•Access control layer
•Shared data store
•Easy deployment
Hadoop Platform Requirements
•Isolated multi-tenancy
•Elasticity
•Support multiple clusters
![Page 11: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/11.jpg)
Decoupling compute & storageHadoop Cluster 1
Transient HDFS
Hadoop Cluster 2
Transient HDFS
S3 Persistent Store
![Page 12: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/12.jpg)
Centralized Hive Metastore
Hive Metastore
Pig
Cascading
Hive
HDFS/S3
DataMetadata
![Page 13: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/13.jpg)
Multi-layered PackagingMapreduce JobsHadoop Jars/Libs
Job/User level Configs
Software Packages/LibsConfigs (OS/Hadoop)
Misc Sys Admin
OSBootstrap Script
Core SW
Runtime Staging(on S3)
Automated Configuration
(Masterless Puppet)
Baked AMI
![Page 14: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/14.jpg)
Executor Abstraction Layer
Hive Metastore
HDFS/S3
Qubole
Managed Hadoop
EMR
Executor
Pinball
Dev Server
![Page 15: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/15.jpg)
•API for simplified executor abstraction
•Advanced support for spot instances
•Baked AMI customization
Why Qubole?•Hadoop & Spark as managed services
•Tight integration with Hive
•Graceful cluster scaling
![Page 16: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/16.jpg)
Confidential
Pinball for Workflow Management
![Page 17: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/17.jpg)
Confidential
● Scale:o 60 Billion Pinso Hundreds of workflowso Thousands of jobso 500+ jobs in a workflowo 3 petabytes processed daily
● Support:o Hadoop, Cascading, Hive, Spark …
Scale of Processing
job
workflow
![Page 18: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/18.jpg)
Confidential
Why Pinball?● Requirements
o Simple abstractionso Extensible in futureo Reliable stateless computingo Easy to debugo Scales horizontallyo Can be upgraded w/o aborting workflowso Rich features like auto-retries, per-job emails, overrun
policies… ● Options
o Apache Oozie, Azkaban, Luigi
![Page 19: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/19.jpg)
Confidential
Pinball Design
Master
Worker
Scheduler
Command Line Clients
UI
![Page 20: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/20.jpg)
Confidential
● Workflow o A directed graph
of nodes called jobs
● Edgeo Run after
dependence● Node
o Job is a node
Workflow Model
![Page 21: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/21.jpg)
Confidential
Job State● Job state is captured in a token● Tokens are named hierarchically
Master
Job Token
version: 123name: /workflow/w1/jobowner: worker_0expiration: 1234567data: JobTemplate(....)
![Page 22: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/22.jpg)
Confidential
Job State Machine
RUNNABLE
RUNNINGWAITING
![Page 23: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/23.jpg)
Confidential
● Master keeps the state● Workers claim and execute tasks● Horizontally scalable
Master Worker Interaction
Worker Master Persistent Store
1: request 2: update
3: ack
![Page 24: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/24.jpg)
Confidential
Master
● Entire state is kept in memory● Each state update is synchronously
persisted before master replies to client● Master runs on a single thread – no
concurrency issues
![Page 25: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/25.jpg)
Confidential
Worker
![Page 26: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/26.jpg)
Confidential
Open SourceGit repo: https://github.com/pinterest/pinball
Mailing list:https://groups.google.com/forum/#!forum/pinball-users
![Page 27: Big Data Platform at Pinterest](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f9a958760da3da068b6db0/html5/thumbnails/27.jpg)
Confidential
Thank You