the big data ecosystem at linkedin jay kreps. me background in data not infrastructure linkedins sna...
TRANSCRIPT
![Page 1: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/1.jpg)
The Big Data Ecosystem at LinkedIn
Jay Kreps
![Page 2: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/2.jpg)
Me
• Background in data not infrastructure
• LinkedIn’s SNA team• Original co-author of some
LinkedIn open source projects (Voldemort, Azkaban, Kafka)
![Page 3: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/3.jpg)
This Talk
• We are in a renaissance of data infrastructure.
• How do all these pieces fit together?
![Page 4: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/4.jpg)
Why the current obsession with “Big Data”?
![Page 5: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/5.jpg)
The goal of modern data infrastructure is to make many small computers act
like one big one.
![Page 6: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/6.jpg)
The Old Picture
![Page 7: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/7.jpg)
The New Picture
![Page 8: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/8.jpg)
Polyglot persistence?
![Page 9: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/9.jpg)
Infrastructure Icebergs
• 90k lines of tooling and monitoring, 30k lines of logic
• Dedicated engineers, operations• Training• First three nines come from operations
![Page 10: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/10.jpg)
This is (still) a very immature space. Which systems should we have?
![Page 11: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/11.jpg)
• Infrastructure is sculpted by applications and constraints
• Projects are defined by trade-offs
![Page 12: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/12.jpg)
Constraints
• Hardware– Jeff Dean: Numbers
everyone should know– David Patterson:
Latency lags bandwidth– $$$
• Other– Path dependence– Complexity– Resources
![Page 13: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/13.jpg)
Applications
![Page 14: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/14.jpg)
Common categories of non-CRUD
• Recommendations & Matching• Graphs• Search• Data Normalization• News feed• Analysis & Monitoring
![Page 15: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/15.jpg)
Social Graph
![Page 16: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/16.jpg)
Search
![Page 17: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/17.jpg)
Recommendations: People
![Page 18: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/18.jpg)
Recommendations: Jobs
![Page 19: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/19.jpg)
Recommendations: Newsfeed
![Page 20: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/20.jpg)
Data Normalization
![Page 21: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/21.jpg)
Analytics
![Page 22: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/22.jpg)
Infrastructure• Search
– Lucene– Bobo (facets), Zoie (real-time indexing), Sensei
(distribution)• Social Graph• Storage
– Oracle– Voldemort– Espresso
• Streams– Databus– Kafka
• Offline– Hadoop & friends (Pig, Hive, Azkaban, etc)
![Page 23: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/23.jpg)
Three Major Paradigms
• Request/Response– Search– Social Graph– Storage
• Streams– Kafka
• Batch– Hadoop
![Page 24: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/24.jpg)
Most features are multi-paradigm
![Page 25: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/25.jpg)
Request/Response
• Search• Social Graph• Storage– Voldemort– Espresso
![Page 26: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/26.jpg)
Request/Response Patterns
• Broker, scatter-gather– Storage systems: only
• Partitioning strategy• Latency oriented
![Page 27: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/27.jpg)
Batch: Hadoop
• Uses– Ad hoc– Production batch
• Ecosystem• Hive, Pig• Azkaban (workflow)• Avro data• Data in: Kafka• Data out: Voldemort, Kafka
![Page 28: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/28.jpg)
Why do batch if you have real-time?
• Batch advantages– Safety– Easy– Throughput– Simplicity– Economics
• Tricky bit: engineering the data cycle
![Page 29: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/29.jpg)
Why do streaming?
• You have to glue all these systems together
• Throughput as good as batch• Latency much better• Metaphor more natural for low
latency than Hadoop
![Page 30: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/30.jpg)
What makes successful infrastructure systems?
• Operability and Operations• Monitoring• Simplicity• Documentation• Broad adoption• Lazy users• Open source
![Page 31: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/31.jpg)
Open Source
• Data > Infrastructure• Open source creates better code—
even with few outside contributors• Commercial infrastructure not
interesting
![Page 32: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/32.jpg)
Open Source Projects• We made
– Voldemort: Key/Value storage– Sensei, Bobo, Zoie: Elastic, faceted, real-time search
with Lucene– Kafka: Persistent, distributed data streams– Norbert: Cluster aware RPC, load balancing, and group
membership– And others…
• We stole– Hadoop, Pig, Hive– Lucene– Netty, Jetty– Zookeeper– Avro– Apache Traffic Server
![Page 33: The Big Data Ecosystem at LinkedIn Jay Kreps. Me Background in data not infrastructure LinkedIns SNA team Original co-author of some LinkedIn open source](https://reader035.vdocuments.us/reader035/viewer/2022070305/55144d1e550346414e8b4f21/html5/thumbnails/33.jpg)
The End
[email protected]://www.linkedin.com/in/jaykreps
http://twitter.com/jaykrepshttp://sna-projects.com