a survey of petabyte scale databases and storage systems deployed at facebook
DESCRIPTION
At Facebook, we use various types of databases and storage system to satisfy the needs of different applications. The solutions built around these data store systems have a common set of requirements: they have to be highly scalable, maintenance costs should be low and they have to perform efficiently. We use a sharded mySQL+memcache solution to support real-time access of tens of petabytes of data and we use TAO to provide consistency of this web-scale database across geographical distances. We use Haystack datastore for storing the 3 billion new photos we host every week. We use Apache Hadoop to mine intelligence from 100 petabytes of clicklogs and combine it with the power of Apache HBase to store all Facebook Messages. This talk describes the reasons why each of these databases are appropriate for their workloads and the design decisions and tradeoffs that were made while implementing these solutions. We touch upon the consistency, availability and partitioning tolerance of each of these solutions. We touch upon the reasons why some of these systems need ACID semantics and other systems do not. We briefly touch upon some futures of how we plan to do big-data deployments across geographical locations and our requirements for a new breed of pure-memory and pure-SSD based transactional database.TRANSCRIPT
![Page 1: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/1.jpg)
Petabyte Scale Data at Facebook
Dhruba Borthakur , Engineer at Facebook, BigDataCloud, Santa Clara, Oct 18 2012
![Page 2: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/2.jpg)
1 Types of Data
2 Data Model and API for Facebook Graph Data
3 SLTP (Semi-OLTP) data
4 Immutable data store for photos, videos, etc
5 Hadoop/Hive analytics storage
Agenda
![Page 3: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/3.jpg)
![Page 4: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/4.jpg)
Four major types of storage systems ▪ Online Transaction Processing Databases (OLTP)
▪ The Facebook Social Graph
▪ Semi-online Light Transaction Processing Databases (SLTP)
▪ Facebook Messages and Sensory Time Series Data
▪ Immutable DataStore
▪ Photos, videos, etc
▪ Analytics DataStore
▪ Data Warehouse, Logs storage
![Page 5: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/5.jpg)
Size and Scale of Databases Total Size Technology Bottlenecks
Facebook Graph
Single digit petabytes
MySQL and TAO/
memcache
Random read IOPS
Facebook Messages and
Time Series Data
Tens of petabytes
Hadoop HDFS and Hbase
Write IOPS and
storage capacity
Facebook Photos
High tens of
petabytes
Haystack
storage capacity
Data Warehouse
Hundreds of
petabytes
Hive, HDFS and
Hadoop
storage capacity
![Page 6: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/6.jpg)
Characteristics Query
Latency Consistency Durability
Graph
< few
milliseconds
quickly
consistent across data
centers
No data loss
Facebook Messages and Time
Series Data
< 200 millisec
consistent
within a data center
No data loss
Facebook Photos
< 250 millisec
immutable
No data loss
Data Warehouse
< 1 min
not consistent
across data centers
No silent data
loss
![Page 7: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/7.jpg)
Facebook Graph: Objects and Associations
![Page 8: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/8.jpg)
Objects & Associations
Data model
6205972929 (story)
8636146 (user)
604191769 (user)
name: Barack Obama birthday: 08/04/1961 website: http://… verified: 1 …
likes
fan
friend
admin
liked by
18429207554 (page)
friend
![Page 9: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/9.jpg)
Facebook Social Graph: TAO and MySQL An OLTP workload:
▪ Uneven read heavy workload
▪ Huge working set with creation-time locality
▪ Highly interconnected data
▪ Constantly evolving
▪ As consistent as possible
![Page 10: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/10.jpg)
Cache & Storage Architecture
TAO Storage Cache MySQL Storage
Web servers
![Page 11: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/11.jpg)
Sharding
▪ Object ids and Assoc id1s are mapped to shard ids
Architecture
s1 s3 s5
s2 s6
s4 s7 s8
TAO Cache
db2 db4
MySQL Storage
db1 db3 db8
db7
db5 db6
Web Servers
![Page 12: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/12.jpg)
Workload
▪ Read-heavy workload
▪ Significant range queries
▪ LinkBench benchmark being open-sourced
▪ http://www.github.com/facebook/linkbench
▪ Real data distribution of Assocs and their access patterns
![Page 13: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/13.jpg)
Messages & Time Series Database SLTP workload
![Page 14: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/14.jpg)
Facebook Messages
Emails Chats SMS Messages
![Page 15: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/15.jpg)
Why we chose Hbase and HDFS ▪ High write throughput
▪ Horizontal scalability
▪ Automatic Failover
▪ Strong consistency within a data center
▪ Benefits of HDFS : Fault tolerant, scalable, Map-Reduce toolset,
▪ Why is this SLTP?
▪ Semi-online: Queries run even if part of the database is offline
▪ Light Transactions: single row transactions
▪ Storage capacity bound rather than iops or cpu bound
![Page 16: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/16.jpg)
What we store in HBase ▪ Small messages
▪ Message metadata (thread/message indices)
▪ Search index
▪ Large attachments stored in Haystack (photo store
![Page 17: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/17.jpg)
Size and scale of Messages Database ▪ 6 Billion messages/day
▪ 74 Billion operations/day
▪ At peak: 1.5 million operations/sec
▪ 55% read, 45% write operations
▪ Average write operation inserts 16 records
▪ All data is lzo compressed
▪ Growing at 8 TB/day
![Page 18: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/18.jpg)
Haystack: The Photo Store
![Page 19: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/19.jpg)
Facebook Photo DataStore 2009 2012
Total Size 15 billion photos 1.5 Petabyte
High tens of petabytes
Upload Rate
30 million photos/day
3 TB/day
300 million photos/day
30 TB/day
Serving Rate
555K images/sec
![Page 20: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/20.jpg)
Haystack based Design
Browser
Web Server
CDN
Haystack Directory
Haystack Store
Haystack Cache
![Page 21: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/21.jpg)
Haystack Internals ▪ Log structured, append-only object store
▪ Built on commodity hardware
▪ Application-aware replication
▪ Images stored in 100 GB xfs-files called needles
▪ An in-memory index for each needle file
▪ 32 bytes of index per photo
![Page 22: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/22.jpg)
Hive Hadoop Analytics Warehouse
![Page 23: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/23.jpg)
www.facebook.com
User tags a photo
Log line generated: <user_id, photo_id>
Scribe Log Storage (HDFS) Log line reaches Scribeh (10s)
copier/loader
Hive Warehouse
Log line reaches warehouse (15 min)
MySQL DB
Scrapes User info reaches Warehouse (1day)
Periodic Analysis (HIVE)
Daily report on count of photo tags by country (1day)
Adhoc Analysis (HIVE)
Count photos tagged by females age 20-25 yesterday
Life of a photo tag in Hadoop/Hive storage
Count users tagging photos in the last hour (1min)
RealHme AnalyHcs (HBASE) puma
![Page 24: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/24.jpg)
Analytics Data Growth(last 4 years)
Facebook Users
Queries/Day Scribe Data/
Day Nodes in
warehouse Size (Total)
Growth 14X 60X 250X 260X 2500X
![Page 25: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/25.jpg)
Why use Hadoop instead of Parallel DBMS? ▪ Stonebraker/DeWitt from the
DBMS community:
▪ Quote “major step backwards”
▪ Published benchmark results which show that Hive is not as performant as a traditional DBMS
▪ http://database.cs.brown.edu/projects/mapreduce-vs-dbms/
![Page 26: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/26.jpg)
Why Hadoop/Hive? Prospecting for Gold.. ▪ “Gold in the wild-west”
▪ A platform for huge data-experiments
▪ A majority of queries are searching for a single gold nugget
▪ Great advantage in keeping all data in one queryable system
▪ No structure to data, specify structure at query time
![Page 27: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/27.jpg)
Why Hadoop/Hive? Open to all users ▪ Open Warehouse
▪ No need to take approval from DBA to create new table
▪ This is possible only because Hadoop is horizontally scalable
▪ Provisioning excess capacity is cheaper than traditional databases
▪ Lower $$/GB ($$/query-latency not relevant here)
![Page 28: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/28.jpg)
Why Hadoop/Hive? Crowd Sourcing ▪ Crowd Sourcing for data discovery
▪ There are 50K tables in a single warehouse
▪ Users are DBAs themselves
▪ Questions about a table are directed to users of that table
▪ Automatic query lineage tools help here
![Page 29: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/29.jpg)
Why Hadoop/Hive? Open Data Format ▪ No Lock-in
▪ Hive/Hadoop is open source
▪ Data is open-format, one can access data below the database layer
▪ HBase can query the same data that is in HDFS
▪ Want a new UDF? No problem, Very easily extendable
![Page 30: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/30.jpg)
Why Hadoop/Hive? Benchmarks ▪ Shortcomings of existing DBMS benchmarks
▪ most DBMS benchmarks measure latency
▪ Do not measure $$/Gigabyte-of-data
▪ Do not test fault tolerance – kill machines
▪ Do not measure elasticity – add and remove machines
▪ Do not measure throughput – concurrent queries in parallel
![Page 31: A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook](https://reader033.vdocuments.us/reader033/viewer/2022052505/556613bad8b42a7d608b481a/html5/thumbnails/31.jpg)
New trends in storage software ▪ Trends:
▪ SSDs cheaper, increasing number of CPUs per server
▪ How can BigData use SSDs?
▪ SATA disk capacities reaching 4 - 8 TB per disk, falling prices $/GB
▪ How can BigData become even bigger?