dissecting scalable database architectures
DESCRIPTION
Presentation by Doug Judd, co-founder of Hypertable Inc, at Groupon office in Palo Alto, CA on November 15th, 2012.TRANSCRIPT
![Page 1: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/1.jpg)
Dissecting Scalable Database ArchitecturesDoug JuddCEO, Hypertable Inc.
![Page 2: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/2.jpg)
Talk Outline• Scalable “NoSQL” Architectures• Next-generation Architectures• Future Evolution - Hardware Trends
![Page 3: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/3.jpg)
Scalable NoSQLArchitecture Categories• Auto-sharding• Dynamo• Bigtable
![Page 4: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/4.jpg)
Auto-Sharding
![Page 5: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/5.jpg)
Auto-Sharding
![Page 6: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/6.jpg)
Auto-sharding Systems• Oracle NoSQL Database• MongoDB
![Page 7: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/7.jpg)
Dynamo• “Dynamo: Amazon’s Highly Available Key-value Store”
– Amazon.com, 2007• Distributed Hash Table (DHT)• Handles inter-datacenter replication• Designed for High Write Availability
![Page 8: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/8.jpg)
Consistent Hashing
![Page 9: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/9.jpg)
Eventual Consistency
![Page 10: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/10.jpg)
Vector Clocks
![Page 11: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/11.jpg)
Dynamo-based Systems• Cassandra• DynamoDB• Riak• Voldemort
![Page 12: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/12.jpg)
Bigtable• “Bigtable: A Distributed Storage System for Structured Data”
- Google, Inc., OSDI ’06• Ordered• Consistent• Not designed to handle inter-datacenter replication
![Page 13: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/13.jpg)
Google Architecture
![Page 14: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/14.jpg)
Google File System
![Page 15: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/15.jpg)
Google File System
![Page 16: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/16.jpg)
Table: Growth Process
![Page 17: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/17.jpg)
Scaling (part 1)
![Page 18: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/18.jpg)
Scaling (part 2)
![Page 19: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/19.jpg)
Scaling (part 3)
![Page 20: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/20.jpg)
System overview
![Page 21: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/21.jpg)
Database Model
• Sparse, two-dimensional table with cell versions• Cells are identified by a 4-part key
• Row (string)• Column Family• Column Qualifier (string)• Timestamp
![Page 22: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/22.jpg)
Table: Visual Representation
![Page 23: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/23.jpg)
Table: Actual Representation
![Page 24: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/24.jpg)
Anatomy of a Key• Column Family is represented with 1 byte• Timestamp and revision are stored big-endian,
ones-compliment• Simple byte-wise comparison
![Page 25: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/25.jpg)
Log Structured Merge Tree
![Page 26: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/26.jpg)
Range Server: CellStore• Sequence of 65K blocks of
compressed key/value pairs
![Page 27: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/27.jpg)
Bloom Filter• Associated with each Cell Store• Dramatically reduces disk access• Tells you if key is definitively not present
![Page 28: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/28.jpg)
Request Routing
![Page 29: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/29.jpg)
Bigtable-based Systems• Accumulo• HBase• Hypertable
![Page 30: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/30.jpg)
Next-generation Architectures
• PNUTS (Yahoo, Inc.)• Spanner (Google, Inc.)• Dremel (Google, Inc.)
![Page 31: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/31.jpg)
PNUTS
• Geographically distributed database• Designed for low-latency access• Manages hashed or ordered tables of records
• Hashed tables implemented via proprietary disk-based hash• Ordered tables implemented with MySQL+InnoDB
• Not optimized for bulk storage (image, videos, …)• Runs as a hosted service inside Yahoo!
![Page 32: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/32.jpg)
PNUTS System Architecture
![Page 33: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/33.jpg)
Record-level Mastering
• Provides per-record timeline consistency• Master is adaptively changed to suit workload• Region names are two bytes associated with each record
![Page 34: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/34.jpg)
PNUTS API
• Read-any• Read-critical(required_version)• Read-latest• Write• Test-and-set-write(required_version)
![Page 35: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/35.jpg)
Spanner
• Globally distributed database (cross-datacenter replication)• Synchronously Replicated• Externally-consistent distributed transactions• Globally distributed transaction management• SQL-based query language
![Page 36: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/36.jpg)
Spanner Server Organization
![Page 37: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/37.jpg)
Spanserver
• Manages 100-1000 tablets• A tablet is similar to a Bigtable tablet and manages a bag of
mappings: (key:string, timestamp:int64) -> string
• Single Paxos state machine implemented on top of each tablet• Tablet may contain multiple directories
• Set of contiguous keys that share a common prefix• Unit of data placement• Can be moved between Tablets for performance reasons
![Page 38: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/38.jpg)
TrueTime
• Universal Clock• Set of time master servers per-datacenter
• GPL clock via GPS receivers with dedicated antennas• Atomic clock
• Time daemon runs on every machine• TrueTime API:
![Page 39: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/39.jpg)
Spanner Software Stack
![Page 40: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/40.jpg)
Externally-consistent Operations• Read-Write Transaction• Read-Only Transaction• Snapshot Read (client-provided timestamp)• Snapshot Read (client-provided bound)• Schema Change Transaction
![Page 41: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/41.jpg)
Dremel
• Scalable, interactive ad-hoc query system• Designed to operate on read-only data• Handles nested data (Protocol Buffers)• Can run aggregation queries over trillion-row tables in seconds
![Page 42: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/42.jpg)
Columnar Storage Format
• Novel format for storing lists of nested records (Protocol Buffers)
• Highly space-efficient• Algorithm for dissecting list of nested records into columns• Algorithm for reassembling columns into list of records
![Page 43: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/43.jpg)
Multi-level Execution Trees
• Execution model for one-pass aggregations returning small and medium-sized results (very common at Google)
• Query gets re-written as it passes down the execution tree.• On the way up, intermediate servers perform a parallel
aggregation of partial results.
![Page 44: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/44.jpg)
Performance
![Page 45: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/45.jpg)
Example Queries
• SELECT SUM(CountWords(txtField)) / COUNT(*) FROM T1
• SELECT country, SUM(item.amount) FROM T2GROUP BY country
• SELECT domain, SUM(item.amount) FROM T2WHERE domain CONTAINS ’.net’GROUP BY domain
• SELECT COUNT(DISTINCT a) FROM T5
![Page 46: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/46.jpg)
Future Evolution - Hardware Trends• SSD Drives• Disk Drives• Networking
![Page 47: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/47.jpg)
Flash Memory Rated Lifetime(P/E Cycles)
Source: Bleak Future of NAND Flash Memory, Grupp et al., FAST 2012
![Page 48: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/48.jpg)
Flash Memory Average BER at Rated Lifetime
Source: Bleak Future of NAND Flash Memory, Grupp et al., FAST 2012
![Page 49: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/49.jpg)
Disk: Areal Density Trend
Source: GPFS Scans 10 Billion Files in 43 Minutes. © Copyright IBM Corporation 2011
![Page 50: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/50.jpg)
Disk: Maximum SustainedBandwidth Trend
Source: GPFS Scans 10 Billion Files in 43 Minutes. © Copyright IBM Corporation 2011
![Page 51: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/51.jpg)
Time Required to Sequentially Fill a SATA Drive
![Page 52: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/52.jpg)
Average Seek Time
Source: GPFS Scans 10 Billion Files in 43 Minutes. © Copyright IBM Corporation 2011
![Page 53: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/53.jpg)
Average Rotational Latency
Source: GPFS Scans 10 Billion Files in 43 Minutes. © Copyright IBM Corporation 2011
![Page 54: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/54.jpg)
Time Required to Randomly Read a SATA Drive
![Page 55: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/55.jpg)
Ethernet• 10GbE
• Starting to replace 1GbE for server NICs• De facto network port for new servers in 2014
• 40GbE• Data center core & aggregation• Top-of-rack server aggregation
• 100GbE• Service Provider core and aggregation• Metro and large Campus core• Data center core & aggregation
• No technology currently exists to transport 40 Gbps or 100 Gbps as a single stream over existing copper or fiber
• 40GbE & 100GbE solved using either 4 or 10 parallel 10GbE “lanes”
![Page 56: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/56.jpg)
10GbE Adoption Curve (?)
Source: CREHAN RESEARCH Inc. © Copyright 2012
![Page 57: Dissecting Scalable Database Architectures](https://reader033.vdocuments.us/reader033/viewer/2022061303/5492333db47959640d8b574d/html5/thumbnails/57.jpg)
The EndThank you!