what does it take to make google work at scale
TRANSCRIPT
What does it taketo make Googlework at scale?
Malte Schwarzkopf [@ms705]
Cambridge Systems at Scalehttp://camsas.org – @CamSysAtScale
CamSaS
CamSaS
2http://camsas.org @CamSysAtScale
CamSaS
3http://camsas.org @CamSysAtScale
CamSaS
4http://camsas.org @CamSysAtScale
What happens here?
CamSaS
5http://camsas.org @CamSysAtScale
What happens in those 139ms?
InternetGoogledatacenter
Your computer
Front-end web server
G+ idx
places idxweb idx
? ?
?
? ??
CamSaS
6http://camsas.org @CamSysAtScale
What we’ll chat about
1.Datacenter hardware
2.Scalable software stacksa.Google
b.Facebook
3.Differences compared to HPC
4.Current research directions
Please ask questions - any time! :)
CamSaS
7http://camsas.org @CamSysAtScale
CamSaS
8http://camsas.org @CamSysAtScale
CamSaS
9http://camsas.org @CamSysAtScale
CamSaS
10http://camsas.org @CamSysAtScale
CamSaS
11http://camsas.org @CamSysAtScale
10,000+ machines
“Machine”no chassis!DC battery~6 disk drivesmostly custom-made
NetworkToR switchmulti-path coree.g., 3-stage fat tree or
Clos topology
CamSaS
12http://camsas.org @CamSysAtScale
How’s this different to HPC?
Emphasis on commodity hardwareNo expensive interconnectMid-range machinesEnergy/performance/cost trade-off essential
Massive automationVery small number of on-site staffAutomated software bootstrap
Fault tolerant designEach component can failSoftware must be aware and compensate
CamSaS
13http://camsas.org @CamSysAtScale
The software side
Machine MachineMachine
Linux kernel (customized)
Linux kernel (customized)
Linux kernel (customized)
Transparent distributed systems
Con
tain
ers
Con
tain
ers
We’ll look at what goes here!
CamSaS
14http://camsas.org @CamSysAtScale
The Google Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).
CamSaS
15http://camsas.org @CamSysAtScale
The Google Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).
CamSaS
16http://camsas.org @CamSysAtScale
GFS/Colossus
Bulk data block storage systemOptimized for large files (GB-size)Supports small files, but not common caseRead, write, append modes
Colossus = GFSv2, adds some improvementse.g., Reed-Solomon-based erasure codingbetter support for latency-sensitive applicationssharded meta-data layer, rather than single master
CamSaS
17http://camsas.org @CamSysAtScale
Application
GFS/Colossus: architecture
Chunkserver
GFS library
Chunkserver
ChunkserverGFS Master
Namespace
Metadata
CamSaS
18http://camsas.org @CamSysAtScale
GFS/Colossus: writesApplication
GFS Master
Chunkserver
GFS library
Chunkserver
Chunkserver
Namespace
Metadata
Chain replication
CamSaS
19http://camsas.org @CamSysAtScale
The Google Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).
Details & Bibliography: http://malteschwarzkopf.de/research/assets/google-stack.pdf
CamSaS
20http://camsas.org @CamSysAtScale
BigTable (2006)
• ‘Three-dimensional‘ key-value store:– <row key, column key, timestamp> → value
• Effectively a distributed, sorted, sparse map
row(key: string)
column(key: string)
timestamp(key: int64)
cell
<“larry.page“, “websearch“, 133746428> → “cat pictures“
CamSaS
21http://camsas.org @CamSysAtScale
BigTable (2006)
• Distributed tablets (~1 GB) hold subsets of map
• On top of Colossus, which handles replication and fault tolerance: only one (active) server per tablet!
• Reads & writes within a row are transactional– Independently of the number of columns touched– But: no cross-row transactions possible– Turns out users find this hard to deal with
CamSaS
22http://camsas.org @CamSysAtScale
BigTable (2006)
• META0 tablet is “root“ for name resolution– Like a coordinator/leader– Nested META1 tablets improve meta-data lookup
• Elect master (META0 tablet server) using Chubby, and to maintain list of tablet servers & schemas– 5-way replicated Paxos consensus on data in Chubby
• Built atop GFS; building block for MegaStore– “Next gen” with transactions: Spanner
CamSaS
23http://camsas.org @CamSysAtScale
The Google Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).
CamSaS
24http://camsas.org @CamSysAtScale
Spanner (2012)• BigTable insufficient for some consistency needs• Often have transactions across >1 data centres
– May buy app on Play Store while travelling in the U.S. – Hit U.S. server, but customer billing data is in U.K.– Or may need to update several replicas for fault tolerance
• Wide-area consistency is hard– due to long delays and clock skew– no global, universal notion of time– NTP not accurate enough, PTP doesn’t work (jittery links)
CamSaS
25http://camsas.org @CamSysAtScale
Spanner (2012)• Spanner offers transactional consistency: full RDBMS
power, ACID properties, at global scale!
• Secret sauce: hardware-assisted clock sync– Using GPS and atomic clocks in data centres
• Use global timestamps and Paxos to reach consensus– Still have a period of uncertainty for write TX: wait it out!– Each timestamp is an interval:
Definitely in the future
tt.latest
Definitely in the past
tt.earliest
tabs
CamSaS
26http://camsas.org @CamSysAtScale
The Google Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).
Details & Bibliography: http://malteschwarzkopf.de/research/assets/google-stack.pdf
CamSaS
27http://camsas.org @CamSysAtScale
MapReduce (2004)• Parallel programming framework for scale
– Run a program on 100’s to 10,000’s machines
• Framework takes care of:– Parallelization, distribution, load-balancing, scaling up
(or down) & fault-tolerance
• Accessible: programmer provides two methods ;-)– map(key, value) → list of <key’, value’> pairs– reduce(key’, value’) → result– Inspired by functional programming
CamSaS
28http://camsas.org @CamSysAtScale
MapReduceInput
Map
Reduce
Output
Shuffle
X: 5 X: 3 Y: 1 Y: 7
Y: 8X: 8
Results: X: 8, Y: 8
Perform Map() query against local data matching input specification
Aggregate gathered results for each intermediate key using Reduce()
End user can query results via distributed key/value store
Slide originally due to S. Hand’s distributed systems lecture course at Cambridge: http://www.cl.cam.ac.uk/teaching/1112/ConcDisSys/DistributedSystems-1B-H4.pdf
CamSaS
29http://camsas.org @CamSysAtScale
MapReduce: Pros & Cons• Extremely simple, and:
– Can auto-parallelize (since operations on every element in input are independent)
– Can auto-distribute (since rely on underlying Colossus/BigTable distributed storage)
– Gets fault-tolerance (since tasks are idempotent, i.e. can just re-execute if a machine crashes)
• Doesn’t really use any sophisticated distributed systems algorithms (except storage replication)
• However, not a panacea: – Limited to batch jobs, and computations which are
expressible as a map() followed by a reduce()
CamSaS
30http://camsas.org @CamSysAtScale
The Google Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).
Details & Bibliography: http://malteschwarzkopf.de/research/assets/google-stack.pdf
CamSaS
31http://camsas.org @CamSysAtScale
Dremel (2010)
Column-oriented storeFor quick, interactive queries
...
Row-oriented storage
Column-oriented storage
CamSaS
32http://camsas.org @CamSysAtScale
Dremel (2010)
Stores protocol buffersGoogle’s universal serialization formatNested messages → nested columnsRepeated fields → repeated records
Efficient encodingMany sparse records: don’t store NULL fields
Record re-assemblyNeed to put results back together into recordsUse a Finite State Machine (FSM) defined by
protocol buffer structure
CamSaS
33http://camsas.org @CamSysAtScale
The Google Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).
Details & Bibliography: http://malteschwarzkopf.de/research/assets/google-stack.pdf
CamSaS
34http://camsas.org @CamSysAtScale
Borg/Omega
Cluster manager and schedulerTracks machine and task livenessDecides where to run what
Consolidates workloads onto machinesEfficiency gain, cost savingsNeed fewer clusters
CamSaS
35http://camsas.org @CamSysAtScale
BorgMaster
Borg/Omega
BorgMasterBorgMaster
Link shard
Borglet Borglet Borglet
Paxos-replicated persistent store
Scheduler
Figure reproduced after A. Verma et al., “Large-scale cluster management at Google with Borg”, Proceedings of EuroSys 2015.
CamSaS
36http://camsas.org @CamSysAtScale
Borg/Omega: workloads
Jobs/tasks: countsCPU/RAM: resource seconds [i.e. resource * job runtime in sec.]
Cluster AMedium sizeMedium utilization
Cluster BLarge sizeMedium utilization
Cluster CMedium (12k mach.)High utilizationPublic trace
Figure from M. Schwarzkopf et al., “Omega: flexible, scalable schedulers for large compute clusters”, Proceedings of EuroSys 2013.
CamSaS
37http://camsas.org @CamSysAtScale
Borg/Omega: workloadsFr
actio
n of
jobs
runn
ing
for l
ess
than
X
Figure from M. Schwarzkopf et al., “Omega: flexible, scalable schedulers for large compute clusters”, Proceedings of EuroSys 2013.
Service jobs run for much longer than batch jobs:long-term user-facing services vs. one-off analytics.
CamSaS
38http://camsas.org @CamSysAtScale
Borg/Omega: workloadsFr
actio
n of
inte
r-ar
rival
gap
s le
ss th
an X
Figure from M. Schwarzkopf et al., “Omega: flexible, scalable schedulers for large compute clusters”, Proceedings of EuroSys 2013.
Batch jobs arrive more frequently than service jobs:more numerous, shorter duration, fail more.
CamSaS
39http://camsas.org @CamSysAtScale
Borg/Omega: workloads
Batch jobs have a longer-tailed CPI distribution:lower scheduling priority in kernel scheduler.
Figures from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).
CamSaS
40http://camsas.org @CamSysAtScale
Borg/Omega: workloads
Service workloads access memory more frequently: larger working sets, less I/O.
Figures from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).
CamSaS
41http://camsas.org @CamSysAtScale
How’s this different to HPC?
Generally avoid tight synchronizationNo barrier-syncs!Asynchronous replication, eventual consistency
Much more data-intensiveLower compute-to-data ratio
Aggressive resource sharing for utilization
CamSaS
42http://camsas.org @CamSysAtScale
The facebook Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).
Details & Bibliography: http://malteschwarzkopf.de/research/assets/facebook-stack.pdf
CamSaS
43http://camsas.org @CamSysAtScale
The facebook Stack
GFSBigTab
le
MapReduce
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).
Details & Bibliography: http://malteschwarzkopf.de/research/assets/facebook-stack.pdf
CamSaS
44http://camsas.org @CamSysAtScale
The facebook Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).
Details & Bibliography: http://malteschwarzkopf.de/research/assets/facebook-stack.pdf
CamSaS
45http://camsas.org @CamSysAtScale
Haystack & f4
Blob stores, hold photos, videosnot: status updates, messages, like counts
Items have a level of hotnessHow many users are currently accessing this?Baseline “cold” storage: MySQL
Want to cache close to usersReduces network trafficReduces latencyBut cache capacity is limited!Replicate for performance, not resilience
CamSaS
46http://camsas.org @CamSysAtScale
What about other companies’ stacks?
CamSaS
47http://camsas.org @CamSysAtScale
How about other companies?
Very similar stacks.Microsoft, Yahoo, Twitter all similar in principle.
Typical set-up:Front-end serving systems and fast back-ends.Batch data processing systems.Multi-tier structured/unstructured storage hierarchy.Coordination system and cluster scheduler.
Minor differences owed to business focuse.g., Amazon focused on inventory/shopping cart.
CamSaS
48http://camsas.org @CamSysAtScale
Open source software
Lots of open reimplementations!MapReduce → Hadoop, Spark, MetisGFS → HDFSBigTable → HBase, CassandraBorg/Omega → Mesos, Firmament
But also some releases from companies…Presto (Facebook)Kubernetes (Google)
CamSaS
49http://camsas.org @CamSysAtScale
Current researchMany hot topics!
This distributed infrastructure is still new.
Make many-core machines go furtherIncrease utilization, schedule better.Deal with co-location interference.
Rapid insights on huge datasetsFast, incremental stream processing, e.g. Naiad.Approximate analytics, e.g. BlinkDB.
Distributed algorithmsNew consensus systems, e.g. RAFT.
CamSaS
50http://camsas.org @CamSysAtScale
Conclusions1.Running at huge (10k+ machines) scale
requires different software stacks.
2.Pretty interesting systemsa.Do read the papers!b.(Partial) Bibliography:
http://malteschwarzkopf.de/research/assets/google-stack.pdfhttp://malteschwarzkopf.de/research/assets/facebook-stack.pdf
3.Lots of activity in this area!a.Open source softwareb.Research initiatives
Cambridge Systems at Scale
@CamSysAtScalecamsas.org