what does it take to make google work at scale

51
What does it take to make Google work at scale? Malte Schwarzkopf [@ms705 ] Cambridge Systems at Scale http://camsas.org @CamSysAtScale CamSaS

Upload: xlight

Post on 15-Apr-2017

1.033 views

Category:

Engineering


2 download

TRANSCRIPT

Page 1: What does it take to make google work at scale

What does it taketo make Googlework at scale?

Malte Schwarzkopf [@ms705]

Cambridge Systems at Scalehttp://camsas.org – @CamSysAtScale

CamSaS

Page 2: What does it take to make google work at scale

CamSaS

2http://camsas.org @CamSysAtScale

Page 3: What does it take to make google work at scale

CamSaS

3http://camsas.org @CamSysAtScale

Page 4: What does it take to make google work at scale

CamSaS

4http://camsas.org @CamSysAtScale

What happens here?

Page 5: What does it take to make google work at scale

CamSaS

5http://camsas.org @CamSysAtScale

What happens in those 139ms?

InternetGoogledatacenter

Your computer

Front-end web server

G+ idx

places idxweb idx

? ?

?

? ??

Page 6: What does it take to make google work at scale

CamSaS

6http://camsas.org @CamSysAtScale

What we’ll chat about

1.Datacenter hardware

2.Scalable software stacksa.Google

b.Facebook

3.Differences compared to HPC

4.Current research directions

Please ask questions - any time! :)

Page 7: What does it take to make google work at scale

CamSaS

7http://camsas.org @CamSysAtScale

Page 8: What does it take to make google work at scale

CamSaS

8http://camsas.org @CamSysAtScale

Page 9: What does it take to make google work at scale

CamSaS

9http://camsas.org @CamSysAtScale

Page 10: What does it take to make google work at scale

CamSaS

10http://camsas.org @CamSysAtScale

Page 11: What does it take to make google work at scale

CamSaS

11http://camsas.org @CamSysAtScale

10,000+ machines

“Machine”no chassis!DC battery~6 disk drivesmostly custom-made

NetworkToR switchmulti-path coree.g., 3-stage fat tree or

Clos topology

Page 12: What does it take to make google work at scale

CamSaS

12http://camsas.org @CamSysAtScale

How’s this different to HPC?

Emphasis on commodity hardwareNo expensive interconnectMid-range machinesEnergy/performance/cost trade-off essential

Massive automationVery small number of on-site staffAutomated software bootstrap

Fault tolerant designEach component can failSoftware must be aware and compensate

Page 13: What does it take to make google work at scale

CamSaS

13http://camsas.org @CamSysAtScale

The software side

Machine MachineMachine

Linux kernel (customized)

Linux kernel (customized)

Linux kernel (customized)

Transparent distributed systems

Con

tain

ers

Con

tain

ers

We’ll look at what goes here!

Page 14: What does it take to make google work at scale

CamSaS

14http://camsas.org @CamSysAtScale

The Google Stack

Figure from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).

Page 15: What does it take to make google work at scale

CamSaS

15http://camsas.org @CamSysAtScale

The Google Stack

Figure from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).

Page 16: What does it take to make google work at scale

CamSaS

16http://camsas.org @CamSysAtScale

GFS/Colossus

Bulk data block storage systemOptimized for large files (GB-size)Supports small files, but not common caseRead, write, append modes

Colossus = GFSv2, adds some improvementse.g., Reed-Solomon-based erasure codingbetter support for latency-sensitive applicationssharded meta-data layer, rather than single master

Page 17: What does it take to make google work at scale

CamSaS

17http://camsas.org @CamSysAtScale

Application

GFS/Colossus: architecture

Chunkserver

GFS library

Chunkserver

ChunkserverGFS Master

Namespace

Metadata

Page 18: What does it take to make google work at scale

CamSaS

18http://camsas.org @CamSysAtScale

GFS/Colossus: writesApplication

GFS Master

Chunkserver

GFS library

Chunkserver

Chunkserver

Namespace

Metadata

Chain replication

Page 19: What does it take to make google work at scale

CamSaS

19http://camsas.org @CamSysAtScale

The Google Stack

Figure from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).

Details & Bibliography: http://malteschwarzkopf.de/research/assets/google-stack.pdf

Page 20: What does it take to make google work at scale

CamSaS

20http://camsas.org @CamSysAtScale

BigTable (2006)

• ‘Three-dimensional‘ key-value store:– <row key, column key, timestamp> → value

• Effectively a distributed, sorted, sparse map

row(key: string)

column(key: string)

timestamp(key: int64)

cell

<“larry.page“, “websearch“, 133746428> → “cat pictures“

Page 21: What does it take to make google work at scale

CamSaS

21http://camsas.org @CamSysAtScale

BigTable (2006)

• Distributed tablets (~1 GB) hold subsets of map

• On top of Colossus, which handles replication and fault tolerance: only one (active) server per tablet!

• Reads & writes within a row are transactional– Independently of the number of columns touched– But: no cross-row transactions possible– Turns out users find this hard to deal with

Page 22: What does it take to make google work at scale

CamSaS

22http://camsas.org @CamSysAtScale

BigTable (2006)

• META0 tablet is “root“ for name resolution– Like a coordinator/leader– Nested META1 tablets improve meta-data lookup

• Elect master (META0 tablet server) using Chubby, and to maintain list of tablet servers & schemas– 5-way replicated Paxos consensus on data in Chubby

• Built atop GFS; building block for MegaStore– “Next gen” with transactions: Spanner

Page 23: What does it take to make google work at scale

CamSaS

23http://camsas.org @CamSysAtScale

The Google Stack

Figure from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).

Page 24: What does it take to make google work at scale

CamSaS

24http://camsas.org @CamSysAtScale

Spanner (2012)• BigTable insufficient for some consistency needs• Often have transactions across >1 data centres

– May buy app on Play Store while travelling in the U.S. – Hit U.S. server, but customer billing data is in U.K.– Or may need to update several replicas for fault tolerance

• Wide-area consistency is hard– due to long delays and clock skew– no global, universal notion of time– NTP not accurate enough, PTP doesn’t work (jittery links)

Page 25: What does it take to make google work at scale

CamSaS

25http://camsas.org @CamSysAtScale

Spanner (2012)• Spanner offers transactional consistency: full RDBMS

power, ACID properties, at global scale!

• Secret sauce: hardware-assisted clock sync– Using GPS and atomic clocks in data centres

• Use global timestamps and Paxos to reach consensus– Still have a period of uncertainty for write TX: wait it out!– Each timestamp is an interval:

Definitely in the future

tt.latest

Definitely in the past

tt.earliest

tabs

Page 26: What does it take to make google work at scale

CamSaS

26http://camsas.org @CamSysAtScale

The Google Stack

Figure from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).

Details & Bibliography: http://malteschwarzkopf.de/research/assets/google-stack.pdf

Page 27: What does it take to make google work at scale

CamSaS

27http://camsas.org @CamSysAtScale

MapReduce (2004)• Parallel programming framework for scale

– Run a program on 100’s to 10,000’s machines

• Framework takes care of:– Parallelization, distribution, load-balancing, scaling up

(or down) & fault-tolerance

• Accessible: programmer provides two methods ;-)– map(key, value) → list of <key’, value’> pairs– reduce(key’, value’) → result– Inspired by functional programming

Page 28: What does it take to make google work at scale

CamSaS

28http://camsas.org @CamSysAtScale

MapReduceInput

Map

Reduce

Output

Shuffle

X: 5 X: 3 Y: 1 Y: 7

Y: 8X: 8

Results: X: 8, Y: 8

Perform Map() query against local data matching input specification

Aggregate gathered results for each intermediate key using Reduce()

End user can query results via distributed key/value store

Slide originally due to S. Hand’s distributed systems lecture course at Cambridge: http://www.cl.cam.ac.uk/teaching/1112/ConcDisSys/DistributedSystems-1B-H4.pdf

Page 29: What does it take to make google work at scale

CamSaS

29http://camsas.org @CamSysAtScale

MapReduce: Pros & Cons• Extremely simple, and:

– Can auto-parallelize (since operations on every element in input are independent)

– Can auto-distribute (since rely on underlying Colossus/BigTable distributed storage)

– Gets fault-tolerance (since tasks are idempotent, i.e. can just re-execute if a machine crashes)

• Doesn’t really use any sophisticated distributed systems algorithms (except storage replication)

• However, not a panacea: – Limited to batch jobs, and computations which are

expressible as a map() followed by a reduce()

Page 30: What does it take to make google work at scale

CamSaS

30http://camsas.org @CamSysAtScale

The Google Stack

Figure from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).

Details & Bibliography: http://malteschwarzkopf.de/research/assets/google-stack.pdf

Page 31: What does it take to make google work at scale

CamSaS

31http://camsas.org @CamSysAtScale

Dremel (2010)

Column-oriented storeFor quick, interactive queries

...

Row-oriented storage

Column-oriented storage

Page 32: What does it take to make google work at scale

CamSaS

32http://camsas.org @CamSysAtScale

Dremel (2010)

Stores protocol buffersGoogle’s universal serialization formatNested messages → nested columnsRepeated fields → repeated records

Efficient encodingMany sparse records: don’t store NULL fields

Record re-assemblyNeed to put results back together into recordsUse a Finite State Machine (FSM) defined by

protocol buffer structure

Page 33: What does it take to make google work at scale

CamSaS

33http://camsas.org @CamSysAtScale

The Google Stack

Figure from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).

Details & Bibliography: http://malteschwarzkopf.de/research/assets/google-stack.pdf

Page 34: What does it take to make google work at scale

CamSaS

34http://camsas.org @CamSysAtScale

Borg/Omega

Cluster manager and schedulerTracks machine and task livenessDecides where to run what

Consolidates workloads onto machinesEfficiency gain, cost savingsNeed fewer clusters

Page 35: What does it take to make google work at scale

CamSaS

35http://camsas.org @CamSysAtScale

BorgMaster

Borg/Omega

BorgMasterBorgMaster

Link shard

Borglet Borglet Borglet

Paxos-replicated persistent store

Scheduler

Figure reproduced after A. Verma et al., “Large-scale cluster management at Google with Borg”, Proceedings of EuroSys 2015.

Page 36: What does it take to make google work at scale

CamSaS

36http://camsas.org @CamSysAtScale

Borg/Omega: workloads

Jobs/tasks: countsCPU/RAM: resource seconds [i.e. resource * job runtime in sec.]

Cluster AMedium sizeMedium utilization

Cluster BLarge sizeMedium utilization

Cluster CMedium (12k mach.)High utilizationPublic trace

Figure from M. Schwarzkopf et al., “Omega: flexible, scalable schedulers for large compute clusters”, Proceedings of EuroSys 2013.

Page 37: What does it take to make google work at scale

CamSaS

37http://camsas.org @CamSysAtScale

Borg/Omega: workloadsFr

actio

n of

jobs

runn

ing

for l

ess

than

X

Figure from M. Schwarzkopf et al., “Omega: flexible, scalable schedulers for large compute clusters”, Proceedings of EuroSys 2013.

Service jobs run for much longer than batch jobs:long-term user-facing services vs. one-off analytics.

Page 38: What does it take to make google work at scale

CamSaS

38http://camsas.org @CamSysAtScale

Borg/Omega: workloadsFr

actio

n of

inte

r-ar

rival

gap

s le

ss th

an X

Figure from M. Schwarzkopf et al., “Omega: flexible, scalable schedulers for large compute clusters”, Proceedings of EuroSys 2013.

Batch jobs arrive more frequently than service jobs:more numerous, shorter duration, fail more.

Page 39: What does it take to make google work at scale

CamSaS

39http://camsas.org @CamSysAtScale

Borg/Omega: workloads

Batch jobs have a longer-tailed CPI distribution:lower scheduling priority in kernel scheduler.

Figures from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).

Page 40: What does it take to make google work at scale

CamSaS

40http://camsas.org @CamSysAtScale

Borg/Omega: workloads

Service workloads access memory more frequently: larger working sets, less I/O.

Figures from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).

Page 41: What does it take to make google work at scale

CamSaS

41http://camsas.org @CamSysAtScale

How’s this different to HPC?

Generally avoid tight synchronizationNo barrier-syncs!Asynchronous replication, eventual consistency

Much more data-intensiveLower compute-to-data ratio

Aggressive resource sharing for utilization

Page 42: What does it take to make google work at scale

CamSaS

42http://camsas.org @CamSysAtScale

The facebook Stack

Figure from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).

Details & Bibliography: http://malteschwarzkopf.de/research/assets/facebook-stack.pdf

Page 43: What does it take to make google work at scale

CamSaS

43http://camsas.org @CamSysAtScale

The facebook Stack

GFSBigTab

le

MapReduce

Figure from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).

Details & Bibliography: http://malteschwarzkopf.de/research/assets/facebook-stack.pdf

Page 44: What does it take to make google work at scale

CamSaS

44http://camsas.org @CamSysAtScale

The facebook Stack

Figure from M. Schwarzkopf, “Operating system support for warehouse-scale computing”, PhD thesis, University of Cambridge, 2015 (to appear).

Details & Bibliography: http://malteschwarzkopf.de/research/assets/facebook-stack.pdf

Page 45: What does it take to make google work at scale

CamSaS

45http://camsas.org @CamSysAtScale

Haystack & f4

Blob stores, hold photos, videosnot: status updates, messages, like counts

Items have a level of hotnessHow many users are currently accessing this?Baseline “cold” storage: MySQL

Want to cache close to usersReduces network trafficReduces latencyBut cache capacity is limited!Replicate for performance, not resilience

Page 46: What does it take to make google work at scale

CamSaS

46http://camsas.org @CamSysAtScale

What about other companies’ stacks?

Page 47: What does it take to make google work at scale

CamSaS

47http://camsas.org @CamSysAtScale

How about other companies?

Very similar stacks.Microsoft, Yahoo, Twitter all similar in principle.

Typical set-up:Front-end serving systems and fast back-ends.Batch data processing systems.Multi-tier structured/unstructured storage hierarchy.Coordination system and cluster scheduler.

Minor differences owed to business focuse.g., Amazon focused on inventory/shopping cart.

Page 48: What does it take to make google work at scale

CamSaS

48http://camsas.org @CamSysAtScale

Open source software

Lots of open reimplementations!MapReduce → Hadoop, Spark, MetisGFS → HDFSBigTable → HBase, CassandraBorg/Omega → Mesos, Firmament

But also some releases from companies…Presto (Facebook)Kubernetes (Google)

Page 49: What does it take to make google work at scale

CamSaS

49http://camsas.org @CamSysAtScale

Current researchMany hot topics!

This distributed infrastructure is still new.

Make many-core machines go furtherIncrease utilization, schedule better.Deal with co-location interference.

Rapid insights on huge datasetsFast, incremental stream processing, e.g. Naiad.Approximate analytics, e.g. BlinkDB.

Distributed algorithmsNew consensus systems, e.g. RAFT.

Page 50: What does it take to make google work at scale

CamSaS

50http://camsas.org @CamSysAtScale

Conclusions1.Running at huge (10k+ machines) scale

requires different software stacks.

2.Pretty interesting systemsa.Do read the papers!b.(Partial) Bibliography:

http://malteschwarzkopf.de/research/assets/google-stack.pdfhttp://malteschwarzkopf.de/research/assets/facebook-stack.pdf

3.Lots of activity in this area!a.Open source softwareb.Research initiatives

Page 51: What does it take to make google work at scale

Cambridge Systems at Scale

@CamSysAtScalecamsas.org