cs 152 computer architecture and engineering lecture …cs152/sp16/lectures/l20... · cs 152...
TRANSCRIPT
4/20/2016 CS152,Spring2016
CS152ComputerArchitectureandEngineering
Lecture20:DatacentersDr. George Michelogiannakis
EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
http://inst.eecs.berkeley.edu/~cs152!
!Thanks to Christina Delimitrou, Christos
Kozyrakis!
4/20/2016 CS152,Spring2016
Administrivia
§ PS5dueNOW
§ Quiz5onWednesdaynextweek
§ PleaseshowuponMondayApril21st(lastlecture)– Neuromorphic,quantum– ParLngthoughtsthathavenothingtodowitharchitecture– ClassevaluaLon
2
4/20/2016 CS152,Spring2016
WhatIsADatacenter?
§ Thecomputeinfrastructureforinternet-scaleservices&cloudcompuLng– x10kofservers,x100kharddisks– Examples:Google,Facebook,MicrosoW,Amazon(+AmazonWebServices),TwiYer,Yahoo,…
§ Bothconsumerandenterpriseservices– WindowsLive,Gmail,Hotmail,Dropbox,bing,Google,Adcenter,GoogleApps,Webapps,Exchangeonline,salesforce.com,Azure,…
3
4/20/2016 CS152,Spring2016
OtherDefiniCons
§ Centralizedrepositoryforthestorage,management,anddisseminaLonofdataandinformaLon,pertainingtoaparLcularbusinessorservice
§ DatacentersinvolvelargequanLLesofdataandtheirprocessing
§ Largelymadeupofcommoditycomponents
4
4/20/2016 CS152,Spring2016
Components
5
§ Apartfromcomputers&networkswitches,youneed:– Powerinfrastructure:voltageconvertersandregulators,generatorsandUPSs,…– Coolinginfrastructure:A/C,coolingtowers,heatexchangers,airimpellers,…
§ Everythingisco-designed!
4/20/2016 CS152,Spring2016
Example:MSQuincyDatacenter
§ 470ksqfeet(10footballfields)§ Nexttoahydro-electricgeneraLonplant
– Atupto40MegaWaYs,$0.02/kWhisbeYerthan$0.15/kWhJ
– That’sequaltothepowerconsumpLonof30,000homes
4/20/2016 CS152,Spring2016
Example:MSChicagoDatacenter
Microsoft’s Chicago Data Center
Kushagra$Vaid,$HotPower'10$ 10$Oct$3,$2010$
[K. Vaid, Microsoft Global Foundation Services, 2010]
4/20/2016 CS152,Spring2016
ApplicaConsThatUseDatacenters
§ Storage– Largeandsmallfiles(e.g.,phonecontacts)
§ Searchengines
§ ComputeLmerental&webhosLng– AmazonEC2–virtualserverhosLng
§ Cloudgaming– Fileorvideostreaming
9
4/20/2016 CS152,Spring2016
CommodityHardware
2-socket server
Low-latency 10GbE Switch
10GbE NIC Flash Storage
JBOD disk array
4/20/2016 CS152,Spring2016
BasicUnit:2-socketServer
§ 1-2mulL-corechips§ 8-16DRAMDIMMS§ 1-2ethernetports
– 10Gbpsorhigher§ Storage
– InternalSATA/SASdisks(2-6)– Externalstorageexpansion
§ ConfiguraLon/sizevary– DependingonLerrole– 1U-2U(1U=1.67inches)
5 Efficient Performance Motherboard Features
5.1 Block Diagram Figure 5 illustrates the functional block diagram of the efficient performance motherboard. Figure 5 が⽰すのは、このマザーボード全体の機能ブロックダイアグラムである。
Figure 5 Efficient Performance Motherboard Functional Block Diagram
5.2 Placement and Form Factor The motherboard's form factor is 6.5x20 inches. Figure 6 illustrates board placement. The placement shows the relative positions of key components, while exact dimension and position information is available in the mechanical DXF file. Form factor, PCIe slot position, front IO port
4/20/2016 CS152,Spring2016
Example:FB2-socketServer
§ CharacterisLcs– UpgradableCPUs&memory,bootonLAN,externalPCIelink,featurereduced
– SimilardesignforAMDservers(why?)
Figure 6 Efficient Performance Board Placement
Figure 25 Open Compute Project Server Chassis for Intel Motherboards
11.1 Fixed Locations Refer to the mechanical DXF file for fixed locations of the mounting hole, PCIe x16 slot, and power connector. マウント・ホールや、PCIe x16 スロット、電源コネクターの固定位置については、mechanical DXF ファイルを参照して欲しい。
11.2 PCB Thickness To ensure proper alignment of the FCI power connector and mounting within the mechanical enclosure, the boards should follow the PCB stackups described in sections 4.5 and 5.5 respectively and have 85mil (2.16mm) thickness. The mid-plane PCB thickness is also 85mil (2.16mm). The mezzanine card PCB thickness is 62mil (≈1.6mm). FCI 電源コネクターの位置および、機械的なエンクロジャー内への適切な配置を保証するために、Section 4.5 と 5.5 に記載されるで記述されるPCB stackup に従い、このボードは 85mil(2.16 mm)の厚さを持つべきである。そして、 mid-plane PCB の厚さも 85mil(2.16mm)である。 さらに、mezzanine card PCB の厚さは、62mil(≈1.6mm)となる。
4/20/2016 CS152,Spring2016
ApplicaConMapping(FBExample)
10
Architecture
Service Cluster Back-End Cluster
Front-End Cluster
Web 250 racks
Ads 30 racks
Cache (~144TB)
Search Photos Msg Others UDB ADS-DB Tao Leader
Multifeed 9 racks
Other Small services
4/20/2016 CS152,Spring2016
ServersUsedforaFBRequest
11
Front end
Back end
Life of a �hit�
Front-End Back-End
Web
MC
MC
MC
MC
Ads
Database
L
Feed agg
Request starts
Time
Request completes
LLLLL
4/20/2016 CS152,Spring2016
WhatServerShouldWeUseinaDatacenter?
§ ManyopLons– 1-socketserver– 2-socketserver– 4-socketserver– …– 64-socketserver– …
§ Whataretheissuestoconsider?
4/20/2016 CS152,Spring2016
2vs4vs8SocketsperServer
§ Whatisgreatabout2vs1socket?§ Whynot4or8socketsthen?
4/20/2016 CS152,Spring2016
PerformanceScalingofInternetScaleApplicaCons
§ ScalinganalysisforSearch&MapReduceatMicrosoW§ AnyobservaLons?
[IEEE Micro’11]
4/20/2016 CS152,Spring2016
PerformanceMetrics
§ CompleLonLme(e.g.,howfast)– OfacertainoperaLons
§ Availability
§ Power/energy
§ Totalcostofownership(TCO)
23
4/20/2016 CS152,Spring2016
TotalCostofOwnership(TCO)
§ Capitalexpenses– Land,building,generators,aircondiLoning,compuLngequipment
§ OperaLngexpenses– Electricityrepairs
§ Costofunavailability
25
4/20/2016 CS152,Spring2016
TCOBreakdown
n ObservaLonsn >50%ofcostinbuyingthehardwaren ~30%costsrelatedtopowern Networking~10%ofoverallcosts(includingcostforservers)
61%16%
14%
6%3%
Servers
Energy
Cooling
Networking
Other
4/20/2016 CS152,Spring2016
CostAnalysis
§ Costmodelpowerfultoolfordesigntradeoffs– Evaluate“what-if”scenarios
§ E.g.,canwereducepowercostwithdifferentdisk?
§ A1TBdiskuses10Wofpower,costs$90.Analternatediskconsumesonly5W,butcosts$150.Ifyouwerethedatacenterarchitect,whatwouldyoudo?
4/20/2016 CS152,Spring2016
Answer
§ A1TBdiskuses10Wofpower,costs$90.Analternatediskconsumesonly5W,butcosts$150.Ifyouwerethedatacenterarchitect,whatwouldyoudo?
§ @$2/WaY–evenifwesavedtheenLre10Wofpowerfordisk,wewouldsave$20peryear.Wearepaying$60moreforthedisk–probablynotworthit.– Whatisthisanalysismissing?
4/20/2016 CS152,Spring2016
Reliability&Availability
§ Commongoalforservices:99.99%availability– 1hourofdown-Lmeperyear
§ Butwiththousandsofnodes,thingswillcrash– Example:with10Kserversratedat30yearsofMTBF,youshouldexpecttohave1failureperday
4/20/2016 CS152,Spring2016
RobustnesstoFailures
§ Failovertootherreplicas/datacenters§ BadbackenddetecLon
– StopusingforliverequestsunLlbehaviorgetsbeYer§ Moreaggressiveloadbalancingwhenimbalanceismoresevere
§ Makeyourappsdosomethingreasonableevenifnotallisright– BeYertogiveuserslimitedfuncLonalitythananerrorpage
4/20/2016 CS152,Spring2016
Consistency
§ MulLpledatacentersimpliesdealingwithconsistencyissues– Disconnected/parLLonedoperaLonrelaLvelycommon,e.g.,datacenterdownformaintenance
– InsisLngonstrongconsistencylikelyundesirable– "Wehaveyourdatabutcan'tshowittoyoubecauseoneofthereplicasisunavailable"
– MostproductswithmutablestategravitaLngtowards"eventualconsistency"model
– Abithardertothinkabout,butbeYerfromanavailabilitystandpoint
4/20/2016 CS152,Spring2016
Performance/AvailabilityTechniquesinDCs
Technique Performance Availability
Replication ✔ ✔ Partitioning (sharding) ✔ ✔ Load-balancing ✔ Watchdog timers ✔ Integrity checks ✔ App-specific compression ✔ Eventual consistency ✔ ✔
4/20/2016 CS152,Spring2016
CharacterisCcsofInternet-scaleServices§ Hugedatasets,usersets,…
§ Highrequestlevelparallelism– Withoutmuchread-writesharing
§ Highworkloadchurn– Newreleasesofcodeonaweeklybasis
§ Requirefault-freeoperaLon
4/20/2016 CS152,Spring2016
PerformanceMetrics
§ Throughput– Userrequestspersecond(RPS)– Scale-outaddressthis(moreservers)
§ QualityofService(QoS)– Latencyofindividualrequests(90th,95th,or95thpercenLle)– Scale-outdoesnotnecessarilyhelp
§ InteresLngnotes– ThedistribuLonmaYers,notjusttheaverages
– OpLmizingthroughputoWenhurtslatency
• AndopLmizinglatencyoWenhurtspowerconsumpLon
– Attheend,itisRPS/$withinsomeQoSconstraints
4/20/2016 CS152,Spring2016
TailAtScale
§ Largerclustersàmorepronetohightaillatency
1The Tail at Scale. Jeffrey Dean, Luiz André Barroso. CACM, Vol. 56 No. 2, Pages 74-80, 2013
4/20/2016 CS152,Spring2016
ResourceAssignmentOpCons
§ Howdoweassignresourcestoapps?§ TwomajoropLons:privatevssharedassignment
4/20/2016 CS152,Spring2016
PrivateResourceAssignment
§ Eachappreceivesaprivate,staLcsetofresources§ AlsoknownasstaLcparLLoning
mysql- cassandra- Rails/nginx- hadoop- memcache-
4/20/2016 CS152,Spring2016
SharedResourceAssignment
§ Sharedresources:flexibilityàhighuLlizaLon– Commoncase:user-facingservices+analyLcsonsameservers
– Alsohelpswithfailures,maintenance,andprovisioning
Rails/nginx-
hadoop-
memcache-
utilization-
Rails/nginx-
hadoop-
memcache-
utilization-
4/20/2016 CS152,Spring2016
SharedClusterManagement
§ Themanagerschedulesappsonsharedresources– AppsrequestresourcereservaLons(cores,DRAM,…)– Managerallocatesandassignsspecificresources
• Consideringperformance,uLlizaLon,faulttolerance,prioriLes,…• PotenLally,mulLpleappsoneachserver
– MulLplemanagerarchitectures(seeBorgpaperforexample)
Mesos-implements-weighted-DRF-
masters-
masters(can(be(configured(with(weights(per'role'
resource(allocation(decisions(incorporate(the(weights(to(determine(dominant(fair(shares(
cluster-manager-status-quo-
cluster-manager-
application/human-
specification-
the(specification(includes(as(much(information(as(possible(to(assist(the(cluster(manager(in(scheduling(and(execution(
cluster-manager-architectures-
master-
slaves-
cluster-manager-status-quo-
cluster-manager-
application/human-
result-
4/20/2016 CS152,Spring2016
Autoscaling
§ Monitorappperformanceorserverload– [Chase’01,AWSAutoScale,Lim’10,Shen’11,Gandhi’12,…]
§ Adjustresourcesgiventoapp– Addorremovetomeetperformancegoal– Feedback-basedcontrolloop
workers-
distributed-system*-anatomy-
coordinator-
* overlooking peer-to-peer distributed systems
Load balancer
workers-
distributed-system*-anatomy-
coordinator-
* overlooking peer-to-peer distributed systems
workers-
distributed-system*-anatomy-
coordinator-
* overlooking peer-to-peer distributed systems
workers-
distributed-system*-anatomy-
coordinator-
* overlooking peer-to-peer distributed systems
workers-
distributed-system*-anatomy-
coordinator-
* overlooking peer-to-peer distributed systems
workers-
distributed-system*-anatomy-
coordinator-
* overlooking peer-to-peer distributed systems
workers-
distributed-system*-anatomy-
coordinator-
* overlooking peer-to-peer distributed systems
workers-
distributed-system*-anatomy-
coordinator-
* overlooking peer-to-peer distributed systems
workers-
distributed-system*-anatomy-
coordinator-
* overlooking peer-to-peer distributed systems
workers-
distributed-system*-anatomy-
coordinator-
* overlooking peer-to-peer distributed systems
workers-
distributed-system*-anatomy-
coordinator-
* overlooking peer-to-peer distributed systems
workers-
distributed-system*-anatomy-
coordinator-
* overlooking peer-to-peer distributed systems
workers-
distributed-system*-anatomy-
coordinator-
* overlooking peer-to-peer distributed systems
workers-
distributed-system*-anatomy-
coordinator-
* overlooking peer-to-peer distributed systems
workers-
distributed-system*-anatomy-
coordinator-
* overlooking peer-to-peer distributed systems
workers-
distributed-system*-anatomy-
coordinator-
* overlooking peer-to-peer distributed systems
workers-
distributed-system*-anatomy-
coordinator-
* overlooking peer-to-peer distributed systems
4/20/2016 CS152,Spring2016
Map+Reduce
§ Map:– Acceptsinputkey/valuepair
– Emitsintermediatekey/valuepair
§ Reduce:– Acceptsintermediatekey/value*pair
– Emitsoutputkey/valuepair
Very
big
data
Result M
A
P
R
E
D
U
C
E
Partitioning
Function
4/20/2016 CS152,Spring2016
AnalyCcsExample:MapReduce
§ Single-Lerarchitecture– DistributedFS,workerservers,coordinator– Diskbasedorin-memory
§ Metric:throughput
[Figure credit: Paul Krzyzanowski]
4/20/2016 CS152,Spring2016
Example3-CerApp:WebMail
§ Mayincludethousandsofmachines,PetaBytesofdata,andbillionsofusers
§ 1stLer:protocolprocessing– Typicallystateless– Usealoadbalancer
§ 2ndLer:applicaLonlogic– OWencachesstatefrom3rdLer
§ 3rdLer:datastorage– Heavilystateful– OWenincludesbulkofmachines
Load Balancer
Front-end tier (HTTP, POP, IMAP…)
Middle tier (Mail delivery, user info,
stats, …)
Back-end tier (Mail metadata & files)
4/20/2016 CS152,Spring2016
Example:SocialNetworking
§ 3Lersystemn Webserver,fastuserdatastorage,persistentstorage
n 2rdLer:latencycriLcal,largenumberofservers
n Userdatastoragen Usingmemcachedfordistributedcachingn 10sofTbytesinmemory(Facebook150TB)
n Shardedandreplicatedacrossmanyserversn Read/write(unlikesearch),bulkisread-dominated
n Fromin-memorycachingtoin-memoryFSn RAMcloud@Stanford,Sinfonia@HP,…
Front-end tier (HTTP, presentation, …)
Middle tier (memcached or in-
memory FS)
Back-end tier (persistent DB)
4/20/2016 CS152,Spring2016
Acknowledgements
§ ChristosKozyrakis,ChrisLnaDelimitrou– BasedonEE282@Stanford
§ “Designs,Lessons,andAdvicefromBuildingLargeDistributedSystems”byJeffDean,GoogleFellow
§ “Athousandchickensortwooxen?Part1:TCO”byIsabelMarLn
§ “UPSasaservicecouldbeaneconomicbreakthrough”byMarkMonroe
§ “TakeacloselookatMapReduce”byXuanhuaShi
49
4/20/2016 CS152,Spring2016
ReducingTailLatency
§ Reducequeuing(reduceheadoflineblocking)§ Separatedifferenttypesofrequests§ CoordinatebackgroundacLviLes§ Hedgedrequeststoreplicas§ Tiedrequeststoreplicas§ Micro-sharding&selecLvereplicaLon§ LatencyinducedprobaLon,canaryrequests
§ SeeTheTail@Scalepaperfordetails