computer architecture trends
DESCRIPTION
Computer Architecture Trends - Current and Previous Trends in Computer Architecture.Computer Architecture Trends.TRANSCRIPT
-
ECE 752: Advanced Computer Architecture I 1
ECE/CS757:AdvancedComputerArchitectureII
Instructor:Mikko HLipasti
Spring2013UniversityofWisconsinMadison
LecturenotesbasedonslidescreatedbyJohnShen,MarkHill,DavidWood,Guri Sohi,JimSmith,NatalieEnrightJerger,MichelDubois,Murali Annavaram,
PerStenstrm andprobablyothers
ComputerArchitecture InstructionSetArchitecture(IBM360)
theattributesofa[computing]systemasseenbytheprogrammer.I.e.theconceptualstructureandfunctionalbehavior,asdistinctfromtheorganizationofthedataflowsandcontrols,thelogicdesign,andthephysicalimplementation. Amdahl,Blaaw,&Brooks,1964
MachineOrganization(microarchitecture) ALUS,Buses,Caches,Memories,etc.
MachineImplementation(realization) Gates,cells,transistors,wires
757InContext Priorcourses
352 gatesuptomultiplexorsandadders 354 highlevellanguagedowntomachinelanguageinterfaceor
instructionsetarchitecture (ISA) 552 implementlogicthatprovidesISAinterface CS537 providesOSbackground(coreq.OK)
Thiscourse 757 coversparallelmachines Multiprocessorsystems Dataparallelsystems MemorysystemsthatexploitMLP Etc.
Additionalcourses ECE752coversadvanceduniprocessor design(notaprereq)
Willreviewkeytopicsinnextlecture ECE755coversVLSIdesign ME/ECE759coversparallelprogramming CS758coversspecialtopics(recentlyparallelprogramming)
WhyTake757? Tobecomeacomputerdesigner
Alumniofthisclasshelpeddesignyourcomputer Tolearnwhatisunderthehood ofacomputer
Innatecuriosity Tobetterunderstandwhenthingsbreak Towritebettercode/applications Towritebettersystemsoftware(O/S,compiler,etc.)
Becauseitisintellectuallyfascinating! Becausemulticore/parallelsystemsareubiquitous
ComputerArchitecture Exerciseinengineeringtradeoffanalysis
Findthefastest/cheapest/powerefficient/etc.solution Optimizationproblemwith100sofvariables
Allthevariablesarechanging Atnonuniformrates Withinflectionpoints Onlyoneguarantee:Todaysrightanswerwillbewrongtomorrow
Twohighleveleffects: Technologypush ApplicationPull
Trends
MooresLawfordeviceintegration Chippowerconsumption Singlethreadperformancetrend
[source:Intel]Mikko Lipasti-University of Wisconsin
-
ECE 752: Advanced Computer Architecture I 2
DynamicPower
StaticCMOS:currentflowswhenactive Combinationallogicevaluatesnewinputs Flipflop,latchcapturesnewvalue(clockedge)
Terms C:capacitanceofcircuit
wirelength,numberandsizeoftransistors V:supplyvoltage A:activityfactor f:frequency
Future:Fundamentallypowerconstrained
unitsi
iiidyn fAVCkP2
Mikko Lipasti-University of Wisconsin Mikko Lipasti-University of Wisconsin
MulticoreMania First,servers
IBMPower4,2001 Thendesktops
AMDAthlonX2,2005 Thenlaptops
IntelCoreDuo,2006 Now,cellphone&tablet
Qualcomm,NvidiaTegra,AppleA6,etc.
WhyMulticore
SingleCore DualCore QuadCoreCorearea A ~A/2 ~A/4Corepower W ~W/2 ~W/4Chippower W +O W +O W+OCoreperformance P 0.9P 0.8PChipperformance P 1.8P 3.2P
Mikko Lipasti-University of Wisconsin
Core Core CoreCore
Core
Core
Core f
AmdahlsLaw
f fractionthatcanruninparallel1f fractionthatmustrunserially
Mikko Lipasti-University of Wisconsin
Time
# C
PUs
1 1-f
f
n
nff
Speedup
)1(
1f
nff
n 11
1
1lim
FixedChipPowerBudget
AmdahlsLaw Ignores(power)costofncores
RevisedAmdahlsLaw Morecores eachcoreisslower Parallelspeedup
-
ECE 752: Advanced Computer Architecture I 3
Challenges Parallelscalinglimitsmanycore
>4coresonlyforwellbehavedprograms Optimisticaboutnew applications
Interconnectoverhead Singlethreadperformance
Willdegradeunlessweinnovate Parallelprogramming
Express/extractparallelisminnewways Retrainprogrammingworkforce
Mikko Lipasti-University of Wisconsin
FindingParallelism1. Functionalparallelism
Car:{engine,brakes,entertain,nav,} Game:{physics,logic,UI,render,}
2. Automaticextraction Decomposeserialprograms
3. Dataparallelism Vector,matrix,dbtable,pixels,
4. Requestparallelism Web,shareddatabase,telephony,
Mikko Lipasti-University of Wisconsin
BalancingWork Amdahlsparallelphasef:allcoresbusy Ifnotperfectlybalanced
(1f)termgrows(fnotfullyparallel) Performancescalingsuffers
Manageablefordata&requestparallelapps Verydifficultproblemforothertwo:
Functionalparallelism Automaticallyextracted
Mikko Lipasti-University of Wisconsin
CoordinatingWork Synchronization
Somedatasomewhereisshared Coordinate/orderupdatesandreads Otherwise chaos
Traditionally:locksandmutualexclusion Hardtogetright,evenhardertotuneforperf.
Researchtoreality:TransactionalMemory Programmer:Declarepotentialconflict Hardwareand/orsoftware:speculate&check Commitorrollbackandretry IBMandIntelannouncedsupport(soon)
Mikko Lipasti-University of Wisconsin
SinglethreadPerformance Stillmostattractivesourceofperformance
Speedsupparallelandserialphases Canuseittobuybackpower
Mustfocusonpowerconsumption PerformancebenefitPowercost
Focusof752;briefreviewcomingup
Mikko Lipasti-University of Wisconsin
FocusofthisCourse Howtominimizetheseoverheads
Interconnect Synchronization CacheCoherence Memorysystems
Also Howtowriteparallelprograms(alittle) Noncachecoherentsystems(clusters,MPP) Dataparallelsystems
-
ECE 752: Advanced Computer Architecture I 4
ExpectedBackground ECE/CS552orequivalent
Designsimpleuniprocessor Simpleinstructionsets Organization Datapathdesign Hardwired/microprogrammedcontrol Simplepipelining Basiccaches Some752content(optionalreview)
Highlevelprogrammingexperience C/UNIXskills modifysimulators
AboutThisCourse Readings
Postedonwebsitelaterthisweek Makesureyoukeepupwiththese!Oftendiscussedindepthinlecture,withrequiredparticipation
Subsetofpapersmustbereviewedinwriting,submittedthroughlearn@uw
Lecture Attendancerequired,popquizzes
Homeworks Notcollected,foryourbenefitonly Developdeeperunderstanding,prepareformidterms
AboutThisCourse Exams
Midterm1:Friday3/1inclass Midterm2:Monday4/8inclass Keepupwithreadinglist!
Textbook Dubois,Annavaram,Stenstrm,ParallelComputerOrganizationandDesign,CambridgeUniv.Press,2012.
Forreference:4betachaptersfromJimSmith Postedoncoursewebsite
AboutThisCourse CourseProject
Researchproject Replicateresultsfromapaper Orattemptsomethingnovel Parallelize/characterizenewapplication
Proposaldue3/22,statusreport4/22 Finalprojectincludesawrittenreportandanoralpresentation Writtenreportsdue5/14 Presentationsduringclasstime5/6,5/8,5/10
AboutThisCourse Grading
Quizzes and Paper Reviews 20% Midterm 1 25% Midterm 2 25% Project 30%
WebPage(checkregularly) http://ece757.ece.wisc.edu
AboutThisCourse OfficeHours
Prof.Lipasti:EH4613,M911,orbyappt. Communicationchannels
Emailtoinstructor,classemaillist [email protected]
Webpage http://ece757.ece.wisc.edu
Officehours
-
ECE 752: Advanced Computer Architecture I 5
AboutThisCourse OtherResources
ComputerArchitectureColloquiumTuesday45PM,1325CSS
ComputerEngineeringSeminar Friday121PM,EH4610
Architecturemailinglist:http://lists.cs.wisc.edu/mailman/listinfo/architecture
WWWComputerArchitecturePagehttp://www.cs.wisc.edu/~arch/www
AboutThisCourse Lectureschedule:
MWF1:002:15 Cancel1of3lectures(onaverage) Freeupseveralweeksnearendforprojectwork
26
TentativeScheduleWeek 1 Introduction, 752 reviewWeek 2 752 review, Multithreading & MulticoreWeek 3 MP Software, Memory SystemsWeek 4 MP Memory SystemsWeek 5 Coherence & ConsistencyWeek 6 Lecture cancelled, midterm 1 on 3/1Week 7 Simulation methodology, transactional memoryWeek 8 Interconnection networksWeek 9 SIMD, MPP
Week 10 Dataflow, Clusters, GPGPUsWeek 11 Midterm 2Week 12 No lectureWeek 13 No lectureWeek 14 No lectureWeek 15 Project talks, Course Evaluation
Finals Week Project reports due 5/1428
BriefIntroductiontoParallelComputing Threadlevelparallelism MultiprocessorSystems CacheCoherence
Snoopy Scalable
FlynnTaxonomy UMAvs.NUMA
29
ThreadlevelParallelism Instructionlevelparallelism(752focus)
Reapsperformancebyfindingindependentworkinasinglethread
Threadlevelparallelism Reapsperformancebyfindingindependentworkacrossmultiplethreads
Historically,requiresexplicitlyparallelworkloads Originatesfrommainframetimesharingworkloads Eventhen,CPUspeed>>I/Ospeed HadtooverlapI/OlatencywithsomethingelsefortheCPUtodo
Hence,operatingsystemwouldscheduleothertasks/processes/threadsthatweretimesharingtheCPU
30
ThreadlevelParallelism
Reduceseffectivenessoftemporalandspatiallocality
CPU1
CPU1
CPU2
Disk access
CPU3
Disk access
Disk access
CPU1
CPU2
Think time
CPU3
Think time
Think time
Single user:CPU1Disk access Think time Increase in
number ofactive threads
reduceseffectiveness
of spatiallocality byincreasing
working set.
Time-shared:
Time dilation of each thread reduces effectiveness of temporal locality.
-
ECE 752: Advanced Computer Architecture I 6
31
ThreadlevelParallelism InitiallymotivatedbytimesharingofsingleCPU
OS,applicationswrittentobemultithreaded QuicklyledtoadoptionofmultipleCPUsinasinglesystem
EnabledscalableproductlinefromentrylevelsingleCPUsystemstohighendmultipleCPUsystems
Sameapplications,OS,runseamlessly AddingCPUsincreasesthroughput(performance)
Morerecently: Multiplethreadsperprocessorcore
Coarsegrainedmultithreading(akaswitchonevent) Finegrainedmultithreading Simultaneousmultithreading
Multipleprocessorcoresperdie Chipmultiprocessors(CMP)
32
MultiprocessorSystems Primaryfocusonsharedmemorysymmetric
multiprocessors Manyothertypesofparallelprocessorsystemshavebeen
proposedandbuilt Keyattributesare:
Sharedmemory:allphysicalmemoryisaccessibletoallCPUs Symmetricprocessors:allCPUsarealike
Otherparallelprocessorsmay: Sharesomememory,sharedisks,sharenothing Haveasymmetricprocessingunits
Sharedmemoryidealisms Fullysharedmemory Unitlatency Lackofcontention Instantaneouspropagationofwrites
33
Motivation Sofar:oneprocessorinasystem WhynotuseNprocessors
Higherthroughputviaparalleljobs Costeffective
Adding3CPUsmayget4xthroughputatonly2xcost Lowerlatencyfrommultithreadedapplications
Softwarevendorhasdonetheworkforyou E.g.database,webserver
Lowerlatencythroughparallelizedapplications Muchharderthanitsounds
34
WheretoConnectProcessors? Atprocessor?
Singleinstructionmultipledata(SIMD) AtI/Osystem?
Clustersormulticomputers Atmemorysystem?
Sharedmemorymultiprocessors FocusonSymmetricMultiprocessors(SMP)
2005 Mikko Lipasti 35
ConnectatProcessor(SIMD)
ControlProcessor
Instruction Memory
DataMemory
Registers
ALU
DataMemory
Registers
ALU
DataMemory
Registers
ALU
. . .
. . .
Interconnection Network
36
ConnectatProcessor SIMDAssessment
Amortizescostofcontrolunitovermanydatapaths
Enablesefficient,widedatapaths Programmingmodelhaslimitedflexibility
Regularcontrolflow,dataaccesspatterns SIMDwidelyemployedtoday
MMX,SSE,3DNOWvectorextensions Dataelementsare8bmultimediaoperands
-
ECE 752: Advanced Computer Architecture I 7
37
ConnectatI/O Connectwithstandardnetwork(e.g.Ethernet)
Calledacluster Adequatebandwidth(GBEthernet,goingto10GB) Latencyveryhigh Cheap,butgetwhatyoupayfor
Connectwithcustomnetwork(e.g.IBMSP1,SP2,SP3) Sometimescalledamulticomputer Highercostthancluster Poorercommunicationthanmultiprocessor
Internetdatacentersbuiltthisway38
ConnectatMemory:Multiprocessors
SharedMemoryMultiprocessors Allprocessorscanaddressallphysicalmemory Demandsevolutionaryoperatingsystemschanges Higherthroughputwithnoapplicationchanges Lowlatency,butrequiresparallelizationwithpropersynchronization
Mostsuccessful:SymmetricMPorSMP 264microprocessorsonabus Stillusecachememories
39
CacheCoherenceProblem
P0 P1Load A
A 0
Load A
A 0
Store A
-
ECE 752: Advanced Computer Architecture I 8
43
SnoopyCacheCoherence Allrequestsbroadcastonbus Allprocessorsandmemorysnoopandrespond Cacheblockswriteableatoneprocessororreadonlyatseveral Singlewriterprotocol
Snoopsthathitdirtylines? Flushmodifieddataoutofcache Eitherwritebacktomemory,thensatisfyremotemissfrommemory,or
Providedirtydatadirectlytorequestor BigprobleminMPsystems
Dirty/coherence/sharing misses
44
ScaleableCacheCoherence Eschewphysicalbusbutstillsnoop
Pointtopointtreestructure Rootoftreeprovidesorderingpoint
Or,uselevelofindirectionthroughdirectory Directoryatmemoryremembers:
Whichprocessorissinglewriter Forwardsrequeststoit
Whichprocessorsaresharedreaders Forwardswritepermissionrequeststothem
Levelofindirectionhasaprice Dirtymissesrequire3hopsinsteadoftwo
Snoop:Requestor>Owner>Requestor Directory:Requestor>Directory>Owner>Requestor
FlynnTaxonomyFlynn (1966) Single Data Multiple Data
Single Instruction SISD SIMD
Multiple Instruction MISD MIMD
45Mikko Lipasti-University of Wisconsin
MISD Faulttolerance Pipelineprocessing/streaming orsystolicarray
NowextendedtoSPMD singleprogrammultipledata
MemoryOrganization:UMAvs.NUMAProcessor Cache Processor Cache Processor Cache Processor
Memory Memory Memory Memory
Cache
Interconnection networkUniformmemorylatency
UniformMemoryAccess(dancehall)
Processor Cache Processor Cache Processor Cache Processor
Memory Memory Memory Memory
Cache
Interconnection network
Short locallatency
NonuniformMemoryAccess
Long remote memory latency
46Mikko Lipasti-University of Wisconsin
MemoryTaxonomyFor Shared Memory Uniform
MemoryNonuniform Memory
Cache Coherence CC-UMA CC-NUMA
No Cache Coherence NCC-UMA NCC-NUMA
47Mikko Lipasti-University of Wisconsin
NUMAwinsoutforpracticalimplementation Cache coherencefavorsprogrammer
Commoningeneralpurposesystems NCCwidespreadinscalable systems
CCoverheadistoohigh,notalwaysnecessary48
ExampleCommercialSystems CCUMA(SMP)
SunE10000:http://doi.ieeecomputersociety.org/10.1109/40.653032 CCNUMA
SGIOrigin2000:TheSGIOrigin:AccnumaHighlyScalableServer NCCNUMA
CrayT3E:http://www.cs.wisc.edu/~markhill/Misc/asplos96_t3e_comm.pdf Clusters
ASCI:https://www.llnl.gov/str/Seager.html
-
ECE 752: Advanced Computer Architecture I 9
WeakScalingandGustafsonsLaw Gustafsonredefinesspeedup
Workloadsgrowasmorecoresbecomeavailable Assumethatlargerworkload(e.g.biggerdataset)providesmorerobustutilizationofparallelmachine
LetF=p/(s+p).ThenSP =(s+pP)/(s+p)=1F+FP=1+F(P1)TP s p+= T1 s pP+=
50
Summary
Threadlevelparallelism MultiprocessorSystems CacheCoherence
Snoopy Scalable
FlynnTaxonomy UMAvs.NUMA GustafsonsLawvs.AmdahlsLaw