computer architecture trends

ECE 752: Advanced Computer Architecture I 1

ECE/CS757:AdvancedComputerArchitectureII

Instructor:Mikko HLipasti

Spring2013UniversityofWisconsinMadison

LecturenotesbasedonslidescreatedbyJohnShen,MarkHill,DavidWood,Guri Sohi,JimSmith,NatalieEnrightJerger,MichelDubois,Murali Annavaram,

PerStenstrm andprobablyothers

ComputerArchitecture InstructionSetArchitecture(IBM360)

theattributesofa[computing]systemasseenbytheprogrammer.I.e.theconceptualstructureandfunctionalbehavior,asdistinctfromtheorganizationofthedataflowsandcontrols,thelogicdesign,andthephysicalimplementation. Amdahl,Blaaw,&Brooks,1964

MachineOrganization(microarchitecture) ALUS,Buses,Caches,Memories,etc.

MachineImplementation(realization) Gates,cells,transistors,wires

757InContext Priorcourses

352 gatesuptomultiplexorsandadders 354 highlevellanguagedowntomachinelanguageinterfaceor

instructionsetarchitecture (ISA) 552 implementlogicthatprovidesISAinterface CS537 providesOSbackground(coreq.OK)

Thiscourse 757 coversparallelmachines Multiprocessorsystems Dataparallelsystems MemorysystemsthatexploitMLP Etc.

Additionalcourses ECE752coversadvanceduniprocessor design(notaprereq)

Willreviewkeytopicsinnextlecture ECE755coversVLSIdesign ME/ECE759coversparallelprogramming CS758coversspecialtopics(recentlyparallelprogramming)

WhyTake757? Tobecomeacomputerdesigner

Alumniofthisclasshelpeddesignyourcomputer Tolearnwhatisunderthehood ofacomputer

Innatecuriosity Tobetterunderstandwhenthingsbreak Towritebettercode/applications Towritebettersystemsoftware(O/S,compiler,etc.)

Becauseitisintellectuallyfascinating! Becausemulticore/parallelsystemsareubiquitous

ComputerArchitecture Exerciseinengineeringtradeoffanalysis

Findthefastest/cheapest/powerefficient/etc.solution Optimizationproblemwith100sofvariables

Allthevariablesarechanging Atnonuniformrates Withinflectionpoints Onlyoneguarantee:Todaysrightanswerwillbewrongtomorrow

Twohighleveleffects: Technologypush ApplicationPull

Trends

MooresLawfordeviceintegration Chippowerconsumption Singlethreadperformancetrend

[source:Intel]Mikko Lipasti-University of Wisconsin


DynamicPower

StaticCMOS:currentflowswhenactive Combinationallogicevaluatesnewinputs Flipflop,latchcapturesnewvalue(clockedge)

Terms C:capacitanceofcircuit

wirelength,numberandsizeoftransistors V:supplyvoltage A:activityfactor f:frequency

Future:Fundamentallypowerconstrained

unitsi

iiidyn fAVCkP2

Mikko Lipasti-University of Wisconsin Mikko Lipasti-University of Wisconsin

MulticoreMania First,servers

IBMPower4,2001 Thendesktops

AMDAthlonX2,2005 Thenlaptops

IntelCoreDuo,2006 Now,cellphone&tablet

Qualcomm,NvidiaTegra,AppleA6,etc.

WhyMulticore

SingleCore DualCore QuadCoreCorearea A ~A/2 ~A/4Corepower W ~W/2 ~W/4Chippower W +O W +O W+OCoreperformance P 0.9P 0.8PChipperformance P 1.8P 3.2P

Mikko Lipasti-University of Wisconsin

Core Core CoreCore

Core

Core

Core f

AmdahlsLaw

f fractionthatcanruninparallel1f fractionthatmustrunserially


Time

# C

PUs

1 1-f

f

n

nff

Speedup

)1(

1f

nff

n 11

1

1lim

FixedChipPowerBudget

AmdahlsLaw Ignores(power)costofncores

RevisedAmdahlsLaw Morecores eachcoreisslower Parallelspeedup


Challenges Parallelscalinglimitsmanycore

>4coresonlyforwellbehavedprograms Optimisticaboutnew applications

Interconnectoverhead Singlethreadperformance

Willdegradeunlessweinnovate Parallelprogramming

Express/extractparallelisminnewways Retrainprogrammingworkforce


FindingParallelism1. Functionalparallelism

Car:{engine,brakes,entertain,nav,} Game:{physics,logic,UI,render,}

2. Automaticextraction Decomposeserialprograms

3. Dataparallelism Vector,matrix,dbtable,pixels,

4. Requestparallelism Web,shareddatabase,telephony,


BalancingWork Amdahlsparallelphasef:allcoresbusy Ifnotperfectlybalanced

(1f)termgrows(fnotfullyparallel) Performancescalingsuffers

Manageablefordata&requestparallelapps Verydifficultproblemforothertwo:

Functionalparallelism Automaticallyextracted


CoordinatingWork Synchronization

Somedatasomewhereisshared Coordinate/orderupdatesandreads Otherwise chaos

Traditionally:locksandmutualexclusion Hardtogetright,evenhardertotuneforperf.

Researchtoreality:TransactionalMemory Programmer:Declarepotentialconflict Hardwareand/orsoftware:speculate&check Commitorrollbackandretry IBMandIntelannouncedsupport(soon)


SinglethreadPerformance Stillmostattractivesourceofperformance

Speedsupparallelandserialphases Canuseittobuybackpower

Mustfocusonpowerconsumption PerformancebenefitPowercost

Focusof752;briefreviewcomingup


FocusofthisCourse Howtominimizetheseoverheads

Interconnect Synchronization CacheCoherence Memorysystems

Also Howtowriteparallelprograms(alittle) Noncachecoherentsystems(clusters,MPP) Dataparallelsystems


ExpectedBackground ECE/CS552orequivalent

Designsimpleuniprocessor Simpleinstructionsets Organization Datapathdesign Hardwired/microprogrammedcontrol Simplepipelining Basiccaches Some752content(optionalreview)

Highlevelprogrammingexperience C/UNIXskills modifysimulators

AboutThisCourse Readings

Postedonwebsitelaterthisweek Makesureyoukeepupwiththese!Oftendiscussedindepthinlecture,withrequiredparticipation

Subsetofpapersmustbereviewedinwriting,submittedthroughlearn@uw

Lecture Attendancerequired,popquizzes

Homeworks Notcollected,foryourbenefitonly Developdeeperunderstanding,prepareformidterms

AboutThisCourse Exams

Midterm1:Friday3/1inclass Midterm2:Monday4/8inclass Keepupwithreadinglist!

Textbook Dubois,Annavaram,Stenstrm,ParallelComputerOrganizationandDesign,CambridgeUniv.Press,2012.

Forreference:4betachaptersfromJimSmith Postedoncoursewebsite

AboutThisCourse CourseProject

Researchproject Replicateresultsfromapaper Orattemptsomethingnovel Parallelize/characterizenewapplication

Proposaldue3/22,statusreport4/22 Finalprojectincludesawrittenreportandanoralpresentation Writtenreportsdue5/14 Presentationsduringclasstime5/6,5/8,5/10

AboutThisCourse Grading

Quizzes and Paper Reviews 20% Midterm 1 25% Midterm 2 25% Project 30%

WebPage(checkregularly) http://ece757.ece.wisc.edu

AboutThisCourse OfficeHours

Prof.Lipasti:EH4613,M911,orbyappt. Communicationchannels

Emailtoinstructor,classemaillist [email protected]

Webpage http://ece757.ece.wisc.edu

Officehours


AboutThisCourse OtherResources

ComputerArchitectureColloquiumTuesday45PM,1325CSS

ComputerEngineeringSeminar Friday121PM,EH4610

Architecturemailinglist:http://lists.cs.wisc.edu/mailman/listinfo/architecture

WWWComputerArchitecturePagehttp://www.cs.wisc.edu/~arch/www

AboutThisCourse Lectureschedule:

MWF1:002:15 Cancel1of3lectures(onaverage) Freeupseveralweeksnearendforprojectwork

26

TentativeScheduleWeek 1 Introduction, 752 reviewWeek 2 752 review, Multithreading & MulticoreWeek 3 MP Software, Memory SystemsWeek 4 MP Memory SystemsWeek 5 Coherence & ConsistencyWeek 6 Lecture cancelled, midterm 1 on 3/1Week 7 Simulation methodology, transactional memoryWeek 8 Interconnection networksWeek 9 SIMD, MPP

Week 10 Dataflow, Clusters, GPGPUsWeek 11 Midterm 2Week 12 No lectureWeek 13 No lectureWeek 14 No lectureWeek 15 Project talks, Course Evaluation

Finals Week Project reports due 5/1428

BriefIntroductiontoParallelComputing Threadlevelparallelism MultiprocessorSystems CacheCoherence

Snoopy Scalable

FlynnTaxonomy UMAvs.NUMA

29

ThreadlevelParallelism Instructionlevelparallelism(752focus)

Reapsperformancebyfindingindependentworkinasinglethread

Threadlevelparallelism Reapsperformancebyfindingindependentworkacrossmultiplethreads

Historically,requiresexplicitlyparallelworkloads Originatesfrommainframetimesharingworkloads Eventhen,CPUspeed>>I/Ospeed HadtooverlapI/OlatencywithsomethingelsefortheCPUtodo

Hence,operatingsystemwouldscheduleothertasks/processes/threadsthatweretimesharingtheCPU

30

ThreadlevelParallelism

Reduceseffectivenessoftemporalandspatiallocality

CPU1

CPU1

CPU2

Disk access

CPU3

Disk access

Disk access

CPU1

CPU2

Think time

CPU3

Think time

Think time

Single user:CPU1Disk access Think time Increase in

number ofactive threads

reduceseffectiveness

of spatiallocality byincreasing

working set.

Time-shared:

Time dilation of each thread reduces effectiveness of temporal locality.


31

ThreadlevelParallelism InitiallymotivatedbytimesharingofsingleCPU

OS,applicationswrittentobemultithreaded QuicklyledtoadoptionofmultipleCPUsinasinglesystem

EnabledscalableproductlinefromentrylevelsingleCPUsystemstohighendmultipleCPUsystems

Sameapplications,OS,runseamlessly AddingCPUsincreasesthroughput(performance)

Morerecently: Multiplethreadsperprocessorcore

Coarsegrainedmultithreading(akaswitchonevent) Finegrainedmultithreading Simultaneousmultithreading

Multipleprocessorcoresperdie Chipmultiprocessors(CMP)

32

MultiprocessorSystems Primaryfocusonsharedmemorysymmetric

multiprocessors Manyothertypesofparallelprocessorsystemshavebeen

proposedandbuilt Keyattributesare:

Sharedmemory:allphysicalmemoryisaccessibletoallCPUs Symmetricprocessors:allCPUsarealike

Otherparallelprocessorsmay: Sharesomememory,sharedisks,sharenothing Haveasymmetricprocessingunits

Sharedmemoryidealisms Fullysharedmemory Unitlatency Lackofcontention Instantaneouspropagationofwrites

33

Motivation Sofar:oneprocessorinasystem WhynotuseNprocessors

Higherthroughputviaparalleljobs Costeffective

Adding3CPUsmayget4xthroughputatonly2xcost Lowerlatencyfrommultithreadedapplications

Softwarevendorhasdonetheworkforyou E.g.database,webserver

Lowerlatencythroughparallelizedapplications Muchharderthanitsounds

34

WheretoConnectProcessors? Atprocessor?

Singleinstructionmultipledata(SIMD) AtI/Osystem?

Clustersormulticomputers Atmemorysystem?

Sharedmemorymultiprocessors FocusonSymmetricMultiprocessors(SMP)

2005 Mikko Lipasti 35

ConnectatProcessor(SIMD)

ControlProcessor

Instruction Memory

DataMemory

Registers

ALU

DataMemory

Registers

ALU

DataMemory

Registers

ALU

. . .

. . .

Interconnection Network

36

ConnectatProcessor SIMDAssessment

Amortizescostofcontrolunitovermanydatapaths

Enablesefficient,widedatapaths Programmingmodelhaslimitedflexibility

Regularcontrolflow,dataaccesspatterns SIMDwidelyemployedtoday

MMX,SSE,3DNOWvectorextensions Dataelementsare8bmultimediaoperands


37

ConnectatI/O Connectwithstandardnetwork(e.g.Ethernet)

Calledacluster Adequatebandwidth(GBEthernet,goingto10GB) Latencyveryhigh Cheap,butgetwhatyoupayfor

Connectwithcustomnetwork(e.g.IBMSP1,SP2,SP3) Sometimescalledamulticomputer Highercostthancluster Poorercommunicationthanmultiprocessor

Internetdatacentersbuiltthisway38

ConnectatMemory:Multiprocessors

SharedMemoryMultiprocessors Allprocessorscanaddressallphysicalmemory Demandsevolutionaryoperatingsystemschanges Higherthroughputwithnoapplicationchanges Lowlatency,butrequiresparallelizationwithpropersynchronization

Mostsuccessful:SymmetricMPorSMP 264microprocessorsonabus Stillusecachememories

39

CacheCoherenceProblem

P0 P1Load A

A 0

Load A

A 0

Store A


43

SnoopyCacheCoherence Allrequestsbroadcastonbus Allprocessorsandmemorysnoopandrespond Cacheblockswriteableatoneprocessororreadonlyatseveral Singlewriterprotocol

Snoopsthathitdirtylines? Flushmodifieddataoutofcache Eitherwritebacktomemory,thensatisfyremotemissfrommemory,or

Providedirtydatadirectlytorequestor BigprobleminMPsystems

Dirty/coherence/sharing misses

44

ScaleableCacheCoherence Eschewphysicalbusbutstillsnoop

Pointtopointtreestructure Rootoftreeprovidesorderingpoint

Or,uselevelofindirectionthroughdirectory Directoryatmemoryremembers:

Whichprocessorissinglewriter Forwardsrequeststoit

Whichprocessorsaresharedreaders Forwardswritepermissionrequeststothem

Levelofindirectionhasaprice Dirtymissesrequire3hopsinsteadoftwo

Snoop:Requestor>Owner>Requestor Directory:Requestor>Directory>Owner>Requestor

FlynnTaxonomyFlynn (1966) Single Data Multiple Data

Single Instruction SISD SIMD

Multiple Instruction MISD MIMD

45Mikko Lipasti-University of Wisconsin

MISD Faulttolerance Pipelineprocessing/streaming orsystolicarray

NowextendedtoSPMD singleprogrammultipledata

MemoryOrganization:UMAvs.NUMAProcessor Cache Processor Cache Processor Cache Processor

Memory Memory Memory Memory

Cache

Interconnection networkUniformmemorylatency

UniformMemoryAccess(dancehall)

Processor Cache Processor Cache Processor Cache Processor

Memory Memory Memory Memory

Cache

Interconnection network

Short locallatency

NonuniformMemoryAccess

Long remote memory latency


MemoryTaxonomyFor Shared Memory Uniform

MemoryNonuniform Memory

Cache Coherence CC-UMA CC-NUMA

No Cache Coherence NCC-UMA NCC-NUMA


NUMAwinsoutforpracticalimplementation Cache coherencefavorsprogrammer

Commoningeneralpurposesystems NCCwidespreadinscalable systems

CCoverheadistoohigh,notalwaysnecessary48

ExampleCommercialSystems CCUMA(SMP)

SunE10000:http://doi.ieeecomputersociety.org/10.1109/40.653032 CCNUMA

SGIOrigin2000:TheSGIOrigin:AccnumaHighlyScalableServer NCCNUMA

CrayT3E:http://www.cs.wisc.edu/~markhill/Misc/asplos96_t3e_comm.pdf Clusters

ASCI:https://www.llnl.gov/str/Seager.html


WeakScalingandGustafsonsLaw Gustafsonredefinesspeedup

Workloadsgrowasmorecoresbecomeavailable Assumethatlargerworkload(e.g.biggerdataset)providesmorerobustutilizationofparallelmachine

LetF=p/(s+p).ThenSP =(s+pP)/(s+p)=1F+FP=1+F(P1)TP s p+= T1 s pP+=

50

Summary

Threadlevelparallelism MultiprocessorSystems CacheCoherence

Snoopy Scalable

FlynnTaxonomy UMAvs.NUMA GustafsonsLawvs.AmdahlsLaw

computer architecture trends

Documents