computer architecture trends

Upload: ikechukwu-okey

Post on 15-Oct-2015

34 views

Category:

Documents


0 download

DESCRIPTION

Computer Architecture Trends - Current and Previous Trends in Computer Architecture.Computer Architecture Trends.

TRANSCRIPT

  • ECE 752: Advanced Computer Architecture I 1

    ECE/CS757:AdvancedComputerArchitectureII

    Instructor:Mikko HLipasti

    Spring2013UniversityofWisconsinMadison

    LecturenotesbasedonslidescreatedbyJohnShen,MarkHill,DavidWood,Guri Sohi,JimSmith,NatalieEnrightJerger,MichelDubois,Murali Annavaram,

    PerStenstrm andprobablyothers

    ComputerArchitecture InstructionSetArchitecture(IBM360)

    theattributesofa[computing]systemasseenbytheprogrammer.I.e.theconceptualstructureandfunctionalbehavior,asdistinctfromtheorganizationofthedataflowsandcontrols,thelogicdesign,andthephysicalimplementation. Amdahl,Blaaw,&Brooks,1964

    MachineOrganization(microarchitecture) ALUS,Buses,Caches,Memories,etc.

    MachineImplementation(realization) Gates,cells,transistors,wires

    757InContext Priorcourses

    352 gatesuptomultiplexorsandadders 354 highlevellanguagedowntomachinelanguageinterfaceor

    instructionsetarchitecture (ISA) 552 implementlogicthatprovidesISAinterface CS537 providesOSbackground(coreq.OK)

    Thiscourse 757 coversparallelmachines Multiprocessorsystems Dataparallelsystems MemorysystemsthatexploitMLP Etc.

    Additionalcourses ECE752coversadvanceduniprocessor design(notaprereq)

    Willreviewkeytopicsinnextlecture ECE755coversVLSIdesign ME/ECE759coversparallelprogramming CS758coversspecialtopics(recentlyparallelprogramming)

    WhyTake757? Tobecomeacomputerdesigner

    Alumniofthisclasshelpeddesignyourcomputer Tolearnwhatisunderthehood ofacomputer

    Innatecuriosity Tobetterunderstandwhenthingsbreak Towritebettercode/applications Towritebettersystemsoftware(O/S,compiler,etc.)

    Becauseitisintellectuallyfascinating! Becausemulticore/parallelsystemsareubiquitous

    ComputerArchitecture Exerciseinengineeringtradeoffanalysis

    Findthefastest/cheapest/powerefficient/etc.solution Optimizationproblemwith100sofvariables

    Allthevariablesarechanging Atnonuniformrates Withinflectionpoints Onlyoneguarantee:Todaysrightanswerwillbewrongtomorrow

    Twohighleveleffects: Technologypush ApplicationPull

    Trends

    MooresLawfordeviceintegration Chippowerconsumption Singlethreadperformancetrend

    [source:Intel]Mikko Lipasti-University of Wisconsin

  • ECE 752: Advanced Computer Architecture I 2

    DynamicPower

    StaticCMOS:currentflowswhenactive Combinationallogicevaluatesnewinputs Flipflop,latchcapturesnewvalue(clockedge)

    Terms C:capacitanceofcircuit

    wirelength,numberandsizeoftransistors V:supplyvoltage A:activityfactor f:frequency

    Future:Fundamentallypowerconstrained

    unitsi

    iiidyn fAVCkP2

    Mikko Lipasti-University of Wisconsin Mikko Lipasti-University of Wisconsin

    MulticoreMania First,servers

    IBMPower4,2001 Thendesktops

    AMDAthlonX2,2005 Thenlaptops

    IntelCoreDuo,2006 Now,cellphone&tablet

    Qualcomm,NvidiaTegra,AppleA6,etc.

    WhyMulticore

    SingleCore DualCore QuadCoreCorearea A ~A/2 ~A/4Corepower W ~W/2 ~W/4Chippower W +O W +O W+OCoreperformance P 0.9P 0.8PChipperformance P 1.8P 3.2P

    Mikko Lipasti-University of Wisconsin

    Core Core CoreCore

    Core

    Core

    Core f

    AmdahlsLaw

    f fractionthatcanruninparallel1f fractionthatmustrunserially

    Mikko Lipasti-University of Wisconsin

    Time

    # C

    PUs

    1 1-f

    f

    n

    nff

    Speedup

    )1(

    1f

    nff

    n 11

    1

    1lim

    FixedChipPowerBudget

    AmdahlsLaw Ignores(power)costofncores

    RevisedAmdahlsLaw Morecores eachcoreisslower Parallelspeedup

  • ECE 752: Advanced Computer Architecture I 3

    Challenges Parallelscalinglimitsmanycore

    >4coresonlyforwellbehavedprograms Optimisticaboutnew applications

    Interconnectoverhead Singlethreadperformance

    Willdegradeunlessweinnovate Parallelprogramming

    Express/extractparallelisminnewways Retrainprogrammingworkforce

    Mikko Lipasti-University of Wisconsin

    FindingParallelism1. Functionalparallelism

    Car:{engine,brakes,entertain,nav,} Game:{physics,logic,UI,render,}

    2. Automaticextraction Decomposeserialprograms

    3. Dataparallelism Vector,matrix,dbtable,pixels,

    4. Requestparallelism Web,shareddatabase,telephony,

    Mikko Lipasti-University of Wisconsin

    BalancingWork Amdahlsparallelphasef:allcoresbusy Ifnotperfectlybalanced

    (1f)termgrows(fnotfullyparallel) Performancescalingsuffers

    Manageablefordata&requestparallelapps Verydifficultproblemforothertwo:

    Functionalparallelism Automaticallyextracted

    Mikko Lipasti-University of Wisconsin

    CoordinatingWork Synchronization

    Somedatasomewhereisshared Coordinate/orderupdatesandreads Otherwise chaos

    Traditionally:locksandmutualexclusion Hardtogetright,evenhardertotuneforperf.

    Researchtoreality:TransactionalMemory Programmer:Declarepotentialconflict Hardwareand/orsoftware:speculate&check Commitorrollbackandretry IBMandIntelannouncedsupport(soon)

    Mikko Lipasti-University of Wisconsin

    SinglethreadPerformance Stillmostattractivesourceofperformance

    Speedsupparallelandserialphases Canuseittobuybackpower

    Mustfocusonpowerconsumption PerformancebenefitPowercost

    Focusof752;briefreviewcomingup

    Mikko Lipasti-University of Wisconsin

    FocusofthisCourse Howtominimizetheseoverheads

    Interconnect Synchronization CacheCoherence Memorysystems

    Also Howtowriteparallelprograms(alittle) Noncachecoherentsystems(clusters,MPP) Dataparallelsystems

  • ECE 752: Advanced Computer Architecture I 4

    ExpectedBackground ECE/CS552orequivalent

    Designsimpleuniprocessor Simpleinstructionsets Organization Datapathdesign Hardwired/microprogrammedcontrol Simplepipelining Basiccaches Some752content(optionalreview)

    Highlevelprogrammingexperience C/UNIXskills modifysimulators

    AboutThisCourse Readings

    Postedonwebsitelaterthisweek Makesureyoukeepupwiththese!Oftendiscussedindepthinlecture,withrequiredparticipation

    Subsetofpapersmustbereviewedinwriting,submittedthroughlearn@uw

    Lecture Attendancerequired,popquizzes

    Homeworks Notcollected,foryourbenefitonly Developdeeperunderstanding,prepareformidterms

    AboutThisCourse Exams

    Midterm1:Friday3/1inclass Midterm2:Monday4/8inclass Keepupwithreadinglist!

    Textbook Dubois,Annavaram,Stenstrm,ParallelComputerOrganizationandDesign,CambridgeUniv.Press,2012.

    Forreference:4betachaptersfromJimSmith Postedoncoursewebsite

    AboutThisCourse CourseProject

    Researchproject Replicateresultsfromapaper Orattemptsomethingnovel Parallelize/characterizenewapplication

    Proposaldue3/22,statusreport4/22 Finalprojectincludesawrittenreportandanoralpresentation Writtenreportsdue5/14 Presentationsduringclasstime5/6,5/8,5/10

    AboutThisCourse Grading

    Quizzes and Paper Reviews 20% Midterm 1 25% Midterm 2 25% Project 30%

    WebPage(checkregularly) http://ece757.ece.wisc.edu

    AboutThisCourse OfficeHours

    Prof.Lipasti:EH4613,M911,orbyappt. Communicationchannels

    Emailtoinstructor,classemaillist [email protected]

    Webpage http://ece757.ece.wisc.edu

    Officehours

  • ECE 752: Advanced Computer Architecture I 5

    AboutThisCourse OtherResources

    ComputerArchitectureColloquiumTuesday45PM,1325CSS

    ComputerEngineeringSeminar Friday121PM,EH4610

    Architecturemailinglist:http://lists.cs.wisc.edu/mailman/listinfo/architecture

    WWWComputerArchitecturePagehttp://www.cs.wisc.edu/~arch/www

    AboutThisCourse Lectureschedule:

    MWF1:002:15 Cancel1of3lectures(onaverage) Freeupseveralweeksnearendforprojectwork

    26

    TentativeScheduleWeek 1 Introduction, 752 reviewWeek 2 752 review, Multithreading & MulticoreWeek 3 MP Software, Memory SystemsWeek 4 MP Memory SystemsWeek 5 Coherence & ConsistencyWeek 6 Lecture cancelled, midterm 1 on 3/1Week 7 Simulation methodology, transactional memoryWeek 8 Interconnection networksWeek 9 SIMD, MPP

    Week 10 Dataflow, Clusters, GPGPUsWeek 11 Midterm 2Week 12 No lectureWeek 13 No lectureWeek 14 No lectureWeek 15 Project talks, Course Evaluation

    Finals Week Project reports due 5/1428

    BriefIntroductiontoParallelComputing Threadlevelparallelism MultiprocessorSystems CacheCoherence

    Snoopy Scalable

    FlynnTaxonomy UMAvs.NUMA

    29

    ThreadlevelParallelism Instructionlevelparallelism(752focus)

    Reapsperformancebyfindingindependentworkinasinglethread

    Threadlevelparallelism Reapsperformancebyfindingindependentworkacrossmultiplethreads

    Historically,requiresexplicitlyparallelworkloads Originatesfrommainframetimesharingworkloads Eventhen,CPUspeed>>I/Ospeed HadtooverlapI/OlatencywithsomethingelsefortheCPUtodo

    Hence,operatingsystemwouldscheduleothertasks/processes/threadsthatweretimesharingtheCPU

    30

    ThreadlevelParallelism

    Reduceseffectivenessoftemporalandspatiallocality

    CPU1

    CPU1

    CPU2

    Disk access

    CPU3

    Disk access

    Disk access

    CPU1

    CPU2

    Think time

    CPU3

    Think time

    Think time

    Single user:CPU1Disk access Think time Increase in

    number ofactive threads

    reduceseffectiveness

    of spatiallocality byincreasing

    working set.

    Time-shared:

    Time dilation of each thread reduces effectiveness of temporal locality.

  • ECE 752: Advanced Computer Architecture I 6

    31

    ThreadlevelParallelism InitiallymotivatedbytimesharingofsingleCPU

    OS,applicationswrittentobemultithreaded QuicklyledtoadoptionofmultipleCPUsinasinglesystem

    EnabledscalableproductlinefromentrylevelsingleCPUsystemstohighendmultipleCPUsystems

    Sameapplications,OS,runseamlessly AddingCPUsincreasesthroughput(performance)

    Morerecently: Multiplethreadsperprocessorcore

    Coarsegrainedmultithreading(akaswitchonevent) Finegrainedmultithreading Simultaneousmultithreading

    Multipleprocessorcoresperdie Chipmultiprocessors(CMP)

    32

    MultiprocessorSystems Primaryfocusonsharedmemorysymmetric

    multiprocessors Manyothertypesofparallelprocessorsystemshavebeen

    proposedandbuilt Keyattributesare:

    Sharedmemory:allphysicalmemoryisaccessibletoallCPUs Symmetricprocessors:allCPUsarealike

    Otherparallelprocessorsmay: Sharesomememory,sharedisks,sharenothing Haveasymmetricprocessingunits

    Sharedmemoryidealisms Fullysharedmemory Unitlatency Lackofcontention Instantaneouspropagationofwrites

    33

    Motivation Sofar:oneprocessorinasystem WhynotuseNprocessors

    Higherthroughputviaparalleljobs Costeffective

    Adding3CPUsmayget4xthroughputatonly2xcost Lowerlatencyfrommultithreadedapplications

    Softwarevendorhasdonetheworkforyou E.g.database,webserver

    Lowerlatencythroughparallelizedapplications Muchharderthanitsounds

    34

    WheretoConnectProcessors? Atprocessor?

    Singleinstructionmultipledata(SIMD) AtI/Osystem?

    Clustersormulticomputers Atmemorysystem?

    Sharedmemorymultiprocessors FocusonSymmetricMultiprocessors(SMP)

    2005 Mikko Lipasti 35

    ConnectatProcessor(SIMD)

    ControlProcessor

    Instruction Memory

    DataMemory

    Registers

    ALU

    DataMemory

    Registers

    ALU

    DataMemory

    Registers

    ALU

    . . .

    . . .

    Interconnection Network

    36

    ConnectatProcessor SIMDAssessment

    Amortizescostofcontrolunitovermanydatapaths

    Enablesefficient,widedatapaths Programmingmodelhaslimitedflexibility

    Regularcontrolflow,dataaccesspatterns SIMDwidelyemployedtoday

    MMX,SSE,3DNOWvectorextensions Dataelementsare8bmultimediaoperands

  • ECE 752: Advanced Computer Architecture I 7

    37

    ConnectatI/O Connectwithstandardnetwork(e.g.Ethernet)

    Calledacluster Adequatebandwidth(GBEthernet,goingto10GB) Latencyveryhigh Cheap,butgetwhatyoupayfor

    Connectwithcustomnetwork(e.g.IBMSP1,SP2,SP3) Sometimescalledamulticomputer Highercostthancluster Poorercommunicationthanmultiprocessor

    Internetdatacentersbuiltthisway38

    ConnectatMemory:Multiprocessors

    SharedMemoryMultiprocessors Allprocessorscanaddressallphysicalmemory Demandsevolutionaryoperatingsystemschanges Higherthroughputwithnoapplicationchanges Lowlatency,butrequiresparallelizationwithpropersynchronization

    Mostsuccessful:SymmetricMPorSMP 264microprocessorsonabus Stillusecachememories

    39

    CacheCoherenceProblem

    P0 P1Load A

    A 0

    Load A

    A 0

    Store A

  • ECE 752: Advanced Computer Architecture I 8

    43

    SnoopyCacheCoherence Allrequestsbroadcastonbus Allprocessorsandmemorysnoopandrespond Cacheblockswriteableatoneprocessororreadonlyatseveral Singlewriterprotocol

    Snoopsthathitdirtylines? Flushmodifieddataoutofcache Eitherwritebacktomemory,thensatisfyremotemissfrommemory,or

    Providedirtydatadirectlytorequestor BigprobleminMPsystems

    Dirty/coherence/sharing misses

    44

    ScaleableCacheCoherence Eschewphysicalbusbutstillsnoop

    Pointtopointtreestructure Rootoftreeprovidesorderingpoint

    Or,uselevelofindirectionthroughdirectory Directoryatmemoryremembers:

    Whichprocessorissinglewriter Forwardsrequeststoit

    Whichprocessorsaresharedreaders Forwardswritepermissionrequeststothem

    Levelofindirectionhasaprice Dirtymissesrequire3hopsinsteadoftwo

    Snoop:Requestor>Owner>Requestor Directory:Requestor>Directory>Owner>Requestor

    FlynnTaxonomyFlynn (1966) Single Data Multiple Data

    Single Instruction SISD SIMD

    Multiple Instruction MISD MIMD

    45Mikko Lipasti-University of Wisconsin

    MISD Faulttolerance Pipelineprocessing/streaming orsystolicarray

    NowextendedtoSPMD singleprogrammultipledata

    MemoryOrganization:UMAvs.NUMAProcessor Cache Processor Cache Processor Cache Processor

    Memory Memory Memory Memory

    Cache

    Interconnection networkUniformmemorylatency

    UniformMemoryAccess(dancehall)

    Processor Cache Processor Cache Processor Cache Processor

    Memory Memory Memory Memory

    Cache

    Interconnection network

    Short locallatency

    NonuniformMemoryAccess

    Long remote memory latency

    46Mikko Lipasti-University of Wisconsin

    MemoryTaxonomyFor Shared Memory Uniform

    MemoryNonuniform Memory

    Cache Coherence CC-UMA CC-NUMA

    No Cache Coherence NCC-UMA NCC-NUMA

    47Mikko Lipasti-University of Wisconsin

    NUMAwinsoutforpracticalimplementation Cache coherencefavorsprogrammer

    Commoningeneralpurposesystems NCCwidespreadinscalable systems

    CCoverheadistoohigh,notalwaysnecessary48

    ExampleCommercialSystems CCUMA(SMP)

    SunE10000:http://doi.ieeecomputersociety.org/10.1109/40.653032 CCNUMA

    SGIOrigin2000:TheSGIOrigin:AccnumaHighlyScalableServer NCCNUMA

    CrayT3E:http://www.cs.wisc.edu/~markhill/Misc/asplos96_t3e_comm.pdf Clusters

    ASCI:https://www.llnl.gov/str/Seager.html

  • ECE 752: Advanced Computer Architecture I 9

    WeakScalingandGustafsonsLaw Gustafsonredefinesspeedup

    Workloadsgrowasmorecoresbecomeavailable Assumethatlargerworkload(e.g.biggerdataset)providesmorerobustutilizationofparallelmachine

    LetF=p/(s+p).ThenSP =(s+pP)/(s+p)=1F+FP=1+F(P1)TP s p+= T1 s pP+=

    50

    Summary

    Threadlevelparallelism MultiprocessorSystems CacheCoherence

    Snoopy Scalable

    FlynnTaxonomy UMAvs.NUMA GustafsonsLawvs.AmdahlsLaw