lecture 01: introduction - github pages8 •“i think there is a world market for maybe five...
TRANSCRIPT
Lecture01:Introduction
CSCE513ComputerArchitectureFall2018
DepartmentofComputerScienceandEngineeringYonghong Yan
[email protected]://cse.sc.edu/~yanyh
1
CopyrightandAcknowledgement• Lotsoftheslideswereadaptedfromlecturesnotesofthetwotextbookswithcopyrightofpublisherorthe
originalauthorsincludingElsevierInc,MorganKaufmann,DavidA.PattersonandJohnL.Hennessy.• Someslideswereadaptedfromthefollowingcourses:
– UCBerkeleycourse“ComputerScience252:GraduateComputerArchitecture”ofDavidE.CullerCopyright2005UCB• http://people.eecs.berkeley.edu/~culler/courses/cs252-s05/
– GreatIdeasinComputerArchitecture(MachineStructures)byRandyKatzandBernhardBoser• http://inst.eecs.berkeley.edu/~cs61c/fa16/
• Ialsorefertothefollowingcoursesandlecturenoteswhenpreparingmaterialsforthiscourse– ComputerScience152:ComputerArchitectureandEngineering,Spring2016byDr.GeorgeMichelogiannakis from
UCBerkeley• http://www-inst.eecs.berkeley.edu/~cs152/sp16/
– ComputerScience252:GraduateComputerArchitecture,Fall2015byProf.Krste Asanović fromUCBerkeley• http://www-inst.eecs.berkeley.edu/~cs252/fa15/
– ComputerScienceS250:VLSISystemsDesign,Spring2016byProf.JohnWawrzynek fromUCBerkeley• http://www-inst.eecs.berkeley.edu/~cs250/sp16/
– ComputerSystemArchitecture,Fall2005byDr.JoelEmer andProf.Arvind fromMIT• http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-823-computer-system-architecture-
fall-2005/– SynthesisLecturesonComputerArchitecture
• http://www.morganclaypool.com/toc/cac/1/1
• Theusesofthematerials(sourcecode,slides,documentsandvideos)ofthiscourseareforeducationalpurposesonlyandshouldbeusedonlyinconjunctionwiththetextbook.Derivativesofthematerialsmustacknowledgethecopyrightnoticesofthisandtheoriginals.Permissionforcommercialpurposesshouldbeobtainedfromtheoriginalcopyrightholderandthesuccessivecopyrightholdersincludingmyself.2
Contents
• Computercomponents• Computerarchitecturesandgreatideasincomputerarchitectures
• Performance
3
GenerationOfComputers
4https://solarrenovate.com/the-evolution-of-computers/
NewSchoolComputer(#1)
PersonalMobileDevices
55
NewSchool“Computer”(#2)
66
ClassesofComputers
• PersonalMobileDevice(PMD)– e.g.startphones,tabletcomputers– Emphasisonenergyefficiencyandreal-time
• DesktopComputing– Emphasisonprice-performance
• Servers– Emphasisonavailability,scalability,throughput
• Clusters/WarehouseScaleComputers– Usedfor“SoftwareasaService(SaaS)”– Emphasisonavailabilityandprice-performance– Sub-class:Supercomputers,emphasis:floating-point
performanceandfastinternalnetworks• InternetofThings/EmbeddedComputers
– Emphasis:price
8
• “Ithinkthereisaworldmarketformaybefivecomputers.”
– ThomasWatson,chairmanofIBM,1943.
• “Thereisnoreasonforanyindividualtohaveacomputerintheirhome”
– KenOlson,presidentandfounderofDigitalEquipmentCorporation,1977.
• “640K[ofmemory]oughttobeenoughforanybody.”– BillGates,chairmanofMicrosoft,1981.
NotesbythePioneers
ComponentsofaComputer
• Samecomponentsforallkindsofcomputer– Desktop,server,
embedded• Twocoreparts
– Processorandmemory• Input/outputincludes
– User-interfacedevices• Display,keyboard,mouse
– Storagedevices• Harddisk,CD/DVD,flash
– Networkadapters• Forcommunicatingwithothercomputers
InsidetheProcessor(CPU)
• Functionalunits:performscomputations• Datapath:wiresformovingdata• Controllogic:sequencesdatapath,memory,andoperations• Cachememory
– SmallfastSRAMmemoryforimmediateaccesstodata
Apple A5
10
ASafePlaceforData
• Volatilemainmemory– Losesinstructionsanddatawhenpoweroff
• Non-volatilesecondarymemory– Magneticdisk– Flashmemory– Opticaldisk(CDROM,DVD)
Contents
• Computercomponents• Computerarchitecturesandgreatideasincomputerarchitectures
• Performance
12
Whatis“ComputerArchitecture”?
13
Applications
InstructionSetArchitecture
Compiler
OperatingSystem
Firmware
I/OsystemInstr.SetProc.
DigitalDesignCircuitDesign
Datapath &Control
Layout&fabSemiconductorMaterials
software
hardware
14
TheInstructionSet:aCriticalInterface
instructionset
software
hardware
• Propertiesofagoodabstraction– Laststhroughmanygenerations(portability)– Usedinmanydifferentways(generality)– Providesconvenientfunctionalitytohigherlevels– Permitsanefficientimplementationatlowerlevels
15
ElementsofanISA
• Setofmachine-recognizeddatatypes– bytes,words,integers,floatingpoint,strings,...
• Operationsperformedonthosedatatypes– Add,sub,mul,div,xor,move,….
• Programmablestorage– regs,PC,memory
• Methodsofidentifyingandobtainingdatareferencedbyinstructions(addressingmodes)– Literal,reg.,absolute,relative,reg +offset,…
• Format(encoding)oftheinstructions– Opcode,operandfields,…
ComputerArchitectureHowthingsareputtogetherindesignandimplementation
• Capabilities&PerformanceCharacteristicsofPrincipalFunctionalUnits
–(e.g.,Registers,ALU,Shifters,LogicUnits,...)•Waysinwhichthesecomponentsareinterconnected• Informationflowsbetweencomponents• Logicandmeansbywhichsuchinformationflowiscontrolled.
• ChoreographyofFUstorealizetheISA
16
Great Ideas in Computer Architectures
1. Design for Moore’s Law
2. Use abstraction to simplify design
3. Make the common case fast
4. Performance via parallelism
5. Performance via pipelining
6. Performance via prediction
7. Hierarchy of memories
8. Dependability via redundancy
17
GreatIdea:“Moore’sLaw”
GordonMoore,FounderofIntel• 1965:sincetheintegratedcircuitwasinvented,thenumberof
transistors/inch2 inthesecircuitsroughlydoubledeveryyear;thistrendwouldcontinuefortheforeseeablefuture
• 1975:revised- circuitcomplexitydoubleseverytwoyears
18Imagecredit:Intel
Moore’sLawtrends• Moretransistors=↑opportunitiesforexploitingparallelisminthe
instructionlevel(ILP)– Pipeline,superscalar,VLIW(VeryLongInstructionWord),SIMD(Single
InstructionMultipleData)orvector,speculation,branchprediction• Generalpathofscaling
– Widerinstructionissue,longerpiepline– Morespeculation– Moreandlargerregistersandcache
• Increasingcircuitdensity~=increasingfrequency~=increasingperformance
• Transparenttousers– Aneasyjobofgettingbetterperformance:buyingfasterprocessors(higher
frequency)
• Wehaveenjoyedthisfreelunchforseveraldecades,however(TBC)…
19
GreatIdea:PipelineFundamentalExecutionCycle
20
InstructionFetch
InstructionDecode
OperandFetch
Execute
ResultStore
NextInstruction
Obtaininstructionfromprogramstorage
Determinerequiredactionsandinstructionsize
Locateandobtainoperanddata
Computeresultvalueorstatus
Depositresultsinstorageforlateruse
Determinesuccessorinstruction
Processor
regs
F.U.s
Memory
program
Data
vonNeumanbottleneck
PipelinedInstructionExecution
21
Instr.
Order
Time (clock cycles)
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
GreatIdea:Abstraction(LevelsofRepresentation/Interpretation)
lw $t0,0($2)lw $t1,4($2)sw $t1,0($2)sw $t0,4($2)
HighLevelLanguageProgram(e.g.,C)
AssemblyLanguageProgram(e.g.,MIPS)
MachineLanguageProgram(MIPS)
HardwareArchitectureDescription(e.g.,blockdiagrams)
Compiler
Assembler
MachineInterpretation
temp=v[k];v[k]=v[k+1];v[k+1]=temp;
0000 1001 1100 0110 1010 1111 0101 10001010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111
LogicCircuitDescription(CircuitSchematicDiagrams)
ArchitectureImplementation
Anythingcanberepresentedasanumber,
i.e.,dataorinstructions
22
TheMemoryAbstraction
• Associationof<name,value>pairs– typicallynamedasbyteaddresses– oftenvaluesalignedonmultiplesofsize
• SequenceofReadsandWrites• Writebindsavaluetoanaddress
– Leftvalue• Readofaddr returnsmostrecentlywrittenvalueboundtothataddress– Rightvalue
23
address (name)command (R/W)
data (W)
data (R)
done
int a=b;
Greatidea:MemoryHierarchyLevelsoftheMemoryHierarchy
24
CPU Registers100s Bytes<< 1s ns
Cache10s-100s K Bytes~1 ns$1s/ MByte
Main MemoryM Bytes100ns- 300ns$< 1/ MByte
Disk10s G Bytes, 10 ms (10,000,000 ns)$0.001/ MByte
CapacityAccess TimeCost
Tapeinfinitesec-min$0.0014/ MByte
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler1-8 bytes
cache cntl8-128 bytes
OS512-4K bytes
user/operatorMbytes
Upper Level
Lower Level
faster
Larger
Processor-DRAMMemoryGap(latency)
25
µProc60%/yr.(2X/1.5yr)
DRAM9%/yr.(2X/10 yrs)
1
10
100
100019
8019
81
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU19
82
Processor-MemoryPerformance Gap:(grows 50% / year)
Perf
orm
ance
Time
“Moore’s Law”
JimGray’sStorageLatencyAnalogy:HowFarAwayistheData?
26RegistersOn Chip CacheOn Board Cache
Main Memory
Disk
12
10
100
Tape /Optical Robot
10 9
10 6
Charleston
This CampusThis Room
My Head
10 min
2 hr
2 Years
1 min
Pluto
2,000 Years
Andromeda
(ns)
JimGrayTuringAwardB.S.Cal1966Ph.D.Cal1969!
ThePrincipleofLocality
• ThePrincipleofLocality:– Programaccessarelativelysmallportionoftheaddressspace
atanyinstantoftime.• TwoDifferentTypesofLocality:
– TemporalLocality(LocalityinTime):Ifanitemisreferenced,itwilltendtobereferencedagainsoon(e.g.,loops,reuse)
– SpatialLocality(LocalityinSpace):Ifanitemisreferenced,closeby itemstendtobereferencedsoon(e.g.,straightline code,arrayaccess)
• Last30years,HWreliedonlocalityforspeed
27
P MEM$
GreatIdea:Parallelism
28
Parallelism
• Classesofparallelisminapplications:– Data-LevelParallelism(DLP)– Task-LevelParallelism(TLP)
• Classesofarchitecturalparallelism:– Instruction-LevelParallelism(ILP)– Vectorarchitectures/GraphicProcessorUnits(GPUs)– Thread-LevelParallelism– Heterogeneity
29
ComputerArchitectureTopics
30
Instruction Set Architecture
Pipelining, Hazard Resolution,Superscalar, Reordering, Prediction, Speculation,Vector, Dynamic Compilation
Addressing,Protection,Exception Handling
L1 Cache
L2/L3 Cache
DRAM
Disks, WORM, Tape
Coherence,Bandwidth,Latency
Emerging TechnologiesInterleavingBus protocols
RAID
VLSI
Input/Output and Storage
MemoryHierarchy
Pipelining and Instruction Level Parallelism
NetworkCommunication
Oth
er P
roce
ssor
s
WhyisArchitectureExcitingToday?
31
CPUSpeedFlat
SingleProcessorPerformance
RISC
Move to multi-processor
32
ProblemsofTraditionalILPScaling
• Fundamentalcircuitlimitations1– delays⇑ asissuequeues⇑ andmulti-portregisterfiles⇑– increasingdelayslimitperformancereturnsfromwiderissue
• Limitedamountofinstruction-levelparallelism1
– inefficientforcodeswithdifficult-to-predictbranches
• Powerandheatstallclockfrequencies
33
[1]Thecaseforasingle-chipmultiprocessor,K.Olukotun,B.Nayfeh,L.Hammond,K.Wilson,andK.Chang,ASPLOS-VII,1996.
ILPimpacts
34
Simulationsof8-issueSuperscalar
35
Power/HeatDensityLimitsFrequency
36
• Somefundamentalphysicallimitsarebeingreached
RecentMulticoreProcessors
37
RecentManycore GPUprocessors
38
��
An�Overview�of�the�GK110�Kepler�Architecture�Kepler�GK110�was�built�first�and�foremost�for�Tesla,�and�its�goal�was�to�be�the�highest�performing�parallel�computing�microprocessor�in�the�world.�GK110�not�only�greatly�exceeds�the�raw�compute�horsepower�delivered�by�Fermi,�but�it�does�so�efficiently,�consuming�significantly�less�power�and�generating�much�less�heat�output.��
A�full�Kepler�GK110�implementation�includes�15�SMX�units�and�six�64�bit�memory�controllers.��Different�products�will�use�different�configurations�of�GK110.��For�example,�some�products�may�deploy�13�or�14�SMXs.��
Key�features�of�the�architecture�that�will�be�discussed�below�in�more�depth�include:�
� The�new�SMX�processor�architecture�� An�enhanced�memory�subsystem,�offering�additional�caching�capabilities,�more�bandwidth�at�
each�level�of�the�hierarchy,�and�a�fully�redesigned�and�substantially�faster�DRAM�I/O�implementation.�
� Hardware�support�throughout�the�design�to�enable�new�programming�model�capabilities�
�
Kepler�GK110�Full�chip�block�diagram�
��
Streaming�Multiprocessor�(SMX)�Architecture�
Kepler�GK110)s�new�SMX�introduces�several�architectural�innovations�that�make�it�not�only�the�most�powerful�multiprocessor�we)ve�built,�but�also�the�most�programmable�and�power�efficient.��
�
SMX:�192�single�precision�CUDA�cores,�64�double�precision�units,�32�special�function�units�(SFU),�and�32�load/store�units�(LD/ST).�
��
Kepler�Memory�Subsystem�/�L1,�L2,�ECC�
Kepler&s�memory�hierarchy�is�organized�similarly�to�Fermi.�The�Kepler�architecture�supports�a�unified�memory�request�path�for�loads�and�stores,�with�an�L1�cache�per�SMX�multiprocessor.�Kepler�GK110�also�enables�compiler�directed�use�of�an�additional�new�cache�for�read�only�data,�as�described�below.�
�
�
64�KB�Configurable�Shared�Memory�and�L1�Cache�
In�the�Kepler�GK110�architecture,�as�in�the�previous�generation�Fermi�architecture,�each�SMX�has�64�KB�of�on�chip�memory�that�can�be�configured�as�48�KB�of�Shared�memory�with�16�KB�of�L1�cache,�or�as�16�KB�of�shared�memory�with�48�KB�of�L1�cache.�Kepler�now�allows�for�additional�flexibility�in�configuring�the�allocation�of�shared�memory�and�L1�cache�by�permitting�a�32KB�/�32KB�split�between�shared�memory�and�L1�cache.�To�support�the�increased�throughput�of�each�SMX�unit,�the�shared�memory�bandwidth�for�64b�and�larger�load�operations�is�also�doubled�compared�to�the�Fermi�SM,�to�256B�per�core�clock.�
48KB�Read�Only�Data�Cache�
In�addition�to�the�L1�cache,�Kepler�introduces�a�48KB�cache�for�data�that�is�known�to�be�read�only�for�the�duration�of�the�function.�In�the�Fermi�generation,�this�cache�was�accessible�only�by�the�Texture�unit.�Expert�programmers�often�found�it�advantageous�to�load�data�through�this�path�explicitly�by�mapping�their�data�as�textures,�but�this�approach�had�many�limitations.��
• ~5kcores
CurrentTrendsinArchitecture• LeveragingInstruction-Levelparallelism(ILP)isnearanend
– Singleprocessorperformanceimprovementendedin2003• Newmodelsforperformance:
– Data-levelparallelism(DLP)– Thread-levelparallelism(TLP)
• Excitingtopicsandchallenges– Heterogeneity– Domainspecificarchitectures– Softwareandhardwareco-design– Agiledevelopment
• DARPAPicksItsFirstSetofWinnersinElectronicsResurgenceInitiative,July2018
• https://spectrum.ieee.org/tech-talk/semiconductors/design/darpa-picks-its-first-set-of-winners-in-electronics-resurgence-initiative.amp.html
39
• Video:https://www.acm.org/hennessy-patterson-turing-lecture• Shortsummary
– https://www.hpcwire.com/2018/04/17/hennessy-patterson-a-new-golden-age-for-computer-architecture/
40
Exercise:InspectISAforsum• Sumexample
– https://passlab.github.io/CSCE513/exercises/sum
• Check– sum_full.s,– sum_riscv.s– sum_x86.s
• Generateandexecute– gcc -save-tempssum.c –osum– ./sum102400
• ForhowtocompileandrunLinuxprogram– https://passlab.github.io/CSCE513/notes/lecture01_LinuxCProgramming.pdf
• Othersystemcommands:– cat/proc/cpuinfo toshowtheCPUand#cores– topcommandtoshowsystemusageandmemory
41
MachineforDevelopmentandExperiment
• LinuxmachinesinSwearingen1D43and3D22– AllCSCEstudentsbydefaulthaveaccesstothesemachine
usingtheirstandardlogincredentials• Letmeknowifyou,CSCEornot,cannotaccess
– RemoteaccessisalsoavailableviaSSHoverport222. Namingschemaisasfollows:• l-1d43-01.cse.sc.eduthroughl-1d43-26.cse.sc.edu• l-3d22-01.cse.sc.eduthroughl-3d22-20.cse.sc.edu
• Restrictedto2GBofdataintheirhomefolder(~/).– Formorespace,createadirectoryin/scratchonthelogin
machine,howeverthatdataisnotsharedanditwillonlybeavailableonthatspecificmachine.
42
PuttySSHConnectiononWindows
43
l-1d43-08.cse.sc.edu 222
SSHConnectionfromLinux/MacOSXTerminal
44
-XforenablingX-windowsforwardingsoyoucanusethegraphicsdisplayonyourcomputer.ForMacOSX,youneedhaveXserversoftwareinstalled,e.g.Xquartz(https://www.xquartz.org/)istheoneIuse.