lecture 01: introduction - github pages8 •“i think there is a world market for maybe five...

Lecture01:Introduction

CSCE513ComputerArchitectureFall2018

DepartmentofComputerScienceandEngineeringYonghong Yan

[email protected]://cse.sc.edu/~yanyh

1

CopyrightandAcknowledgement• Lotsoftheslideswereadaptedfromlecturesnotesofthetwotextbookswithcopyrightofpublisherorthe

originalauthorsincludingElsevierInc,MorganKaufmann,DavidA.PattersonandJohnL.Hennessy.• Someslideswereadaptedfromthefollowingcourses:

– UCBerkeleycourse“ComputerScience252:GraduateComputerArchitecture”ofDavidE.CullerCopyright2005UCB• http://people.eecs.berkeley.edu/~culler/courses/cs252-s05/

– GreatIdeasinComputerArchitecture(MachineStructures)byRandyKatzandBernhardBoser• http://inst.eecs.berkeley.edu/~cs61c/fa16/

• Ialsorefertothefollowingcoursesandlecturenoteswhenpreparingmaterialsforthiscourse– ComputerScience152:ComputerArchitectureandEngineering,Spring2016byDr.GeorgeMichelogiannakis from

UCBerkeley• http://www-inst.eecs.berkeley.edu/~cs152/sp16/

– ComputerScience252:GraduateComputerArchitecture,Fall2015byProf.Krste Asanović fromUCBerkeley• http://www-inst.eecs.berkeley.edu/~cs252/fa15/

– ComputerScienceS250:VLSISystemsDesign,Spring2016byProf.JohnWawrzynek fromUCBerkeley• http://www-inst.eecs.berkeley.edu/~cs250/sp16/

– ComputerSystemArchitecture,Fall2005byDr.JoelEmer andProf.Arvind fromMIT• http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-823-computer-system-architecture-

fall-2005/– SynthesisLecturesonComputerArchitecture

• http://www.morganclaypool.com/toc/cac/1/1

• Theusesofthematerials(sourcecode,slides,documentsandvideos)ofthiscourseareforeducationalpurposesonlyandshouldbeusedonlyinconjunctionwiththetextbook.Derivativesofthematerialsmustacknowledgethecopyrightnoticesofthisandtheoriginals.Permissionforcommercialpurposesshouldbeobtainedfromtheoriginalcopyrightholderandthesuccessivecopyrightholdersincludingmyself.2

Contents

• Computercomponents• Computerarchitecturesandgreatideasincomputerarchitectures

• Performance

3

GenerationOfComputers

4https://solarrenovate.com/the-evolution-of-computers/

NewSchoolComputer(#1)

PersonalMobileDevices

55

NewSchool“Computer”(#2)

66

ClassesofComputers

• PersonalMobileDevice(PMD)– e.g.startphones,tabletcomputers– Emphasisonenergyefficiencyandreal-time

• DesktopComputing– Emphasisonprice-performance

• Servers– Emphasisonavailability,scalability,throughput

• Clusters/WarehouseScaleComputers– Usedfor“SoftwareasaService(SaaS)”– Emphasisonavailabilityandprice-performance– Sub-class:Supercomputers,emphasis:floating-point

performanceandfastinternalnetworks• InternetofThings/EmbeddedComputers

– Emphasis:price

8

• “Ithinkthereisaworldmarketformaybefivecomputers.”

– ThomasWatson,chairmanofIBM,1943.

• “Thereisnoreasonforanyindividualtohaveacomputerintheirhome”

– KenOlson,presidentandfounderofDigitalEquipmentCorporation,1977.

• “640K[ofmemory]oughttobeenoughforanybody.”– BillGates,chairmanofMicrosoft,1981.

NotesbythePioneers

ComponentsofaComputer

• Samecomponentsforallkindsofcomputer– Desktop,server,

embedded• Twocoreparts

– Processorandmemory• Input/outputincludes

– User-interfacedevices• Display,keyboard,mouse

– Storagedevices• Harddisk,CD/DVD,flash

– Networkadapters• Forcommunicatingwithothercomputers

InsidetheProcessor(CPU)

• Functionalunits:performscomputations• Datapath:wiresformovingdata• Controllogic:sequencesdatapath,memory,andoperations• Cachememory

– SmallfastSRAMmemoryforimmediateaccesstodata

Apple A5

10

ASafePlaceforData

• Volatilemainmemory– Losesinstructionsanddatawhenpoweroff

• Non-volatilesecondarymemory– Magneticdisk– Flashmemory– Opticaldisk(CDROM,DVD)

Contents

• Computercomponents• Computerarchitecturesandgreatideasincomputerarchitectures

• Performance

12

Whatis“ComputerArchitecture”?

13

Applications

InstructionSetArchitecture

Compiler

OperatingSystem

Firmware

I/OsystemInstr.SetProc.

DigitalDesignCircuitDesign

Datapath &Control

Layout&fabSemiconductorMaterials

software

hardware

14

TheInstructionSet:aCriticalInterface

instructionset

software

hardware

• Propertiesofagoodabstraction– Laststhroughmanygenerations(portability)– Usedinmanydifferentways(generality)– Providesconvenientfunctionalitytohigherlevels– Permitsanefficientimplementationatlowerlevels

15

ElementsofanISA

• Setofmachine-recognizeddatatypes– bytes,words,integers,floatingpoint,strings,...

• Operationsperformedonthosedatatypes– Add,sub,mul,div,xor,move,….

• Programmablestorage– regs,PC,memory

• Methodsofidentifyingandobtainingdatareferencedbyinstructions(addressingmodes)– Literal,reg.,absolute,relative,reg +offset,…

• Format(encoding)oftheinstructions– Opcode,operandfields,…

ComputerArchitectureHowthingsareputtogetherindesignandimplementation

• Capabilities&PerformanceCharacteristicsofPrincipalFunctionalUnits

–(e.g.,Registers,ALU,Shifters,LogicUnits,...)•Waysinwhichthesecomponentsareinterconnected• Informationflowsbetweencomponents• Logicandmeansbywhichsuchinformationflowiscontrolled.

• ChoreographyofFUstorealizetheISA

16

Great Ideas in Computer Architectures

1. Design for Moore’s Law

2. Use abstraction to simplify design

3. Make the common case fast

4. Performance via parallelism

5. Performance via pipelining

6. Performance via prediction

7. Hierarchy of memories

8. Dependability via redundancy

17

GreatIdea:“Moore’sLaw”

GordonMoore,FounderofIntel• 1965:sincetheintegratedcircuitwasinvented,thenumberof

transistors/inch2 inthesecircuitsroughlydoubledeveryyear;thistrendwouldcontinuefortheforeseeablefuture

• 1975:revised- circuitcomplexitydoubleseverytwoyears

18Imagecredit:Intel

Moore’sLawtrends• Moretransistors=↑opportunitiesforexploitingparallelisminthe

instructionlevel(ILP)– Pipeline,superscalar,VLIW(VeryLongInstructionWord),SIMD(Single

InstructionMultipleData)orvector,speculation,branchprediction• Generalpathofscaling

– Widerinstructionissue,longerpiepline– Morespeculation– Moreandlargerregistersandcache

• Increasingcircuitdensity~=increasingfrequency~=increasingperformance

• Transparenttousers– Aneasyjobofgettingbetterperformance:buyingfasterprocessors(higher

frequency)

• Wehaveenjoyedthisfreelunchforseveraldecades,however(TBC)…

19

GreatIdea:PipelineFundamentalExecutionCycle

20

InstructionFetch

InstructionDecode

OperandFetch

Execute

ResultStore

NextInstruction

Obtaininstructionfromprogramstorage

Determinerequiredactionsandinstructionsize

Locateandobtainoperanddata

Computeresultvalueorstatus

Depositresultsinstorageforlateruse

Determinesuccessorinstruction

Processor

regs

F.U.s

Memory

program

Data

vonNeumanbottleneck

PipelinedInstructionExecution

21

Instr.

Order

Time (clock cycles)

Reg ALU DMemIfetch Reg




Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5

GreatIdea:Abstraction(LevelsofRepresentation/Interpretation)

lw $t0,0($2)lw $t1,4($2)sw $t1,0($2)sw $t0,4($2)

HighLevelLanguageProgram(e.g.,C)

AssemblyLanguageProgram(e.g.,MIPS)

MachineLanguageProgram(MIPS)

HardwareArchitectureDescription(e.g.,blockdiagrams)

Compiler

Assembler

MachineInterpretation

temp=v[k];v[k]=v[k+1];v[k+1]=temp;

0000 1001 1100 0110 1010 1111 0101 10001010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111

LogicCircuitDescription(CircuitSchematicDiagrams)

ArchitectureImplementation

Anythingcanberepresentedasanumber,

i.e.,dataorinstructions

22

TheMemoryAbstraction

• Associationof<name,value>pairs– typicallynamedasbyteaddresses– oftenvaluesalignedonmultiplesofsize

• SequenceofReadsandWrites• Writebindsavaluetoanaddress

– Leftvalue• Readofaddr returnsmostrecentlywrittenvalueboundtothataddress– Rightvalue

23

address (name)command (R/W)

data (W)

data (R)

done

int a=b;

Greatidea:MemoryHierarchyLevelsoftheMemoryHierarchy

24

CPU Registers100s Bytes<< 1s ns

Cache10s-100s K Bytes~1 ns$1s/ MByte

Main MemoryM Bytes100ns- 300ns$< 1/ MByte

Disk10s G Bytes, 10 ms (10,000,000 ns)$0.001/ MByte

CapacityAccess TimeCost

Tapeinfinitesec-min$0.0014/ MByte

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

Processor-DRAMMemoryGap(latency)

25

µProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)

1

10

100

100019

8019

81

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU19

82

Processor-MemoryPerformance Gap:(grows 50% / year)

Perf

orm

ance

Time

“Moore’s Law”

JimGray’sStorageLatencyAnalogy:HowFarAwayistheData?

26RegistersOn Chip CacheOn Board Cache

Main Memory

Disk

12

10

100

Tape /Optical Robot

10 9

10 6

Charleston

This CampusThis Room

My Head

10 min

2 hr

2 Years

1 min

Pluto

2,000 Years

Andromeda

(ns)

JimGrayTuringAwardB.S.Cal1966Ph.D.Cal1969!

ThePrincipleofLocality

• ThePrincipleofLocality:– Programaccessarelativelysmallportionoftheaddressspace

atanyinstantoftime.• TwoDifferentTypesofLocality:

– TemporalLocality(LocalityinTime):Ifanitemisreferenced,itwilltendtobereferencedagainsoon(e.g.,loops,reuse)

– SpatialLocality(LocalityinSpace):Ifanitemisreferenced,closeby itemstendtobereferencedsoon(e.g.,straightline code,arrayaccess)

• Last30years,HWreliedonlocalityforspeed

27

P MEM$

GreatIdea:Parallelism

28

Parallelism

• Classesofparallelisminapplications:– Data-LevelParallelism(DLP)– Task-LevelParallelism(TLP)

• Classesofarchitecturalparallelism:– Instruction-LevelParallelism(ILP)– Vectorarchitectures/GraphicProcessorUnits(GPUs)– Thread-LevelParallelism– Heterogeneity

29

ComputerArchitectureTopics

30

Instruction Set Architecture

Pipelining, Hazard Resolution,Superscalar, Reordering, Prediction, Speculation,Vector, Dynamic Compilation

Addressing,Protection,Exception Handling

L1 Cache

L2/L3 Cache

DRAM

Disks, WORM, Tape

Coherence,Bandwidth,Latency

Emerging TechnologiesInterleavingBus protocols

RAID

VLSI

Input/Output and Storage

MemoryHierarchy

Pipelining and Instruction Level Parallelism

NetworkCommunication

Oth

er P

roce

ssor

s

WhyisArchitectureExcitingToday?

31

CPUSpeedFlat

SingleProcessorPerformance

RISC

Move to multi-processor

32

ProblemsofTraditionalILPScaling

• Fundamentalcircuitlimitations1– delays⇑ asissuequeues⇑ andmulti-portregisterfiles⇑– increasingdelayslimitperformancereturnsfromwiderissue

• Limitedamountofinstruction-levelparallelism1

– inefficientforcodeswithdifficult-to-predictbranches

• Powerandheatstallclockfrequencies

33

[1]Thecaseforasingle-chipmultiprocessor,K.Olukotun,B.Nayfeh,L.Hammond,K.Wilson,andK.Chang,ASPLOS-VII,1996.

ILPimpacts

34

Simulationsof8-issueSuperscalar

35

Power/HeatDensityLimitsFrequency

36

• Somefundamentalphysicallimitsarebeingreached

RecentMulticoreProcessors

37

RecentManycore GPUprocessors

38

��

An�Overview�of�the�GK110�Kepler�Architecture�Kepler�GK110�was�built�first�and�foremost�for�Tesla,�and�its�goal�was�to�be�the�highest�performing�parallel�computing�microprocessor�in�the�world.�GK110�not�only�greatly�exceeds�the�raw�compute�horsepower�delivered�by�Fermi,�but�it�does�so�efficiently,�consuming�significantly�less�power�and�generating�much�less�heat�output.��

A�full�Kepler�GK110�implementation�includes�15�SMX�units�and�six�64�bit�memory�controllers.��Different�products�will�use�different�configurations�of�GK110.��For�example,�some�products�may�deploy�13�or�14�SMXs.��

Key�features�of�the�architecture�that�will�be�discussed�below�in�more�depth�include:�

� The�new�SMX�processor�architecture�� An�enhanced�memory�subsystem,�offering�additional�caching�capabilities,�more�bandwidth�at�

each�level�of�the�hierarchy,�and�a�fully�redesigned�and�substantially�faster�DRAM�I/O�implementation.�

� Hardware�support�throughout�the�design�to�enable�new�programming�model�capabilities�

�

Kepler�GK110�Full�chip�block�diagram�

��

Streaming�Multiprocessor�(SMX)�Architecture�

Kepler�GK110)s�new�SMX�introduces�several�architectural�innovations�that�make�it�not�only�the�most�powerful�multiprocessor�we)ve�built,�but�also�the�most�programmable�and�power�efficient.��

�

SMX:�192�single�precision�CUDA�cores,�64�double�precision�units,�32�special�function�units�(SFU),�and�32�load/store�units�(LD/ST).�

��

Kepler�Memory�Subsystem�/�L1,�L2,�ECC�

Kepler&s�memory�hierarchy�is�organized�similarly�to�Fermi.�The�Kepler�architecture�supports�a�unified�memory�request�path�for�loads�and�stores,�with�an�L1�cache�per�SMX�multiprocessor.�Kepler�GK110�also�enables�compiler�directed�use�of�an�additional�new�cache�for�read�only�data,�as�described�below.�

�

�

64�KB�Configurable�Shared�Memory�and�L1�Cache�

In�the�Kepler�GK110�architecture,�as�in�the�previous�generation�Fermi�architecture,�each�SMX�has�64�KB�of�on�chip�memory�that�can�be�configured�as�48�KB�of�Shared�memory�with�16�KB�of�L1�cache,�or�as�16�KB�of�shared�memory�with�48�KB�of�L1�cache.�Kepler�now�allows�for�additional�flexibility�in�configuring�the�allocation�of�shared�memory�and�L1�cache�by�permitting�a�32KB�/�32KB�split�between�shared�memory�and�L1�cache.�To�support�the�increased�throughput�of�each�SMX�unit,�the�shared�memory�bandwidth�for�64b�and�larger�load�operations�is�also�doubled�compared�to�the�Fermi�SM,�to�256B�per�core�clock.�

48KB�Read�Only�Data�Cache�

In�addition�to�the�L1�cache,�Kepler�introduces�a�48KB�cache�for�data�that�is�known�to�be�read�only�for�the�duration�of�the�function.�In�the�Fermi�generation,�this�cache�was�accessible�only�by�the�Texture�unit.�Expert�programmers�often�found�it�advantageous�to�load�data�through�this�path�explicitly�by�mapping�their�data�as�textures,�but�this�approach�had�many�limitations.��

• ~5kcores

CurrentTrendsinArchitecture• LeveragingInstruction-Levelparallelism(ILP)isnearanend

– Singleprocessorperformanceimprovementendedin2003• Newmodelsforperformance:

– Data-levelparallelism(DLP)– Thread-levelparallelism(TLP)

• Excitingtopicsandchallenges– Heterogeneity– Domainspecificarchitectures– Softwareandhardwareco-design– Agiledevelopment

• DARPAPicksItsFirstSetofWinnersinElectronicsResurgenceInitiative,July2018

• https://spectrum.ieee.org/tech-talk/semiconductors/design/darpa-picks-its-first-set-of-winners-in-electronics-resurgence-initiative.amp.html

39

• Video:https://www.acm.org/hennessy-patterson-turing-lecture• Shortsummary

– https://www.hpcwire.com/2018/04/17/hennessy-patterson-a-new-golden-age-for-computer-architecture/

40

Exercise:InspectISAforsum• Sumexample

– https://passlab.github.io/CSCE513/exercises/sum

• Check– sum_full.s,– sum_riscv.s– sum_x86.s

• Generateandexecute– gcc -save-tempssum.c –osum– ./sum102400

• ForhowtocompileandrunLinuxprogram– https://passlab.github.io/CSCE513/notes/lecture01_LinuxCProgramming.pdf

• Othersystemcommands:– cat/proc/cpuinfo toshowtheCPUand#cores– topcommandtoshowsystemusageandmemory

41

MachineforDevelopmentandExperiment

• LinuxmachinesinSwearingen1D43and3D22– AllCSCEstudentsbydefaulthaveaccesstothesemachine

usingtheirstandardlogincredentials• Letmeknowifyou,CSCEornot,cannotaccess

– RemoteaccessisalsoavailableviaSSHoverport222. Namingschemaisasfollows:• l-1d43-01.cse.sc.eduthroughl-1d43-26.cse.sc.edu• l-3d22-01.cse.sc.eduthroughl-3d22-20.cse.sc.edu

• Restrictedto2GBofdataintheirhomefolder(~/).– Formorespace,createadirectoryin/scratchonthelogin

machine,howeverthatdataisnotsharedanditwillonlybeavailableonthatspecificmachine.

42

PuttySSHConnectiononWindows

43

l-1d43-08.cse.sc.edu 222

SSHConnectionfromLinux/MacOSXTerminal

44

-XforenablingX-windowsforwardingsoyoucanusethegraphicsdisplayonyourcomputer.ForMacOSX,youneedhaveXserversoftwareinstalled,e.g.Xquartz(https://www.xquartz.org/)istheoneIuse.

lecture 01: introduction - github pages8 •“i think there is a world market for maybe five...

Documents