lecture 01: introduction - github pages8 •“i think there is a world market for maybe five...

Lecture01:Introduction

CSCE513ComputerArchitectureFall2018

DepartmentofComputerScienceandEngineeringYonghong Yan

yanyh@cse.sc.eduhttp://cse.sc.edu/~yanyh

CopyrightandAcknowledgement• Lotsoftheslideswereadaptedfromlecturesnotesofthetwotextbookswithcopyrightofpublisherorthe

originalauthorsincludingElsevierInc,MorganKaufmann,DavidA.PattersonandJohnL.Hennessy.• Someslideswereadaptedfromthefollowingcourses:

– GreatIdeasinComputerArchitecture(MachineStructures)byRandyKatzandBernhardBoser• http://inst.eecs.berkeley.edu/~cs61c/fa16/

• Ialsorefertothefollowingcoursesandlecturenoteswhenpreparingmaterialsforthiscourse– ComputerScience152:ComputerArchitectureandEngineering,Spring2016byDr.GeorgeMichelogiannakis from

UCBerkeley• http://www-inst.eecs.berkeley.edu/~cs152/sp16/

– ComputerScience252:GraduateComputerArchitecture,Fall2015byProf.Krste Asanović fromUCBerkeley• http://www-inst.eecs.berkeley.edu/~cs252/fa15/

– ComputerScienceS250:VLSISystemsDesign,Spring2016byProf.JohnWawrzynek fromUCBerkeley• http://www-inst.eecs.berkeley.edu/~cs250/sp16/

– ComputerSystemArchitecture,Fall2005byDr.JoelEmer andProf.Arvind fromMIT• http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-823-computer-system-architecture-

fall-2005/– SynthesisLecturesonComputerArchitecture

• http://www.morganclaypool.com/toc/cac/1/1

• Theusesofthematerials(sourcecode,slides,documentsandvideos)ofthiscourseareforeducationalpurposesonlyandshouldbeusedonlyinconjunctionwiththetextbook.Derivativesofthematerialsmustacknowledgethecopyrightnoticesofthisandtheoriginals.Permissionforcommercialpurposesshouldbeobtainedfromtheoriginalcopyrightholderandthesuccessivecopyrightholdersincludingmyself.2

Contents

• Computercomponents• Computerarchitecturesandgreatideasincomputerarchitectures

• Performance

GenerationOfComputers

4https://solarrenovate.com/the-evolution-of-computers/

NewSchoolComputer(#1)

PersonalMobileDevices

NewSchool“Computer”(#2)

ClassesofComputers

• PersonalMobileDevice(PMD)– e.g.startphones,tabletcomputers– Emphasisonenergyefficiencyandreal-time

• DesktopComputing– Emphasisonprice-performance

• Servers– Emphasisonavailability,scalability,throughput

• Clusters/WarehouseScaleComputers– Usedfor“SoftwareasaService(SaaS)”– Emphasisonavailabilityandprice-performance– Sub-class:Supercomputers,emphasis:floating-point

performanceandfastinternalnetworks• InternetofThings/EmbeddedComputers

– Emphasis:price

• “Ithinkthereisaworldmarketformaybefivecomputers.”

– ThomasWatson,chairmanofIBM,1943.

• “Thereisnoreasonforanyindividualtohaveacomputerintheirhome”

– KenOlson,presidentandfounderofDigitalEquipmentCorporation,1977.

• “640K[ofmemory]oughttobeenoughforanybody.”– BillGates,chairmanofMicrosoft,1981.

NotesbythePioneers

ComponentsofaComputer

• Samecomponentsforallkindsofcomputer– Desktop,server,

embedded• Twocoreparts

– Processorandmemory• Input/outputincludes

– User-interfacedevices• Display,keyboard,mouse

– Storagedevices• Harddisk,CD/DVD,flash

– Networkadapters• Forcommunicatingwithothercomputers

InsidetheProcessor(CPU)

• Functionalunits:performscomputations• Datapath:wiresformovingdata• Controllogic:sequencesdatapath,memory,andoperations• Cachememory

– SmallfastSRAMmemoryforimmediateaccesstodata

Apple A5

ASafePlaceforData

• Volatilemainmemory– Losesinstructionsanddatawhenpoweroff

• Non-volatilesecondarymemory– Magneticdisk– Flashmemory– Opticaldisk(CDROM,DVD)

Contents

• Computercomponents• Computerarchitecturesandgreatideasincomputerarchitectures

• Performance

Whatis“ComputerArchitecture”?

Applications

InstructionSetArchitecture

Compiler

OperatingSystem

Firmware

I/OsystemInstr.SetProc.

DigitalDesignCircuitDesign

Datapath &Control

Layout&fabSemiconductorMaterials

software

hardware

TheInstructionSet:aCriticalInterface

instructionset

software

hardware

• Propertiesofagoodabstraction– Laststhroughmanygenerations(portability)– Usedinmanydifferentways(generality)– Providesconvenientfunctionalitytohigherlevels– Permitsanefficientimplementationatlowerlevels

ElementsofanISA

• Setofmachine-recognizeddatatypes– bytes,words,integers,floatingpoint,strings,...

• Operationsperformedonthosedatatypes– Add,sub,mul,div,xor,move,….

• Programmablestorage– regs,PC,memory

• Methodsofidentifyingandobtainingdatareferencedbyinstructions(addressingmodes)– Literal,reg.,absolute,relative,reg +offset,…

• Format(encoding)oftheinstructions– Opcode,operandfields,…

ComputerArchitectureHowthingsareputtogetherindesignandimplementation

• Capabilities&PerformanceCharacteristicsofPrincipalFunctionalUnits

–(e.g.,Registers,ALU,Shifters,LogicUnits,...)•Waysinwhichthesecomponentsareinterconnected• Informationflowsbetweencomponents• Logicandmeansbywhichsuchinformationflowiscontrolled.

• ChoreographyofFUstorealizetheISA

Great Ideas in Computer Architectures

1. Design for Moore’s Law

2. Use abstraction to simplify design

3. Make the common case fast

4. Performance via parallelism

5. Performance via pipelining

6. Performance via prediction

7. Hierarchy of memories

8. Dependability via redundancy

GreatIdea:“Moore’sLaw”

GordonMoore,FounderofIntel• 1965:sincetheintegratedcircuitwasinvented,thenumberof

transistors/inch2 inthesecircuitsroughlydoubledeveryyear;thistrendwouldcontinuefortheforeseeablefuture

• 1975:revised- circuitcomplexitydoubleseverytwoyears

18Imagecredit:Intel

Moore’sLawtrends• Moretransistors=↑opportunitiesforexploitingparallelisminthe

instructionlevel(ILP)– Pipeline,superscalar,VLIW(VeryLongInstructionWord),SIMD(Single

InstructionMultipleData)orvector,speculation,branchprediction• Generalpathofscaling

– Widerinstructionissue,longerpiepline– Morespeculation– Moreandlargerregistersandcache

• Increasingcircuitdensity~=increasingfrequency~=increasingperformance

• Transparenttousers– Aneasyjobofgettingbetterperformance:buyingfasterprocessors(higher

frequency)

• Wehaveenjoyedthisfreelunchforseveraldecades,however(TBC)…

GreatIdea:PipelineFundamentalExecutionCycle

InstructionFetch

InstructionDecode

OperandFetch

Execute

ResultStore

NextInstruction

Obtaininstructionfromprogramstorage

Determinerequiredactionsandinstructionsize

Locateandobtainoperanddata

Computeresultvalueorstatus

Depositresultsinstorageforlateruse

Determinesuccessorinstruction

Processor

Memory

program

vonNeumanbottleneck

PipelinedInstructionExecution

Instr.

Time (clock cycles)

Reg ALU DMemIfetch Reg

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5

GreatIdea:Abstraction(LevelsofRepresentation/Interpretation)

lw $t0,0($2)lw $t1,4($2)sw $t1,0($2)sw $t0,4($2)

HighLevelLanguageProgram(e.g.,C)

AssemblyLanguageProgram(e.g.,MIPS)

MachineLanguageProgram(MIPS)

HardwareArchitectureDescription(e.g.,blockdiagrams)

Compiler

Assembler

MachineInterpretation

temp=v[k];v[k]=v[k+1];v[k+1]=temp;

0000 1001 1100 0110 1010 1111 0101 10001010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111

LogicCircuitDescription(CircuitSchematicDiagrams)

ArchitectureImplementation

Anythingcanberepresentedasanumber,

i.e.,dataorinstructions

TheMemoryAbstraction

• Associationof<name,value>pairs– typicallynamedasbyteaddresses– oftenvaluesalignedonmultiplesofsize

• SequenceofReadsandWrites• Writebindsavaluetoanaddress

– Leftvalue• Readofaddr returnsmostrecentlywrittenvalueboundtothataddress– Rightvalue

address (name)command (R/W)

data (W)

data (R)

int a=b;

Greatidea:MemoryHierarchyLevelsoftheMemoryHierarchy

CPU Registers100s Bytes<< 1s ns

Cache10s-100s K Bytes~1 ns$1s/ MByte

Main MemoryM Bytes100ns- 300ns$< 1/ MByte

Disk10s G Bytes, 10 ms (10,000,000 ns)$0.001/ MByte

CapacityAccess TimeCost

Tapeinfinitesec-min$0.0014/ MByte

Registers

Memory

Instr. Operands

Blocks

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

Processor-DRAMMemoryGap(latency)

µProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)

100019

Processor-MemoryPerformance Gap:(grows 50% / year)

“Moore’s Law”

JimGray’sStorageLatencyAnalogy:HowFarAwayistheData?

26RegistersOn Chip CacheOn Board Cache

Main Memory

Tape /Optical Robot

Charleston

This CampusThis Room

My Head

10 min

2 Years

2,000 Years

Andromeda

JimGrayTuringAwardB.S.Cal1966Ph.D.Cal1969!

ThePrincipleofLocality

• ThePrincipleofLocality:– Programaccessarelativelysmallportionoftheaddressspace

atanyinstantoftime.• TwoDifferentTypesofLocality:

– TemporalLocality(LocalityinTime):Ifanitemisreferenced,itwilltendtobereferencedagainsoon(e.g.,loops,reuse)

– SpatialLocality(LocalityinSpace):Ifanitemisreferenced,closeby itemstendtobereferencedsoon(e.g.,straightline code,arrayaccess)

• Last30years,HWreliedonlocalityforspeed

P MEM$

GreatIdea:Parallelism

Parallelism

• Classesofparallelisminapplications:– Data-LevelParallelism(DLP)– Task-LevelParallelism(TLP)

• Classesofarchitecturalparallelism:– Instruction-LevelParallelism(ILP)– Vectorarchitectures/GraphicProcessorUnits(GPUs)– Thread-LevelParallelism– Heterogeneity

ComputerArchitectureTopics

Instruction Set Architecture

Pipelining, Hazard Resolution,Superscalar, Reordering, Prediction, Speculation,Vector, Dynamic Compilation

Addressing,Protection,Exception Handling

L1 Cache

L2/L3 Cache

Disks, WORM, Tape

Coherence,Bandwidth,Latency

Emerging TechnologiesInterleavingBus protocols

Input/Output and Storage

MemoryHierarchy

Pipelining and Instruction Level Parallelism

NetworkCommunication

WhyisArchitectureExcitingToday?

CPUSpeedFlat

SingleProcessorPerformance

Move to multi-processor

ProblemsofTraditionalILPScaling

• Fundamentalcircuitlimitations1– delays⇑ asissuequeues⇑ andmulti-portregisterfiles⇑– increasingdelayslimitperformancereturnsfromwiderissue

• Limitedamountofinstruction-levelparallelism1

– inefficientforcodeswithdifficult-to-predictbranches

• Powerandheatstallclockfrequencies

[1]Thecaseforasingle-chipmultiprocessor,K.Olukotun,B.Nayfeh,L.Hammond,K.Wilson,andK.Chang,ASPLOS-VII,1996.

ILPimpacts

Simulationsof8-issueSuperscalar

Power/HeatDensityLimitsFrequency

• Somefundamentalphysicallimitsarebeingreached

RecentMulticoreProcessors

RecentManycore GPUprocessors

��

An�Overview�of�the�GK110�Kepler�Architecture�Kepler�GK110�was�built�first�and�foremost�for�Tesla,�and�its�goal�was�to�be�the�highest�performing�parallel�computing�microprocessor�in�the�world.�GK110�not�only�greatly�exceeds�the�raw�compute�horsepower�delivered�by�Fermi,�but�it�does�so�efficiently,�consuming�significantly�less�power�and�generating�much�less�heat�output.��

A�full�Kepler�GK110�implementation�includes�15�SMX�units�and�six�64�bit�memory�controllers.��Different�products�will�use�different�configurations�of�GK110.��For�example,�some�products�may�deploy�13�or�14�SMXs.��

Key�features�of�the�architecture�that�will�be�discussed�below�in�more�depth�include:�

� The�new�SMX�processor�architecture�� An�enhanced�memory�subsystem,�offering�additional�caching�capabilities,�more�bandwidth�at�

each�level�of�the�hierarchy,�and�a�fully�redesigned�and�substantially�faster�DRAM�I/O�implementation.�

� Hardware�support�throughout�the�design�to�enable�new�programming�model�capabilities�

Kepler�GK110�Full�chip�block�diagram�

��

Streaming�Multiprocessor�(SMX)�Architecture�

Kepler�GK110)s�new�SMX�introduces�several�architectural�innovations�that�make�it�not�only�the�most�powerful�multiprocessor�we)ve�built,�but�also�the�most�programmable�and�power�efficient.��

SMX:�192�single�precision�CUDA�cores,�64�double�precision�units,�32�special�function�units�(SFU),�and�32�load/store�units�(LD/ST).�

��

Kepler�Memory�Subsystem�/�L1,�L2,�ECC�

Kepler&s�memory�hierarchy�is�organized�similarly�to�Fermi.�The�Kepler�architecture�supports�a�unified�memory�request�path�for�loads�and�stores,�with�an�L1�cache�per�SMX�multiprocessor.�Kepler�GK110�also�enables�compiler�directed�use�of�an�additional�new�cache�for�read�only�data,�as�described�below.�

64�KB�Configurable�Shared�Memory�and�L1�Cache�

In�the�Kepler�GK110�architecture,�as�in�the�previous�generation�Fermi�architecture,�each�SMX�has�64�KB�of�on�chip�memory�that�can�be�configured�as�48�KB�of�Shared�memory�with�16�KB�of�L1�cache,�or�as�16�KB�of�shared�memory�with�48�KB�of�L1�cache.�Kepler�now�allows�for�additional�flexibility�in�configuring�the�allocation�of�shared�memory�and�L1�cache�by�permitting�a�32KB�/�32KB�split�between�shared�memory�and�L1�cache.�To�support�the�increased�throughput�of�each�SMX�unit,�the�shared�memory�bandwidth�for�64b�and�larger�load�operations�is�also�doubled�compared�to�the�Fermi�SM,�to�256B�per�core�clock.�

48KB�Read�Only�Data�Cache�

In�addition�to�the�L1�cache,�Kepler�introduces�a�48KB�cache�for�data�that�is�known�to�be�read�only�for�the�duration�of�the�function.�In�the�Fermi�generation,�this�cache�was�accessible�only�by�the�Texture�unit.�Expert�programmers�often�found�it�advantageous�to�load�data�through�this�path�explicitly�by�mapping�their�data�as�textures,�but�this�approach�had�many�limitations.��

• ~5kcores

CurrentTrendsinArchitecture• LeveragingInstruction-Levelparallelism(ILP)isnearanend

– Singleprocessorperformanceimprovementendedin2003• Newmodelsforperformance:

– Data-levelparallelism(DLP)– Thread-levelparallelism(TLP)

• Excitingtopicsandchallenges– Heterogeneity– Domainspecificarchitectures– Softwareandhardwareco-design– Agiledevelopment

• DARPAPicksItsFirstSetofWinnersinElectronicsResurgenceInitiative,July2018

• https://spectrum.ieee.org/tech-talk/semiconductors/design/darpa-picks-its-first-set-of-winners-in-electronics-resurgence-initiative.amp.html

• Video:https://www.acm.org/hennessy-patterson-turing-lecture• Shortsummary

– https://www.hpcwire.com/2018/04/17/hennessy-patterson-a-new-golden-age-for-computer-architecture/

Exercise:InspectISAforsum• Sumexample

– https://passlab.github.io/CSCE513/exercises/sum

• Check– sum_full.s,– sum_riscv.s– sum_x86.s

• Generateandexecute– gcc -save-tempssum.c –osum– ./sum102400

• ForhowtocompileandrunLinuxprogram– https://passlab.github.io/CSCE513/notes/lecture01_LinuxCProgramming.pdf

• Othersystemcommands:– cat/proc/cpuinfo toshowtheCPUand#cores– topcommandtoshowsystemusageandmemory

MachineforDevelopmentandExperiment

• LinuxmachinesinSwearingen1D43and3D22– AllCSCEstudentsbydefaulthaveaccesstothesemachine

usingtheirstandardlogincredentials• Letmeknowifyou,CSCEornot,cannotaccess

– RemoteaccessisalsoavailableviaSSHoverport222. Namingschemaisasfollows:• l-1d43-01.cse.sc.eduthroughl-1d43-26.cse.sc.edu• l-3d22-01.cse.sc.eduthroughl-3d22-20.cse.sc.edu

• Restrictedto2GBofdataintheirhomefolder(~/).– Formorespace,createadirectoryin/scratchonthelogin

machine,howeverthatdataisnotsharedanditwillonlybeavailableonthatspecificmachine.

PuttySSHConnectiononWindows

l-1d43-08.cse.sc.edu 222

SSHConnectionfromLinux/MacOSXTerminal

-XforenablingX-windowsforwardingsoyoucanusethegraphicsdisplayonyourcomputer.ForMacOSX,youneedhaveXserversoftwareinstalled,e.g.Xquartz(https://www.xquartz.org/)istheoneIuse.

lecture 01: introduction - github pages8 •“i think there is a world market for maybe five...

Documents

organization in the eto 1942/1943 · title: organization in...

1943 roundup

backen harold 20 sep 1943 pg 4 bailey lawrence 10 feb 1943...

dental-2046; no.of pages8 article in press...dental-2046;...

pages8 13from examplesfn_lrvsd

5-14-1943 spectator 1943-05-14 - scholarworks

degras - the communist international 1919-1943 - documents -...

tm 9-2800 1943 standard military motor vehicles 1 september...

1943 yearbook

Презентация powerpoint · (x '943 r. (x 1943 r....

1943 petrean

1943 journal

1943 chacahoula

1943 - sep -...

1943 jack teagarden - afrs spotlight bands 1943 (queen ......

uoh dispatch vol.8 | issue12 | pages8

hiroshima 1943

1943 march anchor

quantico football 1943 thru...

jecs-7666; no.of pages8 article in press · +model article...