11/3/2016 CS152,Fall2016
CS152ComputerArchitectureandEngineering
Lecture 18:Multi-Processors- SnoopyCaches
JohnWawrzynekElectricalEngineeringandComputerSciences
UniversityofCalifornia,Berkeley
http://www.eecs.berkeley.edu/~johnwhttp://inst.cs.berkeley.edu/~cs152
11/3/2016 CS152,Fall2016
UniprocessorPerformance(SPECint)
2CS152-Spring’09
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Per
form
ance
(vs
. VA
X-1
1/78
0)
25%/year
52%/year
??%/year
• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
3X
11/3/2016 CS152,Fall2016
ParallelProcessing:Déjàvualloveragain?
§ “…today’sprocessors…arenearinganimpasseastechnologiesapproachthespeedoflight..”– DavidMitchell,TheTransputer:TheTimeIsNow(1989)
§ Transputer hadbadtiming(Uniprocessorperformance↑)⇒ Procrastinationrewarded:2Xseq.perf./1.5years
§ “Wearededicatingallofourfutureproductdevelopmenttomulticoredesigns.…Thisisaseachangeincomputing”– PaulOtellini,President,Intel(2005)
§ AllmicroprocessorcompaniesswitchtoMP(2+CPUs/2yrs)⇒ Procrastinationpenalized:2Xsequentialperf./5yrs
§ Evenhandheldsystemsmovedtomulticore– Nintendo3DS,iPhone6hastwocoreseach(plusadditionalspecializedcores),AndroidQualcommSnapdragon805hasfourcores.PlaystationPortableVitahasfourcores.
3
11/3/2016 CS152,Fall2016
SymmetricMultiprocessors(SMPs)
4
symmetric• Allmemoryisequally farawayfromallprocessors• AnyprocessorcandoanyI/O(setupaDMAtransfer)
Memory
I/Ocontroller
Graphicsoutput
CPU-Memorybus
bridge
Processor
I/Ocontroller I/Ocontroller
I/Obus
Networks
Processor
Local caches at processors makes it practical!
11/3/2016 CS152,Fall2016
Synchronization
5
Theneedforsynchronizationariseswheneverthereareconcurrentprocessesinasystemcooperatingonsometask
(eveninauniprocessor system)
Twoclassesofsynchronization:
Producer-Consumer:Aconsumerprocessmustwaituntiltheproducerprocesshasproduceddata
MutualExclusion:Ensurethatonlyone processusesaresourceatagiventime
producer
consumer
SharedResource
P1 P2
11/3/2016 CS152,Fall2016
NeedforMutualExclusion
§ Example(wikipedia):sharedlinkedlistmanagement§ Twonodes, i andi +1,beingremovedsimultaneouslyresultsinnode i +1notbeingremoved.
6
11/3/2016 CS152,Fall2016
MemoryCoherenceinSMPs
7
Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale valueswrite-through: cache-2 has a stale value
Do these stale values matter?What is the view of shared memory for programming?
cache-1A 100
CPU-Memory bus
CPU-1 CPU-2
cache-2A 100
memoryA 100
11/3/2016 CS152,Fall2016
MaintainingCacheCoherence
§ Acachecoherenceprotocolensuresthatallwritesbyoneprocessorareeventuallyvisibletootherprocessors,foronememoryaddress– i.e.,updatesarenotlost
§Hardwaresupportisrequiredsuchthat– onlyoneprocessoratatimehaswritepermissionforalocation
– noprocessorcanloadastalecopyofthelocationafterawrite
⇒ cachecoherenceprotocols
10
11/3/2016 CS152,Fall2016
Warmup:ParallelI/O
13
(DMA stands for “Direct Memory Access”, means the I/O device can read/write memory autonomous from the CPU)
Either Cache or DMA canbe the Bus Master andeffect transfers
DISKDMA
PhysicalMemory
Proc.
R/W
Data (D) Cache
Address (A)
AD
R/W
Page transfersoccur while theProcessor is running
MemoryBus
11/3/2016 CS152,Fall2016
ProblemswithParallelI/O
14
Memory Disk: Physical memory may bestale if cache copy is dirty
Disk Memory: Cache may hold stale data and not see memory writes
DISK
DMA
PhysicalMemory
Proc.Cache
MemoryBus
Cached portionsof page
DMA transfers
11/3/2016 CS152,Fall2016
SnoopyCache, Goodman1983
§ Idea:Havecachewatch(orsnoopupon)DMAtransfers,andthen“dotherightthing”
§ Snoopycachetagsaredual-ported
15
Proc.
Cache
Snoopy read portattached to MemoryBus
Data(lines)
Tags andState
A
D
R/W
Used to drive Memory Buswhen Cache is Bus Master
A
R/W
11/3/2016 CS152,Fall2016
SnoopyCacheActionsforDMA
16
Observed Bus Cycle Cache State Cache Action
Address not cached
DMA Read Cached, unmodified
Memory Disk Cached, modified
Address not cached
DMA Write Cached, unmodified
Disk Memory Cached, modified
No action
No action
No action
Cache intervenes
Cache purges its copy
???
11/3/2016 CS152,Fall2016
SharedMemoryMultiprocessor
18
Use snoopy mechanism to keep all processors’ view of memory coherent
M1
M2
M3
SnoopyCache
DMA
PhysicalMemory
MemoryBus
SnoopyCache
SnoopyCache
DISKS
11/3/2016 CS152,Fall2016
SnoopyCacheCoherenceProtocols
19
write miss:the address is invalidated in all othercaches before the write is performed
read miss:if a dirty copy is found in some cache, a write-back is performed before the memory is read
11/3/2016 CS152,Fall2016
TheMSIprotocol
20
M: ModifiedS: SharedI: Invalid
Each cache line has state bits
Address tagstatebits
§ Modified:Theblockhasbeenmodifiedinthecache.Thedatainthecacheistheninconsistentwiththebackingstore(e.g.memory).Acachewithablockinthe"M"statehastheresponsibilitytowritetheblocktothebackingstorewhenitisevicted.AblockintheModifiedstateisexclusive(itcan’tbeinanyothercache).
§ Shared:Thisblockisunmodifiedandexistsinread-onlystateinatleastonecache.Thecachecanevictthedatawithoutwritingittothebackingstore.
§ Invalid:Thisblockiseithernotpresent inthecurrentcacheorhasbeeninvalidatedbyabusrequest,andmustbefetched frommemoryiftheblockistobestored inthiscache.
11/3/2016 CS152,Fall2016
TheMSIprotocol
21
M: ModifiedS: SharedI: Invalid
Each cache line has state bits
Address tagstatebits
§ Areadmisstoablockinacache,C1,generatesabustransaction– ifanothercache,C2,hastheblockinMstate(“exclusively”), ithastowritebacktheblockbeforememorysuppliesit.C1getsthedatafromthebusandtheblockbecomes“shared”inbothcaches.
§ AwritehittoasharedblockinC1forcesawriteback– allothercachesthathavetheblockshouldinvalidatethatblock– theblockbecomes“exclusive” inC1.
§ Awritehittoamodified(exclusive) blockdoesnotgenerateawritebackorchangeofstate.
§ Awritemiss(toaninvalidblock)inC1generatesabustransaction– Ifacache,C2,hastheblockas“shared”,C2invalidates it’scopy– Ifacache,C2,hastheblockin“modified(exclusive)”, itwritesbacktheblockandchangesitstateinC2to“invalid”.
– Ifnocachesuppliestheblock,thememorywillsupplyit.– WhenC1getstheblock,itsetsitsstateto”modified(exclusive)”
11/3/2016 CS152,Fall2016
CacheStateTransitionDiagramTheMSIprotocol
22
M
S I
M: ModifiedS: SharedI: Invalid
Each cache line has state bits
Address tagstatebits Write miss
(P1 gets line from memory)
Other processorintent to write (P1 writes back)
Read miss(P1 gets line from memory)
Other processorintent to write
Read by anyprocessor
P1 readsor writes
Cache state in processor P1
Other processor reads(P1 writes back)
11/3/2016 CS152,Fall2016
TwoProcessorExample(Readingandwritingthesamecacheline)
23
M
S I
Write miss
Readmiss
P2 intent to write
P2 reads,P1 writes back
P1 readsor writes
P2 intent to write
P1
M
S I
Write miss
Readmiss
P1 intent to write
P1 reads,P2 writes back
P2 readsor writes
P1 intent to write
P2
P1 readsP1 writesP2 readsP2 writes
P1 writesP2 writes
P1 reads
P1 writes
11/3/2016 CS152,Fall2016
Observations
§ IfalineisintheM statethennoothercachecanhaveacopyoftheline!
§ Memorystayscoherent,multipledifferingcopiescannotexist§ AwritetoalineintheSstatecausesawriteback (evenifnoothercachehasacopy!)
24
M
S I
Write miss
Other processorintent to write
Readmiss
Other processorintent to write
Read by anyprocessor
P1 readsor writesOther processor reads
P1 writes back
11/3/2016 CS152,Fall2016
MESI:AnEnhancedMSIprotocolincreasedperformanceforprivatedata
25
M E
S I
M: Modified ExclusiveE: Exclusive but unmodifiedS: SharedI: Invalid
Each cache line has a tag
Address tagstatebits
Write miss
Other processorintent to write
Read miss,shared
Other processorintent to write
P1 write
Read by anyprocessor
Other processor readsP1 writes back
P1 readP1 writeor read
Cache state in processor P1
P1 intent to write
Read miss, not sharedOther
processorreads
Other processor intent to write, P1 writes back
Write to a Exclusive line doesn’t cause a writeback.
11/3/2016 CS152,Fall2016
OptimizedSnoopwithLevel-2Caches
26
Snooper Snooper Snooper Snooper
• Processorsoftenhavetwo-levelcaches• smallL1,largeL2(usuallybothonchipnow)
• Inclusionproperty:entriesinL1mustbeinL2invalidationinL2⇒ invalidationinL1
• SnoopingonL2doesnotaffectCPU-L1bandwidth
Whatproblemcouldoccur?
CPU
L1$
L2$
CPU
L1$
L2$
CPU
L1$
L2$
CPU
L1$
L2$