lecture 17: memory hierarchy and cache coherence · 2018. 2. 12. · lecture 17: memory hierarchy...
TRANSCRIPT
![Page 1: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/1.jpg)
Lecture17:MemoryHierarchyandCacheCoherence
ConcurrentandMul7coreProgramming
DepartmentofComputerScienceandEngineeringYonghongYan
[email protected]/~yan
1
![Page 2: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/2.jpg)
ParallelisminHardware
• Instruc7on-LevelParallelism– Pipeline– Out-of-orderexecu7on,and– Superscalar
• Thread-LevelParallelism– Chipmul7threading,mul7core– Coarse-grainedandfine-grainedmul7threading– SMT
• Data-LevelParallelism– SIMD/Vector– GPU/SIMT
2
ComputerArchitecture,AQuan7ta7veApproach.5THEdi7on,TheMorganKaufmann,September30,2011byJohnL.Hennessy(Author),DavidA.PaWerson
![Page 3: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/3.jpg)
Topics(Part2)
• Parallelarchitecturesandhardware– Parallelcomputerarchitectures– Memoryhierarchyandcachecoherency
• ManycoreGPUarchitecturesandprogramming– GPUsarchitectures– CUDAprogramming– IntroducGontooffloadingmodelinOpenMPandOpenACC
• Programmingonlargescalesystems(Chapter6)– MPI(pointtopointandcollec7ves)– IntroducGontoPGASlanguages,UPCandChapel
• Parallelalgorithms(Chapter8,9&10)– Densematrix,andsor7ng
3
![Page 4: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/4.jpg)
Outline
• Memory,LocalityofreferenceandCaching• Cachecoherenceinsharedmemorysystem
4
![Page 5: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/5.jpg)
Memoryun7lnow…
• We’vereliedonaverysimplemodelofmemoryformostthisclass– MainMemoryisalineararrayofbytesthatcanbeaccessed
givenamemoryaddress– Alsousedregisterstostorevalues
• Realityismorecomplex.ThereisanenGrememorysystem.– Differentmemoriesexistatdifferentlevelsofthecomputer– Eachvaryintheirspeed,size,andcost
5
![Page 6: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/6.jpg)
Random-Access Memory (RAM)
• Keyfeatures– RAMispackagedasachip.– Basicstorageunitisacell(onebitpercell).– MulGpleRAMchipsformamemory.
• Sta7cRAM(SRAM)– Eachcellstoresbitwithasix-transistorcircuit.– Retainsvalueindefinitely,aslongasitiskeptpowered.– RelaGvelyinsensiGvetodisturbancessuchaselectricalnoise.– FasterandmoreexpensivethanDRAM.
6
![Page 7: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/7.jpg)
Random-Access Memory (RAM)
• DynamicRAM(DRAM)– Eachcellstoresbitwithacapacitorandtransistor.– Valuemustberefreshedevery10-100ms.– SensiGvetodisturbances.– SlowerandcheaperthanSRAM.
7
![Page 8: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/8.jpg)
Memory Modules… real lifeDRAM
• Inreality,– SeveralDRAMchipsarebundledintoMemoryModules
• SIMMS-SingleInlineMemoryModule• DIMMS-DualInlineMemoryModule• DDR-DualdataRead
– Readstwiceeveryclockcycle• QuadPump:SimultaneousR/W
Source for Pictures: http://en.kioskea.net/contents/pc/ram.php3
8
![Page 9: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/9.jpg)
SDR, DDR,QuadPump
9
![Page 10: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/10.jpg)
MemorySpeeds
• ProcessorSpeeds:1GHzprocessorspeedis1nseccycleGme.
• MemorySpeeds(50nsec)• AccessSpeedgap
– InstrucGonsthatstoreorloadfrommemory
10
DIMMModuleChipType ClockSpeed(MHz) BusSpeed(MHz) TransferRate(MB/s)
PC1600DDR200 100 200 1600
PC2100DDR266 133 266 2133
PC2400DDR300 150 300 2400
![Page 11: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/11.jpg)
registers
on-chip L1 cache (SRAM)
main memory (DRAM)
local secondary storage (local disks)
Larger, slower,
and cheaper
(per byte) storage devices
remote secondary storage (distributed file systems, Web servers)
Local disks hold files retrieved from disks on remote network servers.
Main memory holds disk blocks retrieved from local disks.
off-chip L2 cache (SRAM)
L1 cache holds cache lines retrieved from the L2 cache memory.
CPU registers hold words retrieved from L1 cache.
L2 cache holds cache lines retrieved from main memory.
L0:
L1:
L2:
L3:
L4:
L5:
Smaller, faster, and
costlier (per byte) storage devices
MemoryHierarchy(Review)
11
![Page 12: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/12.jpg)
main memory I/O
bridge bus interface L2 cache
ALU register file
cache bus system bus memory bus
L1 cache
CacheMemories(SRAM)
• Cachememoriesaresmall,fastSRAM-basedmemoriesmanagedautomaGcallyinhardware.– Holdfrequentlyaccessedblocksofmainmemory
• CPUlooksfirstfordatainL1,theninL2,theninmainmemory.
• Typicalbusstructure:
12
![Page 13: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/13.jpg)
Processor
HowtoExploitMemoryHierarchy
• Availabilityofmemory– Cost,size,speed
• Principleoflocality– Memoryreferencesarebunchedtogether– AsmallporGonofaddressspaceisaccessedatanygivenGme
• Thisspaceinhighspeedmemory– Problem:notallofitmayfit
13
![Page 14: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/14.jpg)
Typesoflocality
• Temporallocality– TendencytoaccesslocaGonsrecentlyreferenced
• SpaGallocality
– TendencytoreferencelocaGonsaroundrecentlyreferenced– LocaGonx,thenotherswillbex-korx+k
14
X X X t
![Page 15: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/15.jpg)
Sourcesoflocality
• Temporallocality– Codewithinaloop– SameinstrucGonsfetchedrepeatedly
• SpaGallocality– Dataarrays– Localvariablesinstack– Dataallocatedinchunks(conGguousbytes)
for(i=0;i<N;i++){A[i]=B[i]+C[i]*a;}
15
![Page 16: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/16.jpg)
Whatdoeslocalitybuy?
• AddressthegapbetweenCPUspeedandRAMspeed• SpaGalandtemporallocalityimpliesasubsetofinstrucGonscanfitinhighspeedmemoryfromGmetoGme
• CPUcanaccessinstrucGonsanddatafromthishighspeedmemory
• Smallhighspeedmemorycanmakecomputerfasterandcheaper
• Speedof1-20nsecatcostof$50to$100perMbyte• ThisisCaching!!
16
![Page 17: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/17.jpg)
Inser7nganL1CacheBetweenCPUandMainMemory
17
a b c d block 10
p q r s block 21
...
...
w x y z block 30
...
The big slow main memory has room for many 4-word blocks.
The small fast L1 cache has room for two 4-word blocks.
The tiny, very fast CPU register file has room for four 4-byte words. The transfer unit between
the CPU register file and the cache is a 4-byte block.
line 0
line 1 The transfer unit between the cache and main memory is a 4-word block (16 bytes).
![Page 18: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/18.jpg)
Whatinfo.Doesacacheneed
• Cache:Asmaller,fasterstoragedevicethatactsasastagingareaforasubsetofthedatainalarger,slowerdevice.
• YouessenGallyallowasmallerregionofmemorytoholddatafromalargerregion.Nota1-1mapping.
• WhatkindofinformaGondoweneedtokeep:– Theactualdata– Wherethedataactuallycomesfrom– Ifdataisevenconsideredvalid
18
![Page 19: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/19.jpg)
CacheOrganiza7on
• Mapeachregionofmemorytoasmallerregionofcache• Discardaddressbits
– Discardlowerorderbits(a)– Discardhigherorderbits(b)
• Cacheaddresssizeis4bits• Memoryaddresssizeis8bits• Incaseof a)
– 0000xxxxismappedto0000incache• Incaseofb)
– xxxx0001ismappedto0001incache
19
cache
memory
(b)
(a)
![Page 20: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/20.jpg)
Findingdataincache
• Partofmemoryaddressappliedtocache• Remainingisstoredastagincache• Lowerorderbitsdiscarded• Needtocheckif00010011
– Cacheindexis0001– Tagis0011
• Iftagmatches,hit,usedata• Nomatch,miss,fetchdatafrommemory
20
address tag
![Page 21: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/21.jpg)
valid
valid
tag
tag set 0:
B = 2b bytes per cache block
E lines per set
S = 2s sets
t tag bits per line
1 valid bit per line
Cache size: C = B x E x S data bytes
• • •
valid
valid
tag
tag set 1: • • •
valid
valid tag
tag
set S-1: • • •
• • •
Cache is an array of sets.
Each set contains one or more lines.
Each line holds a block of data.
0 1 • • • B–1
0 1 • • • B–1
0 1 • • • B–1
0 1 • • • B–1
0 1 • • • B–1
0 1 • • • B–1
GeneralOrgofaCacheMemory
21
![Page 22: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/22.jpg)
t bits s bits b bits
0
<set index> <block offset>
m-1
<tag>
Address A:
v
v
tag
tag set 0: • • •
v
v
tag
tag set 1: •
v
v
tag
tag set S-1: • • •
• • •
The word at address A is in the cache if the tag bits in one of the <valid> lines in set <set index> match <tag>.
The word contents begin at offset <block offset> bytes from the beginning of the block.
AddressingCaches
22
0 1 • • • B–1
0 1 • • • B–1
0 1 • • • B–1 • • 0 1 • • • B–1
0 1 • • • B–1
0 1 • • • B–1
![Page 23: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/23.jpg)
set 0: valid tag cache block
Direct-MappedCache
• Simplestkindofcache• Characterizedbyexactlyonelineperset.
23
valid tag
valid tag
• • •
set 1:
set S-1:
E=1 lines per set
cache block
cache block
![Page 24: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/24.jpg)
set 0: valid tag
valid tag
valid tag
• • •
set 1:
set S-1: t bits s bits
set index block offset0 m-1
b bits
tag
selected set
cache block
cache block
cache block 0 0 0 0 1
AccessingDirect-MappedCaches
• SetselecGon– Usethesetindexbitstodeterminethesetofinterest.
24
![Page 25: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/25.jpg)
=1? (1) The valid bit must be set
1 0110
t bits s bits
set index block offset0 m-1
b bits
tag
selected set (i):
(3) If (1) and (2), then cache hit,
and block offset selects
starting byte.
(2) The tag bits in the cache = ? line must match the
tag bits in the address
AccessingDirect-MappedCaches
• LinematchingandwordselecGon– Linematching:Findavalidlineintheselectedsetwitha
matchingtag – WordselecGon:Thenextracttheword
25
3 0 1 2 7 4 5 6
0110 i 100
w0 w1 w2 w3
![Page 26: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/26.jpg)
valid
valid
valid
Example:Directmappedcache
• 32bitaddress,64KBcache,32byteblock• Howmanysets,howmanybitsforthetag,howmanybitsfortheoffset?
26
tag
tag
tag
• • •
set 0:
set 1:
cache block
cache block
cache block set n-1:
![Page 27: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/27.jpg)
Write-throughvswrite-back
• Whattodowhenanupdateoccurs?• Write-through:immediately
– Simpletoimplement,synchronouswrite– Uniformlatencyonmisses
• Write-back:writewhenblockisreplaced– RequiresaddiGonaldirtybitormodifiedbit– Asynchronouswrites– Non-uniformmisslatency– Cleanmiss:readfromlowerlevel– Dirtymiss:writetolowerlevelandread(fill)
27
![Page 28: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/28.jpg)
WritesandCache
• ReadinginformaGonfromacacheisstraightforward.• WhataboutwriGng?
– Whatifyou’rewriGngdatathatisalreadycached(write-hit)?– Whatifthedataisnotinthecache(write-miss)?
• Dealingwithawrite-hit.– Write-through-immediatelywritedatabacktomemory– Write-back-deferthewritetomemoryforaslongaspossible
• Dealingwithawrite-miss.– write-allocate-loadtheblockintomemoryandupdate– no-write-allocate-writesdirectlytomemory
• Benefits?Disadvantages?• Write-througharetypicallyno-write-allocate.• Write-backaretypicallywrite-allocate.
28
![Page 29: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/29.jpg)
size: speed: $/Mbyte: line size:
200 B 3 ns
8 B 32 B larger, slower, cheaper
8-64 KB 3 ns
1-4MB SRAM 128 MB DRAM 60 ns $1.50/MB 8 KB
30 GB 8 ms $0.05/MB
Memory
L1 d-cache
Regs Unified
L2 Cache
Processor
6 ns $100/MB 32 B
L1 i-cache
disk
Mul7-LevelCaches
• OpGons:separatedataandinstrucGoncaches,oraunifiedcache
29
![Page 30: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/30.jpg)
CachePerformanceMetrics
• MissRate– FracGonofmemoryreferencesnotfoundincache(misses/
references)– Typicalnumbers:
• 3-10%forL1• canbequitesmall(e.g.,<1%)forL2,dependingonsize,etc.
• HitTime– Timetodeliveralineinthecachetotheprocessor(includesGmeto
determinewhetherthelineisinthecache)– Typicalnumbers:
• 1clockcycleforL1• 3-8clockcyclesforL2
• MissPenalty– AddiGonalGmerequiredbecauseofamiss
• Typically25-100cyclesformainmemory
30
![Page 31: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/31.jpg)
int sumarrayrows(int a[M][N]) {
int i, j, sum = 0;
for (i = 0; i < M; i++) for (j = 0; j < N; j++)
sum += a[i][j]; return sum;
}
int sumarraycols(int a[M][N]) {
int i, j, sum = 0;
for (j = 0; j < N; j++) for (i = 0; i < M; i++)
sum += a[i][j]; return sum;
}
Miss rate = 1/4 = 25% Miss rate = 100%
Wri7ngCacheFriendlyCode
• Repeatedreferencestovariablesaregood(temporallocality)
• Stride-1referencepaxernsaregood(spaGallocality)• Examples:
– coldcache,4-bytewords,4-wordcacheblocks
31
![Page 32: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/32.jpg)
MatrixMul7plica7onExample
• MajorCacheEffectstoConsider– Totalcachesize
• Exploittemporallocalityandblocking)– Blocksize
• ExploitspaGallocality
• DescripGon:– MulGplyNxNmatrices– O(N3)totaloperaGons– Accesses
• Nreadspersourceelement• NvaluessummedperdesGnaGon
– butmaybeabletoholdinregister
/* ijk */ for (i=0; i<n; i++) {
for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++)
sum += a[i][k] * b[k][j]; c[i][j] = sum;
} }
Variable sum held in register
32
![Page 33: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/33.jpg)
MissRateAnalysisforMatrixMul7ply
• Assume:– Linesize=32BYTES(bigenoughfor464-bitwords)– Matrixdimension(N)isverylarge
• Approximate1/Nas0.0– CacheisnotevenbigenoughtoholdmulGplerows
• AnalysisMethod:– Lookataccesspaxernofinnerloop
33
![Page 34: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/34.jpg)
LayoutofCArraysinMemory(review)
• Carraysallocatedinrow-majororder– eachrowinconGguousmemorylocaGons
• Steppingthroughcolumnsinonerow:– for(i = 0; i < N; i++)sum+= a[0][i];– accessessuccessiveelements– ifblocksize(B)>4bytes,exploitspaGallocality
• compulsorymissrate=4bytes/B• Steppingthroughrowsinonecolumn:
– for(i = 0; i < n; i++)sum += a[i][0];
• accessesdistantelements• nospaGallocality!
– compulsorymissrate=1(i.e.100%)
34
![Page 35: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/35.jpg)
MatrixMul7plica7on(ijk)
35
![Page 36: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/36.jpg)
MatrixMul7plica7on(jik)
36
![Page 37: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/37.jpg)
MatrixMul7plica7on(kij)
37
![Page 38: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/38.jpg)
MatrixMul7plica7on(ikj)
38
![Page 39: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/39.jpg)
MatrixMul7plica7on(jki)
39
![Page 40: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/40.jpg)
MatrixMul7plica7on(kji)
40
![Page 41: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/41.jpg)
SummaryofMatrixMul7plica7on
41
for (i=0; i<n; i++) { for (j=0; j<n; j++)
{
sum = 0.0;
for (k=0; k<n; k++) sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
ijk (& jik): kij (& ikj): jki (& kji): • 2 loads, 0 stores • misses/iter = 1.25
for (k=0; k<n; k++) { for (i=0; i<n; i++)
{
r = a[i][k];
for (j=0; j<n; j++) c[i]
[j] += r * b[k][j];
}
}
for (j=0; j<n; j++) { for (k=0; k<n; k++)
{ r = b[k][j];
for (i=0; i<n; i++) c[i]
[j] += a[i][k] * r; }
}
• 2 loads, 1 store • misses/iter = 0.5
• 2 loads, 1 store • misses/iter = 2.0
![Page 42: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/42.jpg)
Outline
• Memory,LocalityofreferenceandCaching• Cachecoherenceinsharedmemorysystem
42
![Page 43: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/43.jpg)
Sharedmemorysystems
• Allprocesseshaveaccesstothesameaddressspace– E.g.PCwithmorethanoneprocessor
• DataexchangebetweenprocessesbywriGng/readingsharedvariables– Sharedmemorysystemsareeasytoprogram– CurrentstandardinscienGficprogramming:OpenMP
• Twoversionsofsharedmemorysystemsavailabletoday– CentralizedSharedMemoryArchitectures– DistributedSharedMemoryarchitectures
![Page 44: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/44.jpg)
CentralizedSharedMemoryArchitecture
• AlsoreferredtoasSymmetricMulG-Processors(SMP)• Allprocessorssharethesamephysicalmainmemory
• MemorybandwidthperprocessorislimiGngfactorforthistypeofarchitecture
• Typicalsize:2-32processors
Memory
CPU CPU
CPU CPU
![Page 45: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/45.jpg)
Centralizedsharedmemorysystem(I)
• IntelX7350quad-core(Tigerton)– PrivateL1cache:32KBinstrucGon,32KBdata– SharedL2cache:4MBunifiedcache
CoreL1
CoreL1
sharedL2
CoreL1
CoreL1
sharedL2
1066MHzFSB
![Page 46: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/46.jpg)
Centralizedsharedmemorysystems(II)
• IntelX7350quad-core(Tigerton)mulG-processorconfiguraGon
C0
C1
L2
C8
C9
L2
C2
C3
L2
C10
C11
L2
C4
C5
L2
C12
C13
L2
C6
C7
L2
C14
C15
L2
Socket0 Socket1 Socket2 Socket3
MemoryControllerHub(MCH)
Memory Memory Memory Memory
8GB/s8GB/s8GB/s8GB/s
![Page 47: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/47.jpg)
DistributedSharedMemoryArchitectures
• AlsoreferredtoasNon-UniformMemoryArchitectures(NUMA)
• Somememoryisclosertoacertainprocessorthanothermemory– ThewholememoryissGlladdressablefromallprocessors– Dependingonwhatdataitemaprocessorretrieves,theaccess
Gmemightvarystrongly
Memory
CPU CPU
Memory
CPU CPU
Memory
CPU CPU
Memory
CPU CPU
![Page 48: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/48.jpg)
NUMAarchitectures(II)
• ReducesthememoryboxleneckcomparedtoSMPs• Moredifficulttoprogramefficiently
– E.g.firsttouchpolicy:dataitemwillbelocatedinthememoryoftheprocessorwhichusesadataitemfirst
• Toreduceeffectsofnon-uniformmemoryaccess,cachesareo}enused– ccNUMA:cache-coherentnon-uniformmemoryaccess
architectures• Largestexampleasoftoday:SGIOriginwith512processors
![Page 49: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/49.jpg)
DistributedSharedMemorySystems
![Page 50: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/50.jpg)
CacheCoherence
• Real-worldsharedmemorysystemshavecachesbetweenmemoryandCPU
• CopiesofasingledataitemcanexistinmulGplecaches• ModificaGonofashareddataitembyoneCPUleadstooutdatedcopiesinthecacheofanotherCPU
Memory
CPU0
Cache
CPU1
Cache
Originaldataitem
CopyofdataitemincacheofCPU0 Copyofdataitem
incacheofCPU1
![Page 51: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/51.jpg)
Cachecoherence(II)
• TypicalsoluGon:– Cacheskeeptrackonwhetheradataitemissharedbetween
mulGpleprocesses– UponmodificaGonofashareddataitem,‘noGficaGon’of
othercacheshastooccur– Othercacheswillhavetoreloadtheshareddataitemonthe
nextaccessintotheircache• CachecoherenceisonlyanissueincasemulGpletasksaccessthesameitem– MulGplethreads– MulGpleprocesseshaveajointsharedmemorysegment– ProcessisbeingmigratedfromoneCPUtoanother
![Page 52: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/52.jpg)
CacheCoherenceProtocols
• SnoopingProtocols– Sendallrequestsfordatatoallprocessors– Processorssnoopabustoseeiftheyhaveacopyandrespondaccordingly– Requiresbroadcast,sincecachinginformaGonisatprocessors– Workswellwithbus(naturalbroadcastmedium)– Dominatesforcentralizedsharedmemorymachines
• Directory-BasedProtocols– KeeptrackofwhatisbeingsharedincentralizedlocaGon– Distributedmemory=>distributeddirectoryforscalability
(avoidsboxlenecks)– Sendpoint-to-pointrequeststoprocessorsvianetwork– ScalesbexerthanSnooping– Commonlyusedfordistributedsharedmemorymachines
![Page 53: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/53.jpg)
Categoriesofcachemisses
• Uptonow:– CompulsoryMisses:firstaccesstoablockcannotbeinthecache(cold
startmisses)– CapacityMisses:cachecannotcontainallblocksrequiredfortheexecuGon– ConflictMisses:cacheblockhastobediscardedbecauseofblock
replacementstrategy• InmulG-processorsystems:
– CoherenceMisses:cacheblockhastobediscardedbecauseanotherprocessormodifiedthecontent• truesharingmiss:anotherprocessormodifiedthecontentoftherequestelement
• falsesharingmiss:anotherprocessorinvalidatedtheblock,althoughtheactualitemofinterestisunchanged.
![Page 54: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/54.jpg)
BusSnoopingTopology
![Page 55: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/55.jpg)
LargerSharedMemorySystems
• TypicallyDistributedSharedMemorySystems• Localorremotememoryaccessviamemorycontroller• Directorypercachethattracksstateofeveryblockineverycache
– Whichcacheshaveacopyofblock,dirtyvs.clean,...• Infopermemoryblockvs.percacheblock?
– PLUS:Inmemory=>simplerprotocol(centralized/onelocaGon)– MINUS:Inmemory=>directoryisƒ(memorysize)vs.ƒ(cachesize)
• Preventdirectoryasboxleneck?distributedirectoryentrieswithmemory,eachkeepingtrackofwhichprocessorshavecopiesoftheirblocks
![Page 56: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/56.jpg)
DistributedDirectoryMPs
![Page 57: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/57.jpg)
• Falsesharing– Whenatleastonethreadwritetoa
cachelinewhileothersaccessit• Thread0:=A[1](read)• Thread1:A[0]=…(write)
• SoluGon:usearraypadding
int a[max_threads]; #pragma omp parallel for schedule(static,1) for(int i=0; i<max_threads; i++) a[i] +=i;
int a[max_threads][cache_line_size]; #pragma omp parallel for schedule(static,1) for(int i=0; i<max_threads; i++) a[i][0] +=i;
FalseSharinginOpenMP
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
False Sharing
CPUs Caches Memory
A store into a shared cache line invalidates the other copies of that line:
The system is not able to distinguish between changes
within one individual line
57
A
T0
T1
![Page 58: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/58.jpg)
NUMAandFirstTouchPolicy
• DataplacementpolicyonNUMAarchitectures
• FirstTouchPolicy
– Theprocessthatfirsttouchesapageofmemorycausesthatpagetobeallocatedinthenodeonwhichtheprocessisrunning
58
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
A generic cc-NUMA architecture
![Page 59: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/59.jpg)
NUMAFirst-touchplacement/1
59
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
About “First Touch” placement/1
for (i=0; i<100; i++) a[i] = 0;
a[0] :a[99]
First TouchAll array elements are in the memory of
the processor executing this thread
int a[100]; Onlyreservethevm
address
![Page 60: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/60.jpg)
NUMAFirst-touchplacement/2
60
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
About “First Touch” placement/2
for (i=0; i<100; i++) a[i] = 0;
a[0] :a[49]
#pragma omp parallel for num_threads(2)
First TouchBoth memories each have “their half” of
the array
a[50] :a[99]
![Page 61: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/61.jpg)
WorkwithFirst-TouchinOpenMP
• First-touchinpracGce– IniGalizedataconsistentlywiththecomputaGons
61
#pragmaompparallelforfor(i=0;i<N;i++){a[i]=0.0;b[i]=0.0;c[i]=0.0;}readfile(a,b,c);#pragmaompparallelforfor(i=0;i<N;i++){a[i]=b[i]+c[i];}
![Page 62: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/62.jpg)
ConcludingObserva7ons
• ProgrammercanopGmizeforcacheperformance– Howdatastructuresareorganized– Howdataareaccessed
• Nestedloopstructure• Blockingisageneraltechnique
• Allsystemsfavor“cachefriendlycode”– Ge�ngabsoluteopGmumperformanceisverypla�orm
specific• Cachesizes,linesizes,associaGviGes,etc.
– Cangetmostoftheadvantagewithgenericcode• Keepworkingsetreasonablysmall(temporallocality)• Usesmallstrides(spaGallocality)
– WorkwithcachecoherenceprotocolandNUMAfirsttouchpolicy
62
![Page 63: Lecture 17: Memory Hierarchy and Cache Coherence · 2018. 2. 12. · Lecture 17: Memory Hierarchy and Cache Coherence ... Web servers) Local disks hold files retrieved from disks](https://reader035.vdocuments.us/reader035/viewer/2022071416/6111f457fc6ef478fc55bb9b/html5/thumbnails/63.jpg)
References
• ComputerArchitecture,AQuanGtaGveApproach.5THEdiGon,TheMorganKaufmann,September30,2011byJohnL.Hennessy(Author),DavidA.Paxerson
• APrimeronMemoryConsistencyandCacheCoherenceDanielJ.SorinMarkD.HillDavidA.Wood,SYNTHESISLECTURESONCOMPUTERARCHITECTUREMarkD.Hill,SeriesEditor,2011
63