cpu caches - jamie allen
DESCRIPTION
Caches are used in many layers of applications that we develop today, holding data inside or outside of your runtime environment, or even distributed across multiple platforms in data fabrics. However, considerable performance gains can often be realized by configuring the deployment platform/environment and coding your application to take advantage of the properties of CPU caches. In this talk, we will explore what CPU caches are, how they work and how to measure your JVM-based application data usage to utilize them for maximum efficiency. We will discuss the future of CPU caches in a many-core world, as well as advancements that will soon arrive such as HP's Memristor.TRANSCRIPT
CPU13 CachesJamie13 Allen
Director13 of13 Consul3ng
jamie_allenh9pgithubcomjamie-shy‐allen
Agenda
bull Goalbull Defini3onsbull Architecturesbull Development13 Tipsbull The13 Future
Goal
Provide13 you13 with13 the13 informa3on13 you13 need13 about13 CPU13 caches13 so13 that13 you13 can13 improve13 the13
performance13 of13 your13 applica3ons
Why
bull Increased13 virtualiza3on13 ndash Run3me13 (JVM13 RVM)ndash PlaRormsEnvironments13 (cloud)
bull Disruptor13 2011
Defini7ons
SMP
bull Symmetric13 Mul3processor13 (SMP)13 Architecturebull Shared13 main13 memory13 controlled13 by13 single13 OSbull No13 more13 Northbridge
NUMA
bull Non-shy‐Uniform13 Memory13 Accessbull The13 organiza3on13 of13 processors13 reflect13 the13 3me13 to13 access13 data13 in13 RAM13 called13 the13 ldquoNUMA13 factorrdquo
bull Shared13 memory13 space13 (as13 opposed13 to13 mul3ple13 commodity13 machines)
Data13 Locality
bull The13 most13 cri3cal13 factor13 in13 performance13 13 Google13 argues13 otherwise
bull Not13 guaranteed13 by13 a13 JVMbull Spa7al13 -shy‐13 reused13 over13 and13 over13 in13 a13 loop13 data13 accessed13 in13 small13 regions
bull Temporal13 -shy‐13 high13 probability13 it13 will13 be13 reused13 before13 long
Memory13 Controller
bull Manages13 communica3on13 of13 readswrites13 between13 the13 CPU13 and13 RAM
bull Integrated13 Memory13 Controller13 on13 die
Cache13 Lines
bull 32-shy‐25613 con3guous13 bytes13 most13 commonly13 64bull Beware13 ldquofalse13 sharingrdquobull Use13 padding13 to13 ensure13 unshared13 linesbull Transferred13 in13 64-shy‐bit13 blocks13 (8x13 for13 6413 byte13 lines)13 arriving13 every13 ~413 cycles
bull Posi3on13 in13 the13 line13 of13 the13 ldquocri3cal13 wordrdquo13 ma9ers13 but13 not13 if13 pre-shy‐fetched
bull Contended13 annota3on13 coming13 in13 Java13 8
Cache13 Associa7vity
bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve
bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13
Cache13 Evic7on13 Strategies
bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches
bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others
Cache13 Write13 Strategies
bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory
bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory
bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory
bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning
Exclusive13 versus13 Inclusive
bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2
bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well
Inter-shy‐Socket13 Communica7on
bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32
MESI+F13 Cache13 Coherency13 Protocol
bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13
a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13
only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13
modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid
Sta7c13 RAM13 (SRAM)
bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does
bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged
Dynamic13 RAM13 (DRAM)
bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge
bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it
Architectures
Current13 Processors
bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)
bull AMDndash Bulldozer
bull Oraclendash UltraSPARC13 isnt13 dead
ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lockunlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3000 ns = 3 micros
Send 2K bytes over 1 Gbps network 20000 ns = 20 micros
SSD random read 150000 ns = 150 micros
Read 1 MB sequentially from memory 250000 ns = 250 micros
Round trip within same datacenter 500000 ns = 05 ms
Read 1 MB sequentially from SSD 1000000 ns = 1 ms
Disk seek 10000000 ns = 10 ms
Read 1 MB sequentially from disk 20000000 ns = 20 ms
Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms
bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Agenda
bull Goalbull Defini3onsbull Architecturesbull Development13 Tipsbull The13 Future
Goal
Provide13 you13 with13 the13 informa3on13 you13 need13 about13 CPU13 caches13 so13 that13 you13 can13 improve13 the13
performance13 of13 your13 applica3ons
Why
bull Increased13 virtualiza3on13 ndash Run3me13 (JVM13 RVM)ndash PlaRormsEnvironments13 (cloud)
bull Disruptor13 2011
Defini7ons
SMP
bull Symmetric13 Mul3processor13 (SMP)13 Architecturebull Shared13 main13 memory13 controlled13 by13 single13 OSbull No13 more13 Northbridge
NUMA
bull Non-shy‐Uniform13 Memory13 Accessbull The13 organiza3on13 of13 processors13 reflect13 the13 3me13 to13 access13 data13 in13 RAM13 called13 the13 ldquoNUMA13 factorrdquo
bull Shared13 memory13 space13 (as13 opposed13 to13 mul3ple13 commodity13 machines)
Data13 Locality
bull The13 most13 cri3cal13 factor13 in13 performance13 13 Google13 argues13 otherwise
bull Not13 guaranteed13 by13 a13 JVMbull Spa7al13 -shy‐13 reused13 over13 and13 over13 in13 a13 loop13 data13 accessed13 in13 small13 regions
bull Temporal13 -shy‐13 high13 probability13 it13 will13 be13 reused13 before13 long
Memory13 Controller
bull Manages13 communica3on13 of13 readswrites13 between13 the13 CPU13 and13 RAM
bull Integrated13 Memory13 Controller13 on13 die
Cache13 Lines
bull 32-shy‐25613 con3guous13 bytes13 most13 commonly13 64bull Beware13 ldquofalse13 sharingrdquobull Use13 padding13 to13 ensure13 unshared13 linesbull Transferred13 in13 64-shy‐bit13 blocks13 (8x13 for13 6413 byte13 lines)13 arriving13 every13 ~413 cycles
bull Posi3on13 in13 the13 line13 of13 the13 ldquocri3cal13 wordrdquo13 ma9ers13 but13 not13 if13 pre-shy‐fetched
bull Contended13 annota3on13 coming13 in13 Java13 8
Cache13 Associa7vity
bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve
bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13
Cache13 Evic7on13 Strategies
bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches
bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others
Cache13 Write13 Strategies
bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory
bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory
bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory
bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning
Exclusive13 versus13 Inclusive
bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2
bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well
Inter-shy‐Socket13 Communica7on
bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32
MESI+F13 Cache13 Coherency13 Protocol
bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13
a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13
only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13
modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid
Sta7c13 RAM13 (SRAM)
bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does
bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged
Dynamic13 RAM13 (DRAM)
bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge
bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it
Architectures
Current13 Processors
bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)
bull AMDndash Bulldozer
bull Oraclendash UltraSPARC13 isnt13 dead
ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lockunlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3000 ns = 3 micros
Send 2K bytes over 1 Gbps network 20000 ns = 20 micros
SSD random read 150000 ns = 150 micros
Read 1 MB sequentially from memory 250000 ns = 250 micros
Round trip within same datacenter 500000 ns = 05 ms
Read 1 MB sequentially from SSD 1000000 ns = 1 ms
Disk seek 10000000 ns = 10 ms
Read 1 MB sequentially from disk 20000000 ns = 20 ms
Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms
bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Goal
Provide13 you13 with13 the13 informa3on13 you13 need13 about13 CPU13 caches13 so13 that13 you13 can13 improve13 the13
performance13 of13 your13 applica3ons
Why
bull Increased13 virtualiza3on13 ndash Run3me13 (JVM13 RVM)ndash PlaRormsEnvironments13 (cloud)
bull Disruptor13 2011
Defini7ons
SMP
bull Symmetric13 Mul3processor13 (SMP)13 Architecturebull Shared13 main13 memory13 controlled13 by13 single13 OSbull No13 more13 Northbridge
NUMA
bull Non-shy‐Uniform13 Memory13 Accessbull The13 organiza3on13 of13 processors13 reflect13 the13 3me13 to13 access13 data13 in13 RAM13 called13 the13 ldquoNUMA13 factorrdquo
bull Shared13 memory13 space13 (as13 opposed13 to13 mul3ple13 commodity13 machines)
Data13 Locality
bull The13 most13 cri3cal13 factor13 in13 performance13 13 Google13 argues13 otherwise
bull Not13 guaranteed13 by13 a13 JVMbull Spa7al13 -shy‐13 reused13 over13 and13 over13 in13 a13 loop13 data13 accessed13 in13 small13 regions
bull Temporal13 -shy‐13 high13 probability13 it13 will13 be13 reused13 before13 long
Memory13 Controller
bull Manages13 communica3on13 of13 readswrites13 between13 the13 CPU13 and13 RAM
bull Integrated13 Memory13 Controller13 on13 die
Cache13 Lines
bull 32-shy‐25613 con3guous13 bytes13 most13 commonly13 64bull Beware13 ldquofalse13 sharingrdquobull Use13 padding13 to13 ensure13 unshared13 linesbull Transferred13 in13 64-shy‐bit13 blocks13 (8x13 for13 6413 byte13 lines)13 arriving13 every13 ~413 cycles
bull Posi3on13 in13 the13 line13 of13 the13 ldquocri3cal13 wordrdquo13 ma9ers13 but13 not13 if13 pre-shy‐fetched
bull Contended13 annota3on13 coming13 in13 Java13 8
Cache13 Associa7vity
bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve
bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13
Cache13 Evic7on13 Strategies
bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches
bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others
Cache13 Write13 Strategies
bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory
bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory
bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory
bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning
Exclusive13 versus13 Inclusive
bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2
bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well
Inter-shy‐Socket13 Communica7on
bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32
MESI+F13 Cache13 Coherency13 Protocol
bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13
a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13
only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13
modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid
Sta7c13 RAM13 (SRAM)
bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does
bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged
Dynamic13 RAM13 (DRAM)
bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge
bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it
Architectures
Current13 Processors
bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)
bull AMDndash Bulldozer
bull Oraclendash UltraSPARC13 isnt13 dead
ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lockunlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3000 ns = 3 micros
Send 2K bytes over 1 Gbps network 20000 ns = 20 micros
SSD random read 150000 ns = 150 micros
Read 1 MB sequentially from memory 250000 ns = 250 micros
Round trip within same datacenter 500000 ns = 05 ms
Read 1 MB sequentially from SSD 1000000 ns = 1 ms
Disk seek 10000000 ns = 10 ms
Read 1 MB sequentially from disk 20000000 ns = 20 ms
Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms
bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Why
bull Increased13 virtualiza3on13 ndash Run3me13 (JVM13 RVM)ndash PlaRormsEnvironments13 (cloud)
bull Disruptor13 2011
Defini7ons
SMP
bull Symmetric13 Mul3processor13 (SMP)13 Architecturebull Shared13 main13 memory13 controlled13 by13 single13 OSbull No13 more13 Northbridge
NUMA
bull Non-shy‐Uniform13 Memory13 Accessbull The13 organiza3on13 of13 processors13 reflect13 the13 3me13 to13 access13 data13 in13 RAM13 called13 the13 ldquoNUMA13 factorrdquo
bull Shared13 memory13 space13 (as13 opposed13 to13 mul3ple13 commodity13 machines)
Data13 Locality
bull The13 most13 cri3cal13 factor13 in13 performance13 13 Google13 argues13 otherwise
bull Not13 guaranteed13 by13 a13 JVMbull Spa7al13 -shy‐13 reused13 over13 and13 over13 in13 a13 loop13 data13 accessed13 in13 small13 regions
bull Temporal13 -shy‐13 high13 probability13 it13 will13 be13 reused13 before13 long
Memory13 Controller
bull Manages13 communica3on13 of13 readswrites13 between13 the13 CPU13 and13 RAM
bull Integrated13 Memory13 Controller13 on13 die
Cache13 Lines
bull 32-shy‐25613 con3guous13 bytes13 most13 commonly13 64bull Beware13 ldquofalse13 sharingrdquobull Use13 padding13 to13 ensure13 unshared13 linesbull Transferred13 in13 64-shy‐bit13 blocks13 (8x13 for13 6413 byte13 lines)13 arriving13 every13 ~413 cycles
bull Posi3on13 in13 the13 line13 of13 the13 ldquocri3cal13 wordrdquo13 ma9ers13 but13 not13 if13 pre-shy‐fetched
bull Contended13 annota3on13 coming13 in13 Java13 8
Cache13 Associa7vity
bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve
bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13
Cache13 Evic7on13 Strategies
bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches
bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others
Cache13 Write13 Strategies
bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory
bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory
bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory
bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning
Exclusive13 versus13 Inclusive
bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2
bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well
Inter-shy‐Socket13 Communica7on
bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32
MESI+F13 Cache13 Coherency13 Protocol
bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13
a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13
only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13
modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid
Sta7c13 RAM13 (SRAM)
bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does
bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged
Dynamic13 RAM13 (DRAM)
bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge
bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it
Architectures
Current13 Processors
bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)
bull AMDndash Bulldozer
bull Oraclendash UltraSPARC13 isnt13 dead
ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lockunlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3000 ns = 3 micros
Send 2K bytes over 1 Gbps network 20000 ns = 20 micros
SSD random read 150000 ns = 150 micros
Read 1 MB sequentially from memory 250000 ns = 250 micros
Round trip within same datacenter 500000 ns = 05 ms
Read 1 MB sequentially from SSD 1000000 ns = 1 ms
Disk seek 10000000 ns = 10 ms
Read 1 MB sequentially from disk 20000000 ns = 20 ms
Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms
bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Defini7ons
SMP
bull Symmetric13 Mul3processor13 (SMP)13 Architecturebull Shared13 main13 memory13 controlled13 by13 single13 OSbull No13 more13 Northbridge
NUMA
bull Non-shy‐Uniform13 Memory13 Accessbull The13 organiza3on13 of13 processors13 reflect13 the13 3me13 to13 access13 data13 in13 RAM13 called13 the13 ldquoNUMA13 factorrdquo
bull Shared13 memory13 space13 (as13 opposed13 to13 mul3ple13 commodity13 machines)
Data13 Locality
bull The13 most13 cri3cal13 factor13 in13 performance13 13 Google13 argues13 otherwise
bull Not13 guaranteed13 by13 a13 JVMbull Spa7al13 -shy‐13 reused13 over13 and13 over13 in13 a13 loop13 data13 accessed13 in13 small13 regions
bull Temporal13 -shy‐13 high13 probability13 it13 will13 be13 reused13 before13 long
Memory13 Controller
bull Manages13 communica3on13 of13 readswrites13 between13 the13 CPU13 and13 RAM
bull Integrated13 Memory13 Controller13 on13 die
Cache13 Lines
bull 32-shy‐25613 con3guous13 bytes13 most13 commonly13 64bull Beware13 ldquofalse13 sharingrdquobull Use13 padding13 to13 ensure13 unshared13 linesbull Transferred13 in13 64-shy‐bit13 blocks13 (8x13 for13 6413 byte13 lines)13 arriving13 every13 ~413 cycles
bull Posi3on13 in13 the13 line13 of13 the13 ldquocri3cal13 wordrdquo13 ma9ers13 but13 not13 if13 pre-shy‐fetched
bull Contended13 annota3on13 coming13 in13 Java13 8
Cache13 Associa7vity
bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve
bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13
Cache13 Evic7on13 Strategies
bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches
bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others
Cache13 Write13 Strategies
bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory
bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory
bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory
bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning
Exclusive13 versus13 Inclusive
bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2
bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well
Inter-shy‐Socket13 Communica7on
bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32
MESI+F13 Cache13 Coherency13 Protocol
bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13
a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13
only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13
modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid
Sta7c13 RAM13 (SRAM)
bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does
bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged
Dynamic13 RAM13 (DRAM)
bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge
bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it
Architectures
Current13 Processors
bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)
bull AMDndash Bulldozer
bull Oraclendash UltraSPARC13 isnt13 dead
ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lockunlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3000 ns = 3 micros
Send 2K bytes over 1 Gbps network 20000 ns = 20 micros
SSD random read 150000 ns = 150 micros
Read 1 MB sequentially from memory 250000 ns = 250 micros
Round trip within same datacenter 500000 ns = 05 ms
Read 1 MB sequentially from SSD 1000000 ns = 1 ms
Disk seek 10000000 ns = 10 ms
Read 1 MB sequentially from disk 20000000 ns = 20 ms
Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms
bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
SMP
bull Symmetric13 Mul3processor13 (SMP)13 Architecturebull Shared13 main13 memory13 controlled13 by13 single13 OSbull No13 more13 Northbridge
NUMA
bull Non-shy‐Uniform13 Memory13 Accessbull The13 organiza3on13 of13 processors13 reflect13 the13 3me13 to13 access13 data13 in13 RAM13 called13 the13 ldquoNUMA13 factorrdquo
bull Shared13 memory13 space13 (as13 opposed13 to13 mul3ple13 commodity13 machines)
Data13 Locality
bull The13 most13 cri3cal13 factor13 in13 performance13 13 Google13 argues13 otherwise
bull Not13 guaranteed13 by13 a13 JVMbull Spa7al13 -shy‐13 reused13 over13 and13 over13 in13 a13 loop13 data13 accessed13 in13 small13 regions
bull Temporal13 -shy‐13 high13 probability13 it13 will13 be13 reused13 before13 long
Memory13 Controller
bull Manages13 communica3on13 of13 readswrites13 between13 the13 CPU13 and13 RAM
bull Integrated13 Memory13 Controller13 on13 die
Cache13 Lines
bull 32-shy‐25613 con3guous13 bytes13 most13 commonly13 64bull Beware13 ldquofalse13 sharingrdquobull Use13 padding13 to13 ensure13 unshared13 linesbull Transferred13 in13 64-shy‐bit13 blocks13 (8x13 for13 6413 byte13 lines)13 arriving13 every13 ~413 cycles
bull Posi3on13 in13 the13 line13 of13 the13 ldquocri3cal13 wordrdquo13 ma9ers13 but13 not13 if13 pre-shy‐fetched
bull Contended13 annota3on13 coming13 in13 Java13 8
Cache13 Associa7vity
bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve
bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13
Cache13 Evic7on13 Strategies
bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches
bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others
Cache13 Write13 Strategies
bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory
bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory
bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory
bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning
Exclusive13 versus13 Inclusive
bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2
bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well
Inter-shy‐Socket13 Communica7on
bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32
MESI+F13 Cache13 Coherency13 Protocol
bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13
a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13
only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13
modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid
Sta7c13 RAM13 (SRAM)
bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does
bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged
Dynamic13 RAM13 (DRAM)
bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge
bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it
Architectures
Current13 Processors
bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)
bull AMDndash Bulldozer
bull Oraclendash UltraSPARC13 isnt13 dead
ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lockunlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3000 ns = 3 micros
Send 2K bytes over 1 Gbps network 20000 ns = 20 micros
SSD random read 150000 ns = 150 micros
Read 1 MB sequentially from memory 250000 ns = 250 micros
Round trip within same datacenter 500000 ns = 05 ms
Read 1 MB sequentially from SSD 1000000 ns = 1 ms
Disk seek 10000000 ns = 10 ms
Read 1 MB sequentially from disk 20000000 ns = 20 ms
Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms
bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
NUMA
bull Non-shy‐Uniform13 Memory13 Accessbull The13 organiza3on13 of13 processors13 reflect13 the13 3me13 to13 access13 data13 in13 RAM13 called13 the13 ldquoNUMA13 factorrdquo
bull Shared13 memory13 space13 (as13 opposed13 to13 mul3ple13 commodity13 machines)
Data13 Locality
bull The13 most13 cri3cal13 factor13 in13 performance13 13 Google13 argues13 otherwise
bull Not13 guaranteed13 by13 a13 JVMbull Spa7al13 -shy‐13 reused13 over13 and13 over13 in13 a13 loop13 data13 accessed13 in13 small13 regions
bull Temporal13 -shy‐13 high13 probability13 it13 will13 be13 reused13 before13 long
Memory13 Controller
bull Manages13 communica3on13 of13 readswrites13 between13 the13 CPU13 and13 RAM
bull Integrated13 Memory13 Controller13 on13 die
Cache13 Lines
bull 32-shy‐25613 con3guous13 bytes13 most13 commonly13 64bull Beware13 ldquofalse13 sharingrdquobull Use13 padding13 to13 ensure13 unshared13 linesbull Transferred13 in13 64-shy‐bit13 blocks13 (8x13 for13 6413 byte13 lines)13 arriving13 every13 ~413 cycles
bull Posi3on13 in13 the13 line13 of13 the13 ldquocri3cal13 wordrdquo13 ma9ers13 but13 not13 if13 pre-shy‐fetched
bull Contended13 annota3on13 coming13 in13 Java13 8
Cache13 Associa7vity
bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve
bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13
Cache13 Evic7on13 Strategies
bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches
bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others
Cache13 Write13 Strategies
bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory
bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory
bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory
bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning
Exclusive13 versus13 Inclusive
bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2
bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well
Inter-shy‐Socket13 Communica7on
bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32
MESI+F13 Cache13 Coherency13 Protocol
bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13
a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13
only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13
modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid
Sta7c13 RAM13 (SRAM)
bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does
bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged
Dynamic13 RAM13 (DRAM)
bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge
bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it
Architectures
Current13 Processors
bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)
bull AMDndash Bulldozer
bull Oraclendash UltraSPARC13 isnt13 dead
ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lockunlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3000 ns = 3 micros
Send 2K bytes over 1 Gbps network 20000 ns = 20 micros
SSD random read 150000 ns = 150 micros
Read 1 MB sequentially from memory 250000 ns = 250 micros
Round trip within same datacenter 500000 ns = 05 ms
Read 1 MB sequentially from SSD 1000000 ns = 1 ms
Disk seek 10000000 ns = 10 ms
Read 1 MB sequentially from disk 20000000 ns = 20 ms
Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms
bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Data13 Locality
bull The13 most13 cri3cal13 factor13 in13 performance13 13 Google13 argues13 otherwise
bull Not13 guaranteed13 by13 a13 JVMbull Spa7al13 -shy‐13 reused13 over13 and13 over13 in13 a13 loop13 data13 accessed13 in13 small13 regions
bull Temporal13 -shy‐13 high13 probability13 it13 will13 be13 reused13 before13 long
Memory13 Controller
bull Manages13 communica3on13 of13 readswrites13 between13 the13 CPU13 and13 RAM
bull Integrated13 Memory13 Controller13 on13 die
Cache13 Lines
bull 32-shy‐25613 con3guous13 bytes13 most13 commonly13 64bull Beware13 ldquofalse13 sharingrdquobull Use13 padding13 to13 ensure13 unshared13 linesbull Transferred13 in13 64-shy‐bit13 blocks13 (8x13 for13 6413 byte13 lines)13 arriving13 every13 ~413 cycles
bull Posi3on13 in13 the13 line13 of13 the13 ldquocri3cal13 wordrdquo13 ma9ers13 but13 not13 if13 pre-shy‐fetched
bull Contended13 annota3on13 coming13 in13 Java13 8
Cache13 Associa7vity
bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve
bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13
Cache13 Evic7on13 Strategies
bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches
bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others
Cache13 Write13 Strategies
bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory
bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory
bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory
bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning
Exclusive13 versus13 Inclusive
bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2
bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well
Inter-shy‐Socket13 Communica7on
bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32
MESI+F13 Cache13 Coherency13 Protocol
bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13
a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13
only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13
modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid
Sta7c13 RAM13 (SRAM)
bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does
bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged
Dynamic13 RAM13 (DRAM)
bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge
bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it
Architectures
Current13 Processors
bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)
bull AMDndash Bulldozer
bull Oraclendash UltraSPARC13 isnt13 dead
ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lockunlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3000 ns = 3 micros
Send 2K bytes over 1 Gbps network 20000 ns = 20 micros
SSD random read 150000 ns = 150 micros
Read 1 MB sequentially from memory 250000 ns = 250 micros
Round trip within same datacenter 500000 ns = 05 ms
Read 1 MB sequentially from SSD 1000000 ns = 1 ms
Disk seek 10000000 ns = 10 ms
Read 1 MB sequentially from disk 20000000 ns = 20 ms
Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms
bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Memory13 Controller
bull Manages13 communica3on13 of13 readswrites13 between13 the13 CPU13 and13 RAM
bull Integrated13 Memory13 Controller13 on13 die
Cache13 Lines
bull 32-shy‐25613 con3guous13 bytes13 most13 commonly13 64bull Beware13 ldquofalse13 sharingrdquobull Use13 padding13 to13 ensure13 unshared13 linesbull Transferred13 in13 64-shy‐bit13 blocks13 (8x13 for13 6413 byte13 lines)13 arriving13 every13 ~413 cycles
bull Posi3on13 in13 the13 line13 of13 the13 ldquocri3cal13 wordrdquo13 ma9ers13 but13 not13 if13 pre-shy‐fetched
bull Contended13 annota3on13 coming13 in13 Java13 8
Cache13 Associa7vity
bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve
bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13
Cache13 Evic7on13 Strategies
bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches
bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others
Cache13 Write13 Strategies
bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory
bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory
bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory
bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning
Exclusive13 versus13 Inclusive
bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2
bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well
Inter-shy‐Socket13 Communica7on
bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32
MESI+F13 Cache13 Coherency13 Protocol
bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13
a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13
only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13
modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid
Sta7c13 RAM13 (SRAM)
bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does
bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged
Dynamic13 RAM13 (DRAM)
bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge
bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it
Architectures
Current13 Processors
bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)
bull AMDndash Bulldozer
bull Oraclendash UltraSPARC13 isnt13 dead
ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lockunlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3000 ns = 3 micros
Send 2K bytes over 1 Gbps network 20000 ns = 20 micros
SSD random read 150000 ns = 150 micros
Read 1 MB sequentially from memory 250000 ns = 250 micros
Round trip within same datacenter 500000 ns = 05 ms
Read 1 MB sequentially from SSD 1000000 ns = 1 ms
Disk seek 10000000 ns = 10 ms
Read 1 MB sequentially from disk 20000000 ns = 20 ms
Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms
bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Cache13 Lines
bull 32-shy‐25613 con3guous13 bytes13 most13 commonly13 64bull Beware13 ldquofalse13 sharingrdquobull Use13 padding13 to13 ensure13 unshared13 linesbull Transferred13 in13 64-shy‐bit13 blocks13 (8x13 for13 6413 byte13 lines)13 arriving13 every13 ~413 cycles
bull Posi3on13 in13 the13 line13 of13 the13 ldquocri3cal13 wordrdquo13 ma9ers13 but13 not13 if13 pre-shy‐fetched
bull Contended13 annota3on13 coming13 in13 Java13 8
Cache13 Associa7vity
bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve
bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13
Cache13 Evic7on13 Strategies
bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches
bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others
Cache13 Write13 Strategies
bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory
bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory
bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory
bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning
Exclusive13 versus13 Inclusive
bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2
bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well
Inter-shy‐Socket13 Communica7on
bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32
MESI+F13 Cache13 Coherency13 Protocol
bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13
a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13
only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13
modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid
Sta7c13 RAM13 (SRAM)
bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does
bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged
Dynamic13 RAM13 (DRAM)
bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge
bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it
Architectures
Current13 Processors
bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)
bull AMDndash Bulldozer
bull Oraclendash UltraSPARC13 isnt13 dead
ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lockunlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3000 ns = 3 micros
Send 2K bytes over 1 Gbps network 20000 ns = 20 micros
SSD random read 150000 ns = 150 micros
Read 1 MB sequentially from memory 250000 ns = 250 micros
Round trip within same datacenter 500000 ns = 05 ms
Read 1 MB sequentially from SSD 1000000 ns = 1 ms
Disk seek 10000000 ns = 10 ms
Read 1 MB sequentially from disk 20000000 ns = 20 ms
Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms
bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Cache13 Associa7vity
bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve
bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13
Cache13 Evic7on13 Strategies
bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches
bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others
Cache13 Write13 Strategies
bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory
bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory
bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory
bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning
Exclusive13 versus13 Inclusive
bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2
bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well
Inter-shy‐Socket13 Communica7on
bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32
MESI+F13 Cache13 Coherency13 Protocol
bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13
a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13
only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13
modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid
Sta7c13 RAM13 (SRAM)
bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does
bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged
Dynamic13 RAM13 (DRAM)
bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge
bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it
Architectures
Current13 Processors
bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)
bull AMDndash Bulldozer
bull Oraclendash UltraSPARC13 isnt13 dead
ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lockunlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3000 ns = 3 micros
Send 2K bytes over 1 Gbps network 20000 ns = 20 micros
SSD random read 150000 ns = 150 micros
Read 1 MB sequentially from memory 250000 ns = 250 micros
Round trip within same datacenter 500000 ns = 05 ms
Read 1 MB sequentially from SSD 1000000 ns = 1 ms
Disk seek 10000000 ns = 10 ms
Read 1 MB sequentially from disk 20000000 ns = 20 ms
Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms
bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Cache13 Evic7on13 Strategies
bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches
bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others
Cache13 Write13 Strategies
bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory
bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory
bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory
bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning
Exclusive13 versus13 Inclusive
bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2
bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well
Inter-shy‐Socket13 Communica7on
bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32
MESI+F13 Cache13 Coherency13 Protocol
bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13
a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13
only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13
modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid
Sta7c13 RAM13 (SRAM)
bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does
bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged
Dynamic13 RAM13 (DRAM)
bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge
bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it
Architectures
Current13 Processors
bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)
bull AMDndash Bulldozer
bull Oraclendash UltraSPARC13 isnt13 dead
ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lockunlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3000 ns = 3 micros
Send 2K bytes over 1 Gbps network 20000 ns = 20 micros
SSD random read 150000 ns = 150 micros
Read 1 MB sequentially from memory 250000 ns = 250 micros
Round trip within same datacenter 500000 ns = 05 ms
Read 1 MB sequentially from SSD 1000000 ns = 1 ms
Disk seek 10000000 ns = 10 ms
Read 1 MB sequentially from disk 20000000 ns = 20 ms
Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms
bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Cache13 Write13 Strategies
bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory
bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory
bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory
bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning
Exclusive13 versus13 Inclusive
bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2
bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well
Inter-shy‐Socket13 Communica7on
bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32
MESI+F13 Cache13 Coherency13 Protocol
bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13
a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13
only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13
modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid
Sta7c13 RAM13 (SRAM)
bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does
bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged
Dynamic13 RAM13 (DRAM)
bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge
bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it
Architectures
Current13 Processors
bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)
bull AMDndash Bulldozer
bull Oraclendash UltraSPARC13 isnt13 dead
ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lockunlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3000 ns = 3 micros
Send 2K bytes over 1 Gbps network 20000 ns = 20 micros
SSD random read 150000 ns = 150 micros
Read 1 MB sequentially from memory 250000 ns = 250 micros
Round trip within same datacenter 500000 ns = 05 ms
Read 1 MB sequentially from SSD 1000000 ns = 1 ms
Disk seek 10000000 ns = 10 ms
Read 1 MB sequentially from disk 20000000 ns = 20 ms
Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms
bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Exclusive13 versus13 Inclusive
bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2
bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well
Inter-shy‐Socket13 Communica7on
bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32
MESI+F13 Cache13 Coherency13 Protocol
bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13
a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13
only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13
modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid
Sta7c13 RAM13 (SRAM)
bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does
bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged
Dynamic13 RAM13 (DRAM)
bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge
bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it
Architectures
Current13 Processors
bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)
bull AMDndash Bulldozer
bull Oraclendash UltraSPARC13 isnt13 dead
ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lockunlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3000 ns = 3 micros
Send 2K bytes over 1 Gbps network 20000 ns = 20 micros
SSD random read 150000 ns = 150 micros
Read 1 MB sequentially from memory 250000 ns = 250 micros
Round trip within same datacenter 500000 ns = 05 ms
Read 1 MB sequentially from SSD 1000000 ns = 1 ms
Disk seek 10000000 ns = 10 ms
Read 1 MB sequentially from disk 20000000 ns = 20 ms
Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms
bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Inter-shy‐Socket13 Communica7on
bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32
MESI+F13 Cache13 Coherency13 Protocol
bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13
a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13
only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13
modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid
Sta7c13 RAM13 (SRAM)
bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does
bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged
Dynamic13 RAM13 (DRAM)
bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge
bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it
Architectures
Current13 Processors
bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)
bull AMDndash Bulldozer
bull Oraclendash UltraSPARC13 isnt13 dead
ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lockunlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3000 ns = 3 micros
Send 2K bytes over 1 Gbps network 20000 ns = 20 micros
SSD random read 150000 ns = 150 micros
Read 1 MB sequentially from memory 250000 ns = 250 micros
Round trip within same datacenter 500000 ns = 05 ms
Read 1 MB sequentially from SSD 1000000 ns = 1 ms
Disk seek 10000000 ns = 10 ms
Read 1 MB sequentially from disk 20000000 ns = 20 ms
Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms
bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
MESI+F13 Cache13 Coherency13 Protocol
bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13
a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13
only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13
modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid
Sta7c13 RAM13 (SRAM)
bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does
bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged
Dynamic13 RAM13 (DRAM)
bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge
bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it
Architectures
Current13 Processors
bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)
bull AMDndash Bulldozer
bull Oraclendash UltraSPARC13 isnt13 dead
ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lockunlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3000 ns = 3 micros
Send 2K bytes over 1 Gbps network 20000 ns = 20 micros
SSD random read 150000 ns = 150 micros
Read 1 MB sequentially from memory 250000 ns = 250 micros
Round trip within same datacenter 500000 ns = 05 ms
Read 1 MB sequentially from SSD 1000000 ns = 1 ms
Disk seek 10000000 ns = 10 ms
Read 1 MB sequentially from disk 20000000 ns = 20 ms
Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms
bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Sta7c13 RAM13 (SRAM)
bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does
bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged
Dynamic13 RAM13 (DRAM)
bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge
bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it
Architectures
Current13 Processors
bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)
bull AMDndash Bulldozer
bull Oraclendash UltraSPARC13 isnt13 dead
ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lockunlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3000 ns = 3 micros
Send 2K bytes over 1 Gbps network 20000 ns = 20 micros
SSD random read 150000 ns = 150 micros
Read 1 MB sequentially from memory 250000 ns = 250 micros
Round trip within same datacenter 500000 ns = 05 ms
Read 1 MB sequentially from SSD 1000000 ns = 1 ms
Disk seek 10000000 ns = 10 ms
Read 1 MB sequentially from disk 20000000 ns = 20 ms
Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms
bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Dynamic13 RAM13 (DRAM)
bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge
bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it
Architectures
Current13 Processors
bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)
bull AMDndash Bulldozer
bull Oraclendash UltraSPARC13 isnt13 dead
ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lockunlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3000 ns = 3 micros
Send 2K bytes over 1 Gbps network 20000 ns = 20 micros
SSD random read 150000 ns = 150 micros
Read 1 MB sequentially from memory 250000 ns = 250 micros
Round trip within same datacenter 500000 ns = 05 ms
Read 1 MB sequentially from SSD 1000000 ns = 1 ms
Disk seek 10000000 ns = 10 ms
Read 1 MB sequentially from disk 20000000 ns = 20 ms
Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms
bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Architectures
Current13 Processors
bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)
bull AMDndash Bulldozer
bull Oraclendash UltraSPARC13 isnt13 dead
ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lockunlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3000 ns = 3 micros
Send 2K bytes over 1 Gbps network 20000 ns = 20 micros
SSD random read 150000 ns = 150 micros
Read 1 MB sequentially from memory 250000 ns = 250 micros
Round trip within same datacenter 500000 ns = 05 ms
Read 1 MB sequentially from SSD 1000000 ns = 1 ms
Disk seek 10000000 ns = 10 ms
Read 1 MB sequentially from disk 20000000 ns = 20 ms
Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms
bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Current13 Processors
bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)
bull AMDndash Bulldozer
bull Oraclendash UltraSPARC13 isnt13 dead
ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lockunlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3000 ns = 3 micros
Send 2K bytes over 1 Gbps network 20000 ns = 20 micros
SSD random read 150000 ns = 150 micros
Read 1 MB sequentially from memory 250000 ns = 250 micros
Round trip within same datacenter 500000 ns = 05 ms
Read 1 MB sequentially from SSD 1000000 ns = 1 ms
Disk seek 10000000 ns = 10 ms
Read 1 MB sequentially from disk 20000000 ns = 20 ms
Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms
bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lockunlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3000 ns = 3 micros
Send 2K bytes over 1 Gbps network 20000 ns = 20 micros
SSD random read 150000 ns = 150 micros
Read 1 MB sequentially from memory 250000 ns = 250 micros
Round trip within same datacenter 500000 ns = 05 ms
Read 1 MB sequentially from SSD 1000000 ns = 1 ms
Disk seek 10000000 ns = 10 ms
Read 1 MB sequentially from disk 20000000 ns = 20 ms
Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms
bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Measured13 Cache13 Latencies
Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns
SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Registers
bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands
bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Store13 Buffers
bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write
bull ~113 cycle
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Level13 Zero13 (L0)
bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Level13 One13 (L1)
bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell
bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem
bull 3-shy‐413 cycles13 to13 access13 L1d
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Level13 Two13 (L2)
bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell
bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture
bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up
bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Level13 Three13 (L3)
bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores
bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB
bull 14-shy‐3813 cycles13 to13 access
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Level13 Four13 (L4)
bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache
bull No13 latency13 benchmarks13 for13 this13 yet
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Programming13 Tips
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Striding13 amp13 Pre-shy‐fetching
bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access
bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable
bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo
bull Shows13 the13 importance13 of13 locality13 and13 striding
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Cache13 Misses
bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types
ndash Compulsoryndash Capacityndash Conflict
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Programming13 Op7miza7ons
bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers
bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex
bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
What13 about13 Func7onal13 Programming
bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on
bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses
bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance
bull Use13 mutable13 data13 in13 targeted13 loca3ons
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Hyperthreading
bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Data13 Structures
bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage
bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster
bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous
bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Applica7on13 Memory13 Wall13 amp13 GC
bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Using13 GPUs
bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update
bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
The13 Future
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
ManyCore
bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores
bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Memristor
bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash
bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Phase13 Change13 Memory13 (PRAM)
bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall
bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Thanks
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content
Credits
bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007
bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13
TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content