Download - Memory Optimizations Research at UNT
Memory OptimizationsResearch at UNT
Krishna Kavi
ProfessorDirector of NSF Industry/University Cooperative Center
for Net-Centric Software and Systems (Net-Centric IUCRC)
Computer Science and EngineeringThe University of North Texas
Denton, Texas 76203, USA
[email protected]://csrl.unt.edu/~kavi
Memory Optimizations at UNT 2
Motivation
Memory subsystem plays a key role in achieving performance on multi-core processors
Memory subsystem contributes to significant portions of energy consumed
Pin limitations limit bandwidth to off-chip memories
Shared caches may have non-uniform access behaviors
Shared caches may encounter inter-core conflicts and coherency misses
Different data types exhibit different locality and reuse behaviors
Different applications need different memory optimizations
Memory Optimizations at UNT 3
Our Research Focus
Cache Memory optimizations
software and hardware solutions
primarily at L-1
some ideas at L-2
Memory Management
Intelligent allocation and user defined layouts
Hardware supported allocation and garbage collection
Memory Optimizations at UNT 4
Non-Uniformity of Cache Accesses
Non-Uniform access to cache setsSome sets are accessed 100,000 time more often than other setsCause more misses while some sets are not used
Non-Uniform Cache Accesses For Parser
Memory Optimizations at UNT 5
Non-Uniformity of Cache Accesses
But, not all applications exhibit “bad” access behavior
Non-Uniform Cache Accesses for Selected BenchmarksNeed different solutions for different applications
Memory Optimizations at UNT 6
Improving Uniformity of Cache Accesses
Possible solutions
• Using Fully associative caches with perfect replacement policies• Selecting optimal addressing schemes• Dynamically re-mapping addresses to new cache lines• Partitioning caches into smaller portions
• Each partition used by a different data object• Using Multiple address decoders • Static or dynamic data mapping and relocation
Memory Optimizations at UNT 7
Associative Caches Improve Uniformity
Direct Mapped Cache 16-Way Associative Cache
Memory Optimizations at UNT 8
Data Memory Characteristics
• Different Object Types exhibit different access behaviors- Arrays exhibit spatial localities- Linked lists and pointer data types are difficult to pre-fetch- Static and scalars may exhibit temporal localities
• Custom memory allocators and custom run-time support can be used to improve locality of dynamically allocated objects- Pool Allocators (U of Illinois)- Regular Expressions to improve on Pool Allocators (Korea)- Profiling and reallocating objects (UNT)- Hardware support for intelligent memory management (UNT and Iowa State)
Memory Optimizations at UNT 9
ABC’s of Cache Memories
Multiple levels of memory – memory hierarchy
CPU and Registers
L1- InstrCache
L1- DataCache
L2 Cache(combinedData and Instr)
DRAM(Main memory)
DISK
Memory Optimizations at UNT 10
ABC’s of Cache Memories
Consider a direct mapped Cache
An address can only be in a fixed cache line as specified by the 6-bit line number of the address
Memory Optimizations at UNT 11
ABC’s of Cache Memories
Consider a 2-way set associative cache
An address is located in a fixed set of the cache.But the address can occupy either of the 2 lines of a set.
We extend this idea to 4-way, 8-way,.. fully associative caches
Memory Optimizations at UNT 12
ABC’s of Cache Memories
Consider a fully associative cache
An address is located in any line
Or, there is only one set in the cache.
Very expensive since we need to compare the address tag with each line tag.
Also need a good replacement strategy.
Can lead to more uniform of access to cache lines
Tag Byte offset
Memory Optimizations at UNT 13
Programmable Associativity
Can we provide higher associativity only when we need it?Consider a simple idea
Heavily accessed cache lines will be provided with alternate locationsas indicated by “partner index”
Memory Optimizations at UNT 14
Programmable Associativity
Pier’s adaptive cache uses two tablesSet-reference History Table (SHT) – tracks heavily used cache lines Out-of-position directory (OUT) – tracks alternate locations
[Pier 98] J. Peir, Y. Lee, and W. Hsu, “Capturing Dynamic Memory Reference Behavior with Adaptive Cache Topology.” In Proc. of the 8th Int. Conf. on Architectural Support for Programming Language and Operating Systems, 1998, pp. 240–250
[Zhang 06] C. Zhang. Balanced cache: Reducing conflict misses of direct-mapped caches. ISCA, pages 155–166, June 2006
Zhang’s programmable associativity (B-Cache)Cache index is divided in to Programmable and Non-programmable indexesThe NPI facilitates for varying associativities
Memory Optimizations at UNT 15
Programmable Associativity
Memory Optimizations at UNT 16
Programmable Associativity
adpcmbasicmath
bitcountcrc
dijkstrafft
patriciaqsort
rijndaelsha
susanAverage
0
20
40
60
80
100
120
Adaptive_Cache B_Cache Column_associative
Mibench Benchamarks
% R
educ
tion
in M
iss-
Rat
e
Memory Optimizations at UNT 17
Programmable Associativity
adpcmbasicmath
bitcountcrc
dijkstrafft
patriciaqsort
rijndaelsha
susanAverage-5
05
10
1520
2530
3540
45
Adaptive_Cache B_Cache Column_associative
Mibench Benchmarks
% R
educ
tion
in A
MAT
Memory Optimizations at UNT 18
Multiple Decoders
Tag Set Index Byte offsetTag
Tag Set Index Byte offset
TagSet Index Byte offsetSet Index
Tag Data
Different decoders may use different associativities
Memory Optimizations at UNT 19
Multiple Decoders
But how to select index bits?
Memory Optimizations at UNT 20
Index Selection Techniques
Different approaches have been studiedGivargis quality bitsX-Or some tag bits with index bitsAdd a multiple of tag to index Use prime modulo
[Givargis 03] T. Givargis, “Improved Indexing for Cache Miss Reduction in Embedded Systems,” In Proc. of Design Automation Conference, 2003.
[Kharbutli 04] M. Kharbutli, K. Irwin, Y. Solihin, and J. Lee, “Using PrimeNumbers for Cache Indexing to Eliminate Conflict Misses,” Proc.Int’l Symp. High Performance Computer Architecture, 2004
Memory Optimizations at UNT 21
Index Selection Techniques
Memory Optimizations at UNT 22
Multiple Decoders
Odd multiplier method
Different multipliers foreach thread
bitcoun t_adp cm
bzip2_ libqua ntum
fft_sus an
groma cs_na m
d
milc_n am
d
qsort_ basicmath
qsort_ patrici a
fft_bas icmath _patri cia_su san
susan_ bitcou nt_adp cm_p atricia
Averag e
0
20
40
60
80
Multi-Threaded Benchmarks
% R
educ
tion
in M
iss-
Rat
e
Memory Optimizations at UNT 23
Multiple Decoders
bitcount_adpc m
fft_susan
qsort_basicmath
qsort_fft
qsort_patricia
libquantum_m
ilc
milc_nam
d
gromacs_nam
d
bzip2_libquan tum
fft_basicmath _patricia_sus an
susan_bitcoun t_adpcm_pat ricia
Average
010203040506070
Multi-threaded Applications
% Im
prov
emen
t in
AMAT Here we split cache into
segments, one per thread
But, we used Adaptive cachetechniques to “donate”underutilized sets to other threads
Memory Optimizations at UNT 24
Other Cache Memory Research at UNT
Use of a single data cache can lead to unnecessary cache missesArrays exhibit higher spatial localities while scalar may exhibit higher temporal
localitiesMay benefit from different cache organizations (associativity, block size)
If using separate instruction and data caches, why not different data caches -- either statically or dynamically partitioned
And if separate array and scalar caches are included, how to further improve their performanceOptimize the sizes of array and scalar caches for each application
Memory Optimizations at UNT 25
Reconfigurable Caches
CPU
ArrayCache
ScalarCache
MAIN
MEMORY
SecondaryCache
Cache
Memory Optimizations at UNT 26
Percentage reduction of power, area and cycles for data cache
0
10
20
30
40
50
60
70
80
90
power
area
time
bc qs dj bf sh ri ss ad cr ff avg
percentage
Conventional cache configuration: 8k, Direct mapped data cache, 32k 4-way Unified level 2 cache Scalar cache configuration: Size variable, Direct mapped with 2 lined Victim cache Array cache configuration: Size variable, Direct mapped
Memory Optimizations at UNT 27
Summarizing
For instruction cache85% (average 62%) reduction in cache size72% (average 37%) reduction in cache access time75% (average 47%) reduction in energy consumption
For data cache78% (average 49%) reduction in cache size36% (average 21%) reduction in cache access time 67% (average 52%) reduction in energy consumption
when compared with an 8KB L-1 instruction cache and an 8KB L-1 unified data cache with a 32KB level-2 cache
Memory Optimizations at UNT 28
Generalization
Why not extend Array/Scalar split caches to more than 2 partitions?
Each partition customized to a specific object type
Partitioning can be achieved using multiple decoders with a single cache resource (virtual partitioning)
Reconfigurable partitions is possible with programmable decodersEach decoder accesses a portion of the cache
either physically restrict to a segment of cacheor virtually limit the number of lines accessed by a
decoder
Scratchpad Memories can be viewed as cache partitionsDedicate a segment of cache for scratchpad
Memory Optimizations at UNT 29
Scratch Pad Memories
They are viewed as compiler controlled memoriesas fast as L-1 caches, but not managed as caches
Compiler decides which data will reside in scratch pad memory
A new paper from Maryland proposes a way of compiling programs for unknown sized Scratch pad memories
Only Stack data (static and global variables) are placed in SPMCompiler views Stack as two stacks
Potential SPM data stackDRAM data stack
Memory Optimizations at UNT 30
Current and Future Research
Extensive study of using Multiple Decoders
Separate decoders for different data structurespartitioning of L-1 caches
Separate decoders for different threads and coresat L-2 or Last Level Cachesminimize conflictsminimize coherency related missesminimize loss due to non-uniform memory access delays
Investigate additional indexing or programmable associativity ideasCooperative L-2 caches using adaptive caches
Memory Optimizations at UNT 31
Program Analysis Tool
We need tools to profile and analyze • Data layout at various levels of memory hierarchy• Data access patterns
• Existing tools (Valgrind, Pin) do not provide fine grained information• We want to relate each memory access back to a source level constructs
Source variable name, function/thread that caused the access
Memory Optimizations at UNT 32
Gleipnir
Our tool is built on top of ValgrindCan be used with any architecture that is supported by Valgrind
x-86, PPC, MIPSand ARM
Memory Optimizations at UNT 33
Gleipnir
Memory Optimizations at UNT 34
Gleipnir
Memory Optimizations at UNT 35
Gleipnir
Memory Optimizations at UNT 36
Gleipnir
How can we use Gleipnir.Explore different data layoutsand their impact on cache accesses
Memory Optimizations at UNT 37
Gleipnir
Standard layout
Memory Optimizations at UNT 38
Gleipnir
Tiled matrices
Memory Optimizations at UNT 39
Gleipnir
Matrices A and C combined
Memory Optimizations at UNT 40
Further Research
• Restructuring memory allocation – currently in progress- Analyze cache set conflicts and relate them to data objects- Modify data placement of these objects- Reorder variables, include dummy variables, …
• Restructure Code to improve data access patters (SLO tool)- Loop Fusion – combine loops that use the same data- Loop tiling – split loops into smaller loops to limit data accessed- Similar techniques to assure “common” data resides in L-2 (shared
caches)- Similar techniques such that data is transferred to GPUs infrequently
Loop Tiling IdeaToo much data accessed in the loop
Memory Optimizations at UNT 41
Code Refactoring
double sum(…) {…for(int i=0; i<len; i++)
result += X[i];…
}
all cache misses occur here.
Memory Optimizations at UNT 42
Code Refactoring
Loop Fusion Idea
double inproduct(…) { …
for(int i=0; i<len; i++)result += X[i]*Y[i];
…}
double sum(…) {…for(int i=0; i<len; i++)
result += X[i];…
}
previous use occur here.
all cache misses occur here.
Memory Optimizations at UNT 43
SLO Tool
double inproduct(…) { …
for(int i=0; i<len; i++)result += X[i]*Y[i];
…}double sum(…) {
…for(int i=0; i<len; i++)
result += X[i];…
}
Memory Optimizations at UNT 44
Extensions Planned
Key Factors Influencing Code and Data Refactoring
Reuse Distance – reducing distance improves data utilizationCan be used with CPU-GPU configurationsFuse loops so that all computations using the “same” data are grouped
Conflict sets and conflict distancesThe set of variables that fall to the same cache line (or group of lines)Conflict between pairs of conflicting variablesIncrease conflict distance
Memory Optimizations at UNT 45
Further Research
We are currently investigating several of these ideas
Using architectural simulators like SimICSexplore multiple decoders with multiple threads, cores or for different data types
Further extend Gleipnirand explore using Gleipnir with compilersand Gleipnir with other tools like SLO, evaluate the effectiveness of custom allocators
Some hardware implementations of memory management using FPGAs
And we welcome collaborations
Memory Optimizations at UNT 46
The End
Questions?
More information and papers at http://csrl.cse.unt.edu/~kavi
Memory Optimizations at UNT 47
Custom Memory Allocators
Consider a typical pointer chasing programs
node {int key;… data; /* complex data partnode *next; }
We will explore two possibilitiespool allocationsplit structures
Memory Optimizations at UNT 48
Custom Memory Allocators
• Pool Allocator (Illinois)
Data type B
Data type A Data type AData type A Data type A
Data type B Data type BData type B
Data type A
Data type AData type A
Data type A
Data type B
Data type B
Heap
Heap
Memory Optimizations at UNT 49
Custom Memory Allocators
Further OptimizationConsider a typical pointer chasing programs
node {int key;… data; /* complex data partnode *next; }
The data part is accessed only if key matches
while (..) {if (b->key == k) return h->data;h= h->next;}
Consider a different definition of the data
node { int key; node *next; data_node * data+ptr; }
Key; *next;
*datat_ptr
Key; *next;
*datat_ptr
Key; *next;
*datat_ptr
Data_node Data_node Data_node
Memory Optimizations at UNT 50
Custom Memory Allocators
Profiling (UNT) Using data profiling, “flatten” dynamic data into consecutive blocksMake linked lists look like arrays!
Memory Optimizations at UNT 51
Cache Based Side-Channel Attacks
Encryption algorithms use keys (or blocks of the key) as index into tables containing constants used in the algorithm
Using which table entries caused cache missescan find the address of the table entry
and then find the value of the key that was used
Z. Wang and R. Lee. “New cache designs for thwarting software cache based side channel attacks”, ISCA 2007, pp 494-505
Two solutions: 1. Lock cache lines (cannot be displaced) when using encryption
2. Use a random replacement policy in selecting which line of a set is replaced
Memory Optimizations at UNT 52
Offloading Memory Management Functions
1. Dynamic memory management is the management of main memory for use by programs during runtime
2. Dynamic memory management account for significant amount of execution time –42& for 197.parser (from SPEC 2000 benchmarks)
3. If CPU is performing memory management, CPU cache will perform poorly due to switching between user functions and memory management functions
4. If we have a separate hardware and separate cache for memory management, CPU cache performance can be improved dramatically
Memory Optimizations at UNT 53
Offloading Memory Management Functions
BIUCPU
DataCache
1
2
3
De-All Completion
Allocation Ready
System B
us
InstructionCache
Interface
Mem
ory
Proc
esso
r
MPInst. Cache
MPData Cache
Seco
nd L
evel
Cac
he
Memory Optimizations at UNT 54
Improved Performance
• Object Oriented and Linked Data Structured Applications Exhibit Poor LocalityCache pollution caused by Memory Management functions
• Memory management functions do not use user data cachesOn average, about 40% of cache misses eliminated
• Memory manager does not need large data caches
Memory Optimizations at UNT 55
Improved Execution Performance
Nameof
Benchmark
% of cycles
spent on malloc
Numbers of instructions in conventional Architecture
Numbers of instruction in
Separated Hardware
Implementation
% Performance increase due to
Separate Hardware
Implementation
% Performance increase due to fastest separate
Hardware Implementation
255.vortex 0.59 13020462240 12983022203 2.81 2.90164.gzip 0.04 4,540,660 4,539,765 0.031 0.0346197.parser 17.37 2070861403 1616890742 3.19 18.8espresso
Cfrac 31.17 599365 364679 19.03 39.99bisort 2.08 620560644 607122284 10.03 12.76
Memory Optimizations at UNT 56
Other Uses of Hardware Memory Manager
Dynamic relocation of objects to improve localitiesHardware Manager can track object usage and relocate them
without CPU’s knowledgeNew and innovative Allocation/Garbage collection methods
Estranged Buddy AllocatorContaminated Garbage Collector
Predictive allocation to achieve “one-cycle” allocation
Allocator bookkeeping data kept separate from objects