efficient execution of compressed programs · 2011-03-08 · pt go p e g eg2 e nc p t p l x...
TRANSCRIPT
Efficient Execution of Compressed Programs
Charles Lefurgy
http://www.eecs.umich.edu/compress
Advanced Computer Architecture LaboratoryElectrical Engineering and Computer Science Dept.
The University of Michigan, Ann Arbor
2
The problem
Embedded Microprocessor
ROMProgram
RAM
I/O
CPU
• Microprocessor die cost
– Low cost is critical for high-volume, low-margin embedded systems
– Control cost by reducing area and increasing yield
• Increasing amount of on-chip memory
– Memory is 40-80% of die area [ARM, MCore]
– In control-oriented embedded systems, much of this is program memory
• How can program memory be reduced without sacrificing performance?
3
Solution
Embedded Systems
Original Program
ROMProgram
RAM
I/O
CPU
Compressed Program
RO
MRAM
I/O
CPU
• Code compression– Reduce compiled code size– Compress at compile-time– Decompress at run-time
• Implementation– Hardware or software?– Code size?– Execution speed?
4
Research contributions• HW decompression [MICRO-32, MICRO-30, CASES-98]
– Dictionary compression method
– Analysis of IBM CodePack algorithm
– HW decompression increases performance
• SW decompression [HPCA-6, CASES-99]
– Near native code performance for media applications
– Software-managed cache
– Hybrid program optimization (new profiling method)
– Memoization optimization
5
Outline
Compression overview and metrics
HW decompression overview
SW decompression
• Compression algorithms
– Dictionary
– CodePack
• Hardware support
• Performance study
• Optimizations
– Hybrid programs
– Memoization
6
Why are programs compressible?
• Ijpeg benchmark (MIPS gcc 2.7.2.3 -O2)
– 49,566 static instructions– 13,491 unique instructions
– 1% of unique instructions cover 29% of static instructions
1
10
100
1,000
0 5,000 10,000 15,000
Unique instruction bit patterns
Number
810 (jalr $31,$2)
13,491
7
Evaluation metrics
• Size
• Decode efficiency– Measure program execution time
sizeoriginal
sizecompressedratio ncompressio =
0%
20%
40%
60%
80%
100%
Originalprogram
Compressedprogram
Compression ratio
8
• Many compression methods– Compression unit: instruction, cache line, or procedure
• Difficult to compare previous results– Studies use different instruction sets and benchmarks
• Many studies do not measure execution time• Wire code
– Small size: a goal for embedded systems– Problem: no random access, must decompress entire program at once
Who Instruction Set
Compression Ratio
Comment
Thumb ARM 70% 16-bit instruction subset of 32-bit ISA MIPS-16 MIPS 60% 16-bit instruction subset of 32-bit ISA CodePack PowerPC 60% Cache line compression Wolfe MIPS 73% Cache line, Huffman Lekatsas MIPS, x86 50%, 80% Cache line, stream division, Huffman Araujo MIPS 43% Cache line, op. factorization, HuffmanLiao TMS320C25 82% Procedure abstraction Kirovski SPARC 60% Procedure compression Ernst SPARC 20% Interpreted wire code
Previous results
Hardware decompression
10
CodePack
• Overview– IBM– PowerPC instruction set
– 60% compression ratio, ±10% performance [IBM]• performance gain due to prefetching
• Implementation– Binary executables are compressed after compilation
– Decompression during instruction cache miss• Instruction cache holds native code
• Decompress two cache lines at a time (16 insns)
– PowerPC core is unaware of compression
11
CodePack encoding
Encoding for upper 16 bits Encoding for lower 16 bits
32
64
128
256
Tag
Index
Escape
Raw bits
0 1
1 0 0
1 0 1
1 1 0
x x x x x x
xx x x x x x x
xx x x x x x x x
xx x x x x x x x x x x x x x x x1 1 1
0 0
x x x x x
x x x
0 1 x x x x
1 0 0
xx x x x x x x1 0 1
xx x x x x x x x1 1 0
xx x x x x x x x x x x x x x x x1 1 1
0 0
x x x x x
8
16
32
128
256
1 Encodes zero
0 311615
32-bit PowerPC instruction word
12
CodePack system
• CodePack is part of the memory system– After L1 instruction cache
• Dictionaries– Contain 16-bit upper and lower halves of instructions
• Index table– Maps instruction address to compressed code address
main memory
Processor
Dictionaries
DecompressorIndex table
Compressed code
Instruction cache CodePack
Instruction memory hierarchy
13
CodePack decompression
Decompress
Byte-aligned block address
L1 I-cache miss address
Fetch index
Fetch compressed instructions
Native Instruction
Low dictionary
Compression Block(16 instructions)
312625650
Index table(in main memory)
Compressed bytes(in main memory)
Hi tag Low tag Low indexHi index
High 16-bits Low 16-bits
High dictionary
1 compressed instruction
14
Instruction cache miss latency
• Native code uses critical word first• Compressed code must be fetched sequentially
Instruction cache missInstructions from main memory
Instruction cache missIndex from index cacheCodes from main memory
Instruction cache miss
Codes from main memoryDecompressor
Two Decompressors
a) Native code
b) Compressed code
c) Compressed code+ optimizations
t=0
L1 cache miss
Fetch index
Fetch instructions (first line)
Fetch instructions (remaining lines)
Decompression cycle
Critical instruction word
A
B
10 30201 cycle
Index from main memory
15
Comparison of optimizations
• Index cache provides largest benefit• Optimizations
– index cache: 64 lines, 4 indices/line, fully assoc.
– 2nd decoder
• Speedup over native code: 0.97 to 1.05• Speedup over CodePack: 1.17 to 1.25
0
0.2
0.4
0.6
0.8
1
1.2
cc1 go perl vortex
Speedup over native code CodePack
index cache2nd decoderboth optimizations
16
Hardware decompression conclusions
• Performance can be improved at modest cost– Remove decompression overhead
• index lookup
• dictionary lookup
– Better memory bus utilization
• Compression can speedup execution– Compressed code requires fewer main memory accesses
– CodePack includes simple prefetching
• Systems that benefit most from compression– Narrow memory bus
– Slow memory
Software decompression
18
Software decompression
• Previous work– Whole program compression [Tauton91]
• Saved disk space• No memory savings
– Procedure compression [Kirovski97]• Requires large decompression memory
• Fragmentation of decompression memory• Slow
• My work– Decompression unit: 1 or 2 cache-lines
– High performance focus
19
How does code compression work?• What is compressed?
– Individual instructions• When is decompression performed?
– During I-cache miss• How is decompression implemented?
– I-cache miss invokes exception handler• What is decompressed?
– 1 or 2 cache lines• Where are decompressed instructions stored?
– I-cache is the decompression buffer
Compressed code
Main memory (eDRAM)
compressed programIndex tableDictionary
Program data
Native codeProgram & decompressor
data
Decompressedinstructions
I-cache (SRAM)
D-cache (SRAM)
Processor
System-on-a-chip
Decompressorprogram
20
Dictionary compression algorithm
• Goal: fast decompression• Dictionary contains unique instructions• Replace program instructions with short index
lw r2,r3
lw r2,r3
lw r15,r3
lw r15,r3
lw r15,r3
32 bits
.text segment
Original program
5
5
30
30
30
16 bits
.text segment (contains indices)
Compressed program
lw r2,r3
lw r15,r3
32 bits
.dictionary segment
21
Decompression
• Algorithm1. I-cache miss invokes decompressor (exception handler)2. Fetch index3. Fetch dictionary word4. Place instruction in I-cache (special instruction)
• Write directly into I-cache• Decompressed instructions only exist in I-cache
Proc.
�
ó
ìö
D-cache
I-cache
Memory
Add r1,r2,r3
5
Dictionary
Indices...
22
Hardware support
• Decompression exception– Raise exception on I-cache miss– Exception not raised on native code section (allow hybrid programs)
– Similar to Informing Memory [Horowitz98]
• Store-instruction instruction – MAJC, MIPS R10K
23
Two software decompressors
• Dictionary– Faster– Less compression
• CodePack– A software version of IBM’s CodePack hardware
– Slower
– More compression
Dictionary CodePack
Codewords (indices) Fixed-length Variable-length
Decompress granularity 1 cache line 2 cache linesStatic instructions 43 174
Dynamic instructions 43 1042-1062
Decompression overhead 73-105 cycles 1235-1266 cycles
24
Compression ratio
•
– CodePack: 55% - 63%– Dictionary: 65% - 82%
size originalsize compressed
ratio ncompressio =
0%10%20%30%40%50%60%70%80%90%
100%
cc1
ghostsc
ript go
ijpeg
mpeg
2enc
pegwit
perl
vorte
x
Compression ratio
Dictionary
CodePack
25
Simulation environment
• SimpleScalar• Pipeline: 5 stage, in-order
• I-cache: 4KB, 32B lines, 2-way
• D-cache: 8KB, 16B lines, 2-way
• Memory: embedded DRAM– 10 cycle latency
– bus width = 1 cache line– 10x denser than SRAM caches
• Performance: slowdown = 1 / speedup (1 = native code)• Area results include:
– Main memory to hold program (compressed bytes, tables)– I-cache
– Memory for decompressor optimizations (memorization, native code)
26
Performance
• CodePack: very high overhead• Reduce overhead by reducing cache misses
02468
101214161820
I-cache size (KB)
Slowdown
CodePack
Dictionary
Native
CodePack 19.19 3.17 1.26
Dictionary 2.66 1.43 1.04
Native 1.00 1.00 1.00
4KB 16KB 64KB
Ghostscript
27
Cache miss
• Control slowdown by optimizing I-cache miss ratio• Small change in miss ratio large performance impact
0
5
10
15
20
25
30
35
40
0% 2% 4% 6% 8%
I-cache miss ratio
Slowdown relative to
native code
CodePack 4KBCodePack 16KBCodePack 64KBDictionary 4KBDictionary 16KBDictionary 64KB
28
Two optimizations• Hybrid programs (static)
– Both compressed and native code
• Memoization (dynamic)– Cache recent decompressions in main memory
• Both can be applied to any compression algorithm
nativeOriginal Program
compressedCompressed
compressedHybrid native
Memoization compressed memo
compressedCombined native memo
Program memory
29
Hybrid programs
• Selective compression– Only compress some procedures– Trade size for speed
– Avoid decompression overhead
• Profile methods– Count dynamic instructions
• Example: ARM/Thumb• Use when compressed code has more instructions
• Reduce number of executed instructions
– Count cache misses• Example: CodePack• Use when compressed code has longer cache miss latency• Reduce cache miss latency
30
Cache miss profiling
• Cache miss profile reduces overhead 50%• Loop-oriented benchmarks benefit most
– Profiles are different in loop regions
Pegwit (encryption)
1.00
1.02
1.04
1.06
1.08
1.10
1.12
60% 70% 80% 90% 100%
Compression ratio
Slowdown relative to
native code
CodePack: dynamic instructions
CodePack: cache miss
31
Code placement
adict b c d
Original code
b c d A B C
Whole compressiondecompress region (in L1 cache only)compressed code
Selective compression
A B C
A
B
B D
Dnative regioncompressed code
Memory
adict
cadict
cadict
decompress region
D
D
C
Decompress
Decompress
Different order!
Same order
32
CodePack vs. Dictionary
• More compression may have better performance– CodePack has smaller size than Dictionary compression– Even with some native code, CodePack is smaller
– CodePack is faster due to using more native code
Ghostscript
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
60% 70% 80% 90% 100%Compression ratio
Slowdown relative to native
code
CodePack: cache miss
Dictionary: cache miss
33
Memoization
• Reserve main memory for caching decompressed insns.– Use high density DRAM to store more than I-cache in less area– Manage as a cache: data and tags
• Algorithm– Decompressor checks memo table before decompressing– On hit, copy instructions into I-cache (no decompression)– On miss, decompress into I-cache and update memo table
decompress
address
cache line
No memoization With memoization
decompress
address
cacheline
tag native code
= ?
34
Memoization results
• 16KB memoization table• CodePack: large improvement • Dictionary is already fast: small improvement
0
1
2
3
4
5
cc1
ghostsc
ript go
ijpeg
mpeg
2enc
pegwit
perl
vorte
x
Dictionary
Memo
0
5
10
15
20
25
30
cc1
ghostsc
ript go
ijpeg
mpeg
2enc
pegwit
perl
vorte
x
Slo
wd
ow
n
CodePack
Memo
35
0
2
4
6
8
10
12
14
cc1
ghos
tscrip
t go
ijpeg
mpe
g2en
c
pegw
it
perl
vorte
x
Slo
wdo
wn
0KB memo4KB memo8KB memo16KB memo32KB memo (not hybrid)
Combined
• Use memoization on hybrid programs– Keep area constant. Partition memory for best solution.
• Combined solution is often the best
CodePack with 32KB decompression buffer
36
Combined
• Adding DRAM is better than larger SRAM cache
Ghostscript using CodePack
02468
101214161820
60% 70% 80% 90% 100%Compression Ratio
Slo
wdo
wn
0 KB Memo 4.5 KB Memo9 KB Memo 18 KB Memo36 KB Memo 72 KB Memo144 KB Memo 288 KB Memo4 KB I-cache 8 KB I-cache16 KB I-cache 32 KB I-cache64 KB I-cache Native
9 KB total
288 KB total
18 KB total
36 KB total72 KB total
144 KB total
37
Conclusions
• High-performance SW decompression possible– Dictionary faster than CodePack, but 5-25% compression ratio difference– Hardware support
• I-cache miss exception• Store-instruction instruction
• Tune performance– Cache size– Hybrid programs, memoization
• Hybrid programs– Use cache miss profile for loop-oriented benchmarks
• Memoization– No profile required, but has start-up latency– Effective on large working sets
38
Future work
• Compiler optimization for compression– Improve compression ratio without affecting performance
• Unify selective compression and code placement– Reduce I-cache miss to improve performance
• Energy consumption– Increasing run-time may use too much energy
– Improving bus utilization saves energy
• Dynamic code generation: Crusoe, Dynamo– Memory management, code stitching
• Compression for high-performance– Lower L2 miss rate
39
Web page
http://www.eecs.umich.edu/compress