thread warping: a framework for dynamic synthesis of thread accelerators greg stitt dept. of ece...
Post on 22-Dec-2015
213 Views
Preview:
TRANSCRIPT
Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators
Greg Stitt Dept. of ECE
University of Florida
This research was supported in part by the National Science Foundation and the Semiconductor
Research Corporation
Frank VahidDept. of CS&E
University of California, Riverside
Also with the Center for Embedded Computer Systems, UC Irvine
2/30
Binary Translation
VLIWµP
Background Motivated by commercial dynamic binary translation of early
2000s
x86Binary x86 VLIW
x86 VLIW FPGA
VLIWBinary
FPGAµP
Binary
Warp processing (Lysecky/Stitt/Vahid 2003-2007): dynamically translate binary to circuits on FPGAs
Performance
e.g., Transmeta Crusoe “code morphing”
Binary “Translation”
3/30
µP
FPGAOn-chip CAD
Warp Processing Background
Profiler
Initially, software binary loaded into instruction memory
11
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software Binary
4/30
µP
FPGAOn-chip CAD
Warp Processing Background
ProfilerI Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryMicroprocessor executes
instructions in software binary
22
Time EnergyµP
5/30
µP
FPGAOn-chip CAD
Warp Processing Background
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryProfiler monitors instructions
and detects critical regions in binary
33
Time Energy
Profiler
add
add
add
add
add
add
add
add
add
add
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
Critical Loop Detected
6/30
µP
FPGAOn-chip CAD
Warp Processing Background
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD reads in critical
region44
Time Energy
Profiler
On-chip CAD
7/30
µP
FPGADynamic Part. Module (DPM)
Warp Processing Background
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD decompiles critical
region into control data flow graph (CDFG)
55
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0
Decompilation surprisingly effective at recovering high-level program structures Stitt et al ICCAD’02, DAC’03, CODES/ISSS’05, ICCAD’05, FPGA’05, TODAES’06, TODAES’07
Recover loops, arrays, subroutines, etc. – needed to synthesize good circuits
8/30
µP
FPGADynamic Part. Module (DPM)
Warp Processing Background
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD synthesizes
decompiled CDFG to a custom (parallel) circuit
66
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0+ + ++ ++
+ ++
+
+
+
. . .
. . .
. . .
9/30
µP
FPGADynamic Part. Module (DPM)
Warp Processing Background
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD maps circuit onto
FPGA77
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0+ + ++ ++
+ ++
+
+
+
. . .
. . .
. . .
CLB
CLB
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
++
FPGA
Lean place&route/FPGA 10x faster CAD (Lysecky et al DAC’03, ISSS/CODES’03, DATE’04, DAC’04, DATE’05, FCCM’05, TODAES’06)
Multi-core chips – use 1 powerful core for CAD
10/30
µP
FPGADynamic Part. Module (DPM)
Warp Processing Background
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software Binary88
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0+ + ++ ++
+ ++
+
+
+
. . .
. . .
. . .
CLB
CLB
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
++
FPGA
On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more
Mov reg3, 0Mov reg4, 0loop:// instructions that interact with FPGA
Ret reg4
FPGA
Time Energy
Software-only“Warped”
>10x speedups for some apps
11/30
Warp Scenarios
µP
Time
µP (1st execution)
Time
On-chip CAD
µP FPGA
Speedup
Long Running Applications Recurring Applications
Long-running applications Scientific computing, etc.
Recurring applications (save FPGA configurations) Common in embedded systems Might view as (long) boot phase
On-chip CAD
Single-execution speedup
FPGA
Possible platforms: Xilinx Virtex II Pro, Altera Excalibur, Cray XD1, SGI Altix, Intel QuickAssist, ...
Warping takes time – when useful?
12/30
µP
Thread Warping - Overview
FPGAµPµP
µP
OS
µP
f()
f()f()
Compiler
Binary
for (i = 0; i < 10; i++) {
thread_create( f, i );
}
f()
µP
On-chip CAD
Acc. Lib
f() f()
OS schedules threads onto available µPs
Remaining threads added to queue
OS invokes on-chip CAD tools to create accelerators for f()
OS schedules threads onto accelerators (possibly dozens), in addition to µPs
Thread warping: use one core to create accelerator for waiting threads
Very large speedups possible – parallelism at bit, arithmetic, and now thread level too
x86 VLIW TWrp
PerformanceMulti-core platforms multi-threaded apps
13/30
Decompilation
Memory Access Synchronization
High-level Synthesis
Thread Functions
Netlist
Binary Updater
Updated Binary
Hw/Sw Partitioning
Hw Sw
Thread Group Table
Thread Warping Tools Invoked by OS Uses pthread library (POSIX)
Mutex/semaphore for synchronization
Defined methods/algorithms of a thread warping framework
Accelerator Instantiation
Thread Queue
Thread Functions
Thread Counts
Accelerator Synthesis
Accelerator Library
FPGA
Not In Library?
DoneAccelerators Synthesized?
Queue Analysis
falsefalse
true true
Updated Binary
Schedulable Resource List
Place&RouteThread Group Table
NetlistBitfile
On-chip CAD
FPGA
µP
Accelerator Synthesis
Memory Access Synchronization
14/30
Must deal with widely known memory bottleneck problem FPGAs great, but often can’t get data to them fast enough
void f( int a[], int val ){ int result; for (i = 0; i < 10; i++) { result += a[i] * val; } . . . . }
Memory Access Synchronization (MAS)
Same array
FPGA b()a()
RAMData for dozens of threads can create bottleneck
for (i = 0; i < 10; i++) {
thread_create( thread_function, a, i );
}DMA
Threaded programs exhibit unique feature: Multiple threads often access same data
Solution: Fetch data once, broadcast to multiple threads (MAS)
….
15/30
Memory Access Synchronization (MAS)
1) Identify thread groups – loops that create threads
for (i = 0; i < 100; i++) { thread_create( f, a, i );}
void f( int a[], int val ){ int result; for (i = 0; i < 10; i++) { result += a[i] * val; } . . . . }
Thread Group
Def-Use: a is constant for all threads
Addresses of a[0-9] are constant for thread group
f() f() ……………… f()
DMA RAMA[0-9]
A[0-9] A[0-9] A[0-9]
Before MAS: 1000 memory accesses
After MAS: 100 memory accesses
Data fetched once, delivered to entire group
2) Identify constant memory addresses in thread function Def-use analysis of parameters to thread function
3) Synthesis creates a “combined” memory access Execution synchronized by OS
enable (from OS)
16/30
Memory Access Synchronization (MAS)
Also detects overlapping memory regions – “windows”
void f( int a[], int i ){ int result; result += a[i]+a[i+1]+a[i+2]+a[i+3]; . . . . }
for (i = 0; i < 100; i++) { thread_create( thread_function, a, i );}
a[0] a[1] a[2] a[3] a[4] a[5] ………
f() f() ……………… f()
DMA RAMA[0-103]
A[0-3] A[1-4] A[6-9]
Data streamed to “smart buffer”
Smart Buffer
Buffer delivers window to each thread
W/O smart buffer: 400 memory accessesWith smart buffer: 104 memory accesses
Synthesis creates extended “smart buffer” [Guo/Najjar FPGA04]
Caches reused data, delivers windows to threads
Each thread accesses different addresses – but addresses may overlap
enable
17/30
Framework
Accelerator Instantiation
Thread Queue
Thread Functions
Thread Counts
Accelerator Synthesis
Accelerator Library
FPGA
Not In Library? Done
Accelerators Synthesized?
Queue Analysis
falsefalse
true true
Updated Binary
Schedulable Resource List
Place&RouteThread Group Table
NetlistBitfile
Also developed initial algorithms for: Queue analysis Accelerator
instantiation OS scheduling
of threads to accelerators and cores
18/30
Thread Warping Example
int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, i ); } . . . . . .}void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}
FPGAµPµP
µP
OS
µP
main() filter()
µP
On-chip CAD filter()
Thread Queue
Queue Analysis
Thread functions: filter()
filter() threads execute on available cores
Remaining threads added to queue
OS invokes CAD (due to queue size or periodically)
CAD tools identify filter() for synthesis
19/30
Example
int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, i ); } . . . . . .}void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}
FPGAµPµP
µP
OS
µP
main() filter()
µP
On-chip CAD filter()
filter() binary
Decompilation
CDFG
Memory Access Synchronization
MAS detects overlapping windows
MAS detects thread group
CAD reads filter() binary
20/30
Example
int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}
FPGAµPµP
µP
OS
µP
main() filter()
µP
On-chip CAD filter()
filter() binary
Decompilation
CDFG
Memory Access Synchronization
High-level Synthesis
+ +
+
>>2
filter() filter(). . . . .
Smart Buffer
RAM
RAM Accelerator Library
filter filter
Synthesis creates pipelined accelerator for filter() group: 8 accelerators
Stored for future use
Accelerators loaded into FPGA
21/30
Example
int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}void filter( int a[53], int b[50], int i, ) { b[i][=avg( a[i], a[i+1], a[i+2], a[i+3] );}
FPGAµPµP
µP
OS
µP
main() filter()
µP
On-chip CAD filter()
filter() filter(). . . . .
Smart Buffer
RAM
RAM
filter filter
a[0-52]
a[2-5] a[9-12]
Smart buffer streams a[] data
After buffer fills, delivers a window to all eight accelerators
OS schedules threads to accelerators
enable (from OS)
22/30
Example
int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}
FPGAµPµP
µP
OS
µP
main() filter()
µP
On-chip CAD filter()
filter() filter(). . . . .
Smart Buffer
RAM
RAM
filter filter
a[0-53]
a[10-13] a[17-20]
Each cycle, smart buffer delivers eight more windows – pipeline remains full
23/30
Example
int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}
FPGAµPµP
µP
OS
µP
main() filter()
µP
On-chip CAD filter()
filter() filter(). . . . .
Smart Buffer
RAM
RAM
filter filter
a[0-53]
b[2-9]Accelerators create 8 outputs after pipeline latency passes
24/30
Example
int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}
FPGAµPµP
µP
OS
µP
main() filter()
µP
On-chip CAD filter()
filter() filter(). . . . .
Smart Buffer
RAM
RAM
filter filter
a[0-53]
b[10-17]
Thread warping: 8 pixel outputs per cycle Software: 1 pixel output every ~9 cycles72x cycle count improvement
Additional 8 outputs each cycle
25/30
Experiments to Determine Thread Warping Performance: Simulator Setup
main
filter filter filter…………
main
……
Parallel Execution Graph (PEG) – represents thread level parallelism
Nodes: Sequential execution blocks (SEBs)
Edges: pthread calls
Generate PEG using pthread wrappers
Determine SEB performancesSw: SimpleScalarHw: Synthesis/simulation (Xilinx)
Event-driven simulation – use defined algoritms to change architecture dynamically
Simulation Summary
Complete when all SEBs simulated Observe total cycles
1)
2)
3)
4)
Optimistic for Sw execution (no memory contention)
Pessimistic for warped execution (accelerators/microprocessors execute exclusively)
int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}
26/30
Experiments
Benchmarks: Image processing, DSP, scientific computing Highly parallel examples to illustrate thread warping potential We created multithreaded versions
Base architecture – 4 ARM cores Focus on recurring applications (embedded)
TW: FPGA running at whatever frequency determined by synthesis
On-chip CAD
FPGAµP
µP
µP
µP
4 ARM11 400 MHz
Compared to
4 ARM11 400 MHz + FPGA (synth freq)
Multi-core Thread Warping
µP
µP
µP
µP
27/30
Speedup from Thread Warping
Average 130x speedup
130 502 63 130 38308
01020304050
Fir Prewitt Linear Moravec Wavelet Maxfilter 3DTrans N-body Avg. Geo.Mean
4-uP
TW
8-uP
16-uP
32-uP
64-uP
11x faster than 64-core system Simulation pessimistic, actual results likely better
But, FPGA uses additional area
So we also compare to systems with 8 to 64 ARM11 uPs – FPGA size = ~36 ARM11s
28/30
Limitations
Dependent on coding practices Assumes boss/worker thread model Not all apps amenable to FPGA
speedup Commercial CAD slow – warping
takes time But in worst case, FPGA just not
used by application
29/30
Why not Partition Statically?
Static good, but hiding FPGA opens technique to all sw platforms
Standard languages/tools/binaries
On-chip CAD
FPGA
µP
Any Compiler
FPGA
µP
Specialized Compiler
Binary Netlist Binary
Specialized Language Any Language
Static Thread Synthesis Dynamic Thread Warping
Can add FPGA without changing binaries – like expanding memory, or adding processors to multiprocessor
Can adapt to changing workloads Smaller & more accelerators, fewer & large accelerators, ...
Memory-access synchronization applicable to static approach
30/30
Conclusions
Thread warping framework dynamically synthesizes accelerators for thread functions
Memory Access Synchronization Helps reduce memory bottleneck problem
130x speedups for chosen examples Future work
Handle wider variety of coding constructs Improve for different thread models Numerous open problems
E.g., dynamic reallocation of FPGA resources
top related