thread warping: a framework for dynamic synthesis of thread accelerators greg stitt dept. of ece...

Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

Greg Stitt Dept. of ECE

University of Florida

This research was supported in part by the National Science Foundation and the Semiconductor

Research Corporation

Frank VahidDept. of CS&E

University of California, Riverside

Also with the Center for Embedded Computer Systems, UC Irvine

Binary Translation

VLIWµP

Background Motivated by commercial dynamic binary translation of early

x86Binary x86 VLIW

x86 VLIW FPGA

VLIWBinary

FPGAµP

Binary

Warp processing (Lysecky/Stitt/Vahid 2003-2007): dynamically translate binary to circuits on FPGAs

Performance

e.g., Transmeta Crusoe “code morphing”

Binary “Translation”

FPGAOn-chip CAD

Warp Processing Background

Profiler

Initially, software binary loaded into instruction memory

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software Binary

FPGAOn-chip CAD

ProfilerI Mem

Software BinaryMicroprocessor executes

instructions in software binary

Time EnergyµP

FPGAOn-chip CAD

Profiler

Software BinaryProfiler monitors instructions

and detects critical regions in binary

Time Energy

Profiler

Critical Loop Detected

FPGAOn-chip CAD

Profiler

Software BinaryOn-chip CAD reads in critical

region44

Time Energy

Profiler

On-chip CAD

FPGADynamic Part. Module (DPM)

Profiler

Software BinaryOn-chip CAD decompiles critical

region into control data flow graph (CDFG)

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0

Decompilation surprisingly effective at recovering high-level program structures Stitt et al ICCAD’02, DAC’03, CODES/ISSS’05, ICCAD’05, FPGA’05, TODAES’06, TODAES’07

Recover loops, arrays, subroutines, etc. – needed to synthesize good circuits

Profiler

Software BinaryOn-chip CAD synthesizes

decompiled CDFG to a custom (parallel) circuit

Time Energy

Profiler

On-chip CAD

ret reg4

reg3 := 0reg4 := 0+ + ++ ++

Profiler

Software BinaryOn-chip CAD maps circuit onto

FPGA77

Time Energy

Profiler

On-chip CAD

ret reg4

reg3 := 0reg4 := 0+ + ++ ++

Lean place&route/FPGA 10x faster CAD (Lysecky et al DAC’03, ISSS/CODES’03, DATE’04, DAC’04, DATE’05, FCCM’05, TODAES’06)

Multi-core chips – use 1 powerful core for CAD

Profiler

Software Binary88

Time Energy

Profiler

On-chip CAD

ret reg4

reg3 := 0reg4 := 0+ + ++ ++

On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more

Mov reg3, 0Mov reg4, 0loop:// instructions that interact with FPGA

Ret reg4

Time Energy

Software-only“Warped”

>10x speedups for some apps

Warp Scenarios

µP (1st execution)

On-chip CAD

µP FPGA

Speedup

Long Running Applications Recurring Applications

Long-running applications Scientific computing, etc.

Recurring applications (save FPGA configurations) Common in embedded systems Might view as (long) boot phase

On-chip CAD

Single-execution speedup

Possible platforms: Xilinx Virtex II Pro, Altera Excalibur, Cray XD1, SGI Altix, Intel QuickAssist, ...

Warping takes time – when useful?

Thread Warping - Overview

FPGAµPµP

f()f()

Compiler

Binary

for (i = 0; i < 10; i++) {

thread_create( f, i );

On-chip CAD

Acc. Lib

f() f()

OS schedules threads onto available µPs

Remaining threads added to queue

OS invokes on-chip CAD tools to create accelerators for f()

OS schedules threads onto accelerators (possibly dozens), in addition to µPs

Thread warping: use one core to create accelerator for waiting threads

Very large speedups possible – parallelism at bit, arithmetic, and now thread level too

x86 VLIW TWrp

PerformanceMulti-core platforms multi-threaded apps

Decompilation

Memory Access Synchronization

High-level Synthesis

Thread Functions

Netlist

Binary Updater

Updated Binary

Hw/Sw Partitioning

Thread Group Table

Thread Warping Tools Invoked by OS Uses pthread library (POSIX)

Mutex/semaphore for synchronization

Defined methods/algorithms of a thread warping framework

Accelerator Instantiation

Thread Queue

Thread Functions

Thread Counts

Accelerator Synthesis

Accelerator Library

Not In Library?

DoneAccelerators Synthesized?

Queue Analysis

falsefalse

true true

Updated Binary

Schedulable Resource List

Place&RouteThread Group Table

NetlistBitfile

On-chip CAD

Must deal with widely known memory bottleneck problem FPGAs great, but often can’t get data to them fast enough

void f( int a[], int val ){ int result; for (i = 0; i < 10; i++) { result += a[i] * val; } . . . . }

Memory Access Synchronization (MAS)

Same array

FPGA b()a()

RAMData for dozens of threads can create bottleneck

for (i = 0; i < 10; i++) {

thread_create( thread_function, a, i );

Threaded programs exhibit unique feature: Multiple threads often access same data

Solution: Fetch data once, broadcast to multiple threads (MAS)

1) Identify thread groups – loops that create threads

for (i = 0; i < 100; i++) { thread_create( f, a, i );}

void f( int a[], int val ){ int result; for (i = 0; i < 10; i++) { result += a[i] * val; } . . . . }

Thread Group

Def-Use: a is constant for all threads

Addresses of a[0-9] are constant for thread group

f() f() ……………… f()

DMA RAMA[0-9]

A[0-9] A[0-9] A[0-9]

Before MAS: 1000 memory accesses

After MAS: 100 memory accesses

Data fetched once, delivered to entire group

2) Identify constant memory addresses in thread function Def-use analysis of parameters to thread function

3) Synthesis creates a “combined” memory access Execution synchronized by OS

enable (from OS)

Also detects overlapping memory regions – “windows”

void f( int a[], int i ){ int result; result += a[i]+a[i+1]+a[i+2]+a[i+3]; . . . . }

for (i = 0; i < 100; i++) { thread_create( thread_function, a, i );}

a[0] a[1] a[2] a[3] a[4] a[5] ………

f() f() ……………… f()

DMA RAMA[0-103]

A[0-3] A[1-4] A[6-9]

Data streamed to “smart buffer”

Smart Buffer

Buffer delivers window to each thread

W/O smart buffer: 400 memory accessesWith smart buffer: 104 memory accesses

Synthesis creates extended “smart buffer” [Guo/Najjar FPGA04]

Caches reused data, delivers windows to threads

Each thread accesses different addresses – but addresses may overlap

enable

Framework

Accelerator Instantiation

Thread Queue

Thread Functions

Thread Counts

Accelerator Library

Not In Library? Done

Accelerators Synthesized?

Queue Analysis

falsefalse

true true

Updated Binary

Schedulable Resource List

Place&RouteThread Group Table

NetlistBitfile

Also developed initial algorithms for: Queue analysis Accelerator

instantiation OS scheduling

of threads to accelerators and cores

Thread Warping Example

int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, i ); } . . . . . .}void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}

FPGAµPµP

main() filter()

On-chip CAD filter()

Thread Queue

Queue Analysis

Thread functions: filter()

filter() threads execute on available cores

Remaining threads added to queue

OS invokes CAD (due to queue size or periodically)

CAD tools identify filter() for synthesis

Example

int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, i ); } . . . . . .}void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}

FPGAµPµP

main() filter()

filter() binary

Decompilation

MAS detects overlapping windows

MAS detects thread group

CAD reads filter() binary

Example

int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}

FPGAµPµP

main() filter()

filter() binary

Decompilation

High-level Synthesis

filter() filter(). . . . .

Smart Buffer

RAM Accelerator Library

filter filter

Synthesis creates pipelined accelerator for filter() group: 8 accelerators

Stored for future use

Accelerators loaded into FPGA

Example

int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}void filter( int a[53], int b[50], int i, ) { b[i][=avg( a[i], a[i+1], a[i+2], a[i+3] );}

FPGAµPµP

main() filter()

Smart Buffer

filter filter

a[0-52]

a[2-5] a[9-12]

Smart buffer streams a[] data

After buffer fills, delivers a window to all eight accelerators

OS schedules threads to accelerators

enable (from OS)

Example

FPGAµPµP

main() filter()

Smart Buffer

filter filter

a[0-53]

a[10-13] a[17-20]

Each cycle, smart buffer delivers eight more windows – pipeline remains full

Example

FPGAµPµP

main() filter()

Smart Buffer

filter filter

a[0-53]

b[2-9]Accelerators create 8 outputs after pipeline latency passes

Example

FPGAµPµP

main() filter()

Smart Buffer

filter filter

a[0-53]

b[10-17]

Thread warping: 8 pixel outputs per cycle Software: 1 pixel output every ~9 cycles72x cycle count improvement

Additional 8 outputs each cycle

Experiments to Determine Thread Warping Performance: Simulator Setup

filter filter filter…………

……

Parallel Execution Graph (PEG) – represents thread level parallelism

Nodes: Sequential execution blocks (SEBs)

Edges: pthread calls

Generate PEG using pthread wrappers

Determine SEB performancesSw: SimpleScalarHw: Synthesis/simulation (Xilinx)

Event-driven simulation – use defined algoritms to change architecture dynamically

Simulation Summary

Complete when all SEBs simulated Observe total cycles

Optimistic for Sw execution (no memory contention)

Pessimistic for warped execution (accelerators/microprocessors execute exclusively)

int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}

Experiments

Benchmarks: Image processing, DSP, scientific computing Highly parallel examples to illustrate thread warping potential We created multithreaded versions

Base architecture – 4 ARM cores Focus on recurring applications (embedded)

TW: FPGA running at whatever frequency determined by synthesis

On-chip CAD

FPGAµP

4 ARM11 400 MHz

Compared to

4 ARM11 400 MHz + FPGA (synth freq)

Multi-core Thread Warping

Speedup from Thread Warping

Average 130x speedup

130 502 63 130 38308

01020304050

Fir Prewitt Linear Moravec Wavelet Maxfilter 3DTrans N-body Avg. Geo.Mean

11x faster than 64-core system Simulation pessimistic, actual results likely better

But, FPGA uses additional area

So we also compare to systems with 8 to 64 ARM11 uPs – FPGA size = ~36 ARM11s

Limitations

Dependent on coding practices Assumes boss/worker thread model Not all apps amenable to FPGA

speedup Commercial CAD slow – warping

takes time But in worst case, FPGA just not

used by application

Why not Partition Statically?

Static good, but hiding FPGA opens technique to all sw platforms

Standard languages/tools/binaries

On-chip CAD

Any Compiler

Specialized Compiler

Binary Netlist Binary

Specialized Language Any Language

Static Thread Synthesis Dynamic Thread Warping

Can add FPGA without changing binaries – like expanding memory, or adding processors to multiprocessor

Can adapt to changing workloads Smaller & more accelerators, fewer & large accelerators, ...

Memory-access synchronization applicable to static approach

Conclusions

Thread warping framework dynamically synthesizes accelerators for thread functions

Memory Access Synchronization Helps reduce memory bottleneck problem

130x speedups for chosen examples Future work

Handle wider variety of coding constructs Improve for different thread models Numerous open problems

E.g., dynamic reallocation of FPGA resources

thread warping: a framework for dynamic synthesis of thread accelerators greg stitt dept. of ece...

mov reg3

mov reg4

beq reg3

reg4 mem reg2 reg3

reg4 software binary

p slide

fpga p os p main filter

p fpga dynamic

Documents

alex stitt: portfolio of work

warping constant

frank & pardis stitt - southern foodways alliance · [begin...

cbgw julie stitt panel

keith stitt complete2

susan egan - georgia stitt

sonny stitt transcriptions

image warping

sonny stitt photo by michael wilderman

pacnet's nigel stitt

sonny stitt alto solos 2

ball warping process

transcripciones sonny stitt

warping torsion - university of british...

susan stitt portfolio:...

warping phpapp3648

winding, warping, sizing

image warping/morphing image warpingimage warping · image...

dark eyes_sonny stitt solo_for musicians only

rotor warping