hardware/software partitioning greg stitt ece department university of florida
Post on 24-Dec-2015
218 Views
Preview:
TRANSCRIPT
Introduction
FPGAs are often much faster than sw But, most real designs with FPGAs still use
microprocessors Why?
FPGAs typically implement “kernels” efficiently Difficult/inefficient to implement entire
application as a custom circuit in FPGA Common case
Implement performance critical code in FPGA Implement everything else on
microprocessors Certain regions can afford to be slow
Hw/Sw Architectures
Hybrids/ASIPs Tensilica Xtensa is uP with custom instructions in hw Stretch is similar with FPGA Piperench, Warp processors, Chameleon, etc.
FPGAs FPGAs more commonly have microprocessor cores in
fabric Virtex II Pro, Virtex IV FX have PowerPCs
Even if no uP cores in fabric, can implement uP on FPGA - soft core uPs
Microblaze, Picoblaze, Nios Slow, but sometimes not a problem
High-Performance Computing Cray XD1 - AMDs/FPGAs SGI Altix - Xeons/FPGAs
Hardware/Software Partitioning
Definition: Given an application, hw/sw partitioning maps each region of the application onto hardware (custom circuits) or software (microprocessors)
A partition is a mapping of each region to either hw or sw Possible Goals
Meet design constraints (performance, power, size, cost, etc.)
Maximize performance Minimize power for a given performance constraint Etc.
Challenges Huge number of partitions for an application
# of partitions = 2n, n is number of regions 5 regions = 32 partitions, 100 regions = 1.26*1030 partitions!
Clearly, we need efficient heuristics
Hardware/Software Partitioning
Issues to consider Granularity
What type of regions to consider? Partition evaluation
How to determine goodness of partitions? Alternative region implementations Implementation models Exploration
How to quickly find good partition?
Granularity
Definition: Measure of functionality considered for hw/sw
Coarse grained regions - tasks, functions, loops Fine grained regions - blocks, statements, operations
Tradeoffs exist for coarse grained/fine grained Coarse grained regions
Simplifies partitioning (fewer regions) Possibly more accurate estimations (don’t have to combine a
bunch of small regions) Possibly less inter-partition communication
Hw/Sw communication usually expensive May outweigh benefits of putting regions in hardware
Fine grained regions May take longer to find good partition (more partitions to
choose from) Estimation possibly more difficult But, may provide better solution
Granularity: Examplevoid Reference_IDCT(block)short *block;{ int i, j, k, v; double partial_product; double tmp[64];
for (i=0; i<8; i++) for (j=0; j<8; j++) { partial_product = 0.0;
for (k=0; k<8; k++) partial_product+= c[k][j]*block[8*i+k];
tmp[8*i+j] = partial_product; }
for (j=0; j<8; j++) for (i=0; i<8; i++) { partial_product = 0.0;
for (k=0; k<8; k++) partial_product+= c[k][i]*tmp[8*k+j];
v = (int) floor(partial_product+0.5); block[8*i+j] = (v<-256) ? -256 : ((v>255) ? 255 : v); }}
Coarse grained: Functions and loops
+Few regions
+Easier estimation (less hw/sw communication)
-May not provide optimal partition (explores less possibilities)
Granularity: Examplevoid Reference_IDCT(block)short *block;{ int i, j, k, v; double partial_product; double tmp[64];
for (i=0; i<8; i++) for (j=0; j<8; j++) { partial_product = 0.0;
for (k=0; k<8; k++) partial_product+= c[k][j]*block[8*i+k];
tmp[8*i+j] = partial_product; }
for (j=0; j<8; j++) for (i=0; i<8; i++) { partial_product = 0.0;
for (k=0; k<8; k++) partial_product+= c[k][i]*tmp[8*k+j];
v = (int) floor(partial_product+0.5); block[8*i+j] = (v<-256) ? -256 : ((v>255) ? 255 : v); }}
Fine grained: Statements
+Explores more partitions (may find better partition)
-Explores more partitions (takes much longer)
Granularity: Examplevoid Reference_IDCT(block)short *block;{ int i, j, k, v; double partial_product; double tmp[64];
for (i=0; i<8; i++) for (j=0; j<8; j++) { partial_product = 0.0;
for (k=0; k<8; k++) partial_product+= c[k][j]*block[8*i+k];
tmp[8*i+j] = partial_product; }
for (j=0; j<8; j++) for (i=0; i<8; i++) { partial_product = 0.0;
for (k=0; k<8; k++) partial_product+= c[k][i]*tmp[8*k+j];
v = (int) floor(partial_product+0.5); block[8*i+j] = (v<-256) ? -256 : ((v>255) ? 255 : v); }}
Very fine grained: Individual Operations
+Most flexible (allows exploration of all possibilities)
-Huge number of regions
Etc.
Partition Evaluation
Responsible for determining the “goodness” of a partition
Evaluates multiple design metrics Performance, power, area, etc. May use some cost function for
representing goodness e.g. weighted average of multiple metrics
HWSWPerformance – 28.5sArea – 62000 gatesPower - 2 watts
Loop1Loop2
Quantize()
DCT()Huffman()
Partition Evaluation
Input: Partition
Output: Design Metrics
Partition Evaluation
Complicated problem Regions are not independent
e.g. adding more regions to hw may seem to improve performance but may require more steering logic, clock may be lengthened, etc.
Must consider effects of regions on each other Must consider many architectural issues
e.g. Communication time for hw-hw, hw-sw, sw-sw May be different for each architectural component
E.g. heterogeneous microprocessors
2 possibilities for evaluation Implementation - actually implement each partition,
determine design metrics Accurate, but slow
Estimation Estimation - less accurate/faster
Partition Evaluation: Implementation/Estimation
Evaluation techniques - many others Pure implementation
Possible only for a small number of regions Pure estimation
Likely inaccurate Hybrid approach 1
Implement hardware/software for individual regions (ignore possible combinations)
Characterize regions with performance/area Estimate changes when combining regions
Hybrid approach 2 Iterate by estimating goodness of partitions, with occasional
implementations to verify estimates Hybrid approach 3
Estimate some good partitions to reduce exploration space, implement those few partitions, choose best one
Hybrid approach 4 Combine estimation and implementation.
E.g. use “rough” synthesis to get hardware performance
Alternative Region Implementations
10s15s25s 10s
5s
12s
8s 5s
Sw Time: 50s Sw Time: 30s Sw Time: 20s
Application Regions(Different sized shapes represent different hw implementations)
FIR() ACCUM()
SEARCH()
5s25s
10s10s 15s
Possible Solutions: Use fastest implementations
Use smallest implementations
Consider all “middle” implementations
5+30+20=55s 25+15+10=50s 10+15+20=45sPerformance:Best Partition
15s
Alternative Region Implementations
Issue: Hw regions can be implemented in many ways Challenge 1: How to choose an implementation for each
region? Making one region fast may make partition slow
May use area needed by other regions May need to choose slow implementation to save area for
other regions Must consider entire partition for each change to each
region Challenge 2: Exploration space explodes!
For 8 regions w/ 1 hw implementation, possible partitions = 28 = 256
For 8 regions w/ 4 hw implementations, possible partitions = 58 = 390625 partitions!
5 possible implementations for each region = 1 sw + 4 hw Good solution: unknown
Implementation Models Implementation models define how microprocessors
interface with hardware More possibilities, better solutions, but larger solution
space Estimation techniques more difficult for complex models
Example 1: Communication methods Direct communication, using shared memory, tightly-
coupled, etc.
Microprocessor
Cache
DMA
Bridge
Memory
Tightly-coupled
Loosely-coupled
Fused
Directcommunication
Dynamically reconfigurable
Implementation Models
Example 2: Execution models Mutually exclusive
FPGA and uP never execute simultaneously May be appropriate for sequential applications
Advantage: easier estimations Disadvantage: decreased performance
Parallel Advantage: Improved performance Disadvantage: Estimates much more
difficult Must take into account memory contention,
cache coherency, synchronization, etc.
Exploration
Exploration searches partition space for a optimal partition - realistically must settle for good partition
Main step: represents majority of hw/sw partitioning work
Highly dependent on formulation of problems A formulation is a particular instance of discussed
issues e.x. direct communication, sequential regions, 1
implementation per region, etc. HWSWHWSW
Performance – 28.5sArea – 1452 gates
HWSW
Performance – 28.5sArea – 0 gates
Performance – 16.2sArea – 3418 gates
HWSW
Performance – 11.1sArea – 12380 gates
Exploration
Simple formulation: n regions, each region has Sw time, Hw time, and Hw area
Assumptions Adding hw regions together doesn’t change
area/performance Obviously not true But, may be good enough in some situations
Communication time of regions same for Hw or Sw
Often not true, but may be true if uP and Hw has same interface to memory
Exploration
A solution for simple formulation: Problem identical to 0-1 knapsack problem
NP-complete 0-1 knapsack problem
Input: knapsack with weight capacity, and a set of items with profit and weight
Problem: Determine which items should be placed in the knapsack
Goal: maximizing profit without violating weight capacity Mapping to hw/sw partitioning
Knapsack is hw (FPGA in our case) Weight capacity is hw area Items are program regions Profit is speedup from implementation in hw Weight is area of hw implemention
Exploration: Heuristics for simple formulation
Problem: 0-1 knapsack is NP-complete We likely need to use a heuristic Need way of focusing on moving
regions to hw that provide large speedup
How do we know if a region potentially provides large speedup?
Exploration: Heuristics for simple formulation
Amdahl’s Law Originally stated how much performance could be improved
by parallelization Can be generalized to stating how much speedup is achieved
based on the percentage of the application that is optimized Speedup = 1/(s-p/n)
p is percentage of app. that is optimized, s is the percentage unoptimized (1-p), n is the speedup of the region created by the optimization
Ideal Speedup = 1/(s) = 1/(1-p) Speedup assuming that hw runs infinitely fast
From these equations, we can see that heuristics should focus on regions consisting of a large % of execution time
The larger p is for a region, the larger the potential speedup is
p = 90%, ideal speedup = 1/(1-.9) = 10x p = 10%, ideal speedup = 1/(1-.1) = 1.1x
Exploration: Heuristics for simple formulation
90-10 rule Observation that for many applications 90% of
execution time spent in 10% of code
Good news for heuristic Suggests heuristic can achieve most of potential speedup
by focusing on moving this 10% of code to hardware
0%
20%
40%
60%
80%
100%
1 2 3 4 5 6 7 8 9 10
Most-frequent regions
Cumulative application execution percentage
Exploration: Heuristics for simple formulation
Possible greedy heuristic 1) Profile application to determine % of execution time for
each region Part of input for simple formulation
2) Create speedup/area ratio for regions with largest % Partition evaluation - may be estimate or implementation How many regions?
Depends on how fast you want heuristic to be 3) Sort regions based on this ratio 4) Implement regions in sorted order until area exhausted O(n lgn) complexity
Mapping back to knapsack problem Basic idea: Place items in knapsack in order of
profit/weight
Exploration More complicated formulations
More complex implementation models Asymmetric communication Multiple processors Multiple FPGAs Tightly-coupled vs loosely coupled Multiple implementations Etc.
Common exploration techniques: ILP Simulated annealing/genetic algorithms/hill climbing Group migration (Kernighan-Lin) Graph bipartitioning (read paper on website) Tabu search (read paper on website)
Similar to simulated annealing, but maintains “Tabu” list to improve diversity of solutions
Exploration There is no known efficient solution for considering all
possible issues Ridiculously large exploration space Problem is becoming harder with more complex architectures
State of the art: Granularity
Consider coarse and fine grained partitions Partition evaluation
Estimation and “rough” implementation Alternative region implementations
Typically only consider a single implementation of each region Area for future improvements - a lot of interesting problems
How to decide how many implementations to consider? How to decide which implementations to consider?
Implementation models Typically assume architectures with few options
One type of communication, no dynamic reconfiguration, etc. Future architectures will increase options
Should improve partition, but increase exploration space
Summary
Applications often not efficient in pure hw Hw/sw partitioning maps regions of application
onto sw (microprocessors) and hw (custom circuit)
Goal: Maximize performance, meet design constraints, etc.
Issues Granularity of regions Partition evaluation Alternative region implementations Implementation models Exploration techniques
Focus of most work
top related