Transforming a Linear Algebra Core To An FFT Accelerator
Ardavan Pedram★, John McCalpin✪, Andreas Gerstlauer★ ★Electrical and Computer Engineering ✪Texas Advanced Computing Center
The University of Texas at Austin
6/6/13
The Era of Heterogeneous Computing
• Physical limits of technology scaling • Power/utilization/… walls and dark silicon
– Only a fragment of a chip may be active at any given time
• Efficiency/optimality vs. flexibility/generality – GFLOPS/W (energy per operation)
Ø Opportunity and need for specialization Ø Heterogeneous multi-core /
Asynchronous CMP Ø On-chip accelerators
Ø GP-GPUs Ø Programmable, reconfigurable
or hardcoded?
Nvidia Tegra 2 System on Chip
Pedram et. al. 2
6/6/13
Implementation Spectrum
Linear Algebra Processor
E F F I CI E N C Y
Pedram et. al. 3
FLE
XIB
ILIT
Y
Source: T. Noll, RWTH Aachen, via R. Leupers, “From ASIP to MPSoC”, Computer Engineering Colloquium, TU Delft, 2006
LA/FFT Processor
6/6/13
Base Architecture: Linear Algebra Core(LAC)
• Scalable 2-D array of nr×nr processing elements (PEs) [ASAP’11] • Up to 50 GFLOPS/W @ 45nm • Specialized floating-point units w/ 1 MAC/cycle throughput • Broadcast busses (no need to pipeline up to nr=16) • Distributed memory architecture • Distributed, PE-local control • Level-3 BLAS [ASAP’12], Matrix Factorizations [ARITH21]
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
`
MEM B
Addr1
Row Bus Write (RBW)
Column Bus Write (CBW)
A B
Controller
Column Bus
Read (CBR)
Row Bus
Read (RBR)
MACACC_in
Accumulator
Cin
Memory Interface
Addr2
RF
MEM A
Pedram et. al. 4
GEMM vs. FFT
• GEMM • High ratio of computation to communication
– O(N3)/O(N2)
• Demonstrates the maximum sustainable FLOPS • Balanced: # additions = # multiplications
• FFT • Modest ratio of computation to communication
– O(N logN)/O(N)
• Typically memory BW limited • Non-Balanced: # additions > # multiplications
6/6/13 Pedram et. al. 5
6/6/13
Outline
ü Introduction ü Motivation and vision
• Related work
• FFT Algorithm and Mapping • Architecture Tradeoffs • Experimental Results • Conclusions and future work
Pedram et. al. 6
6/6/13
Related Work
• CPUs: Poor utilization ~ 40% Peak (Effective) • Powers-of-2 strides in FFT interact badly
– set-associative caches, – set-associative address translation mechanisms – power-of-2-banked memory subsystems
• GPUs: Even poorer utilization ~ 20% peak
• More computation units • Weaker Memory subsystem
• FPGAs and ASIC:
• Automatic RTL generation [Spiral]
• A complete comparison in [Chung2010]
Pedram et. al. 7
6/6/13
Outline
ü Introduction ü Motivation and vision
ü Related work • FFT Algorithm and Mapping • Architecture Tradeoffs • Experimental Results • Conclusions and future work
Pedram et. al. 8
6/6/13
Outline
ü Introduction ü Motivation and vision
ü Related work • FFT Algorithm and Mapping • Architecture Tradeoffs • Experimental Results • Conclusions and future work
Pedram et. al. 9
FMA Optimized Radix-4
• Radix-4 butterfly operation • Three complex multiplication • Eight complex addition • 34 real floating-point operations
6/6/13 Pedram et. al. 10
x(j)
x(j + L/4)x(j + L/2)x(j + 3L/4)
!⇥ =
1 1 1 11 �j �1 j
1 �1 1 �11 j �1 �j
!diag(1,!j
L
,!2jL
,!3jL
)
x(j)
x(j + L/4)x(j + L/2)x(j + 3L/4)
!⇥ =
1 0 ! 10 1 0 �i!
1 0 �! 00 1 0 �i!
!0
@1 !
2 0 01 �!
2 0 00 0 1 !
2
0 0 1 �!
2
1
A
• FMA optimized Radix-4 butterfly operation • 24 FMA operations • Reduce the loads for twiddle factors
Radix-4 Butterfly • FMA Optimized
6/6/13 Pedram et. al. 11
• Non-optimized
4 2 4 2 4 2 4 2
24
6 6 6 2 2 2 2 2 2 2 2
34
b=a-WLjb d=c-WL
2jd
a=2a-b c=2c-d
c=a-WLjc
x(j)=2a-c
d=b-iWLjd
x(j+3L/4)=2b-d
0-34-7
8-9
10-1112-15
18-2116-17
22-23
x(j+L/2)=x(j)-wjLx(j+L/2)
x(j)=2x(j)-x(j+L/2)
0-3
4-5
RADIX 4 on a FMACRADIX 2 on a FMAC
AccumulationDependency
Multiplication Dependency
a=x(j); b=x(j+L/2); d=x(j+3L/4);
x(j+L/4)=d
c=x(j+L/4);
x(j+L/2)=c
Fig. 1. DAG of the optimized Radix4 Butterfly using a fused multiply-addunit. Rectangles on top indicate the input data, solid nodes show complexcomputations with four FMA operations each, nodes with dashed lines showcomplex computations with two FMA operations each. The nodes are executedin an order that avoids data dependency hazards due to pipeline latencies, asshown by the start-finish cycle numbers next to each node.
tion is shown on the left and the pseudo-code for the FMAoptimized version is shown on the right:
for j = 0 : L/4� 1 for j = 0 : L/4� 1a := x(j); a := x(j);b := !
jLx(j + L/4) b := x(j + L/4)
c := !
2jL x(j + L/2) c := x(j + L/2)
d := !
3jL x(j + 3L/4) d := x(j + 3L/4)
⌧0 := a+ c b := a� !
2jL b
⌧1 := a� c a := 2a� b
⌧2 := b+ d d := c� !
2jL d
⌧3 := b� d c := 2c� d
x(j) := ⌧0 + ⌧2; x(j + L/2) := c = a� !
jLc
x(j + L/4) := ⌧1 � i⌧3; x(j) := 2a� c
x(j + L/2) := ⌧0 � ⌧2; x(j + L/4) := d := b� i!
jl d
x(j + 3L/4) := ⌧1 + i⌧3; x(j + 3L/4) := 2b� d
end for end for
IV. BASELINE LINEAR ALGEBRA ARCHITECTURE
The microarchitecture of the baseline linear algebracore (LAC) is illustrated in Figure 2. The architecture andimplementation optimize the rank-1 update operation that isthe innermost kernel of parallel matrix multiplication [14].This allows the implementation to achieve orders of magnitudebetter efficiency in power and area consumption than conven-tional general purpose architectures [3].
A. General ArchitectureThe LAC architecture consists of a 2D array of n
r
⇥ nr
Processing Elements (PEs), with nr
= 4 in Figure 2. Each PEhas a double-precision Floating-Point Multiply-ACcumulate(FPMAC) unit with a local accumulator, and local memory(SRAM) storage divided into a larger single-ported and asmaller dual-ported memory. PEs on the same row/columnare connected by low-overhead horizontal/vertical broadcastbuses. LAC control is distributed and each PE has a statemachine that drives a predetermined, hard coded sequenceof communication, storage, and computation steps for eachsupported operation.
The FPMAC units perform the inner dot-product computa-tions central to almost all level-3 BLAS operations. To achievehigh performance and register-level locality, the LAC utilizes
PE(0,0)
PE(0,1)
PE(0,2)
PE(0,3)
PE(1,0)
PE(1,1)
PE(1,2)
PE(1,3)
PE(2,0)
PE(2,1)
PE(2,2)
PE(2,3)
PE(3,0)
PE(3,1)
PE(3,2)
PE(3,3)
`
MEM B
Address Regs
Row Bus Write
Column Bus Write
A B
µ programmed Controller
Column Bus Read
Row Bus Read
MACAccumulator
Cin
Memory Interface
RFMEM A
Fig. 2. Linear Algebra Core optimized for rank-1 updates. PEs that own thecurrent column of 4⇥ kc matrix A and the current row of kc ⇥ 4 matrix B,write elements of A and B to the buses and the other PEs read them [3].
pipelined FPMAC units that can achieve a throughput ofone dependent FPMAC operation per cycle [15]. This meansthat there is no data dependency hazard for floating pointaccumulations. Note that this is not the case in current general-purpose architectures [16], which require the use of multipleaccumulators to avoid pipeline stalls.
V. FFT ALGORITHM MAPPING
In this section, we show the details of mapping an FFT onthe LAC along with the required modifications that need to bemade to the existing core architecture. We start by focusingon small problems that fit in the local core memory. Then,we present solutions for bigger problems that do not fit in thelocal store.
The broadcast bus topology allows a PE to communicatewith other PEs in the same row and with other PEs in the samecolumn simultaneously. To maximize locality, we consideronly designs in which each butterfly operation is computedby a single PE, with communication taking place between thebutterfly computational steps. We note that if the LAC dimen-sions are selected as powers of two, the communication acrossPEs between both Radix-2 or Radix-4 butterfly operations willbe limited to the neighbors on the same row or column. Thechoice of n
r
= 2 provides little parallelism, while values ofnr
>= 8 provide inadequate bandwidth per PE due to theshared bus interconnect. Therefore, we choose n
r
= 4 as thestandard configuration for the rest of the paper.
A. Radix-4 FFT Algorithms on the PEs
In Section V-C1 we gave a description of regular and FMAoptimized versions of the Radix-2 and Radix-4 butterfly oper-ations. Here, we show the details of mapping such operationson the PEs. A Radix-2 operation takes six FMA operations.Performing Radix-2 operations in each PE, the LAC canperform 32-point FFTs, but can only hide the latency of FMApipeline for FFT transforms with 64 or more points. TheRadix-4 butterfly on the PE is more complicated due to datadependencies within the butterfly operation. Figure 1 showsthe DAG of the Radix-4 butterfly. Solid ellipse nodes take 4FMA operations and dashed nodes take 2 FMA operations. Apipelined FPMAC unit has q pipeline stages with q = 5 ⇠ 9.The nodes in the DAG should be scheduled in a way that
3
FLOPs
FMA OPs
Radix-4 Butterfly on a PE
6/6/13 Pedram et. al. 12
b=a-WLjb d=c-WL
2jd
a=2a-b c=2c-d
c=a-WLjc
x(j)=2a-c
d=b-iWLjd
x(j+3L/4)=2b-d
0-34-7
8-9
10-1112-15
18-2116-17
22-23
x(j+L/2)=x(j)-wjLx(j+L/2)
x(j)=2x(j)-x(j+L/2)
0-3
4-5
RADIX 4 on a FMACRADIX 2 on a FMAC
AccumulationDependency
Multiplication Dependency
a=x(j); b=x(j+L/2); d=x(j+3L/4);
x(j+L/4)=d
c=x(j+L/4);
x(j+L/2)=c
• Dependencies • Multiplication • Addition
• FP-MAC unit • 5-9 stages of pipelining
– Possible Hazards
• Single cycle accumulator – Only track Multiplication
• 24 Cycles • No Dependence Hazard
• Careful scheduling of operations
Fast Fourier Transform
• LAC Architecture • Broadcast buses • Floating-Point MAC in PEs
• nr be a power of 2 • Limit communication
pattern
• Butterfly operations on PEs • Optimize operations
for FP-MAC units
6/6/13 Pedram et. al. 13
PE(0,0)
PE(0,1)
PE(0,2)
PE(0,3)
PE(1,0)
PE(1,1)
PE(1,2)
PE(1,3)
PE(2,0)
PE(2,1)
PE(2,2)
PE(2,3)
PE(3,0)
PE(3,1)
PE(3,2)
PE(3,3)
`
MEM B
Address Regs
Row Bus Write
Column Bus Write
A B
µ programmed Controller
Column Bus Read
Row Bus Read
MACAccumulator
Cin
Memory Interface
RFMEM A
PE(0,0)
PE(0,1)
PE(0,2)
PE(0,3)
PE(1,0)
PE(1,1)
PE(1,2)
PE(1,3)
PE(2,0)
PE(2,1)
PE(2,2)
PE(2,3)
PE(3,0)
PE(3,1)
PE(3,2)
PE(3,3)
`
MEM B
Address Regs
Row Bus Write
Column Bus Write
A B
µ programmed Controller
Column Bus Read
Row Bus Read
MACAccumulator
Cin
Memory Interface
RFMEM A
64 Point Radix-4 FFT on a 4x4 LAC • Stage one
• No communication
• Stage two (PE(0,0)) • Neighbors of distances
1+40, 2+40, 3+40
• nr(nr-1)x2=24 row bus transactions
6/6/13 Pedram et. al. 14
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
0000 0001 0010 0011
0000
0100
1000
1100
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
0000010000
Stage 1Inner PE access
Stage 2Row Bus access
Stage 3Column Bus access
3 Stage64 Point FFT
3 Stage64 Point FFT
3 Stage64 Point FFT
3 Stage64 Point FFT
PE0
PE0
PE0
PE0
Stage 4Intra PE
communication
Stages 1~364 Point FFT
PE(0,0)
PE(0,1)
PE(0,2)
PE(0,3)
PE(1,0)
PE(1,1)
PE(1,2)
PE(1,3)
PE(2,0)
PE(2,1)
PE(2,2)
PE(2,3)
PE(3,0)
PE(3,1)
PE(3,2)
PE(3,3)
PE(0,0) Communication pattern: stage 2
PE(0,0) Communication pattern: stage 3
64 Point Radix-4 FFT on a 4x4 LAC • Stage one
• No communication
• Stage two (PE(0,0)) • Neighbors of distances
1×40, 2×40, 3×40
• nr(nr-1)×2=24 row bus transactions
• Stage three (PE(0,0)) • Neighbors of distances
1×41, 2×41, 3×41
• nr(nr-1)×2=24 column bus transactions
6/6/13 Pedram et. al. 15
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
0000 0001 0010 0011
0000
0100
1000
1100
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
0000010000
Stage 1Inner PE access
Stage 2Row Bus access
Stage 3Column Bus access
3 Stage64 Point FFT
3 Stage64 Point FFT
3 Stage64 Point FFT
3 Stage64 Point FFT
PE0
PE0
PE0
PE0
Stage 4Intra PE
communication
Stages 1~364 Point FFT
PE(0,0)
PE(0,1)
PE(0,2)
PE(0,3)
PE(1,0)
PE(1,1)
PE(1,2)
PE(1,3)
PE(2,0)
PE(2,1)
PE(2,2)
PE(2,3)
PE(3,0)
PE(3,1)
PE(3,2)
PE(3,3)
PE(0,0) Communication pattern: stage 2
PE(0,0) Communication pattern: stage 3
FFT on LAC • Stage 1 inside PEs • Stage 2 only row buses • Stage 3 only column
buses • Stage ≥ 4 only intra PE
accesses
• Cycle count: • 6N/nr
2 log4N
• Effective Bandwidth: • nr
2/(logN4 −1)
6/6/13 Pedram et. al. 16
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
0000 0001 0010 0011
0000
0100
1000
1100
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
0000010000
STAGE 2 STAGE 3 STAGE ≥ 4
Larger Out of Core 1D and 2D FFTs
• 1D:Four step algorithm 1. N1 DFTs of size N2
– Write back in bit reverse 2. Multiply the result by N1N2
twiddle factors 3. N2 DFTs of size N1
4. Transpose the result
• 2D: Can do any order • N1/N2 DFTs of size N2/N1
– Write back in bit reverse • Same set of twiddle
factors for all rows/columns
• N1/N2 DFTs of size N2/N1 6/6/13 Pedram et. al. 17
FFTCore
256x256 FFT Input 256x256 FFT Input
1D FFT 2D FFT
FFTCore
Stage 1:Read
TransformWrite Back Columns
Stage 2:Read
TransformWrite Back
Rows
Stage 1/2:Read
TransformWrite Back Columns
Stage 2/1:Read
TransformWrite Back
Rows
6/6/13
Outline
ü Introduction ü Motivation and vision
ü Related work ü FFT Algorithm and Mapping • Architecture Configurations • Experimental Results • Conclusions and future work
Pedram et. al. 18
Architecture Configurations • Communication vs.
computation • No overlap
– 12 KB/ PE – 50% Utilization
• Full overlap – 16 KB/ PE – 83% Utilization
• Required BW to core • LAC is limited to 4
Doubles/Cycles • FFT core needs more
BW for small problems
6/6/13 Pedram et. al. 19
0%#
10%#
20%#
30%#
40%#
50%#
60%#
70%#
80%#
90%#
100%#
0#
2#
4#
6#
8#
10#
12#
14#
16#
18#
20#
64# 256# 1024# 4096#
UTiliza3on#
KBytes#
Problem#Size#
Overlap#
LS/PE#
NoHOverlap#
LS/PE##
NoHOverlap#
U3liza3on#
Overlap#
U3liza3on#
0"
1"
2"
3"
4"
5"
6"
7"
8"
9"
64" 256" 1024" 4096"
Doub
les/cycle"
Problem"Size"
Average"BW""
Average"Reads"
EffecFve"BW""
EffecFve"Reads"
Hybrid LA/FFT Design
• Core • Double the off-core BW • Expand memory interface to
support rows and columns – Symmetric – Natively support transpose
• PE • 8-Words RF (temporary values) • 4-Words RF (twiddle factors) • Divide MEMA into two halves • Symmetric data-path to both
SRAMs
6/6/13 Pedram et. al. 20
Double the Width »» (Size) of the main SRAM
`
MEM A
Address Regs
Row Bus Write
Column Bus Write
µ programmed Controller
Column Bus Read
Row Bus Read
MAC
MEM B
RFω
(b)
`
MEM B
Address Regs
Row Bus Write
Column Bus Write
A B
µ programmed Controller
Column Bus Read
Row Bus Read
MAC
Cin
MEM A1RFω
(c)
MEM A2
PE(0,0)
PE(0,1)
PE(0,2)
PE(0,3)
PE(1,0)
PE(1,1)
PE(1,2)
PE(1,3)
PE(2,0)
PE(2,1)
PE(2,2)
PE(2,3)
PE(3,0)
PE(3,1)
PE(3,2)
PE(3,3)
Memory Interface
Mem
ory
Inte
rfac
e
(a)
Double the Width »» (Size) of the main SRAM
`
MEM A
Address Regs
Row Bus Write
Column Bus Write
µ programmed Controller
Column Bus Read
Row Bus Read
MAC
MEM B
RFω
(b)
`
MEM B
Address Regs
Row Bus Write
Column Bus Write
A B
µ programmed Controller
Column Bus Read
Row Bus Read
MAC
Cin
MEM A1RFω
(c)
MEM A2
PE(0,0)
PE(0,1)
PE(0,2)
PE(0,3)
PE(1,0)
PE(1,1)
PE(1,2)
PE(1,3)
PE(2,0)
PE(2,1)
PE(2,2)
PE(2,3)
PE(3,0)
PE(3,1)
PE(3,2)
PE(3,3)
Memory Interface
Mem
ory
Inte
rfac
e
(a)
6/6/13
Outline
ü Introduction ü Motivation and vision
ü Related work ü FFT Algorithm and Mapping ü Architecture Configurations • Experimental Results • Conclusions and future work
Pedram et. al. 21
Power and Area Analysis
• Power and Area • CACTI • [Galal’10] • @ 1GHz (sweet
spot)
• Power • dominated by
FPMAC • Efficiency
• Higher is better • Up to 10% loss
6/6/13 Pedram et. al. 22
0"
0.01"
0.02"
0.03"
0.04"
0.05"
LAC" Hybrid"GEMM" FFT" Hybrid"FFT"
Actual"Pow
er"W
aA"
BC"Buses"
Registers"
FPFMAC"
SRAM(s)"
0.0#
0.2#
0.4#
0.6#
0.8#
1.0#
1.2#
LAC# Hybrid#GEMM# FFT# Hybrid#FFT#
GFLOPS/W##
GFLOPS/mm^2##
GFLOPS/MAX#W##
Comparison to Other Designs • Double-precision 1D FFT performance • Scaled to 45nm technology
• Proposed FFT: • Two orders of magnitude better power efficiency than
CPUs and GPUs • High (83%) effective utilization • An order of magnitude better area efficiency
6/6/13 Pedram et. al. 23
Problem fits in GFLOPS W GFLOPS/mm2 GFLOPS/W Utilization
Xeon E3-1270 core L2 Cache 12.0 28 0.33 0.43 44%
ARM Cortex A9 L1 Cache 0.6 0.28 0.45 2.13 60%
PowerXCell 8i SPE SPE Local SRAM 12.0 64 0.12 0.19 12%
Nvidia Tesla C2050 L1+L2 Cache 110.0 150 0.21 0.73 21.3%
Hybrid Core On-Core SRAM 26.7 0.66 12.2 40.50 83+%
Hybrid Core+ SRAM 2MB Off-Core SRAM 26.7 1.02 1.71 26.30 83+%
6/6/13
Outline
ü Introduction ü Motivation and vision
ü Related work
ü Base Architecture ü FFT Algorithm and Mapping ü Architecture Configurations ü Experimental Results • Conclusions and future work
Pedram et. al. 24
6/6/13
Summary & Conclusions • Hybrid FFT/Linear Algebra Core
• Algorithm/architecture co-design • Power and efficiency estimation
• Results @ 1GHz for FFT core • DP: 26.7 GFLOPS, 40 GFLOPS/W • 0.7 Watts • 2.2 mm2 in 45nm • 83% utilization • Orders of magnitude improvement in efficiency
Ø On-going and future work • Memory hierarchy study of a multi-core FFT design
Pedram et. al. 25