transforming a linear algebra core to an fft accelerator · 6/6/13 the era of heterogeneous...

Transforming a Linear Algebra Core To An FFT Accelerator

Ardavan Pedram★, John McCalpin✪, Andreas Gerstlauer★ ★Electrical and Computer Engineering ✪Texas Advanced Computing Center

The University of Texas at Austin

6/6/13

The Era of Heterogeneous Computing

•  Physical limits of technology scaling •  Power/utilization/… walls and dark silicon

–  Only a fragment of a chip may be active at any given time

•  Efficiency/optimality vs. flexibility/generality –  GFLOPS/W (energy per operation)

Ø  Opportunity and need for specialization Ø Heterogeneous multi-core /

Asynchronous CMP Ø On-chip accelerators

Ø GP-GPUs Ø Programmable, reconfigurable

or hardcoded?

Nvidia Tegra 2 System on Chip

Pedram et. al. 2

6/6/13

Implementation Spectrum

Linear Algebra Processor

E F F I CI E N C Y

Pedram et. al. 3

FLE

XIB

ILIT

Y

Source: T. Noll, RWTH Aachen, via R. Leupers, “From ASIP to MPSoC”, Computer Engineering Colloquium, TU Delft, 2006

LA/FFT Processor

6/6/13

Base Architecture: Linear Algebra Core(LAC)

•  Scalable 2-D array of nr×nr processing elements (PEs) [ASAP’11] •  Up to 50 GFLOPS/W @ 45nm •  Specialized floating-point units w/ 1 MAC/cycle throughput •  Broadcast busses (no need to pipeline up to nr=16) •  Distributed memory architecture •  Distributed, PE-local control •  Level-3 BLAS [ASAP’12], Matrix Factorizations [ARITH21]

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

`

MEM B

Addr1

Row Bus Write (RBW)

Column Bus Write (CBW)

A B

Controller

Column Bus

Read (CBR)

Row Bus

Read (RBR)

MACACC_in

Accumulator

Cin

Memory Interface

Addr2

RF

MEM A

Pedram et. al. 4

GEMM vs. FFT

•  GEMM •  High ratio of computation to communication

–  O(N3)/O(N2)

•  Demonstrates the maximum sustainable FLOPS •  Balanced: # additions = # multiplications

•  FFT •  Modest ratio of computation to communication

–  O(N logN)/O(N)

•  Typically memory BW limited •  Non-Balanced: # additions > # multiplications

6/6/13 Pedram et. al. 5

6/6/13

Outline

ü  Introduction ü Motivation and vision

•  Related work

•  FFT Algorithm and Mapping •  Architecture Tradeoffs •  Experimental Results •  Conclusions and future work

Pedram et. al. 6

6/6/13

Related Work

•  CPUs: Poor utilization ~ 40% Peak (Effective) •  Powers-of-2 strides in FFT interact badly

–  set-associative caches, –  set-associative address translation mechanisms –  power-of-2-banked memory subsystems

•  GPUs: Even poorer utilization ~ 20% peak

•  More computation units •  Weaker Memory subsystem

•  FPGAs and ASIC:

•  Automatic RTL generation [Spiral]

•  A complete comparison in [Chung2010]

Pedram et. al. 7

6/6/13

Outline


ü Related work •  FFT Algorithm and Mapping •  Architecture Tradeoffs •  Experimental Results •  Conclusions and future work

Pedram et. al. 8

6/6/13

Outline


ü Related work •  FFT Algorithm and Mapping •  Architecture Tradeoffs •  Experimental Results •  Conclusions and future work

Pedram et. al. 9

FMA Optimized Radix-4

•  Radix-4 butterfly operation •  Three complex multiplication •  Eight complex addition •  34 real floating-point operations


x(j)

x(j + L/4)x(j + L/2)x(j + 3L/4)

!⇥ =

1 1 1 11 �j �1 j

1 �1 1 �11 j �1 �j

!diag(1,!j

L

,!2jL

,!3jL

)

x(j)

x(j + L/4)x(j + L/2)x(j + 3L/4)

!⇥ =

1 0 ! 10 1 0 �i!

1 0 �! 00 1 0 �i!

!0

@1 !

2 0 01 �!

2 0 00 0 1 !

2

0 0 1 �!

2

1

A

•  FMA optimized Radix-4 butterfly operation •  24 FMA operations •  Reduce the loads for twiddle factors

Radix-4 Butterfly •  FMA Optimized


•  Non-optimized

4 2 4 2 4 2 4 2

24

6 6 6 2 2 2 2 2 2 2 2

34

b=a-WLjb d=c-WL

2jd

a=2a-b c=2c-d

c=a-WLjc

x(j)=2a-c

d=b-iWLjd

x(j+3L/4)=2b-d

0-34-7

8-9

10-1112-15

18-2116-17

22-23

x(j+L/2)=x(j)-wjLx(j+L/2)

x(j)=2x(j)-x(j+L/2)

0-3

4-5

RADIX 4 on a FMACRADIX 2 on a FMAC

AccumulationDependency

Multiplication Dependency

a=x(j); b=x(j+L/2); d=x(j+3L/4);

x(j+L/4)=d

c=x(j+L/4);

x(j+L/2)=c

Fig. 1. DAG of the optimized Radix4 Butterfly using a fused multiply-addunit. Rectangles on top indicate the input data, solid nodes show complexcomputations with four FMA operations each, nodes with dashed lines showcomplex computations with two FMA operations each. The nodes are executedin an order that avoids data dependency hazards due to pipeline latencies, asshown by the start-finish cycle numbers next to each node.

tion is shown on the left and the pseudo-code for the FMAoptimized version is shown on the right:

for j = 0 : L/4� 1 for j = 0 : L/4� 1a := x(j); a := x(j);b := !

jLx(j + L/4) b := x(j + L/4)

c := !

2jL x(j + L/2) c := x(j + L/2)

d := !

3jL x(j + 3L/4) d := x(j + 3L/4)

⌧0 := a+ c b := a� !

2jL b

⌧1 := a� c a := 2a� b

⌧2 := b+ d d := c� !

2jL d

⌧3 := b� d c := 2c� d

x(j) := ⌧0 + ⌧2; x(j + L/2) := c = a� !

jLc

x(j + L/4) := ⌧1 � i⌧3; x(j) := 2a� c

x(j + L/2) := ⌧0 � ⌧2; x(j + L/4) := d := b� i!

jl d

x(j + 3L/4) := ⌧1 + i⌧3; x(j + 3L/4) := 2b� d

end for end for

IV. BASELINE LINEAR ALGEBRA ARCHITECTURE

The microarchitecture of the baseline linear algebracore (LAC) is illustrated in Figure 2. The architecture andimplementation optimize the rank-1 update operation that isthe innermost kernel of parallel matrix multiplication [14].This allows the implementation to achieve orders of magnitudebetter efficiency in power and area consumption than conven-tional general purpose architectures [3].

A. General ArchitectureThe LAC architecture consists of a 2D array of n

r

⇥ nr

Processing Elements (PEs), with nr

= 4 in Figure 2. Each PEhas a double-precision Floating-Point Multiply-ACcumulate(FPMAC) unit with a local accumulator, and local memory(SRAM) storage divided into a larger single-ported and asmaller dual-ported memory. PEs on the same row/columnare connected by low-overhead horizontal/vertical broadcastbuses. LAC control is distributed and each PE has a statemachine that drives a predetermined, hard coded sequenceof communication, storage, and computation steps for eachsupported operation.

The FPMAC units perform the inner dot-product computa-tions central to almost all level-3 BLAS operations. To achievehigh performance and register-level locality, the LAC utilizes

PE(0,0)

PE(0,1)

PE(0,2)

PE(0,3)

PE(1,0)

PE(1,1)

PE(1,2)

PE(1,3)

PE(2,0)

PE(2,1)

PE(2,2)

PE(2,3)

PE(3,0)

PE(3,1)

PE(3,2)

PE(3,3)

`

MEM B

Address Regs

Row Bus Write

Column Bus Write

A B

µ programmed Controller

Column Bus Read

Row Bus Read

MACAccumulator

Cin

Memory Interface

RFMEM A

Fig. 2. Linear Algebra Core optimized for rank-1 updates. PEs that own thecurrent column of 4⇥ kc matrix A and the current row of kc ⇥ 4 matrix B,write elements of A and B to the buses and the other PEs read them [3].

pipelined FPMAC units that can achieve a throughput ofone dependent FPMAC operation per cycle [15]. This meansthat there is no data dependency hazard for floating pointaccumulations. Note that this is not the case in current general-purpose architectures [16], which require the use of multipleaccumulators to avoid pipeline stalls.

V. FFT ALGORITHM MAPPING

In this section, we show the details of mapping an FFT onthe LAC along with the required modifications that need to bemade to the existing core architecture. We start by focusingon small problems that fit in the local core memory. Then,we present solutions for bigger problems that do not fit in thelocal store.

The broadcast bus topology allows a PE to communicatewith other PEs in the same row and with other PEs in the samecolumn simultaneously. To maximize locality, we consideronly designs in which each butterfly operation is computedby a single PE, with communication taking place between thebutterfly computational steps. We note that if the LAC dimen-sions are selected as powers of two, the communication acrossPEs between both Radix-2 or Radix-4 butterfly operations willbe limited to the neighbors on the same row or column. Thechoice of n

r

= 2 provides little parallelism, while values ofnr

>= 8 provide inadequate bandwidth per PE due to theshared bus interconnect. Therefore, we choose n

r

= 4 as thestandard configuration for the rest of the paper.

A. Radix-4 FFT Algorithms on the PEs

In Section V-C1 we gave a description of regular and FMAoptimized versions of the Radix-2 and Radix-4 butterfly oper-ations. Here, we show the details of mapping such operationson the PEs. A Radix-2 operation takes six FMA operations.Performing Radix-2 operations in each PE, the LAC canperform 32-point FFTs, but can only hide the latency of FMApipeline for FFT transforms with 64 or more points. TheRadix-4 butterfly on the PE is more complicated due to datadependencies within the butterfly operation. Figure 1 showsthe DAG of the Radix-4 butterfly. Solid ellipse nodes take 4FMA operations and dashed nodes take 2 FMA operations. Apipelined FPMAC unit has q pipeline stages with q = 5 ⇠ 9.The nodes in the DAG should be scheduled in a way that

3

FLOPs

FMA OPs

Radix-4 Butterfly on a PE


b=a-WLjb d=c-WL

2jd

a=2a-b c=2c-d

c=a-WLjc

x(j)=2a-c

d=b-iWLjd

x(j+3L/4)=2b-d

0-34-7

8-9

10-1112-15

18-2116-17

22-23

x(j+L/2)=x(j)-wjLx(j+L/2)

x(j)=2x(j)-x(j+L/2)

0-3

4-5

RADIX 4 on a FMACRADIX 2 on a FMAC

AccumulationDependency

Multiplication Dependency

a=x(j); b=x(j+L/2); d=x(j+3L/4);

x(j+L/4)=d

c=x(j+L/4);

x(j+L/2)=c

•  Dependencies •  Multiplication •  Addition

•  FP-MAC unit •  5-9 stages of pipelining

–  Possible Hazards

•  Single cycle accumulator –  Only track Multiplication

•  24 Cycles •  No Dependence Hazard

•  Careful scheduling of operations

Fast Fourier Transform

•  LAC Architecture •  Broadcast buses •  Floating-Point MAC in PEs

•  nr be a power of 2 •  Limit communication

pattern

•  Butterfly operations on PEs •  Optimize operations

for FP-MAC units


PE(0,0)

PE(0,1)

PE(0,2)

PE(0,3)

PE(1,0)

PE(1,1)

PE(1,2)

PE(1,3)

PE(2,0)

PE(2,1)

PE(2,2)

PE(2,3)

PE(3,0)

PE(3,1)

PE(3,2)

PE(3,3)

`

MEM B

Address Regs

Row Bus Write

Column Bus Write

A B


Column Bus Read

Row Bus Read

MACAccumulator

Cin

Memory Interface

RFMEM A

PE(0,0)

PE(0,1)

PE(0,2)

PE(0,3)

PE(1,0)

PE(1,1)

PE(1,2)

PE(1,3)

PE(2,0)

PE(2,1)

PE(2,2)

PE(2,3)

PE(3,0)

PE(3,1)

PE(3,2)

PE(3,3)

`

MEM B

Address Regs

Row Bus Write

Column Bus Write

A B


Column Bus Read

Row Bus Read

MACAccumulator

Cin

Memory Interface

RFMEM A

64 Point Radix-4 FFT on a 4x4 LAC •  Stage one

•  No communication

•  Stage two (PE(0,0)) •  Neighbors of distances

1+40, 2+40, 3+40

•  nr(nr-1)x2=24 row bus transactions


0000

0001

0010

0011

0100

0101

0110

0111

1000

1001

1010

1011

1100

1101

1110

1111

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

0000 0001 0010 0011

0000

0100

1000

1100

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

0000010000

Stage 1Inner PE access

Stage 2Row Bus access

Stage 3Column Bus access

3 Stage64 Point FFT

3 Stage64 Point FFT

3 Stage64 Point FFT

3 Stage64 Point FFT

PE0

PE0

PE0

PE0

Stage 4Intra PE

communication

Stages 1~364 Point FFT

PE(0,0)

PE(0,1)

PE(0,2)

PE(0,3)

PE(1,0)

PE(1,1)

PE(1,2)

PE(1,3)

PE(2,0)

PE(2,1)

PE(2,2)

PE(2,3)

PE(3,0)

PE(3,1)

PE(3,2)

PE(3,3)

PE(0,0) Communication pattern: stage 2


64 Point Radix-4 FFT on a 4x4 LAC •  Stage one

•  No communication

•  Stage two (PE(0,0)) •  Neighbors of distances

1×40, 2×40, 3×40

•  nr(nr-1)×2=24 row bus transactions

•  Stage three (PE(0,0)) •  Neighbors of distances

1×41, 2×41, 3×41

•  nr(nr-1)×2=24 column bus transactions


0000

0001

0010

0011

0100

0101

0110

0111

1000

1001

1010

1011

1100

1101

1110

1111

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

0000 0001 0010 0011

0000

0100

1000

1100

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

0000010000

Stage 1Inner PE access

Stage 2Row Bus access

Stage 3Column Bus access

3 Stage64 Point FFT

3 Stage64 Point FFT

3 Stage64 Point FFT

3 Stage64 Point FFT

PE0

PE0

PE0

PE0

Stage 4Intra PE

communication

Stages 1~364 Point FFT

PE(0,0)

PE(0,1)

PE(0,2)

PE(0,3)

PE(1,0)

PE(1,1)

PE(1,2)

PE(1,3)

PE(2,0)

PE(2,1)

PE(2,2)

PE(2,3)

PE(3,0)

PE(3,1)

PE(3,2)

PE(3,3)



FFT on LAC •  Stage 1 inside PEs •  Stage 2 only row buses •  Stage 3 only column

buses •  Stage ≥ 4 only intra PE

accesses

•  Cycle count: •  6N/nr

2 log4N

•  Effective Bandwidth: •  nr

2/(logN4 −1)


0000

0001

0010

0011

0100

0101

0110

0111

1000

1001

1010

1011

1100

1101

1110

1111

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

0000 0001 0010 0011

0000

0100

1000

1100

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

0000010000

STAGE 2 STAGE 3 STAGE ≥ 4

Larger Out of Core 1D and 2D FFTs

•  1D:Four step algorithm 1.  N1 DFTs of size N2

–  Write back in bit reverse 2.  Multiply the result by N1N2

twiddle factors 3.  N2 DFTs of size N1

4.  Transpose the result

•  2D: Can do any order •  N1/N2 DFTs of size N2/N1

–  Write back in bit reverse •  Same set of twiddle

factors for all rows/columns

•  N1/N2 DFTs of size N2/N1 6/6/13 Pedram et. al. 17

FFTCore

256x256 FFT Input 256x256 FFT Input

1D FFT 2D FFT

FFTCore

Stage 1:Read

TransformWrite Back Columns

Stage 2:Read

TransformWrite Back

Rows

Stage 1/2:Read

TransformWrite Back Columns

Stage 2/1:Read

TransformWrite Back

Rows

6/6/13

Outline


ü Related work ü FFT Algorithm and Mapping •  Architecture Configurations •  Experimental Results •  Conclusions and future work

Pedram et. al. 18

Architecture Configurations •  Communication vs.

computation •  No overlap

–  12 KB/ PE –  50% Utilization

•  Full overlap –  16 KB/ PE –  83% Utilization

•  Required BW to core •  LAC is limited to 4

Doubles/Cycles •  FFT core needs more

BW for small problems


0%#

10%#

20%#

30%#

40%#

50%#

60%#

70%#

80%#

90%#

100%#

0#

2#

4#

6#

8#

10#

12#

14#

16#

18#

20#

64# 256# 1024# 4096#

UTiliza3on#

KBytes#

Problem#Size#

Overlap#

LS/PE#

NoHOverlap#

LS/PE##

NoHOverlap#

U3liza3on#

Overlap#

U3liza3on#

0"

1"

2"

3"

4"

5"

6"

7"

8"

9"

64" 256" 1024" 4096"

Doub

les/cycle"

Problem"Size"

Average"BW""

Average"Reads"

EffecFve"BW""

EffecFve"Reads"

Hybrid LA/FFT Design

•  Core •  Double the off-core BW •  Expand memory interface to

support rows and columns –  Symmetric –  Natively support transpose

•  PE •  8-Words RF (temporary values) •  4-Words RF (twiddle factors) •  Divide MEMA into two halves •  Symmetric data-path to both

SRAMs


Double the Width »» (Size) of the main SRAM

`

MEM A

Address Regs

Row Bus Write

Column Bus Write


Column Bus Read

Row Bus Read

MAC

MEM B

RFω

(b)

`

MEM B

Address Regs

Row Bus Write

Column Bus Write

A B


Column Bus Read

Row Bus Read

MAC

Cin

MEM A1RFω

(c)

MEM A2

PE(0,0)

PE(0,1)

PE(0,2)

PE(0,3)

PE(1,0)

PE(1,1)

PE(1,2)

PE(1,3)

PE(2,0)

PE(2,1)

PE(2,2)

PE(2,3)

PE(3,0)

PE(3,1)

PE(3,2)

PE(3,3)

Memory Interface

Mem

ory

Inte

rfac

e

(a)

Double the Width »» (Size) of the main SRAM

`

MEM A

Address Regs

Row Bus Write

Column Bus Write


Column Bus Read

Row Bus Read

MAC

MEM B

RFω

(b)

`

MEM B

Address Regs

Row Bus Write

Column Bus Write

A B


Column Bus Read

Row Bus Read

MAC

Cin

MEM A1RFω

(c)

MEM A2

PE(0,0)

PE(0,1)

PE(0,2)

PE(0,3)

PE(1,0)

PE(1,1)

PE(1,2)

PE(1,3)

PE(2,0)

PE(2,1)

PE(2,2)

PE(2,3)

PE(3,0)

PE(3,1)

PE(3,2)

PE(3,3)

Memory Interface

Mem

ory

Inte

rfac

e

(a)

6/6/13

Outline


ü Related work ü FFT Algorithm and Mapping ü Architecture Configurations •  Experimental Results •  Conclusions and future work

Pedram et. al. 21

Power and Area Analysis

•  Power and Area •  CACTI •  [Galal’10] •  @ 1GHz (sweet

spot)

•  Power •  dominated by

FPMAC •  Efficiency

•  Higher is better •  Up to 10% loss


0"

0.01"

0.02"

0.03"

0.04"

0.05"

LAC" Hybrid"GEMM" FFT" Hybrid"FFT"

Actual"Pow

er"W

aA"

BC"Buses"

Registers"

FPFMAC"

SRAM(s)"

0.0#

0.2#

0.4#

0.6#

0.8#

1.0#

1.2#

LAC# Hybrid#GEMM# FFT# Hybrid#FFT#

GFLOPS/W##

GFLOPS/mm^2##

GFLOPS/MAX#W##

Comparison to Other Designs •  Double-precision 1D FFT performance •  Scaled to 45nm technology

•  Proposed FFT: •  Two orders of magnitude better power efficiency than

CPUs and GPUs •  High (83%) effective utilization •  An order of magnitude better area efficiency


Problem fits in GFLOPS W GFLOPS/mm2 GFLOPS/W Utilization

Xeon E3-1270 core L2 Cache 12.0 28 0.33 0.43 44%

ARM Cortex A9 L1 Cache 0.6 0.28 0.45 2.13 60%

PowerXCell 8i SPE SPE Local SRAM 12.0 64 0.12 0.19 12%

Nvidia Tesla C2050 L1+L2 Cache 110.0 150 0.21 0.73 21.3%

Hybrid Core On-Core SRAM 26.7 0.66 12.2 40.50 83+%

Hybrid Core+ SRAM 2MB Off-Core SRAM 26.7 1.02 1.71 26.30 83+%

6/6/13

Outline


ü Related work

ü Base Architecture ü FFT Algorithm and Mapping ü Architecture Configurations ü Experimental Results •  Conclusions and future work

Pedram et. al. 24

6/6/13

Summary & Conclusions •  Hybrid FFT/Linear Algebra Core

•  Algorithm/architecture co-design •  Power and efficiency estimation

•  Results @ 1GHz for FFT core •  DP: 26.7 GFLOPS, 40 GFLOPS/W •  0.7 Watts •  2.2 mm2 in 45nm •  83% utilization •  Orders of magnitude improvement in efficiency

Ø  On-going and future work •  Memory hierarchy study of a multi-core FFT design

Pedram et. al. 25

transforming a linear algebra core to an fft accelerator · 6/6/13 the era of heterogeneous...

Documents