hiroaki honda 1 , farhad mehdipour 2 , hiroshi kataoka 1 ,

Performance Evaluations of Finite DifferenceApplications Realized on a Single Flux QuantumCircuits-Based Reconfigurable Accelerator

Hiroaki Honda1, Farhad Mehdipour2, Hiroshi Kataoka1,Koji Inoue1, and Kazuaki J. Murakami1

1 Department of Advanced Information Technology, Kyushu University, Fukuoka, Japan

2 Center for Japan-Egypt Cooperation in Science and Technology, Kyushu University, Fukuoka, Japan

Email: [email protected]

Agenda

Introduction Single-flux quantum (SFQ) circuit SFQ-reconfigurable data-path (RDP) processor

Objective Implementing an Application on SFQ-RDP

Tool chain Code modification DFG extraction and mapping

Performance Evaluation Comparison with GPU and GPP results

Conclusions

2

Top500 Supercomputer Rankingand Projection

1 ExaFlop/s [=109 GFlop/s] can be attained in ~2019and 10 ExaFLop/s in ~2022?? (only in next ten years)

PetaFLop/s [=106GFlop/s] world from 2009, 1000 times speed up in 10 years

1EFlops

http://www.top500.org/ 3

10EF

2022

Energy Consumption Estimation for Floating Point Units (FPUs)

Power / [1FPU (2GHz)] is larger than 10 mW (CMOS, ~8nm in ~2019) 1)

Power / [1GFlop/s] is larger than 5 mW

Enegy consumption of FPUs for 10 ExaFlop/s system

is larger than 5 mW * 10 * 109 = 50 MW !!

1) http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf, p178

It is extremely power consuming to construct 10 ExaFlop/s supercomputer system by CMOS circuit processor

• Additional power consumption by memory, network, storage,…

(1ExaFlop/s =109

GFlop/s)

4

•Difficult to implement feed backloops and conditional branches•No practical SFQ memory

Single-Flux Quantum (SFQ) Circuit

Pulse logic:Bit serial/slice description for 32/64 bits

•Ultra high speed switching•Ultra low power•No cost for latch•Suitable for Pipeline processing

Josephson junction

2~3 ps

SFQ Pulse

~1 mVSFQ Pulse(quantized magnetic flux)

Superconductivity loop

Advantages Disadvantages

x 10~100 faster operationx ~1/10 energy consumptionx 10~100 faster operationx ~1/10 energy consumption

5

Single-Flux Quantum-Reconfigurable Data Path (SFQ-RDP) Computer

SMACSMACMain Mem.

:...:::

SMAC

SB

ORN

...

ORN

...

: : : :

ORN

...

ORN

FPU SFQ RDP Chip

80GHz, 2bit slice（32 x 32 PEs）（2.5 GFLOPS／PE)

10 GFLOPS @system(4 SFQ-RDP Chips)

4.2 K

Streaming memoryAccess controller

CMOSCPU

(One Chip)

Memory bandwidth per Chip：256GB/s (max.)(=16GB/s × 16 channels)

SFQ 0.5μm process

PE PEPE

ORN

PE PE PEPE

PE PE PEPE

ORN

オペランドルーティングネットワーク（ORN）

ORN

PE．．．

．．．

．．．

PE PEPE

ORN

PE PE PEPE

PE PE PEPE

ORN

ORN

．．．

．．．

．．．

PEPE

Operand Routing Network(ORN)

．．

．

．．

．

．．

．

．．

．

• Large scale two-dimensional floating-point unit array, data-path architecture

• Reconfigurable Operand Routing Network (ORN)• No on-chip memory• Dynamically reconfigurable PEs and ORNs

• Data Flow is unidirectional• No feed back loop• Minimal amount of control circuits

2-ports/1-port Data accessesFor Input / Output

~2.5TFLOPS/chip

One FPU anddata through unitsOne FPU anddata through units

Network connectingbetween PEs and PEsNetwork connectingbetween PEs and PEs

PE

ORN

6

CREST-JST SFQ-RDP Project (2006~): A Low-Power, High-performance Reconfigurable Processor Based on Single-Flux Quantum Circuits

Goals: Discovering appropriate computation-intensive scientific applicationsDeveloping compiler toolsDeveloping performance evaluation toolsDesigning the SFQ-LSRDP architecture

Yokohama National Univ.SFQ-FPU chip, cell library

Kyushu Univ.Architecture, Compiler

and Applications

Nagoya Univ.SFQ-RDP chip, cell library,

and wiring

SFQ-RDP

Nagoya Univ.CAD for logic design and arithmetic circuits

Superconducting Research Lab. (SRL)

SFQ process

7

Prototype 2x3 SFQ-RDP Processorand SFQ-MUL FPU

8-bit ALUs implementing:ADD, SUB, AND, OR, XOR

Frequency: 25GHzProcess: 2mArea: 6.84 x 6.72 mm2 Power: 4.1mW

1) Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.2) H.Hara, et al.,"Design and Implementation of SFQ Half-Precision Floating-Point Multipliers,", ACS08, 2008.

16-bit FPUs: Adder, MultiplierMUL

Frequency: 32GHzPerformance: 2.6 GFLOPsThe number of junctions: 11044 JJsPower consumption: 3.5 mWCircuit area: 6.22 ×3.78 mm2

2x3 SFQ-RDP processor1)

SFQ- Floating Point Multiplier2)

8

Objectives

Performance evaluations by implementing practical applications and showing possibility of efficient computations by SFQ-RDP computer system

Applications: 2D-diffusion,2D-Finite-Difference Time-Domain (2D-FDTD)

Comparisons of execution times with GPP and GPU

2D-FPU array, data-flow architecture Data Flow Graphs (DFGs) are extracted from applications

and mapped onto the SFQ-RDP Compiler tools

Compiler tools have to be developed No on-chip memory

DMA transfer of DRAM has to be fully used to avoid random accesses

Dynamically reconfigurable PEs and ORNs One time reconfiguration is enough for both Diffusion & FDTD

applications

Points

9

Tool Chain for Implementationof an Application on SFQ-RDP

Application:C/Fortran code

Application:C/Fortran code Modified codeModified code

Code Modificationusing SFQ-RDP APICode Modificationusing SFQ-RDP API

Compiler developedfor SFQ-RDP

Compiler developedfor SFQ-RDP

Data Flow Graph (DFG)Extraction (Semi-manual)Data Flow Graph (DFG)

Extraction (Semi-manual)

Object codeObject codeExtracted DFGExtracted DFG

Placement andRouting Tool

Placement andRouting Tool

RDPConfiguration file

RDPConfiguration file

RDP library fileFunctions definition

& declaration

RDP architecture description

Input

GPP

SFQ-RDPTool chain has beenalmost completed10

Implementing an Application on SFQ-RDP:2D Diffusion

• Basic Finite Difference Method (FDM) formula

1, 0 1, 1, 1 , 1 , 1 2 ,n n n n n ni j i j i j i j i j i jf C f f C f f C f

n-axis (time)

x-axis(space)

y-axis (space)

i

j n

Time development calculation by FDM

(time=n points)

n+1 In/Out Ops

5 / 1 7

11

loop n loop i, j

f(n+1)[i,j] = C0 * ( f(n)[i-1,j] + f(n)[i+1,j] ) + C1 * ( f(n)[i,j-1] + f(n)[i,j+1] ) + C2 * f(n)[i,j]

endend

Original Code for GPP ( n ⇒ n+1 )

Code Implementation and Modification for SFQ-RDP

Extracted DFG:

In/Out Ops Byte/Flop

5 / 1 7 3.4

In/Out Ops Byte/Flop

21 / 9 7 * 9 1.9

loop n loop i, j, (+3, +3)

f(n+1)[i,j] = C0 * ( f(n)[i-1,j] + f(n)[i+1,j] ) + C1 * ( f(n)[i,j-1] + f(n)[i,j+1] ) + C2 * f(n)[i,j] f(n+1)[i+1,j] = C0 * ( f(n)[i,j] + f(n)[i+2,j] ) + C1 * ( f(n)[i+1,j-1] + f(n)

[i+1,j+1] ) + C2 * f(n)[i+1,j]

f(n+1)[i+2,j] = …

…

f(n+1)[i+2,j+2]= …

endend

Unrolled Loop Code for SFQ-RDP ( n ⇒ n+1)

9 formulas in loop-body

DFGExtraction

12

Mapping Extracted DFG onto SFQ-RDP

Placement and Routing

Extracted DFG

DFG mappingResult

13

RDP configuration data

i

Array A

Array B

j

Improving Data Access Efficiency-Data Structure Conversion for DMA Transfer

All two dimensional f[i,j] valuesare divided and stored astwo one-dimensional arrays:A[] and B[]

15(A)+15(B) input data areaccessed via two input ports

9 output data areaccessed

i

ｊUnrolled loop includes21 inputs and 9 outputsfor calculation

Random memory accesses

Data Structure Conversion:

Input point

Output point

i

ｊ

i

ｊ

14

f[i,j]:

A[i]:B[i]:

Sequential memory accesses: possible to use DMA transfer

f[i,j] A[i],B[i]

double buffering

Performance Evaluation

GPP: Simulation by cycle accurate processor simulator SFQ-RDP: Performance evaluation modeling

Estimation of execution times

GPP Processor type Out-of-Order

Freq. 3.2 GHz

Inst. issue width 4 Inst./CC

L1 data cache 64 KB

L2 unified cache 4 MB

Latency of main mem. 300 CC

RDP Freq.(SFQ-RDP) 80 GHz

Reconfiguration latency

30000 CC

Main mem. Bandwidth* 141.7, 157.0 GB/s

No. PEs in a row 22

No. PEs in a column 15

* BW numbers are based on ones for GPU calculation

System Architecture System Configuration

GPP

MainMemory

SFQ-RDP22x15 PEs

80 GHz3.2 GHz

BW:141.7, 157.0GB/s

15

2input/1outputports

Results of Performance Evaluation

SFQ-RDP (GFLop/s)

GPU(GFLop/s)

Ratio(by GPU)

Ratio(by GPP)

2D-Diffusion

50.6 63.0 1) 0.80 79.0

2D-FDTD 23.4 31.4 2) 0.75 26.2

1D-Diffusion

210.0 3) - - -

1D-Vibration

104.9 3) - - -

Comparable results to GPU

• SFQ-RDP processor, which is implementedby superconductivity circuits and simple 2D-array architecture, can be used as an efficient accelerator1) T. Aoki, et al., “CUDA programming primer,”, (Japanese), Kougakusya, ISBN-10:4777514773, 2009.

2) N. Takada, et al., “Speeding up of FDTD finite difference calculations by efficient use of GPU and shared memory,” (Japanese), Proceedings of Forum of Information Science and Technology, 20093) H. Kataoka, et al.,"Reducing Preprocessing Overhead Times in a Reconfigurable Accelerator of Finite Difference Applications", SAAHPC 10, Jul. 2010.

16

Why Can We Achieve Comparable Results?

# of Operation

# of I/O1) Byte/Flop Estimation of GFlop/s 2)

(Max. BW 159.0GB/s)

RDP

Calc.

Original formula

(1 Output)

7 5+1

= 6

6*4/7

= 3.42~4.7

(random access: ~16GB/s)

Unrolled loop formula(9 outputs formula)

7 * 9

= 63

21+9

= 30

30 * 4 / 63

= 1.90~8.4

(random access: ~16GB/s)

Data structure conversion for DMA transfer

7 * 9

= 63

30+9

= 39

39 * 4 / 63

= 2.4864.1

(DMA: ~159.0GB/s)

With GPP calc, comm. and other overheads

50.6(DMA: ~159.0GB/s)

GPU

Calc.

Aoki et al. ３ )

63.04)

1) Based on the utilization of HW for rearrangement of input data2) Single Precision Calculation, BW 159.0GB/s , GeForce GTX 2853) GeForce GTX 285, 1 proc. calculation ： (1024x1204 mesh)4) T. Aoki, et al., “CUDA programming primer,”, (Japanese), Kougakusya, ISBN-

10:4777514773, 2009. 17

Conclusions and Future Works

Conclusions An Single-Flux Quantum Reconfigurable Data-Path (SFQ-RDP)

with two-dimensional floating point array architecture implemented by superconducting circuits was introduced.

Two-dimensional Heat (2D-Heat) and Finite Difference Time Domain (2D-FDTD) applications were implemented on SFQ-RDP and performance evaluations were conducted.

For 2D-Heat and 2D-FDTD, 50.6 and 79.0 times faster computation than general purpose processor were achievable respectively, while these performance values were comparable to reported results for the GPU.

SFQ-RDP accelerator can be used for practical scientific calculations especially based on finite difference methods.

Future Works Implementations and performance evaluations of other

applications18

• CAD for logic design and arithmetic circuits• Prof. N.Takagi (Leader), Prof. K.Takagi (Kyoto Univ.)

• SFQ-RDP chip, cell library, and wiring• Prof. A.Fujimaki, Prof. H.Akaike, Prof. M.Tanaka (Nagoya

Univ.)• SFQ-FPU chip, cell library

• Prof. N.Yoshikawa (Yokohama National Univ.)• SFQ process

• Dr. S.Nagasawa, Dr. M.Hidaka (SLRC)

Acknowledgement

This research was supportedby Core Research for Evolutional Science and Technology (CREST)of Japan Science and Technology Corporation (JST).

Other SFQ-RDP research members

19

hiroaki honda 1 , farhad mehdipour 2 , hiroshi kataoka 1 ,

Documents

srl sfq process

rdp chip

crestjst sfqrdp project

fpu chip

single flux quantumcircuits

power consuming

exaflops systemis larger

mm2 power