fpga-based scientific computingcas.ee.ic.ac.uk/people/gac1/date2011/stitt.pdf · fpga-based...

9
DATE 2011 Workshop W2 FPGA-based Scientific Computing: A Bright Future? Dr. Greg Stitt Electrical and Computer Engineering University of Florida NSF Center for High-Performance Reconfigurable Computing (CHREC)

Upload: dodiep

Post on 15-Apr-2018

214 views

Category:

Documents


1 download

TRANSCRIPT

DATE 2011 Workshop W2

FPGA-based Scientific Computing: A Bright Future?

Dr. Greg Stitt Electrical and Computer Engineering

University of Florida NSF Center for High-Performance Reconfigurable

Computing (CHREC)

Introduction   FPGAs widely shown to have performance advantages

compared to other devices   However, trend for scientific computing is microprocessors+GPUs

  Problems with GPU trend   Microprocessor+GPU trend unsustainable due to power

consumption   Top supercomputers nearing 10 megawatts   Result: Energy and cooling dominate total cost of ownership

  Are FPGAs a potential solution?   FPGAs use significantly less power than GPUs, sometimes with

similar or better performance   Computational density per watt becoming an important metric for

device efficiency for scientific computing

CD/W Results (selected HPC-related devices) CD/W: Computational Density per Watt

3

CD/W Results (selected HPC-related devices) CD/W: Computational Density per Watt

4

5

CD/W Results (selected HPC-related devices) CD/W: Computational Density per Watt

Novo-G RC Machine @ CHREC   192 Stratix-III E260 FPGAs

  Each Altera FPGA with 768 18x18 multipliers, 254K logic elements, 204K registers, & max. power ~18W

  48 quad-FPGA boards   GiDEL PCIe x8 PROCStar-III

  Embedded-style boards, for both HPEC- & HPC-oriented research

  4¼ GB DDR2 attached to each FPGA   ~ 1 TB total RAM in Novo-G

  24+1 Linux servers in cluster   24 compute servers (2 boards/server)   1 head-node server for management   20 Gb/s non-blocking InfiniBand   1 Gb/s Ethernet   26 (24+2) quad-core Xeons   Max. system power of ~8 KW

6 “Novo” is Latin, "to make anew, refresh, revive, change, alter," essence of RC; “G” is for Genesis or Green.

NW/SW/ND Performance on Novo-G

Baseline: 192·225, length 850 Sequence Comparisons Software Runtime: 11,026 CPU hours on 2.4GHz Opteron # FPGAs Runtime (sec) Speedup

1 47,616 833

4 12,014 3,304

96 503 78,914

128 391 101,518

192 (est.) 270 147,013

Baseline: Human X Chromosome v 19200, length 650 Seqs Software Runtime: 5,481 CPU hours on 2.4GHz Opteron # FPGAs Runtime (sec) Speedup

1 23,846 827

4 5,966 3,307

96 250 78,926

128 188 104,955

192 (est.) 127 155,366

Baseline: 192·224, length 450 Distance Calculations Software Runtime: 11,673 CPU hours on 2.4GHz Opteron # FPGAs Runtime (sec) Speedup

1 13,522 3,108

4 3,429 12,255

96 144 291,825

128 118 356,125

192 (est.) 77 545,751

Results on Novo-G for NW (left), SW (Center), and ND (Right). Each chart illustrates performance of a single FPGA under varying input conditions. Each table shows scaling performance with varying number of FPGAs under optimal input conditions.

Estimated performance on Novo-G comparable or better than biggest supercomputers on www.Top500.org Jaguar @ ORNL: 224,162 cores – 2.4 GHz Hexacore Opterons; 6.95 MW Roadrunner @ LANL: 122,400 cores – 3.2 GHz Cells + 1.8GHz Opterons; 2.35 MW

7

} Novo-G Power 8 KW Max.

Information-Theoretic Adaptive Filtering

8

ITL

System identification Feature extraction

Blind source separation Clustering

Information-Theoretic Learning (ITL)   New way of data quantification based upon MEE

(minimum error entropy) instead of MSE (mean square error)

  Superior results for nonlinear system identification   However, prohibitive increase in computational

complexity; Solution? RC

Baseline: 160 10th order AFs with window size of 100 Software Runtime: 3min 34sec CPU time on 2.4GHz Opteron

# FPGAs   Runtime (ms)   Speedup  

1   36.85 5,800

4 (1 board)   9.22 23,200

8 (1 server) 4.61 46,400

ITL Adaptive Filters (AFs)   20 AFs/ FPGA @ 150 MHz   80 AFs/board (4 FPGAs), 160 AFs/server

(8 FPGAs)   Additional AFs not scientifically meaningful

for 1D data, so app capped @ 160 AFs

Software: Fastest possible sampling frequency of ~1.5 kHz Hardware: Fastest possible sampling frequency of 425 kHz Impact: Able to employ superior MEE cost function for

much broader spectrum of signals and problems

REF: S. Craciun, A. George, H. Lam, J. Principe, "A Parallel Hardware Architecture for Information-Theoretic Adaptive Filtering," Proc. of High-Performance Reconfigurable Computing Technology

and Applications Workshop at SC'10, New Orleans, LA, Nov. 14, 2010.

Why aren’t FPGAs more widely used?

  5 main barriers preventing wider usage   Increased application design complexity

  Significantly more complex that microprocessors or GPUs

  Limited applicability   Not all applications benefit from FPGAs

  Prohibitive compilation times   Placement and routing often takes hours, days, even

more than a week   Device cost

  Newest devices cost more than $10,000   Lack of application/tool portability, standardization

  Application and tool designers must start over for new systems

9