computing and networking with fpgas and...

Informatik V

Computing and networking with FPGAs and GPUs

Andreas Kugel

Dept. for Application Specific Computing (ASC)

at Institute for Computer Engineering (ZITI) Heidelberg

Informatik V

2011-02-25 Tea seminar @ IPE, Beijing. A. Kugel 2

Contents

• Intro ASC, CAG of ZITI• FPGA technology• ATLAS activities• GRACE project / SPH

– FPGA– GPU

• Networking (CAG)

Informatik V


ASC Activities

• Head: Prof. Dr. Reinhard Männer– Physics => Computer Engineering

• http://www.ziti.uni-heidelberg.de/ziti/

• Main Areas– Trigger and Data Acquisition in High Energy Physics

• ATLAS at LHC, CBM, XFEL

– Accelerated Scientific Computing (G. Marcus)• Simulations, Biocomputing (Haralick Feature Extraction)

– Virtual Reality in Medicine• Software and training machinery (also spin-off)

• Technologies: general purpose (CPU, GPU) + custom FPGA processors

Informatik V


CAG Activitites

• Neighbouring dept. with focus on computer architecture and communication– Head: Prof. Dr. Ulrich Brüning

• http://ra.ziti.uni-heidelberg.de/index.php

• High-speed, low latency interconnects– Optical communication infrastructure for physics

experiments– Active optical cables (AOC)– Accelerated cluster interconnects (FPGA/ASIC)

• EXTOLL low latency protocol• HTX(2,3) and PCIe (1,2) implementations

Informatik V


Thanks to ...

• Collaboration with R. Spurzem/ARI HD since 1998– Provides scientific use case for “application specific

computing”– Astrophysical simulations on accelerated clusters– Thanks to Rainer ... and Peter Berczik ...

• Support by Volkswagenstiftung / Baden-Württemberg for GRACE project

Informatik V


Technology

Informatik V


FPGA Technology

• Building blocks– Simple logic elements (bits):

104 .. 106

– Programmable cross-bar

– Flexible I/O (~1000 pins)

– Special functions• Memory: BRAMs, MIG• DSP• Clocking• Serial I/O (2.5 – 28G)

• Configuration memory

• Altera, Xilinx

Informatik V


FPGA Application Areas

• Algorithms– Application specific instruction

set processors

– Highly parallel dataflow processors (DSP style)

– Complex (very) instruction pipelines (later ...)

– Tailored precision

• Communication– High-bandwidth, low latency

– Parallel, serial

– HW protocol engines (PCIe, GE, 10GE, switches, ...)

Xilinx: Radar signal processing

Xilinx: terabit switch

Informatik V


FPGA design flow

• Special coding style– HDL

– Blocks (MATLAB, IP-cores, ...)

– Pipeline Generators

– C-based (Impulse-C, Catapult-C, FCUDA, OpenCL, LLVM) exist ...

• Special tools flows– Simulation – takes time

– Compilation – takes more time, O(hours)

– Timing closure – iterate over sim + comp ... ATLAS FPGA histogramming

Informatik V


FPGA Co-Processors

• Co-Pro Blocks– Main FPGA– Host IF– Local DRAM– Local SRAM– I/O expansion– Clk, Ctl, Cfg

• Activities– Boards: VME, cPCI, PCI, PCIe (HTX at CA dept.)– Algorithms: stand-alone + hybrid– Tools: compilers, libraries, frameworks

Informatik V


Co-processors cont'

• MPRACE-1 (equiv. ROBIN)– 3M gates Virtex2, 2001– PCI, 256MB/s– still in use ...

• MPRACE-2– 6M gates Virtex-4, 2007– PCIe-4X, 1GB/s

• MPRACE-3MPRACE-3– 13M gates Virtex-6– 2011 (planned)– PCIe2.0-8x, 4GB/s

Available resources wrt MPRACE-1LUTs: 30k, DSP/BRAMs: 100SPH: 1 Flop = 200 LUTs, 0.5 DSP

V4++: with MGTsV6++: BRAM 2x,DSP 24(18),LUT 6(4)V7/K7: 10G MGTs

Speed 60 .. 300MHz

LUT BRAM DSP

0

2

4

6

8

10

12

Xilinx mid-size FPGA last 10 years

xc2v3000xc4fx60xc6vlx130txc7k325txc7v450t

Informatik V


MPRACE-2

10GE mezzanine

Informatik V


Software architecture

• Driver– Linux

• Generic driver

• Kernel/user mode

– IRQ handler (device spec)

– PCI + memory resources• SG-lists

– C, C++ user API

– Configurable devices IDs

• Libraries– Buffer management

• Buffer + translate

– Multi-client server (dynamic device allocation)

– Device library (next slide)

Informatik V


Device library

Board specific

Informatik V


PCI performance (Gen 1, 4x)

Downstream (FPGA reading)Upstream (host writing)

800MB/s upstream, 700MB/s downstreamBig difference, platform dependant, > factor 2xBest results with polling on FPGA register (DMA done). IRQ rates lower for packets < ~256kB

PCIe2.0 8x tests in progress

Informatik V


Computing

Informatik V


FPGA computing

• Pre 2000

– FPGAs too small for floating point

• 2000..2007

– Big speedups with “massive” parallel implementations

– Several accelerated machines, SGI, Cray, ...

• Since

– GPUs dominating

– Niches exist still ..

Altera, 2007: Speedup 10..370

Informatik V


GPU computing

• Levels of processing– Multi-processor, multi-core,

multi-thread ...

– ~500GFlops/s DP

• HW thread scheduling

• Levels of memory– Local, shared, global, host

• PCIe Gen2.0 x 16

Nvidia

MOLE-8.5 @IPE1PFlop/s DP... needless to say more ...

Informatik V


Accelerating applications

• Acceleration = Parallelism

• CPU,GPU,FPGA,ASIC

– different levels of parallelism, tools, W/Flop

• Heterog. multicores: speedup drops rapidly with f (fraction of acceleratable time)

• Technology doesn't helpE.S Chung et al., 2010

now300 40 8

Informatik V


ATLAS

• Largest LHC particle detector

• Initial rate ~ 1GHz @ 108 channels (40MHz BX)

• H->4µ rate: 1000/a

• Trigger/DAQ: reduction to 100Hz*1MB– Don't loose any Higgs!

• Offline: Tier 0/1/2 data centers (e.g IHEP Beijing tier 2)

p pH

µ+

µ-

µ+

µ-

Z

Z

44*26m, 7000t

Informatik V


ATLAS DAQ

• 1600 x 2Gb/s from detector

• 600 ROBINs in 160 PCs– Intelligent buffering

• Online processing on CPU farms O(1k) nodes

100kHzEvent rate

3 .. 20kHzRequests

Data

ROBIN FPGA CardDetector data after L1

Informatik V


ATLAS Processing

• Tracking @ L2: Find tracks from RoI data– 1) Find tracks: Hough, Zfinder

• High-Luminosty: ~30 * #hits

– 2) Fit tracks: Fitting, Kalman to obtain track parameters

– CPU solutions installed

– FPGA solution for Hough (dropped)

– New GPU implementation for ZF/ Kalman: speedup ~ 35/5 (Tesla 1060)

• Upgrade: RoIs not sufficient – FTK (FPGA+CAM) after L1

– 109 patterns parallel, 103 speedup

ATLAS FTK collaboration

Vertex

P.J.Clark et al., 2011

A. Khomich et al., 2006

Informatik V


ATLAS Processing (2)

• Pixel detector calibration– Threshold scan: verify proper

settings (+ tune)

• New IBL: 12M pixel

• DSP code replaced by FPGA histogramming + CPU/GPU fit

μ,σ

• LUT fastest, but not flexible

• Now only pixel parallel

• GPU needs parallel mask

bad

good

Informatik V


The GRACE Project

• Galaxy evolution: considering diffuse interstellar matter requires SPH (hydrodynamic force)

• Goal: Hybrid system for N-Body + SPH– CPU: Orbital Integration + I/O, O(N)– GRAPE: Gravitational force, O(N²)

• Now on GPU• Big speedup

• Once gravitation is done, SPH is most time consuming

– FPGA: SPH, O(Nn*N)

• Now on GPU• But also FPGA

Informatik V


SPH

• Two steps: density, acceleration

• Loop over particles (i), with neighbours (j). O(j) ~ 50

• Store intermediate results on host

Informatik V


SPH on FPGA

• Step 2 pipeline– 60 float ops

• Limited precision (16 bit significant)

• Algorithm not executed but cast into hardware

• Systolic process

Informatik V


FPGA Programming

• Algorithm: building-blocks + pipeline generator (PDL)

• Memory: External SRAM => custom interface

• I/O via PCIe => custom interfaceentity distance;clock clk;# parametersfloPValDef fpDef(m=>24, e=>8,s=>1,z=>0); # inputssignal (suppress_v);floPVal (v,a,t)(fpDef );

# calculatet2 = <floPSquare> t; s1 = v <floPMult> t;ss1 = gated(s1,suppress_v);half_a = <floPDiv2> a;s2 = half_a <floPMult> t2;s = ss1 <floPAdd> s2;

re s e t i D a t a

A s a v e

l a t c hQ

A s a v e

la t c hQ

A s a v e

la t c hQ

I 1 I2

s y n cQ 1 Q 2

rX

A X A Y A Z B X B Y B Z

f l o P V e c D if f Z GQ X Q Y Q Z

rY rZp

I 1 I2

c a lc P D iv R h o 2I1 : rh o InI2 : p InQ 1 : p D iv R h o 2 O u t

Q 1

r iX r iY r iZ

r i j X r i jY r i jZ

v r i j

A X A Y A Z B X B Y B Z

f l o P V e c C ro s s P ro dQ X Q Y Q Z

r i j x v i jY r i j x v i jZ

A X A Y A Z

f l o P V e c S q u a reQ

r i j2

p D iv R h o 2

is _ p D iv R h o 2 i p D iv R h o 2 j

A s a v e

la t c hQ

p D iv R h o 2 i

High level description HDL + visual representation

Informatik V


SPH on GPU

• N(pipelines) ~ 500 .. 1k = x(processors) * y(threads)

• Local, shared and global memory

• Map neighbour lists on threads: must fit in shared memory

Informatik V


Common GPU/FPGA Framework

• C, C++, Fortran interfaces

• Complete abstraction of SPH capabilities: racesph library

• Intelligent buffer manager, incl. re-formatting for FPGA

• Device specific libraries and drivers

Informatik V


SPH at ARI

• Two clusters in operation, Infiniband + accelerators

• HD(32) Grape+MPRACE-1

• MA(42) Tesla+MPRACE-2

• FPGA ~7GFlops @ 20W

• GPU (GTX8800) 2* as fast @ 150W

• Speedup ~ 10

• HW not up-to-date ...

GRAPE now GTX9800

Spurzem et al. 2007,2009

ARI Titan cluster, 4TF/s

Informatik V


SPH astrophysical results

• Collision of interstellar gas clouds• TREE-GRAPE + MPRACE (4 nodes)

– M = 2000 Mo, R = 3 pc

– Isothermal evolution

– Initial density distr. ~1/r

– T = 20 K (c_sound = 0.3 km/sec)

– V_merge = 5 km/sec

– Calculation time 3*t_ff = 6 Myr

– Resolution is h_min = 1e-4 pc

– SPH MPRACE/CPU speedup ~10

– Total GRAPE+MPRACE/CPU speedup ~15

N = 2x4k DT_CPU = 52 min 2x8k 1.74 hours 2x16k 3.5 hours 2x32k 6.9 hours 2x64k 14 hours * 2x128k 28 hours 2x256k 55 hours 2x512k 111 hours

Initial conditions

Berczik 2008

Informatik V


SPH astrophysical results (2)Colliding ...

Informatik V


Networking

Informatik V


Network = bandwitdth & latency problems

• Parallelism is key to application acceleration

• Parallel applications are distributed applications

• Inter-node latency must be hidden/reduced for good scaling– Commodity networks don't

perform so well ...

– Some apps can tolerate latency

• Solution by CAG group: EXTOLL-based low-latency network

Informatik V


EXTOLL Network

• Sub-µs end-to-end latency

• Switch-less, 3D-torus– Hop latency O(100ns)

• Reliable transport– CRC, retransmission

• Hardware barriers• Hardware multicast• Global interrupts

Informatik V


EXTOLL Network (2)

• Host interface– Hypertransport HT3

– PCIe (planned)

• Network interface– VELO

– RMA

– ATU

• EXTOLL network

– Deterministic routing

– 3 Virtual Channels

– 6 links

• Virtex-6 prototype

Informatik V


EXTOLL SW Architecture

• Separate application and management interfaces

• Middleware and low-level API access

– MPI/OpenMPI

– PGAS/GasNET (prototype)

– Message based (libVELO)

– Shared memory (libRMA)

• Set of Linux kernel drivers

• Management software

Informatik V


Host interfaces

• Hypertransport inherently lower latency than PCIe => 1st choice– Interfaces directly to

CPU: AMD only– Small overhead, no

encoding– 8/16 bit DDR HT3

@400,600 .. 2400 MHz– HT3 protocol engine

• Sync, framing, buffers

Informatik V


Host interfaces (2)

• PCIe– Off-CPU (any)– Bi-directional serial

interface• 2.5, 5, 8 Gbit/s• 1, 4, 8, 16 switched,

packed based• 8B10B encoding,

CDR required• Ubiquitous ...• Many slots/node

Informatik V


HTX/EXTOLL enabled systems

• Experimental HTX cluster (8 nodes) at ZITI– Application testing– No GPU (as no PCIe)

• HTX + PCIe ZITI demonstrator– Upgrade to Ventoux this

spring– GPU + EXTOLL possible

• Commercial HT2600 ASIC-based ~2012

Informatik V


More custom interconnects

• FPGAs capable to run 40G and 100G links• Xilinx XC7VH870T

– 16 * 28Gbit/s – 72 * 13Gbit/s

• Xilinx XC7K325T– 16 * 11Gbit/s

• Potential for custom computing/communication accelerators?– Plus embedded GPUs?

Informatik V


Summary

• HPC dominated by GPU-accelerated clusters• Scaling of some applications suffers from network

latencies– Low-latency implementations needed (EXTOLL)

• FPGA-based accelerators– where GPUs don't perform well (thread/memory

mapping ...)– Fast I/O on accelerator– Special applications (embedded, ATLAS ...)– Power consumption

Informatik V


Thanks for you attention

computing and networking with fpgas and...

Documents