cultivating a community about fpga-hpc...

Cultivating a community about FPGA-HPC platforms

9th JLESC workshop. Knoxville, TN, USA. April 16, 2019

Kazutomo Yoshii <[email protected]>Mathematics and Computer Science

Argonne National Laboratory

JLESC project: Evaluating high-level programming models for FPGA platforms

Carlos Alvarez (BSC), Daniel Jimenez-Gonzalez (BSC), Xavier Martorell (BSC),Osman Unsal (BSC), Eric Rutten (INRIA), Kentaro Sano (R-CCS),Zheming Jin (ANL), Hal Finkel (ANL), Franck Cappello (ANL)

2

CMOS Scaling Is Coming to An End

● Requires significant investment

– e.g., Intel spent $5B on 14nm

– Rock’s law: cost of new plant doubles every four years

● Benefits are shrinking

– thermal, leakage, reliability, etc

● # of manufacturing companies is 20 to 5 in the past 15 years!

● “The number of people predicting the death of Moore’s law doubles every two years.”

IEEE International Roadmap for Devices and Systems (2017)

3

How Can We Survive in The Post-Moore Era?

● Our demands for computational power keeps growing exponentially

– scientific discoveries depends on computational power

● New types of computers?

– quantum computers may not be ready in a timely manner

– different concept, the applicability of classical algorithms is questionable

● Still depend on general-purpose processors

– Performance is driven by transistor scaling

Specializatione.g., AI processors

Reconfigurable(co-design)

Quantum computers,Brain-inspired computers, etc

“Re-form” LDRD project was funded in 2015

Investigators: Kazutomo Yoshii, Franck Cappello,

Hal Finkel, Fangfang Xia

General-PurposeProcessors

new switching technologies

4

Data and learning are new requirements for HPC

5

FPGAs for data analytics?

MappingAligning

Positionsorting

Duplicatemarking

Variantcalling

FASTQ VCF

● Complicated pipelines– different implementation

– integer heavy computation on some stages

● Scaling study is still new– end up in runtime system development

● Edico Genome currently holds the Guinness world record for fastest time– 1,000 FPGAs Amazon EC2 F1

instances for 1,000 human genomes

6

FPGAs for learning?

● Machine learning (ML) acceleration is becoming mandatory to future HPC!

● Workload characteristics are different from ordinal numerical computing

– ML algorithms and implementation techniques keep evolving

● mixed/reduced precision, stochastic rounding, zero-pruning, etc

● hard for ASICs

– Latency sensitive for some application

● Already many success stories using FPGAs

7

Edge and Near-Sensor Computing

SensorSensorSensorSensorSensor

Edge node

Data acquisition node

more sensors, higher resolutionthe increase in data rateis exponential!

limited bandwidthhigher latency

Opportunity here!

Cloud

HPCdisk

8

Unfortunately FPGAs are not adopted in HPC

● Although numerous success in cloud and data center space

● Individual FPGA researches and works are often impressive– innovative architecture

designs, dataflow studies, etc

● Hard to translate someone’s knowledge to others– the nature of the

platforms● large, complicated

heterogeneous architecture

– lack of abstraction

– lack of community● lack of common platforms

9

Field Programmable Gate Array (FPGA)

● The first FPGA chip (1985)– 64 flip flops, 128 3 lookup

tables● Practical reconfigurable

architecture● Lower non-recurring engineering

cost compared to ASICs

– once a design gets fixed, no one touches

● Application

– prototype ASICs

– signal processing

– data acquisition system

Logic block

Switchblock

SwitchBlock

10

Today’s FPGA technology

● Heterogeneous– not only login elements but also DSP, BRAM, etc

● Floating point capability– Intel Stratix 10’s theoretical peak is 10 Tflops (SP)

● Run faster– Technology like Hyperflex helps

– > 600 MHz is not a dream

● Large internal memory– up to ~40MB of SRAM (e.g., Xilinx VU37P)

– very high internal bandwidth

● Off-chip memory improvement– HBM2 integration

● High-speed transceivers– ~56 Gbps

– can be used for direct FPGA-FPGA communication

● The advent of FPGA-CPU hybrid platforms– ARM-FPGA, Xeon-FPGA, etc

● Embedded-class FPGAs

11

Intel AgileX

● Scalable– edge, networking, cloud

● Heterogeneous system in package (SiP)– possible eASIC integration

● Bfloat16, HBM, CXL

● CXL– Compute Express Link

● One API

12

Lack of abstraction and portability

● So many FPGA chips, different boards/platforms

● No compatibility, even source-level

– longer compilation time

– so many different Xeon chips, too, but they offer binary compatibility and shorter compilation time

● HLS can abstract FPGA resources to some degree

● Need to abstract off-chip memory, I/O, debugging APIs, etc

PCIe

Every FPGA chip has a big product table!

Tightly-coupledw/ beefy CPUs

Standalone

13

bash $ cc hello.cbash $ ./a.outhellobash $ cc app.c -lm -l....

14

Abstraction layers for FPGAs are emerging

● OpenCL BSPs

– provide low-level software APIs

– limit board choices, less extensible

● or build a custom BSP

– OpenCL features may be an overkill

● Intel Open Programmable Acceleration Engine (OPAE)

– abstract accelerator such as FPGA

– consists of kernel drivers, userspace libraries and tools (e.g., discover, reconfigure)

– Supported platforms ?

● AWS EC2 FPGA hardware and software development kits

– FPGA shells and software APIs

– https://github.com/aws/aws-fpga.git

● Questions

– support our edge-to-HPC needs?

– support various programming models?

● offload, streaming, pure dataflow, hybrid dataflow, etc

15

Presented at Intel’s 2018 Architecture Day

16

FPGAFPGA

Paradigm shift in cluster designs

CPU

FPGA

CPU

FPGA

CPU

FPGA

Interconnect

CPU

FPGA

CPU

FPGA

CPU

FPGA

Interconnect

CPU FPGA

Sequential codes to FPGA designsmore memory references

Communication via CPU Direct communicationForm a larger FPGA

CPU FPGA

Mem

Data flow modelsminimize memory references

FPGAs have a richset of I/O(transceiver, GPIO)

Catapult (Microsoft)Novo-G# (Boston U.)PEACH (U. Tsukuba)

Common software stack is missing!

17

Efficient data movement

● Complicated memory hierarchy

– fast on-chip: register, BRAM

– off-chip: QDR, DDR, HMC, HBM2

– remote: over transceiver, Gen-Z, CXL, etc

– storage: SSD, 3D-xpoint

● Minimize costly data movement

– exploit data locality (e.g., cache)

– FPGAs in data path

● reduced precision, compression

● stream processing

● Address space management (OS level)

– OpenCL host memory, hybrid dataflow models

– shared virtual memory is becoming norm● IOMMU. e.g., PCIe ATS● efficient TLB management scheme

– dataflow address region like DMA region

customcache

ProcessingUnit

FPGA

Storage

FPGA

18

HPC-FPGA community

● Historically, FPGA electric design automation (EDA) communities had little interaction with software folks

● With the emergence of high-level synthesis tools, more software folks started evaluating

● Successful FPGA stories in cloud space

– driven by specific workloads

– custom solutions● written in HDL

● Early stage but HPC and FPGA community started emerging

– organizing events to cultivate HPC-FPGA community

19

Past workshops and conference events (1)

● 2016 Jan : Workshop on FPGAs for scientific simulation and data analytics, Argonne, IL – Highlights: 14 speakers. universities, national labs, vendors

(Xilinx, Altera), high-level synthesis, SYCL/SPIR-V

– https://collab.cels.anl.gov/display/REFORM/Workshop20160121

● 2016 Oct : Workshop on FPGAs for scientfic simulation and data analytics, Urbana, IL– Highlights: 19 speakers. programming models, scalable results

– http://www.ncsa.illinois.edu/Conferences/FPGA16/agenda.html

20


● 2017 Nov : SC17 bird-of-feather session, "Reconfigurable Computing in Exascale", Denver– Highlights: 8 speakers. CERN, Maxeller, Micron, Intel, Cray, BSC,

universities. not overlapped with our previous workshops

– https://sc17.supercomputing.org/SC17%20Archive/bof/bof_pages/bof148.html

● 2018 Mar: 3rd International Workshop on FPGA for HPC (IWFH), Tokyo– Highlights: 8 speakers. Microsoft, national labs, universities.

Successful large scale FPGA cluster, programming models, tightly-coupled FPGA clusters, runtime system

– https://www.ccs.tsukuba.ac.jp/hpc-iwfh/

21


● 2018 Nov: SC18 bird-of-feather session "Benchmarking Scientific Reconfigurable/FPGA Computing", Dallas– Highlights: 8 speakers. focuses on benchmarking

– https://sc18.supercomputing.org/proceedings/bof/bof_pages/bof190.html

● 2018 Nov: SC18 panel session "Reconfigurable computing for HPC: Will it make it this time?", Dallas– Highlights: 7 speakers. Intel, Xilinx, national labs, univerities.

– https://sc18.supercomputing.org/presentation/?id=pan112&sess=sess299

22


● 2018 Dec: FTP18 workshop "Workshop on Integrating HPC and FPGAs", Okinawa, Japan– Highlights: 4 speakers. RIKEN, INRIA, BSC, Boston University.

programming paradigms (dataflow, task), runtime, FPGA cluster

– https://collab.cels.anl.gov/display/HPCFPGA/HPC-FPGA

● 2018 Dec: FTP18 workshop "Workshop on Reconfigurable High-Performance Computing", Okinawa, Japan– Highlights: 9 speakers. people from the FPGA community.

abstraction, virtualization, runtime, dataflow

– https://collab.cels.anl.gov/display/RECONFHPC/RECONF-HPC

23

Next workshop

● 9th JLESC workshop helped initiating the conversation!

● Planning to submit a workshop proposal to FPL2019– Barcelona, Spain

– Sep 12 or 13 (tentative)

● Format– Full day. one or two keynotes. short talks. panel. possibly call for extended abstracts

● Possibly topics– low-level runtime and operating systems

– high-level front-ends

– programming models● general or domain specific

– virtualization, coarse-grained reconfigurable architecture

– heterogeneous cluster● not only FPGA, but also other accelerators like GPUs, AI

● efficient data movement

– common playground/platform for collaborative works

●

cultivating a community about fpga-hpc...

Documents