cultivating a community about fpga-hpc...
TRANSCRIPT
Cultivating a community about FPGA-HPC platforms
9th JLESC workshop. Knoxville, TN, USA. April 16, 2019
Kazutomo Yoshii <[email protected]>Mathematics and Computer Science
Argonne National Laboratory
JLESC project: Evaluating high-level programming models for FPGA platforms
Carlos Alvarez (BSC), Daniel Jimenez-Gonzalez (BSC), Xavier Martorell (BSC),Osman Unsal (BSC), Eric Rutten (INRIA), Kentaro Sano (R-CCS),Zheming Jin (ANL), Hal Finkel (ANL), Franck Cappello (ANL)
2
CMOS Scaling Is Coming to An End
● Requires significant investment
– e.g., Intel spent $5B on 14nm
– Rock’s law: cost of new plant doubles every four years
● Benefits are shrinking
– thermal, leakage, reliability, etc
● # of manufacturing companies is 20 to 5 in the past 15 years!
● “The number of people predicting the death of Moore’s law doubles every two years.”
IEEE International Roadmap for Devices and Systems (2017)
3
How Can We Survive in The Post-Moore Era?
● Our demands for computational power keeps growing exponentially
– scientific discoveries depends on computational power
● New types of computers?
– quantum computers may not be ready in a timely manner
– different concept, the applicability of classical algorithms is questionable
● Still depend on general-purpose processors
– Performance is driven by transistor scaling
Specializatione.g., AI processors
Reconfigurable(co-design)
Quantum computers,Brain-inspired computers, etc
“Re-form” LDRD project was funded in 2015
Investigators: Kazutomo Yoshii, Franck Cappello,
Hal Finkel, Fangfang Xia
General-PurposeProcessors
new switching technologies
4
Data and learning are new requirements for HPC
5
FPGAs for data analytics?
MappingAligning
Positionsorting
Duplicatemarking
Variantcalling
FASTQ VCF
● Complicated pipelines– different implementation
– integer heavy computation on some stages
● Scaling study is still new– end up in runtime system development
● Edico Genome currently holds the Guinness world record for fastest time– 1,000 FPGAs Amazon EC2 F1
instances for 1,000 human genomes
6
FPGAs for learning?
● Machine learning (ML) acceleration is becoming mandatory to future HPC!
● Workload characteristics are different from ordinal numerical computing
– ML algorithms and implementation techniques keep evolving
● mixed/reduced precision, stochastic rounding, zero-pruning, etc
● hard for ASICs
– Latency sensitive for some application
● Already many success stories using FPGAs
7
Edge and Near-Sensor Computing
SensorSensorSensorSensorSensor
Edge node
Data acquisition node
more sensors, higher resolutionthe increase in data rateis exponential!
limited bandwidthhigher latency
Opportunity here!
Cloud
HPCdisk
8
Unfortunately FPGAs are not adopted in HPC
● Although numerous success in cloud and data center space
● Individual FPGA researches and works are often impressive– innovative architecture
designs, dataflow studies, etc
● Hard to translate someone’s knowledge to others– the nature of the
platforms● large, complicated
heterogeneous architecture
– lack of abstraction
– lack of community● lack of common platforms
9
Field Programmable Gate Array (FPGA)
● The first FPGA chip (1985)– 64 flip flops, 128 3 lookup
tables● Practical reconfigurable
architecture● Lower non-recurring engineering
cost compared to ASICs
– once a design gets fixed, no one touches
● Application
– prototype ASICs
– signal processing
– data acquisition system
Logic block
Switchblock
SwitchBlock
10
Today’s FPGA technology
● Heterogeneous– not only login elements but also DSP, BRAM, etc
● Floating point capability– Intel Stratix 10’s theoretical peak is 10 Tflops (SP)
● Run faster– Technology like Hyperflex helps
– > 600 MHz is not a dream
● Large internal memory– up to ~40MB of SRAM (e.g., Xilinx VU37P)
– very high internal bandwidth
● Off-chip memory improvement– HBM2 integration
● High-speed transceivers– ~56 Gbps
– can be used for direct FPGA-FPGA communication
● The advent of FPGA-CPU hybrid platforms– ARM-FPGA, Xeon-FPGA, etc
● Embedded-class FPGAs
11
Intel AgileX
● Scalable– edge, networking, cloud
● Heterogeneous system in package (SiP)– possible eASIC integration
● Bfloat16, HBM, CXL
● CXL– Compute Express Link
● One API
12
Lack of abstraction and portability
● So many FPGA chips, different boards/platforms
● No compatibility, even source-level
– longer compilation time
– so many different Xeon chips, too, but they offer binary compatibility and shorter compilation time
● HLS can abstract FPGA resources to some degree
● Need to abstract off-chip memory, I/O, debugging APIs, etc
PCIe
Every FPGA chip has a big product table!
Tightly-coupledw/ beefy CPUs
Standalone
13
bash $ cc hello.cbash $ ./a.outhellobash $ cc app.c -lm -l....
14
Abstraction layers for FPGAs are emerging
● OpenCL BSPs
– provide low-level software APIs
– limit board choices, less extensible
● or build a custom BSP
– OpenCL features may be an overkill
● Intel Open Programmable Acceleration Engine (OPAE)
– abstract accelerator such as FPGA
– consists of kernel drivers, userspace libraries and tools (e.g., discover, reconfigure)
– Supported platforms ?
● AWS EC2 FPGA hardware and software development kits
– FPGA shells and software APIs
– https://github.com/aws/aws-fpga.git
● Questions
– support our edge-to-HPC needs?
– support various programming models?
● offload, streaming, pure dataflow, hybrid dataflow, etc
15
Presented at Intel’s 2018 Architecture Day
16
FPGAFPGA
Paradigm shift in cluster designs
CPU
FPGA
CPU
FPGA
CPU
FPGA
Interconnect
CPU
FPGA
CPU
FPGA
CPU
FPGA
Interconnect
CPU FPGA
Sequential codes to FPGA designsmore memory references
Communication via CPU Direct communicationForm a larger FPGA
CPU FPGA
Mem
Data flow modelsminimize memory references
FPGAs have a richset of I/O(transceiver, GPIO)
Catapult (Microsoft)Novo-G# (Boston U.)PEACH (U. Tsukuba)
Common software stack is missing!
17
Efficient data movement
● Complicated memory hierarchy
– fast on-chip: register, BRAM
– off-chip: QDR, DDR, HMC, HBM2
– remote: over transceiver, Gen-Z, CXL, etc
– storage: SSD, 3D-xpoint
● Minimize costly data movement
– exploit data locality (e.g., cache)
– FPGAs in data path
● reduced precision, compression
● stream processing
● Address space management (OS level)
– OpenCL host memory, hybrid dataflow models
– shared virtual memory is becoming norm● IOMMU. e.g., PCIe ATS● efficient TLB management scheme
– dataflow address region like DMA region
customcache
ProcessingUnit
FPGA
Storage
FPGA
18
HPC-FPGA community
● Historically, FPGA electric design automation (EDA) communities had little interaction with software folks
● With the emergence of high-level synthesis tools, more software folks started evaluating
● Successful FPGA stories in cloud space
– driven by specific workloads
– custom solutions● written in HDL
● Early stage but HPC and FPGA community started emerging
– organizing events to cultivate HPC-FPGA community
19
Past workshops and conference events (1)
● 2016 Jan : Workshop on FPGAs for scientific simulation and data analytics, Argonne, IL – Highlights: 14 speakers. universities, national labs, vendors
(Xilinx, Altera), high-level synthesis, SYCL/SPIR-V
– https://collab.cels.anl.gov/display/REFORM/Workshop20160121
● 2016 Oct : Workshop on FPGAs for scientfic simulation and data analytics, Urbana, IL– Highlights: 19 speakers. programming models, scalable results
– http://www.ncsa.illinois.edu/Conferences/FPGA16/agenda.html
20
Past workshops and conference events (2)
● 2017 Nov : SC17 bird-of-feather session, "Reconfigurable Computing in Exascale", Denver– Highlights: 8 speakers. CERN, Maxeller, Micron, Intel, Cray, BSC,
universities. not overlapped with our previous workshops
– https://sc17.supercomputing.org/SC17%20Archive/bof/bof_pages/bof148.html
● 2018 Mar: 3rd International Workshop on FPGA for HPC (IWFH), Tokyo– Highlights: 8 speakers. Microsoft, national labs, universities.
Successful large scale FPGA cluster, programming models, tightly-coupled FPGA clusters, runtime system
– https://www.ccs.tsukuba.ac.jp/hpc-iwfh/
21
Past workshops and conference events (3)
● 2018 Nov: SC18 bird-of-feather session "Benchmarking Scientific Reconfigurable/FPGA Computing", Dallas– Highlights: 8 speakers. focuses on benchmarking
– https://sc18.supercomputing.org/proceedings/bof/bof_pages/bof190.html
● 2018 Nov: SC18 panel session "Reconfigurable computing for HPC: Will it make it this time?", Dallas– Highlights: 7 speakers. Intel, Xilinx, national labs, univerities.
– https://sc18.supercomputing.org/presentation/?id=pan112&sess=sess299
22
Past workshops and conference events (4)
● 2018 Dec: FTP18 workshop "Workshop on Integrating HPC and FPGAs", Okinawa, Japan– Highlights: 4 speakers. RIKEN, INRIA, BSC, Boston University.
programming paradigms (dataflow, task), runtime, FPGA cluster
– https://collab.cels.anl.gov/display/HPCFPGA/HPC-FPGA
● 2018 Dec: FTP18 workshop "Workshop on Reconfigurable High-Performance Computing", Okinawa, Japan– Highlights: 9 speakers. people from the FPGA community.
abstraction, virtualization, runtime, dataflow
– https://collab.cels.anl.gov/display/RECONFHPC/RECONF-HPC
23
Next workshop
● 9th JLESC workshop helped initiating the conversation!
● Planning to submit a workshop proposal to FPL2019– Barcelona, Spain
– Sep 12 or 13 (tentative)
● Format– Full day. one or two keynotes. short talks. panel. possibly call for extended abstracts
● Possibly topics– low-level runtime and operating systems
– high-level front-ends
– programming models● general or domain specific
– virtualization, coarse-grained reconfigurable architecture
– heterogeneous cluster● not only FPGA, but also other accelerators like GPUs, AI
● efficient data movement
– common playground/platform for collaborative works
●