embedded supercomputing at nvidia · tegra enabled arm multicore testing but the embedded gpu was...

Alex Ramirez, NVIDIA Research

Embedded Supercomputing at NVIDIA

The project Build HPC using embedded commodity technology

Tibidabo Tegra 2

KAYLA Tegra 3 +

Quadro GPU

Pedraforca Tegra 3 + Tesla K20

Mont-Blanc Exynos 5250

Tegra enabled ARM Multicore testing But the embedded GPU was not programmable …

Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …

Tegra 3 4x ARM Cortex-A9 VFP for 64-bit FP 3x faster GPU …

Tegra 4 4x ARM Cortex-A15

72-core embedded GPU …

Samsung Exynos 5250

2x ARM Cortex-A15 @ 1.7 GHz

4-core ARM Mali T604

OpenCL 1.1

Dual-channel DDR3

USB 3.0 to 1 GbE bridge

First embedded + programmable GPU

NVIDIA Tegra K1

4x ARM Cortex-A15 @ 2.3 GHz

32KB + 32KB L1 cache

2 MB L2 cache

192-core Kepler GPU

CUDA 6

The first embedded CUDA GPU

Jetson DevKit Clusters @ SC’14 Lead the way … and they will follow

NVIDIA Tegra K1-64 (Denver)

2x NVIDIA Denver CPU @ 2.5 GHz

64-bit ARM v8

2 MB L2 cache

192-core Kepler GPU

CUDA 6

Pin compatible with Tegra K1

64-bit ARM CPU

NVIDIA Denver CPU

7-wide superscalar

2x FP pipelines

Dynamic code optimization

OOO performance with in-order execution

Aggressive HW prefetcher

Highest performance ARMv8 Mobile CPU

NVIDIA Tegra X1

4x ARM Cortex A57

2 MB L2 cache

4x ARM Cortex A53

512KB L1 cache

256-core Maxwell GPU

Embedded GPU overhaul

Drive-PX

2x Tegra X1

4x ARM Cortex-A57

4x ARM Cortex-A53

256-core Maxwell GPU

2.3 TFLOPS (32-bit FP)

1 GbE cluster interconnect

Embedded supercomputer for autonomous driving

Drive-PX2

2x Tegra X2

4x ARM Cortex-A57

2x NVIDIA Denver2

2x 3840-core Pascal GPU

8 TFLOPS (64-bit FP)

24 TFLOPS (16-bit FP)

1 GbE cluster interconnect

Discrete GPU for embedded deep learning capabilities

This is what 8 TFLOPS looked like in 2001

ASCI White, LLNC, CA

This is what 8 TFLOPS look like in 2016 Drive-PX2, under the hood of your next self-driving car

This is what 40 TFLOPS looked like in 2004 MareNostrum, Barcelona Supercomputing Center, Spain

This is what 40 TFLOPS look like in 2016 DGX-1 Deep Learning System

NVlink in NVIDIA’s NUMA link

NVlink across Tegra SoCs NVlink across Tegra and GPU

For GPU’s only?

* Diagram does not imply any current or planned NVIDIA products

The self-hosted GPU accelerator ?

NVlink across Tegra and GPU Self-hosted GPU package with heterogeneous memory

DDR + HBM

Why integrate the CPU with the GPU?

* Diagram does not imply any current or planned NVIDIA products

Conlusions

Embedded SoCs are gaining HPC compute capabilities

The benefits of SoC customization have not yet been exploited

Scaling performance across multiple SoC is still challenging …

… but connecting many of the is getting easier

Heterogeneous IP blocks across heterogeneous SoCs

Heterogeneous memories

Homogeneous programming model !?

Unleash the genie …

Phenomenal cosmic powers … Itty bitty living space

embedded supercomputing at nvidia · tegra enabled arm multicore testing but the embedded gpu was...

Documents

arm mali-g78ae · 2020. 9. 25. · arm mali-g78ae datasheet...

whitepaper nvidia tegra multi-processor...

the architecture of graphic processor unit – gpu and cuda...

nvidia® tegra software license agreement linux for tegra

tegra 4 the world’s fastest mobile processor 72 gpu cores...

arm mali gpu opengl es application optimization guide · pdf...

gpu-centric communication for improved...

tvm at arm · 2019. 12. 17. · •interested in arm cpu...

imaging using arm t6xx gpu

open arm gpu drivers - fosdem...open arm gpu drivers: where...

mobile vs desktop gpus and gpu vs mimd multicore cpu...

mali gpu application optimization guide - arm - the...

integrating cpu and gpu, the arm methodology

the arm mali -t880 mobile gpu - semantic scholar

arm mali gpu opengl es application optimization...

colibri tegra datasheet - arm and x86 embedded computer...

arm mali gpu opengl es application optimization guide

the bifrost gpu architecture and the arm mali-g71 gpu ·...

take gpu power to the maximum for vision and depth sensor...

tegra x1/tegra linux driver package multimedia user guide