powering real-time radio astronomy signal processing with...

29
Powering Real-time Radio Astronomy Signal Processing with latest GPU architectures Harshavardhan Reddy Suda NCRA, India Vinay Deshpande NVIDIA, India Bharat Kumar NVIDIA, India

Upload: others

Post on 14-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

Powering Real-time Radio

Astronomy Signal Processing with

latest GPU architectures

Harshavardhan Reddy Suda

NCRA, India

Vinay Deshpande

NVIDIA, India

Bharat Kumar

NVIDIA, India

Page 2: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

What signals we are processing?

GMRT▪ The Giant Meter-wave Radio Telescope

(GMRT) is a world class instrument for studying astrophysical phenomena at low radio frequencies

▪ Located 80 km north of Pune, 160 km east of Mumbai

▪ Array telescope with 30 antennas of 45 m diameter, operating at meter wavelengths

▪ Digitized baseband signals from 30 dual polarized antennas of GMRT

Page 3: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

GMRT

▪ Supports two modes of operation :

- Interferometry (correlator)- Array mode (beamformer)

▪ Frequency bands :

- 130 to 260 MHz

- 250 to 500 MHz

- 550 to 900 MHz

- 1050 to 1600 MHz

▪ Maximum instantaneous bandwidth :

400 MHz (Legacy GMRT = 32

MHz)

▪ Effective collecting area (2-3% of

SKA)

-30,000 sq m at lower frequencies

-20,000 sq m at higher frequencies

Page 4: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

The Giant Meter-wave Radio Telescope

A Google eye view

Page 5: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

GMRT receiver chain Signal processing in

digital back-end

Image courtesy : Ajith Kumar, NCRA

Page 6: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

Computation requirements

Sampler

Fourier Transform

O(NlogN)

Phase

Correction

MAC

M(M+1)/2

Antenna

Signals(M=64)

Maximum Bandwidth 400 MHz

16k point spectral channels –

3 TFlops

0.1 TFlops

6.6 TFlops

Total ~ 10 TFlops

Page 7: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

Design : Time slicing model

Page 8: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

Design : Time slicing model

A 4-node example

Ant 1, Ant 2 --- Ant 16 : Digitized data of baseband signals of Antennas

Page 9: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

Implementation

▪ 16 Dell T630 machines as Compute Nodes

▪ 16 ROACH (FPGA) boards with Atmel/e2v based ADCs developed by CASPER group, Berkeley for digitization and packetization

▪ 32 Tesla K40c GPU cards for processing

▪ 36 port Mellanox Infiniband switch for data sharing between Compute Nodes and Host Nodes

▪ Software : C/C++ and CUDA C programming with OpenMPI and OpenMP directives

▪ Developed in collaboration with Swinburne University, Australia

Page 10: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

Implementation

Image courtesy : Irappa Halagalli, NCRA

Page 11: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

Sample result

Legacy GMRT 325 MHz : 350 μJy Upgraded GMRT 300 – 500 MHz : 28 μJy

Significantly lower noise RMS and better image quality with upgraded GMRT

Dharam Vir Lal and Ishwar

Chandra, NCRA

Image of Coma cluster

Page 12: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

Computation Performance : K40

ChannelsFFT

(Gflops)MAC

(Gflops)

2048 620 626

4096 626 620

8192 512 574

16384 498 537

No. of antennas : 32 (dual pol)

CUDA 7.5

Page 13: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

Motivation for next generation GPUs

▪ Adding more compute intensive applications

- Multi-beamforming

- Processing on each beam (beam steering)

- Gated correlator

- FIR filtering with many taps for narrow-band mode implementation

▪ Working GMRT system and code provides an excellent testing ground

for the features of next generation GPUs

▪ Performance measured and compared on GP100 and V100

Page 14: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

Computation performance – K40 vs GP100

Cuda 7.5, ECC off

Performance follows CUFFT benchmarks for K40 and P100

Reference for K40 benchmark : CUDA 6.5 performance report, September 2014

Reference for P100 benchmark : CUDA 8 PERFORMANCE OVERVIEW, November 2016

Page 15: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

Computation performance : K40 vs GP100

Cuda 7.5, ECC off

No. of antennas : 32 (dual pol)

Page 16: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

Computation performance : K40 vs GP100

Cuda 7.5, ECC off

Peak Global Memory Bandwidth :

K40 – 288 GB / sec

GP100 – 732 GB / sec

Peak Performance :

K40 – 4.3 TFlops

GP100 – 9.3 TFlops

Page 17: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

Computation performance as % of Real-time

Bandwidth : 200 MHz

No. of antennas : 32 (dual pol)

Spectral Channels : 16384

Page 18: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

Computation performance : GP100 vs V100

GP100 on Cuda 7.5

V100 on Cuda 9.1 (using PSG cluster)

Page 19: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

Computation performance : GP100 vs V100

GP100 on Cuda 7.5

V100 on Cuda 9.1 (using PSG cluster)

No. of antennas : 32 (dual pol)

Page 20: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

Computation performance : GP100 vs V100

GP100 on Cuda 7.5

V100 on Cuda 9.1 (using PSG cluster)

Peak Global Memory Bandwidth :

GP100 – 732 GB / sec

V100 – 900 GB / sec

Peak Performance :

GP100 – 9.3 TFlops

V100 – 14 TFlops

Page 21: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

Reasons behind relatively low performance of

MAC

▪ Non-contiguous Global Memory access at block level

MAC input data format

▪ Low Arithmetic Intensity

Page 22: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

GPU kernel improvements

▪ MAC :

Simplified Index Arithmetic

Improved the L2 hit ratio : less then 5% to nearly 86%

Vectorized loads – Increased ILP (float4)

Exposing more parallelism by increasing the occupancy

Single Precision to Half Precision floating point – No performance gain

▪ FFT :

Single Precision to Half Precision floating point

Page 23: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

MAC : Performance gain with optimizations on

V100

No. of antennas : 32 (dual pol)

V100 on Cuda 9.1 (using PSG cluster)

Page 24: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

FFT : Performance gain with half precision on

V100

V100 on Cuda 9.1 (using PSG cluster)

Page 25: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

FFT : Error analysis with half precision in power spectrum

Spectral Channels : 2048

Batch size : 128

Page 26: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

FFT : Error analysis with half precision in phase spectrum

Spectral Channels : 2048

Batch size : 128

Page 27: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

Going forward

▪ Improving MAC using Tensor cores – potential 2x improvement

▪ Implementing the MAC optimizations and half-precision floating point FFT in the GMRT code

▪ Optimized FIR filtering routines in CUDA for narrow-band mode implementation

▪ Implementing multi-beamforming, beam steering and gated correlator

Page 28: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

Acknowledgements

▪ Prof. Yashwant Gupta, Centre Director, NCRA

▪ Ajith Kumar B., Back-end group co-ordinator, GMRT, NCRA

▪ Sanjay Kudale, GMRT, NCRA

▪ Shelton Gnanaraj, GMRT, NCRA

▪ Andrew Jameson, Swinburne University, Australia

▪ Benjamin Barsdel, Swinburne University, Australia (now at Nvidia)

▪ CASPER Group, Berkeley

▪ Digital Back-end Group, GMRT, NCRA

▪ Computer Group, GMRT, NCRA

▪ Control Room, GMRT

Page 29: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,

Thank You