powering real-time radio astronomy signal processing with...

Powering Real-time Radio

Astronomy Signal Processing with

latest GPU architectures

Harshavardhan Reddy Suda

NCRA, India

Vinay Deshpande

NVIDIA, India

Bharat Kumar

NVIDIA, India

What signals we are processing?

GMRT▪ The Giant Meter-wave Radio Telescope

(GMRT) is a world class instrument for studying astrophysical phenomena at low radio frequencies

▪ Located 80 km north of Pune, 160 km east of Mumbai

▪ Array telescope with 30 antennas of 45 m diameter, operating at meter wavelengths

▪ Digitized baseband signals from 30 dual polarized antennas of GMRT

GMRT

▪ Supports two modes of operation :

- Interferometry (correlator)- Array mode (beamformer)

▪ Frequency bands :

- 130 to 260 MHz

- 250 to 500 MHz

- 550 to 900 MHz

- 1050 to 1600 MHz

▪ Maximum instantaneous bandwidth :

400 MHz (Legacy GMRT = 32

MHz)

▪ Effective collecting area (2-3% of

SKA)

-30,000 sq m at lower frequencies

-20,000 sq m at higher frequencies

The Giant Meter-wave Radio Telescope

A Google eye view

GMRT receiver chain Signal processing in

digital back-end

Image courtesy : Ajith Kumar, NCRA

Computation requirements

Sampler

Fourier Transform

O(NlogN)

Phase

Correction

MAC

M(M+1)/2

Antenna

Signals(M=64)

Maximum Bandwidth 400 MHz

16k point spectral channels –

3 TFlops

0.1 TFlops

6.6 TFlops

Total ~ 10 TFlops

Design : Time slicing model

Design : Time slicing model

A 4-node example

Ant 1, Ant 2 --- Ant 16 : Digitized data of baseband signals of Antennas

Implementation

▪ 16 Dell T630 machines as Compute Nodes

▪ 16 ROACH (FPGA) boards with Atmel/e2v based ADCs developed by CASPER group, Berkeley for digitization and packetization

▪ 32 Tesla K40c GPU cards for processing

▪ 36 port Mellanox Infiniband switch for data sharing between Compute Nodes and Host Nodes

▪ Software : C/C++ and CUDA C programming with OpenMPI and OpenMP directives

▪ Developed in collaboration with Swinburne University, Australia

Implementation

Image courtesy : Irappa Halagalli, NCRA

Sample result

Legacy GMRT 325 MHz : 350 μJy Upgraded GMRT 300 – 500 MHz : 28 μJy

Significantly lower noise RMS and better image quality with upgraded GMRT

Dharam Vir Lal and Ishwar

Chandra, NCRA

Image of Coma cluster

Computation Performance : K40

ChannelsFFT

(Gflops)MAC

(Gflops)

2048 620 626

4096 626 620

8192 512 574

16384 498 537

No. of antennas : 32 (dual pol)

CUDA 7.5

Motivation for next generation GPUs

▪ Adding more compute intensive applications

- Multi-beamforming

- Processing on each beam (beam steering)

- Gated correlator

- FIR filtering with many taps for narrow-band mode implementation

▪ Working GMRT system and code provides an excellent testing ground

for the features of next generation GPUs

▪ Performance measured and compared on GP100 and V100

Computation performance – K40 vs GP100

Cuda 7.5, ECC off

Performance follows CUFFT benchmarks for K40 and P100

Reference for K40 benchmark : CUDA 6.5 performance report, September 2014

Reference for P100 benchmark : CUDA 8 PERFORMANCE OVERVIEW, November 2016

Computation performance : K40 vs GP100

Cuda 7.5, ECC off


Computation performance : K40 vs GP100

Cuda 7.5, ECC off

Peak Global Memory Bandwidth :

K40 – 288 GB / sec

GP100 – 732 GB / sec

Peak Performance :

K40 – 4.3 TFlops

GP100 – 9.3 TFlops

Computation performance as % of Real-time

Bandwidth : 200 MHz


Spectral Channels : 16384

Computation performance : GP100 vs V100

GP100 on Cuda 7.5

V100 on Cuda 9.1 (using PSG cluster)


GP100 on Cuda 7.5




GP100 on Cuda 7.5


Peak Global Memory Bandwidth :

GP100 – 732 GB / sec

V100 – 900 GB / sec

Peak Performance :

GP100 – 9.3 TFlops

V100 – 14 TFlops

Reasons behind relatively low performance of

MAC

▪ Non-contiguous Global Memory access at block level

MAC input data format

▪ Low Arithmetic Intensity

GPU kernel improvements

▪ MAC :

Simplified Index Arithmetic

Improved the L2 hit ratio : less then 5% to nearly 86%

Vectorized loads – Increased ILP (float4)

Exposing more parallelism by increasing the occupancy

Single Precision to Half Precision floating point – No performance gain

▪ FFT :

Single Precision to Half Precision floating point

MAC : Performance gain with optimizations on

V100



FFT : Performance gain with half precision on

V100


FFT : Error analysis with half precision in power spectrum


Batch size : 128

FFT : Error analysis with half precision in phase spectrum


Batch size : 128

Going forward

▪ Improving MAC using Tensor cores – potential 2x improvement

▪ Implementing the MAC optimizations and half-precision floating point FFT in the GMRT code

▪ Optimized FIR filtering routines in CUDA for narrow-band mode implementation

▪ Implementing multi-beamforming, beam steering and gated correlator

Acknowledgements

▪ Prof. Yashwant Gupta, Centre Director, NCRA

▪ Ajith Kumar B., Back-end group co-ordinator, GMRT, NCRA

▪ Sanjay Kudale, GMRT, NCRA

▪ Shelton Gnanaraj, GMRT, NCRA

▪ Andrew Jameson, Swinburne University, Australia

▪ Benjamin Barsdel, Swinburne University, Australia (now at Nvidia)

▪ CASPER Group, Berkeley

▪ Digital Back-end Group, GMRT, NCRA

▪ Computer Group, GMRT, NCRA

▪ Control Room, GMRT

Thank You

powering real-time radio astronomy signal processing with...

Documents