[ieee 2012 sc companion: high performance computing, networking, storage and analysis (scc) - salt...

1
Imaging through Cluttered Media Using Electromagnetic Interferometry on a Hardware-Accelerated High-Performance Cluster Esam El-Araby, Ozlem Kilic, and Vinh Dang Department of Electrical Engineering and Computer Sciences [email protected], [email protected], and [email protected] As a first effort, to the best of our knowledge, terahertz interferometric imaging through cluttered media has been efficiently implemented using a GPU-accelerated cluster. The GPU implementation with efficient load balancing outperforms that of CPU implementation in terms of speedup and scalability. The experimental results also show that our implementations have favorable scal- ability features for larger cluster configurations. Conclusions Image size for a single node as D 0 = 16,000x16,000 pixels All computations are performed using double precision Fixed-time model (Gustafson’s Law) [1] O. Kilic, A. Smith, E. El-Araby, and V. Dang, “Interferometric imaging through random media using GPU,” The Applied Computational Electromagnetics Society (ACES) , Williamsburg, VA, USA, April 2011. [2] O. Kilic and A. Smith, “Imaging Through Random Media,” European Conference on Antennas and Propagation (EuCAP) , Rome, Italy, April 2011. [3] J. F. Frederici, D. Gray, B. Schulkin, F. Huang, H. Altan, R. Barat, and D. Zimdars, “Terahertz Imaging Using an Interferometric Array,” Appl. Phys. Lett., vol. 83, p. 2477, 2004. [4] K. P. Walsh, B. Schulkin, D. Gary, J. F. Federici, R. Barat, and D. Zimdars, “Terahertz Near -Field Interferometric and Synthetic Aperture Imaging,” Proc. SPIE, vol. 5411, p. 9, 2003. References Execution Time of Cluster Implementations Efficiency Scalability Motivation Detecting concealed objects such as weapons behind cluttered media is essential for security applications. Terahertz frequencies are usually employed for imaging in security checkpoints due to their safe non-ionizing properties for humans and their ability to penetrate clothing while still reflecting off metallic objects [1],[2]. Interferometric images of target objects are constructed based on the complex cor- relation function of the received electric fields from the medium of interest at each pair combination of sensors in a detector array. This imaging technique requires intensive computations which make it impractical for real time security applications. Objectives and Approach Provide a first effort, to the best of our knowledge, to efficiently implement and achieve a high performance of terahertz interferometric imaging of targets behind cluttered media using HPC platforms. Exploit the capabilities of a 13-node GPU-accelerated cluster using CUDA and MVAPICH2 environments. Maximize system resource utilization through efficient load balancing. Introduction Interferometric images are constructed using complex correlations of electric field intensities obtained from all possible pair combinations in a detector array [3], [4] where E : the time average of the electric field intensity detected for each pixel. Target behind cluttered media: the electric field received at detectors have three contributions: direct scattering, indirect scattering, and volume scattering. Interferometric Imaging through Cluttered Media ( 1)/2 1 , cos NN E l l l l l A k GPU-Accelerated Cluster Implementations GPU without load bal- ancing implementa- tion: The entire image is equally divided into many partial images and distributed to the corresponding com- puting nodes GPUs do all computa- tions CPUs: data transfer and work distribution GPU with load balancing implementation: Utilize multi-core CPUs in addition to the GPUs by assigning to each CPU core a part of the total workload. Load balancing: The total workload is distributed among the computing resources proportional to their relative speeds / throughputs. where D 0 be the partial image assigned to the computing node. B CPU and B GPU are the processing through- put of a CPU core and GPU, respectively n threads is the total number of parallel threads. Execution model: The algo- rithm first sets up the parallel environment, then transfers data from CPU memory space to GPU memory space, proc- esses data using accelerator(s), transfers the resultant image back to the CPU and finally releases the allocated hardware resources. 0 0 0 1 , 1 , 1 CPU GPU CPU GPU CPU GPU threads CPU GPU CPU CPU GPU threads CPU GPU GPU GPU threads CPU t t D D B B D n D D B D D B n B B D D B n B System configuration: Application parameters: \ Spherical metallic target put behind the random media, located at the (0,0) point: Experimental Results GPU Cluster Hardware 13 computing nodes, each has 1xTesla Fermi M2090, 1.3GHz, 5.375GB/GPU, 512 cores/GPU, 1,331GFLOPs/GPU, Infiniband interconnection Software NVIDIA’s CUDA 4.0, MVAPICH2 Detectors Number of detectors 45 Detector geometry Circular Cluttered media (Scatterers) Particle density 10 6 per m 3 Particle type Point source Particle size 0.02 mm Target object Shape Sphere Size 1.9 mm (radius) Number of targets 1 Distance to detector array 1 m Signal Frequency 300GHz CPU-based result GPU-based result

Upload: vinh

Post on 11-Apr-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2012 SC Companion: High Performance Computing, Networking, Storage and Analysis (SCC) - Salt Lake City, UT (2012.11.10-2012.11.16)] 2012 SC Companion: High Performance Computing,

Imaging through Cluttered Media Using Electromagnetic Interferometry on a Hardware-Accelerated

High-Performance Cluster

Esam El-Araby, Ozlem Kilic, and Vinh Dang

Department of Electrical Engineering and Computer Sciences

[email protected], [email protected], and [email protected]

Speedup versus 12-thread CPU

As a first effort, to the best of our knowledge, terahertz interferometric imaging

through cluttered media has been efficiently implemented using a GPU-accelerated

cluster.

The GPU implementation with efficient load balancing outperforms that of CPU

implementation in terms of speedup and scalability.

The experimental results also show that our implementations have favorable scal-

ability features for larger cluster configurations.

Conclusions

Image size for a single node as D0 = 16,000x16,000 pixels

All computations are performed using double precision

Fixed-time model (Gustafson’s Law)

[1] O. Kilic, A. Smith, E. El-Araby, and V. Dang, “Interferometric imaging through random media using

GPU,” The Applied Computational Electromagnetics Society (ACES) , Williamsburg, VA, USA, April

2011.

[2] O. Kilic and A. Smith, “Imaging Through Random Media,” European Conference on Antennas and

Propagation (EuCAP) , Rome, Italy, April 2011.

[3] J. F. Frederici, D. Gray, B. Schulkin, F. Huang, H. Altan, R. Barat, and D. Zimdars, “Terahertz Imaging

Using an Interferometric Array,” Appl. Phys. Lett., vol. 83, p. 2477, 2004.

[4] K. P. Walsh, B. Schulkin, D. Gary, J. F. Federici, R. Barat, and D. Zimdars, “Terahertz Near-Field

Interferometric and Synthetic Aperture Imaging,” Proc. SPIE, vol. 5411, p. 9, 2003.

References

Execution Time of Cluster

Implementations

Efficiency Scalability

Simulation Setup Detector Geometry

Target Image

Motivation

Detecting concealed objects such as weapons behind cluttered media is essential

for security applications.

Terahertz frequencies are usually employed for imaging in security checkpoints

due to their safe non-ionizing properties for humans and their ability to penetrate

clothing while still reflecting off metallic objects [1],[2].

Interferometric images of target objects are constructed based on the complex cor-

relation function of the received electric fields from the medium of interest at each

pair combination of sensors in a detector array.

This imaging technique requires intensive computations which make it impractical

for real time security applications.

Objectives and Approach

Provide a first effort, to the best of our knowledge, to efficiently implement and

achieve a high performance of terahertz interferometric imaging of targets behind

cluttered media using HPC platforms.

Exploit the capabilities of a 13-node GPU-accelerated cluster using CUDA and

MVAPICH2 environments.

Maximize system resource utilization through efficient load balancing.

Introduction

Interferometric images are constructed using complex correlations of electric field

intensities obtained from all possible pair combinations in a detector array [3], [4]

where E : the time average of the electric field intensity detected for each pixel.

Target behind cluttered media: the electric field received at detectors have three

contributions: direct scattering, indirect scattering, and volume scattering.

Interferometric Imaging through Cluttered Media

( 1)/2

1

, cosN N

E l l l l

l

A k

GPU-Accelerated Cluster Implementations

GPU without load bal-

ancing implementa-

tion:

The entire image is

equally divided into

many partial images

and distributed to the

corresponding com-

puting nodes

GPUs do all computa-

tions

CPUs: data transfer

and work distribution

GPU with load balancing

implementation:

Utilize multi-core CPUs

in addition to the GPUs

by assigning to each

CPU core a part of the

total workload.

Load balancing: The to ta l work load i s distributed among the computing resources proportional to their r e l a t i v e s p e ed s / throughputs.

where

D0 be the partial image assigned to the

computing node.

BCPU and BGPU are the processing through-

put of a CPU core and GPU, respectively

nthreads is the total number of parallel

threads.

Execution model: The algo-

rithm first sets up the parallel

environment, then transfers

data from CPU memory space

to GPU memory space, proc-

esses data using accelerator(s),

transfers the resultant image

back to the CPU and finally

releases the allocated hardware

resources.

0

0

0

1

,1

,1

CPU GPU

CPU GPU

CPU GPU

threads CPU GPU

CPUCPU

GPU threads CPU

GPUGPU

GPU threads CPU

t t

D D

B B

D n D D

BD D

B n B

BD D

B n B

System configuration:

Application parameters:

\

Spherical metallic target put behind the random media, located at the (0,0) point:

Experimental Results

GPU Cluster

Hardware

13 computing nodes, each has 1xTesla Fermi M2090,

1.3GHz, 5.375GB/GPU, 512 cores/GPU,

1,331GFLOPs/GPU, Infiniband interconnection

Software NVIDIA’s CUDA 4.0, MVAPICH2

Detectors Number of detectors 45

Detector geometry Circular

Cluttered media

(Scatterers)

Particle density 106 per m3

Particle type Point source

Particle size 0.02 mm

Target object

Shape Sphere

Size 1.9 mm (radius)

Number of targets 1

Distance to detector array 1 m

Signal Frequency 300GHz

CPU-based result GPU-based result

where NPE : number of processing elements

D0 : the image size for a single PE

Metrics for Evaluation

0

0

00

0

1

1

(1 )( , )

( , )

total setup in process out release

CPUtotal PE

PE PEtotal PE PE

PEPE

thtotal

PE PE meastotal PE PE

T T T T T T

T ,N DS N

T N ,N D

S NN

S

T ,DE N N D (%)

T N N D

Total execution time:

Speedup: performance gained

when using multiple GPU-

accelerated computing nodes in

reference to a single CPU-based

node

Scalability: normalized speedup

Hardware efficiency: compares the

execution time measured through

experiments to the expected/

theoretical performance.