[ieee 2012 sc companion: high performance computing, networking, storage and analysis (scc) - salt...
TRANSCRIPT
![Page 1: [IEEE 2012 SC Companion: High Performance Computing, Networking, Storage and Analysis (SCC) - Salt Lake City, UT (2012.11.10-2012.11.16)] 2012 SC Companion: High Performance Computing,](https://reader031.vdocuments.us/reader031/viewer/2022011722/5750ab441a28abcf0cde30bd/html5/thumbnails/1.jpg)
Imaging through Cluttered Media Using Electromagnetic Interferometry on a Hardware-Accelerated
High-Performance Cluster
Esam El-Araby, Ozlem Kilic, and Vinh Dang
Department of Electrical Engineering and Computer Sciences
[email protected], [email protected], and [email protected]
Speedup versus 12-thread CPU
As a first effort, to the best of our knowledge, terahertz interferometric imaging
through cluttered media has been efficiently implemented using a GPU-accelerated
cluster.
The GPU implementation with efficient load balancing outperforms that of CPU
implementation in terms of speedup and scalability.
The experimental results also show that our implementations have favorable scal-
ability features for larger cluster configurations.
Conclusions
Image size for a single node as D0 = 16,000x16,000 pixels
All computations are performed using double precision
Fixed-time model (Gustafson’s Law)
[1] O. Kilic, A. Smith, E. El-Araby, and V. Dang, “Interferometric imaging through random media using
GPU,” The Applied Computational Electromagnetics Society (ACES) , Williamsburg, VA, USA, April
2011.
[2] O. Kilic and A. Smith, “Imaging Through Random Media,” European Conference on Antennas and
Propagation (EuCAP) , Rome, Italy, April 2011.
[3] J. F. Frederici, D. Gray, B. Schulkin, F. Huang, H. Altan, R. Barat, and D. Zimdars, “Terahertz Imaging
Using an Interferometric Array,” Appl. Phys. Lett., vol. 83, p. 2477, 2004.
[4] K. P. Walsh, B. Schulkin, D. Gary, J. F. Federici, R. Barat, and D. Zimdars, “Terahertz Near-Field
Interferometric and Synthetic Aperture Imaging,” Proc. SPIE, vol. 5411, p. 9, 2003.
References
Execution Time of Cluster
Implementations
Efficiency Scalability
Simulation Setup Detector Geometry
Target Image
Motivation
Detecting concealed objects such as weapons behind cluttered media is essential
for security applications.
Terahertz frequencies are usually employed for imaging in security checkpoints
due to their safe non-ionizing properties for humans and their ability to penetrate
clothing while still reflecting off metallic objects [1],[2].
Interferometric images of target objects are constructed based on the complex cor-
relation function of the received electric fields from the medium of interest at each
pair combination of sensors in a detector array.
This imaging technique requires intensive computations which make it impractical
for real time security applications.
Objectives and Approach
Provide a first effort, to the best of our knowledge, to efficiently implement and
achieve a high performance of terahertz interferometric imaging of targets behind
cluttered media using HPC platforms.
Exploit the capabilities of a 13-node GPU-accelerated cluster using CUDA and
MVAPICH2 environments.
Maximize system resource utilization through efficient load balancing.
Introduction
Interferometric images are constructed using complex correlations of electric field
intensities obtained from all possible pair combinations in a detector array [3], [4]
where E : the time average of the electric field intensity detected for each pixel.
Target behind cluttered media: the electric field received at detectors have three
contributions: direct scattering, indirect scattering, and volume scattering.
Interferometric Imaging through Cluttered Media
( 1)/2
1
, cosN N
E l l l l
l
A k
GPU-Accelerated Cluster Implementations
GPU without load bal-
ancing implementa-
tion:
The entire image is
equally divided into
many partial images
and distributed to the
corresponding com-
puting nodes
GPUs do all computa-
tions
CPUs: data transfer
and work distribution
GPU with load balancing
implementation:
Utilize multi-core CPUs
in addition to the GPUs
by assigning to each
CPU core a part of the
total workload.
Load balancing: The to ta l work load i s distributed among the computing resources proportional to their r e l a t i v e s p e ed s / throughputs.
where
D0 be the partial image assigned to the
computing node.
BCPU and BGPU are the processing through-
put of a CPU core and GPU, respectively
nthreads is the total number of parallel
threads.
Execution model: The algo-
rithm first sets up the parallel
environment, then transfers
data from CPU memory space
to GPU memory space, proc-
esses data using accelerator(s),
transfers the resultant image
back to the CPU and finally
releases the allocated hardware
resources.
0
0
0
1
,1
,1
CPU GPU
CPU GPU
CPU GPU
threads CPU GPU
CPUCPU
GPU threads CPU
GPUGPU
GPU threads CPU
t t
D D
B B
D n D D
BD D
B n B
BD D
B n B
System configuration:
Application parameters:
\
Spherical metallic target put behind the random media, located at the (0,0) point:
Experimental Results
GPU Cluster
Hardware
13 computing nodes, each has 1xTesla Fermi M2090,
1.3GHz, 5.375GB/GPU, 512 cores/GPU,
1,331GFLOPs/GPU, Infiniband interconnection
Software NVIDIA’s CUDA 4.0, MVAPICH2
Detectors Number of detectors 45
Detector geometry Circular
Cluttered media
(Scatterers)
Particle density 106 per m3
Particle type Point source
Particle size 0.02 mm
Target object
Shape Sphere
Size 1.9 mm (radius)
Number of targets 1
Distance to detector array 1 m
Signal Frequency 300GHz
CPU-based result GPU-based result
where NPE : number of processing elements
D0 : the image size for a single PE
Metrics for Evaluation
0
0
00
0
1
1
(1 )( , )
( , )
total setup in process out release
CPUtotal PE
PE PEtotal PE PE
PEPE
thtotal
PE PE meastotal PE PE
T T T T T T
T ,N DS N
T N ,N D
S NN
S
T ,DE N N D (%)
T N N D
Total execution time:
Speedup: performance gained
when using multiple GPU-
accelerated computing nodes in
reference to a single CPU-based
node
Scalability: normalized speedup
Hardware efficiency: compares the
execution time measured through
experiments to the expected/
theoretical performance.