© university of north carolina at charlotte1 chapter 9: green computing platforms for biomedical...
TRANSCRIPT
© University of North Carolina at Charlotte 1
Chapter 9: Green Computing Platforms for
Biomedical Systems
Vinay Vijendra Kumar Lakshmi, Ashish Panday, Arindam Mukherjee, and Bharat S Joshi
University of North Carolina at Charlotte
HANDBOOK ON GREEN INFORMATION AND COMMUNICATION SYSTEMS
© University of North Carolina at Charlotte 2
Overview
Green Computing in Biomedical Field Survey of Green Computing Platform Analysis of popular Biomedical Applications Design Framework for Biomedical Embedded
Processors Survey of Simulation tools for Design Space
Exploration Development and Characterization of Benchmark
Suite Design Space Exploration and Optimization
Techniques of Embedded Micro architectures Conclusion Future Research Areas
© University of North Carolina at Charlotte 3
Green Computing in Biomedical Field
Computing in Biomedical systems can be classified into 3 categories.
Implantable device levelPortable/Embedded platform levelServer level
© University of North Carolina at Charlotte 4
Characteristics of Biomedical Systems
Power consumptionRenewable energy resource – energy
harvesting Heat dissipationMinimizing areaCostPerformance
© University of North Carolina at Charlotte 5
Survey of Green Computing Platforms
Implantable Devices monitor the physiological parameters of the human body. Pacemakers, cardioverter-defibrillators, cochlear Most of the implantable devices are inactive most of the times and activate based on a stimulus from the body
Configuration of a brain implant or brain-machine interface (BMI)
© University of North Carolina at Charlotte 6
physiological monitoring systems recognition systems
Wearable ultra-low power biomedical signal processor, CoolBio™.
Embedded Platforms
© University of North Carolina at Charlotte 7
Power Management in Intel ATOM
ATOM includes power management control block, a power management block, a clock synthesizer and a few programmable registers which work on reducing the noise, achieving low quiescent current, real-time dynamic switching of voltage and frequency between multiple performance modes, varying core operation voltage and processor speeds to save on ATOM’s power and improve its performance.
Figure : Power management in Intel ATOM
© University of North Carolina at Charlotte 8
ServersThe Oracle WebLogic Server 11g software was used to demonstrate the performance of the Avitek Medical Records sample application. A configuration using SPARC T3-1B and SPARC Enterprise M5000 servers from Oracle was used and showed excellent scaling of different configurations as well as doubling previous generation SPARC blade performance.
Server Processor Memory Maximum TPS
SPARC T3-1B 1 x SPARC T3, 1.65 GHz, 16 cores 128 GB 28,156
SPARC T3-1B 1 x SPARC T3, 1.65 GHz, 8 cores 128 GB 14,030
Sun Blade T6320 1 x UltraSPARC T2, 1.4 GHz, 8 cores 64 GB 13,386
© University of North Carolina at Charlotte 9
Cell Processor
© University of North Carolina at Charlotte 10
Analysis of Biomedical Applications
Flowchart for choosingalgorithm-architecturecombination best suited for an application
Start
3Identify different
solving techniques to optimize kernel
1Research standard
definitions and processes
involved in solving the desired
kernel
2Optimal time/
space complexities?
4Choose parallel
algorithm
7Is algorithm acceptable?
5Choose
architecture
6Is architecture-algorithm pair
optimal?
9Change algorithm
10Change
architecture
8Is architecture
acceptable?
Yes
No
Yes
NoNo
Yes
No
stop
Yes
© University of North Carolina at Charlotte 11
Pairwise Correlation
Another way to interpret PPMCC
cov( , ) X Y
x y X y
E X YX Yr
S S S S
1 1 1
2 2
2 2
1 1 1 1
n n n
i i i ii i i
n n n n
i i i ii i i i
n x y x yr
n x x n y y
X: {x1, x2, x3, ….. xn}Y : {y1, y2, y3, ….. yn}r : coefficient of correlationCov(X,Y) : covariance of X and YSX : standard deviations of XSY : standard deviations of YµX: Expectation of XµY: Expectation of Y
© University of North Carolina at Charlotte 12
i,j : ith, jth channel where 1≤i,j≤mx(i,k), x(j,k) : kth sample from ith, jth channel where 1≤i,j≤m, i≠j
and 1≤k≤nr(i,j) : Correlation coefficient between ith, jth channel
where 1≤i,j≤m
( , ) ( , ) ( , ) ( , )1 1 1
( , ) 2 2
2 2( , ) ( , ) ( , ) ( , )
1 1 1 1
n n n
i k j k i k j kk k k
i jn n n n
i k i k j k j kk k k k
n x x x xr
n x x n x x
© University of North Carolina at Charlotte 13
Choosing initial algorithm and architecture
CPI L1I_MISS% L1D_MISS % L2_MISS %Serial Code 0.84 22.98 91.54 60.77
Initially the PWC is written In serial fashion for Xeon Dual Core processor . After running Vtune we arrive at the following statistics
Table 1: Performance of Serial code on Intel Xeon Dual Core processor
CPI L1I_MISS % L1D_MISS % L2_MISS %Parallel
Code(OMP)0.67 27.84 89.23 25.67
The code is them parallelised in OpenMP and analysed once again to arrive at better performance values as shown below
Table 4.3: Performance of OpenMP code on Intel Xeon Dual Core processor
Implementation on Cell using the Ring Algorithm gives a speed-up of approx. 56 when compared with serial version on Intel
Xeon.
© University of North Carolina at Charlotte 14
Design Framework for Biomedical Embedded Processors
Design flow for Bio-medical Embedded Processors
Start
1Devlopment of
Application Specific
Benchmark Suite
stop
2Performance Analysis of
benchmarks on Simulator tools
3Explore the design
space. Arrive at suitable embedded
architecture
5Run the optimizer to arrive at better performance and
power values
6Are latency and
throughput requirements met?
No
4 Select Simulator
Tool
7Is the architecture
optimum?
No
Yes
Yes
© University of North Carolina at Charlotte 15
Survey of Simulation tools for Design Space Exploration
Features MV5 M5 CASPER Sesc
Full-System Simulation
System-call Emulation
I/O Disk
ISA
Emulated thread API
Category
IO Core
Multithreaded core
OOO Core
SIMD Core
✘
✔
✔
Alpha
✔
Event Driven
✔
✔
✔
✔
✔
✔
✔
Various
✔
Cycle Driven
✔
✔
✔
✔
✘
✘
✘
Sparc
✔
Trace driven
✔
✘
✔
✘
✘
✔
✔
Mips
✔
Event Driven
✔
✘
✔
✔
© University of North Carolina at Charlotte 16
Development and Characterization of Benchmark Suite
A good multicore benchmark will identify bottlenecks
in the multicore system design including memory
and I/O bottlenecks, computational bottlenecks,
and real-time bottlenecks*. In addition, a good
multicore benchmark will identify synchronization
problems where code and data blocks are split,
distributed to various compute engines for
processing, and then the results are reassembled.
*S Gal-On, M Levy, S Leibson, “How to Survice the Quest for a useful Multicore Benchmark", ECN Magazine, Dec 2009
© University of North Carolina at Charlotte 17
Performance analysis of the benchmark
Analysis of PWC on various Simulator toolsCASPER
CPI per core on CASPER
CPI
D$ size (in bytes)
Avg Power (uW)
D$ size (in bytes) Average Power per core on CASPER
© University of North Carolina at Charlotte 18
Analysis of Parallel version of the code (per CPU results) on MV5 with various configurations
Frequency Number of SIMD CPUs
Number of OOO CPUs
No. of HW+SW threads
Benchmark Used
Host memory usage
Simulation time (seconds)
fractal_smp 1 GHz 4 0 64+2 FILTER 1.217 MB 0.019065Fractal_smp 1 GHz 4 0 64+2 PPPC 1.207 MB 0.001364Config_hetero 1 GHz 2 2 32+2 FILTER PPPC 2.234 MB 0.070888Config_hetero 1 Ghz 4 4 32+2 FILTER PPPC 2.255 MB 1050.42
Total Energy of cpu (mJ)
Total Leakage Energy of cpu (mJ)
Clock active energy (uJ)
Total Cache Energy (mJ)
D$ Miss rate
I$ Miss rate
Floating ALU Active Energy (mJ)
Integer ALU Active Energy (mJ)
Fractal smp on FILTER
26.358100
1.713209 0.000956 2.186035 0.257 0.162 1.0877987 1.785665
Fractal smp on PPPC
0.010644
0.002118 0.000188 0.003291 2.195 0.078 0.0004001 0.000844
Config_hetero FILTER + PPPC
29.918695
1.543702 0.000292 12.097728 2.876 0.018 2.241064 4.005570
Config_hetero FILTER + PPPC
32.2689733
4.1982526 0.000182 0.0081944 1.747 0.000001 1.0877992 1.7854455
MV5 Simulation
© University of North Carolina at Charlotte 19
Design Space Exploration and Optimization Techniques of Embedded
Micro architectures
Different approaches used for design space exploration for multicore processor architecture and optimization algorithms
Artificial Neural Networks (ANN)Fast Genetic Algorithms(Used in CASPER)Genetically programmed Response
surfaces(GPRS used on MV5)
© University of North Carolina at Charlotte 20
Conclusion
Methodologies for the characterization of bio-medical applications for ultra-low-energy and low heat producing embedded implantable devices, as well as for low power dissipation but high performance embedded computing platforms. PWC benchmark the computation complexity is O(mn2), which has given a CPI of 0.67 and L2 Cache miss percentage of 25.67 on Intel Xeon Dual Core processor
Outlines of the procedure to be followed for the design space exploration of processor micro-architectures using existing simulation tools and optimizers. heterogeneous configuration with two IO and two OOO consumes less energy per CPU (29.918 mJ) compared to a homogenous configuration on MV5's alpha architecture simulation
© University of North Carolina at Charlotte 21
Development of better different instruction set architectures (ISAs) Corresponding cross-compilers to generate optimized executables for the simulators Upgrading existing simulation platforms to support full system mode with real time kernel
libraries to account for the latency and throughput of the real-life applications Development of advanced real time operating systems and scheduling algorithms to
schedule the various applications on different heterogeneous cores to meet the hard real time constraints.
Future Research Areas
© University of North Carolina at Charlotte 22
Thanks for your attention!