© university of north carolina at charlotte1 chapter 9: green computing platforms for biomedical...

© University of North Carolina at Charlotte 1

Chapter 9: Green Computing Platforms for

Biomedical Systems

Vinay Vijendra Kumar Lakshmi, Ashish Panday, Arindam Mukherjee, and Bharat S Joshi

University of North Carolina at Charlotte

HANDBOOK ON GREEN INFORMATION AND COMMUNICATION SYSTEMS


Overview

Green Computing in Biomedical Field Survey of Green Computing Platform Analysis of popular Biomedical Applications Design Framework for Biomedical Embedded

Processors Survey of Simulation tools for Design Space

Exploration Development and Characterization of Benchmark

Suite Design Space Exploration and Optimization

Techniques of Embedded Micro architectures Conclusion Future Research Areas


Green Computing in Biomedical Field

Computing in Biomedical systems can be classified into 3 categories.

Implantable device levelPortable/Embedded platform levelServer level


Characteristics of Biomedical Systems

Power consumptionRenewable energy resource – energy

harvesting Heat dissipationMinimizing areaCostPerformance


Survey of Green Computing Platforms

Implantable Devices monitor the physiological parameters of the human body. Pacemakers, cardioverter-defibrillators, cochlear Most of the implantable devices are inactive most of the times and activate based on a stimulus from the body

Configuration of a brain implant or brain-machine interface (BMI)


physiological monitoring systems recognition systems

Wearable ultra-low power biomedical signal processor, CoolBio™.

Embedded Platforms


Power Management in Intel ATOM

ATOM includes power management control block, a power management block, a clock synthesizer and a few programmable registers which work on reducing the noise, achieving low quiescent current, real-time dynamic switching of voltage and frequency between multiple performance modes, varying core operation voltage and processor speeds to save on ATOM’s power and improve its performance.

Figure : Power management in Intel ATOM


ServersThe Oracle WebLogic Server 11g software was used to demonstrate the performance of the Avitek Medical Records sample application. A configuration using SPARC T3-1B and SPARC Enterprise M5000 servers from Oracle was used and showed excellent scaling of different configurations as well as doubling previous generation SPARC blade performance.

Server Processor Memory Maximum TPS

SPARC T3-1B 1 x SPARC T3, 1.65 GHz, 16 cores 128 GB 28,156

SPARC T3-1B 1 x SPARC T3, 1.65 GHz, 8 cores 128 GB 14,030

Sun Blade T6320 1 x UltraSPARC T2, 1.4 GHz, 8 cores 64 GB 13,386


Cell Processor


Analysis of Biomedical Applications

Flowchart for choosingalgorithm-architecturecombination best suited for an application

Start

3Identify different

solving techniques to optimize kernel

1Research standard

definitions and processes

involved in solving the desired

kernel

2Optimal time/

space complexities?

4Choose parallel

algorithm

7Is algorithm acceptable?

5Choose

architecture

6Is architecture-algorithm pair

optimal?

9Change algorithm

10Change

architecture

8Is architecture

acceptable?

Yes

No

Yes

NoNo

Yes

No

stop

Yes


Pairwise Correlation

Another way to interpret PPMCC

cov( , ) X Y

x y X y

E X YX Yr

S S S S

1 1 1

2 2

2 2

1 1 1 1

n n n

i i i ii i i

n n n n

i i i ii i i i

n x y x yr

n x x n y y

X: {x1, x2, x3, ….. xn}Y : {y1, y2, y3, ….. yn}r : coefficient of correlationCov(X,Y) : covariance of X and YSX : standard deviations of XSY : standard deviations of YµX: Expectation of XµY: Expectation of Y


i,j : ith, jth channel where 1≤i,j≤mx(i,k), x(j,k) : kth sample from ith, jth channel where 1≤i,j≤m, i≠j

and 1≤k≤nr(i,j) : Correlation coefficient between ith, jth channel

where 1≤i,j≤m

( , ) ( , ) ( , ) ( , )1 1 1

( , ) 2 2

2 2( , ) ( , ) ( , ) ( , )

1 1 1 1

n n n

i k j k i k j kk k k

i jn n n n

i k i k j k j kk k k k

n x x x xr

n x x n x x


Choosing initial algorithm and architecture

CPI L1I_MISS% L1D_MISS % L2_MISS %Serial Code 0.84 22.98 91.54 60.77

Initially the PWC is written In serial fashion for Xeon Dual Core processor . After running Vtune we arrive at the following statistics

Table 1: Performance of Serial code on Intel Xeon Dual Core processor

CPI L1I_MISS % L1D_MISS % L2_MISS %Parallel

Code(OMP)0.67 27.84 89.23 25.67

The code is them parallelised in OpenMP and analysed once again to arrive at better performance values as shown below

Table 4.3: Performance of OpenMP code on Intel Xeon Dual Core processor

Implementation on Cell using the Ring Algorithm gives a speed-up of approx. 56 when compared with serial version on Intel

Xeon.


Design Framework for Biomedical Embedded Processors

Design flow for Bio-medical Embedded Processors

Start

1Devlopment of

Application Specific

Benchmark Suite

stop

2Performance Analysis of

benchmarks on Simulator tools

3Explore the design

space. Arrive at suitable embedded

architecture

5Run the optimizer to arrive at better performance and

power values

6Are latency and

throughput requirements met?

No

4 Select Simulator

Tool

7Is the architecture

optimum?

No

Yes

Yes


Survey of Simulation tools for Design Space Exploration

Features MV5 M5 CASPER Sesc

Full-System Simulation

System-call Emulation

I/O Disk

ISA

Emulated thread API

Category

IO Core

Multithreaded core

OOO Core

SIMD Core

✘

✔

✔

Alpha

✔

Event Driven

✔

✔

✔

✔

✔

✔

✔

Various

✔

Cycle Driven

✔

✔

✔

✔

✘

✘

✘

Sparc

✔

Trace driven

✔

✘

✔

✘

✘

✔

✔

Mips

✔

Event Driven

✔

✘

✔

✔


Development and Characterization of Benchmark Suite

A good multicore benchmark will identify bottlenecks

in the multicore system design including memory

and I/O bottlenecks, computational bottlenecks,

and real-time bottlenecks*. In addition, a good

multicore benchmark will identify synchronization

problems where code and data blocks are split,

distributed to various compute engines for

processing, and then the results are reassembled.

*S Gal-On, M Levy, S Leibson, “How to Survice the Quest for a useful Multicore Benchmark", ECN Magazine, Dec 2009


Performance analysis of the benchmark

Analysis of PWC on various Simulator toolsCASPER

CPI per core on CASPER

CPI

D$ size (in bytes)

Avg Power (uW)

D$ size (in bytes) Average Power per core on CASPER


Analysis of Parallel version of the code (per CPU results) on MV5 with various configurations

Frequency Number of SIMD CPUs

Number of OOO CPUs

No. of HW+SW threads

Benchmark Used

Host memory usage

Simulation time (seconds)

fractal_smp 1 GHz 4 0 64+2 FILTER 1.217 MB 0.019065Fractal_smp 1 GHz 4 0 64+2 PPPC 1.207 MB 0.001364Config_hetero 1 GHz 2 2 32+2 FILTER PPPC 2.234 MB 0.070888Config_hetero 1 Ghz 4 4 32+2 FILTER PPPC 2.255 MB 1050.42

Total Energy of cpu (mJ)

Total Leakage Energy of cpu (mJ)

Clock active energy (uJ)

Total Cache Energy (mJ)

D$ Miss rate

I$ Miss rate

Floating ALU Active Energy (mJ)

Integer ALU Active Energy (mJ)

Fractal smp on FILTER

26.358100

1.713209 0.000956 2.186035 0.257 0.162 1.0877987 1.785665

Fractal smp on PPPC

0.010644

0.002118 0.000188 0.003291 2.195 0.078 0.0004001 0.000844

Config_hetero FILTER + PPPC

29.918695

1.543702 0.000292 12.097728 2.876 0.018 2.241064 4.005570

Config_hetero FILTER + PPPC

32.2689733

4.1982526 0.000182 0.0081944 1.747 0.000001 1.0877992 1.7854455

MV5 Simulation


Design Space Exploration and Optimization Techniques of Embedded

Micro architectures

Different approaches used for design space exploration for multicore processor architecture and optimization algorithms

Artificial Neural Networks (ANN)Fast Genetic Algorithms(Used in CASPER)Genetically programmed Response

surfaces(GPRS used on MV5)


Conclusion

Methodologies for the characterization of bio-medical applications for ultra-low-energy and low heat producing embedded implantable devices, as well as for low power dissipation but high performance embedded computing platforms. PWC benchmark the computation complexity is O(mn2), which has given a CPI of 0.67 and L2 Cache miss percentage of 25.67 on Intel Xeon Dual Core processor

Outlines of the procedure to be followed for the design space exploration of processor micro-architectures using existing simulation tools and optimizers. heterogeneous configuration with two IO and two OOO consumes less energy per CPU (29.918 mJ) compared to a homogenous configuration on MV5's alpha architecture simulation


Development of better different instruction set architectures (ISAs) Corresponding cross-compilers to generate optimized executables for the simulators Upgrading existing simulation platforms to support full system mode with real time kernel

libraries to account for the latency and throughput of the real-life applications Development of advanced real time operating systems and scheduling algorithms to

schedule the various applications on different heterogeneous cores to meet the hard real time constraints.

Future Research Areas


Thanks for your attention!

© university of north carolina at charlotte1 chapter 9: green computing platforms for biomedical...

Documents

expectation of x y

expectation of y slide

standard deviations

covariance of x

ppmcc x

application slide

standard deviations

biomedical field computing