map smac algorithm onto gpu

22
Mapping SMAC Algorithm onto GPU Student: Zhengjie Lu Supervisor: Dr. Ir. Bart Mesman Ir. Yifan He Prof. Dr. Ir. Richard Kleihorst

Upload: zhengjie-lu

Post on 08-Aug-2015

22 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Map SMAC Algorithm onto GPU

Mapping SMAC Algorithm onto GPU

Student: Zhengjie Lu

Supervisor: Dr. Ir. Bart Mesman

Ir. Yifan He

Prof. Dr. Ir. Richard Kleihorst

Page 2: Map SMAC Algorithm onto GPU

Contents 1. Background ............................................................................................................................................... 3

1.1 GPU programming .............................................................................................................................. 3

1.2 SMAC algorithm .................................................................................................................................. 3

2. Implementation ........................................................................................................................................ 6

2.1 General Structure ................................................................................................................................ 6

2.2 SMAC on CPU ...................................................................................................................................... 7

2.3 SMAC on GPU ...................................................................................................................................... 9

3. Experiment .............................................................................................................................................. 12

3.1 Experiment Environment .................................................................................................................. 12

3.2 Experiment Setup .............................................................................................................................. 13

3.3 Experiment Result ............................................................................................................................. 14

3.3.1 GPU improvement ..................................................................................................................... 14

3.3.2 Linear execution-time model ..................................................................................................... 15

4. Roofline Model Analysis .......................................................................................................................... 17

4.1 Roofline Model .................................................................................................................................. 17

4.2 Application ........................................................................................................................................ 17

5. Conclusion ............................................................................................................................................... 19

Acknowledgement ...................................................................................................................................... 20

Appendix ..................................................................................................................................................... 21

Page 3: Map SMAC Algorithm onto GPU

1. Background

1.1 SMAC algorithm SMAC is short for the “Simplified Method for Atmospheric Correction”, which is specially used in

computing the atmospheric correction of satellite measurements in the solar spectrum. It is popular in

the remote sensing application because it is several hundred times faster than more detailed radiative

transfer models like 5S [3]. Figure 1.1 explains a black-box model of SMAC. The input of 9 parameters

will be taken into SMAC and then a single output will be generated.

SMAC Algorithm

Float sza;

Float sva;

Float vza;

Float vva;

Float taup550;

Float uh2o;

Float uo3;

Float airPressure;

Float r_toa.

Float r_surfRecycle

Output:

Input:

Fig1.1 SMAC black box model

SMAC is computational fast in its peer, but it still takes considerable amounts of time in processing

large-size data that is common in the remote sensing applications. Figure 1.2gives the profiles into

SMAC when it is processing a data size of 231240 bytes on CPU. The file IO operation is dominant in this

case (about 75%), while the CPU computation time is also significant (about 25%). Since the file IO

performance can be improved by introducing some faster hard disks, the CPU computation will become

the bottleneck later. This motivates us to map SMAC onto a commercial GPU (i.e. NVIDIA graphic card)

and see how much computation performance improvement we can achieve.

Fig1.2 Profiles of the original SMAC program

Page 4: Map SMAC Algorithm onto GPU

1.2 GPU programming The GPU programming is introduced into the field of Scientific Computation after its success in

accelerating the computer graphic processing. The hardware advantage of many parallel processing

cores per GPU (normally the number is at least 32 processing cores per GPU) makes it capable to deal

with massive data. The disadvantages of GPU programming are that: (1) the programmers are required

to have the knowledge about the hardware (especially the memory accessing pattern) so that they can

manipulate GPU efficiently, and (2) the pipeline penalty is dramatically huge if the branch predictions

miss. Normally, an efficient program is organized in such a way that GPU is responsible for the massive

mathematical computation while CPU takes charge of the logical operations and controlling.

NVIDIA develops a GPU programming technology named “Compute Unified Device Architecture” (CUDA)

for its own graphic card products, and it is the most popular one in the state-of-the-art GPU

programming. The CUDA supported GPUs can be found in NVIDIA’s official website *1+ and they have

covered the latest products of NVIDIA with more than 100 cores per GPU. In our paragraph, we will refer

CUDA programming as GPU programming for the NVIDIA graphic cards.

It is essential to explain the NVIDIA GPU hardware architecture before we discuss about CUDA

programming, because CUDA programming is actually a collection of regulations for operating the

hardware in the most efficient way. Figure 1.3 shows the overview of NVIDIA GeForce 8800GT GPU.

Every 8 stream processors (SP) is organized together in a stream multi-processor (SM). 14 SMs consists

of the main body of a GPU. Inside each SM, there are 8192 registers and also a shared memory with the

size 16384 bytes. The shared memory is used for the local communication within the 8 SPs. A global

memory is connected with each SM to make the global communication. It should be pointed out that

the access to the global memory is rather slow and that to the shared memory is rather fast. This

indicates that we should play with the shared memory much more than the global memory, to achieve a

better performance.

Fig1.3 NVIDIA 8800GT architecure

The basic concept in CUDA programming is the single-instruction-multiple-threads (SIMT), which means

that all the active threads will perform an identical instruction in each execution time [2]. Each active

thread is assigned to a unique SP so that the physically parallel-threading is achieved. Also each thread

Page 5: Map SMAC Algorithm onto GPU

will has its own registers to keep its status and they can communicate with each other through the

shared memory.

The second concept in CUDA programming is the thread and block. Each block is consisting of multiple

threads shown in figure 1.4, and the number of threads per block is limited by the physical available

registers number and shared memory size. Each block will be assigned to a SM inside which 8 SPs are

integrated. A single block can be only assigned to a single SM, while a single SM can hold many blocks.

Fig1.4 Block and threads Fig1.5 Stream Execution

The third concept in CUDA programming is the warp, which is the basic thread scheduling unit. 32

threads in a block will be organized as a warp and then simultaneously assigned to this block’s

responding SM by the scheduler. If the thread number in a block is not the times of 32, then some

dummy threads will be appended to this block to make the thread number as the times of 32.

The forth concept in CUDA programming is the concurrent-copy-and-execution, or so-called "stream"

execution. The input data is broken down into several segmentations with the same length. Then the

data segmentations (so-called data streams) are transferred from the CPU memory to the GPU memory

one by one. A data stream can be processed by the GPU kernel can process once it has been transferred

completely, without waiting for the completion of other data stream transmissions. An explanation on

the stream execution is given in figure 1.5.

The fifth concept in CUDA programming is the memory access coalescence. This means that the access

pattern to the global memory should be viewed in terms of aligned segments of 16 and 32 words. The

addressing pattern must also be aligned to 16 or 32 words.

As a short conclusion, the program has to be mapped to the NVIDIA GPU hardware with the CUDA

regulations.

Page 6: Map SMAC Algorithm onto GPU

2. Implementation

2.1 General Structure SMAC algorithm is consisting of 14 steps, as shown in figure 2.1. The first step as a "data filter" is a

conditional branching, through which only the valid data can be passed to the computations later. Each

computation would just rely on the input of its previous ones, as shown in the data dependency graph in

figure 2.2. All computations in SMAC are arithmetic calculations, including trigonometric functions,

exponential functions and etc. Also several computations contain the if-else conditions and these

branching will be replaced with the equivalent logical expression when they are mapped onto GPU. The

implementation of SMAC on CPU is programmed in ASIC C++, while the one on GPU programmed in C++

and C.

Calculation: step 1

Valid vector?

Calculation: step 2

Parameters setup

Calculation: step 11

Calculation: step 3

Calculation: step 4

Calculation: step 5

Calculation: step 6

Calculation: step 7

Calculation: step 8

Calculation: step 9

Calculation: step 10

Yes

No

Start

Calculation: step 12

Fig.2.1 Overview of SMAC kernel

Page 7: Map SMAC Algorithm onto GPU

1

2

3

4

5

6

7 8

9

10 1112

p

p

1

Parameters

Setup

Calculation

Step 1

Fig.2.2 Data dependency Graph of SMAC

2.2 SMAC on CPU A single thread is employed as the execution model of SMAC on CPU, in which the SMAC kernel has to

read through all the input and then generate out the final results. One input vector consisting of 9 float

point number can just be used to produce one output data consisting of 1 float point number. No any

data dependencies exist between different input vectors and neither do the outputs. The completely

execution model is shown in figure 2.3.

The data flow in this case is quite simple, which is shown in figure 2.4. Both the input data and

coefficients are read from the files on the hard disk to the CPU memory. Then CPU takes them into its

registers and throws out the final results into the CPU memory.

Page 8: Map SMAC Algorithm onto GPU

vector[0]

vector[1]

vector[2]

vector[n-1]

SMAC kernel

Float sza;

Float sva;

Float vza;

Float vva;

Float taup550;

Float uh2o;

Float uo3;

Float airPressure;

Float r_toa.

Float r_surfRecycle[0];

Float r_surfRecycle[1];

Float r_surfRecycle[2];

Float r_surfRecycle[3];

Float r_surfRecycle[n-1].

Input Output

Fig.2.3 Execution model of SMAC on CPU

Fig.2.4 Data flow of SMAC on CPU

The original SMAC program employs 3 classes: (1) SmacAlgorithm Class, (2) Coefficients Class and (3)

CoefficientsFile Class. SmacAlgorithm Class functions the kernel in which the SMAC algorithm is fully

implemented, while the other two manage the access to the coefficient file. An additional SimData Class

is now included to manipulate the input and output data. The relations among all the classes are listed

in figure 2.5.

Because the validness of input data can be identified as soon as they are read in, the “data filter” in the

SMAC kernel can be immediately achieved in SimData Class instead. It can both save the memory usage

and reduce the processing time in the kernel. Also the GPU can get rid of the conditional branching

when mapping SMAC on it.

The flow chart in figure 2.6 explains the implementation procedures. The coefficients and the input data

are read firstly, and then passed to the SmacAlgorithm instance for computation. The computational

results will be collected by the SimData instance which is the output parameter of the SmacAlgorithm

instance.

Page 9: Map SMAC Algorithm onto GPU

SmacAlgorithm

classCoefficients class

CoefficientsFile

class

File OperationAlgorithm

ExecutionI/O Interface

SimData class

Satellite Sensors

Coefficients File

Satellite Data File

Earth Surface

Reflectance File

Coefficients::setC

oefficients()

SimData::readData()

SmacAlgthm::SmacAlgthm()

SmacAlgthm:

run()

ENTRY

EXIT

Fig.2.5 Program structure of SMAC on CPU Fig.2.6 Flow chart of SMAC on CPU

2.3 SMAC on GPU The execution model of SMAC on GPU is benefiting from the multiple threads. Each GPU thread is an

instance of the SMAC kernel so that multiple input vectors can be processed simultaneously, as shown in

figure 2.7. This is the main benefits of employing GPU.

Vector[0]

Vector[1]

Vector[2]

Vector[n-1]

SMAC kernel 0 Float sza;

Float sva;

Float vza;

Float vva;

Float taup550;

Float uh2o;

Float uo3;

Float airPressure;

Float r_toa.

Float r_surfRecycle[0];

Float r_surfRecycle[1];

Float r_surfRecycle[2];

Float r_surfRecycle[3];

Float r_surfRecycle[n-1].

Input Output

SMAC kernel 1

SMAC kernel 2

...

GPU

Fig.2.7 Execution model of SMAC on GPU

The data flow of SMAC on GPU is quite different from the one on CPU as shown in figure 2.8. The input

data need to be transferred from the CPU memory to the GPU memory and then the results copied back

in reverse. The constant memory in GPU is employed as the cache, for accessing the frequently used

coefficients in all SMs.

Page 10: Map SMAC Algorithm onto GPU

Coefficients[];

Input_data[];

1st copy of coefficients[];

1st copy of input_data[];

1st copy of output_data[].

Hard disk CPU memory

2nd

copy of input_data[];

output_data[].

GPU Global memory

2nd

copy of coefficients[]

GPU constant memory

Registers

GPU

Fig.2.8 Data flow of SMAC on GPU

The program structure of SMAC on GPU is based on the one on CPU, in which the SMAC kernel is

immediately mapped onto GPU. Two more GPU relevant modules are introduced to the program

structure as shown in figure 2.9. The module "GPU_kernel.cu" implements the SMAC kernel which will

be executed on GPU, while the one "GPU.cu" controls the GPU memory operations and kernel execution.

Fig.2.9 Program structure of SMAC on GPU

Also the flow chart of SMAC on GPU is similar to the one on CPU, except for invoking the GPU relevant

functions, as shown in figure 2.10. An obvious change is that the input data are transferred from the

CPU memory to the GPU memory before executing the SMAC kernel on GPU and then the output data

are copied back from the GPU memory to the CPU memory. Besides that, the input data has to be re-

organized in such a sequence that they can be coalesced accessed to the GPU memory.

Page 11: Map SMAC Algorithm onto GPU

reorganizeInput()

cudaMemcpyAsync()

cudaMemcpyAsync()

GPU_kernel<<>>()

reorganizeOutput()

Coefficients::setC

oefficients()

SimData::readData()

SmacAlgthm::SmacAlgthm()

SmacAlgthm:

run()

ENTRY

EXIT

Fig.2.10 Flow chart of SMAC on GPU

As introduced in section 1.1, we have to concern some more in GPU programming and they are: threads,

blocks and streams. The profiling tool CUDAPROF gives us the tip that 59 registers are needed per SMAC

kernel thread. With this requirement of register number, the optimization tool CUDACALCULATOR

indicates that a max number of 192 threads per block can be achieved.

Once a block with 192 threads is executing on a unique SM, no other blocks can be assigned to the same

SM before the working block finishes all its executions. Since there are 4 SMs inside the GPU in our

experiment, 4 blocks are enough to fully utilize the GPU. Employing more blocks would probably

introduce the extra overhead when the blocks are switching, while employing fewer blocks could be just

a waste of hardware resources.

The number of streams will be explored later in our experiment.

Page 12: Map SMAC Algorithm onto GPU

3. Experiment

3.1 Experiment Environment We apply the SMAC program on a laptop workstation which is equipped with both an Intel Dual-Core

CPU and a NVIDIA GPU. The GPU is mounted on the laptop mother boarder through the PCI express

interface. The operation system on this machine is Windows 32-bit Vista Enterprise, with CUDA 2.2

support. The workload on CPU and GPU are both low before our experiment. The detailed descriptions

of experiment environment are listed below:

HARDWARE

CPU Intel(R) Core(TM)2 Duo CPU T9300, 2.5 GHz x 2

GPU nVidia Quadro FX570M, 0.95 GHz x 32

Main Memory 4GB

Mother Board Interface PCI express 1.0 x 16

SOFTWARE

Operation system Widows Vista Enterprise 32-Bit

CUDA version CUDA 2.2

GPU maximum registers per thread 60

GPU thread number 192 x 4 (#thread per block x #block)

CPU thread number 1

Table 3.1 Experiment environment

Coefficients::setC

oefficients()

SimData::readData()

SmacAlgthm::SmacAlgthm()

SmacAlgthm:

run()

ENTRY

EXIT

Start timer

Stop timer

Fig.3.1 Time profiling method

Page 13: Map SMAC Algorithm onto GPU

To profile the execution time for either SMAC on CPU or SMAC on GPU, the timers are attached at two

ends of the algorithm instances as shown in figure 3.1. In our experiment, we are concern about the

SMAC kernel performance but not the application performance. That is because the later one is

dominated by the hard disk I/O speed as already given in table 1.1 and it can be overcome by employing

other hard disk with high I/O speed.

3.2 Experiment Setup As it is indicated in figure 3.1, the timing profile of the SMAC kernel can be defined as the difference

between the “start timer” and “stop timer”. For instances, they are:

CPU time CPU stop timer CPU start timer -

GPU time GPUstop timer GPU start timer -

Now we define the performance improvement as:

CPU time

ImprovementGPU time

, in which the linear execution-time model is employed:

CPU time CPU overhead Bytes CPU speed

( )

( )

( )

( )

GPU time GPU memorytime GPU run time

GPU memory overhead Bytes GPU memory speed

GPU kernel overhead Bytes GPU kernel speed

GPU memory overhead GPU kernel overhead

Bytes GPU memory speed GPU kernel speed

GPU overhe

ad Bytes GPU speed

The performance improvement can also be expressed as:

CPU overhead Bytes CPU speedImprovement

GPU overhead Bytes GPU speed

Bytes CPU speed CPU speed

Bytes GPU speed GPU speed

It should be pointed out that this equation only holds when the data size is dramatically large. From this

formula, it can be obtained that the ultimate improvement is only relevant to the CPU and GPU speed

Page 14: Map SMAC Algorithm onto GPU

and has nothing to do with the data size. We will apply the linear execution-time model to predict the

GPU performance later.

3.3 Experiment Result

3.3.1 GPU improvement

Table 3.2 and figure 3.2 record the performance improvement during our experiment. The improvement

cures in figure 3.2 is climbing down when the number of streams increased, in case that the data size is

quite low. That is because the data size is too small to cover the overhead gap. But if we increase the

data size to some larger one, we can find that the slopes of the curves become larger and larger. That is

due to the increasing data size which dismisses the overhead. Later all the curves would behavior

similarly or even overlapping on each other, when the data size is larger than a certain threshold. It

couldn’t help a lot even if we increase the data size or the stream number.

Table 3.2 Performance improvement: CPU time/GPU time

Page 15: Map SMAC Algorithm onto GPU

Fig.3.2 GPU performance improvement: CPU time/GPU time

3.3.2 Linear execution-time model

In our earlier tests, the overhead and processing speed have been obtained and they are recorded in

table 3.3. The GPU overhead is relatively large, while the CPU overhead is too tiny to be measured. The

reason for the significant GPU overhead is that the data needs be re-organized and then transferred

both before and after the GPU kernel execution. It is also obvious that the GPU speed is at least 10 times

faster than the CPU one.

CPU time (ms) = CPU overhead (ms) + data size (byte) x CPU speed (ms/byte)

CPU overhead 0

CPU speed 55.39 10

GPU time (ms) = GPU overhead (ms) + data size (byte) x GPU speed (ms/byte)

GPU overhead: 1-stream 1.67

GPU speed: 1-stream 62.41 10

GPU overhead: 8-stream 4.45

GPU speed: 8-stream 62.01 10

Table 3.3 Parameters of linear execution-time model

Both the predicted improvement and experimental improvement of 1-stream GPU performance are

plotted in figure 3.3. Some recognizable variance occurs between the two curves with the data size is

small. Then the cures are almost identical when the data size is large. Finally, the curves will likely reach

the "ceiling" in despite of the data size.

Page 16: Map SMAC Algorithm onto GPU

Fig.3.3. Single stream GPU performance improvement

A comparison of the 8-stream GPU performance improvement is shown in figure 3.4 and it tell tells us

the similar information. The performance improvement will ultimately stays as a constant. No better

improvement could be achieved even when we increase the amounts of data size.

Fig.3.4. 8-stream GPU performance improvement

1-stream GPU

Data size (byte)

Experimental Improvement

Predicted Improvement

116640 5,71 3,21806093

233280 8,6 5,626082853

466560 12,83 8,989388397

933120 14,79 12,82188881

1866240 19,28 16,29558697

3732480 20,54 18,84884576

8-stream GPU

Data size (byte)

Experimental Improvement

Predicted Improvement

116640 1,99 1,342894063

233280 3,49 2,557938816

466560 6,18 4,671162775

933120 9,95 7,958659158

1866240 14,75 12,27984849

3732480 19,24 16,85581952

Page 17: Map SMAC Algorithm onto GPU

4. Roofline Model Analysis

4.1 Roofline Model Roofline Model can give us a non-perfect insight into the performance bottleneck [4]. When the

instruction density measured in flops/byte lies below the peak performance as shown in figure 4.1, it

indicates that the bottleneck is the data transfer and we should bring in more data. Otherwise the

bottleneck might be the computational limitation and we should reconsider the computation approach.

Fig.4.1.Example of Roofline model

4.2 Application To apply the Roofline model with our case, the GPU hardware specifications and the profiles of SMAC

kernel (notes: no the SMAC application) have to be obtained before the analysis. They are all listed in

table 4.1.

Hardware: NVIDIA Quadro FX570M

PCI express bandwidth: 4 GB/sec

Peak performance: 91.2 GFlops/sec

Peak performance without FMAU: 30.4 GFlops/sec Software: SMAC kernel on GPU Data size: 59719680 Bytes

Issued instruction number: 4189335552 Flops

Execution time: 79.2 ms

Instruction density: 70.15 Flops/Byte

Instruction Throughput: 52.8 GFlops/sec

Table 4.1 Parameters of Roofline model

Page 18: Map SMAC Algorithm onto GPU

Now the performance of SMAC on GPU is applied on the Roofline model as the blue marker shown in

figure 4.2. It can be identified that the performance bottleneck is the computation if we only consider

the kernel execution on GPU. To be more precise, it results from the computation unbalances between

float point multiplication and addition. That is partially because that SMAC is immediately mapped on

GPU and the data dependencies inside SMAC still exist. The data dependencies limit the fully utilization

of the FMAUs (float-point multiplication-addition unit) in GPU.

The other reason is that just 192 threads are employed in a block to satisfy the register budget.

Introducing more threads would only result in that the temp operands have to be stored in the local

memory which is quite slow to be accessed. Keeping the thread number as many as 192 can dismiss the

opportunities t using the local memory, but it would also make some function units always unused.

The situation can be worse if the I/O operations are included. The navy blue sloping line stands for the

hard disk IO bandwidth bottleneck and it is right farther than the GPU memory bandwidth limitation

cure. The instruction throughput in this case is below 1 GFlops/second so that it is not plotted in figure

4.2. To be summarized, the SMAC application is limited by the hard disk I/O.

Fig.4.2 Roofline model of SMAC on GPU

Page 19: Map SMAC Algorithm onto GPU

5. Conclusion and Future The popular algorithm named SMAC in the remote sensing field is successfully mapped onto the

commercial programmable GPU with the help of CUDA technology. Since SMAC is applying with large

size of data stream, it can employ the trick of “stream-execution” to achieve the performance

improvement. Our experiment result shows that a performance speed up of 25 times can be achieved

by GPU, compared with CPU. Also the linear execution-model is proved in analyzing GPU’s stream-

execution.

Besides that, the Roofline model is employed to identify the bottleneck of SMAC on GPU. The SMAC

kernel on GPU is dominated by the computational bottleneck, while the completed application is limted

by the hard disk I/O bandwidth. We are only interested in the former one in this report. Two main

reasons result in the computational bottleneck of SMAC kernel on GPU and they are: (1) the unbalance

between the float-point multiplications and additions due to the data dependencies, and (2) the register

pressure caused by the register requirement per thread. The first one can possibly be released by

decoupling the data dependencies inside the SMAC algorithm kernel, while the second one can be

solved by turning to the fine-grained threads in which fewer register are required.

Fig.5.1Diagram of GPU power measurement Fig.5.2 Physical layout of GPU power measurement

The power consumption of SMAC on GPU also interested us in the future. Since no commercial power

consumption measurement PCI-E cards are available in the market, a customized approach has to be

carried out. Figure 5.1 shows the measurement principle, in which the 5V and 12V power supply lines in

the 4-pin PCI-E interface are measured separately. A 0.03 resistor with 20W capacity is connected to

the 5V power supply line and its current flow is calculated by its voltage over its register value. Then the

power supply through this 5V line can be obtained by 5V times the current through the resistor. So does

the measurement on the 12V power supply line. At last the two power supplies are summarized to

obtain the GPU power consumption. Figure 5.2 shows the physical connection of our chosen future

experiment.

Page 20: Map SMAC Algorithm onto GPU

Acknowledgement This assignment is my 3-month traineeship in Technische Universiteit Eindhoven (TU/e) and the topic is

from VITO-TAP, Flemish Institute for Technological Research NV. I have got lots of support from the

people surrounding. Dr. Ir. Bart Mesman, my supervisor in TU/e, has spent quit a lot of time in my

weekly reports and verifying my ideas. Ir. Ir. Yifang He who is the PhD candidate in TU/e, also my

supervisor in this traineeship, guided me with the research methodology. Prof. Dr. Ir. Richard Kleihorst,

my supervisor in VITO-TAP, kindly arranged the daily issues and working environment in VITO-TAP. Prof.

Dr. Ir. Henk Corporaal, my academic mentor in TU/e, gave me strong supports in the 3 months. I should

also thank Ir. Zhengyu Ye and Ir. Gert Jam, who gave me valuable advices on GPU programming and

Roofline Model.

Page 21: Map SMAC Algorithm onto GPU

Appendix [1] NVIDIA CUDA official website, http://www.nvidia.com/object/cuda_home.html, retrieved on July 20,

2009

[2] NVIDIA CUDA documentation, Chapter 4, “NVIDIA CUDA Programming Guide 2.2”, February 4, 2009

[3] H. Rahman, G. Dedieu, “SMAC: A Simplified Method for the Atmospheric Correction of Satellite”.

December 5, 1993

*4+ S. Williams, A. Waterman, D. Patterson, “Rooline: An insightful Visual Performance model for Multi-

core Architectures”, April 2009

[5] S. Williams, D. Patterson, “The Roofline Model: A pedagogical tool for program analysis and

optimization”, retrieved on September 10, 2009.

[6] Zhengyu Ye, "Design Space Exploration for GPU-Based Architecture", August 2009.

[7] Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, Sain-Zee Ueng, John A.

Stratton, and Wen-mei W. Hwu, “Program Optimization Space Pruning for a Multithreaded GPU”, ACM

2008.

[8] NVIDIA CUDA documentation, “NVIDIA_CUDA_BestPracticesGuide_2.3”, retrieved on September 12,

2009

[9] Rob Farber, “CUDA, Supercomputing for the Masses” , September 19, 2008.

*10+ X. Ma, M. Dong, L. Zhong, Z. Deng, “Statistical Power Consumption Analysis and Modeling for GPU-

based Computing”, retrieved on September, 2009

[11+ S. Collange, D. Defour, A. Tisserand, “Power Consumption of GPUs from a Software Perspective”,

Proceedings of the 9th International Conference on Computational Science, 2009

*12+ Analog Device documentation, “Measuring temperatures on computer chips with speed and

accuracy”, April, 1999

[13+ Green Grid, “The Green Data Center: Energy-Efficient Computing in the 21st Century”, retrieved on

July 16, 2009

[14+ Google, “The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale

Machines”. Retrieved on July 16, 2009

[15+ Google, “The Case for Energy-Proportional Computing”, retrieved on July 16, 2009

[16] http://en.wikipedia.org/wiki/PowerNow!, , retrieved on July 16, 2009

[17] http://en.wikipedia.org/wiki/SpeedStep, retrieved on July 16, 2009

Page 22: Map SMAC Algorithm onto GPU

[18+ SUN, “Sun's Throughput Servers: Paradigm Shift in Processor Design Drives Improved Business

Value”, retrieved on July 16, 2009

[19] IBM, "Storage Modeling for Power Estimation", retrieved on July 16, 2009

[20] http://en.wikipedia.org/wiki/Green_computing, retrieved on July 16, 2009

[21+ Seagate, “2.5-Inch Enterprise Disc Drives: Key to Cutting Data Center Costs”, retrieved on July 16,

2009

[22] Google, "Power-Aware Micro-architecture: Design and Modeling Challenges for Next-Generation

Microprocessors", retrieved on July 16, 2009

[23] Google, "MapReduce: Simplified Data Processing on Large Clusters", retrieved on July 16, 2009

[24] Google, "Power Provisioning for a Warehouse-sized Computer", retrieved on July 16, 2009

[25+ IBM, “IBM BladeCenter HS22 Technical Introduction”, retrieved on July 16, 2009

[26+ IBM, “IBM BladeCenter Products and Technology”, retrieved on July 16, 2009

[27] http://www-03.ibm.com/systems/virtualization/, retrieved on July 16, 2009

[28+ Green Grid, “Five Ways to Reduce Data Center Server Power Consumption”, retrieved on July 16,

2009