anscse15 full paper thanakij

8/6/2019 ANSCSE15 Full Paper Thanakij

1/11

ANSCSE15 Bangkok University, Thailand

March 30-April 2, 2011

Improving Bayesian Computational Time and Scalability with

GPGPU

T. PechprasarnC, and N. Khiripet

Knowledge Elicitation and Archiving Laboratory, National Electronics and Computer Technology Center,

Pathumthani, 12120, ThailandC

E-mail: [email protected]; Fax: 02-5646772; Tel. 02-5646900 ext. 2220

ABSTRACTIt is almost impossible for one to find the posterior probability in Bayesian inference due to

the lack of closed-form antiderivatives. Instead, an approximation method like Monte Carlo

integration (MCI) is used to calculate such an integral. MCI involves a ramdom process to

generate samples corresponding to the target distribution. In general, a larger number ofsamples yield a more accurate result; however, it also requires more computational time.

To obtain higher performance, NVidia CUDA can help accelerate the computation by

leveraging a parallel programming pattern called parallel reduction. Although our current

achieved speed-up is reasonable, it still can be further improved. In addition to the running

time, scalability is another issue that we can add on. In this paper, in order to improve the

performance we further optimize our parallel programs by introducing some optmizationtechniques and also cope with the problem of scalability. Loop unrolling and enhancing thecompacting code are included in our optimization methods. In order to improve scalability,

we utilize the multidimensional feature of CUDA by using 2D blocks instead of 1D blocks.

The result shows that the computation time is substantially decreased and the program can

handle much larger problem size even though a small block size is being used. We

conclude our work by identifying proper block sizes for certain problem sizes.

Keywords: Bayesian probability, Monte Carlo integration, Parallel reduction, GPU

computing, CUDA.

1. INTRODUCTIONIn Bayesian probability, one is often interested in finding the posterior distribution to test the

hypothesis given observed training data. However, solving for the posterior is a challenging task

because typically the posterior is in a form of integrals and most of the time the closed-form

solutions for such integrals are not available [3]. Instead, the approximation method like Monte

Carlo integration (MCI) is used to find the integrated value. MCI involves a random process togenerate samples from the target distribution. Then, each sample will be calculated its

contribution to the final integrated value. In general, using larger number of samples would yield

more accurate results. Nevertheless, when the sample size is large, the computation becomes

much slower. Therefore, we try to speed up the computation of MCI with GPUs. We implement a

parallel program using Compute Unified Device Architecture (CUDA), a leading framework for

programming GPUs. Given a set of samples, our work is focusing on the core integration part.

The integration involves finding a summation from each contributed part. We employ a parallel

pattern called parallel reduction for finding a summation. Parallel reduction is suitable for our

CUDA programs as it allows many parts of the calculation to be done in parallel [5]. From our

previous work [7], the experimental result indicates that the higher performance is gained since

the running time is substantially decreased. Although our previous work is successful to some


2/11



point, it still can be improved in many aspects. For example, the computational time can be

further improved by introducing some optimization methods such as loop unrolling technique.

There is also an issue about scalability as we cannot use smaller block sizes for larger problem

sizes. This scalability issue is important because it prevents us to determine the effect and theperformance from using smaller block sizes. Thus, solving scalability issue would significantly

reveal additional search space for finding the optimal running time. To solve the scalability

problem, we utilize the multidimensional feature of CUDA and divide the samples into 2D blocks

instead of 1D blocks. This lets us use smaller block sizes such as 128 for larger problem sizes.Eventually, we present our work to reduce the running time with chosen optimization techniques

and also cope with the problem of scalability. In addition, a real world example of Bayesian

application is also provided. According to our experiments, the results show our parallelprograms perform much better than the sequential implementation. For example, the maximum

speed-up obtained is 53.49 times the sequential code.

2. THEORY AND RELATED WORKS2.1Bayesian ProbabilityAccording to [1], Bayes rule is defined as

| | (1)where,

D = observed data

= the hypothesis defined by parameter P(|D) = posterior of givenDP(D|) = likelihood ofP() = prior probability ofP(D) = probability ofD

The posterior is often of interests as it is an inverse probability compared to direct probability

from classical theory of statistics. The posterior is used to infer the causes given observed data.Given data, the model which is called the likelihood can be constructed. The prior distribution is

for expressing a general knowledge about the data. Next, in order to see whether the hypothesis

will be accepted or rejected, an expected value of the posterior has to be computed. The posterior

expectation has to fall in the region of 95% of the prior distribution.

According to [2], an expected value is used to find an averaged outcome of a function in long

run and is defined as

(2)where,

P(x) = probability density function

The expectation of the posterior is |. According to (2),

| | According to (1),

| |


3/11



According to (2),

|

|(3)

where,

P(D) = a constant value of | = |2.2Monte Carlo Integration (MCI)Involving a random process, MCI is an integration method to find the value of a definite

integral [4]. A general form of such an integral is

(4)We can dividef(x) with P(x) and (4) becomes

where,

P(x) = probability density function on interval [a,b]

According to (2), we have

(6)But the expected value can be estimated as

(7)From (6) and (7), MCI is defined as

(8)

where,

P(x) = a sampling distribution

N= the number of samples

There are two major steps in MCI. The first one is to generate a set of samples from the

sampling distribution. Then, contributions from each sample are summed to find an integratedvalue.

2.3Parallel ReductionParallel reduction is a common pattern for reducing a set of numbers into a single value [5].

The structure of parallel reduction is shown in Figure 1. With a tree-based structure, there are

log2Ntree levels. All operations at the same level can be done in parallel, but the next level has to

wait until operands from the previous level are ready. We employ the parallel reduction patterninto our program during the second step of MCI which is to find a summation of contributions

from samples.


4/11



Figure 1. Structure of parallel reduction.

2.4Compute Unified Device Architecture (CUDA)According to [6], CUDA is a general-purpose parallel computing architecture. It comes with

a programming model and new instruction set architecture. The architecture of CUDA is

composed of GPUs with stream multiprocessors. Each stream multiprocessors contains CUDAcores. CUDA exploits parallelism via blocks of threads. Blocks are executed independently by

CUDA cores. Therefore, more than one block can be executed in parallel depending on the

available CUDA core resources. This allows CUDA programs to automatically scale up by

simply running more blocks. Next, a kernel is a function to be executed on GPUs. In order to

launch a kernel, both the number of blocks and the number of threads per block have to be

specified by the CPU callers.

3. IMPLEMENTATION DETAILS1. Bayesian ApplicationWe extend our previous work by introducing a Bayesian application. This application is for

calculating the expectation of the posterior. The application is going to compute the result from(3). Using MCI, the first step is to generate a set of samples according to the prior distribution.This random number generation part is done in CPUs. Next, using generated samples, according

to (7), an expected value can be calculated with parallel reduction. With GPUs, the computation

of the parallel reduction is accelerated. After obtaining the expectation of the posterior,hypothesis testing is performed by checking whether the probability is fall under 95% regions of

the prior. An overview of the implementation is shown in Figure 2.

Figure 2. Implementation of 2D blocks.

2. Solving the ScalabilityIn order to solve the scalability issue found in our previous work [7], we employ the

multidimensional feature of CUDA. Our CUDA programs basically divide an array of samples

into smaller blocks which will be reduced later to find an integrated value. Figure 1 illustrates our

(* Calculate the expectation of the posterior using MCI *)

SET samples to Sampling(Normal(5,0.5), N)

SET numeratortoReduce(f, samples, N) usingf(x) = x*lhd(x)

SET denominatortoReduce(f, samples, N) usingf(x) = lhd(x)

SET expected_value to numerator/denominator

(* Hypothesis testing *)

SETpH0 to Test(expected_value, Normal(5,0.5))

RETURNpH0 < 0.95


5/11



idea in transformation of using 1D blocks into 2D blocks. With 1D blocks, the maximum number

of blocks we can use is 65535 x 1 x 1 = 65535. On the other hand, if 2D blocks are being used,

the maximum number of blocks becomes 65535 x 65535 x 1 = 4294836225. This number is

already large enough to utilize CUDA core resources and also allow us to scale to larger problemsizes. Theoretically, the maximum problem size for a certain block size is calculated by

multiplying the number of blocks with the block size. In practice, the physical limit of GPU

memory may be a blocker for a very large problem size. Figure 3 provides an illustration of the

idea.

Figure 3. Transformation from 1D blocks to 2D blocks.

Figure 4. Implementation of 2D blocks.

Figure 4 shows our implementation with a required parameter, the size of a row. Thisparameter can be tuned to fit certain problem sizes. If larger value of row size is being used, there

might be a lot of waste computation in the last row. On the other hand, if smaller rows are used,

then there is much less opportunity to waste the computation; however, as the number of rows

grows it might hit the limit of 65535. Future work can provide complex analysis on this problem.

3. Performance2.1) Loop unrolling

We employ loop unrolling into our parallel reduction code. The major advantage of the loop

unrolling technique is that there is no need to check the condition of the loop when iterating. We

unroll last six iterations since the number of threads can be ensured to be a warp size. By doing

(* 1D block representation *)

SET num_blocks to num_samples/block_size

SET block.x to num_blocks

SET block.y to 1

(* 2D block representation *)

SET num_blocks to num_samples /block_size

SET num_row to num_blocks/row_size

SET block.x to min(num_blocks, row_size)

SET block.y to num_rows

(* # CUDA blocks = min(block.x, 65535) x min(block.y, 65535) *)

samples

row0

row1

row size


6/11



this, there is an extra benefit as we can remove unnecessary expensive synchronous instructions.

Threads within the same warp do not require a synchronous point as they will always execute the

same instruction. The idea is shown in Figure 5.

Figure 5. Loop unrolling in parallel reduction.

2.2) Enhancing the compact kernel

The compact kernel is for gathering the reduced values from all CUDA blocks and forming anew array which will be sent to a reduce kernel until only one block is left. In our previous work,

to keep simplicity of programming, we use only a single thread per block and let each block to do

the compact job which may not utilize the CUDA resources. Although it is not a core code,

tuning up this part also yields a performance improvement. Figure 6 shows our modification.

Figure 6. Enhancing the compact kernel.

There is another parameter appeared which is the number of threads for the compactkernel. We adjust the number of threads for this kernel according to the problem size. For

example, we use 128 threads if the sample size is less than 8388480, use 512 threads if

the sample size is larger than 16776960 and use 256 threads if the size is in between.

4. EXPERIMENTS AND RESULTS1. PlatformsWe use NVidia GeForce GTX 580 as our platform for GPUs. On the CPU side, we have Intel

Core i7. The detail specifications are shown in Table 1.

(* Original version *)

kernel_reduce ()

(* Modified version *)

kernel_reduce ()

(* parallel reduction in the reduce kernel *)

FOR s from num_samples/2 to 64 having s/=2

Sync threads (* make sure that all threads are working on the same level of the tree *)

IF threadIdis less than s THEN

Add s_data[threadId] to s_data[threadId+ s]

END IF

END FOR

(* loop unrolling *)

IF threadIdis less than 32 THEN (* CUDA warp size is 32 *)

Add s_data[threadId] to s_data[threaded+ 32]

Add s_data[threadId] to s_data[threaded+ 16]Add s_data[threadId] to s_data[threaded+ 8]




END IF


7/11



Table 1. Specification of CPUs and GPUs.

Description CPU GPU

Model Intel Core i7 NVidia GeForce GTX 580

Clock frequency (GHz) 2.8 1.56# processors 2 16

# cores per processor 4 32

# total cores 8 512

2. DatasetsCavendishs data [8] are used in our experiments. The data represents the specific density of

the earth. From 23 experiments, they are: 5.36, 5.29, 5.58, 5.65, 5.57, 5.53, 5.62, 5.29, 5.44, 5.34,

5.79, 5.10, 5.27, 5.39, 5.42, 5.47, 5.63, 5.34, 5.46, 5.30, 5.78, 5.68 and 5.85. According to [9], our

corresponding model is also a normal distributionN(:|,0.04). For the prior, it is chosen to benormal with mean = 5 and variance = 0.5.

3. ResultsOur computed posterior expectation is 5.483 which is similar to the result from [9]. We find

that the computed probability falls under the 95% of the region (0.75 < 0.95.) Thus, the

hypothesis is accepted.

Figure 7. Results from our Bayesian application.

Figure 7 shows examples of results from our application. In addition to the answer, the

running time is also provided for both CPU and GPU versions. The details of the computational

time are provided for both the part of calculating the posterior expectation and the part ofhypothesis testing. In terms of performance, we expect an improvement for the part of posteriorcalculation since this part involves parallelism using CUDA. On the other hand, we should obtain

similar running time for the part of hypothesis testing since there is no GPU involvement in this

part.

3.1) Running Time

According to the experiment, for our Bayesian application, the results show that our GPU

program takes less time than the CPU implementation. The logarithmic chart below illustrates thecomparison.


8/11



Figure 8. Running time of CPU and GPU (the whole application.)

Because there are two main parts in the application: 1) posterior expectation calculation and 2)

hypothesis tests, we proceed by providing the details of the running time for each part. The part

of calculating the expectation is shown in Figure 9.

Figure 9. Running time of CPU and GPU (for posterior expectation.)

The two charts, Figure 8 and 9, reveal a similar trend that the GPU implementation is faster

than the CPU. Next, for the running time of testing the hypothesis, because this portion of the

code has no GPU involvement so there is no difference in timing between CPU and GPU.

However, it would still be useful to see this part scales with different problem sizes. According to

Figure 10, we find that the testing part has a linear-time scaling.

0.010

0.100

1.000

10.000

100.000

1000.000

10000.000

0 100,000,000 200,000,000 300,000,000Runningtime(seconds)

Problem sizeCPU GPU

0.001

0.010

0.100

1.000

10.000

100.000

1000.000

10000.000

0 100,000,000 200,000,000 300,000,000

Runningtime(seconds)

Problem size

CPU GPU


9/11



Figure 10. Running time of the portion of testing the hypothesis.

Next, we move back to the posterior expectation calculation. It would be interesting to see

how each optimization strategy performs on the GPU side. Therefore, Figure 11 provides the

details running time of the GPU programs with different optimizations.

Figure 11. Effect of optimization methods in GPU programs.

However, the chart illustrates that there is no much difference in running time of each

method. We anticipate that this would be caused by the evaluation of the complex function like

the likelihood function on the GPU side in the parallel reduction step of MCI. Although many

threads are working in parallel to evaluate the function, but at least the elapsed time of such the

calculation for a single thread is dominating the portion of the whole reduction. Because the

optimization techniques such as enhancing the compact kernel and loop unrolling are focusing on

0.000

5.000

10.000

15.000

20.000

25.000

0 100,000,000 200,000,000 300,000,000


Problem size

1) No exta optimization 2) Enhance the compacting kernel3) Loop unrolling 4) Optimization (2)+(3)

0.000

2.000

4.000

6.000

8.000

10.000

12.000

14.000

16.000

18.000

20.000

0 100,000,000 200,000,000 300,000,000


Problem size


10/11



the core reduction part, such improvement becomes very little compared to the time used by the

function evaluation. Therefore, there is no much difference for each optimization technique.

3.2) ScalabilityWe show the result after solving the scalability problem in Table 2. Notice that all block size

even the block size of 128 can be used by all problem sizes and this would not be possible in our

previous work.

Table 2. Running time of GPU programs with different block sizes.

Problem Size 128 256 512 1024 2048 4096

65,535 0.011 0.011 0.011 0.011 0.011 0.011

131,070 0.021 0.021 0.021 0.021 0.021 0.021

262,140 0.041 0.041 0.040 0.047 0.040 0.040

524,280 0.080 0.080 0.081 0.080 0.080 0.080

1,048,560 0.159 0.159 0.165 0.166 0.159 0.158

2,097,120 0.317 0.316 0.316 0.316 0.316 0.3164,194,240 0.631 0.638 0.652 0.631 0.638 0.631

8,388,480 1.261 1.261 1.264 1.261 1.261 1.262

16,776,960 2.523 2.529 2.522 2.522 2.525 2.524

33,553,920 5.076 5.042 5.117 5.042 5.041 5.045

67,107,840 10.368 10.087 10.082 10.082 10.085 10.084

134,215,680 20.516 20.502 20.786 20.161 20.155 20.516

268,431,360 40.332 40.311 41.300 40.313 40.316 40.329

Table 2 shows no difference in running time of the GPU programs varying block

sizes. Again, this would be caused by that the most time spent is not in the core parallel

reduction code so the effect of different block sizes cannot be seen.

3.3) Speed-up

We calculate the speed-up of the GPU programs for both the whole program and the

portion of posterior calculation. The speed-ups are shown in Table 3.

Table 3. Speed-ups of GPU programs.

Problem Size Whole

Application

Posterior

Expectation

65,535 49.97 84.25

131,070 50.17 88.58

262,140 52.27 91.59

524,280 52.83 93.48

1,048,560 53.22 94.61

2,097,120 53.29 94.95

4,194,240 53.37 95.21

8,388,480 53.45 95.41

16,776,960 53.44 95.52

33,553,920 53.48 95.54

67,107,840 53.49 95.60

134,215,680 53.49 95.56

268,431,360 52.21 95.58


11/11



The maximum speed-up obtained in case of the whole application is 53.49 times the sequential

code. For the only portion of calculating the posterior expectation, the maximum speed-up is

95.60.

5. CONCLUSIONWe illustrate a real world application of Bayesian probability for testing the

hypothesis. The expectation is required to do the hypothesis testing. The implementation

shows that our method can be accurately used to find such the posterior expectation. Wealso present an enhancement to our previous work by further optimizing our CUDA programs and

also handling the scalability issue. Our results show that our parallel programs perform better

than the CPU program as they take much less time when executing. In our experiments, we show

that with small block sizes, we still can handle large problem sizes and this is essential since more

solution space has been created. The maximum speed-up identified in our experiment is 53.49

times the sequential code. Future work would focus on employing a full GPU implementation by

generating random numbers in GPUs and also cover the issue of evaluating the function in the

parallel reduction step so that the effect of optimization and block size can be seen.

REFERENCES1. Bayes, T., and Price, R., "An Essay towards solving a Problem in the Doctrine of Chance. By

the late Rev. Mr. Bayes, communicated by Mr. Price, in a letter to John Canton, M. A. and F.

R. S.". Philosophical Transactions of the Royal Society of London 53, 1763, 370418.

2. Ross, S., "2.4 Expectation of a random variable".Introduction to probability models (9th ed.).Academic Press, 2007, p. 38.

3. Tierney, L., and Kadane, J., "Accurate Approximations for Posterior Moments and MarginalDensities,"Journal of the American Statistical Association, 1986, 81, 82-86.

4. Caflisch, R., Monte Carlo and quasi-Monte Carlo methods,Acta Numerica vol. 7, CambridgeUniversity Press, 1998, pp. 1-49.

5. Harris, M., Mapping computational concepts to GPUs, in: M. Pharr (ed.), GPUGems 2 :Programming Techniques for High-Performance Graphics and General-Purpose

Computation, chap. 31, Addison-Wesley, 2005, pp. 493508.

6. NVIDIA CUDA C Programming Guide Version 3.2, 2010.7. Pechprasarn, T. and Khiripet, N., Accelerating Bayesian Computation with Parallel Reduction

using CUDA, The 4th

Mahasarakham International Workshop on Artificial Intelligence

(MIWAI), 2010, p40-45.

8. Cavendish, H., "Experiments to Determine the Density of the Earth".MacKenzie, A. S..Scientific Memoirs Vol.9: The Laws of Gravitation. American Book Co.. 1900. pp. 59105.

9. Piche R., Normal Data, in the note 2 of Bayesian statistics courses, Tampere University ofTechnology, 2009.

anscse15 full paper thanakij

Documents