anscse15 full paper thanakij
TRANSCRIPT
-
8/6/2019 ANSCSE15 Full Paper Thanakij
1/11
ANSCSE15 Bangkok University, Thailand
March 30-April 2, 2011
Improving Bayesian Computational Time and Scalability with
GPGPU
T. PechprasarnC, and N. Khiripet
Knowledge Elicitation and Archiving Laboratory, National Electronics and Computer Technology Center,
Pathumthani, 12120, ThailandC
E-mail: [email protected]; Fax: 02-5646772; Tel. 02-5646900 ext. 2220
ABSTRACTIt is almost impossible for one to find the posterior probability in Bayesian inference due to
the lack of closed-form antiderivatives. Instead, an approximation method like Monte Carlo
integration (MCI) is used to calculate such an integral. MCI involves a ramdom process to
generate samples corresponding to the target distribution. In general, a larger number ofsamples yield a more accurate result; however, it also requires more computational time.
To obtain higher performance, NVidia CUDA can help accelerate the computation by
leveraging a parallel programming pattern called parallel reduction. Although our current
achieved speed-up is reasonable, it still can be further improved. In addition to the running
time, scalability is another issue that we can add on. In this paper, in order to improve the
performance we further optimize our parallel programs by introducing some optmizationtechniques and also cope with the problem of scalability. Loop unrolling and enhancing thecompacting code are included in our optimization methods. In order to improve scalability,
we utilize the multidimensional feature of CUDA by using 2D blocks instead of 1D blocks.
The result shows that the computation time is substantially decreased and the program can
handle much larger problem size even though a small block size is being used. We
conclude our work by identifying proper block sizes for certain problem sizes.
Keywords: Bayesian probability, Monte Carlo integration, Parallel reduction, GPU
computing, CUDA.
1. INTRODUCTIONIn Bayesian probability, one is often interested in finding the posterior distribution to test the
hypothesis given observed training data. However, solving for the posterior is a challenging task
because typically the posterior is in a form of integrals and most of the time the closed-form
solutions for such integrals are not available [3]. Instead, the approximation method like Monte
Carlo integration (MCI) is used to find the integrated value. MCI involves a random process togenerate samples from the target distribution. Then, each sample will be calculated its
contribution to the final integrated value. In general, using larger number of samples would yield
more accurate results. Nevertheless, when the sample size is large, the computation becomes
much slower. Therefore, we try to speed up the computation of MCI with GPUs. We implement a
parallel program using Compute Unified Device Architecture (CUDA), a leading framework for
programming GPUs. Given a set of samples, our work is focusing on the core integration part.
The integration involves finding a summation from each contributed part. We employ a parallel
pattern called parallel reduction for finding a summation. Parallel reduction is suitable for our
CUDA programs as it allows many parts of the calculation to be done in parallel [5]. From our
previous work [7], the experimental result indicates that the higher performance is gained since
the running time is substantially decreased. Although our previous work is successful to some
-
8/6/2019 ANSCSE15 Full Paper Thanakij
2/11
ANSCSE15 Bangkok University, Thailand
March 30-April 2, 2011
point, it still can be improved in many aspects. For example, the computational time can be
further improved by introducing some optimization methods such as loop unrolling technique.
There is also an issue about scalability as we cannot use smaller block sizes for larger problem
sizes. This scalability issue is important because it prevents us to determine the effect and theperformance from using smaller block sizes. Thus, solving scalability issue would significantly
reveal additional search space for finding the optimal running time. To solve the scalability
problem, we utilize the multidimensional feature of CUDA and divide the samples into 2D blocks
instead of 1D blocks. This lets us use smaller block sizes such as 128 for larger problem sizes.Eventually, we present our work to reduce the running time with chosen optimization techniques
and also cope with the problem of scalability. In addition, a real world example of Bayesian
application is also provided. According to our experiments, the results show our parallelprograms perform much better than the sequential implementation. For example, the maximum
speed-up obtained is 53.49 times the sequential code.
2. THEORY AND RELATED WORKS2.1Bayesian ProbabilityAccording to [1], Bayes rule is defined as
| | (1)where,
D = observed data
= the hypothesis defined by parameter P(|D) = posterior of givenDP(D|) = likelihood ofP() = prior probability ofP(D) = probability ofD
The posterior is often of interests as it is an inverse probability compared to direct probability
from classical theory of statistics. The posterior is used to infer the causes given observed data.Given data, the model which is called the likelihood can be constructed. The prior distribution is
for expressing a general knowledge about the data. Next, in order to see whether the hypothesis
will be accepted or rejected, an expected value of the posterior has to be computed. The posterior
expectation has to fall in the region of 95% of the prior distribution.
According to [2], an expected value is used to find an averaged outcome of a function in long
run and is defined as
(2)where,
P(x) = probability density function
The expectation of the posterior is |. According to (2),
| | According to (1),
| |
-
8/6/2019 ANSCSE15 Full Paper Thanakij
3/11
ANSCSE15 Bangkok University, Thailand
March 30-April 2, 2011
According to (2),
|
|(3)
where,
P(D) = a constant value of | = |2.2Monte Carlo Integration (MCI)Involving a random process, MCI is an integration method to find the value of a definite
integral [4]. A general form of such an integral is
(4)We can dividef(x) with P(x) and (4) becomes
where,
P(x) = probability density function on interval [a,b]
According to (2), we have
(6)But the expected value can be estimated as
(7)From (6) and (7), MCI is defined as
(8)
where,
P(x) = a sampling distribution
N= the number of samples
There are two major steps in MCI. The first one is to generate a set of samples from the
sampling distribution. Then, contributions from each sample are summed to find an integratedvalue.
2.3Parallel ReductionParallel reduction is a common pattern for reducing a set of numbers into a single value [5].
The structure of parallel reduction is shown in Figure 1. With a tree-based structure, there are
log2Ntree levels. All operations at the same level can be done in parallel, but the next level has to
wait until operands from the previous level are ready. We employ the parallel reduction patterninto our program during the second step of MCI which is to find a summation of contributions
from samples.
-
8/6/2019 ANSCSE15 Full Paper Thanakij
4/11
ANSCSE15 Bangkok University, Thailand
March 30-April 2, 2011
Figure 1. Structure of parallel reduction.
2.4Compute Unified Device Architecture (CUDA)According to [6], CUDA is a general-purpose parallel computing architecture. It comes with
a programming model and new instruction set architecture. The architecture of CUDA is
composed of GPUs with stream multiprocessors. Each stream multiprocessors contains CUDAcores. CUDA exploits parallelism via blocks of threads. Blocks are executed independently by
CUDA cores. Therefore, more than one block can be executed in parallel depending on the
available CUDA core resources. This allows CUDA programs to automatically scale up by
simply running more blocks. Next, a kernel is a function to be executed on GPUs. In order to
launch a kernel, both the number of blocks and the number of threads per block have to be
specified by the CPU callers.
3. IMPLEMENTATION DETAILS1. Bayesian ApplicationWe extend our previous work by introducing a Bayesian application. This application is for
calculating the expectation of the posterior. The application is going to compute the result from(3). Using MCI, the first step is to generate a set of samples according to the prior distribution.This random number generation part is done in CPUs. Next, using generated samples, according
to (7), an expected value can be calculated with parallel reduction. With GPUs, the computation
of the parallel reduction is accelerated. After obtaining the expectation of the posterior,hypothesis testing is performed by checking whether the probability is fall under 95% regions of
the prior. An overview of the implementation is shown in Figure 2.
Figure 2. Implementation of 2D blocks.
2. Solving the ScalabilityIn order to solve the scalability issue found in our previous work [7], we employ the
multidimensional feature of CUDA. Our CUDA programs basically divide an array of samples
into smaller blocks which will be reduced later to find an integrated value. Figure 1 illustrates our
(* Calculate the expectation of the posterior using MCI *)
SET samples to Sampling(Normal(5,0.5), N)
SET numeratortoReduce(f, samples, N) usingf(x) = x*lhd(x)
SET denominatortoReduce(f, samples, N) usingf(x) = lhd(x)
SET expected_value to numerator/denominator
(* Hypothesis testing *)
SETpH0 to Test(expected_value, Normal(5,0.5))
RETURNpH0 < 0.95
-
8/6/2019 ANSCSE15 Full Paper Thanakij
5/11
ANSCSE15 Bangkok University, Thailand
March 30-April 2, 2011
idea in transformation of using 1D blocks into 2D blocks. With 1D blocks, the maximum number
of blocks we can use is 65535 x 1 x 1 = 65535. On the other hand, if 2D blocks are being used,
the maximum number of blocks becomes 65535 x 65535 x 1 = 4294836225. This number is
already large enough to utilize CUDA core resources and also allow us to scale to larger problemsizes. Theoretically, the maximum problem size for a certain block size is calculated by
multiplying the number of blocks with the block size. In practice, the physical limit of GPU
memory may be a blocker for a very large problem size. Figure 3 provides an illustration of the
idea.
Figure 3. Transformation from 1D blocks to 2D blocks.
Figure 4. Implementation of 2D blocks.
Figure 4 shows our implementation with a required parameter, the size of a row. Thisparameter can be tuned to fit certain problem sizes. If larger value of row size is being used, there
might be a lot of waste computation in the last row. On the other hand, if smaller rows are used,
then there is much less opportunity to waste the computation; however, as the number of rows
grows it might hit the limit of 65535. Future work can provide complex analysis on this problem.
3. Performance2.1) Loop unrolling
We employ loop unrolling into our parallel reduction code. The major advantage of the loop
unrolling technique is that there is no need to check the condition of the loop when iterating. We
unroll last six iterations since the number of threads can be ensured to be a warp size. By doing
(* 1D block representation *)
SET num_blocks to num_samples/block_size
SET block.x to num_blocks
SET block.y to 1
(* 2D block representation *)
SET num_blocks to num_samples /block_size
SET num_row to num_blocks/row_size
SET block.x to min(num_blocks, row_size)
SET block.y to num_rows
(* # CUDA blocks = min(block.x, 65535) x min(block.y, 65535) *)
samples
row0
row1
row size
-
8/6/2019 ANSCSE15 Full Paper Thanakij
6/11
ANSCSE15 Bangkok University, Thailand
March 30-April 2, 2011
this, there is an extra benefit as we can remove unnecessary expensive synchronous instructions.
Threads within the same warp do not require a synchronous point as they will always execute the
same instruction. The idea is shown in Figure 5.
Figure 5. Loop unrolling in parallel reduction.
2.2) Enhancing the compact kernel
The compact kernel is for gathering the reduced values from all CUDA blocks and forming anew array which will be sent to a reduce kernel until only one block is left. In our previous work,
to keep simplicity of programming, we use only a single thread per block and let each block to do
the compact job which may not utilize the CUDA resources. Although it is not a core code,
tuning up this part also yields a performance improvement. Figure 6 shows our modification.
Figure 6. Enhancing the compact kernel.
There is another parameter appeared which is the number of threads for the compactkernel. We adjust the number of threads for this kernel according to the problem size. For
example, we use 128 threads if the sample size is less than 8388480, use 512 threads if
the sample size is larger than 16776960 and use 256 threads if the size is in between.
4. EXPERIMENTS AND RESULTS1. PlatformsWe use NVidia GeForce GTX 580 as our platform for GPUs. On the CPU side, we have Intel
Core i7. The detail specifications are shown in Table 1.
(* Original version *)
kernel_reduce ()
(* Modified version *)
kernel_reduce ()
(* parallel reduction in the reduce kernel *)
FOR s from num_samples/2 to 64 having s/=2
Sync threads (* make sure that all threads are working on the same level of the tree *)
IF threadIdis less than s THEN
Add s_data[threadId] to s_data[threadId+ s]
END IF
END FOR
(* loop unrolling *)
IF threadIdis less than 32 THEN (* CUDA warp size is 32 *)
Add s_data[threadId] to s_data[threaded+ 32]
Add s_data[threadId] to s_data[threaded+ 16]Add s_data[threadId] to s_data[threaded+ 8]
Add s_data[threadId] to s_data[threaded+ 4]
Add s_data[threadId] to s_data[threaded+ 2]
Add s_data[threadId] to s_data[threaded+ 1]
END IF
-
8/6/2019 ANSCSE15 Full Paper Thanakij
7/11
ANSCSE15 Bangkok University, Thailand
March 30-April 2, 2011
Table 1. Specification of CPUs and GPUs.
Description CPU GPU
Model Intel Core i7 NVidia GeForce GTX 580
Clock frequency (GHz) 2.8 1.56# processors 2 16
# cores per processor 4 32
# total cores 8 512
2. DatasetsCavendishs data [8] are used in our experiments. The data represents the specific density of
the earth. From 23 experiments, they are: 5.36, 5.29, 5.58, 5.65, 5.57, 5.53, 5.62, 5.29, 5.44, 5.34,
5.79, 5.10, 5.27, 5.39, 5.42, 5.47, 5.63, 5.34, 5.46, 5.30, 5.78, 5.68 and 5.85. According to [9], our
corresponding model is also a normal distributionN(:|,0.04). For the prior, it is chosen to benormal with mean = 5 and variance = 0.5.
3. ResultsOur computed posterior expectation is 5.483 which is similar to the result from [9]. We find
that the computed probability falls under the 95% of the region (0.75 < 0.95.) Thus, the
hypothesis is accepted.
Figure 7. Results from our Bayesian application.
Figure 7 shows examples of results from our application. In addition to the answer, the
running time is also provided for both CPU and GPU versions. The details of the computational
time are provided for both the part of calculating the posterior expectation and the part ofhypothesis testing. In terms of performance, we expect an improvement for the part of posteriorcalculation since this part involves parallelism using CUDA. On the other hand, we should obtain
similar running time for the part of hypothesis testing since there is no GPU involvement in this
part.
3.1) Running Time
According to the experiment, for our Bayesian application, the results show that our GPU
program takes less time than the CPU implementation. The logarithmic chart below illustrates thecomparison.
-
8/6/2019 ANSCSE15 Full Paper Thanakij
8/11
ANSCSE15 Bangkok University, Thailand
March 30-April 2, 2011
Figure 8. Running time of CPU and GPU (the whole application.)
Because there are two main parts in the application: 1) posterior expectation calculation and 2)
hypothesis tests, we proceed by providing the details of the running time for each part. The part
of calculating the expectation is shown in Figure 9.
Figure 9. Running time of CPU and GPU (for posterior expectation.)
The two charts, Figure 8 and 9, reveal a similar trend that the GPU implementation is faster
than the CPU. Next, for the running time of testing the hypothesis, because this portion of the
code has no GPU involvement so there is no difference in timing between CPU and GPU.
However, it would still be useful to see this part scales with different problem sizes. According to
Figure 10, we find that the testing part has a linear-time scaling.
0.010
0.100
1.000
10.000
100.000
1000.000
10000.000
0 100,000,000 200,000,000 300,000,000Runningtime(seconds)
Problem sizeCPU GPU
0.001
0.010
0.100
1.000
10.000
100.000
1000.000
10000.000
0 100,000,000 200,000,000 300,000,000
Runningtime(seconds)
Problem size
CPU GPU
-
8/6/2019 ANSCSE15 Full Paper Thanakij
9/11
ANSCSE15 Bangkok University, Thailand
March 30-April 2, 2011
Figure 10. Running time of the portion of testing the hypothesis.
Next, we move back to the posterior expectation calculation. It would be interesting to see
how each optimization strategy performs on the GPU side. Therefore, Figure 11 provides the
details running time of the GPU programs with different optimizations.
Figure 11. Effect of optimization methods in GPU programs.
However, the chart illustrates that there is no much difference in running time of each
method. We anticipate that this would be caused by the evaluation of the complex function like
the likelihood function on the GPU side in the parallel reduction step of MCI. Although many
threads are working in parallel to evaluate the function, but at least the elapsed time of such the
calculation for a single thread is dominating the portion of the whole reduction. Because the
optimization techniques such as enhancing the compact kernel and loop unrolling are focusing on
0.000
5.000
10.000
15.000
20.000
25.000
0 100,000,000 200,000,000 300,000,000
Runningtime(seconds)
Problem size
1) No exta optimization 2) Enhance the compacting kernel3) Loop unrolling 4) Optimization (2)+(3)
0.000
2.000
4.000
6.000
8.000
10.000
12.000
14.000
16.000
18.000
20.000
0 100,000,000 200,000,000 300,000,000
Runningtime(seconds)
Problem size
-
8/6/2019 ANSCSE15 Full Paper Thanakij
10/11
ANSCSE15 Bangkok University, Thailand
March 30-April 2, 2011
the core reduction part, such improvement becomes very little compared to the time used by the
function evaluation. Therefore, there is no much difference for each optimization technique.
3.2) ScalabilityWe show the result after solving the scalability problem in Table 2. Notice that all block size
even the block size of 128 can be used by all problem sizes and this would not be possible in our
previous work.
Table 2. Running time of GPU programs with different block sizes.
Problem Size 128 256 512 1024 2048 4096
65,535 0.011 0.011 0.011 0.011 0.011 0.011
131,070 0.021 0.021 0.021 0.021 0.021 0.021
262,140 0.041 0.041 0.040 0.047 0.040 0.040
524,280 0.080 0.080 0.081 0.080 0.080 0.080
1,048,560 0.159 0.159 0.165 0.166 0.159 0.158
2,097,120 0.317 0.316 0.316 0.316 0.316 0.3164,194,240 0.631 0.638 0.652 0.631 0.638 0.631
8,388,480 1.261 1.261 1.264 1.261 1.261 1.262
16,776,960 2.523 2.529 2.522 2.522 2.525 2.524
33,553,920 5.076 5.042 5.117 5.042 5.041 5.045
67,107,840 10.368 10.087 10.082 10.082 10.085 10.084
134,215,680 20.516 20.502 20.786 20.161 20.155 20.516
268,431,360 40.332 40.311 41.300 40.313 40.316 40.329
Table 2 shows no difference in running time of the GPU programs varying block
sizes. Again, this would be caused by that the most time spent is not in the core parallel
reduction code so the effect of different block sizes cannot be seen.
3.3) Speed-up
We calculate the speed-up of the GPU programs for both the whole program and the
portion of posterior calculation. The speed-ups are shown in Table 3.
Table 3. Speed-ups of GPU programs.
Problem Size Whole
Application
Posterior
Expectation
65,535 49.97 84.25
131,070 50.17 88.58
262,140 52.27 91.59
524,280 52.83 93.48
1,048,560 53.22 94.61
2,097,120 53.29 94.95
4,194,240 53.37 95.21
8,388,480 53.45 95.41
16,776,960 53.44 95.52
33,553,920 53.48 95.54
67,107,840 53.49 95.60
134,215,680 53.49 95.56
268,431,360 52.21 95.58
-
8/6/2019 ANSCSE15 Full Paper Thanakij
11/11
ANSCSE15 Bangkok University, Thailand
March 30-April 2, 2011
The maximum speed-up obtained in case of the whole application is 53.49 times the sequential
code. For the only portion of calculating the posterior expectation, the maximum speed-up is
95.60.
5. CONCLUSIONWe illustrate a real world application of Bayesian probability for testing the
hypothesis. The expectation is required to do the hypothesis testing. The implementation
shows that our method can be accurately used to find such the posterior expectation. Wealso present an enhancement to our previous work by further optimizing our CUDA programs and
also handling the scalability issue. Our results show that our parallel programs perform better
than the CPU program as they take much less time when executing. In our experiments, we show
that with small block sizes, we still can handle large problem sizes and this is essential since more
solution space has been created. The maximum speed-up identified in our experiment is 53.49
times the sequential code. Future work would focus on employing a full GPU implementation by
generating random numbers in GPUs and also cover the issue of evaluating the function in the
parallel reduction step so that the effect of optimization and block size can be seen.
REFERENCES1. Bayes, T., and Price, R., "An Essay towards solving a Problem in the Doctrine of Chance. By
the late Rev. Mr. Bayes, communicated by Mr. Price, in a letter to John Canton, M. A. and F.
R. S.". Philosophical Transactions of the Royal Society of London 53, 1763, 370418.
2. Ross, S., "2.4 Expectation of a random variable".Introduction to probability models (9th ed.).Academic Press, 2007, p. 38.
3. Tierney, L., and Kadane, J., "Accurate Approximations for Posterior Moments and MarginalDensities,"Journal of the American Statistical Association, 1986, 81, 82-86.
4. Caflisch, R., Monte Carlo and quasi-Monte Carlo methods,Acta Numerica vol. 7, CambridgeUniversity Press, 1998, pp. 1-49.
5. Harris, M., Mapping computational concepts to GPUs, in: M. Pharr (ed.), GPUGems 2 :Programming Techniques for High-Performance Graphics and General-Purpose
Computation, chap. 31, Addison-Wesley, 2005, pp. 493508.
6. NVIDIA CUDA C Programming Guide Version 3.2, 2010.7. Pechprasarn, T. and Khiripet, N., Accelerating Bayesian Computation with Parallel Reduction
using CUDA, The 4th
Mahasarakham International Workshop on Artificial Intelligence
(MIWAI), 2010, p40-45.
8. Cavendish, H., "Experiments to Determine the Density of the Earth".MacKenzie, A. S..Scientific Memoirs Vol.9: The Laws of Gravitation. American Book Co.. 1900. pp. 59105.
9. Piche R., Normal Data, in the note 2 of Bayesian statistics courses, Tampere University ofTechnology, 2009.