exploiting the graphics hardware to solve two compute intensive problems

74
Exploiting the Graphics Hardware to solve two compute intensive problems Sheetal Lahabar and P. J. Narayanan Center for Visual Information Technology, IIIT - Hyderabad

Upload: gibson

Post on 06-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Sheetal Lahabar and P. J. Narayanan Center for Visual Information Technology, IIIT - Hyderabad. Exploiting the Graphics Hardware to solve two compute intensive problems. General-Purpose Computation on GPUs. Why GPGPU? Computational Power Pentium 4: 12 GFLOPS, GTX 280: 1 TFLOPS - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Exploiting the Graphics Hardware to solve two compute intensive problems

Exploiting the Graphics Hardware to solve two compute intensive

problems

Sheetal Lahabar and P. J. Narayanan

Center for Visual Information Technology,

IIIT - Hyderabad

Page 2: Exploiting the Graphics Hardware to solve two compute intensive problems

General-Purpose Computation on GPUs Why GPGPU? Computational Power

Pentium 4: 12 GFLOPS, GTX 280: 1 TFLOPS

High Performance Growth: Faster than Moore's law CPU: 1.4x, GPU: 1.7x ~ 2.3x for every year Disparity in performance: CPU(caches and branch

prediction), GPU(arithmetic intensity)

Flexible and precise Programmability High-level language support

Economics Gaming market

Page 3: Exploiting the Graphics Hardware to solve two compute intensive problems

The Problem: Difficult to use GPUs are designed for and driven by

graphics Model is unusual & tied to graphics Environment is tightly constrained

Underlying architectures Inherently parallel Rapidly evolving Largely secret

Can’t simply “port” code written for the CPU!

Page 4: Exploiting the Graphics Hardware to solve two compute intensive problems

Mapping Computations to GPU

Data-parallel processing GPU architecture is ALU-heavy Performance depends on

Arithmetic intensity = Computation / Bandwidth ratio

Hide memory latency with more computation

Page 5: Exploiting the Graphics Hardware to solve two compute intensive problems

GPU architecture

Page 6: Exploiting the Graphics Hardware to solve two compute intensive problems

Singular Value Decomposition on

the GPU using CUDA

Proceedings of IEEE International Parallel Distributed Processing Symposium(IPDPS 09), 25-29 May, 2009, Rome, Italy

Page 7: Exploiting the Graphics Hardware to solve two compute intensive problems

Problem Statement

SVD of matrix A(mxn) for m>n

U and V are orthogonal and Σ is a diagonal matrix

Page 8: Exploiting the Graphics Hardware to solve two compute intensive problems

Motivation

SVD has many applications in Image Processing, Pattern Recognition etc.

High computational complexity GPUs have high computing power

Teraflop performance Exploit the GPU for high performance

Page 9: Exploiting the Graphics Hardware to solve two compute intensive problems

Related Work Ma et al. implemented two sided rotation

Jacobi on 2 million gate FPGA (2006) Yamamoto et al. proposed a method on

CSX600 (2007) Only for large rectangular matrices

Bobda et al. proposed a implemention on Distributed reconfigurable system (2001)

Zhang Shu et al. implemented One Sided Jacobi Works for small matrices

Bondhugula et al. proposed a hybrid implementation on GPU Using frame buffer objects

Page 10: Exploiting the Graphics Hardware to solve two compute intensive problems

Methods

SVD algorithms Golub Reinsch (Bidiagonalization and Diagonalization) Hestenes algorithm(Jacobi)

Golub Reinsch method Simple and compact Maps well to the GPU Popular in numerical libraries

Page 11: Exploiting the Graphics Hardware to solve two compute intensive problems

Golub Reinsch algorithm

Bidiagonalization: Series of householder transformations

Diagonalization: Implicitly Shifted QR iterations

Page 12: Exploiting the Graphics Hardware to solve two compute intensive problems

SVD

Overall algorithm B ← QTAP Bidiagonalization of A to B Σ ← XTBY Diagonalization of B to Σ U ← QX , V T ← (PY ) T

Compute orthogonal matrices U andV T

Complexity: O(mn2) for m>n

Page 13: Exploiting the Graphics Hardware to solve two compute intensive problems

Bidiagonalization

QT AsA QT P

Identity matrix

Simple Bidiagonalization

ith updateA(i+1:m, i+1:n) = A(i+1:m, i+1:n) – uif(ui,vi ) - f(vi)vi QT(i:m, 1:m) = QT(i:m, 1:m) – f(Q,ui)ui P(1:n, i:n) = P(1:n, i:n) – f(P,vi)vi

Page 14: Exploiting the Graphics Hardware to solve two compute intensive problems

Contd… Many Reads and writes Use block updates Divide matrix into n/L blocks

Eliminate L rows and columns at once n/L block transformations

Page 15: Exploiting the Graphics Hardware to solve two compute intensive problems

Contd… A Block transformation, L=3

QT A P

L

ith block transformation updates trailing

A(iL+1:m, iL+1:n), Q(1:m, iL+1:m) and PT(iL+1:n, 1:n)

Update using BLAS operations

Page 16: Exploiting the Graphics Hardware to solve two compute intensive problems

Contd… Final bidiagonal matrix B = QTAP Store L ui’s and vi’s Additional space complexity O(mL) Partial Bidiagonalization only

computes B

Page 17: Exploiting the Graphics Hardware to solve two compute intensive problems

Challenges

Iterative algorithm Repeated data transfer High precision requirements Irregular data access Matrix size affects performance

Page 18: Exploiting the Graphics Hardware to solve two compute intensive problems

Bidiagonalization on GPU Block updates require Level 3 BLAS CUBLAS functions used, single

precision High performance for smaller

dimension Matrix dimension are multiple of 32 Operations on data local to the GPU Expensive GPU CPU transfers

avoided

Page 19: Exploiting the Graphics Hardware to solve two compute intensive problems

Contd…

Inplace bidiagonalization Efficient GPU implementation Bidiagonal matrix copied to the CPU

Page 20: Exploiting the Graphics Hardware to solve two compute intensive problems

Diagonalization

Implicitly shifted QR algorithm

Identity matrix

k1

X B Y

T

k2

k1

Iteration 1Iteration 2

k2

Page 21: Exploiting the Graphics Hardware to solve two compute intensive problems

Diagonalization Apply implicitly shifted QR algorithm

In every iteration, until convergence Find matrix indexes k1 and k2

Apply Given’s rotations on B Store coefficient vectors (C1, S1) and (C2, S2) of

length k2-k1

Transform k2-k1+1 rows of YT using (C1, S1) Transform k2-k1+1 columns of X using (C2, S2)

Page 22: Exploiting the Graphics Hardware to solve two compute intensive problems

Contd… Forward transformation on YT

C1 S1

YTfor(j=k1; j<k2; j++)

YT(j,1:n) = f (YT(j,1:n), YT(j+1,1:n), C1(j-k1+1), S1(j-k1+1))

YT(j+1,1:n) = g (YT(j,1:n), YT(j+1,1:n), C1(j-k1+1), S1(j-k1+1))

j=0

j=1

j=2

Page 23: Exploiting the Graphics Hardware to solve two compute intensive problems

Diagonalization on GPU

Hybrid algorithm Given rotations modifies B on CPU Transfer coefficient vectors to GPU Row transformations Transform k2-k1+1 rows of YT and XT

on GPU

Page 24: Exploiting the Graphics Hardware to solve two compute intensive problems

Contd… A row element depends on next or

previous row element A row is divided into blocks

m

n

txty=0

B1 Bk Bn

blockDim.x

k1

k2

Page 25: Exploiting the Graphics Hardware to solve two compute intensive problems

Contd…

Kernel modifies k2-k1+1 rows

Kernel loops over k2-k1 rows Two rows in shared memory Requires k2-k1+1 coefficient vectors Coefficient vectors copied to shared

memory Efficient division of rows Each thread works independently

Page 26: Exploiting the Graphics Hardware to solve two compute intensive problems

Orthogonal matrices

CUBLAS matrix multiplication for U and VT

Good performance even for small matrices

Page 27: Exploiting the Graphics Hardware to solve two compute intensive problems

Results Intel 2.66 Ghz Dual Core CPU used Speedup on NVIDIA GTX 280:

3-8 over MKL LAPACK 3-60 over MATLAB

Page 28: Exploiting the Graphics Hardware to solve two compute intensive problems

Contd… CPU outperforms for smaller matrices Speedup increases with matrix size

Page 29: Exploiting the Graphics Hardware to solve two compute intensive problems

Contd… SVD timing for rectangular matrices

(m=8K) Speedup increases with varying

dimension

Page 30: Exploiting the Graphics Hardware to solve two compute intensive problems

Contd… SVD of upto 14K x 14K on Tesla S1070

takes 76 mins on GPU 10K x 10K SVD takes 4.5 hours on CPU,

25.6 minutes on GPU

Page 31: Exploiting the Graphics Hardware to solve two compute intensive problems

Contd…

Yamamoto achieved a speedup of 4 on CSX600 for very large matrices

Bobda report the time for 106 x 106 matrix which takes 17 hours

Bondhugula report only the partial bidiagonalization time

Page 32: Exploiting the Graphics Hardware to solve two compute intensive problems

Timing for Partial Bidiagonalization Speedup:1.5-16.5 over Intel MKL CPU outperforms for small matrices Timing comparable to Bondhugula e.g 11 secs on GTX 280 compared to

19 secs on 7900 Time in secs

SIZEBidiag.

GTX 280

Partial Bidiag.

GTX 280

Partial Bidiag.

Intel MKL

512 x 512 0.57 0.37 0.14

1K x 1K 2.40 1.06 3.81

2K x 2K 14.40 4.60 47.9

4K x 4K 92.70 21.8 361.8

Page 33: Exploiting the Graphics Hardware to solve two compute intensive problems

Timing for Diagonalization Speedup:1.5-18 over Intel MKL Maximum Occupancy: 83% Data coalescing achieved Performance increases with matrix

size Performs well even for small matricesTime in secs

SIZE

Diag.

GTX 280

Diag.

Intel MKL

512 x 512 0.38 0.54

2K x 2K 5.14 49.1

4K x 4K 20 354

8K x 2K 8.2 100

Page 34: Exploiting the Graphics Hardware to solve two compute intensive problems

Limitations Limited double precision support High performance penalty Discrepancy due to reduced precision

m=3K, n=3K

Page 35: Exploiting the Graphics Hardware to solve two compute intensive problems

Contd…

Max singular value discrepancy = 0.013%

Average discrepancy < 0.00005% Average discrepancy < 0.001% for U

and VT

Limited by device memory

Page 36: Exploiting the Graphics Hardware to solve two compute intensive problems

SVD on GPU using CUDA Summary SVD algorithm on GPU Exploits the GPU parallelism High performance achieved Bidiagonalization using CUBLAS Hybrid algorithm for diagonalization Error due to low precision < 0.001% SVD of very large matrices

Page 37: Exploiting the Graphics Hardware to solve two compute intensive problems

Ray Tracing Parametric Patches on GPU

Page 38: Exploiting the Graphics Hardware to solve two compute intensive problems

Problem Statement

Direct ray trace parametric patches Exact point of intersection High visual quality images Less artifacts Fast preprocessing Less memory requirement Better rendering

Page 39: Exploiting the Graphics Hardware to solve two compute intensive problems

Motivation

Describes 3D geometrical figures Foundation of most CAD systems

Computationally expensive process Graphics Processing Units (GPU)

High Computational Power, 1 TFLOPS

Exploit the Graphics hardware

Page 40: Exploiting the Graphics Hardware to solve two compute intensive problems

Bezier patch

16 control points Better continuity properties, compact Difficult to render directly Tessellated to polygons Patch equation

Q(u, v) = [u3 u2 u 1] P [v3 v2 v 1]T

Page 41: Exploiting the Graphics Hardware to solve two compute intensive problems

Methods

Uniformly refine on the fly Expensive tests to avoid recursion Approximates to triangles Rendering artifacts

Find exact hit point of a ray with a patch High computational complexity Prone to numerical errors

Page 42: Exploiting the Graphics Hardware to solve two compute intensive problems

Related Work Toth’s algorithm (1985)

Applies multivariate Newton iteration Dependent on calculation of interval

extension; numerical errors Manocha’s and Krishnan’s method (1993)

Algebraic pruning based approaches Eigen value formation of the problem Does not map well to GPU

Kajiya’s method (1982) Finds roots of a 18-degree polynomial Maps well to parallel architectures

Page 43: Exploiting the Graphics Hardware to solve two compute intensive problems

Kajiya’s algorithm

v - Intersect a and bu - gcd(a,b)

Rl0

l1a

b P

Page 44: Exploiting the Graphics Hardware to solve two compute intensive problems

Advantages

Finds the exact point of intersection Uses robust root finding procedure No memory overhead required Requires double precision arithmetic Able to trace secondary rays On the downside; computationally

expensive Suitable for parallel implementation Can be implemented on GPU

Page 45: Exploiting the Graphics Hardware to solve two compute intensive problems

Overview of ray tracing algorithm

Create BVH (CPU)

Compute Plane

Equations (GPU)

Traverse BVH for all

pixels/rays (GPU)

Compute 18 degree

polynomials (GPU)

Find the roots of the

polynomials (GPU)

Compute the GCD of bicubic

polynomials (GPU)

Accumulate shading data

recursively and render

Spawn Secondar

y Rays (GPU)

Compute point and

normal (GPU)

For all intersections

Every frame

Preprocessing

Page 46: Exploiting the Graphics Hardware to solve two compute intensive problems

Compute Plane Equations

M+N planes represent MxN rays Thread computes a plane equation

Use frustum corner information Device occupancy: 100%

EyePixel

Page 47: Exploiting the Graphics Hardware to solve two compute intensive problems

BVH traversal on the GPU Create BVH, traverse depth first Invoke traverse, scan, rearrange Store Num_Intersect intersection

data Device occupancy: 100%

4,5 4,6 5,65,5

0 0 1 1 2 2

0 0

3 3

2 2 3 3 4 4 5 5 4 4

(x,y)

traverse

1 1 2 2 2 2

3 3 4 4 6 6

Sum

Prefix_Sum

scan

4 4 4 4 5 5 5 5

5 5 5 6 5 5 6 6

0 1 2 2 3 4 4 5

rearrange

pixel_x

pixel_y

patch_ID

Page 48: Exploiting the Graphics Hardware to solve two compute intensive problems

Computing the 18 degree polynomial Intersection of a and b

32 A and B coefficients Evaluate R = [a b c; b d e; c e f] for v bezout kernel

grid = Num_Intersect/16, threads = 21*16

6-6 degree, 6-12 degree, 3-18 degree

16

21 Threads active21*16Threads active13*16Threads active19*21

Page 49: Exploiting the Graphics Hardware to solve two compute intensive problems

Contd…

Configuration uses resources well Avoids uncoalesced read and write

Row major layout Reduced divergence Device occupancy: 69% Performance limited by registers

Page 50: Exploiting the Graphics Hardware to solve two compute intensive problems

Finding the polynomial roots 18 roots using Laguerre’s method

Guarantees convergence Iterative and cubically convergent

Thread evaluates an intersection grid = Num_Intersect/64, threads = 64

Kernel invoked from the CPUwhile(i < 18)

call <laguerre> kernel, finds ith root xi

call <deflate> kernel, deflates polynomial by xi

End

Iteration update: xi = xi – g(p(x), p’(x))

Each invocation finds a root in the block Store real v count in d_countv

Page 51: Exploiting the Graphics Hardware to solve two compute intensive problems

Contd…

Splitting kernel reduces register usage Avoids uncoalesced read and write

Row major data layout Device occupancy

laguerre kernel : 25%, deflate kernel: 50% Performance limited by

Use of double registers Complex arithmetic Shared memory: Repeated transfer of

polynomial coefficients

Page 52: Exploiting the Graphics Hardware to solve two compute intensive problems

Compute GCD of bicubic polynomials u = GCD(a,b)

Euclidean algorithm Real v count from d_countv

Thread evaluates an intersection grid = Num_Intersect/64, threads = 64

Num_Intersect

tx = 64, ty = 0

bx bx

Page 53: Exploiting the Graphics Hardware to solve two compute intensive problems

Contd…

Update d_countu for real (u, v) pair Device occupancy: 25% Performance limited

Double registers Shared memory

A and B coefficients read repeatedly

Page 54: Exploiting the Graphics Hardware to solve two compute intensive problems

Compute (x,y,z) and normal n

Use parametric patch equation Real (u,v) count from d_countu

Thread processes an intersection grid = Num_Intersect / 64, threads = 64

Device occupancy: 25% Performance limited by

Double registers Shared memory

Repeated patch data transfer

Page 55: Exploiting the Graphics Hardware to solve two compute intensive problems

Challenges

High computational complexity Requires higher precision Repeated data transfer from device

to kernel Irregular data access Robust root finding algorithm Complex arithmetic High memory requirements

Page 56: Exploiting the Graphics Hardware to solve two compute intensive problems

Optimizations

Keep computations independent (one thread per pixel) Disadvantage – no coherence

Avoid unnecessary computations Using SAH(surface area heuristics) in

building BVH Arrange data to reduce workload

Page 57: Exploiting the Graphics Hardware to solve two compute intensive problems

Secondary rays

Secondary (shadow and reflection) rays spawned

Two orthogonal planes selected Find real point of intersection Shadow ray shadows the point of

origin Compute final color recursively Standard illumination equation

Page 58: Exploiting the Graphics Hardware to solve two compute intensive problems

Memory requirements and Bandwidth Memory requirements

64 doubles: Patch coefficients Store plane equations (screen resolution) Per Intersection of the ray (double) – 480 bytes

32 x 8 bytes: Bicubic polynomials 19 x 8 bytes: Polynomial roots 3 x 4 bytes: Patch ID and Pixel location 60 bytes: Additional flags

Memory Bandwidth Patch coefficients read repeatedly in laguerre

kernel Incurs a performance penalty

Page 59: Exploiting the Graphics Hardware to solve two compute intensive problems

Strengths

Facilitates direct ray tracing of dynamic patches

Divides into independent tasks Low branch divergence and high

memory access coherence Time taken linear in the number of

intersections No additional overhead incurred for

secondary rays

Page 60: Exploiting the Graphics Hardware to solve two compute intensive problems

Contd…

Predict performance based on scene complexity

Speed up by multiple GPUs Reduction in the number of

intersections boosts performance

Page 61: Exploiting the Graphics Hardware to solve two compute intensive problems

Limitations

Ray tracing performance Memory usage

Limits the number of intersections processed 480 bytes per ray patch intersection

Double Precision Performance Less GFLOPS

Limited Shared Memory Repeated data transfer Increases memory traffic and reduces the

performance

Page 62: Exploiting the Graphics Hardware to solve two compute intensive problems

Contd…

Batch processing solves the memory usage problem

GPUs now have improved double precision, up to 4x

Modern GPU has increased shared memory available

Page 63: Exploiting the Graphics Hardware to solve two compute intensive problems

Results: On GTX 280 Model No. of

Intersection

Patch/Ray

BVH TraversalTime (secs)

Polynomial formationTime (secs)

Solve polynomialTime (secs)

GCD, x,y,z and n computation Time (secs)

Time per frame(secs)

Average time per intersection(microseconds)

Teapot-P 54389 2.01 0.004 0.019 0.175 0.013 0.211 3.8

Teapot-S 29626 2.32 0.003 0.012 0.111 0.010 0.136 4.5

Teapot-R 41096 3.21 0.004 0.031 0.143 0.011 0.189 4.6

Bigguy-P 114048 3.23 0.007 0.043 0.352 0.015 0.417 3.6

Bigguy-S 114112 3.47 0.007 0.048 0.350 0.015 0.420 3.7

Bigguy-R 143040 4.34 0.008 0.104 0.480 0.022 0.614 4.3

Killeroo-P 127040 1.43 0.010 0.050 0.390 0.016 0.466 3.7

Killeroo-S 138240 1.72 0.011 0.061 0.420 0.016 0.508 3.7

Killeroo-R 146432 1.82 0.013 0.105 0.446 0.022 0.586 4

Page 64: Exploiting the Graphics Hardware to solve two compute intensive problems

Kernel split timing Finding roots: On

average 82% BVH traversal takes

negligible time Constant

percentage for

primary and secondary rays

Device occupancy: 25-100% Y axis – Model, Ray Type tuple

Teapot(T), Bigguy(B), Killeroo(K)Primary(P), Shadow(S), Reflected(R)

Page 65: Exploiting the Graphics Hardware to solve two compute intensive problems

Preliminary results on FermiModel No. of

inter-sections

Time per frame (secs)Fermi 480

Time per frame(secs)GTX 280

Avg. time per inter-section Fermi 480(microsecs)

Avg. time per inter- sectionGTX 280(microsecs)

Speedup

Teapot-P 54389 0.071 0.211 1.3 3.8 2.94

Teapot-S 29626 0.041 0.136 1.38 4.5 3.28

Teapot-R 41096 0.057 0.189 1.38 4.6 3.30

Bigguy-P 114048 0.147 0.417 1.28 3.6 2.83

Bigguy-S 114112 0.148 0.420 1.29 3.7 2.83

Bigguy-R 143040 0.190 0.614 1.32 4.3 3.23

Killeroo-P 127040 0.164 0.466 1.29 3.7 2.83

Killeroo-S 138240 0.179 0.508 1.29 3.7 2.83

Killeroo-R 146432 0.195 0.586 1.33 4 2.99

Page 66: Exploiting the Graphics Hardware to solve two compute intensive problems

Average time per intersection

Per intersection 3.7 μs – GTX 280 1.4 μs – GTX 480

No overhead incurred for secondary rays Predict perfor- mance

X axis – Model, Ray Type tuple

Teapot(T), Bigguy(B), Killeroo(K)Primary(P), Shadow(S), Reflection(R)

Page 67: Exploiting the Graphics Hardware to solve two compute intensive problems

Comparison to CPU First direct ray tracing

implementation Scales linearly with

number of inter-

sections Near interactive rates

Outperforms the CPU:

340x – GTX 280

990x – GTX 480

Promises interactivity

Shows the speedup using GTX 280 over MATLAB implementation on AMD dual core processor

Page 68: Exploiting the Graphics Hardware to solve two compute intensive problems

Teapot (32 patches) with reflection rays

Teapot (32 patches) with shadow and reflection rays

Page 69: Exploiting the Graphics Hardware to solve two compute intensive problems

Bigguy(3570 patches) with shadow rays

Killeroo(11532 patches) with shadow rays

Page 70: Exploiting the Graphics Hardware to solve two compute intensive problems

Multiple objects with shadow and reflection rays

Page 71: Exploiting the Graphics Hardware to solve two compute intensive problems

Ray tracing parametric patches on the GPU – Summary Finds exact points of intersection

Per pixel shading using true normal Renders highly accurate models

Quality not affected on zooming Able to trace secondary rays Suitable for parallel and pipelined

execution Near interactive performance; Speed

up over CPU

Page 72: Exploiting the Graphics Hardware to solve two compute intensive problems

Contd…

Alternative to subdivision approaches Suitable for multi GPU implementation Easily extended for other parametric

models

Page 73: Exploiting the Graphics Hardware to solve two compute intensive problems

Future Work

SVD and Ray tracing on multiple GPUs

Addressing large SVD Use double precision for SVD Adapt ray tracing to new generation

architectures (Fermi) Extend ray tracing for dynamic

models

Page 74: Exploiting the Graphics Hardware to solve two compute intensive problems

Thank you