discrete wavelet transform - fau€¦ · topic of this talk: forward discrete wavelet transform the...

27
JPEG 2000 Discrete Wavelet Transform by Christof Kobylko Seminar „Multi-Core Architectures and Programming“, 17.07.2013

Upload: trandung

Post on 03-May-2018

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

JPEG 2000 Discrete Wavelet Transform

by Christof Kobylko

Seminar „Multi-Core Architectures and Programming“, 17.07.2013

Page 2: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

Table of Contents

● JPEG-2000

● Overview of JPEG-2000

● Idea of Discrete Wavelet Transform (DWT)

● DWT-Example

● Optimizing DWT for CPU-based computation

● Definition of DWT

● Lifting Scheme

● Further optimizations

● Benchmarks

● Optimizing DWT for GPU-based computation

● GPU-specific adaption of CPU-implementation

● Benchmarks

● Summary

● Literature

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 2

Page 3: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

Table of Contents

● JPEG-2000

● Overview of JPEG-2000

● Idea of Discrete Wavelet Transform (DWT)

● DWT-Example

● Optimizing DWT for CPU-based computation

● Definition of DWT

● Lifting Scheme

● Further optimizations

● Benchmarks

● Optimizing DWT for GPU-based computation

● GPU-specific adaption of CPU-implementation

● Benchmarks

● Summary

● Literature

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 3

Page 4: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

Overview of JPEG-2000

● Image compression standard, successor to the popular JPEG

● Codec processing chain:

● Topic of this talk: Forward Discrete Wavelet Transform

● The results of this project are compared to the JPEG-2000 reference

implementation OpenJPEG

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 4

Processing chain of the JPEG-2000 codec: a) Encoder, b) Decoder (from [2])

Page 5: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

Idea of Discrete Wavelet Transform (DWT)

● Iterative procedure at each step:

● Filter the image by 2D-lowpass and highpass filter

● Sub-sample the results by factor 2

● Decompose the image into 4 sub-bands (LL, LH, HL, HH)

Huge decrease in entropy in high-frequency bands (LH, HL, HH)

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 5

Original Image

LL1

LH1

HL1

HH1

LL2

LH1

HH1

HL1

LH2

HL2

HH2

Page 6: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

DWT-Example

Original Image:

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 6

Page 7: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

DWT-Example (2)

DWT with 2 resolution levels:

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 7

Page 8: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

DWT-Example (3)

DWT with 3 resolution levels:

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 8

Page 9: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

DWT-Example (4)

DWT with 4 resolution levels:

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 9

Page 10: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

DWT-Example (5)

DWT with 5 resolution levels:

And everything is still lossless!

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 10

Page 11: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

Table of Contents

● JPEG-2000

● Overview of JPEG-2000

● Idea of Discrete Wavelet Transform (DWT)

● DWT-Example

● Optimizing DWT for CPU-based computation

● Definition of DWT

● Lifting Scheme

● Further optimizations

● Benchmarks

● Optimizing DWT for GPU-based computation

● GPU-specific adaption of CPU-implementation

● Benchmarks

● Summary

● Literature

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 11

Page 12: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

Definition of DWT

● 2D-convolution with a lowpass and a highpass filter

● Filters are separable

Decompose filtering into a vertical and a horizontal 1D-convolution

● Coefficients for the used Cohen-Daubechies-Feauveau 9/7 wavelets:

● Filters are symmetric

Efficient implementation of the convolution possible:

11 memory transactions, 14 additions and 9 multiplications per output

pixel pair

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 12

n 0 1 2 3 4

Lowpass hL[n] 0.602949 0.266864 -0.078223 -0.016864 0.026749

Highpass hH[n] 0.557543 0.295636 -0.028772 -0.045636 0

3

1

4

1

][])12[]12[(]0[]12[][

][])2[]2[(]0[]2[][

k

HHH

k

LLL

khkniknihnino

khkniknihnino

Page 13: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

Lifting Scheme

● Technique to reduce computational complexity of convolutions with

symmetric filters (very simplified formulation)

● Approach: Rewrite convolution operation in series of “lifting steps”

● Actual performance of this scheme depends heavily on the realization

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 13

][][:][

][][:][

]12[:][

]2[:][

])1[][(][][

])[]1[(][][

])1[][(][][

])[]1[(][][

21

20

0

0

22312

11212

11101

00001

ndndno

nsnsno

nind

ninswith

ndndnsns

nsnsndnd

ndndnsns

nsnsndnd

H

L

n 0 1 2 3

αn 1.586134 0.052980 0.882911 0.443506

βn 0.812893 0.615087 / /

Page 14: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

Lifting Scheme (2)

Naïve implementation: Re-compute all intermediate values every time

● 11 memory transactions, 20 additions and 12 multiplications per

output pixel pair Very inefficient

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 14

d2[n]

s2[n]

d[n]s[n]

d2[n-1]

d1[n-2]

s1[n]s

1[n-1] s

1[n+1]

d1[n-1] d

1[n] d

1[n+1]

s0[n-1] s

0[n] s

0[n+1]d

0[n-2] d

0[n-1] d

0[n] d

0[n+1] s

0[n+2]s

0[n-2]

:= Memory

:= Intermediate result

:= Output value

Page 15: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

Lifting Scheme (3)

OpenJPEG approach: Decompose computation graph into its layers

● 20 memory transactions, 8 additions and 6 multiplications per output

pixel pair

Computation effort traded against memory load

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 15

d1[n]

s0[n] s

0[n+1]d

0[n]

s1[n]

d1[n-1] d

1[n]s

0[n]

d2[n]

s1[n] s

1[n+1]d

1[n]

d2[n]

s2[n]

d2[n-1] s

1[n]

s[n]

s2[n]

d[n]

d2[n]

:= Memory (input)

:= Memory (output)

:= Output value

Page 16: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

Lifting Scheme (4)

Loop folding: Compute only right-most branch in each iteration

● 4 memory transactions, 8 additions and 6 multiplications per output

pixel pair Minimum possible effort

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 16

d2[n]

s2[n]

d[n]s[n]

d2[n-1]

s1[n] s

1[n+1]

d1[n] d

1[n+1]

s0[n+1] d

0[n+1] s

0[n+2]

:= Memory

:= Intermediate result

:= Output value

:= Cached result

Page 17: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

Further optimizations

● OpenJPEG is storing the sub-bands pixel-wise interleaved

Another 4 memory transactions per output pixel pair for de-interleaving

● Optimized version instantaneously de-interleaves sub-bands and

transposes the images

● No overhead for de-interleaving

● Only the vertical filter must be implemented (inherent transposing)

Easy column-wise parallelization (especially for GPUs!)

Column-wise vectorization possible (4 columns at once with SSE)

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 17

Input image Output image

Lowfrequencies

Highfrequencies

Transposed, de-interleaved writing scheme

Page 18: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

Benchmarks

● 5 Different test scenarios

● OpenJPEG reference implementation [OPJ]

● Optimized implementation (with loop folding) [OPT]

● Optimized implementation with column-wise parallelization [OPT_MT]

● Vectorized optimized implementation [SSE]

● Vectorized and parallelized optimized implementation [SSE_MT]

● 3 Different test environments

● Intel Core i7-2630QM @ 2.00 GHz (4 CPUs), NVidia Quadro 1000M

● Intel Xeon E5640 @ 2.67 GHz (4 CPUs), NVidia Tesla C2050

● Intel Core i7-3930K @ 3.20 GHz (6 CPUs), NVidia Quadro 2000

● All tests are performed with 6 DWT resolution-layers

● Evaluation of raw computation time and memory throughput

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 18

Page 19: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

Benchmarks (2)

● Test data size = 1920 * 1080 * 3 pixels

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 19

0

50

100

150

200

250

300

Computation Time [ms]

0

2

4

6

8

10

12

Memory load [GByte/s]

i7-2630QM

E5640

i7-3930K

Page 20: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

Table of Contents

● JPEG-2000

● Overview of JPEG-2000

● Idea of Discrete Wavelet Transform (DWT)

● DWT-Example

● Optimizing DWT for CPU-based computation

● Definition of DWT

● Lifting Scheme

● Further optimizations

● Benchmarks

● Optimizing DWT for GPU-based computation

● GPU-specific adaption of CPU-implementation

● Benchmarks

● Summary

● Literature

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 20

Page 21: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

GPU-specific adaption of CPU-implementation

● GPU is interpreted as

● massively parallel compute architecture with

● 32 element-wide vector units (Warp-Size)

● Plain porting of the vectorized code to CUDA results in a relatively

huge computation time !

Transposed scatter-write slows down GPU memory controllers extremely

● Optimization step 1:

● Write the output values of the DWT kernel aligned to the input values

● Transpose image with separate kernel via shared memory

Great performance improvement, but transpose kernel still needs up to

80 % of the total computation time !

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 21

Page 22: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

GPU-specific adaption of CPU-implementation (2)

● Optimization step 2:

● Create separate kernel for horizontal DWT transform

● Compute 32 output pixel pairs in parallel via shared memory

● Write output values aligned back to global memory

Transpose of images becomes unnecessary

● Example with 4 output pixel pairs per iteration:

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 22

Page 23: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

Benchmarks

● 5 Different test scenarios

● Reference CPU-implementation:

Vectorized and parallelized optimized implementation [SSE_MT]

● Naïve CUDA-implementation of vectorized code [C_NAIVE]

● CUDA-implementation with shared memory transpose [C_TRANS]

● Optimized CUDA-implementation (separate kernels) [C_OPT]

● Kernel execution time for opt. CUDA-implementation [C_KERNEL]

● 3 Different test environments

● Intel Core i7-2630QM @ 2.00 GHz (4 CPUs), NVidia Quadro 1000M

● Intel Xeon E5640 @ 2.67 GHz (4 CPUs), NVidia Tesla C2050

● Intel Core i7-3930K @ 3.20 GHz (6 CPUs), NVidia Quadro 2000

● All tests are performed with 6 DWT resolution-layers

● Evaluation of raw computation time and effective PCIe-throughput

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 23

Page 24: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

Benchmarks (2)

● Test data size = 1920 * 1080 * 3 pixels

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 24

0

0,5

1

1,5

2

2,5

3

Eff. PCIe throughput [GByte/s]

i7-2630QM / Quadro 1000M

E5640 / Tesla C2050

i7-3930K / Quadro 2000

0

10

20

30

40

50

60

70

80

90

100

Computation Time [ms]

Page 25: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

Summary

● Efficiency of OpenJPEG DWT has been improved drastically through

● smarter implementation of the lifting scheme

● elimination of unnecessary memory transactions

● multi-threaded processing of the columns of the image components

● simultaneous multi-column processing via SSE

● Memory access patterns must be changed for an efficient CUDA-

implementation

● Usage of GPUs must be thought through thoroughly

● Even low-end GPUs outperform the computation power of high-end

CPUs, but

● PCIe-transfer times eat up the performance gain

Porting the whole encoder chain to CUDA would be heavily beneficial,

porting only the DWT offers almost no gain with respect to the multi-

threaded SSE implementation

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 25

Page 26: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

Literature

● [1] M. Antonini, M. Barlaud, P. Mathieu and I. Daubechies.

„Image Coding Using Wavelet Transform“. IEEE Transactions on

Image Processing, 1(2): 205-220, April 1992

● [2] M. D. Adams and R. Ward. „Wavelet Transforms in the JPEG-2000

Standard“. Dept. of Elec. and Comp. Engineering, University of

British Columbia, Vancouver, BC, Canada, V6T 1Z4

● [3] J. Li. „Image Compression: The Mathematics of JPEG 2000“.

Modern Signal Processing, MSRI Publications, Vol. 46, 2003

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 26

Page 27: Discrete Wavelet Transform - FAU€¦ · Topic of this talk: Forward Discrete Wavelet Transform The results of this project are compared to the JPEG-2000 reference ... CUDA-implementation

Questions ?

Christof Kobylko - JPEG 2000 / Discrete Wavelet Transform 27