a simple jpeg encoder with cuda technology dongyue mou and zeng xing cujpeg
TRANSCRIPT
![Page 1: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/1.jpg)
A Simple JPEG EncoderWith CUDA Technology
Dongyue Mou and Zeng Xing
cujpeg
![Page 2: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/2.jpg)
Outline
• JPEG Algorithm
• Traditional Encoder
• What's new in cujpeg
• Benchmark
• Conclusion
![Page 3: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/3.jpg)
Outline
• JPEG Algorithm
• Traditional Encoder
• What's new in cujpeg
• Benchmark
• Conclusion
![Page 4: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/4.jpg)
JPEG Algorithm
JPEG is a commonly used method for image compression.
JPEG Encoding Algorithm is consist of 7 steps:1. Divide image into 8x8 blocks2. [R,G,B] to [Y,Cb,Cr] conversion3. Downsampling (optional)4. FDCT(Forward Discrete Cosine Transform)5. Quantization6. Serialization in zig-zag style7. Entropy encoding (Run Length Coding & Huffman coding)
![Page 5: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/5.jpg)
This is an example
JPEG Algorithm -- Example
![Page 6: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/6.jpg)
This is an example
Divide into 8x8 blocks
![Page 7: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/7.jpg)
This is an example
Divide into 8x8 blocks
![Page 8: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/8.jpg)
RGB vs. YCC
The precision of colors suffer less (for a human eye) than the precision of contours (based on luminance)
Color space conversion makes use of it!
Simple color space model: [R,G,B] per pixel
JPEG uses [Y, Cb, Cr] Model
Y = Brightness
Cb = Color blueness
Cr = Color redness
![Page 9: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/9.jpg)
Convert RGB to YCC
8x8 pixel1 pixel = 3 components
MCU withsampling factor
(1, 1, 1)
![Page 10: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/10.jpg)
Downsampling
MCU: minimum coded unit: The smallest group of data units that is coded.
Data size reduces to a half immediately
4 blocks16 x16 pixel
Y is taken every pixel , and Cb,Cr are taken for a block of 2x2 pixels
MCU withsampling
factor(2, 1, 1)
![Page 11: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/11.jpg)
Apply FDCT
2D IDCT:
1D IDCT:
2-D is equivalent to 1-D applied in each direction
Kernel uses 1-D transforms
Bottleneck, the complexity of the algorithm is O(n^4)
![Page 12: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/12.jpg)
Apply FDCT
Shift operations
From [0, 255]
To [-128, 127]
DCT Result
Meaning of each positionin DCT result-matrix
![Page 13: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/13.jpg)
Quantization
DCT resultQuantization Matrix(adjustable according to quality)
Quantization result
![Page 14: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/14.jpg)
Zigzag reordering / Run Length Coding
[ Number of Zero before me, my value]
Quantization result
![Page 15: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/15.jpg)
Huffman encoding
RLC result:[0, -3] [0, 12] [0, 3]......EOB
After group number added:[0,2,00b] [0,4,1100b] [0,2,00b]...... EOB
First Huffman coding (i.e. for [0,2,00b] ): [0, 2, 00b] => [100b, 00b] ( look up e.g. table AC Chron)
Total input: 512 bits, Output: 113 bits output
Values G Real saved values
0
-1, 1
-3, -2, 2, 3
-7,-6,-5,-4,5,6,7
.
.
.
.
.
.
.
.
.
-32767..32767
0
1
2
3
4
5
.
.
.
.
.
.
.
15
.
0,1
00, 01, 10, 11000,001,010,011,100,101,110,111
.
.
.
.
.
.
.
.
.
![Page 16: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/16.jpg)
Outline
• JPEG Algorithm
• Traditional Encoder
• What's new in cujpeg
• Benchmark
• Conclusion
![Page 17: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/17.jpg)
Traditional Encoder
CPU
Load image
Color conversion
DCT
Quantization
Zigzag Reorder
Encoding
Image
.jpg
![Page 18: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/18.jpg)
Outline
• JPEG Algorithm
• Traditional Encoder
• What's new in cujpeg
• Benchmark
• Conclusion
![Page 19: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/19.jpg)
Algorithm Analyse
1x full 2D DCT scan O(N4)
8x Row 1D DCT scan8x Column 1D DCT scan
O(N3)
8 threads can paralell work
![Page 20: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/20.jpg)
Algorithm Analyse
![Page 21: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/21.jpg)
DCT In Place
__device__ void vectorDCTInPlace(float *Vect0, int Step)
{
float *Vect1 = Vect0 + Step, *Vect2 = Vect1 + Step;
float *Vect3 = Vect2 + Step, *Vect4 = Vect3 + Step;
float *Vect5 = Vect4 + Step, *Vect6 = Vect5 + Step;
float *Vect7 = Vect6 + Step;
float X07P = (*Vect0) + (*Vect7);
float X16P = (*Vect1) + (*Vect6);
float X25P = (*Vect2) + (*Vect5);
float X34P = (*Vect3) + (*Vect4);
float X07M = (*Vect0) - (*Vect7);
float X61M = (*Vect6) - (*Vect1);
float X25M = (*Vect2) - (*Vect5);
float X43M = (*Vect4) - (*Vect3);
float X07P34PP = X07P + X34P;
float X07P34PM = X07P - X34P;
float X16P25PP = X16P + X25P;
float X16P25PM = X16P - X25P;
(*Vect0) = C_norm * (X07P34PP + X16P25PP);
(*Vect2) = C_norm * (C_b * X07P34PM + C_e * X16P25PM);
(*Vect4) = C_norm * (X07P34PP - X16P25PP);
(*Vect6) = C_norm * (C_e * X07P34PM - C_b * X16P25PM);
(*Vect1) = C_norm * (C_a * X07M - C_c * X61M + C_d * X25M - C_f * X43M);
(*Vect3) = C_norm * (C_c * X07M + C_f * X61M - C_a * X25M + C_d * X43M);
(*Vect5) = C_norm * (C_d * X07M + C_a * X61M + C_f * X25M - C_c * X43M);
(*Vect7) = C_norm * (C_f * X07M + C_d * X61M + C_c * X25M + C_a * X43M);
}
__device__ void blockDCTInPlace(float *block) { for(int row = 0; row < 64; row += 8) vectorDCTInPlace(block + row, 1);
for(int col = 0; col < 8; col++) vectorDCTInPlace(block + col, 1); }
__device__ void parallelDCTInPlace(float *block) { int col = threadIdx.x % 8; int row = col * 8;
__syncthreads(); vectorDCTInPlace(block + row, 1); __syncthreads(); vectorDCTInPlace(block + col, 1); __syncthreads(); }
![Page 22: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/22.jpg)
Allocation
Desktop PC– CPU: 1 P4 Core, 3.0GHz– RAM: 2GB
Graphic Card– GPU: 16 Core 575MHz
8 SP/Core, 1.35GHz– RAM: 768MB
![Page 23: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/23.jpg)
Binding
Huffman Encoding• many conditions/branchs• intensive bit operating• less computing
Color conversion, DCT, Quantize• intensive computing• less conditions/branchs
![Page 24: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/24.jpg)
Binding
Hardware: 16KB Shared MemoryProblem: 1 MCU contains 702 Byte data
Result: maximal 21 MCUs/CUDA Block
Hardware: 512 threadsProblem: 1 MCU contains 3 Blocks, 1 Block needs 8 threads
Result: 1 MCU needs 24 threads
1 CUDA Block = 504 Threads
![Page 25: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/25.jpg)
cujpeg Encoder
CPU
Load image
Color conversion
DCT
Quantization
Zigzag Reorder
Encoding
Image
.jpg
GPU
![Page 26: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/26.jpg)
cujpeg Encoder
CPU
Encoding
Image
.jpg
GPU
TextureMemory
GlobalMemory
QuantizationReorderResult
Shared M
emory
ColorConversion
In PlaceDCT
QuantizeReorder
HostMemory
cudaMallocArray( &textureCache, &channel, scanlineSize, imgHeight ));cudaMemcpy2DToArray(textureCache, 0, 0, image, imageStride, imageWidth, imageHeight, cudaMemcpyHostToDevice ));cudaBindTextureToArray(TexSrc, textureCache, channel));
cudaMalloc((void **)(&ResultDevice), ResultSize);
Load image
int b = tex2D(TexSrc, TexPosX++, TexPosY); int g = tex2D(TexSrc, TexPosX++, TexPosY); int r = tex2D(TexSrc, TexPosX+=6, TexPosY); float y = 0.299*r + 0.587*g + 0.114*b - 128.0 + 0.5; float cb = -0.168763*r - 0.331264*g + 0.500*b + 0.5; float cr = 0.500*r - 0.418688f*g - 0.081312*b + 0.5;
myDCTLine[Offset + i] = y; myDCTLine[Offset + 64 + i]= cb; myDCTLine[Offset + 128 + i]= cb;
for (int i=0; i<BLOCK_WIDTH; i++) myDestBlock[myZLine[i]] = (int)(myDCTLine[i] * myDivQLine[i] + 0.5f);
cudaMemcpy( ResultHost, ResultDevice, ResultSize, cudaMemcpyDeviceToHost);
![Page 27: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/27.jpg)
Scheduling
For each MCU:
•24 threads• Convert 2 pixel
•8 threads• Convert rest 2 pixel
•24 threads• Do 1x row vector DCT• Do 1x column vector DCT• Quantize 8x scalar value
Y Cb Cr
RGB Data
YCC Block
DCT Block
Quantized/Reordered Data
Y Cb Cr
x24
x24
x24
![Page 28: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/28.jpg)
Outline
• JPEG Algorithm
• Traditional Encoder
• What's new in cujpeg
• Benchmark
• Conclusion
![Page 29: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/29.jpg)
GPU Occupancy
Varying Register Count
My RegisterCount 16
0
6
12
18
24
0 4 8 12 16 20 24 28 32
Registers Per Thread
Mu
ltip
roc
es
so
r W
arp
Oc
cu
pa
nc
y
0
6
12
18
24
16 80 144 208 272 336 400 464
Varying Block Size
My Block Size 504
Threads Per Block
Multip
roce
ssor
War
p O
ccupan
cy
0
6
12
18
24
0 1024
2048
3072
4096
5120
6144
7168
8192
9216
10240
11264
12288
13312
14336
15360
16384
Varying Shared Memory Usage
My Shared Memory 16128
Shared Memory Per Thread
Mu
ltip
roce
ssor
War
p O
ccu
pan
cy
Threads Per Block 504
Registers Per Thread 16
Shared Memory Per Block (bytes) 16128
Active Threads per Multiprocessor 504
Active Warps per Multiprocessor 16
Active Thread Blocks per Multiprocessor 1
Occupancy of each Multiprocessor 67%
Maximum Simultaneous Blocks per GPU 16
![Page 30: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/30.jpg)
Benchmark
512x512 1024x1024 2048x2048 4096x4096
cujpeg 0.321s 0.376s 0.560s 1.171s
libjpeg 0.121s 0.237s 0.804s 3.971s
( Q = 80, Sample = 1:1:1 )
![Page 31: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/31.jpg)
Benchmark
Time Consumption (4096x4096)
Load Tansfer Compute Encode Total
Quality = 100 0.132s 0.348s 0.043s 0.837s 1.523s
Quality = 80 0.121s 0.324s 0.043s 0.480 1.123s
Quality = 50 0.130s 0.353s 0.044s 0.468s 1.167s
Ti me Consumpt i on
l oad 10%transfer 27%
compute 3%
encode 47%
others 13%
l oad t ransf er compute encode others
![Page 32: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/32.jpg)
Benchmark
Time Consumption (4096x4096)
Load Tansfer Compute Encode Total
Quality = 100 0.132s 0.348s 0.043s 0.837s 1.523s
Quality = 80 0.121s 0.324s 0.043s 0.480 1.123s
Quality = 50 0.130s 0.353s 0.044s 0.468s 1.167s
Each thread has 240 operations
24 threads process 1 MCU
4096x4096 image includes 262144 MCUs.
Total ops: 262144*24*210 = 1509949440 flops
Speed: (Total ops) /0.043 = 35.12Gflops
![Page 33: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/33.jpg)
Outline
• JPEG Algorithm
• Traditional Encoder
• What's new in cujpeg
• Benchmark
• Conclusion
![Page 34: A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg](https://reader034.vdocuments.us/reader034/viewer/2022051516/56649cb95503460f94980ef9/html5/thumbnails/34.jpg)
Conclusion
CUDA can obviously accelerate the JPEG compression.
The over-all performance• Depends on the system speed• More bandwidth• Besser encoding routine• Support downsample