time optimization of hevc encoder over x86 processors using simd
DESCRIPTION
Time Optimization of HEVC Encoder over X86 Processors using SIMD. Spring 2013 Multimedia Processing EE5359. Advisor: Dr. K. R. Rao. Kushal Shah 1000857252 k [email protected]. Objective. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/1.jpg)
Time Optimization of HEVC Encoder over X86 Processors using SIMD
Kushal Shah1000857252
Advisor: Dr. K. R. Rao
Spring 2013Multimedia Processing EE5359
![Page 2: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/2.jpg)
Objective
• With a lot of enhanced coding tools introduced, HEVC is expected to achieve 50% bit rate reductions at similar mean opinion score (MOS) compared with the previous standard H.264/AVC [1].
• However, the computational complexity of HEVC has greatly increased, making encoding speed a serious problem in the implementation of HEVC [2].
![Page 3: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/3.jpg)
Overview of HEVC [1]
• High Efficiency Video Coding (HEVC) is the newest video coding standard of the ITU-T Video Coding Experts Group and the ISO/IEC Moving Picture Experts Group.
• The main goal of the HEVC standardization effort is to enable significantly improved compression performance relative to existing standards—in the range of 50% bit-rate reduction for equal perceptual video quality.
![Page 4: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/4.jpg)
HEVC Encoder Block Diagram [1]
Fig.1: HEVC encoder block diagram [1]
![Page 5: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/5.jpg)
Macroblocks in HEVC [5]
Fig. 2 Macroblocks in HEVC [5]
![Page 6: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/6.jpg)
Time Analysis of HEVC Encoder [2][3]
Fig. 3: Time analysis of HEVC encoder [2][3]
![Page 7: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/7.jpg)
Time Analysis of HEVC Encoder[2][3]
• HEVC utilizes a quadtree structure [4] to support large and flexible block sizes.
• The size of a coding unit (CU) can be 64x64, 32x32, 16x16 and 8x8. Each CU is split into one or more prediction units (PU) and transform units (TU).
• For PU, the width and height of a PU vary from 4 to 64, indicating that the blocks to be processed in motion compensation (MC) can be as large as 64x64.
![Page 8: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/8.jpg)
Time Analysis of HEVC Encoder [2][3]
• In motion estimation (ME), sum of absolute differences (SAD) and sum of absolute transformed differences (SATD) of different block sizes are calculated.
• Due to the flexible block structure, each 4x4 block will be calculated several times from 4x4 to 64x64 ME, which can be quite time-consuming.
![Page 9: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/9.jpg)
8-Tap and 4-Tap Interpolation[7]
• 8-Tap Interpolation Filter:
Fig. 4: Interpolation filter for fractional pels in motion compensation [7]
![Page 10: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/10.jpg)
Intel SSE Instruction [6]• Streaming SIMD extensions (SSE) are the SIMD instruction set
extension over the x86 architecture. It is further enhanced to SSE2, SSE3, SSSE3 and SSE4 subsequently.
• SSE contains eight 128-bit registers originally, known as XMM0 through XMM7. The number of registers is extended to sixteen in AMD64.
• Each 128-bit register can be divided into two 64-bit integers, four 32-bit integers, eight 16-bit short integers or sixteen 8-bit bytes.
• With SSE series instructions, several XMM registers can be operated at the same time, indicating considerable data-level parallelism.
![Page 11: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/11.jpg)
Intel SSE Instruction[6]• The PMADDUBSW instruction takes two 128-bit SSE registers
as operands, with the first one containing sixteen unsigned 8-bit integers, and the second one containing sixteen signed 8-bit integers. With this instruction, It is only necessary to sum the values in the destination register to get the final results.
Fig 5: SSE Instruction structure [6]
![Page 12: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/12.jpg)
Intel SSE Instruction[6]• The PMADDW instruction takes two 64-bit SSE registers as
operands, with the first one containing eight unsigned 8-bit integers, and the second one containing eight signed 8-bit integers. This instruction adds and concatenates values of this two operands.
Fig 6: SSE Instruction structure [6]
![Page 13: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/13.jpg)
Calculating Motion Vectors[7]
Fig. 7 : Luminance and chrominance row interpolation [7]
![Page 14: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/14.jpg)
Hadamard Transform [2]
Fig. 8 Hadamard transform algorithm
![Page 15: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/15.jpg)
Hadamard Transform [2]
Fig. 9 Instruction structure for hadamard transform calculation
![Page 16: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/16.jpg)
SAD/SSD Calculation [2]
Fig. 10 Instruction structure for SAD/SSD calculation
![Page 17: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/17.jpg)
Experimental Configuration
• IntraPeriod : 32 # Period of I-Frame• GOPSize : 8 # GOP Size• QP : 32 # Quantization Parameter• FramesToBeEncoded : 100 # Number of frames
to be coded• FrameRate : 60 # Frame Rate per
second• Number of frames :100 # frames used per sequence• Intel COREi5, Windows 8 and 8GB RAM
![Page 18: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/18.jpg)
Test sequences [8]
BQSquare_416x240_60.yuv BQMall_832x480_60.yuv
BQTerrace_1920x1080_60.yuv
Fig 11: Test sequences
![Page 19: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/19.jpg)
PSNR
Fig 12: PSNR comparison
BQSquare(416x240) BQMall(832x480) BQTerrace(1920x1080)31
32
33
34
35
36
37
38
39
40
PSNR(dB)
PSNR after optimzation(dB)
![Page 20: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/20.jpg)
Bit Rate
Fig 13: Bitrate comparison
BQSquare(416x240) BQMall(832x480) BQTerrace(1920x1080)0
500
1000
1500
2000
2500
3000
3500
BitRate(kbps)BitRate after optimization(kbps)
![Page 21: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/21.jpg)
Time
Fig 14: Time comparison
BQSquare(416x240) BQMall(832x480) BQTerrace(1920x1080)0
5000
10000
15000
20000
25000
Time(sec)Time after optimzation(sec)
![Page 22: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/22.jpg)
Comparison using BD-PSNR
Fig 15: BD-PSNR Comparison
BQSquare(416x240) BQMall(832x480) BQTerrace(1920x1080)
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
BD-PSNR(dB)
BD-PSNR(dB)
![Page 23: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/23.jpg)
Comparison using BD-Bitrate
BQSquare(416x240) BQMall(832x480) BQTerrace(1920x1080)0
5
10
15
20
25
30
BD-RATE(%)
BD-RATE(%)
Fig 15: BD-Rate Comparison
![Page 24: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/24.jpg)
R-D Plot
0 500 1000 1500 2000 2500 3000 350031
32
33
34
35
36
37
38
39
40
R-D Plot
Optimized PlotStandard Plot
Bit-Rate(kbps)
PSN
R(dB
)
Fig 16: R-D Plot
![Page 25: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/25.jpg)
Conclusion• As proposed by implementing SIMD on various blocks of HEVC encoder there is
significant optimization on time scale without affecting the throughput and quality of video. This result shows significant reduction in encoding time of test sequences due to optimization in motion vector calculation, Hadamard transform and SAD/SSD calculation.
• It can observed from test results for PSNR comparison there is no significant reduction in quality of video as there is about 0.5dB reduction in PSNR which is tolerable. Bitrate of the optimized test sequence is also consistent as compared to original test sequences.
• But it can be observed that there is major difference in encoding period of test sequences as there is lot of optimization done in calculation of motion vectors, Hadamard transform and SAD/SSD calculation in HEVC encoder which is the most time consuming block. SIMD instructions are used for all these calculation due to which processing time reduces to greater extent without affecting quality of video sequences.
![Page 26: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/26.jpg)
Future Work
• SIMD optimization can be future implemented on calculation of integer transform and RDOQ. Along with these, performing parallel programming on HEVC code can be implement using GPU.
![Page 27: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/27.jpg)
Acronyms• AVC: Advanced Video Coding• CABAC: Context-Adaptive Binary Arithmetic Coding• CB: Coding Block• CTB: Coding Tree Block• CTU: Coding Tree Unit• CU: Coding Unit• GPU: Graphical Processing Unit• HEVC: High Efficiency Video Coding• JCT-VC: Joint Collaborative Team on Video Coding• MC: Motion Compensation• ME: Motion Estimation • MOS: Mean Opinion Score• PB: Prediction Block
![Page 28: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/28.jpg)
Acronyms• PU: Prediction Unit• RDOQ: Rate Distortion Optimized Quantization• SAD: Sum of Absolute Differences • SAO: Sample Adaptive Offset• SATD: Sum of Absolute Transformed Differences (SATD) • SIMD: Single Instruction Multiple Data• SSD: Sum of Squared Difference• SSE: Streaming SIMD Extensions• TB: Transform Block• TU: Transform Unit
![Page 29: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/29.jpg)
References[1] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the High Efficiency Video Coding (HEVC) standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1648–1667, Dec. 2012.[2] Keji Chen, Yizhou Duan, Leju Yan, Jun Sun and Zongming Guo, “Efficient SIMD Optimization of HEVC Encoder over X86 Processors ,” Institute of Computer Science and Technology, Peking University, Beijing 100871, China.[3] JCT-VC, “HM6: High Efficiency Video Coding (HEVC) Test Model 6 Encoder Description,”JCTVC-H1002, Feb. 2012.[4] D. Marpe et al., “Video compression using nested quadtree structures, leaf merging, and improved techniques for motion representation and entropy coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 20, no. 12, pp. 1676 –1687, Dec. 2010.[5] Explanation of block partition: http://codesequoia.wordpress.com/2012/10/28/hevc-ctu-cu-ctb-cb-pb-and-tb/
![Page 30: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/30.jpg)
References[6] Intel Corp., Intel® 64 and IA-32 Architectures Software Developers Manualhttp://download.intel.com/products/processor/manual/325383.pdf[7] Leju Yan; Yizhou Duan; Jun Sun; Zongming Guo , “Implementation of HEVC decoder on x86 processors with SIMD optimization,” VCIP, pp. 1-6, Nov. 2012.[8] Test Sequence : ftp://ftp.tnt.uni-hannover.de/testsequences[9] HM9.2 Software: https://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftware/tags/HM-9.2rc1/[10] BD Rate and BD PSNR Calculation : http://wftp3.itu.int/av-arch/video-site[11] SIMD implementation sample: http://sci.tuomastonteri.fi/programming/sse/example1
![Page 31: Time Optimization of HEVC Encoder over X86 Processors using SIMD](https://reader036.vdocuments.us/reader036/viewer/2022062305/5681681d550346895dddacf0/html5/thumbnails/31.jpg)
THANK YOU