hevc encoder r11 final
TRANSCRIPT
Analysis and Parallelism of HEVC Encoder
Kwangwoon University (KWU) Donggyu Sim ([email protected])
Contents
• Overview of HEVC
• Encoding issues for HEVC test model (HM)
• Complexity analysis of HEVC encoder
• Fast encoding algorithms and performances
• Issues of parallel processing
• Conclusion
OVERVIEW OF HEVC
Introduction of HEVC
• High efficiency video coding (HEVC) – JCT‐VC
• Joint collaborative team on video coding • ITU‐T SG16 Q.6 (VCEG) and ISO/IEC JCT1/SC29/WG11 (MPEG)
– Requirement • Development of a standard for video coding technology more advanced than
the current AVC standard • Development of the associated conformance testing and reference software
specifications • Casual goal: 50% coding efficiency than H.264/AVC
Timeline
• Delayed HEVC timeline – Test model selection process begins 2010/04 (1st JCTVC meeting at Dresden, DE) – Test model selection by 2010/10 (3rd JCTVC meeting at Guangzhou, CN) – Committee Draft (CD) approval by 2012/02 (8th JCTVC meeting at San Jose, US) – Final Committee Draft (FCD) – Draft International Standard (DIS) approval by 2012/07 (10th JCTVC meeting at
Stockholm, SE) – Final Draft International Standard (FDIS) approval by 2013/01 (12th JCTVC meeting at
Geneva, CH)
Block diagram of HEVC standard
• Typical block‐based hybrid codec structure + enhanced and additional tools
• CABAC
Entropy coding
• Deblocking filter • SAO
In‐loop filter
• Delta QP • RDO‐Q
Quantization
• TU 4x4 to 32x32 • Residual quad‐tree
transform
Transform
• AMVP • Merge • DCT‐IF
Inter prediction
• Angular intra prediction
Intra prediction
Bitstream
Inv.Quantization
Inv.Transform
• CU 8x8 to 64x64 • Diverse PU types
Block structure
• 8‐bit/sample
Picture storage
Picture
Block structure in HEVC
• Three block structures are defined in HEVC – Coding unit (CU) – Prediction unit (PU) – Transform unit (TU)
CU32×32
CU16×16 CU16×16 CTU64
CU8×8 CU8×8
CU16×16 CU8×8 CU8×8
CU16×16 CU16×16 CU8×8 CU8×8 CU8×8 CU8×8
CU8×8 CU8×8 CU8×8 CU8×8
CU16×16 CU16×16 CU16×16 CU8×8 CU8×8
CU8×8 CU8×8
CTU64×64 CTU64×64 CTU64
2N×2N N×2N 2N×N
2N×nU nR×2N nL×2N
… TU depth 0
TU depth 1
TU depth 2
2N×nD
N×N
Profiles for HEVC
• HEVC standard supports 3 profiles (Main, HE 10, Main Still Picture)
Main High efficiency 10 (HE10) Main Still Picture
High-level structure
High-level support for frame rate temporal nesting and random access Clean random access (CRA) support -
Rectangular tile-structured scanning Wavefront-structured processing dependencies for parallelism
Slices with spatial granularity equal to coding tree unit
Coding units, Prediction units, and Transform units
Coding unit quadtree structure
(square coding unit block sizes 2Nx2N, for N=4, 8, 16, 32; i.e., up to 64x64 luma samples in size)
Prediction units
(for coding unit size 2Nx2N: for Inter, 2Nx2N, 2NxN, Nx2N, and, for N>4, also 2Nx(N/2+3N/2) & (N/2+3N/2)x2N; for Intra, only 2Nx2N and
, for N=4, also NxN)
Prediction units
(2Nx2N and, for N=4, also NxN)
Transform unit tree structure within coding unit (maximum of 3 levels) Transform block size of 4x4 to 32x32 samples
(always square)
Spatial Signal Transformation and PCM
Representation
DCT-like integer block transform; for Intra also a DST-based integer block transform (only for Luma 4x4)
Transforms can cross prediction unit boundaries for Inter; not for Intra - PCM coding with worst-case bit usage limit
Intra-picture Prediction Angular intra prediction (35 directions ) Planar intra prediction
Inter-picture prediction
Luma motion compensation interpolation: 1/4 sample precision, 8x8 separable with 6 bit tap values -
Chroma motion compensation interpolation: 1/8 sample precision, 4x4 separable with 6 bit tap values -
Advanced motion vector prediction with motion vector “competition” and “merging” -
Entropy Coding Context adaptive binary arithmetic entropy coding RDOQ on
Picture Storage and Output Precision 8 bit-per-sample storage and output 10 bit-per-sample storage and output 8 bit-per-sample storage and output
In-Loop Filtering Deblocking filter Sample-adaptive offset filter
Level
Level
Max lum
a picture size MaxL
umaPS (s
amples)
Max C
PB size M
axCP
B (1000 bits)
Max slice segm
ents per picture MaxSli
ceSegmentsPerPicture
Max # of tile row
s MaxT
ileRow
s
Max # of tile colum
ns MaxT
ileCols
Max lum
a sample rate M
axLum
aSR
(samples/sec)
Max bit rate M
axBR
(1000 bits/s)
Min C
ompression R
atio MinC
R
Main tier
High tier
Main tier
High tier
1 36 864 350 - 16 1 1 552 960 128 - 2
2 122 880 1 500 - 16 1 1 3 686 400 1 500 - 2
2.1 245 760 3 000 - 20 1 1 7 372 800 3 000 - 2
3 552 960 6 000 - 30 2 2 16 588 800 6 000 - 2
3.1 983 040 10 000 - 40 3 3 33 177 600 10 000 - 2
4 2 228 224 12 000 30 000 75 5 5 66 846 720 12 000 30 000 4
4.1 2 228 224 20 000 50 000 75 5 5 133 693 440 20 000 50 000 4
5 8 912 896 25 000 100 000 200 11 10 267 386 880 25 000 100 000 6
5.1 8 912 896 40 000 160 000 200 11 10 534 773 760 40 000 160 000 8
5.2 8 912 896 60 000 240 000 200 11 10 1 069 547 520 60 000 240 000 8
6 35 651 584 60 000 240 000 600 22 20 1 069 547 520 60 000 240 000 8
6.1 35 651 584 120 000 480 000 600 22 20 2 139 095 040 120 000 480 000 8
6.2 35 651 584 240 000 800 000 600 22 20 4 278 190 080 240 000 800 000 6
HM ENCODER
Decision‐level for HEVC encoder
• Sequence‐level – Coding structure (All intra, Low delay, Random access) – Profile, tier, level – Max/Min CTU size, CU depth – Max/Min TU size, TU depth – Tool on/off (SAO, deblocking, WPP, tile)
• Picture‐level – #ref frame, rate control – Tile, slice
• Slice‐ or tile‐level – Ref frames – Deblocking filter parameters
• CTU‐level – CU partitioning – Sample adaptive offset parameters
• CU‐level – PU and TU partitioning
• PU‐ & TU‐level – Prediction modes, motion vectors – cbf, coefficients
Sequence
Picture
CTU
Slice or Tile
CU
PU & TU
Coding structure (1/3)
• All intra (AI) – All picture is coded as instantaneous decoding refresh (IDR) picture – No temporal prediction is allowed – No QP variation is allowed
IDR Picture
time
0
QPI
= POC
Coding order 1
QPI
2
QPI
3
QPI
4
QPI
5
QPI
6
QPI
7
QPI
Coding structure (2/3)
• Low delay (LD) – The first picture shall be coded as IDR picture – Generalized P and B (GPB) picture shall be used for the other successive pictures
• The GPB shall be able to use only the reference pictures, each of whose POC is smaller than the current picture (all reference picture in List_0 and List_1 shall be temporally previous in display order relative to the current picture)
– QP of each inter coded picture shall be derived by adding offset to QP of Intra coded picture depending on temporal layer
IDR or Intra picture GPB(Generalized P and B)
picture
0
1
2
4
5 3
6
7
8
time
QPI
QPBL3=QPI+3
QPBL2=QPI+2
QPBL3 QPBL3 QPBL3
QPBL2
QPBL1=QPI+1 QPBL1 : Depth == 0 : Depth == 1 : Depth == 2
= POC
Coding order
Coding structure (3/3)
• Random access (RA) – Hierarchical B structure shall be used for coding – IDR Intra picture or clean random access (CRA) picture shall be inserted cyclically per about one
second in random access point – QP of each inter coded picture shall be derived by adding offset to QP of Intra coded picture
depending on temporal layer
IDR or Intra picture
GPB(Generalized P and B) picture
0
4
3
2
7 5 8
1
time
Referenced B Picture
Non‐referenced B Picture
8
4
1
2
3 5
6
7
0
QPI
QPBL4=QPI+4 QPBL4 QPBL4 QPBL4
QPBL3=QPI+3 QPBL3
QPBL2=QPI+2
QPBL1=QPI+1
≠ POC
Coding order
: Depth == 0 : Depth == 1 : Depth == 2 : Depth == 3
Picture of HEVC
• Picture : A picture contains an array of luma samples in monochrome format or an array of luma samples and two corresponding arrays of chroma samples in 4:2:0, 4:2:2, and 4:4:4 color format.
– Coding order of coding tree unit (CTU) is raster scan order
Example) Class A (2560×1600) – NebutaFestival CTU size : 64×64 40×25 CTU partition
* CTU & CTB : The CTU consists of a luma coding tree block (CTB) and the corresponding chroma CTBs and syntax elements
Picture partitioning
• A slice is a sequence of coding tree units (CTUs)
• Unlike slices, tiles are always rectangular and always contain an integer number of coding tree units in coding tree unit raster scan
• At least one of the following conditions should be true for each slice and tile in a picture – All CTBs in a slice belong to the same tile, or all CTBs in a tile belong to the same slice
FIGURE. A picture with 40×25 coding tree units that is partitioned into two slices
FIGURE. A picture with 40×25 coding tree units that is partitioned into three tiles
Overall of HM encoding process
Sequence
Picture
CTU decisions in a slice or a tile
Deblocking filter SAO
CU partitioning decision PU & TU partitioning decision
RDO process
Maximum CTU size & CU depth
• Size of CTU is specified in sequence parameter set (SPS)
TABLE. Syntax for size of CTU in SPS seq_parameter_set_rbsp() { Descriptor … log2_min_coding_block_size_minus3 ue(v) log2_diff_max_min_coding_block_size_minus2 ue(v) … }
64 CU : split_coding_unit_flag(1) 32 CU : split_coding_unit_flag(0) 32 CU : split_coding_unit_flag(1) 16 CU : split_coding_unit_flag(0) 16 CU : split_coding_unit_flag(0) 16 CU : split_coding_unit_flag(1) …
FIGURE. Example of CU quad‐tree structure
CU32×32
CU16×16 CU16×16
CU8×8 CU8×8
CU16×16 CU8×8 CU8×8
CU16×16 CU16×16 CU8×8 CU8×8 CU8×8 CU8×8
CU8×8 CU8×8 CU8×8 CU8×8
CU16×16 CU16×16 CU16×16 CU8×8 CU8×8
CU8×8 CU8×8
16x16 ~ 64x64
Coding unit (CU) decision
• Coding unit quad‐tree structure – Starting from CTU, each CU can be split into 4 smaller CUs
8×8
11 ‐
64×64
1 21
32×32
2 5
32×32
23 10
32×32
44 15
32×32
65 20
16×16
3 1
16×16
8 2
16×16
13 3
16×16
18 4
8×8
4 ‐
8×8
5 ‐
8×8
6 ‐
8×8
7 ‐
8×8
9 ‐
8×8
10 ‐
8×8
12 ‐
CU size
Best CU RD‐cost calculation for each CU level
Competition of the best CU and its sub‐partitioned CUs
… …
… …
Prediction unit (PU) types
• 2 PU types for Intra prediction – 2N×2N, (Smallest CU: additionally N×N)
• 8 PU types for Inter prediction – Smallest CU:
• 8x8: 2N×2N, N×2N, 2N×N • Others: 2N×2N, N×2N, 2N×N, N×N
– Others: 2N×2N, N×2N, 2N×N, nL×2N, nR×2N, 2N×nU, 2N×nD
FIGURE. PU partitions in HEVC
2N×2N N×2N 2N×N N×N
2N×nD 2N×nU nR×2N nL×2N
Prediction unit (PU) types Current CU size
SCU size AMP enable flag
Cur. CU size == SCU size
AMP enable flag
Cur. CU size == 8×8
Intra 2N×2N
Inter 2N×2N
Inter 2N×N
Inter N×2N
Inter AMP
Intra 2N×2N
Inter 2N×2N
Inter 2N×N
Inter N×2N
Intra 2N×2N
Inter 2N×2N
Inter 2N×N
Inter N×2N
Intra N×N
Intra 2N×2N
Inter 2N×2N
Inter 2N×N
Inter N×2N
Intra N×N
Inter N×N
No Yes
Yes Yes No No
Maximum TU size & TU depth
• Root of TU quad‐tree is CU which the TU belong to • Available transform block sizes and max transform hierarchy depth are specified in SPS
FIGURE. TU quad‐tree structure in HEVC
TABLE. Syntax for size of TU in SPS
32
32
seq_parameter_set_rbsp() { Descriptor
…
log2_min_transform_block_size_minus_2 ue(v)
log2_diff_max_min_transform_block_size ue(v)
…
max_transform_hierarchy_depth_inter ue(v)
max_transform_hierarchy_depth_intra ue(v)
…
}
4x4 ~ 32x32
Maximum 3‐level (0, 1, or 2)
RDO process to decide PU & TU
… …
… … Compress CU
2N×2N merge skip Inter 2N×2N
Inter N×2N Inter 2N×N Inter AMP
Intra 2N×2N Intra N×N
Intra PCM
CU size ≤ SCU
Compress CU Compress CU Compress CU Compress CU
Finish
No
Yes
Intra prediction flow
• Prediction modes – Luma
• DC, Planar, Angular prediction(33 directions) – Chroma
• DC, planar, ver, hor, DM
• Filtering – MDIS(Mode dependent intra smoothing) – DC filtering, Ver/Hor filtering
• 3 MPM
2N×2N PU
MDIS
Intra prediction
Reference sample padding
RD‐cost, Intra_mode
Best mode decision
N
Y
Mode<35?
Jmod e SSE mod e * Bmod e
J pred ,SATD SATD pred * Bpred
FIGURE. Flow chart ‐ Intra prediction FIGURE. Directions and modes of HEVC intra prediction
Fast intra prediction & TU decision in HM
• Intra prediction step in HM 1) Rough prediction mode decision
• 35 prediction • Select N prediction modes
– Distortion(SATD) + lamda * mode bits
• # of candidate prediction modes : N modes + MPM (3) 2) Best intra prediction mode decision with transform
• Transform (RQT depth = 1) • 1 best intra mode decision
3) Best RQT decision with RD costs • RQT depth = 3
35 modes
N mode + MPM
1 Best mode
Best mode RD‐cost
Inter prediction
• Skip: Merge skip
• Non‐skip – Uni‐directional prediction – Bi‐directional prediction – Half‐pel/Quarter‐pel motion refinement
• DCT‐IF (8tap/4tap)
– Merge
Merge skip
Inter 2N×2N Inter 2N×N Inter N×2N
Best mode decision
Uni‐directional prediction
Bi‐directional prediction
Merge
Best mode decision
Cur. CU
RD‐cost, Best mode
Spatial candidates derivation
Temporal candidate derivation
Additional candidates derivation
RD‐cost calculation AMP
(nL×2N, nR×2N, 2N×nU, 2N×nD)
FIGURE. Flow chart ‐ Inter prediction
Inter prediction
• Inter coding mode – Merge skip mode (CU level)
• skip_flag=1 and merge_idx • No reference index • No motion vector • No residual
– Merge mode (PU level) • skip_flag=0, pred_mode_flag, and part_mode • merge_flag=1 and merge_idx • No reference index and motion vector • no_residual_syntax_flag: Residual is encoded or not
– General PU modes • skip_flag=0 pred_mode_flag, and part_mode • merge_flag=0 • ref_idx_lx and mvp_lx_flag based on AMVP (x=0 or 1) • MVD is encoded • no_residual_syntax_flag: Residual is encoded or not
Inter prediction flow BEGIN input : current PU part mode for a CU FOR PU partition FOR List = 0 to 1 DO FOR 0 to refidx DO Motion estimation (diamond search, SR : 64) Decide best RD-cost for uni-prediction ENDFOR ENDFOR IF bi-directional prediction THEN FOR iteration = 0 to 3 DO FOR 0 to refidx DO Motion estimation (full search, SR : 4) Decide best RD-cost for bi-prediction ENDFOR ENDFOR ENDIF ENDFOR Merge RD-cost competition among uni/bi-prediction and merge END output : inter prediction syntax
Fast encoder decision (FEN) –Sub‐sampled SAD for integer ME •Use sub‐sampled SAD when rows > 8 for integer ME –Only 1 iteration for bi‐predictive motion search •default number : 4
Fast Decision for Merge RD cost (FDM) –After merge with merge idx X, if all cbf is zero then merge process is terminated
FIGURE. Pseudo code - Inter prediction flow
time
Cur
Current PU
Uni‐prediction
Bi‐prediction
LIST_0 LIST_1
Bi‐prediction
• Search P0 and P1 which produce minimum error with O – R = (O – P), where P = (P0 + P1) / 2
• Practical Bi‐predictive search 1) Search P1 which produce minimum 2R with (2O – P0)
R = O – (P0 + P1) / 2 ⇒ 2R = (2O – P0) – P1 2) Search P0 which produce minimum error with (2O – P1)
R = O – (P0 + P1) / 2 ⇒ 2R = (2O – P1) – P0
• BipredSearchRange : 4 • FEN : 1 (iteration : 1)
P0 P1 O
List 1 Reference
List 0 Reference Current frame
Example) Bi‐prediction
Bi‐directional prediction
Iteration : 2
Iteration : 3
Uni‐directional prediction
P1 O
List 1 Reference
List 0 Reference Current frame
Search range : 64
P0 P1’ O
BipredSearchRange : 4
P0’ P1’ O
BipredSearchRange : 4
Iteration : 1
P0 P1 O
BipredSearchRange : 4
Iteration : 4
P0’ P1’’ O
BipredSearchRange : 4
P0’ 2O R0’
P1’ 2O R1’
P0 2O R0
P1 2O R1
Motion estimation (Integer‐pel)
• Practical motion estimation (diamond search) – First search & early termination
• Max 3 (default) more rounds after a recent best match
– Raster refinement search • If integer‐pel distance is bigger than 5, then conduct the raster refinement search.
– Star refinement search & early termination
• Diamond search with the center of the best match from the early two steps • Max 2 rounds after the best match
FIGURE. Raster refinement search
…
3
3 2 3 2 1 2
… 3 2 1 0 1 2 3 … 2 1 2 3 2 3
3
…
FIGURE. First search & start refinement
Motion estimation (Sub‐pel refinement)
• Integer‐pel motion search – Cost function : SAD
• Sub‐pel motion refinement – Cost function : SATD – Half‐pel refinement – Quarter‐pel refinement
FIGURE. Integer‐pel motion search
FIGURE. Half‐pel motion search
FIGURE. Quarter‐pel motion search
Search range
Search ra
nge
Integer‐pel
Half‐pel
Quarter‐pel
Interpolation
• DCT‐IF in HEVC – Fixed 8‐tap(7‐tap) and 4‐tap interpolation filters based on DCT – 2D separable filter
• 8*Horizontal 1D filter + 1*Vertical 1D filter
Component α Filter(α)
Luma 1/4 {‐1, 4, ‐10, 58, 17, ‐5, 1, 0}
1/2 {‐1, 4, ‐11, 40, 40, ‐11, 4, ‐1}
Chroma
1/8 { ‐2, 58, 10, 2 }
3/8 { ‐6, 46, 28, ‐4 }
1/4 { ‐4, 54, 16, ‐2 }
1/2 {‐4, 36, 36, ‐4 }
FIGURE. Integer and fractional sample positions for luma and chroma interpolation
TABLE. Interpolation filter coefficients A-1,-1 A0,-1 a0,-1 b0,-1 c0,-1 A1,-1
A-1,0 A0,0 A1,0
A-1,1 A0,1 A1,1a0,1 b0,1 c0,1
a0,0 b0,0 c0,0
d0,0
h0,0
n0,0
e0,0
i0,0
p0,0
f0,0
j0,0
q0,0
g0,0
k0,0
r0,0
d-1,0
h-1,0
n-1,0
d1,0
h1,0
n1,0
A2,-1
A2,0
A2,1
d2,0
h2,0
n2,0
A-1,2 A0,2 A1,2a0,2 b0,2 c0,2 A2,2
B0,0 ae0,0 ag0,0 ah0,0ab0,0 ac0,0 ad0,0 af0,0 B1,0
B1,1B0,1
be0,0 bg0,0 bh0,0bb0,0 bc0,0 bd0,0 bf0,0ba0,0
ce0,0 cg0,0 ch0,0cb0,0 cc0,0 cd0,0 cf0,0ca0,0
de0,0 dg0,0 dh0,0db0,0 dc0,0 dd0,0 df0,0da0,0
ee0,0 eg0,0 eh0,0eb0,0 ec0,0 ed0,0 ef0,0ea0,0
fe0,0 fg0,0 fh0,0fb0,0 fc0,0 fd0,0 ff0,0fa0,0
ge0,0 gg0,0 gh0,0gb0,0 gc0,0 gd0,0 gf0,0ga0,0
he0,0 hg0,0 hh0,0hb0,0 hc0,0 hd0,0 hf0,0ha0,0
ah-1,0
bh-1,0
ch-1,0
dh-1,0
eh-1,0
fh-1,0
gh-1,0
hh-1,0
he0,-1 hg0,-1 hh0,-1hb0,-1 hc0,-1 hd0,-1 hf0,-1ha0,-1
ba1,0
ca1,0
da1,0
ea1,0
fa1,0
ga1,0
ha1,0
ae0,1 ag0,1 ah0,1ab0,1 ac0,1 ad0,1 af0,1
Example of PU decision
Compress CU
2N×2N merge skip Inter 2N×2N
Inter N×2N Inter 2N×N Inter AMP
Intra 2N×2N Intra N×N
Intra PCM
CU size ≤ SCU
Compress CU Compress CU Compress CU Compress CU
Finish
No
Yes
Bi‐prediction RD‐cost = SAD/SATD + λ*Bmode
= 9000
Bi‐prediction RD‐cost = SSE + λ*Bmode
= 8500
Merge RD‐cost = SAD/SATD + λ*Bmode
= 11000
Uni‐prediction RD‐cost = SAD/SATD + λ*Bmode
= 12000
Vs.
Vs.
Example
No TU decision No reconstruction
TU decision Reconstruction
TU decision flow (Inter)
• Residual quad‐tree
2N×2N N×2N 2N×N N×N
2N×nD 2N×nU nR×2N nL×2N TU depth : 0
TU depth : 1
TU depth : 2 …
T/Q IT/IQ (recon) RD‐cost (SSE + λ*Bmode)
… …
Original Predictor Residual
TU decision flow (Intra)
• Example) intra_pred_mode = 10 (vertical mode)
Reference samples
Prediction direction
Intra prediction using reference samples T/Q IT/IQ RD‐cost (SSE + λ*Bmode)
Prediction direction
Reference sample (after above block is reconstructed)
TU depth : N
TU depth : N+1
… … …
Residual
Transform
• Implementation of transform in HEVC – Matrix multiplication
• Straightforward/Few code lines • Huge number of operations, but SIMD friendly
– Partial butterfly implementation • Utilizes symmetry/anti‐symmetry properties of basis vectors • Less multiplications/additions • Increase number of code lines
Matrix multiplication
Matrix multiplication
Matrix multiplication
Matrix multiplication
Partitioning syntax for a CTU
Syntax
CU32×32
CU16×16 CU16×16
CU8×8 CU8×8
CU16×16 CU8×8 CU8×8
CU16×16 CU16×16 CU8×8 CU8×8 CU8×8 CU8×8
CU8×8 CU8×8 CU8×8 CU8×8
CU16×16 CU16×16 CU16×16 CU8×8 CU8×8
CU8×8 CU8×8
64 CU : split_coding_unit_flag(1) 32 CU : split_coding_unit_flag(0) 32 CU : split_coding_unit_flag(1) 16 CU : split_coding_unit_flag(0) 16 CU : split_coding_unit_flag(0) 16 CU : split_coding_unit_flag(1)
…
32x32 TU : split flag (1)
16x16 TU : split flag (0) <16x16 cbf, coefficients>
coefficients> 4x4 TU : split flag(0) <4x4 cbf, coefficients>
…
32x32 TU : split flag (1)
16x16 TU : split flag (0) <16x16 cbf, coefficients>
16x16 TU : split flag (1) 8x8 TU : split flag(0) <8x8 cbf, coefficients> 8x8 TU : split flag(0) <8x8 cbf, coefficients> 8x8 TU : split flag(0) <8x8 cbf, coefficients> 8x8 TU : split flag(0) <8x8 cbf, coefficients>
16x16 TU : split flag (1) 8x8 TU : split flag(0) <8x8 cbf, coefficients> 8x8 TU : split flag (1) 4x4 TU : split flag(0) <4x4 cbf, coefficients> 4x4 TU : split flag(0) <4x4 cbf, coefficients> 4x4 TU : split flag(0) <4x4 cbf, coefficients> 4x4 TU : split flag(0) <4x4 cbf, coefficients>
…
FIGURE. Example of TU quad‐tree structure
PU partition & Pred_mode info
TU split flags & Coefficients
PU partition & Pred_mode info
TU split flags & Coefficients
FIGURE. Example of CU quad‐tree structure
‐ SKIP flag (merge idx) ‐ Prediction mode flag (intra or inter) ‐ PU part size (2Nx2N, 2NxN, Nx2N, NxN,
AMP) ‐ Prediction info. ( Intra mode or mv and
ref. idx., merge idx, AMVP idx)
PU partition & Pred_mode info
TU split flags & Coefficients
In‐loop filter
• In HEVC, two processing steps, a deblocking filter (DBF) and a sample adaptive offset (SAO) operation are applied
– DBF: similar to the DBF of the H.264/AVC standard – SAO: applied adaptively to all samples satisfying certain conditions (while the DBF is only applied
to the samples located at block boundaries)
• On/off syntaxes for in‐loop filters
1. slice_disable_deblocking_filter_flag : slice‐level on/off 2. sample_adaptive_offset_enabled_flag : slice‐level on/off
Deblocking filter (DBF)
• Basically, deblocking filter of HEVC is similar to that of H.264/AVC – In‐loop filtering
• Coding performance for inter frame • Frame‐based filtering • On/off control is provided
– Adaptive filtering • boundary strength
– Filtering on the block boundaries • transform and prediction boundary
– Sequential filtering for vertical and horizontal edges • Sample values modified during filtering of vertical edges are used as input for the filtering of
the horizontal edges
Deblocking filter (DBF)
• Features of HEVC deblocking filter compared to H.264/AVC – For the TUs and PUs with edges less than 8 samples in either vertical or horizontal direction, only
the edges lying on the 8×8 sample grid are filtered
vertical edges ‐> horizontal filtering
horizontal edges‐> vertical filtering 2
1 vertical edges ‐> horizontal filtering
horizontal edges‐> vertical filtering 2
1
[e.g. 16x16 Coding unit]
H.264/AVC HEVC
(a) H.264/AVC (b) HEVC FIGURE. Derivation process for the boundary filter strength in AVC and HEVC
Processing flow of DBF
• Boundary decision – Three kinds of boundaries involving in the filtering
• CU, TU, PU boundary – CU boundaries are always involved in the filtering – TU boundary at 8×8 block grid and PU boundary between
each PU inside CU are involved in the filtering • [Except] PU boundary is inside TU, the boundary shall
not be filtered
• Bs calculation
– Bs is calculated in 4×4 block basis ‐> re‐mapped to 8×8 grid – Two Bs are belong to 8 pixels consisting a line in 4×4 grid,
maximum Bs is selected as Bs for boundaries in 8×8 grid
Boundary decision
Bs calculation (4×4 ‐> 8×8)
β, tc decision
filter on/off decision
Strong/weak filter selection
Strong filtering Weak filtering
FIGURE. Overall processing flow of deblocking filter process
• Filter on/off decision for 4 lines – If dp0+dq0+dp3+dq3 < β, filtering for the first 4 lines
is turned on
– Derive values for weak filtering process
– For the second 4 lines, decision is made in a same fashion with above
Processing flow of DBF
• β and tc decision – Threshold values β and tc are derived based on luma QPP and QPQ
– Q = ((QPP + QPQ + 1)>>1) – β and tc are specified as left table with Q as input (If Bs > 1, tc is specified as left table with clip3(0, 55, Q+2) as input)
Q 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
β 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 7 8
tc 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Q 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
β 9 10 11 12 13 14 15 16 17 18 20 22 24 26 28 30 32 34 36
tc 1 1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4
Q 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
β 38 40 42 44 46 48 50 52 54 56 58 60 62 64 64 64 64 64
tc 5 5 6 6 7 8 9 9 10 10 11 11 12 12 13 13 14 14
p30 p20 p10 p00 q00 q10 q20 q30
p31 p21 p11 p01 q01 q11 q21 q31
p32 p22 p12 p02 q02 q12 q22 q32
p33 p23 p13 p03 q03 q13 q23 q33
p34 p24 p14 p04 q04 q14 q24 q34
p35 p25 p15 p05 q05 q15 q25 q35
p36 p26 p16 p06 q06 q16 q26 q36
p37 p27 p17 p07 q07 q17 q27 q37
first 4 lines
second 4 lines
dEp1 = dp0+dp3 < (β+(β>>1))>>3 ? 1 : 0
dEq1 = dq0+dq3 < (β+(β>>1))>>3 ? 1 : 0
dp0 = | p2,0 – 2*p1,0 + p0,0 |
dp3 = | p2,3 – 2*p1,3 + p0,3 |
dp4 = | p2,4 – 2*p1,4 + p0,4 |
dp7 = | p2,7 – 2*p1,7 + p0,7 |
dq0 = | q2,0 – 2*q1,0 + q0,0 |
dq3 = | q2,3 – 2*q1,3 + q0,3 |
dq4 = | q2,4 – 2*q1,4 + q0,4 |
dq7 = | q2,7 – 2*q1,7 + q0,7 |
p3 p2 p1 p0 q0 q1 q2 q3
Processing flow of DBF
• Strong/weak filter selection for 4 lines – If following 2 conditions are met, strong filter – Otherwise, weak filter – Conditions for first 4 lines
– Conditions for second 4 lines
• Strong/weak filtering
dp0 = | p2,0 – 2*p1,0 + p0,0 |
dp3 = | p2,3 – 2*p1,3 + p0,3 |
dp4 = | p2,4 – 2*p1,4 + p0,4 |
dp7 = | p2,7 – 2*p1,7 + p0,7 |
dq0 = | q2,0 – 2*q1,0 + q0,0 |
dq3 = | q2,3 – 2*q1,3 + q0,3 |
dq4 = | q2,4 – 2*q1,4 + q0,4 |
dq7 = | q2,7 – 2*q1,7 + q0,7 |
1) 2*(dp0+dq0) < ( β >> 2 ), | p30 – p00 | + | q00 – q30 | < ( β >> 3 ) and | p00 – q00 | < ( 5* tC + 1 ) >> 1
2) 2*(dp3+dq3) < ( β >> 2 ), | p33 – p03 | + | q03 – q33 | < ( β >> 3 ) and | p03 – q03 | < ( 5* tC + 1 ) >> 1
1) 2*(dp4+dq4) < ( β >> 2 ), | p34 – p04 | + | q04 – q34 | < ( β >> 3 ) and | p04 – q04 | < ( 5* tC + 1 ) >> 1
2) 2*(dp7+dq7) < ( β >> 2 ), | p37 – p07 | + | q07 – q37 | < ( β >> 3 ) and | p07 – q07 | < ( 5* tC + 1 ) >> 1
p0’ = ( p2 + 2*p1 + 2*p0 + 2*q0 + q1 + 4 ) >> 3 q0’ = ( p1 + 2*p0 + 2*q0 + 2*q1 + q2 + 4 ) >> 3 p1’ = ( p2 + p1 + p0 + q0 + 2 ) >> 2 q1’ = ( p0 + q0 + q1 + q2 + 2 ) >> 2 p2’ = ( 2*p3 + 3*p2 + p1 + p0 + q0 + 4 ) >> 3 q2’ = ( p0 + q0 + q1 + 3*q2 + 2*q3 + 4 ) >> 3
= ( 9 * ( q0 – p0 ) – 3 * ( q1 – p1 ) + 8 ) >> 4 When abs() is less than tC *10, = Clip3( - tC , tC , ) p0’ = Clip1Y( p0 + ) q0’ = Clip1Y( q0 - ) If dEp1 is equal to 1, p = Clip3( -( tC >> 1), tC >> 1, ( ( ( p2 + p0 + 1 ) >> 1 ) – p1 + ) >>1 ) p1’ = Clip1Y( p1 + p ) If dEq1 is equal to 1, q = Clip3( -( tC >> 1), tC >> 1, ( ( ( q2 + q0 + 1 ) >> 1 ) – q1 – ) >>1 ) q1’ = Clip1Y( q1 + q )
Strong filtering
Weak filtering
Overview of sample adaptive offset (1/2)
• Artifacts – Blocking artifacts, ringing artifacts, color biases, and blurring artifacts – A larger transform could introduce more artifacts
• HEVC : 4x4 ~ 32x32 transform • Artifacts are exist at medium and low bit rates
– A large number of interpolation taps can also lead to more serious ringing artifacts • HEVC : 8‐tap (luma), 4‐tap (chroma)
• Sample adaptive offset
– To reduce sample distortion (reconstructed pixels ↔ original pixels) – Average 3.5% BD‐rate reduction (with 1% encoding time increase, 2.5% decoding time increase)
SAO is located after DF and also belongs to in‐loop filtering
Overview of sample adaptive offset (2/2)
• SAO features – Each color component may has its own SAO parameters – Two SAO types
• Edge offset (EO; 4 EO classes) • Band offset (BO; 1 BO class)
– SAO merging (left CTU or above CTU) • SAO merge information is shared for three color components
• SAO object and subjective results
SAO is enabled (QP=32)
SAO is disabled (QP=32)
Anchor: Disabling SAO Test: Enabling SAO
CTU size in Luma: 64x64 CTU Boundary: option 1
Y DB‐rate
All intra (AI)
Random access (RA)
Low delay B (LB)
Low delay P (LP)
Class Summary
Class A ‐0.6% ‐2.3%
ClassB ‐0.5% ‐2.1% ‐2.0% ‐11.1%
ClassC ‐0.5% ‐1.1% ‐1.8% ‐7.1%
ClassD ‐0.4% ‐0.3% ‐0.7% ‐4.4%
ClassE ‐0.6% ‐2.3% ‐11.0%
ClassF ‐1.5% ‐2.6% ‐5.7% ‐12.3%
Overall Summary
All ‐0.7% ‐1.7% ‐2.5% ‐9.2%
Enc. Time(%) 101% 100% 100% 100%
Dec. Time(%) 103% 103% 102% 102%
Edge offset of SAO
• Four 1‐D directional patterns – horizontal, vertical, 135° diagonal, 45° diagonal
• Only one EO class can be selected for each CTB of which EO is enabled • Each sample inside the CTB is classified into one of five categories
– One edge offset is encoded for each category (4 offsets are transmitted in the case of EO) – No information for classification of five categories (encoder and decoder use same rules)
a c b
a
c
b
a
c
b
a
c
b FIGURE. Four 1‐D directional patterns for EO sample classification
Category Condition
1 c < a && c < b
2 (c < a && c == b) || (c==a && c < b)
3 (c > a && c == b) || (c==a && c> b)
4 c > a && c > b
0 None of the above (SAO is not applied)
pixel index x‐1 x x+1
pixe
l lev
el category 1
pixel index x‐1 x x+1
pixe
l lev
el category 2
pixel index x‐1 x x+1
pixe
l lev
el pixel index
x‐1 x x+1
pixe
l lev
el category 3
pixel index x‐1 x x+1
pixe
l lev
el pixel index
x‐1 x x+1
pixe
l lev
el category 4
Positive edge offset Negative edge offset
TABLE. Sample classification rules for edge offset
Band offset of SAO
• BO implies one offset is added to all samples of the same band – The sample value range is equally divided into 32 bands – For 8‐bit samples ranging from 0 to 255, the width of a band is 8
• Only offsets of four consecutive bands and the starting band position are signaled to the decoder
– The average difference between the original samples and reconstructed samples in a band is signaled to the decoder
– Four offsets are transmitted in the case of BO
0 max
The first band for which offset is transmitted
Four offsets are transmitted for four consecutive bands
SAO syntaxes
• Merging information – sao_merge_left_flag – sao_merge_up_flag – No additional transfered data in merge case
• sao_type_idx_X: type of SAO
– 0: Not applied – 1: Band offset – 2: Edge offset
• sao_offset_abs, sao_offset_sign
– Sign is only for band offset – Signs of EO are implicitly derived
• sao_band_position (if BO)
– Start band for SAO
• sao_eo_class (if EO) – Class of edge offset (1‐D degree)
TABLE. Syntax for sample adaptive offset
sao (rx, ry) { Descriptor … sao_merge_left_flag ae(v) sao_merge_up_flag ae(v) if(!sao_merge_up_flag && !sao_merge_left_flag) { … sao_type_idx_luma ae(v) sao_type_idx_chroma ae(v) if( sao_type_idx != 0) { for(i=0; i<4; i++) sao_offset_abs[cIdx][rx][ry][i] ae(v) if(sao_type_idx == BO) { for(i=0; i<4; i++) { if( sao_offset_abs[cIdx][rx][ry][i] != 0) sao_offset_sign[cIdx][rx][ry][i] ae(v) } sao_band_position[cIdx][rx][ry] ae(v) } else { sao_eo_class_luma ae(v) } } … }
A fast distortion estimation for SAO
• Distortions have to be calculated many times • Let k, s(k), and x(k) be sample positions, original samples, and pre‐SAO samples,
respectively – Distortion between original samples and pre‐SAO samples
– Distortion between original samples and post‐SAO samples
– h is the offset for the sample set and N is the number of samples in the set , the delta distortion is defined (N and E can be calculated only once)
Ck
pre kxksD 2))()(((
Ck
post hkxksD 2)))(()((
Ck
prepost hENhkxkshhDDD 2)))()((2( 22
Ck
kxksE ))()((
RDJ
Offset refinement
• Initial offset value, h is E/N – All the numbers between zero and offset are used for offset refinement process
0
1
2
3
4
5
6 Initial offset
0
‐1
‐2
‐3
‐4
‐5
‐6 Initial offset
Ck
kxksE ))()((
Encoding flow of SAO in HM
Encoding picture
Deblocking Filtering
SAO (frame‐based)
CTB‐based processing
BO 32개의 band 별로 sum of difference, pixel count 계산
EO class0에 대해서 category 별로Sum of difference, pixel count 계산
EO class1에 대해서 category 별로Sum of difference, pixel count 계산
EO class2에 대해서 category 별로Sum of difference, pixel count 계산
EO class3에 대해서 category 별로Sum of difference, pixel count 계산
EO class0에 대하여 rdcost 계산rdcost0 = distortion + λ× rate
( A fast distortion estimation, offset refinement )
EO class1에 대하여 rdcost 계산rdcost1 = distortion + λ× rate
( A fast distortion estimation, offset refinement )
EO class2에 대하여 rdcost 계산rdcost2 = distortion + λ× rate
( A fast distortion estimation, offset refinement )
EO class3에 대하여 rdcost 계산rdcost3 = distortion + λ× rate
( A fast distortion estimation, offset refinement )
BO에서 시작 band position 결정( A fast distortion estimation, offset refinement )
Rdcost가 가장 작은 type 결정(BO, EO class0, EO class1, EO class2, EO class3)
BO에 대한 rdcost 계산rdcostBO = distortion + λ× rate
Left merge, up merge에 대하여rdcost 경쟁
E
N
Slice‐level on/off control of SAO
• Hierarchical quantization parameter (QP) settings for each group of pictures
• A slice‐level on/off decision algorithm – For “depth=0” picture, SAO is always enabled in the slice header – Other depth
• If the previous picture (the last picture of depth “N‐1” in decoding order) disables SAO for more than 75% of CTBs, the current picture will early terminate the SAO encoding process and disable SAO in all slice headers
8k
(8k+4) Depth=0
Depth=1
Depth=2
Depth=3
A higher QP
(8k+2)
(8k+1) (8k+3) (8k+5) (8k+7)
(8k+6)
CTB‐based encoding issues about SAO
• Since SAO is after DF, the SAO parameters cannot be precisely estimated until the deblocked samples are available
– In CTU‐based encoder, the deblocked samples of the right columns and the bottom rows in the current CTB may be unavailable
• Two practical CTU‐based SAO decisions – Case 1. Avoiding using the bottom rows and right columns (current HM) – Case 2. Use non‐deblock‐filtered pixels for the bottom rows and right coloumns (JCTVC‐J0139)
TABLE. Average BD‐rates of enabling SAO versus disabling SAO for different CTU sizes
deblock‐filtered pixels
non‐deblock‐filtered pixels
CTU Size in Luma
Option 1: Skip right and bottom samples in the CTU during parameter estimation
Option 2: Use pre‐deblocked samples near right and bottom boundaries in the CTU during
parameter estimation
Y Cb Cr Y Cb Cr
6464 ‐3.5% ‐4.8% ‐5.8% ‐3.3% ‐5.3% ‐6.6%
3232 ‐2.0% ‐1.1% ‐1.5% ‐2.5% ‐2.0% ‐2.7%
1616 0.0% ‐0.3% 0.3% ‐0.8% 0.4% 0.1%
COMPLEXITY ANALYSIS OF HEVC ENCODER
Complexity analysis of HM encoder
• Test sequences – Sequence : Class B (1920×1080), Class C (832×480)
• Class B : Kimono, ParkScene, Cactus, BasketballDrive, BQTerrace
• Class C : BasketballDrill, BQMall, PartyScene, RaceHorse
– QP : 22, 27, 32, 37 – Main profile – Random access, low delay
• Test environment – HM 7.0 software – Intel® CoreTM i7 CPU 860 @ 2.8GHz – 4GB memory – Windows 7 (64‐bit) – Analysis tool : Intel® VtuneTM Amplifier XE
FIGURE. ClassB ‐ BasketballDrive
FIGURE. ClassC ‐ BQMall
Profiling result of HEVC encoder
Class Module QP
22 27 32 37
B
Entropy 6.6 3.4 1.0 0.9
Intra 3.3 2.2 2.1 1.4
Inter 68.4 78.1 83.9 85.7
TR+Q 20.4 15.2 11.7 10.6
Loop filter 0.2 0.2 0.2 0.1
etc 1.2 1.1 1.3 1.5
C
Entropy 6.5 3.9 2.8 1.3
Intra 2.9 2.7 2.2 1.8
Inter 68.8 74.9 79.8 83.3
TR+Q 20.7 17.0 13.9 12.4
Loop filter 0.2 0.2 0.2 0.1
etc 1.0 1.5 1.4 1.2
Class Module QP
22 27 32 37
B
Entropy 6.1 2.8 0.4 0.3
Intra 3.4 2.0 1.2 1.2
Inter 71.3 81.2 87.3 89.1
TR+Q 18.6 13.0 9.9 8.5
Loop filter 0.2 0.2 0.2 0.1
etc 0.8 1.2 0.8 0.9
C
Entropy 5.3 3.1 1.1 0.4
Intra 3.0 2.5 1.8 1.5
Inter 72.6 79.1 83.5 87.2
TR+Q 18.2 14.9 12.1 10.1
Loop filter 0.2 0.2 0.2 0.1
etc 1.1 0.6 1.6 1.0
TABLE. Complexity ratio of HM 7.0 encoder (RA) TABLE. Complexity ratio of HM 7.0 encoder (LD)
Complexity portions of HM encoder
Loop filter : 0.1‐0.2%
Inter prediction : 77‐81%
Intra prediction : 1‐2%
Entropy coding : 2‐4%
Tr + Q : 14‐16%
• CABAC
Entropy coding
• Deblocking filter • SAO
In‐loop filter
• Delta QP • RDO‐Q
Quantization
• TU 4x4 to 32x32 • Residual quad‐
tree transform
Transform
• AMVP • Merge • DCT‐IF
Inter prediction
• Angular intra prediction
Intra prediction
Bitstream
Inv.Quantization
Inv.Transform
• CU 8x8 to 64x64 • Diverse PU types
Block structure
• 8‐bit/sample
Picture storage
Picture
Inter prediction
Transform + Q
Intra prediction
Loop filter
Entropy coding
etc
FIGURE. HEVC encoder blockdiagram and profiling result
Complexity ratio of CU size and mode
FIGURE. Example of CU quad‐tree structure
CU32×32
CU16×16 CU16×16
CU8×8 CU8×8
CU16×16 CU8×8 CU8×8
CU16×16 CU16×16 CU8×8 CU8×8 CU8×8 CU8×8
CU8×8 CU8×8 CU8×8 CU8×8
CU16×16 CU16×16 CU16×16 CU8×8 CU8×8
CU8×8 CU8×8
TABLE. Complexity ratio of CU size and mode
Size Mode RA (%) LD (%) Average (%)
64x64
Intra 2.1 1.0 1.6
Inter 19.0 31.9 25.5
Skip 3.9 3.4 3.7
32x32
Intra 1.9 0.7 1.3
Inter 25.0 27.4 26.2
Skip 4.5 3.2 3.9
16x16
Intra 2.3 0.2 1.3
Inter 17.0 12.5 14.8
Skip 3.2 1.7 2.5
8x8
Intra 2.4 0.4 1.4
Inter 8.7 4.9 6.8
Skip 1.7 0.6 1.2
Selected ratio of CU, PU and TU
CU size PU mode
ClassB ClassC
22 27 32 37 22 27 32 37
64x64
Merge skip 10.6 26.6 43.3 55.2 11.7 20.6 30.6 39.5
Inter 2Nx2N 4.5 7.1 7.2 6.0 5.8 7.5 6.7 5.5
Inter Nx2N 1.4 2.2 1.8 1.3 1.6 1.8 1.7 1.7
Inter 2NxN 1.5 1.9 1.3 0.9 1.2 1.0 0.8 0.7
Inter AMP 1.2 1.4 1.0 0.7 1.0 1.1 1.0 1.1
Intra 2Nx2N 0.3 0.4 0.6 1.0 0.0 0.0 0.0 0.1
32x32
Merge skip 9.9 12.4 19.9 8.4 12.2 13.5 15.2 16.8
Inter 2Nx2N 8.1 6.9 4.6 3.1 9.1 7.2 5.4 4.3
Inter Nx2N 1.8 1.4 0.9 0.4 2.2 1.9 1.9 1.7
Inter 2NxN 1.7 1.3 0.7 1.0 1.4 1.0 0.9 0.8
Inter AMP 4.4 2.9 1.6 0.6 4.2 3.5 3.1 2.6
Intra 2Nx2N 2.3 2.3 2.6 2.6 0.2 0.4 0.7 1.1
16x16
Merge skip 6.8 5.6 3.9 2.9 8.0 7.7 7.3 6.1
Inter 2Nx2N 9.1 3.7 1.7 0.8 6.9 4.8 3.1 2.0
Inter Nx2N 1.6 0.7 0.3 0.1 2.0 1.4 1.0 0.6
Inter 2NxN 1.7 0.6 0.2 0.1 1.2 0.8 0.5 0.3
Inter AMP 4.1 1.4 0.5 0.2 4.1 2.7 1.7 0.9
Intra 2Nx2N 2.6 2.1 1.7 1.4 1.2 1.6 1.8 1.7
8x8
Merge skip 2.8 1.9 1.2 0.9 3.9 3.3 2.3 1.4
Inter 2Nx2N 5.8 1.3 0.4 0.1 4.9 2.5 1.1 0.4
Inter Nx2N 0.3 0.2 0.1 0.0 1.2 0.7 0.3 0.1
Inter 2NxN 0.4 0.2 0.1 0.0 0.7 0.4 0.2 0.1
Intra 2Nx2N 2.9 1.2 0.1 0.5 2.1 1.7 1.2 0.8
Intra NxN 0.8 0.6 0.7 0.2 1.9 1.1 0.6 0.3
Class Size QP
22 27 32 37
B
32x32 33.5 55.0 63.0 65.7
16x16 19.8 20.9 20.1 19.7
8x8 36.2 15.5 10.7 10.0
4x4 10.5 8.5 6.2 4.5
C
32x32 35.7 43.4 49.2 52.2
16x16 27.7 27.7 27.5 29.0
8x8 21.7 18.1 15.8 13.9
4x4 14.8 10.8 7.5 4.9
TABLE. Selected ratio of TU
TABLE. Selected ratio of CU size and PU mode
BD‐BR vs. Encoding time depending on CTU size
• CTU size : 32x32 – 3.3‐3.4 % BD‐bitrate – 78‐79 % encoding time
• CTU size : 16x16 – 15.4‐17.5 % BD‐bitrate – 50‐54 % encoding time
CTU size : 16x16 Enc T : 50.8 % BD‐bitrate : 17.53 %
CTU size : 32x32 Enc T : 79.22 % BD‐bitrate : 3.31 %
CTU size : 64x64 (Reference)
CTU size : 16x16 Enc T : 54.7 % BD‐bitrate : 15.43 %
CTU size : 32x32 Enc T : 78.92 % BD‐bitrate : 3.43 %
SW : HM 7.1 Seq : Class B cfg : Random access & low delay
BD‐BR vs. Encoding time depending on TU size
• Transform size – 16×16 to 4×4 on case
• 3.2‐3.5 % BD‐bitrate • 96 % encoding time
– 8×8 to 4×4 on case • 10.2‐11.2 % BD‐bitrate • 91‐92 % encoding time
Max TU size : 8x8 Quad‐tree max depth : 1 Enc T : 92.4 % BD‐bitrate : 11.2 %
Max TU size : 8x8 Quad‐tree max depth : 1 Enc T : 91.4 % BD‐bitrate : 10.24 %
Max TU size : 16x16 Quad‐tree max depth : 2 Enc T : 96.8 % BD‐bitrate : 3.2 %
Max TU size : 16x16 Quad‐tree max depth : 2 Enc T : 96.5 % BD‐bitrate : 3.5 %
Max TU size : 32x32 Quad‐tree max depth : 3 (Reference)
SW : HM 7.1 Seq : Class B cfg : Random access & low delay
Tool on/off test
Fast encoding algorithms in HM software
Contents note
Fast Encoding Setting : FEN, JCTVC‐A0124
‐ Early CU termination ‐ Sub‐sampled SAD Operation ‐ Simple Bi‐prediction (The number of iteration 4 ‐> 1)
Fast Decision for Merge RD Cost : FDM, JCTVC‐H178 ‐ 2Nx2N Merge에 대한 CBF 를 이용한 early termination PU level
Rough Mode Decision (for Intra) : RMD, JCTVC‐C311/D283
‐ 35가지 Intra mode에 대해 SATD기반 RD 후보모드 선별 ‐ 선별된 RD 후보모드에 대해 RD기반 최적모드 선택 ‐ 최적모드에 대한 Full RQT 적용
PU level
AMP Speed‐up : AMPS, JCTVC‐E316 ‐ 특정 조건 하에서 AMP에 대한 ME or Merge 선택적 적용 PU level
CBF Fast Mode Setting : CFM, JCTVC‐F045 ‐ 현 PU의 CBF가 0이면 하위 PU들에 대한 ME 과정 생략 PU level
Early CU Setting : ECU, JCTVC‐F092 ‐ 현 CU의 최적모드가 Skip이면, 하위 CU 분할과정 생략 CU level
Early Skip Detection Setting : ESD, JCTVC‐G543 ‐ Inter 2Nx2N 연산 후 조건에 따른 Early Skip Detection CU level
TABLE. Fast encoding algorithms in HM software
PARALLEL TOOLS IN HEVC
Tile (JCTVC‐E408, JCTVC‐F335)
• Motivation and goal of tile – Picture partitioning introduces coding loss – Why partition a picture?
• High‐level parallel processing • Maximum transmission unit (MTU of the network) size matching • Motion estimation with constrained on‐chip memory
– Goal: reduce coding loss due to partitioning
• Vertical and horizontal boundaries partition • Boundary locations may be specified individually or uniformly spaced • Always rectangular with an integer number of CTUs
Tile#1 Core 1
Tile#4 Core 4
Tile#2 Core 2
Tile#3 Core 3
Tile & slice
• Tiles are always rectangular and always contain an integer number of coding tree blocks in coding tree block raster scan
• Parallel implementation with the tile tool has higher compression efficiency, compared to the slice one.
Tile#1 Core 1
Tile#4 Core 4
Tile#2 Core 2
Tile#3 Core 3
slice1 Core 1
slice2 Core 2 slice2 Core 2
slice3 Core 3 slice3 Core 3
slice4 Core 4
Slice mode Tile mode
Seq. BDR BDR‐low
BDR‐high BDR BDR‐
low BDR‐high
classD 3.9 6.1 2.4 2.7 3.6 2.0
class C 4.5 6.6 2.8 3.3 4.4 2.4
class B 5.3 7.6 3.5 4.0 5.1 3.2
class E 13.4 21.6 6.1 6.3 9.1 3.6
Avg. 6.2 9.6 3.6 3.9 5.3 2.8
Seq. BDR BDR‐low
BDR‐high
classD ‐1.2 ‐2.3 ‐0.4
class C ‐1.1 ‐2.1 ‐0.4
class B ‐1.2 ‐2.3 ‐0.4
class E ‐6.3 ‐10.3 ‐2.4
Avg. ‐2.1 ‐3.8 ‐0.8
TABLE. Slice mode and tile mode vs. anchor TABLE. Slice mode vs. tile mode
Relationship between slices and tiles
Slice#0
Slice#1
Tile #0 Tile #1
Slice#0
Slice#1
Tile #0 Tile #1
Slice#0
Slice#1
Tile #0 Tile #1
Tile #2 Tile #3
Slice#0
Slice#1
Tile #0 Tile #1
Tile #2 Tile #3
• Constraints are set on the relationship between slices and Tiles. – All CTBs in a slice belong to the same tile. or – All CTBs in a tile belong to the same slice.
Example 1 Example 2
Syntax for tile
• Syntax elements for tile
TABLE. Syntax for tile in PPS pic_parameter_set_rbsp () { Descriptor … tile_enabled_flag u(1) … if( tiles_enabled_flag ) { num_tile_columns_minus1 ue(v) num_tile_rows_minus1 ue(v) uniform_spacing_flag u(1) if( !uniform_spacing_flag ) { for(i=0; i<num_tile_columns_minus1; i++) column_width_minus1[i] ue(v) for(i=0; i<num_tile_rows_minus1; i++) row_height_minus1[i] ue(v) } loop_filter_across_tiles_enabled_flag u(1) } … }
Tile#0
Tile#5
Tile#1
Tile#4
Tile#2 Tile#3
num_tile_columns_minus1
num_tile_row
s_minus1
• uniform_spacing_flag equal to 1 specifies that column boundaries and likewise row boundaries are distributed uniformly across the picture.
column_width_minus1[0]
row_height_m
inus1[2]
Wave‐front parallel processing (WPP)
• Method to perform high‐level parallel video encoding and decoding
• Advantages – Loss of parallelization is relatively small
compared to other parallelization methods (WPP vs. slice or tile : 0.8~1% gain)
– Fast decoding and very low delay with the dependent slice
• For large sequences (Classes A, B, E) acceleration factor is 1.8x with two threads and 3x with four threads
• Disadvantages – Parallelization steps (prologue, kernel,
epilogue), the activate cores are limited
X
epilogue prologue kernel
time
Activ
ate co
res
Average activate cores
FIGURE. Single‐thread processing of HEVC encoding/decoding
X
X
X
X
FIGURE. Parallel processing without break pixel/MV dependency (ideally)
Wave‐front parallel processing (WPP)
• In HEVC, CABAC is the only entropy coder – Probabilities not available on first CTU of the line – If we re‐initialize at the beginning of each CTU line, performance degradation
• Synchronize CABAC probabilities with upper‐right CTU (JCTVC‐E196, JCTVC‐F274)
– Upper‐right CTU available, with spatial/MV dependency – Efficiently carries over quick vertical learning of probabilities
X
X
X
X
Probabilities not available!
X
X
X
X
FIGURE. Wavefront processing of HEVC CTUs (problem) FIGURE. Wavefront processing of HEVC CTUs (WPP)
• Bitstream contains any sub‐bitstream
Bitstream structure for WPP & Tiles
SPS PPS Slice header
Slice data
Slice layer rpsp
Subset 0 Subset 1 Subset 2
entry_point_offset[ i ] represented by 9bit
offset_len_minus1 (0 to 31) = 8 bit
num_entry_point_offsets = 3
slice_header( ) { Descriptor if( tiles_enabled_flag | | entropy_coding_sync_enabled_flag ) { num_entry_point_offsets ue(v) if( num_entry_point_offsets > 0 ) { offset_len_minus1 ue(v) for( i = 0; i < num_entry_point_offsets; i++ ) entry_point_offset[ i ] u(v) } } }
TABLE. Syntax for subsets in slice header
Only one of WPP or tile tools can be used for the main profile.
• Case 1 : multiple tiles in a slice – Slice layer rpsp bitstream consists of sub‐bitstreams for decoding – According to the tile partition, CTU decoding order can be decided
Relationship between slices and tiles
Slice#0
Slice#1
Tile #0 Tile #1
Tile #2 Tile #3
Slice#0
Slice#1
Tile #0 Tile #1
Tile #2 Tile #3
Example
Tile 0 Tile 1
SPS PPS Slice header
Slice data
Slice 0
Tile 2 Tile 3
Slice header
Slice data
Slice 1
+
Bitstream structure
• Case 2 : multiple slices in a tile – According to the tile partition, CTU decoding order can be decided – Slice data includes start CTU address information
Relationship between slices and tiles
SPS PPS Slice header
Slice data (Tile 0)
Slice 0
Slice header
Slice data (Tile 0)
Slice 1
Bitstream structure
Slice header
Slice data (Tile 1)
Slice 2
Example
+ Slice#0
Slice#2
Slice#1
Tile #0
Tile #1
Slice#0
Slice#2
Slice#1
Tile #0
Tile #1
Conventional Encoding procedure
• Conventional encoding process – Pixel encoding can be processed 2D wave‐front – But, entropy encoding is sequential process
• It requires many size of memory for syntax elements • Generally, encoding process conducts raster‐scan order due to entropy coder
… bitstream … CTB0 CTB1 CTB3 CTBk‐1 CTBk‐2 CTBk‐3
Syntax elements
Pixel encoding
entropy encoding
Tile encoding & decoding procedure
• Tile encoding/decoding process – Each tile has no dependency for pixel coding and entropy coding – All tiles can be encoding/decoding at the same time
Subset 0
Entropy encoding
Subset 1
Subset 2
Subset 3
Subset 0 Subset 1
Subset 3 Subset 2
Tile boundary processing (DB, SAO)
Tile boundary processing (DB, SAO)
Pixel encoding
Bitstream transmission
Subset 0
Entropy decoding
Subset 1
Subset 2
Subset 3
Subset 0 Subset 1
Subset 3 Subset 2
Pixel decoding
Tile boundary processing (DB, SAO)
Tile boundary processing (DB, SAO)
WPP encoding & decoding procedure
• WPP encoding/decoding process – Pixel coding can be 2D wave‐front process – Entropy coding also can be 2D wave‐front process using contexts synchronized scheme
Pixel encoding Pixel decoding
Subset 0
Subset 1
Subset 2
Subset 3
Subset 0
Subset 1
Subset 2
Subset 3
Subset 3
Bitstream transmission
Subset 0
Subset 1
Subset 2
sync
sync
sync
Entropy encoding
Subset 3
Subset 0
Subset 1
Subset 2
sync
sync
sync
Entropy decoding
Conclusion
• Overview of HEVC • Encoding parameters for HEVC test model (HM) • Complexity analysis of HEVC encoder • Fast encoding algorithms and performances • Issues of parallel processing