ee 5359 multimedia processing spring 2011 final report fpga

50
EE 5359 MULTIMEDIA PROCESSING SPRING 2011 Final Report FPGA IMPLEMENTATION OF H.264 VIDEO ENCODER Under guidance of DR K R RAO DEPARTMENT OF ELECTRICAL ENGINEERING UNIVERSITY OF TEXAS AT ARLINGTON SPRING 2011 Presented by KUSHAL KUNIGAL 1000662485 [email protected]

Upload: others

Post on 03-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

EE 5359 MULTIMEDIA PROCESSING

SPRING 2011

Final Report

FPGA IMPLEMENTATION OF H.264 VIDEO ENCODER

Under guidance of

DR K R RAO

DEPARTMENT OF ELECTRICAL ENGINEERING

UNIVERSITY OF TEXAS AT ARLINGTON

SPRING 2011

Presented by

KUSHAL KUNIGAL

1000662485

[email protected]

Introduction:

This project presents a detailed study on H.264 video encoder and the algorithms for evaluating the

transform and quantization suitable for high speed implementation on FPGA/ASIC. Along with this,

detailed architectures of intra Prediction, integer transforms and quantization processors are presented.

Overview:

To achieve a real-time H.264 encoding solution, multiple FPGAs and programmable DSPs are often

used. The computational complexity alone does not determine if a functional module should be mapped

to hardware or remain in software. The architectural issues that influence the overall design decision are:

Data Locality: In a synchronous design, the ability to access memory in a particular order and

granularity while minimizing the number of clock cycles due to latency, bus contention, alignment,

DMA transfer rate and the types of memory used is very important. The data locality issue (Figure 1) is

primarily dictated by the physical interfaces between the data unit and the arithmetic unit (or the

processing engine) [2].

Fig 1: H.264 encoder block diagram [2].

Computational Complexity: Programmable DSPs are bounded in computational complexity, as

measured by the clock rate of the processor. Signal processing algorithms implemented in the FPGA

fabric are typically computationally-intensive. By mapping these modules onto the FPGA fabric, the

host processor or the programmable DSP has the extra cycles for other algorithms. Furthermore, FPGAs

can have multiple clock domains in the fabric, so selective hardware blocks can have separate clock

speeds based on their computational requirements [2].

Block diagram of H.264 advanced Video Encoder:

Fig.2 shows the block diagram of H.264 encoder. Therein the modules designed in this work are shown

in grey shades. An input frame or field Fn is processed in units of a macro block. A macro block consists

of 16x16 pixels. Each macro block is encoded in intra or inter mode and for each block in the macro

block, a prediction P is formed based on the reconstructed picture samples. In Intra mode, P is formed

from samples in the current slice that have been previously constructed. In inter mode, P is formed by

motion-compensated prediction from one or two reference picture(s) selected from the set of reference

pictures.

Fig 2: Modules in H.264 video encoder [3].

Prediction modes in H.264 standard: H.264 or MPEG-4 part 10 aims at coding video sequences at approximately half the bit rate compared to MPEG-2 at the same quality. It also aims at having significant improvements in coding efficiency using CABAC entropy coder, error robustness and network friendliness. Parameter set concept, arbitrary slice ordering, flexible macroblock structure, redundant pictures, switched predictive and switched intra pictures have contributed to error resilience / robustness of this standard. Adaptive (directional) intra prediction (Figs 3a-3i) is one of the factors which contributed to the high coding efficiency of this standard [16]. Different modes used for block prediction are shown in Fig.3

Fig 3a: Mode 0: vertical Fig 3b: Mode 1: horizontal

Fig 3c: Mode 2: DC

Fig 3d: Mode 3: diagonal down-left Fig 3e: Mode 4: diagonal down-right

Fig 3f: Mode 5: vertical-right Fig 3g: Mode 6: horizontal-down

Fig 3h: Mode 7: vertical-left Fig 3i: Mode 8: horizontal-up

Fig 3a-3h: Different prediction modes used for prediction H.264 [16].

The intra prediction reduces spatial redundancies by exploiting the spatial correlation between adjacent

blocks in a given picture. Each picture is divided into 16×16 pixel MB and each MB is formed with

luma components and chroma components. For luma components, the 16×16 pixel MB can be

partitioned into sixteen 4×4 blocks. The chroma components are predicted by 8×8 blocks with a similar

prediction technique as the 16×16 luma prediction. There are 9 prediction modes for 4×4 luma blocks

and 4 prediction modes for 16×16 luma blocks. For the chroma components, there are 4 prediction

modes that are applied to the two 8×8 chroma blocks (U and V).

A. 4×4 Luma intra prediction modes

The MB is divided into sixteen 4×4 luma blocks and a prediction for each 4×4 luma block is applied

individually. 4×4 Luma intra prediction modes are well suited for coding of parts of a picture with a

significant detail. The nine different prediction modes are supported as shown in Figs 3a-3j, where the

prediction values for pixels are calculated from the neighboring boundary pixel values. Each mode is

suitable to predict directional structures in the picture at different angles (Fig. 3c). Note that DC is a

special prediction mode, where the mean of the left handed and upper samples are used to predict the

entire block.For the 8×8 intra prediction, 9 prediction modes are used which are the same as that of 4×4

intra prediction. However, the computational complexity of the H.264 encoder is dramatically increased

according to this feature of the new extended profile.

B. 16×16 Luma Intra Prediction Modes

16×16 Luma Intra Prediction Modes are more suitable for coding very smooth areas of a picture by

prediction for the whole luma component of a MB. Four different prediction modes are supported:

Vertical, Horizontal, DC and Plane prediction. Plane prediction mode uses a linear function between the

neighboring samples to the left and to the top in

order to predict the current samples.

C. 8×8 Chroma Intra Prediction Modes

The chroma intra prediction of a MB is similar to the 16×16 luma intra prediction because the chroma

signals are very smooth in most cases. It is performed always on chroma blocks using vertical

prediction, horizontal prediction, DC-prediction or plane-prediction.

Concept: By understanding the ideas and importance behind video compression, it is possible to use the idea and

implement an efficient and high performance encoder, such that it consumers less power and take less

clock cycles to encode an image frame. The implementation is considered a lite version of the H.264

encoder, similar to the MPEG-4 digital video codec which is known to achieving high data compression.

The same building blocks implemented in the H.264 encoder will be used in the lite version with

exceptions of a few optimizing modifications.

For example the Motion Estimation algorithm, it was suggested to use a full search algorithm but after

some research, it was discovered that the motion estimation process consumes 66-94% of the cycles [3].

Therefore if there was any optimization to be made, here would be the place to start.

Therefore, instead of applying the full search algorithm, an alternative algorithm was used, which will

be discussed under the background section. The motion compensation would produce the predicted

frame from the motion vectors (from motion estimation) and reference frame. The residual frame would

be generated by the difference between the predicted frame and current frame. To compress the data

even further, Discrete Cosine Transform (DCT), a type of linear transform, will be performed on the

residual frame. In addition, quantization will be used as well to compress the data. This project will be

simulated and synthesized on the Xilinx 8.1 ISE (to determine chip size and power consumption),

ModelSim (to simulate and observe waveforms). The purpose of this implementation is to improve

behavioral VHDL modeling and FPGA design process. In addition, to learn about video compression

system and be able to incorporate other simulation tools with VHDL. Hopefully, with this

implementation of the H.264 encoder, it will result in a design that is efficient and achieves high

performance, the hardware design that runs faster, with less power consumption and smaller area.

Modified Encoder Hardware Design:

Fig 3: Modified Encoder Hardware Design (derived from Fig 2).

The encoder will be used to encode a frame from a video sequence:

30 frames/second.

Each frame is 176 x 144 pixels (QCIF resolution, very typical for low bit-rate video contents in

cell phones).

Fig 4: Reference frame 1 [4] . Fig 5: Current frame 2 [4].

Fig 6: Residual of current frame [4].

The following are the different stages in designing the video encoder:

Motion Estimation:

Motion estimation is the most computationally demanding task in image compression applications, and

can require as much as 66%-94% of the processor cycles spent in the video encoder [6]. Therefore an

efficient motion estimation algorithm is essential for creating an efficient implementation of the H.264

encoder. This section will briefly cover different motion estimation algorithms and the tradeoffs between

them, and the proposed algorithm that will be implemented on this project.

Fig 7: The principle of block matching motion estimation algorithm is finding the best matching block in the searching area of a reference frame for each macroblock in the original frame.

Motion estimation attempts to find a region in a previously encoded frame (called the "reference frame")

that closely matches each macro block in the current frame. In this implementation, each frame is

divided into macro blocks of 4x4 pixels. To find the best matching block, Minimum Absolute

Difference (MAD) is very often exploited as the matching criterion because of its simple operation and

efficiency. Distance vector between this best matching block and current original macroblock is so-

called the motion vector. The motion vector is comprised of the horizontal and vertical offsets from the

location of the macro block in the current frame to the location in the reference frame. The full search

algorithm, also referred to as exhaustive block matching algorithm (EBMA) was the suggested

algorithm to perform the motion estimation in the encoder. The full search algorithm, would find the

“best matched” macro block within a search range of 8 pixels (resulting in a 20x20 pixel search range)

in the reference frame. This algorithm will simply exhaustively test all possible motion representations.

Which means that within the 20x20 pixel search range, the algorithm will start at the top left corner of

the search range and move the 4x4 macro block one pixel at a time and scan that area to search for the

“best match”. Once the entire search area has been scanned, the location of the macro block that was the

„best match‟ will be used to calculate the motion vector. Then a new macro block in the current frame

will be selected, and the process will start all over again, until all the macro blocks in the current frame

has been processed.

• This means that for every search area (20x20 pixels) there are 289 possible macro block locations, that

will be scanned by the full search algorithm.

• In addition, given that the frame size is 176 x 144 pixels (44 x 36 macro blocks), which gives a total of

1584 macro blocks.

• To compute the motion vector for each macro block in the current frame, the algorithm will have to

scan 1584 x 289 = 457776 blocks.

Using the full search algorithm scans and processes many macro blocks are involved, which consumes

too many clock cycles. An alternative algorithm would be the four-step search (4SS) motion estimation

algorithm. The four-step search was developed from the popular three-step search algorithm in order to

improve efficiency. The 4SS algorithm performs similar to a hierarchal search algorithm, starting at a

very broad search, and performs narrower searches after. The 4SS algorithm uses a center-biased search

pattern, with nine checking points on a 5x5 window in the first step. The center of the search window is

then shifted to the point with the “best match” macro block. The search window of the next two steps

depends on the location of the “best match”.

The 4SS is summarized in 4 steps [5]:

• Step 1: A macro block that is the “best match” is found from a nine-checking-points pattern on a 5x5

window located at the center of the 15 x 15 search area (Fig. 8a).

• Decision: If the “best matched” macro block is found at the center of the search window, go to Step 4

otherwise go to Step 2.

• Step 2: The search-window size is maintained at 5 x 5. However, the search patter will depend on the

position of the previous “best matched” macro block location.

a. If the previous “best matched” macro block is located at the corner of the search window, five

additional checking points (Figure 8b) are used.

b. If the previous “best matched” macro block is located at the middle of the horizontal or vertical axis

of the previous search window, three additional checking points (Figure 8c) are used.

• Decision: If the “best matched” macro block is found at the center of the search window, go to Step 4;

otherwise go to Step 3.

• Step 3: The searching pattern strategy is the same as Step 2, but finally it will go to Step 4.

• Step 4: The search window is reduced to 3 x 3 (Figure 8d) and the direction of the overall motion

vector is considered as the “best matched” macro block is among these nine macro blocks.

Fig 8a: A 20x20 pixel search range divided up by 4x4 pixel search [5].

Fig 8b: A 20x20 pixel search range with 5 additional checkpoints during step 2 of the four-step search algorithm [5].

Fig 8c: A 20x20 pixel search range with 3 additional checkpoints during step 3 of the four-step search algorithm [5].

4x4 macroblock

Check points for motion estimation

A 20x20 pixel search range divided up by 4x4 pixel search range

Fig 8d: A 20x20 pixel search range with 5x5 search window, now reduced to 3x3 during step 4 of the four-step search algorithm [5].

Figs 8a-8b of the 4SS algorithm does not use the same search area dimensions as specified in the Full

Search Algorithm, but the overall implementation of the 4SS search algorithm can be easily seen,

especially how the algorithm is able to optimize and reduce the number of searches. The algorithm only

scans a fraction of the search area, starts off with a broad scan, and narrows each of its scan afterwards.

A modified version of the 4SS search algorithm will be used in this implementation of the motion

estimation process.

Motion vector description:

Fig 9. Motion vector description.

After the user get the best match on the picture, each macro-block will have their own motion vector.

After the motion vectors are obtained, a predicted frame can be formed with the reference frame (RF)

and the motion vectors (MV) with the following pseudo code:

Fig 10: Psuedo code for obtaining predicted frame [6].

Motion compensation

After the motion vectors are obtained, motion compensation uses the motion vectors encoded in the

video bit stream to predict the pixels in each macro block. A predicted frame can be formed with the

reference frame and the motion vectors (Figure 11).

Figure 11: Generating predicted frame using reference frame and motion vectors.

Like motion estimation, motion compensation requires the video decoder to keep one or two reference

frames in memory, often requiring external memory chips for this purpose. However, motion

compensation makes fewer accesses to reference frame buffers than motion estimation. Therefore,

memory bandwidth requirements are less stringent for motion compensation compared to motion

estimation. The predicted frame, shown in Figure 11, still fails to new elements that were absent in the

previous frame (i.e.illumination). To correct that a residual frame, which is the difference between the

predicted frame and the actual current frame, is also computed. The residual frame is shown in Figure

12.

Figure 12: The residual frame formed by subtracting the predicted frame from the actual current frame.

Discrete cosine transform

After the residual frame has been produced, it goes through a discrete cosine transform which transforms

each block (in residual frame) into a frequency-domain representation. This is a lossless process,

because it is just a direct transformation into frequency coefficients. In the H.264 standard, real DCT is

factored into the following form:

Figure 13: Real DCT factor.

Where:

X denotes the original macroblock

a = 0.5

b = sqrt(2/5)

denotes array multiplication

The DCT has strong energy compaction properties. Most of the signal information tends to be

concentrated in a few low frequency components of the DCT. However, the human eye is

more sensitive to the information contained in DCT coefficients that represent low frequencies

(corresponding to large features in the image) than to the information contained in DCT coefficients that

represent high frequencies (corresponding to small features). Therefore, the DCT helps

separate the more perceptually significant information.

As mentioned above, the DCT coefficients for each block are encoded using more bits for the more

perceptually significant low-frequency DCT coefficients and fewer bits for the less significant high-

frequency coefficients. This encoding of coefficients is accomplished in two steps: First, quantization

is used to discard perceptually insignificant information. Quantization rounds each DCT coefficient to

the nearest of a number of predetermined values. By reducing the number of discrete symbols in a given

stream, the stream becomes more compressible.

QP QStep QP Qstep

1 0.6875 7 1.3750

2 0.8125 8 1.6250

3 0.8750 9 1.7500

4 1.0000 10 2.0000

5 1.1250 11 2.2500

6 1.2500 12 2.5000

Table 1: Quantization step sizes in H.264 CODEC

The quality of reconstructed frame is determined by the QP used. The higher the QP, the lower the

decoded quality and vice versa. In the encoder, each pixel, yij, in the current macroblock is quantized by

the following equation:

Zij = round(Yij Eij / QStep)

Overall view of operations in encoder module:

Fig 14. Overall view of operations in encoder module [6].

Results:

Platform:

ModelSim: Version XE III 6.0d.

Quartus II: Version 10.1 -sp1 Web Edition (64-Bit).

The hardware description language used was VHDL (Very high speed integrated circuit Hardware Description language). Figure 15 shows the list of code segments used.

Figure 15: List of code segments used to design the H.264 encoder.

Figure 16: List of tasks performed.

Figure 17: Analysis and synthesis summary.

Figure 18: Resource usage summary.

Figure 19: Generated register transfer logic for the H.264 encoder.

Conclusion:

The H.264 encoding algorithms were discussed. A complete analysis of the motion estimation

algorithm, motion compensation, DCT and quantization was made. VHDL code was written to

implement the H.264 video encoder. The code was compiled and synthesized using ModelSim and

Quartus ii sofwares. The Register transfer schematic for the code was generated.

Appendix

VHDL code for the design of the H.264 encoder.

library IEEE;

use IEEE.std_logic_1164.all;

use IEEE.Numeric_STD.all;

use IEEE.std_logic_unsigned."+";

use IEEE.std_logic_unsigned."-";

use IEEE.std_logic_signed.">";

use IEEE.std_logic_signed."<";

use work.config.all;

entity me_top is

port ( clk : in std_logic;

clear : in std_logic;

reset : in std_logic;

register_file_address : in std_logic_vector(4 downto 0); -- 32 general purpose registers

register_file_write : in std_logic;

register_file_data_in : in std_logic_vector(31 downto 0);

register_file_data_out : out std_logic_vector(31 downto 0);

done_interrupt : out std_logic; -- high when macroblock processing has completed

best_sad_debug : out std_logic_vector(15 downto 0); --debugging ports

best_mv_debug : out std_logic_vector(15 downto 0);

best_eu_debug : out std_logic_vector(3 downto 0);

partition_mode_debug : out std_logic_vector(3 downto 0);

qp_on_debug : out std_logic; --running qp

dma_rm_re_debug : in std_logic; --set to one to enable reading the reference area

dma_rm_debug : out std_logic_vector(63 downto 0); -- reference area data out

dma_address : in std_logic_vector(10 downto 0); -- next reference memory address

dma_data_in : in std_logic_vector(63 downto 0); -- pixel in for reference memory or

macroblock memory

dma_rm_we : in std_logic; --enable writing to reference memory

dma_cm_we : in std_logic; --enable writing to current macroblock memory

dma_pom_we : in std_logic; -- enable writing to point memory

dma_prm_we : in std_logic; -- enable writing to program memory

dma_residue_out : out std_logic_vector(63 downto 0); -- get residue from winner mv

dma_re_re : in std_logic -- enable reading residue

);

end;

architecture struct of me_top is

component mv_cost --calculate mv cost using Lagrangian optimization

generic (pipelines : integer);

port (

clk : in std_logic;

clear : in std_logic;

reset : in std_logic;

load : in std_logic; -- start

mvp_x : in std_logic_vector(7 downto 0); --predicted mv x

mvp_y : in std_logic_vector(7 downto 0); -- predicted mv y

mvx_c : in std_logic_vector(7 downto 0); -- motion vector candidate x

mvy_c : in std_logic_vector(7 downto 0); -- motion vector candidate y

rest_mvx_c : in rest_type_points; -- rest of motion vector candidates x

rest_mvy_c : in rest_type_points; -- rest of motion vector candidates y

quant_parameter : in std_logic_vector(5 downto 0); -- quantization parameter

p_cost_mv : out std_logic_vector(15 downto 0);

rest_p_cost_mv : out rest_type_displacement

);

end component;

component phy_address

port (

clk : in std_logic;

clear : in std_logic;

reset : in std_logic;

partition_count : in std_logic_vector(3 downto 0); --identify the subpartition active

line_offset : in std_logic_vector(5 downto 0); -- read multiple lines

mvx : in std_logic_vector(7 downto 0);

mvy : in std_logic_vector(7 downto 0);

phy_address : out std_logic_vector(13 downto 0));

end component;

component sad_selector

generic (integer_pipeline_count : integer);

port (

clk : in std_logic;

reset : in std_logic;

clear : in std_logic;

calculate_sad_done : in std_logic;

active_pipelines : in std_logic_vector(CFG_PIPELINE_COUNT-1 downto 0);

update : in std_logic; -- completed program reset the stored sad

update_fp : in std_logic; -- end of iteraction

best_eu : out std_logic_vector(3 downto 0);

best_sad : in std_logic_vector(15 downto 0);

best_mv : in std_logic_vector(15 downto 0);

rest_best_sad : in rest_type_displacement;

rest_best_mv : in rest_type_displacement;

best_sad_out : out std_logic_vector(15 downto 0);

best_mv_out : out std_logic_vector(15 downto 0));

end component;

component sad_selector_qp

port (

clk : in std_logic;

reset : in std_logic;

clear : in std_logic;

calculate_sad_done : in std_logic;

active_pipelines : in std_logic_vector(CFG_PIPELINE_COUNT_QP-1 downto 0);

update : in std_logic; -- completed fp part set the stored sad

update_qp : in std_logic; -- end of iteraction

best_eu : out std_logic_vector(3 downto 0);

best_sad : in std_logic_vector(15 downto 0);

best_mv : in std_logic_vector(15 downto 0);

rest_best_sad : in rest_type_displacement_qp;

rest_best_mv : in rest_type_displacement_qp;

best_sad_out : out std_logic_vector(15 downto 0);

best_mv_out : out std_logic_vector(15 downto 0));

end component;

component distance_engine64

generic (qp_mode : std_logic);

port(

clk : in std_logic;

clear : in std_logic;

reset : in std_logic;

enable : in std_logic;

update : in std_logic;

load_mv : in std_logic;

mode_in : in mode_type;

mv_cost_on : in std_logic;

mv_cost_in : in std_logic_vector(15 downto 0);

candidate_mvx : in std_logic_vector(7 downto 0);

candidate_mvy : in std_logic_vector(7 downto 0);

reference_data_in : in std_logic_vector(63 downto 0);

current_data_in : in std_logic_vector(63 downto 0);

residue_out : out std_logic_vector(63 downto 0);

enable_fifo : out std_logic;

reset_fifo : out std_logic;

winner1 : out std_logic;

calculate_sad_done : out std_logic;

distance_engine_active : out std_logic;

best_sad : out std_logic_vector(15 downto 0);

best_mv : out std_logic_vector(15 downto 0));

end component;

component register_file

generic (integer_pipeline_count : integer);

port(

clk : in std_logic;

clear : in std_logic;

reset : in std_logic;

addr : in std_logic_vector(4 downto 0);

write : in std_logic;

data_in : in std_logic_vector(31 downto 0);

data_out : out std_logic_vector(31 downto 0);

start : out std_logic;

all_done_qp : in std_logic; -- program completes

all_done_fp : in std_logic; -- fp part completes

mvc_done : out std_logic; -- all motion vector candidates evaluated

mvc_to_do : out std_logic_vector;

instruction_zero : in std_logic;

partition_done_fp : in std_logic; -- fp partition terminates

partition_done_qp : in std_logic; -- qp partition terminates

done_interrupt : out std_logic;

start_row : out std_logic;

update_fp : in std_logic;

load_mv : in std_logic; -- force the mvc to move foward

mode_out : out mode_type;

mv_cost_on : out std_logic; -- activate the costing of mvs

best_sad_fp : in std_logic_vector(15 downto 0);

best_mv_fp : in std_logic_vector(15 downto 0);

first_mv_fp : out std_logic_vector(15 downto 0);

rest_first_mv_fp : out rest_type_displacement;

mbx_coordinate : out std_logic_vector(7 downto 0);

mby_coordinate : out std_logic_vector(7 downto 0);

mvp_x : out std_logic_vector(7 downto 0);

mvp_y : out std_logic_vector(7 downto 0);

quant_parameter : out std_logic_vector(5 downto 0);

frame_dimension_x : out std_logic_vector(7 downto 0);

frame_dimension_y : out std_logic_vector(7 downto 0);

update_qp : in std_logic;

partition_count : in std_logic_vector(3 downto 0); --identify the subpartition active

best_sad_qp : in std_logic_vector(15 downto 0);

best_mv_qp : in std_logic_vector(15 downto 0);

first_mv_qp : out std_logic_vector(15 downto 0));

end component;

component fp_pipeline

port ( clk : in std_logic;

clear : in std_logic;

reset : in std_logic;

next_point_displacement_fp : in std_logic_vector(15 downto 0); --next point to be

processed

first_mv_fp : in std_logic_vector(15 downto 0); -- first point to start the search from

start : in std_logic;

start_mb : in std_logic; -- once per macroblock

mode_in : in mode_type;

partition_count : in std_logic_vector(3 downto 0); --identify the subpartition active

frame_dimension_x : in std_logic_vector(7 downto 0); --in mb

frame_dimension_y : in std_logic_vector(7 downto 0);

mbx_coordinate : in std_logic_vector(7 downto 0); --in mb

mby_coordinate : in std_logic_vector(7 downto 0);

candidate_mvx : out std_logic_vector(7 downto 0); -- port for Lagrangian optimizaton

candidate_mvy : out std_logic_vector(7 downto 0);

mv_cost : in std_logic_vector(15 downto 0);

mv_cost_on : in std_logic; -- enable mv cost

all_done_fp : in std_logic; -- program completes

calculate_sad_done : out std_logic;

best_sad_fp : out std_logic_vector(15 downto 0);

best_mv_fp : out std_logic_vector(15 downto 0);

dma_address : in std_logic_vector(10 downto 0); -- next reference memory address

mb_data_in : in std_logic_vector(63 downto 0); -- pixel in for macroblock memory (shared

this memory among the pipelines)

dma_data_in : in std_logic_vector(63 downto 0); -- pixel in for reference memory

dma_rm_we : in std_logic; --enable writing to reference memory

--dma_cm_we : in std_logic; --enable writing to current macroblock memory

dma_residue_out : out std_logic_vector(63 downto 0); -- get residue from winner mv

dma_re_re : in std_logic -- enable reading residue

);

end component;

component program_memory

port (

addr: in std_logic_vector(7 downto 0);

clk: in std_logic;

din: in std_logic_vector(19 downto 0);

dout: out std_logic_vector(19 downto 0);

we: in std_logic);

end component;

component reference_memory64_remap -- This memory stores the 5x7 reference data (1120

words of 64 bit)

port ( -- It also remaps the addresses

addr_r: in std_logic_vector(10 downto 0);

addr_w: in std_logic_vector(10 downto 0);

enable_hp_inter : in std_logic; -- working in interpolation mode

clk: in std_logic;

start : in std_logic;

next_configuration : in std_logic; -- move to the next configuration

start_row : in std_logic;

reset : in std_logic;

clear : in std_logic;

din: in std_logic_vector(63 downto 0);

dout: out std_logic_vector(63 downto 0);

dout2: out std_logic_vector(63 downto 0);

we: in std_logic);

end component;

component reference_memory64_remap_compact -- This memory stores the 5x5 reference data

(800 words of 64 bit)

port ( -- It also remaps the addresses

addr: in std_logic_vector(9 downto 0);

enable_hp_inter : in std_logic; -- working in interpolation mode

clk: in std_logic;

next_configuration : in std_logic; -- move to the next configuration

start_row : in std_logic;

reset : in std_logic;

clear : in std_logic;

din: in std_logic_vector(63 downto 0);

dout: out std_logic_vector(63 downto 0);

dout2 : out std_logic_vector(63 downto 0); -- from the second read port

we: in std_logic);

end component;

component concatenate64_qp

port(

addr : in std_logic_vector(2 downto 0);

clk : in std_logic;

clear : in std_logic;

reset : in std_logic;

din : in std_logic_vector(63 downto 0);

din2 : in std_logic_vector(63 downto 0);

dout : out std_logic_vector(63 downto 0);

enable : in std_logic;

quick_valid : out std_logic; --as valid but one cycle earlier

valid : out std_logic); -- indicates when 64 valid bits are in the output

end component;

component concatenate64 -- this unit makes sure that 16 valid pixels are assemble depending on

byte address

port(

addr : in std_logic_vector(2 downto 0);

clk : in std_logic;

clear : in std_logic;

reset : in std_logic;

din : in std_logic_vector(63 downto 0);

din2 : in std_logic_vector(63 downto 0);

dout : out std_logic_vector(63 downto 0);

enable_hp_inter : in std_logic; -- working in interpolation mode

enable : in std_logic;

quick_valid : out std_logic; --as valid but one cycle earlier

valid : out std_logic); -- indicates when 16 valid bytes are in the output

end component;

component qp_interpolate_engine

port(

clk : in std_logic;

clear : in std_logic;

reset : in std_logic;

qp_mode : in std_logic; -- in qp mode two lines must be written to interpolate

enable_hp_inter : in std_logic;

write_interpolate_register : in std_logic;

interpolate_in_pixels_a : in std_logic_vector(63 downto 0);

interpolate_in_pixels_b : in std_logic_vector(63 downto 0);

write_block1 : out std_logic; -- control which of the two blocks is being read and written

(interpolate and dist engine)

rma_address : out std_logic_vector(4 downto 0); -- extracted reference pixels use this address

rma_we : out std_logic;

interpolate_out_pixels : out std_logic_vector(63 downto 0)

);

end component;

component forward_engine

port(

clk : in std_logic;

clear : in std_logic;

reset : in std_logic;

enable_hp_inter : in std_logic; -- when hp interpolation is being performed in the background

write_register : in std_logic;

mode_in : in mode_type;

partition_count_in : in std_logic_vector(3 downto 0);

in_pixels : in std_logic_vector(63 downto 0);

write_block1 : out std_logic; -- control which of the two blocks is being read and written

(interpolate and dist engine)

rma_address : out std_logic_vector(4 downto 0); -- extracted reference pixels use this address

rma_we : out std_logic;

out_pixels : out std_logic_vector(63 downto 0)

);

end component;

-- half pel interpolation engine

component systolic_array_top

port (

clear : in std_logic;

reset : in std_logic;

enable_interpolation : in std_logic;

enable_estimation : in std_logic;

clk : in std_logic;

data_in : in std_logic_vector(63 downto 0);

data_in_valid : in std_logic;

data_request : out std_logic;

all_done : out std_logic;

next_point : out std_logic; -- tell the main control unit that the next point is required

shift_concatenate_valid : in std_logic; -- need to know when 64 bits are valid

-- memory interface for o,h,v and d memories

candidate_mvx : in std_logic_vector(7 downto 0);

candidate_mvy : in std_logic_vector(7 downto 0);

hp_address_a : out std_logic_vector(2 downto 0);

hp_address_b : out std_logic_vector(2 downto 0);

data_out_a : out std_logic_vector(63 downto 0); --interpolated data out

data_out2_a : out std_logic_vector(63 downto 0); --interpolated data out

data_out_b : out std_logic_vector(63 downto 0); --interpolated data out

data_out2_b : out std_logic_vector(63 downto 0); --interpolated data out

data_out_valid : out std_logic --signal to indicate 8 bytes of data valid

);

end component;

component cm -- This memory stores the current macroblock(256 bytes))

port (

addr: in std_logic_vector(4 downto 0);

clk: in std_logic;

din: in std_logic_vector(63 downto 0);

dout: out std_logic_vector(63 downto 0);

we: in std_logic

);

end component;

component point_memory

port (

addr: IN std_logic_VECTOR(7 downto 0);

clk: IN std_logic;

din: IN std_logic_VECTOR(15 downto 0);

dout: OUT std_logic_VECTOR(15 downto 0);

we: IN std_logic);

end component;

--component point_memory

-- port (

-- addr: in std_logic_vector(7 downto 0);

-- clk: in std_logic;

-- dout: out std_logic_vector(15 downto 0));

--end component;

component me_control_unit

generic ( integer_pipeline_count : integer);

port ( clk : in std_logic;

clear : in std_logic;

reset : in std_logic;

start : in std_logic;

range_ok : in std_logic; --keep track of the mv range

best_sad_in : in std_logic_vector(15 downto 0); -- to make SAD-based decisions

mv_length_in : in std_logic_vector(15 downto 0); -- to make LENGTH-based decisions

mode_in : in mode_type;

mvc_done : in std_logic; -- all motion vector candidates evaluated

mvc_to_do : in std_logic_vector(3 downto 0);

qp_on : in std_logic; -- qp on

partition_count_out : out std_logic_vector(3 downto 0); --identify the subpartition active

start_pipelines : out std_logic_vector(CFG_PIPELINE_COUNT-1 downto 0);

active_pipelines : out std_logic_vector(CFG_PIPELINE_COUNT-1 downto 0); -- so

sad selector ignores the non active ones

shift_concatenate_valid : in std_logic; -- valid output from the concantenate unit (64 bit

ready)

instruction_address : out std_logic_vector(7 downto 0); -- address to fetch next instruction

instruction_opcode : in std_logic_vector(3 downto 0); -- opcode

point_count : in std_logic_vector(7 downto 0); -- how many points to test

point_address : in std_logic_vector(7 downto 0); -- which is the first point to test

best_eu : in std_logic_vector(3 downto 0);

calculate_sad_done : in std_logic;

distance_engine_active : in std_logic;

interpolation_done : in std_logic; -- interpolation completes

interpolate_data_request : in std_logic; -- interpolator requests data

next_point : out std_logic_vector(7 downto 0); -- next point address to ROM

line_offset : out std_logic_vector(5 downto 0); -- multiple line reading

enable_concatenate_unit : out std_logic;

--enable_dist_engine : out std_logic;

write_register : out std_logic;

load_mv : out std_logic;

update : out std_logic;

instruction_zero : out std_logic;

all_done : out std_logic; -- program completes or qp mode started (fp finished)

partition_done : out std_logic;

qpel_loc_x : in std_logic_vector(1 downto 0); -- detect qp mode

qpel_loc_y : in std_logic_vector(1 downto 0);

start_qp : out std_logic;

enable_hp_inter : out std_logic; -- start the interpolation core

-- write_block1 : out std_logic;

-- next_rm_address_ready : in std_logic;

next_rm_addresss : in std_logic_vector(13 downto 0); --physical address for reference

(macroblock upper left corner

rm_address : out std_logic_vector(13 downto 0) -- reference memory write from address

-- cm_address : out std_logic_vector(4 downto 0); -- address to extract 4x4 blocks from

current macroblock

-- rma_address : out std_logic_vector(4 downto 0); -- reference macroblock write to

address

-- rma_we : out std_logic

);

end component;

component me_control_unit_qp

port ( clk : in std_logic;

clear : in std_logic;

reset : in std_logic;

start : in std_logic; -- start qp refinement

next_point_inter : in std_logic; -- tell the main control unit that the next point is required by

the interpolation unit

shift_concatenate_valid : in std_logic; -- valid output from the concantenate unit

qp_starting_address : in std_logic_vector(7 downto 0); -- start fetching qp instructions from

this point

instruction_address : out std_logic_vector(7 downto 0); -- address to fetch next instruction

point_count : in std_logic_vector(7 downto 0); -- how many points to test

point_address : in std_logic_vector(7 downto 0); -- which is the first point to test

calculate_sad_done : in std_logic; -- signals when the distance engine has finished

instruction_opcode : in std_logic_vector(3 downto 0); -- opcode

best_eu : in std_logic_vector(3 downto 0); -- best execution unit

next_point : out std_logic_vector(7 downto 0); -- next point address to ROM

qp_mode : out std_logic; --enable qp estimation

qp_on : out std_logic; -- qp active

load_mv : out std_logic;

update : out std_logic;

all_done : out std_logic -- program completes

);

end component;

component range_checker --make sure that MVs are not out of range

port (

clk : in std_logic;

clear : in std_logic;

reset : in std_logic;

candidate_mvx : in std_logic_vector(7 downto 0);

candidate_mvy : in std_logic_vector(7 downto 0);

frame_dimension_x : in std_logic_vector(7 downto 0); --in mb

frame_dimension_y : in std_logic_vector(7 downto 0);

mbx_coordinate : in std_logic_vector(7 downto 0); --in mb

mby_coordinate : in std_logic_vector(7 downto 0);

range_ok : out std_logic

);

end component;

signal test : std_logic;

signal rest_next_point_fp : rest_type_points;

signal rest_point_memory_address : rest_type_points;

signal rest_next_point_displacement_fp : rest_type_displacement;

signal rest_start_pipeline,rest_calculate_sad_done :

std_logic_vector(CFG_PIPELINE_COUNT-1 downto 0);

signal rest_best_sad_fp,rest_best_mv_fp,rest_first_mv_fp: rest_type_displacement;

signal rest_next_point_qp : rest_type_points_qp;

signal rest_point_memory_address_qp : rest_type_points_qp;

signal rest_next_point_displacement_qqp : rest_type_displacement_qp;

signal rest_start_pipeline_qp,rest_calculate_sad_done_qp :

std_logic_vector(CFG_PIPELINE_COUNT_QP-1 downto 0);

signal rest_best_sad_qp,rest_best_mv_qp: rest_type_displacement_qp;

signal

mvp_x,mvp_y,frame_dimension_x,frame_dimension_y,mby_coordinate,mbx_coordinate,progra

m_memory_address,program_memory_address_qp,instruction_address_fp,instruction_address_

qp,point_count_fp,point_count_qp,point_address_fp,point_address_qp,candidate_mvx_fp,candi

date_mvy_fp,candidate_mvx_qp,candidate_mvy_qp : std_logic_vector(7 downto 0);

signal quant_parameter,line_offset : std_logic_vector(5 downto 0);

signal partition_count,instruction_fp,instruction_qp : std_logic_vector(3 downto 0); --op code

signal

next_point_fp,point_memory_address,next_point_qp,point_memory_address_qp,candidate_mv

x_int,candidate_mvy_int : std_logic_vector(7 downto 0);

signal

one_bit,zero_bit,distance_engine_active_fp,range_ok,quick_valid,quick_valid_qp,next_point_in

ter,enable_concatenate_unit,dma_cm_we_m1,dma_cm_we_m2,dma_cm_we_fp,dma_cm_we_q

p,done_interrupt_int,mv_cost_on : std_logic;

signal

distance_engine_address_m2,distance_engine_address_m1,distance_engine_address_qp,cm_ad

dress_m1,cm_address_m2,address1_fp,address2_fp,address1_qp,address2_qp:

std_logic_vector(4 downto 0);

signal

zero,best_sad_out_qp,best_mv_out_qp,next_point_displacement_qp,next_point_displacement_f

p,best_sad_fp,best_sad_qp,best_sad_qp_distance_engine,best_mv_fp,best_mv_qp,best_sad_out

_fp,best_mv_out_fp,p_cost_mv,p_cost_mv_qp : std_logic_vector(15 downto 0);

signal point_count_position_qp,point_count_position_fp : std_logic_vector(19 downto 0);

signal mv_length_in,first_mv_fp,first_mv_qp,mv_displacement_fp,mv_displacement_qp :

std_logic_vector(15 downto 0);

signal

mvc_done,instruction_zero,qp_on,partition_done_fp,partition_done_qp,mux_control_write,mux

_control_read,qp_mode,start_qp,qp_pixels_valid,hp_pixels_valid,hp_interpolation_done,data_re

quest_hp_inter,all_done_fp,all_done_qp,start_row,reset_fifo_fp,reset_fifo_qp,reset_fifo1_qp,res

et_fifo2_qp,reset_fifo1_fp,reset_fifo2_fp,fifo_enable_w1,fifo_enable_w2,fifo_enable_r1,fifo_e

nable_r2,enable_fifo_fp,enable_fifo_qp,winner1_fp,winner1_qp,start,write_register,write_interp

olate_register,next_rm_address_ready,shift_concatenate_valid_qp,shift_concatenate_valid_fp,r

ma_we_qp,rma_we_fp,rma_we1_qp,rma_we2_qp,rma_we1_fp,rma_we2_fp,write_block1_qp,l

oad_mv_fp,load_mv_qp,update_fp,update_qp,calculate_sad_done_qp,calculate_sad_done_fp,en

able_hp_inter : std_logic;

signal rm_address_r,rm_address_w : std_logic_vector(10 downto 0);

signal rm_address_c : std_logic_vector(9 downto 0);

signal best_eu,best_eu_qp,mvc_to_do : std_logic_vector(3 downto 0);

signal next_rm_address,int_rm_address : std_logic_vector(13 downto 0);

signal

current_pixels,current_pixels_fp,current_pixels_qp,current_pixels_m1,current_pixels_m2,refere

nce_data_in1_fp,reference_data_in2_fp,reference_data_in1_qp,reference_data_in2_qp,reference

_data_in_fp,reference_data_in_qp : std_logic_vector(63 downto 0);

signal

reference_pixels_in,reference_pixels_in2,residue_out_fp,residue_out_qp,residue_out_1_2,resid

ue_out_2_2,residue_out_1_1,residue_out_2_1 : std_logic_vector(63 downto 0);

signal

out_pixels_fp,out_pixels_qp,qp_pixels_out_a,qp_pixels_out_b,hp_pixels_out_a,hp_pixels_out2

_a,hp_pixels_out_b,hp_pixels_out2_b,reference_pixels_out_qp_a,reference_pixels_out_qp_b,re

ference_pixels_out_fp : std_logic_vector(63 downto 0); --for the interpolate unit

signal rma_address_qp,rma_address_fp : std_logic_vector(4 downto 0);

signal hp_address_a, hp_address_b : std_logic_vector(2 downto 0);

--signal sad : std_logic_vector(15 downto 0);

signal qpel_loc_x,qpel_loc_y : std_logic_vector(1 downto 0);

signal point_memory_data_in : std_logic_vector(15 downto 0);

signal program_memory_data_in : std_logic_vector(19 downto 0);

signal partition_mode : mode_type;

signal active_pipelines : std_logic_vector(CFG_PIPELINE_COUNT-1 downto 0);

signal active_pipelines_qp : std_logic_vector(CFG_PIPELINE_COUNT_QP-1 downto 0);

signal rest_mvx_c,rest_mvy_c : rest_type_points;

signal rest_p_cost_mv : rest_type_displacement;

begin

-- program memory for fp engine

program_memory_data_in <= dma_data_in(19 downto 0);

program_memory_address <= dma_address(7 downto 0) when dma_prm_we = '1' else

instruction_address_fp;

qp_on_debug <= qp_on;

zero_bit <= '0';

one_bit <= '1';

mode_process : process(partition_mode)

begin

case partition_mode is

when m16x16 => partition_mode_debug <= "0000";

when m8x8 => partition_mode_debug <= "0001";

when others => partition_mode_debug <= "0000";

end case;

end process;

program_memory1 : program_memory

port map(

addr =>program_memory_address,

clk =>clk,

din =>program_memory_data_in,

dout => point_count_position_fp,

we => dma_prm_we

);

-- program memory for qp engine

program_memory2_qp : if CFG_PIPELINE_COUNT_QP = 1 generate

program_memory_address_qp <= dma_address(7 downto 0) when dma_prm_we = '1' else

instruction_address_qp;

program_memory2 : program_memory

port map(

addr =>program_memory_address_qp,

clk =>clk,

din =>program_memory_data_in,

dout => point_count_position_qp,

we => dma_prm_we

);

end generate;

no_qpgen0 : if CFG_PIPELINE_COUNT_QP = 0 generate

point_count_position_qp <= (others => '0');

end generate;

range_checker1 : range_checker --make sure that MVs are not out of range

port map(

clk => clk,

clear => clear,

reset => reset,

candidate_mvx => candidate_mvx_fp,

candidate_mvy => candidate_mvy_fp,

frame_dimension_x =>frame_dimension_x,

frame_dimension_y =>frame_dimension_y,

mbx_coordinate => mbx_coordinate,

mby_coordinate => mby_coordinate,

range_ok => range_ok

);

phy_address1 : phy_address

port map(

clk => clk,

clear => clear,

reset => reset,

partition_count => partition_count, --identify the subpartition active

line_offset => line_offset,

mvx => candidate_mvx_int,

mvy => candidate_mvy_int,

phy_address => next_rm_address

);

instruction_fp <= point_count_position_fp(19 downto 16);

point_count_fp <= point_count_position_fp(15 downto 8);

point_address_fp <= point_count_position_fp(7 downto 0);

instruction_qp <= point_count_position_qp(19 downto 16);

point_count_qp <= point_count_position_qp(15 downto 8);

point_address_qp <= point_count_position_qp(7 downto 0);

candidate_mvx_fp <= first_mv_fp(15 downto 8)+mv_displacement_fp(15 downto 8);

candidate_mvy_fp <= first_mv_fp(7 downto 0)+mv_displacement_fp(7 downto 0);

-- check that MVX is in the reference area

in_range_x : process(candidate_mvx_fp)

begin

if candidate_mvx_fp > 47 then

candidate_mvx_int <= x"2f";

elsif candidate_mvx_fp < -48 then

candidate_mvx_int <= x"d0";

else

candidate_mvx_int <= candidate_mvx_fp;

end if;

end process;

-- check that MVY is in the reference area

in_range_y: process(candidate_mvy_fp)

begin

if candidate_mvy_fp > 31 then

candidate_mvy_int <= x"1f";

elsif candidate_mvy_fp < -32 then

candidate_mvy_int <= x"e0";

else

candidate_mvy_int <= candidate_mvy_fp;

end if;

end process;

--this has to change the first mv for qp should be the winner from fo

candidate_mvx_qp <= first_mv_qp(15 downto 8)+mv_displacement_qp(15 downto 8);

candidate_mvy_qp <= first_mv_qp(7 downto 0)+mv_displacement_qp(7 downto 0);

no_qpgen11 : if CFG_PIPELINE_COUNT_QP = 0 generate

qpel_loc_x <= (others => '0');

qpel_loc_y <= (others => '0');

end generate;

qpgen11 : if CFG_PIPELINE_COUNT_QP = 1 generate

qpel_loc_x <= candidate_mvx_fp(1 downto 0);

qpel_loc_y <= candidate_mvy_fp(1 downto 0);

end generate;

-- fp point memory

--point_memory_fp : point_memory

-- port map(

-- addr =>next_point_fp,

-- clk =>clk,

-- dout => next_point_displacement_fp

--);

point_memory_address <= dma_address(7 downto 0) when dma_pom_we = '1' else

next_point_fp;

point_memory_data_in <= dma_data_in(15 downto 0);

point_memory_fp : point_memory

port map(

addr =>point_memory_address,

clk =>clk,

din =>point_memory_data_in,

dout =>next_point_displacement_fp,

we => dma_pom_we

);

--generate enough memories to hold the point memories for each aditional pipeline

generate_pipelines1 : for i in 1 to (CFG_PIPELINE_COUNT-1) generate

begin

rest_next_point_fp(i) <= next_point_fp when mvc_done = '0' else next_point_fp + i;

rest_point_memory_address(i) <= dma_address(7 downto 0) when dma_pom_we = '1'

else rest_next_point_fp(i);

rest_point_memory_fp : point_memory

port map (

addr => rest_point_memory_address(i),

clk => clk,

din =>point_memory_data_in,

dout => rest_next_point_displacement_fp(i),

we => dma_pom_we

);

end generate;

--generate integer pipelines

generate_pipelines2 : for i in 1 to (CFG_PIPELINE_COUNT-1) generate

begin

fp_pipelines1 : fp_pipeline

port map( clk =>clk,

clear =>clear,

reset =>reset,

next_point_displacement_fp =>rest_next_point_displacement_fp(i), --next point to be

processed

first_mv_fp =>rest_first_mv_fp(i), -- first point to start the search from

start =>rest_start_pipeline(i), -- enable the pipeline by main me

start_mb => start, -- once per macroblock

mode_in => partition_mode,

partition_count => partition_count, --identify the subpartition active

frame_dimension_x =>frame_dimension_x, --in mb

frame_dimension_y =>frame_dimension_y,

mbx_coordinate =>mbx_coordinate, --in mb

mby_coordinate =>mby_coordinate,

candidate_mvx =>rest_mvx_c(i), -- port for Lagrangian optimizaton

candidate_mvy =>rest_mvy_c(i),

mv_cost =>rest_p_cost_mv(i),

mv_cost_on => mv_cost_on, -- enable mv cost

all_done_fp => all_done_fp,

calculate_sad_done => rest_calculate_sad_done(i),

best_sad_fp =>rest_best_sad_fp(i),

best_mv_fp =>rest_best_mv_fp(i),

dma_address =>dma_address, -- next reference memory address

mb_data_in => current_pixels_fp, -- shared mb memory

dma_data_in =>dma_data_in, -- pixel in for reference memory

dma_rm_we =>dma_rm_we, --enable writing to reference memory

--dma_cm_we =>dma_cm_we, --enable writing to current macroblock memory

dma_residue_out =>open, -- get residue from winner mv

dma_re_re => dma_re_re-- enable reading residue

);

end generate;

point_memory_qp_qp : if CFG_PIPELINE_COUNT_QP = 1 generate

-- qp point memory

point_memory_qp : point_memory

port map(

addr =>point_memory_address_qp,

din =>point_memory_data_in,

clk =>clk,

dout => next_point_displacement_qp,

we => dma_pom_we

);

point_memory_address_qp <= dma_address(7 downto 0) when dma_pom_we = '1' else

next_point_qp;

end generate;

no_qpgen1 : if CFG_PIPELINE_COUNT_QP= 0 generate

next_point_displacement_qp <= (others => '0');

end generate;

-- displace the mv by 2 pixels to define the interpolation area when interpolation active

mv_displacement_fp <= next_point_displacement_fp when enable_hp_inter = '0' else x"14FC";

--(20,-4)

mv_displacement_qp <= next_point_displacement_qp;

done_interrupt <= done_interrupt_int;

register_file1 : register_file

generic map(integer_pipeline_count => (CFG_PIPELINE_COUNT))

port map(

clk => clk,

clear => clear,

reset => reset,

addr => register_file_address,

write => register_file_write,

data_in => register_file_data_in,

data_out => register_file_data_out,

start => start,

mode_out => partition_mode,

mv_cost_on => mv_cost_on, -- activate the costing of mvs

all_done_fp => all_done_fp,

all_done_qp => all_done_qp,

mvc_to_do => mvc_to_do,

mvc_done => mvc_done, -- all motion vector candidates evaluated

instruction_zero => instruction_zero,

partition_done_fp => partition_done_fp,

partition_done_qp => partition_done_qp,

done_interrupt => done_interrupt_int,

start_row => start_row,

load_mv => load_mv_fp, -- force the mvc to move foward

update_fp => update_fp,

best_sad_fp => best_sad_out_fp,

best_mv_fp => best_mv_out_fp,

first_mv_fp => first_mv_fp,

rest_first_mv_fp => rest_first_mv_fp,

mbx_coordinate =>mbx_coordinate,

mby_coordinate =>mby_coordinate,

mvp_x => mvp_x,

mvp_y => mvp_y,

quant_parameter => quant_parameter,

frame_dimension_x =>frame_dimension_x,

frame_dimension_y =>frame_dimension_y,

partition_count => partition_count,

update_qp => update_qp,

best_sad_qp => best_sad_out_qp,

best_mv_qp => best_mv_out_qp,

first_mv_qp => first_mv_qp

);

compact_memory0 : if CFG_CM = 0 generate

reference_memory_large : reference_memory64_remap --This memory stores the 7x5

port map(

addr_r => rm_address_r,

addr_w => rm_address_w,

enable_hp_inter => enable_hp_inter, -- working in interpolation mode

clk => clk,

next_configuration => start, -- use the start signal to move between configurations

all_done_fp, -- move to the next configuration when programs completes

start => start,

start_row => start_row,

reset => reset,

clear => clear,

din =>dma_data_in,

dout =>reference_pixels_in,

dout2 => reference_pixels_in2,

we => dma_rm_we

);

end generate;

compact_memory1 : if CFG_CM = 1 generate

rm_address_c <= rm_address_w(9 downto 0) when dma_rm_we = '1' else rm_address_r(9

downto 0);

reference_memory_compact : reference_memory64_remap_compact -- This memory stores the

5x5 reference data (800 words of 64 bit)

port map( -- It also remaps the addresses

addr => rm_address_c,

enable_hp_inter => enable_hp_inter, -- working in interpolation mode

clk => clk,

next_configuration => start, -- move to the next configuration

start_row => start_row,

reset => reset,

clear => clear,

din => dma_data_in,

dout => reference_pixels_in,

dout2 => reference_pixels_in2, -- from the second read port

we => dma_rm_we

);

end generate;

--when qp mode it is the systolic array which decides when data is valid

concatenate_qp_qp : if CFG_PIPELINE_COUNT_QP = 1 generate

concatenate_qp_a : concatenate64_qp -- this unit makes sure that 8 valid pixels are assemble

depending on byte address

port map(

addr => hp_address_a,

clk => clk,

clear => clear,

reset => reset,

din => hp_pixels_out_a,

din2 => hp_pixels_out2_a,

dout => reference_pixels_out_qp_a,

enable => hp_pixels_valid,

quick_valid => quick_valid_qp, --as valid but one cycle earlier

valid => shift_concatenate_valid_qp -- indicates when 64 valid bits are in the output

);

concatenate_qp_b : concatenate64_qp -- this unit makes sure that 8 valid pixels are assemble

depending on byte address

port map(

addr => hp_address_b,

clk => clk,

clear => clear,

reset => reset,

din => hp_pixels_out_b,

din2 => hp_pixels_out2_b,

dout => reference_pixels_out_qp_b,

enable => hp_pixels_valid,

quick_valid => open, --as valid but one cycle earlier

valid => open -- indicates when 64 valid bits are in the output

);

end generate;

no_qpgen2 : if CFG_PIPELINE_COUNT_QP = 0 generate

reference_pixels_out_qp_a <= (others => '0');

reference_pixels_out_qp_b <= (others => '0');

shift_concatenate_valid_qp <= '0';

end generate;

-- when no qp mode different concatenate unit

concatenate_fp : concatenate64 -- this unit makes sure that 8 valid pixels are assemble

depending on byte address

port map(

addr => int_rm_address(2 downto 0),

clk => clk,

clear => clear,

reset => reset,

din => reference_pixels_in,

din2 => reference_pixels_in2,

dout => reference_pixels_out_fp,

enable => enable_concatenate_unit,

enable_hp_inter => enable_hp_inter, -- working in interpolation mode

quick_valid => quick_valid, --as valid but one cycle earlier

valid => shift_concatenate_valid_fp -- indicates when 64 valid bits are in the output

);

-- half pel interpolation engine

interpolate2_qp : if CFG_PIPELINE_COUNT_QP = 1 generate

interpolate2 : systolic_array_top

port map(

clear =>clear,

reset => reset,

enable_interpolation =>enable_hp_inter,

enable_estimation =>qp_mode,

clk =>clk,

data_in =>reference_pixels_out_fp,

data_in_valid =>shift_concatenate_valid_fp,

data_request =>data_request_hp_inter,

all_done =>hp_interpolation_done,

next_point => next_point_inter, -- tell the main control unit that the next point is required

shift_concatenate_valid =>shift_concatenate_valid_qp, -- need to know when 64 bits are valid

-- memory interface for o,h,v and d memories

candidate_mvx => candidate_mvx_qp,

candidate_mvy => candidate_mvy_qp,

hp_address_a => hp_address_a,

hp_address_b => hp_address_b,

data_out_a => hp_pixels_out_a, --interpolated data out port a

data_out2_a => hp_pixels_out2_a, --interpolated data out

data_out_b => hp_pixels_out_b, --interpolated data out port b

data_out2_b => hp_pixels_out2_b, --interpolated data out

data_out_valid => hp_pixels_valid --signal to indicate 8 bytes of data valid

);

end generate;

no_qpgen3 : if CFG_PIPELINE_COUNT_QP = 0 generate

data_request_hp_inter <= '0';

hp_interpolation_done <= '0';

next_point_inter <= '0';

hp_address_a <= (others => '0');

hp_address_b <= (others => '0');

hp_pixels_out_a <= (others => '0'); --interpolated data out

hp_pixels_out_b <= (others => '0');

hp_pixels_valid <= '0';

end generate;

-- data for qp engine always comes from the concatenate unit

qp_pixels_valid <= shift_concatenate_valid_qp;

qp_pixels_out_a <= reference_pixels_out_qp_a;

qp_pixels_out_b <= reference_pixels_out_qp_b;

forward1 : forward_engine

port map(

clk =>clk,

clear =>clear,

reset =>reset,

mode_in => partition_mode,

partition_count_in => partition_count,

enable_hp_inter =>enable_hp_inter, -- when hp interpolation is being performed in the

background

write_register =>shift_concatenate_valid_fp,

in_pixels =>reference_pixels_out_fp,

write_block1 =>open, -- control which of the two blocks is being read and written

(interpolate and dist engine)

rma_address =>rma_address_fp, -- extracted reference pixels use this address

rma_we => rma_we_fp,

out_pixels => out_pixels_fp

);

interpolate1_qp : if CFG_PIPELINE_COUNT_QP = 1 generate

interpolate1 : qp_interpolate_engine

port map(

clk=>clk,

clear=>clear,

reset=>reset,

qp_mode => qp_mode, -- in qp mode two lines must be written to interpolate

enable_hp_inter => enable_hp_inter,

write_interpolate_register=> qp_pixels_valid,

interpolate_in_pixels_a => qp_pixels_out_a,

interpolate_in_pixels_b => qp_pixels_out_b,

write_block1 => open, -- control which of the two blocks is being read and written

(interpolate and dist engine)

rma_address => rma_address_qp, -- extracted reference pixels use this address

rma_we => rma_we_qp,

interpolate_out_pixels => out_pixels_qp

);

end generate;

no_qpgen4 : if CFG_PIPELINE_COUNT_QP = 0 generate

write_block1_qp <= '0'; -- control which of the two blocks is being read and written

(interpolate and dist engine)

rma_address_qp <= (others => '0'); -- extracted reference pixels use this address

rma_we_qp <= '0';

out_pixels_qp <= (others => '0');

end generate;

reference_data_in_fp <= out_pixels_fp;

rm_address_r <= int_rm_address(13 downto 3) when (dma_rm_re_debug = '0') else

dma_address;

rm_address_w <= dma_address;

dma_rm_debug <= reference_pixels_in;

reference_data_in_qp <= out_pixels_qp;

--These two memories will alternate if they are fp or qp

dma_cm_we_m2 <= dma_cm_we when mux_control_write = '1' else '0';

dma_cm_we_m1 <= dma_cm_we when mux_control_write = '0' else '0';

current_pixels_fp <= current_pixels_m1 when mux_control_read = '0' else current_pixels_m2;

current_pixels_qp <= current_pixels_m1 when mux_control_read = '1' else current_pixels_m2;

distance_engine_address_m1 <= rma_address_fp when mux_control_read = '0' else

rma_address_qp;

distance_engine_address_m2 <= rma_address_fp when mux_control_read = '1' else

rma_address_qp;

--two so you can write one while you read the other. Dual port is not enough since you cannot

destroy the contents

cm_address_m1 <= distance_engine_address_m1 when dma_cm_we_m1 = '0' else

dma_address(4 downto 0);

cm_address_m2 <= distance_engine_address_m2 when dma_cm_we_m2 = '0' else

dma_address(4 downto 0);

me_control_unit_qp1_qp : if CFG_PIPELINE_COUNT_QP = 1 generate

me_control_unit_qp1 : me_control_unit_qp

port map( clk =>clk,

clear =>clear,

reset =>reset,

start =>start_qp,

next_point_inter => next_point_inter, -- next point address to ROM

shift_concatenate_valid =>quick_valid_qp,

qp_starting_address =>instruction_address_fp,

instruction_address =>instruction_address_qp,

point_count =>point_count_qp,

point_address =>point_address_qp,

calculate_sad_done =>calculate_sad_done_qp,

instruction_opcode => instruction_qp,

best_eu => best_eu_qp,

next_point =>next_point_qp,

qp_mode =>qp_mode, --enable qp estimation

qp_on => qp_on, -- qp active

load_mv => load_mv_qp,

update =>update_qp,

all_done =>all_done_qp -- program completes

);

end generate;

no_qpgen7 : if CFG_PIPELINE_COUNT_QP = 0 generate

instruction_address_qp <= (others => '0'); -- address to fetch next instruction

next_point_qp <= (others => '0');

qp_mode <= '0'; --enable qp estimation

load_mv_qp <= '0';

update_qp <= '0';

all_done_qp <= '0'; -- program completes

qp_on <= '0';

end generate;

best_eu_debug <= best_eu;

me_control_unit1 : me_control_unit

generic map(

integer_pipeline_count => (CFG_PIPELINE_COUNT)

)

port map( clk =>clk,

clear =>clear,

reset =>reset,

start => start,

range_ok => range_ok,

mode_in => partition_mode,

best_sad_in => best_sad_out_fp, -- to make SAD-based decisions

mv_length_in => mv_length_in, -- to make LENGTH-based decisions

qp_on => qp_on, -- qp on

mvc_done => mvc_done, -- all motion vector candidates evaluated

mvc_to_do => mvc_to_do,

partition_count_out => partition_count, --identify the subpartition active

start_pipelines => rest_start_pipeline,

active_pipelines => active_pipelines,

shift_concatenate_valid => quick_valid, -- valid output from the concantenate unit (64 bit

ready)

instruction_address => instruction_address_fp, -- address to fetch next instruction

instruction_opcode => instruction_fp, -- the opcode

point_count => point_count_fp, -- how many points to test

point_address => point_address_fp, -- which is the first point to test

calculate_sad_done => calculate_sad_done_fp,

distance_engine_active => distance_engine_active_fp,

interpolation_done => hp_interpolation_done,-- interpolation completes

interpolate_data_request => data_request_hp_inter,-- interpolator requests data

line_offset => line_offset, -- read the different lines of the reference macroblock

enable_concatenate_unit => enable_concatenate_unit,

-- enable_dist_engine => enable_dist_engine,

write_register => write_register,

load_mv => load_mv_fp,

best_eu => best_eu,

update => update_fp,

instruction_zero => instruction_zero,

all_done => all_done_fp,

partition_done => partition_done_fp,

qpel_loc_x =>qpel_loc_x, -- detect qp mode

qpel_loc_y =>qpel_loc_y,

next_point => next_point_fp,

start_qp => start_qp,

enable_hp_inter =>enable_hp_inter,

-- write_block1 => write_block1, -- control which of the two blocks is being read and

written (interpolate and dist engine)

-- next_rm_address_ready => next_rm_address_ready,

next_rm_addresss => next_rm_address, --physical address for reference (macroblock upper

left corner

rm_address => int_rm_address -- internal reference memory addresses

-- rma_address => rma_address, -- extracted reference pixels use this address

-- rma_we => rma_we

);

-- calculate the length of the motion vector

mv_length_in <= (best_mv_out_fp(15 downto 8) - mvp_x) & (best_mv_out_fp(7 downto 0)

- mvp_y);

gen_mv_cost_qp : if (CFG_MV_COST = 1 and CFG_PIPELINE_COUNT_QP = 1) generate

-- mv cost unit

mv_cost_qp : mv_cost

generic map(pipelines => CFG_PIPELINE_COUNT_QP)

port map(

clk => clk,

clear => clear,

reset => reset,

load => load_mv_qp, -- start calculation of mv costs for qp

mvp_x => mvp_x, -- predicted mv x

mvp_y => mvp_y, -- predicted mv y

mvx_c =>candidate_mvx_qp, -- motion vector candidate x

mvy_c =>candidate_mvy_qp, -- motion vector candidate y

rest_mvx_c =>rest_mvx_c, -- rest of motion vector candidates x

rest_mvy_c =>rest_mvy_c, -- rest of motion vector candidates y

quant_parameter => quant_parameter,

p_cost_mv => p_cost_mv_qp,

rest_p_cost_mv => open

);

end generate;

no_gen_mv_cost_qp : if (CFG_MV_COST = 0 or CFG_PIPELINE_COUNT_QP = 0) generate

p_cost_mv_qp <= (others => '0');

end generate;

gen_mv_cost_fp : if CFG_MV_COST = 1 generate

-- mv cost unit

mv_cost_fp : mv_cost

generic map(pipelines => CFG_PIPELINE_COUNT)

port map(

clk => clk,

clear => clear,

reset => reset,

load => rest_start_pipeline(0), -- start calculation of mv costs for fp

mvp_x => mvp_x, -- predicted mv x

mvp_y => mvp_y, -- predicted mv y

mvx_c =>candidate_mvx_int, -- motion vector candidate x

mvy_c =>candidate_mvy_int, -- motion vector candidate y

rest_mvx_c =>rest_mvx_c, -- rest of motion vector candidates x

rest_mvy_c =>rest_mvy_c, -- rest of motion vector candidates y

quant_parameter => quant_parameter,

p_cost_mv => p_cost_mv,

rest_p_cost_mv => rest_p_cost_mv

);

end generate;

no_gen_mv_cost_fp : if CFG_MV_COST = 0 generate

p_cost_mv <= (others => '0');

end generate;

-- fp distance engine

distance_engine_fp : distance_engine64

generic map (qp_mode => '0')

port map(

clk =>clk,

clear =>clear,

reset =>reset,

enable => rma_we_fp, -- calculate when new data available

update => update_fp, -- instruction completes set the best sad register to FFFF

load_mv => load_mv_fp,

mode_in => partition_mode,

mv_cost_on => mv_cost_on,

mv_cost_in => p_cost_mv,

candidate_mvx => candidate_mvx_fp,

candidate_mvy => candidate_mvy_fp,

reference_data_in => reference_data_in_fp,

current_data_in => current_pixels_fp,

residue_out => residue_out_fp,

enable_fifo => enable_fifo_fp,

reset_fifo => reset_fifo_fp,

winner1 => winner1_fp,

calculate_sad_done => calculate_sad_done_fp,

distance_engine_active => distance_engine_active_fp,

best_sad => best_sad_fp,

best_mv => best_mv_fp

);

-- mv/sad selector

sad_selector_fp : sad_selector

generic map(integer_pipeline_count => CFG_PIPELINE_COUNT)

port map(

clk =>clk,

reset =>reset,

clear =>clear,

calculate_sad_done =>calculate_sad_done_fp,

update =>partition_done_fp,

update_fp => update_fp, -- end of iteraction

best_eu => best_eu, --id of best execution unit

active_pipelines => active_pipelines,

best_sad => best_sad_fp,

best_mv => best_mv_fp,

rest_best_sad => rest_best_sad_fp,

rest_best_mv => rest_best_mv_fp,

best_sad_out => best_sad_out_fp,

best_mv_out => best_mv_out_fp

);

best_sad_debug <= best_sad_fp;

best_mv_debug <= best_mv_fp; --debugging port

--qp distance engine

pipeline_qp_qp : if CFG_PIPELINE_COUNT_QP = 1 generate

sad_selector_qp1 : sad_selector_qp

port map (

clk =>clk,

reset =>reset,

clear =>clear,

calculate_sad_done =>calculate_sad_done_qp,

active_pipelines => active_pipelines_qp,

update =>all_done_fp, -- complete fp part set the stored sad

update_qp =>update_qp,

best_eu => best_eu_qp,

best_sad =>best_sad_qp,

best_mv =>best_mv_qp,

rest_best_sad =>rest_best_sad_qp,

rest_best_mv =>rest_best_mv_qp,

best_sad_out =>best_sad_out_qp,

best_mv_out =>best_mv_out_qp

);

distance_engine_qp : distance_engine64

generic map (qp_mode => '0')

port map(

clk =>clk,

clear =>clear,

reset =>reset,

enable =>rma_we_qp,

update => update_qp, -- instruction completes set the best sad register to FFFF

load_mv => load_mv_qp,

mode_in => partition_mode,

mv_cost_on => mv_cost_on,

mv_cost_in => p_cost_mv_qp,

candidate_mvx => candidate_mvx_qp,

candidate_mvy => candidate_mvy_qp,

reference_data_in => reference_data_in_qp,

current_data_in => current_pixels_qp,

residue_out => residue_out_qp,

enable_fifo => enable_fifo_qp,

reset_fifo => reset_fifo_qp,

winner1 => winner1_qp,

calculate_sad_done => calculate_sad_done_qp,

distance_engine_active => open,

best_sad => best_sad_qp_distance_engine,

best_mv => best_mv_qp

);

best_sad_qp <= best_sad_qp_distance_engine when all_done_fp = '0' else best_sad_out_fp; --

stored best sad fp in qp part

end generate;

no_qpgen8 : if CFG_PIPELINE_COUNT_QP = 0 generate

distance_engine_address_qp <= (others => '0');

residue_out_qp <= (others => '0');

enable_fifo_qp <= '0';

reset_fifo_qp <= '0';

winner1_qp <= '0';

calculate_sad_done_qp <= '0';

best_sad_qp <= (others => '0');

best_mv_qp <= (others => '0');

end generate;

next_rm_address_ready <= '1';

-- control the wiring of the memories

regs : process(clk,clear)

begin

if (clear = '1') then

mux_control_write <= '0';

mux_control_read <= '0';

elsif rising_edge(clk) then

if (reset = '1') then

mux_control_write <= '0';

mux_control_read <= '0';

elsif (start = '1') then

mux_control_write <= not(mux_control_write);

elsif (all_done_fp = '1') then

mux_control_read <= not(mux_control_read);

end if;

end if;

end process regs;

end;

References:

[1] T. Wiegand, G. et al , “Overview of the H.264/AVC Video Coding Standard”, IEEE Trans. on Circuits and Systems for

Video Technology vol. 13, no. 7, pp.560–576, July 2003.

[2]Data locality description: http://www.eetimes.com/design/embedded/4007043/How-to-map-the-H-264-AVC-video-

standard-onto-an-FPGA-fabric.

[3] N. Keshaveni, S. Ramachandran and K. S. Gurumurthy “Design and FPGA Implementation of Integer Transform and

Quantization Processor and Their Inverses for H.264 Video Encoder”, Advances in Computing, Control, &

Telecommunication Technologies, 2009. ACT 2009. International Conference on, pp. 646-649, July 2009

[4] I. Richardson, “The H.264 advanced video compression standard”, Wiley, 2nd edition, 2010.

[5] L. Po and W. C. Ma, IEEE Transactions on “A novel four-step search algorithm for fast block motion estimation”,

Circuits and Systems for Video Technology, vol. 6, no. 3, pp. 313-317, August 1996.

[6] W. I. Choi, B. Jeon and J. Jeong, “Fast motion estimation with modified diamond search for variable motion block sizes”,

Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, vol. 2, no. 2, pp. 11,

Sept 2003.

[7] How to map the H.264/AVC video standard onto an FPGA fabric -

http://www.eetimes.com/design/embedded/4007043/How-to-map-the-H-264-AVC-video-standard-onto-an-FPGA-fabric.

[8]CODEC Architecture for modules- http://www.drtonygeorge.com/Video/h264/comp_architecture.htm

[9] Q. Peng and J. Jing, “H.264 System on Chip Design and Verification”, The IEEE 2003 Workshop on Signal Processing

Systems(SIPS‟03), 2003.

[10] DSP-Enabled efficient motion estimation for Mobile MPEG-4 video encoding-

http://www.techonline.com/community/21066.

[11] T. Wiegand et al, "Overview of the H.264/AVC Video Coding Standard", IEEE Transactions on Circuits and Systems

for Video Technology, Vol. 13, No. 7, pp. 560-576, July 2003.

[12] T. Wedi and H. G. Musmann, "Motion- and aliasing-compensated prediction for hybrid video coding," IEEE

Transactions on Circuits and Systems for Video Technology, Vol. 13, No. 7, pp. 577- 586, July 2003.

[13] H. S. Malvar et al, "Low-complexity transform and quantization in H.264/AVC," IEEE Transactions on Circuits and

Systems for Video Technology, Vol. 13, No. 7, pp. 598- 603, July 2003.

[14] S. Yeping and S. Ting, "Fast Multiple Reference Frame Motion Estimation for H.264/AVC," IEEE Transactions on

Circuits and Systems for Video Technology, Vol. 16, no. 3, pp 447 – 452, June 2006.

[15] ModelSim simulation software: http://model.com/content/modelsim-pe-student-edition-hdl-simulation .

[16] AIC website: http://www.bilsen.com/aic/

[17] DCT: http://www.domagoj-babic.com/uploads/Pubs/MSC03/msc03.pdf