project report.pdf

HIGH THROUGHPUT

MULTISTANDARD TRANSFORM CORE

REALISATION USING CSDA IN VERILOG HDL

A PROJECT REPORT

Submitted by

SELVARANI. K (951911106080)

SUJITHA. M (951911106093)

SWARNAMUKI. R (951911106096)

in partial fulfillment for the award of the degreeof

BACHELOR OF ENGINEERING

in

ELECTRONICS AND COMMUNICATION ENGINEERING

P.S.R.ENGINEERING COLLEGE, SIVAKASI-626140

ANNA UNIVERSITY: CHENNAI 600 025

APRIL 2015

ii

ANNA UNIVERSITY: CHENNAI 600 025

BONAFIDE CERTIFICATE

Certified that this project report “HIGHTHROUGHPUT MULTISTANDARD TRANSFORMCORE REALISATION USING CSDA IN VERILOG HDL”is the bonafide work of SELVARANI.K (951911106080),SUJITHA.M (951911106093) and SWARNAMUKI.R(951911106096), who carried out the project under mysupervision.

SIGNATURE SIGNATURE

C.K.RAMAR, M.E., Mrs.J.MEENA, M.E.,

HEAD OF THE DEPARTMENT SUPERVISOR

Assistant professor

Electronics and communication Electronics and communication

Engineering Engineering

P.S.R Engineering college , P.S.R Engineering College,

Sivakasi-626 140 Sivakasi-626 140

Submitted for the Viva Voce to be held on: _______________

INTERNAL EXAMINER EXTERNAL EXAMINER

iii

ACKNOWLEDGEMENT

First and foremost we wish to express our deep unfathomablefeeling, gratitude to our institution and our department for providing us achance to fulfill our long cherished of becoming Electronics andCommunication Engineers.

We thank our beloved correspondent Mr.R.Solaisamy for hissupport in every staff our college for their contribution in the growth ofthis project.

We wish to express our hearty thanks to the principal of ourcollege Dr.B.G.Vishnuram, M.E.,Ph.D.,FIE., for his constantmotivation and continual encouragement regarding our project work.

We are greatly indebted to our Head of DepartmentMr.C.K.Ramar,M.E., for his sincere help, and the encouragement he hasgiven towards the accomplishment of this project work.

We express our warm and sincere thanks to our guideMrs.J.Meena, M.E., Assistant professor/Electronics and communicationEngineering, for her tireless and meticulous efforts in bringing out thisproject to its logical conclusion.

We are committed to place our heartfelt thanks to all teaching andnon teaching staff members, lab technician and friends, and all the noblehearts that gave us immense encouragement towards the completion ofour project.

iv

ABSTRACT

To compress the video signals with very small delay and small area

in an efficient manner is completely a challenging technique in VLSI by

using Verilog HDL .This project proposes creating an architecture for

compress a video signals which supports three different types of video

codec standards are MPEG-1/2/4(8×8), H.264(8×8,4×4), VC-

1(8×8,8×4,4×8,4×4) through the single core. Compression technique

involves Transformation, Truncation, Encoding process. Combination of

factor sharing and distributed arithmetic which results in new concept

called Common Sharing Distributed Arithmetic algorithm. This

efficiently reduces the utilization of number of adders in proposed Multi

Standard Transform core instead of using the multipliers and also

involves the new concept of ECAT which efficiently reduce the

truncation errors obtain from CSDA when compared to DCT. The

proposed system consists of Buffer for storage instead of Pipeline

Registers, with TMEM the inverse of the matrix can be constructed which

achieves 2D-CSDA from 1D-CSDA. At low cost the MST core can be

constructed with low area, delay, power. Extensive simulation are

conducted in Modelsim simulator to evaluate the notable performances

and the entire schematic diagram are viewed using RTL schematic in

Xilinx software.

v

TABLE OF CONTENT

CHAPTER

NO

TITLE PAGE

NO

ABSTRACT IV

LIST OF FIGURES X

LIST OF TABLE XI

LIST OF ABBREVIATIONS XII

1 INTRODUCTION 1

1.1 Video Compression 1

1.2 Lossy Compression 2

1.3 Advantage of video compression 3

1.4 Development in the field of VLSI 4

1.4.1 Reconfigurable computing 4

1.4.2 Takeover of Hardware design 5

1.4.3 The need for hardware compilers 6

1.5 Design Methodology 8

1.6 Objective And Scope 9

vi

1.7 Applications 9

1.8 Recent Research In Video Compression 9

2 VIDEO CODECS 11

2.1 Video Codec Design 11

2.2 Different Standards 14

2.2.1 MPEG 1/2/4 14

2.2.1.1 MPEG-1 14

2.2.1.2 MPEG-2 15

2.2.1.3 MPEG-4 15

2.2.2 H.264 16

2.2.3 VC-1 17

3 SYSTEM ANALYSIS 18

3.1 Project Introduction 18

3.2 Existing system 19

3.2.1 Introduction 19

3.2.2 CSDA 20

3.2.3 Limitations of existing system 20

vii

3.3 Proposed System 20

3.3.1 Introduction 20

3.3.2 Buffer as a memory 20

3.3.3 Advantages of proposed system 21

3.4 Derivation of CSDA Algorithm 21

3.4.1 Factor sharing derivation 21

3.4.2 Distributed arithmetic format 21

3.4.3 CSDA Algorithm 22

3.5 Flow Diagram 24

3.5.1 Description 24

3.6 Modules 25

3.6.1 1-D Common Sharing

Distributed Arithmetic-MST25

3.6.2 Even part common sharing distributed

arithmetic circuit26

3.6.3 Odd part common sharing distributed

arithmetic circuit27

3.6.4 ECAT 28

3.6.5 Permutation 29

3.7 2-D CSDA core design 30

viii

3.7.1 Mathematical Derivation of Eight-Point

and Four-Point Transforms30

3.7.2 TMEM 32

4 SYSTEM IMPLEMENTATION 33

4.1 Xilinx ISE Overview 33

4.1.1 Design Entry 33

4.1.2 Synthesis 33

4.1.3 Implementation 33

4.1.4 Verification 34

4.1.5 Device Configuration 34

4.2 ModelSim Overview 34

4.3 Project flow 36

4.4 Multiple library flow 37

4.5 Debugging tools 37

4.6 VERILOG 38

5 RESULT ANALYSIS 40

5.1 Comparision With Existing Systems 40

ix

5.2 MUX Selection Inputs 41

5.3 MPEG Simulation Result 42

5.4 H.264 Simulation Result 43

5.5 VC-1 Simulation Result 44

5.6 RTL Schematic View Of Entire Process 45

5.7 Synthesis Report For Output 48

5.8 Power Analyzer Output 49

5.9 Device Utilization Summary 55

6 CONCLUSION 51

7 REFERENCE 52

x

LIST OF FIGURES

FIGURE NO TITLE PAGE NO

3.1 Flow Diagram of CSDA 24

3.2 Architecture of the proposed 1-D CSDA-MST 26

3.3 Architecture of the even part CSDA circuit 27

3.4 Architecture of the odd part CSDA circuit 28

3.5 Architecture of ECAT 29

3.6 Permutation Concept 29

3.7 2D CSDA core with TMEM 30

3.8 TMEM 32

5.1 Simulation result for MPEG 42

5.2 Simulation result for H.264 43

5.3 Simulation result for VC1 44

5.4 RTL view of whole 2D-CSDA Architecture 45

5.5 RTL inner view of 2D-CSDA 46

5.6 RTL inner view of 1D-CSDA 47

5.7 Output for 2-D CSDA –MST delay 48

5.8 Output for 2-D CSDA –MST power 49

5.9 Output for 2-D CSDA –MST gate counts 50

xi

LIST OF TABLES

TABLE NO TITLE PAGE NO

3.1 Corresponding Dimensions of different

standards of Video Codes

18

5.1 Measured Results 40

5.2 Selection inputs for different standards 41

xii

LIST OF ABBREVATION

ABBREVATIONS ACRONYMS

ASICS Application Specific Integrated Circuits

CAD Computer Aided Design

CMOS Combinational Metal Oxide Semiconductor

CSDA Common Sharing Distributed Arithmetic

DA Distributed Arithmetic

ECAT Error Compensated Tree

FS Factor Sharing

IT Integer Transform

MPEG Moving Picture Expert Group

MST Multi Standard Transform core

NEDA New Distributed Arithmetic

RTL Resistor Transistor Level

SOC Silicon On Chip

TMEM Memory Transpose

VC Video Codec

VHDL Very High Speed Integrated Circuits

Hardware Description Language

1

CHAPTER 1

INTRODUCTION

Compression can mainly done by using several transforms such as

Discrete Cosine Transform, Integer Transforms, Distributed Arithmetic,

Factor sharing in video and image signals. These transforms are mainly

used as matrix decomposition methods to reduce the Hardware cost as

well as the implementation cost, but the implementation of such

transforms may be tedious in some cases especially for making a single

compatible architecture for different types of standards. In this project a

new technique which involves in supporting three video coding standards

has to be implemented.

1.1 VIDEO COMPRESSION

Video compression uses modern coding techniques to reduce

redundancy in video data. Video is nothing but the continuous motion of

the frames or images obtained from a moving object which is to be

considered. Most video compression algorithms and codecs combine

spatial image compression and temporal motion compensation technique

with some encoding features to secure the data. Video compression is a

practical implementation of source coding in information theory. In

practice, most video codecs also use audio compression techniques in

parallel to compress the separate, but combined data streams as one

package. The majority of video compression algorithms use lossy

compression which is best one to reduce the delay. Uncompressed

video requires a very high data rate. Although lossless video

compression codecs perform an average compression of over factor 3, a

typical MPEG-4 lossy compression video has a compression factor

between 20 and 200. As in all lossy compression, there is a trade-

http://en.wikipedia.org/wiki/List_of_codecs#Video_codecs

http://en.wikipedia.org/wiki/Video_codec

http://en.wikipedia.org/wiki/Image_compression

http://en.wikipedia.org/wiki/Motion_compensation

http://en.wikipedia.org/wiki/Lossy_compression


http://en.wikipedia.org/wiki/Uncompressed_video

http://en.wikipedia.org/wiki/Uncompressed_video

http://en.wikipedia.org/wiki/Uncompressed_video#Storage_and_Data_Rates_for_Uncompressed_Video

http://en.wikipedia.org/wiki/List_of_codecs#Lossless_compression

http://en.wikipedia.org/wiki/List_of_codecs#Lossless_compression

http://en.wikipedia.org/wiki/MPEG-4

http://en.wikipedia.org/wiki/Trade-off

2

off between video qualities, cost of processing the compression and

decompression, and system requirements. Highly compressed video may

present visible or distracting artifacts.

Some video compression schemes typically operate on square-

shaped groups of neighboring pixels, often called macroblocks. These

pixel groups or blocks of pixels are compared from one frame to the next,

and the video compression codec sends only the differences within those

blocks. In areas of video with more motion, the compression must encode

more data to keep up with the larger number of pixels that are changing.

Commonly during explosions, flames, flocks of animals, and in some

panning shots, the high-frequency detail leads to quality decreases or to

increases in the variable bitrate.

1.2 LOSSY COMPRESSION

In information technology, "lossy" compression is the class of data

encoding methods that uses inexact approximations (or partial data

discarding) for representing the content that has been encoded. Such

compression techniques are used to reduce the amount of data that would

otherwise be needed to store, handle, and/or transmit the represented

content. The different versions of the photo of the cat at the right

demonstrate how the approximation of an image becomes progressively

coarser as more details of the data that made up the original image are

removed. The amount of data reduction possible using lossy compression

can often be much more substantial than what is possible with lossless

data compression techniques.

Using well-designed lossy compression technology, a substantial

amount of data reduction is often possible before the result is sufficiently

degraded to be noticed by the user. Even when the degree of degradation

http://en.wikipedia.org/wiki/Trade-off

http://en.wikipedia.org/wiki/Video_quality

http://en.wikipedia.org/wiki/Compression_artifact

http://en.wikipedia.org/wiki/Pixel

http://en.wikipedia.org/wiki/Macroblock

http://en.wikipedia.org/wiki/Video_codec

http://en.wikipedia.org/wiki/Residual_frame

http://en.wikipedia.org/wiki/Variable_bitrate

http://en.wikipedia.org/wiki/Information_technology

http://en.wikipedia.org/wiki/Data_compression


http://en.wikipedia.org/wiki/Lossless_data_compression

http://en.wikipedia.org/wiki/Lossless_data_compression

3

becomes noticeable, further data reduction may often be desirable for

some applications (e.g., to make real-time communication possible

through a limited bit-rate channel, to reduce the time needed to transmit

the content, or to reduce the necessary storage capacity).

Lossy compression is one of most commonly used to

compress multimedia data (audio, video, and still images), especially in

applications such as streaming media and internet telephony. By contrast,

lossless compression is typically required for text and data files, such as

bank records and text articles. In many cases it is advantageous to make a

master lossless file that can then be used to produce compressed files for

different purposes; for example, a multi-megabyte file can be used at full

size to produce a full-page advertisement in a glossy magazine, and a

10 kilobyte lossy copy can be made for a small image on a web page.

1.3 ADVANTAGES OF VIDEO COMPRESSION

The main advantage of compression is that it reduces the data

storage requirements. It also offers an attractive approach to reduce the

communication cost in transmitting high volumes of data over long-haul

links via higher effective utilization of the available bandwidth in the data

links. This significantly aids in reducing the cost of communication due

to the data rate reduction. Because of the data rate reduction, data

compression also increases the quality of multimedia presentation

through limited-bandwidth communication channels. Hence the audience

can experience rich-quality signals for audio-visual data representation.

For example, because of the sophisticated compression technologies we

can receive toll-quality audio at the other side of the globe through the

good old telecommunications channels at a much better price compared

to a decade ago. Because of the significant progress in image

http://en.wikipedia.org/wiki/Multimedia

http://en.wikipedia.org/wiki/Sound_recording_and_reproduction

http://en.wikipedia.org/wiki/Video

http://en.wikipedia.org/wiki/Image

http://en.wikipedia.org/wiki/Streaming_media

http://en.wikipedia.org/wiki/VOIP

4

compression techniques, a single 6 MHz broadcast television channel can

carry HDTV signals to provide better quality audio and video at much

higher rates and enhanced resolution without additional bandwidth

requirements. The rate of input-output operations in a computing device

can be greatly increased due to shorter representation of data.

In systems with levels of storage hierarchy, data compression in

principle makes it possible to store data at a higher and faster storage

level (usually with smaller capacity), thereby reducing the load on the

input-output channels. Data compression obviously reduces the cost of

backup and recovery of data in computer systems by storing the backup

of large database files in compressed form. The advantages of data

compression will enable more multimedia applications with reduced cost

and hence aid its usage by a larger population with newer applications in

the near future.

1.4 DEVELOPMENTS IN THE FIELD OF VLSI

There are a number of directions a person can take in VLSI, and

they are all closely related to each other. Together, these developments

are going to make possible the visions of embedded systems and

ubiquitous computing.

1.4.1 Reconfigurable computing

Reconfigurable computing is a very interesting and pretty recent

development in microelectronics. It involves fabricating circuits that can

be reprogrammed on the fly! And no, we are not talking about

microcontrollers running with EEPROM inside. Reconfigurable

computing involves specially fabricated devices called FPGA's, that when

programmed act just like normal electronic circuits. They are so designed

5

that by changing or "reprogramming" the connections between numerous

sub modules, the FPGA's can be made to behave like any circuit we wish.

This fantastic ability to create modifiable circuits again opens up

new possibilities in microelectronics. Consider for example,

microprocessors which are partly reconfigurable. We know that running

complex programs can benefit greatly if support was built into the

hardware itself. We could have a microprocessor that could optimise

itself for every task that it tackled! Or then consider a system that is too

big to implement on hardware that may be limited by cost, or other

constraints. If we use a reconfigurable platform, we could design the

system so that parts of it are mapped onto the same hardware, at different

times. One could think of many such applications, not the least of which

is prototyping - using an FPGA to try out a new design before it is

actually fabricated. This can drastically reduce development cycles, and

also save some money that would have been spent in fabricating

prototype IC's

1.4.2 Takeover of Hardware design

ASIC's provided the path to creating miniature devices that can do

a lot of diverse functions. But with the impending boom in this kind of

technology, what we need is a large number of people who can design

these IC's. This is where we realise that we cross the threshold between a

chip designer and a systems designer at a higher level. Does a person

designing a chip really need to know every minute detail of the IC

manufacturing process? Can there be tools that allow a designer to simply

create design specifications that get translated into hardware

specifications?

6

The solution to this is rather simple - hardware compilers or silicon

compilers as they are called. We know by now, that there exist languages

like VHDL which can be used to specify the design of a chip. What if we

had a compiler that converts a high level language into a VHDL

specification? The potential of this technology is tremendous - in simple

manner, we can convert all the software programmers into hardware

designers.

1.4.3 The need for hardware compilers

Before we go further let us look at why we need this kind of

technology that can convert high-level languages into hardware

definitions. We see a set of needs which actually lead from one to the

other in a series.

A. Rapid development cycles

The traditional method of designing hardware is a long and

winding process, going through many stages with special effort spent in

design verification at every stage. This means that the time from drawing

board to market, is very long. This proves to be rather undesirable in case

of large expanding market, with many competitors trying to grab a share.

We need alternatives to cut down on this time so that new ideas reach the

market faster, where the first person to get in normally gains a large

advantage.

B. Large number of designers

With embedded systems becoming more and more popular, there is

a need for a large number of chip designers, who can churn out chips

designed for specific applications. Its impractical to think of training so

many people in the intricacies of VLSI design.

7

C. Specialized training

Person who wishes to design ASIC's will require extensive training

in the field of VLSI design. But we cannot possibly expect to find a large

number of people who would wish to undergo such training. Also, the

process of training these people will itself entail large investments in time

and money. This means there has to be system which can abstract out all

the details of VLSI, and which allows the user to think in simple system-

level terms.

There are quite a few tools available for using high-level languages

in circuit design. But this area has started showing fruits only recently.

For example, there is a language called Handel-C, that looks just like

good old C. But it has some special extensions that make it usable for

defining circuits. A program written in Handel-C, can be represented

block-by-block by hardware equivalents. And in doing all this, the

compiler takes care of all low-level issues like clock-frequency, layout,

etc. The biggest selling point is that the user does not really have to learn

anything new, except for the few extensions made to C, so that it may be

conveniently used for circuit design.

Another quite different language, that is still under development, is

Lava. This is based on an esoteric branch of computer science, called

"functional programming". FP itself is pretty old, and is radically

different from the normal way we write programs. This is because it

assumes parallel execution as a part of its structure - it’s not based on the

normal idea of "sequence of instructions". This parallel nature is

something very suitable for hardware since the logic circuits are is

inherently parallel in nature. Preliminary studies have shown that Lava

can actually create better circuits than VHDL itself, since it affords a

high-level view of the system, without losing sight of low-level features.

8

1.5 DESIGN METHODOLOGY

A good VLSI design system should provide for consistent in all

three description domains (behavioral, structural, and physical) and at

level of abstraction (e.g. architecture, RTL/block, logic, circuit).The

means by which this is to be measured in various terms that differ in

importance based on their application. These parameters can be

summarized in terms of

Performance-Speed, power, flexibility

Size of die (Cost of die)

Time to design (Cost of engineering and schedule)

Ease of verification, Test generation and testability (Cost of

engineering and schedule)

Design is a continuous trade off to achieve the adequate results for

all of the above parameters .So that the tools and methodologies used for

the particular chip will be functioning based on the above parameters. But

other constraints depends on economics (i.e., size of die affecting yield)

are even subjectivity.

The process of designing a system on silicon is complicated, the

role of good VLSI-design aids is to reduce this complexity, increase the

productivity, and assure that designer of the working product. A good

method of simplifying the approach to a design by the use of constraints

and abstraction. The design method in contrast to the design flow used to

built a chip. The base design method are arranged in roughly in order of

“increase investment”, which loosely relates to time and cost it takes to

design and implement the system. It is important to understand the cost,

capabilities and limitations of a given implementation technology to

9

select the right solution. To design a custom chip when an off-the-shelf

solution that meet the system criteria is available for same or lower cost.

1.6 OBJECTIVE AND SCOPE

This project deals with a MST core that supports, H.264 (8 × 8, 4 × 4)

MPEG-1/2/4 (8 × 8), and VC-1 (8 ×8, 8 × 4, 4 × 8, 4 × 4) transforms. The

proposed MST core employs Distributed algorithm and Factor Sharing

schemes as common sharing distributed arithmetic (CSDA) to reduce

hardware cost.

Our new design of multi standard transform video codec

architecture will be in high throughput, low area and low delay.

1.7 APPLICATIONS

Digital video codecs are found in DVD systems (players,

recorders), Video CD systems, in emerging satellite and digital terrestrial

broadcast systems, various digital devices and software products with

video recording or playing capability. Online video material is encoded

by a variety of codecs, and this has led to the availability of codec packs a

pre-assembled set of commonly used codecs combined with an installer

available as a software package for PCs, such as K-Lite Codec Pack.

Encoding media by the public has seen an upsurge with the availability of

CD and DVD recorders.

1.8 RECENT RESEARCH IN VIDEO COMPRESSION

Although the imminent death of research into video compression

has often been proclaimed, the growth in capacity of telecommunications

networks is being outpaced by the rapidly increasing demand for services.

The result is an ongoing need for better multimedia compression, and

particularly video and image compression.

http://en.wikipedia.org/wiki/DVD

http://en.wikipedia.org/wiki/Video_CD

http://en.wikipedia.org/wiki/K-Lite_Codec_Pack

http://en.wikipedia.org/wiki/DVD_recorders

10

At the same time, there is a need for these services to be carried on

networks of greatly varying capacities and qualities of service, and to be

decoded by devices ranging from small, low-power, handheld terminals

to much more capable fixed systems. Hence, the ideal video compression

algorithm have high compression efficiency, be scalable to accommodate

variations in network performance including capacity and quality of

service, and be scalable to accommodate variations in decoder capability.

In this presentation, these issues will be examined, illustrated by recent

research at UNSW@ADFA in compression efficiency, scalability and

error resilience.

11

CHAPTER 2

VIDEO CODECS

2.1 VIDEO CODEC DESIGN

A video codec is a device or software that enables compression or

decompression of digital video; the format of the compressed data

adheres to a video compression specification. The compression is usually

lossy. Historically, video was stored as an analog signal on magnetic tape.

Around the time when the compact disc entered the market as a digital-

format replacement for analog audio, it became feasible to also begin

storing and using video in digital form, and a variety of such technologies

began to emerge. Audio and video call for customized methods of

compression which may leads to new trend in telecommunication

wireless systems. Engineers and mathematicians have tried a number of

solutions for tackling this problem.

There is a complex relationship between the video quality, the

quantity of the data needed to represent it (also known as the bit rate), the

complexity of the encoding and decoding algorithms, robustness to data

losses and errors, ease of editing, random access, and end-to-end delay.

Video codecs seek to represent a fundamentally analog data set in a

digital format. Because of the design of analog video signals, which

represent luma and color information separately, a common first step in

image compression in codec design is to represent and store the image in

a Y,Cb,Cr color space. The conversion to Y,Cb,Cr provides two benefits:

first, it improves compressibility by providing decorrelation of the color

signals; and second, it separates the luma signal, which is perceptually

much more important, from the chroma signal, which is less perceptually

important and which can be represented at lower resolution to achieve

http://en.wikipedia.org/wiki/Codec

http://en.wikipedia.org/wiki/Software


http://en.wikipedia.org/wiki/Digital_video

http://en.wikipedia.org/wiki/Video_compression_specification


http://en.wikipedia.org/wiki/Magnetic_tape

http://en.wikipedia.org/wiki/Compact_disc

http://en.wikipedia.org/wiki/Engineers

http://en.wikipedia.org/wiki/Mathematicians

http://en.wikipedia.org/wiki/Video_quality

http://en.wikipedia.org/wiki/Bit_rate

12

more efficient data compression. It is common to represent the ratios of

information stored in these different channels in the following way

Y:Cb:Cr. Refer to the following article for more information about

Chroma subsampling.

Different codecs will use different chroma subsampling ratios as

appropriate to their compression needs. Video compression schemes for

Web and DVD make use of a 4:2:0 color sampling pattern, and the DV

standard uses 4:1:1 sampling ratios. Professional video codecs designed

to function at much higher bitrates and to record a greater amount of

color information for post-production manipulation sample in 3:1:1

(uncommon), 4:2:2 and 4:4:4 ratios. Examples of these codecs include

Panasonic's DVCPRO50 and DVCPROHD codecs (4:2:2), and then

Sony's HDCAM-SR (4:4:4) or Panasonic's HDD5 (4:2:2). Apple's new

Progress HQ 422 codec also samples in 4:2:2 color space. More codecs

that sample in 4:4:4 patterns exist as well, but are less common, and tend

to be used internally in post-production houses. It is also worth noting

that video codecs can operate in RGB space as well. These codecs tend

not to sample the red, green, and blue channels in different ratios, since

there is less perceptual motivation for doing so just the blue channel

could be under sampled.

Some amount of spatial and temporal down sampling may also be

used to reduce the raw data rate before the basic encoding process. The

most popular such transform is the 8x8 discrete cosine transform (DCT).

Codecs which make use of a wavelet transform are also entering the

market, especially in camera workflows which involve dealing with

RAW image formatting in motion sequences. The output of the transform

is first quantized, then entropy encoding is applied to the quantized

values. When a DCT has been used, the coefficients are typically scanned

13

using a zig-zag scan order, and the entropy coding typically combines a

number of consecutive zero-valued quantized coefficients with the value

of the next non-zero quantized coefficient into a single symbol, and also

has special ways of indicating when all of the remaining quantized

coefficient values are equal to zero. The entropy coding method typically

uses variable-length coding tables. Some encoders can compress the

video in a multiple step process called n-pass encoding (e.g. 2-pass),

which performs a slower but potentially better quality compression.

The decoding process consists of performing, to the extent

possible, an inversion of each stage of the encoding process. The one

stage that cannot be exactly inverted is the quantization stage. There, a

best-effort approximation of inversion is performed. This part of the

process is often called "inverse quantization" or "dequantization",

although quantization is an inherently non-invertible process.

This process involves representing the video image as a set of

macroblocks. For more information about this critical facet of video

codec design.

Video codec designs are often standardized or will be in the future-

i.e., specified precisely in a published document. However, only the

decoding process needs to be standardized to enable interoperability. The

encoding process is typically not specified at all in a standard, and

implementers are free to design their encoder however they want, as long

as the video can be decoded in the specified manner. For this reason, the

quality of the video produced by decoding the results of different

encoders that use the same video codec standard can vary dramatically

from one encoder implementation to another.

14

2.2 DIFFERENT STANDARDS

In this project three different types of standards has to be

considered they are

MPEG 1/2/4

H.264

VC-1

2.2.1 MPEG 1/2/4

The MPEG standards consist of different Parts. Each part covers a

certain aspect of the whole specification. The standards also

specify Profiles and Levels. Profiles are intended to define a set of tools

that are available, and Levels define the range of appropriate values for

the properties associated with them. Some of the approved MPEG

standards were revised by later amendments and/or new editions. MPEG

has standardized the following compression formats and ancillary

standards:

2.2.1.1 MPEG-1

Coding of moving pictures and associated audio for digital storage

media at up to about 1.5 Mbit/s (ISO/IEC 11172). The first MPEG

compression standard for audio and video. It is commonly limited to

about 1.5 Mbit/s although the specification is capable of much higher bit

rates. It was basically designed to allow moving pictures and sound to be

encoded into the bitrate of a Compact Disc. It is used on Video CD and

can be used for low-quality video on DVD Video. It was used in digital

satellite/cable TV services before MPEG-2 became widespread. To meet

the low bit requirement, MPEG-1 downsamples the images, as well as

uses picture rates of only 24–30 Hz, resulting in a moderate quality. It


http://en.wikipedia.org/wiki/Audio_compression_(data)

http://en.wikipedia.org/wiki/Video_compression

http://en.wikipedia.org/wiki/Bitrate

http://en.wikipedia.org/wiki/Compact_Disc

http://en.wikipedia.org/wiki/Video_CD

http://en.wikipedia.org/wiki/Downsample

15

includes the popular MPEG-1 Audio Layer III (MP3) audio compression

format.

2.2.1.2 MPEG-2

Generic coding of moving pictures and associated audio

information (ISO/IEC 13818). Transport, video and audio standards for

broadcast-quality television. MPEG-2 standard was considerably broader

in scope and of wider appeal supporting interlacing and high definition.

MPEG-2 is considered important because it has been chosen as the best

compression scheme for over which includes the air digital

television ATSC, DVB and ISDB, digital satellite TV services like Dish

Network, digital cable television signals, SVCD and DVD Video. It is

also used on Blu-ray Discs, but these normally use MPEG-4 Part 10 or

SMPTE VC-1 for high-definition content.

2.2.1.3 MPEG-4

Coding of audio-visual objects (ISO/IEC 14496) MPEG-4 uses

further coding tools with additional complexity to achieve higher

compression factors than MPEG-2. In addition to more efficient coding

of video, MPEG-4 moves closer to computer graphics applications. In

more complex profiles, the MPEG-4 decoder effectively becomes a

rendering processor and the compressed bit stream describes three-

dimensional shapes and surface texture. MPEG-4 supports Intellectual

Property Management and Protection (IPMP), which provides the facility

to use proprietary technologies to manage and protect content like digital

rights management. It also supports MPEG-J, a fully programmatic

solution for creation of custom interactive multimedia applications (Java

application environment with a Java API) and many other features.

http://en.wikipedia.org/wiki/MP3


http://en.wikipedia.org/wiki/Interlacing

http://en.wikipedia.org/wiki/High-definition_video

http://en.wikipedia.org/wiki/Digital_television

http://en.wikipedia.org/wiki/Digital_television

http://en.wikipedia.org/wiki/ATSC_Standards

http://en.wikipedia.org/wiki/Digital_Video_Broadcasting

http://en.wikipedia.org/wiki/ISDB

http://en.wikipedia.org/wiki/Dish_Network

http://en.wikipedia.org/wiki/Dish_Network

http://en.wikipedia.org/wiki/Cable_television

http://en.wikipedia.org/wiki/SVCD

http://en.wikipedia.org/wiki/DVD_Video

http://en.wikipedia.org/wiki/Blu-ray_Disc

http://en.wikipedia.org/wiki/VC-1


http://en.wikipedia.org/wiki/Digital_rights_management

http://en.wikipedia.org/wiki/Digital_rights_management

http://en.wikipedia.org/wiki/Java_application

http://en.wikipedia.org/wiki/Java_application

http://en.wikipedia.org/wiki/Java_API

16

2.2.2 H.264

H.264 or MPEG-4 Part 10, Advanced Video Coding (MPEG-4

AVC) is a video compression format that is currently one of the most

commonly used formats for the recording, compression, and distribution

of video content. H.264/MPEG-4 AVC is a block-oriented motion-

compensation-based video compression standard developed by the ITU-

T Video Coding Experts Group (VCEG) together with the ISO/IEC

JTC1 Moving Picture Experts Group (MPEG).

H.264 is perhaps best known as being one of the video encoding

standards for Blu-ray Discs; all Blu-ray Disc players must be able to

decode H.264. It is also widely used by streaming internet sources, such

as videos from Vimeo, YouTube, and the iTunes Store, web software

such as the Adobe Flash Player and Microsoft Silverlight, H.264 is

typically used for lossy compression in the strict mathematical sense,

although the amount of loss may sometimes be imperceptible. It is also

possible to create truly lossless encodings using it e.g., to have localized

lossless-coded regions within lossy coded pictures or to support rare use

cases for which the entire encoding is lossless.

The intent of the H.264/AVC project was to create a standard

capable of providing good video quality at substantially lower bit rates

than previous standards (i.e., half or less the bit rate of MPEG-2, H.263,

or MPEG-4 Part 2), without increasing the complexity of design so much

that it would be impractical or excessively expensive to implement. An

additional goal was to provide enough flexibility to allow the standard to

be applied to a wide variety of applications on a wide variety of networks

and systems, including low and high bit rates, low and high resolution

video, broadcast, DVD storage, RTP/IP packet networks, and ITU-

T multimedia telephony systems.

http://en.wikipedia.org/wiki/Video_compression_format



http://en.wikipedia.org/wiki/ITU-T


http://en.wikipedia.org/wiki/Video_Coding_Experts_Group

http://en.wikipedia.org/wiki/ISO/IEC_JTC1

http://en.wikipedia.org/wiki/ISO/IEC_JTC1

http://en.wikipedia.org/wiki/Moving_Picture_Experts_Group


http://en.wikipedia.org/wiki/Vimeo

http://en.wikipedia.org/wiki/YouTube

http://en.wikipedia.org/wiki/ITunes_Store

http://en.wikipedia.org/wiki/Adobe_Flash_Player

http://en.wikipedia.org/wiki/Microsoft_Silverlight


http://en.wikipedia.org/wiki/Lossless_compression


http://en.wikipedia.org/wiki/H.263

http://en.wikipedia.org/wiki/MPEG-4_Part_2

http://en.wikipedia.org/wiki/Broadcasting

http://en.wikipedia.org/wiki/DVD

http://en.wikipedia.org/wiki/Real-time_Transport_Protocol

http://en.wikipedia.org/wiki/Internet_Protocol



http://en.wikipedia.org/wiki/Telephony

17

The H.264 standard can be viewed as a "family of standards"

composed of the profiles described below. A specific decoder decodes at

least one, but not necessarily all profiles. The decoder specification

describes which profiles can be decoded. The H.264 name follows

the ITU-T naming convention, where the standard is a member of the

H.26x line of VCEG video coding standards.

2.2.3 VC-1

VC-1 is an evolution of the conventional DCT-based video codec

design also found in H.261, MPEG-1 Part 2, H.262/MPEG-2 Part

2, H.263, and MPEG-4 Part 2. It is widely characterized as an alternative

to the ITU-T and MPEG video codec standard known as H.264/MPEG-4

AVC. VC-1 contains coding tools for interlaced video sequences as well

as progressive encoding. The main goal of VC-1 Advanced Profile

development and standardization was to support the compression of

interlaced content without first converting it to progressive, making it

more attractive to broadcast and video industry professionals.

Both HD DVD and Blu-ray Disc have adopted VC-1 as a video

standard, meaning their video playback devices will be capable of

decoding and playing video-content compressed using VC-1. Windows

Vista partially supports HD DVD playback by including the VC-1

decoder and some related components needed for playback of VC-1

encoded HD DVD movies.


http://en.wikipedia.org/wiki/Video_Coding_Experts_Group

http://en.wikipedia.org/wiki/Discrete_cosine_transform


http://en.wikipedia.org/wiki/MPEG-1#Part_2:_Video

http://en.wikipedia.org/wiki/H.262/MPEG-2_Part_2

http://en.wikipedia.org/wiki/H.262/MPEG-2_Part_2


http://en.wikipedia.org/wiki/MPEG-4_Part_2


http://en.wikipedia.org/wiki/MPEG

http://en.wikipedia.org/wiki/H.264/MPEG-4_AVC

http://en.wikipedia.org/wiki/H.264/MPEG-4_AVC

http://en.wikipedia.org/wiki/Interlaced_video

http://en.wikipedia.org/wiki/Progressive_scan

http://en.wikipedia.org/wiki/HD_DVD


http://en.wikipedia.org/wiki/Windows_Vista

http://en.wikipedia.org/wiki/Windows_Vista

18

CHAPTER 3

SYSTEM ANALYSIS

3.1 PROJECT INTRODUCTION

Compression can mainly done by using several transforms such as

Discrete Cosine Transform, Integer Transforms, Distributed Arithmetic,

Factor sharing in video and image signals. These transforms are mainly

used as matrix decomposition methods to reduce the Hardware cost as

well as the implementation cost. Swartzland and Yu present an efficient

method for reducing ROMs size by using Recursive DCT algorithms.

For scaling purpose ROMs are better but some other circuits with

shrinking technology nodes. Numerous ROM free DA architecture have

been emerged recently. A new DA sharing system called NEDA which

involves bit level sharing scheme to implement the butterfly matrix based

on the adders. These are used to support anyone of the application

standards (Table 3.1).

Table 3.1 Corresponding Dimensions Of Different Video Codecs

The DFT is the most important discrete transform, used to

perform Fourier analysis in many practical applications. In digital signal

Video Codecs Dimensions Groups

MPEG 1/2/4 8×8 ISO

H.264 8×8,4×4 ITU-T

VC-1 8×8,8×4,4×8,4×4 Microsoft

http://en.wikipedia.org/wiki/Discrete_transform

http://en.wikipedia.org/wiki/Fourier_analysis

http://en.wikipedia.org/wiki/Digital_signal_processing

19

processing, the function is any quantity or signal that varies over time,

such as the pressure of a sound wave, a radio signal, or

daily temperature readings, sampled over a finite time interval (often

defined by a window function). In image processing, the samples can be

the values of pixels along a row or column of a raster image. The DFT is

also used to efficiently solve partial differential equations, and to perform

other operations such as convolutions or multiplying large integers

likewise DCT and all other transforms has the advantages to increase the

throughput rate.

3.2 EXISTING SYSTEM

3.2.1 INTRODUCTION

Numerous researchers have worked on transform core designs,

including discrete cosine transform (DCT) and integer transform, using

distributed arithmetic (DA) , factor sharing (FS) and matrix

decomposition methods to reduce hardware cost. The inner product can

be implemented using ROMs and accumulators instead of multipliers to

reduce the area cost. To improve the throughput rate of the NEDA

method, high-throughput adder trees are introduced. FS method derives

matrices for multistandards as linear combinations from the same matrix

and delta matrix, and show that the coefficients in the same matrix can

share the same hardware resources. Matrices for VC-1 transformations

can be decomposed into several small matrices. Recently, reconfigurable

architectures have been presented as a solution to achieve a good

flexibility of processors in field-programmable gate array (FPGA)

platform or application-specific integrated circuit (ASIC). These all

existed methods fully supported transform core for the H.264 standard,

including 8 × 8 and 4 × 4 transforms. The eight-point and four-point

http://en.wikipedia.org/wiki/Digital_signal_processing

http://en.wikipedia.org/wiki/Signal_(information_theory)

http://en.wikipedia.org/wiki/Sound_wave

http://en.wikipedia.org/wiki/Radio

http://en.wikipedia.org/wiki/Temperature

http://en.wikipedia.org/wiki/Window_function

http://en.wikipedia.org/wiki/Image_processing

http://en.wikipedia.org/wiki/Pixel

http://en.wikipedia.org/wiki/Raster_image

http://en.wikipedia.org/wiki/Partial_differential_equations

http://en.wikipedia.org/wiki/Convolution

20

transform cores for MPEG-1/2/4 and H.264 and VC-1 cannot support the

VC-1 compression standard. To overcome this limitation the proposed

system exists.

3.2.2 CSDA

CSDA means Common Sharing Distributed Arithmetic, it is the

technique that combines the Factor sharing and Distributed Arithmetic to

generate the CSDA coefficients. Factor sharing means sharing the same

factors from the existing input and Distributed Arithmetic means sharing

the same input coefficients. In existing system pipeline register is used as

a storage element.

3.2.3 LIMITATIONS OF EXISTING SYSTEM

Low throughput

High cost

High delay

More number of adders

3.3 PROPOSED SYSTEM

3.3.1 INTRODUCTION

The proposed CSDA combines DA and FS methods. By expand

the coefficients matrix at bit level The Factor sharing method first shares

the same factor in each coefficient ,the distributed method is then applied

to share the same combination of Input among each coefficient position.

3.3.2 BUFFER AS A MEMORY

In proposed system instead of pipeline register buffer is used as a

memory element. Buffer is active only when the clock input is high. The

21

usage of buffer here makes the bit stream without getting any halt in the

memory. Hence the delay is considerably reduced. There is no storage in

the register which makes the retrieval time must be very small.

3.3.3 ADVANTAGES OF PROPOSED SYSTEM

High throughput

Low cost

Supports three different types of video codecs

Reduction in number of adders

3.4 DERIVATION OF CSDA ALGORITHM

The CSDA algorithm mainly combines Factor Sharing and

Distributed Arithmetic techniques. The methods for computing the

coefficients are given below.

3.4.1 Factor sharing derivation

In this technique the signals having the same factors that

has to be shared. If the signals S1 and S2 can be written as

( )

( ) (3.1)

Where Fs (shared factor) and Fd1(remainder coefficients) can be found in

the coefficients C1 and C2,respectively

3.4.2 Distributed Arithmetic format

For matrix multiplication and accumulation the inner product can

be written as

22

∑ (3.2)

Where Ai is an Nbit CSD coefficients and Xi is an input data.

[ ( ) ] [

( )

] (3.3)

The product Y can be obtained by shifting and adding every Yj which is

the nonzero value. The inner product can be obtained by using shifters

and adders instead of using multipliers to implementing in low cost.

3.4.3 CSDA Algorithm

The inner product can be a product of inputs and coefficient

[ ] [

] [ ] (3.4)

[ ]

[ ]

[ ]

[ ] (3.5)

This section provides a discussion of the hardware resources and

system accuracy for the proposed 2-D CSDA-MST core and also presents

a comparison with previous works. Finally, the characteristics of the

implementation into a chip are described.

23

The coefficients can be generated as the above matrix, the values

can be compared and the shame factors i.e,[1 -1] that has to be shared and

finally calculate the value Fs. From the shared factor the distributed

arithmetic values can be considered with the help of the inputs Xi. The

CSDA combines the factor sharing and DA methods. The FS method is

implemented to first identify the factors that can achieve the greater

hardware resource sharing capacity. The shared factor FS in four

coefficients is [ 1 -1] and C1 ~ C2 can use instead of [1 -1] with the

corresponding position under the FS method. The Distributed Arithmetic

is applied to share the same position for the input, and the DA shared

coefficient DA1=(X1+X2) FS . Finally, the matrix inner product in

above equation can be implemented by shifting and adding every nonzero

weight position.

To adopt searching flow software code is the only way to iterative

searching loops by setting a constraint with minimum number of nonzero

elements. The choice of shared coefficients is obtained by some

24

constraints ,the coefficients is not a global optimal solution which have

the minimal non zero bits.

3.5 FLOW DIAGRAM

3.5.1 DESCRIPTION:

To obtain better resource sharing for inner product operation, the

proposed CSDA combines the FS and DA methods. The FS method is

adopted first to identify the factors that can achieve higher capability in

Input coefficient

matrix

Iteration Searching Loop

FS finds new shared factor in

coefficient matrix.

DA finds shared coefficient Based on

FS results

Calculate the numbers of the adders.

Compare to previous data (adders),

and Update the smallest one for FS

and DA

Find the CSDA

shared coefficient

Fig 3.1 CSDA flow diagram

25

hardware resource sharing, where the hardware resource in this paper is

defined as the number of adder usage. Next, the DA method is used to

find the shared coefficient based on the results of the FS method. The

adder-tree circuits will be followed by the proposed CSDA circuit. Thus,

the CSDA method aims to reduce the nonzero elements to as few as

possible. The CSDA shared coefficient is used for estimating and

comparing thereafter the number of adders in the CSDA loop. Therefore,

the iteration searching loop requires a large number of loops to determine

the smallest hardware resource by these steps, and the CSDA shared

coefficient can be established. Notice that the optimal factor or coefficient

in only FS or DA methods is not conducted for the smallest resource in

the proposed CSDA method. Thus, a number of iteration loops is needed

for determining the better CSDA shared coefficient.

3.6 MODULES

3.6.1 1-D Common Sharing Distributed arithmetic-MST

Based on the proposed CSDA algorithm, the coefficients for

MPEG-1/2/4, H.264, and VC-1 transforms are chosen to achieve high

sharing capability for arithmetic resources. To adopt the searching flow,

software code will help to do the iterative searching lop by setting a

constraint with minimum nonzero elements. In this paper, the constraint

of minimum nonzero elements is set to be five. After software

searching, the coefficients of the CSD expression, where 1 indicates −1.

Note that this choice of shared coefficient is obtained by some

constraints. Thus, the chosen CSDA coefficient is not a global optimal

solution. It is just a local or suboptimal solution. Besides, the CSD codes

are not optimal expression, which have minimal nonzero bits.

26

Fig 3.2. Architecture of the proposed 1-D CSDA-MST.

3.6.2 Even part common sharing distributed arithmetic circuit

The SBF module executes for the eight-point transform and

bypasses the input data for two four-point transforms. After the SBF

module, the CSDA_E and CSDA_O execute and by feeding input data a

and b, respectively. The CSDA_E calculates the even part of the eight-

point transform, similar to the four-point Trans form for H.264 and VC-1

standards. Within the architecture of CSDA_E, two pipeline stages exist

(12-bit and 13-bit). The first stage executes as a four-input butterfly

matrix circuit, and the second stage of CSDA_E then executes by using

the proposed CSDA algorithm to share hardware resources in variable

standards.

27

Fig.3.3 Architecture of the even part CSDA circuit

3.6.3 Odd part common sharing distributed arithmetic circuit:

Similar to the CSDA_E, the CSDA_O also has two pipeline stages.

Based on the proposed CSDA algorithm, the CSDA_O efficiently shares

the hardware resources among the od part of the eight-point transform

and four-point transform for variable standards. It contains selection

signals of multiplexers (MUXs) for different standards. Eight adder trees

with error compensation (ECATs) are followed by the CSDA_E and

CSDA_O, which ad the nonzero CSDA coefficients with corresponding

weight as the tree-like architectures. The ECATs circuits can alleviate

truncation error efficiently in small area design when summing the

nonzero data al together.

1st stage memory 2

nd stage memory

28

Fig.3.4. Architecture of the odd part CSDA circuit.

3.6.4 ECAT

Eight adder trees with error compensation (ECATs) are followed

by the CSDA_E and CSDA_O, which add the nonzero CSDA

coefficients with corresponding weight as the tree-like architectures. The

ECATs circuits can alleviate truncation error efficiently in small area

design when summing the nonzero data all together.

1st stage memory 2

nd stage memory

29

Fig.3.5. Architecture of ECAT

3.6.5 Permutation

In 8 output from ECAT directly given to permutation.

permutation relates to the act of rearranging, or permuting, all the

members of a set into some sequence or order (unlike combinations,

which are selections of some members of the set where order is

disregarded).It is used to for encode output matrix.

Fig.3.6 Permutation concept

30

3.7 2D CSDA CORE DESIGN

3.7.1 Mathematical Derivation of Eight-Point and Four-Point

Transforms

This section introduces the proposed 2-D CSDA-MST core

implementation. Neglecting the scaling factor, the one- dimensional (1-D)

eight-point transform can be defined as follows

Because the eight-point coefficient structures in MPEG- 1/2/4,

H.264, and VC-1 standards are the same, the eight-point transform for

these standards can use the same mathematic derivation. According to the

Fig.3.7 2D CSDA core with TMEM

(3.6)

(3.7)

31

symmetry property, the 1-D eight- point transform can be divided into

even and odd two four-point transforms, Ze and Zo, as listed in and

respectively

The even part of the operation in (10) is the same as that of the four-point

H.264 and VC-1 transformations. Moreover, the even part Ze can be

further decomposed into even and odd parts: Zee and Zeo

(3.8)

(3.9)

32

3.7.2 TMEM

The TMEM is implemented using 64-word 12-bit dual-port buffer

and has a latency of 52 cycles. Based on the time scheduling strategy and

result of the time scheduling strategy, the 1st-D and 2nd-D transforms are

able to be computed simultaneously. The transposition memory is an 8×8

buffer array with the data width of 16 bits and is shown in Fig

Fig.3.8 TMEM

33

CHAPTER 4

SYSTEM IMPLEMENTATION

4.1 Xilinx ISE Overview

The Integrated Software Environment (ISE®) is the Xilinx®

design software suite that allows you to take your design from design

entry through Xilinx device programming. The ISE Project Navigator

manages and processes your design through the following steps in the

ISE design flow.

4.1.1 Design Entry

Design entry is the first step in the ISE design flow. During design

entry, you create your source files based on your design objectives. You

can create your top-level design file using a Hardware Description

Language (HDL), such as VHDL, Verilog, or ABEL, or using a

schematic. You can use multiple formats for the lower-level source files

in your design.

4.1.2 Synthesis

After design entry and optional simulation, you run synthesis.

During this step, VHDL, Verilog, or mixed language designs become net

list files that are accepted as input to the implementation step.

4.1.3 Implementation

After synthesis, you run design implementation, which converts the

logical design into a physical file format that can be downloaded to the

selected target device. From Project Navigator, you can run the

implementation process in one step, or you can run each of the

implementation processes separately. Implementation processes vary

34

depending on whether you are targeting a Field Programmable Gate

Array (FPGA) or a Complex Programmable Logic Device (CPLD).

4.1.4 Verification

You can verify the functionality of your design at several points in

the design flow. You can use simulator software to verify the

functionality and timing of your design or a portion of your design. The

simulator interprets VHDL or Verilog code into circuit functionality and

displays logical results of the described HDL to determine correct circuit

operation. Simulation allows you to create and verify complex functions

in a relatively small amount of time. You can also run in-circuit

verification after programming your device.

4.1.5 Device Configuration

After generating a programming file, you configure your device.

During configuration, you generate configuration files and download the

programming files from a host computer to a Xilinx device.

4.2 ModelSim Overview

ModelSim is a very powerful simulation environment, and as such

can be difficult to master. Thankfully with the advent of Xilinx Project

Navigator 6.2i, the Xilinx tools can take care of launching ModelSim to

simulate most projects. However, a rather large flaw in Xilinx Project

Navigator 6.2i is its inability to correctly handle test benches which

instantiate multiple modules. To correctly simulate a test bench which

instantiates multiple modules, you will need to create and use a

ModelSim project manually. The steps are fairly simple:

1. Create a directory for your project

2. Start ModelSim and create a new project

35

3. Add all your verilog to the project

4. Compile your verilog files

5. Start the simulation

6. Add signals to the wave window

7. Recompile changed verilog files

8. Restart/Run the simulation

ModelSim is a simulation and debugging tool for VHDL, Verilog,

and mixed-language designs.

4.2.1 Basic simulation flow

The following diagram shows the basic steps for simulating a

design in ModelSim

4.2.2 Creating the working library

In ModelSim, all designs, be they VHDL, Verilog, or some

combination thereof, are compiled into a library. You typically start a

new simulation in ModelSim by creating a working library called "work".

"Work" is the library name used by the compiler as the default destination

for compiled design units.

4.2.3 Compiling your design

After creating the working library, you compile your design units

into it. The ModelSim library format is compatible across all supported

platforms. You can simulate your design on any platform without having

to recompile your design.

36

4.2.4 Running the simulation

With the design compiled, you invoke the simulator on a top-level

module (Verilog) or a configuration or entity/architecture pair (VHDL).

Assuming the design loads successfully, the simulation time is set to zero,

and you enter a run command to begin simulation.

4.2.5 Debugging your results

If you don’t get the results you expect, you can use ModelSim’s

robust debugging environment to track down the cause of the problem.

4.3 Project flow

A project is a collection mechanism for an HDL design under

specification or test. Even though you don’t have to use projects in

ModelSim, they may ease interaction with the tool and are useful for

organizing files and specifying simulation settings. The following

diagram shows the basic steps for simulating a design within a ModelSim

project.As you can see, the flow is similar to the basic simulation flow.

However, there are two important differences:

• You do not have to create a working library in the project flow; it is

done for you automatically.

• Projects are persistent. In other words, they will open every time you

invoke ModelSim unless you specifically close them.

4.4 Multiple library flow

ModelSim uses libraries in two ways:

1) As a local working library that contains the compiled version of your

design;

2) As a resource library.

37

The contents of your working library will change as you update

your design and recompile. A resource library is typically static and

serves as a parts source for your design. You can create your own

resource libraries, or they may be supplied by another design team or a

third party (e.g., a silicon vendor).

You specify which resource libraries will be used when the design

is compiled, and there are rules to specify in which order they are

searched. A common example of using both a working library and a

resource library is one where your gate-level design and test bench are

compiled into the working library, and the design references gate-level

models in a separate resource library.

The diagram below shows the basic steps for simulating with

multiple libraries.

You can also link to resource libraries from within a project. If you

are using a project, you would replace the first step above with these two

steps: create the project and add the test bench to the project.

4.5 Debugging tools

ModelSim offers numerous tools for debugging and analyzing your

design. Several of these tools are covered in subsequent lessons,

including:

• Setting breakpoints and stepping through the source code

• Viewing waveforms and measuring time

• Viewing and initializing memories

A project may also consist of,

• HDL source files or references to source files

• Other files such as READMEs or other project documentation

38

• Local libraries

• References to global libraries

4.6 VERILOG

Verilog, standardized as IEEE 1364, is a hardware description

language (HDL) used to model electronic systems. It is most commonly

used in the design and verification of digital circuits at the register-

transfer level of abstraction. It is also used in the verification of analog

circuits and mixed-signal circuits.

Verilog HDL is one of the two most common Hardware

Description Languages (HDL) used by integrated circuit (IC) designers.

The other one is VHDL. HDL’s allows the design to be simulated earlier

in the design cycle in order to correct errors or experiment with different

architectures. Designs described in HDL are technology-independent,

easy to design and debug, and are usually more readable than schematics,

particularly for large circuits.

Verilog can be used to describe designs at four levels of

abstraction:

(i) Algorithmic level (much like c code with if, case and loop

statements).

(ii) Register transfer level (RTL uses registers connected by

Boolean equations).

(iii) Gate level (interconnected AND, NOR etc.).

(iv) Switch level (the switches are MOS transistors inside gates).

The language also defines constructs that can be used to control the

input and output of simulation. More recently Verilog is used as an input

for synthesis programs which will generate a gate-level description (a

39

netlist) for the circuit. Some Verilog constructs are not synthesizable.

Also the way the code is written will greatly effect the size and speed of

the synthesized circuit. Most readers will want to synthesize their circuits,

so no synthesizable constructs should be used only for test benches.

These are program modules used to generate I/O needed to simulate the

rest of the design. The words “not synthesizable” will be used for

examples and constructs as needed that do not synthesize.

There are two types of code in most HDLs:

Structural, which is a verbal wiring diagram without storage.

assign a=b & c | d; /* “|” is a OR */

assign d = e & (~c);

Here the order of the statements does not matter. Changing e will change

a.

Procedural which is used for circuits with storage, or as a

convenient way to write conditional logic.

always @(posedge clk) // Execute the next statement on every

rising clock edge.

count <= count+1;

Procedural code is written like c code and assumes every

assignment is stored in memory until over written. For synthesis, with

flip-flop storage, this type of thinking generates too much storage.

However people prefer procedural code because it is usually much easier

to write, for example, if and case statements are only allowed in

procedural code. As a result, the synthesizers have been constructed

which can recognize certain styles of procedural code as actually

combinational.

40

CHAPTER 5

RESULT ANALYSIS

5.1 COMPARISION WITH EXISTING SYSTEMS

Measured results

Huan

g e

t

al.

Lai

et

al.

Lee

et

al.

Chan

g e

t

al.

Lee

et

al.

Exsi

stin

g

CS

DA

Pro

pose

d

CS

DA

Gate counts(NAND2) 39.8K 55.6K 36.6K 39.1K 36.8K 30K 27K

Supporting

Standards

MPEG 1/2/4

8×8

О О О О О О О

H.264

8×8 О О О О × О О

4×4(L) О О О О О О О

4×4(H) О О О О О О О

VC-1

8×8 × О О О О О О

8×4 × О × О × О О

4×8 × О × О × О О

4×4 × О О О × О О

Power Consumption

(mW)

38.7m

W

3.4Mw N/A N/A N/A 46.3mW 26m

W

Table 5.1 Measured Results

×- represents non supported standards

О-represents supported standards

41

While comparing the proposed system with existing system the

usage of buffer instead of pipeline register will considerably reduce the

delay which increases the speed with respect to the number of reduction

in the adders from the Table 7.1. The value of gate counts can be reduced

to 27k but in the existing system the value is quite higher 30k. Since only

few numbers of adders utilized which reduces the power consumption

which is of 26mw, while the existing system involves in high power

consumption of 46mw. The proposed system also supports the multiple

standards.

From the above table the measured results are compared with the

existing results .In a proposed CSDA the gate count is reduced when

compared with the existing CSDAThe power consumption measured in

our proposed CSDA is 26mW which is reduced when compared with our

existing CSDA.

5.2 MUX SELECTION INPUTS

Table 5.2 Selection Inputs For Different Standards

Video

codec

standards

Dimensions MUX MUX-

1

MUX-

2

MUX-

3

MUX-

4

MUX-

5

MUX-

6

MPEG 8 1 1 1 1 1 0 1

H.264

8 1 0 0 1 0 0 0

4(H) 0 0 0 0 0 1 1

4(I) 0 0 0 0 0 1 1

VC-1 8 1 0 1 0 0 1 1

4 0 0 1 0 1 1 1

42

These are the selection inputs which are given to the individual standards.

The desired standard can be obtained using the MUX selection.

5.3 MPEG SIMULATION RESULT

By giving the selection inputs as binary inputs for seven mux as

(1111101) for eight point transform we get the MPEG output simulation

as shown in fig.7.1

Fig.5.1 simulation result for MPEG

43

5.4 H.264 SIMULATION RESULT


(1001000) for eight point transform and (0000011) for four point

transform we get the H.264 output simulation as shown in fig.7.2

Fig.5.2 simulation result for H.264

44

5.5 VC-1 SIMULATION RESULT


(1001000) for eight point transform and (0000011) for four point

transform we get the VC-1 output simulation as shown in fig.7.3

Fig.5.3 simulation result for VC-1

45

5.6 RTL SCHEMATIC VIEW OF ENTIRE PROCESS

In a synthesis results Run by Xilinx 13.2 software. The proposed

MST core employs Distributed algorithm and Factor Sharing schemes as

common sharing distributed arithmetic (CSDA) to reduce hardware cost

and delay.

Fig.5.4 RTL view of whole 2D-CSDA architecture

46

Fig 5.5 RTL inner view of 2D CSDA

47

Fig5.6 .RTL inner view of 1D CSDA

48

5.7 SYNTHESIS REPORT FOR OUTPUT

Fig.5.7 Output for 2-D Common Sharing Distributed

arithmetic-MST delay

49

5.8 POWER ANALYZER OUTPUT


arithmetic-MST Power

50

5.9 DEVICE UTILIZATION SUMMARY


arithmetic-MST Gate count

51

CHAPTER 6

CONCLUSION

The CSDA-MST core can achieve high performance, with a high

throughput rate and low-cost VLSI design, supporting MPEG-1/2/4,

H.264, and VC-1 MSTs. By using the proposed CSDA method, the

number of adders and MUXs in the MST core can be saved efficiently.

Measured results show the CSDA-MST core with a synthesis and

simulation rate with 27k logic gates and with power consumption of

26mW. Measured results show the CSDA-MST core with a throughput

rate of 1.28 G-pixels/s, which can support (4928 × 2048@24 Hz) digital

cinema format with only 27k logic gates. Because visual media

technology has advanced rapidly, this approach will help meet the rising

high-resolution specifications and future needs as well.

52

REFERENCES

1. Chang.H, Kim.S, Lee.S, and Cho.K, , Nov[ 2009], ‘Design of

area-efficient unified transform circuit for multi-standard

video decoder,’ in Proc. IEEE Int. SoC Design Confpp. 369–

372.

2. Chen.Y.H, Chang.T.Y, and Li.C.Y, Apr[ 2011]. ‘High

throughput DA-based DCT with high accuracy error-

compensated adder tree,’ IEEE Trans. Very Large Scale

Integration. (VLSI) Syst., vol. 19, no. 4, pp. 709–714.

3. Hoang.D.T and Vitter.J.S,[ 2001]. ‘Efficient Algorithms for

MPEG Video Compression’. New York, USA: Wiley.

4. Huang.C.Y, Chen.L.F, and Lai.Y.K, May [2008]. ‘A high-

speed 2-D transform architecture with unique kernel for

multi-standard video applications,’ in Proc. IEEE Int. Symp.

Circuits Syst., pp. 21–24.

5. Hwangbo.W and Kyun.C.M, Apr.[ 2010]. ‘A multitransform

architecture for H.264/AVC high-profile coders,’ IEEE

Trans. Multimedia, vol. 12, no. 3, pp. 157–162.

6. Lai.Y.K and Lai.Y.F Aug. [2010]. ‘A reconfigurable IDCT

architecture for universal video decoders,’ IEEE Trans.

Consum. Electron., vol. 56, no. 3, pp. 1872–187.

53

7. Lee.S and Cho.K , Feb. [2008]. ‘Architecture of transform

circuit for video decoder supporting multiple standards,’

Electron. Lett, vol. 44, no. 4, pp. 274–2758.

8. Uramoto.S, Inoue.Y, Takabatake.A, Takeda.J,Yamashita.Y,

Terane.T, and Yoshimoto.M, Apr [1992]. ‘A 100-MHz 2-D

discrete cosine transform core processor,’ IEEE J. Solid-State

Circuits, vol. 27, no. 4, pp. 492–499.

project report.pdf

Documents