elysium journal of engineering -...

ELYSIUM JOURNAL OF ENGINEERING

RESEARCH AND MANAGEMENT

DECEMBER 2015 | VOLUME 02 | ISSUE 06

ELYSIUM JOURNAL OF ENGINEERING RESEARCH AND MANAGEMENT

VOL.2 DECEMBER 2015 NO.6

S.NO TABLE OF CONTENTS Page No

1. An Efficient High Speed VCS Updation Based Mesh Topology NoC Router

Architecture Design Yedla Harika

1, T.Vishnumurthy

2 1

2. An Efficient Activity Tracking and Recognition Using the Neural Network

Classifier

G.Karthic1, B.Lalitha

2 5

3. Design of Low Power 32-Bit CSKA for High Speed Applications C.Yamini

1, M.Krishnamurthy

2 11

4. High Speed and Low Power CMOS Technology Based Ram-Cam Memory Design

K.Vidhya kamu1 and Mr.P.Karthikeyan

2 14

5. Design and Implementation of 32 Bit ALU Using Look Ahead Clock Gating Based

On FPGA

V. Prasanth1, M. Sri Manikyamba

2 17

P-ISSN: 2347-4408

E-ISSN: 2347-4734

1| Page December 2015 Volume – 2, Issue - 6

AN EFFICIENT HIGH SPEED VCS UPDATION BASED MESH TOPOLOGY NOC

ROUTER ARCHITECTURE DESIGN

Yedla Harika1, T.Vishnumurthy

2

1Department of Electronics and Communication Engineering, Pragati Engineering College, affiliated to JNTU Kakinada,

Surampalem, Andhra pradesh-533437. Email id: [email protected] 2Assistant Professor, Department of Electronics and Communication, Pragati Engineering College, affiliated to JNTU

Kakinada, Surampalem, Andhra pradesh-533437. Email id: tamminenivishnu @gmail.com

Abstract- Network-on-Chip (NoC) architectures signify a capable

design paradigm to cope with raising communication requirements

in digital systems. It emerged as a vital factor that defines the

presentation and power consumption of many core systems. VLSI

technology is to modify NOC internal router arrangements, shortest

path allocation process and neighbor router estimation control.

Existing system is to design a mesh topology based network on chip

architecture. This architecture is to implement the packet and circuit

switching for path allocation process. Existing system is to improve

the path allocation time and to effectively transmit the source to

destination processing time level. Existing time is to raise the circuit

complexity level and it consume more time for circuit analysis

process. Proposed system is to design a mesh topology based router

architecture design and to develop the path allocation process using

hybrid scheme. This scheme is to consist of VCS, CS and PS

technique for path allocation work. Proposed system is used to

implement the single router data transfer process in slave and master

router condition and to upgrade the path selection complexity level.

This technique is to reduce data transmission time between source

and destination. Proposed system is to raise the system speed level

(clock frequency). Proposed system is to reduce the delay time level

and to reduce the latency time.

Index terms- NoC, latency, VCS technique.

1. INTRODUCTION

With the rapid growth of advanced nanometer IC

technology, continuously shrinking transistor dimensions allow

designers to integrate and enhancing the IP cores or no of

processors into a single chip. Traditional bus-based

communication is no longer suitable due to its poor scalability.

Instead, network-on-chip (NoC) has derived as a scalable and

promising solution to global communications within large

multicore systems. The pipeline stages of a baseline PS router

contains the switch allocation (SA) stage, the buffer write (BW)

stage, the switch traversal (ST) stage, the route computation

(RC) stage, and the virtual channel allocation (VA) stage. The

convoluted router pipeline points to a high latency ratio.

Although look ahead routing and aggressive speculation

shorten the critical path through the router stages. In a mesh-

connected NOC, the PS router still conquers a high ratio of

communication latency when related with the one cycle

connection delay. The convoluted router pipeline leads to a

high power ratio and lacks the flexibility in circuit switching. If

numerous communications participate for a common physical

channel, circuits will be set up in turn.

In NOC the lengthy setup time will reduce the overall

performance. The hybrid system is merging with the circuit and

packet switching is proposed to solve the difficulties in

switching. It not only delivers the communications in high

flexibility, it also creating the CS connections between

communication pairs to improve the latency. In traffic with

light congestion, maximum communications is solved through

the circuit switching. But, traffic with high congestion, a very

small ratio of CS links to communications may be gained to

confines the power and latency for NOCs.

In summary, main influences of this paper are scheduled as

follows.

1) Virtual circuit switching is first introduced in this paper, and

the modified router architecture and its corresponding

switching mechanism are presented to support the

recommended hybrid system.

2) Based on virtual circuit, this paper proposes a path allocation

procedure to enhance the power consumption and

communication latency.

3) The efficiency of the recommended hybrid scheme is

established by associating with the baseline packed switched

NoC and VIP design using a set of synthetic and real traffic

workloads.

2. ARCHITECTURE

In the recommended hybrid system the basic principle is

that VCs are exploits in virtual circuit to form a no of VCS

networks. The multiple VCS networks share a common

channel.

In this hybrid system, VCS connections cooperate with CS and

PS networks to transmit packets in which physical channels are

shared by additional communication, respectively.(x,y) denotes

the physical channel from node to node y. Connections after

using the conventional hybrid scheme. A CS connection is

designed by record the each router which input port should be

attached with the output port. It is composed of physical

channels and routers. However, routers on a PS connection are

formed through the (VA, SA, RC, and BW) stages when flits

require passing through. A multiple PS and 1 CS connections

are shared by physical channel. In the CS connections flits

attain at crossbar switches, routers are instantly configured so

that the CS flits can bypass directly to the ST stage. When there

is no CS flit, the corresponding ports of crossbar switches are

released to PS connections.

P-ISSN: 2347-4408

E-ISSN: 2347-4734


3. MODULES OF ARCHITECTURE

Figure 1. Block diagram

The master memory block is used to recover the destination

data for the router architecture and to analysis the destination

selection level also .The master process work is used to select

the output data to the next node selection data. The single router

architecture is to port the router architecture level and to check

the master selection processing level. The single router design

is to consist the three slave memory block in overall router

architecture. The slave position level is to check another slave

results and to analysis the previous data in another slave

memory blocks. The slave memory block work is used to find

next node data in overall NOC architecture.

A common method to diminish the problem consists of

distributing the most of network interface resources among a no

of processor cores. The network interface architecture we are

targeting supports multiple outstanding write transactions but

only one pending read transaction.

In circuit switched network packets are transferred with

reserve a physical path, while a packet switching packet

transfers without reserving the entire path. The NOC

architecture is an m × n mesh of switches and resources are

placed on the slots formed by the switches.

Figure 2. Modules

4. EXPERIMENTAL RESULTS

P-ISSN: 2347-4408

E-ISSN: 2347-4734


RTL DIAGRAM:

SYNTHESIS REPORT:

5. CONCLUSION

In this paper, we present a novel hybrid system centered on

virtual circuit switching to further reduce power of NoCs and

communication latency. In recommended hybrid scheme the

basic principle is to inter mingle virtual circuit switching with

packet and circuit switching. Intermediate router pipelines are

bypassed by establishing CS and VCS connections. A path

allocation procedure is also presented to smartly allocate CS

and VCS connections for a given traffic in mesh-connected

NoCs, such that the energy consumption and average packet

latency are both improved. To determine the efficiency of the

recommended hybrid scheme, a set of synthetic traffic

workloads and real traffic loads are exploited for evaluation.

The experimental results show that, compared with the baseline

PS NoC with three-stage routers and the hybrid NoC with VIP

connections, our recommended hybrid scheme can attain

further considerable reductions in power consumption and

latency. Our future work will focus on extending the current

work to support the applications with unpredictable

communication patterns. Other extensions include the fault

tolerance, the quality of-service (QoS) operation, the multicast

delivery service, and the mapping, scheduling of applications

based on virtual circuit switching. In fact, due to the small area

overhead. The recommended hybrid system can have the

similar reliability and bit-error rate when compared with the

baseline NoC and VIP design. In addition, some fault-tolerance

methods, such as structural redundancy, packet retransmission

and error control codes, can be utilized to raise the reliability of

the suggested hybrid system. Moreover, the recommended

hybrid system can be exploited to achieve the QoS operation.

For example, In the class of communications CS and VCS

connections can be defined that demand the certain latency and

to serve the best effort traffic packet switching is used.

P-ISSN: 2347-4408

E-ISSN: 2347-4734


REFERENCES

[1] T.G.Mattson et al., “The 48-core SCC processor: The

programmer’s view,” inProc. High Performance

Computing, Networking, SC, 2010, pp. 1–11.

[2] S. Bell et al., “TILE64 processor: A 64-core SoC with

mesh interconnect,” inProc. ISSCC, 2008, pp. 88–598.

[3] S. R. Vangal et al., “An 80-tile sub-100-W teraFLOPS

processor in 65-nm CMOS,” IEEE J. Solid-State

Circuits, vol. 43, no. 1, pp. 29–41, Jan. 2008.

[4] N. E. Jerger, L.-S. Peh, and M. H. Lipasti, “Circuit-

switched coherence,” in Proc. ACM/IEEE Int. NOCS,

2008, pp. 193–202.

[5] A. Abousamra, A. K. Jones, and R. Melhem,

“Codesign of NoC and cache organization for

reducing access latency in chip multiprocessor,” IEEE

Trans. Parallel Distrib. Syst., vol. 23, no. 6, pp. 1038–

1046, Jun. 2012.

[6] M. Modarressi, A. Tavakkol, and H. Sarbazi-Azad,

“Virtual point-to point connections for NoCs,”IEEE

Trans. Comput.-Aided Design Integr.Circuits Syst.,

vol. 29, no. 6, pp. 855–868, Jun. 2010.

[7] W. Dally and B. Towles, “Route packets, not wires:

On-chip interconnection networks,” inProc. DAC,

2001, pp. 684–689.

[8] L. Benini and G. D. Micheli, “Networks on chips: A

new SoC paradigm,”IEEE Trans. Comput., vol. 35, no.

1, pp. 70–78, Jan. 2002.

P-ISSN: 2347-4408 E-ISSN: 2347-4734


AN EFFICIENT ACTIVITY TRACKING AND RECOGNITION USING THE

NEURAL NETWORK CLASSIFIER

G.Karthic1, B.Lalitha

2 1PG Student, Department of Computer science and Engineering, Sree Sowdambika College of Engineering, Virudhunagar, Tamil Nadu,

India. Email: [email protected] 2Assistant Professor, Department of Computer science and Engineering, Sree Sowdambika College of Engineering, Virudhunagar, Tamil

Nadu, India.

Abstract: A continuous video contains two important components such

as tracks of the person in the video, and localization of the actions that

are performed by the actors. The analysis of the activity is used for

solving both the tracking, and recognition problems. In this paper, we

have deployed an efficient activity analysis framework for determining

the activity of the human in the video. Initially, the input video is

obtained from the ULCA, and VIRAT datasets, then the video file is

converted into multiple video frames named as frames. The

information regarding each frames are obtained and further the

frames are resized for preventing the memory from dumping. The noise

present in each frame is filtered using the Gaussian filter. The

Hierarchical Markov Random Field-Sparce (HMRF-Sparce) technique

is used for extracting the shape of the object from the background. The

tracking of the video file is performed using the Bounding Box

technique. The features from the resultant image are extracted using

the Local Binary Pattern (LBP). Based on the features obtained the

frames a pattern is generated. These feature values are grouped into

activity segments using the Neural Network (NN) classifier. To validate

the performance of the proposed NN classifier it is validated with the

existing Support Vector Machine (SVM) classifier for the metrics such

as accuracy, precision, and recall. The experimental results proved that

the proposed NN classifier produced optimal results than the existing

SVM.

Index Terms— Local Binary Pattern (LBP), Neural Network (NN),

Support Vector Machine (SVM), Hierarchical Markov Random Field-

Sparce (HMRF-Sparce) technique, Bounding box technique, Gaussian

filter.

1. INTRODUCTION

Object tracking is used in multiple applications such as robotics

control, video retrieval, etc. The video tracking is the process of

detecting an object in the image plane as it moves over the scene.

The video tracking is preferred for various applications such as

automated surveillance, video indexing, human-computer

interaction, meteorology, and traffic management system. The

key issue in the video tracking are motion estimation, and

matching estimation. The motion estimation is used to predict the

location of the region in the next video frame where the object

would have been placed. The motion estimation information is

very difficult to be determined, hence an effective mechanism for

the determination of the fixed-size region is essential. In the

matching estimation, an object is identified which is being

tracked in the next video frame that is placed in the closed region

of the next video frame. The motion estimation stage predicts the

closed region. The location of the object of interest in the next

frame is estimated in the matching estimation stage. The

matching estimation algorithms incorporate a feature detection

stage for performing the operations such as image classification,

and segmentation. The object tracking algorithms implement

feature detection for matching the pixels from the object being

tracked between two consecutive video frames, then estimates

the exact location of the object in the next frame. The existing

techniques used for the activity recognition do not consider the

tracks, location, and labels for determining the movement of the

human in the scene. Hence, to overcome this issue, we have

proposed an efficient activity analysis framework. Initially, the

input video is converted into multiple video frames, then the noise

present in all the frames are filtered using the Gaussian filter. The

use of Gaussian filter prevents the edge from blurring. Further,

they are computationally efficient. The filtered frames are

provided as input to the Hierarchical Markov Random Field-

Sparce (HMRF-Sparce) technique. This technique, by comparing

the intensity of the pixels, separates the background from the

object. The object of interest is tracked using the bounding box

technique, then the features present in the object of interest is

substituted the Local Binary Pattern (LBP). The main advantage

of using the LBP is increased accuracy, and stability. The

extracted features are classified into various activity segments

using the NN classifier. To validate the performance of the

proposed NN classifier it is compared with the existing SVM

classifier. The comparison results show that the proposed NN

classifier provides higher accuracy than the SVM classifier.

Further, the precision, and recall for the proposed activity

detection framework is validated. The analysis results show that

the suggested framework provides increased higher precision, and

recall values for the different video input files.

The remainder of the paper is systematized as follows,

Section II describes the literature review related to the existing

human action recognition techniques. Section III illustrates the

proposed human activity analysis framework, section IV

describes the performance results of the proposed method, and

Section V illustrates the conclusion of this paper.

mailto:[email protected]

P-ISSN: 2347-4408 E-ISSN: 2347-4734


2. RELATED WORK

This section describes the various existing human action

recognition techniques. Brendel, et al [1] proposed a volumetric-

based approach for the activity recognition, and video parsing.

Based on the sub activities, and hierarchical temporal, and spatial

relations the suggested approach extracted the human activities.

When compared to the traditional approaches, the proposed

volumetric-based approach produced optimal results. Wang, et al

[2] suggested a novel actionlet ensemble model for charactering

the human actions. The suggested model prevented the noise, and

successfully characterized both the human motion, and human-

object interactions. Three datasets such as Kinect devices,

multiview action recognition dataset that was captured using the

Kinect device, and the dataset that was captured using the motion

captures system were used for the evaluation. The experimental

results proved that the suggested method produced optimal results

than the state-of-the art algorithms. Chaaraoui, et al [3] proposed

an evolutionary algorithm for determining the optimal subset of

skeleton joints. As the suggested algorithm was based on the

topological structure of the skeleton, the final success rate was

optimal. When compared to the traditional RGB action

recognition approach, the proposed evolutionary algorithm

provided improved initial recognition rate, and optimal success

rate for the MSR-Action 3D dataset. Ofli, et al [4] proposed the

Sequence of the Most Informative Joints (SMIJ) representation

for the human actions. The selection of the skeletal joints were

automatic. The human actions were represented as a sequence of

the most informative joints. When compared to the state-of-the art

algorithms, the proposed SMIJ representation provided better

performance. Xia, et al [5] proposed the Histograms of 3D Joint

locations (HOJ3D) for representing the human postures. The

action depth sequence from the HOJ3D was re-projected using

the Linear Discriminant Analysis (LDA), and then clustered into

visual words. The discrete Hidden Markov Models (HMMs) was

used for modeling the temporal evolutions of the visual words.

The suggested representation provided optimal results for the 3D

action dataset. Ji, et al [6] proposed a novel 3D Convolutional

Neural Network (CNN) model for the action recognition. The

suggested model extracted the features from both the spatial, and

the temporal dimensions. The suggested model produced multiple

channels of information from the input frames. The information

from all the channels were combined for representing the

features. On applying the suggested model for the real-world

environment, superior performance was achieved. Tanays, et al

[7] analyzed the effectiveness of the sparse representation

obtained from the context of the action recognition in videos. The

human actions were modeled using three over complete

dictionary learning frameworks. The over complete dictionary

was constructed using the spatio-temporal descriptors. The

suggested approach produced state-of-the art results on the public

datasets. Chen, et al [8] proposed an efficient approach for

unifying the activity categorization with the space-time

localization. The upshot was the fastest method that evaluated the

boarder space of the candidates. The suggested algorithm

produced high speed, and accuracy than the existing search

strategies. Morariu, et al [9] suggested a framework for the

automatic recognition of complex multi-agent events. Based on

the video analysis, the events were determined. The interval-

based temporal reasoning was integrated with the probabilistic

logical inference for preventing the combinatorial explosions.

Hoai, et al [10] proposed the joint segmentation, and action

recognition actions for preventing the limitations of the traditional

methods. The suggested model was based on the discriminative

temporal extension of the spatial bag-of-words model. The

classification was performed using the multi-class SVM

framework. When compared to the traditional methods, the

proposed method produced optimal results for the honeybee,

Weizmann, and Hollywood datasets. Le, et al [11] addressed the

issue of building the high-level, class specific feature detectors

from the unlabeled data. The feature detector was robust to the

translation, scaling, and out-of-plane rotation. The network was

trained to recognize 22,000 objects. When compared to the

traditional approaches, the proposed trained network produced

70% better performance. Oh, et al [12] proposed a novel large-

scale video dataset for validating the performance of the diverse

visual event recognition algorithms. The suggested dataset had

many outdoor scenes with the actions of the non-actors. Various

types of evaluation modes were proposed for the visual

recognition tasks. Zhang, et al [13] proposed an approach that

efficiently identified the local, and long-range motion

interactions. The suggested approach captured the combination of

the hand movement of one person with the foot response of

another person. The experimental results proved that the

suggested approach effectively recognized a wide variety of

activity than the state-of-the art methods. Yao, et al [14] exploited

the attributes, and parts for recognizing the human actions in the

still images. The action attributes were described as the verbs.

When compared to the traditional classification methods, the

proposed method extracted the meaningful higher-order

interactions. Lara, et al [15] proposed the centinela system for

providing a highly accurate activity recognition. The suggested

system identified the actions such as walking, running, sitting,

ascending, and descending. A portable and unobtrusive real-time

data collection platform was included in the proposed system.

The Centinela provided 100% accuracy for running, and sitting.

Further, the classification accuracy for the ascending action was

improved.

3. PROPOSED METHOD

The overall flow of our proposed human activity analysis

framework is depicted in the figure 1. The key components of our

proposed framework are,

Frame conversion

Filtering

Segmentation

Video tracking

Feature extraction

Activity analysis

P-ISSN: 2347-4408 E-ISSN: 2347-4734


Fig.1. Overall flow of the proposed human activity analysis framework

3.1 FRAME CONVERSION The input datasets are obtained from [16], and [17]. The

input video is converted into individual frames using the frame

conversion process. The conversion of the video file into frames

is illustrated in the figure 2. The fig 2(a) depicts the input video

file, and the fig 2(b) shows the converted frames.

(a) (b)

Fig. 2 (a) Input video file (b) Framess

The details of each frame such as number of frames, height,

and width of the frames are collected after the frame conversion

process. Further, to prevent the memory from dumping the large

sized frames are resized to smaller frames. The resizing process

of a single frame is illstrated in the figure 3. The same process is

repeated for all the frames.

Fig. 3 Resized Frames

3.2 FILTERING The resized frames are then filtered using the Gaussian filter.

The weights of the Gaussian filters are chosen based on the

Gaussian functions. The Gaussian filter smoothens the image, and

also prevents the Gaussian noises. Mathematically, the Gaussian

filter is represented as follows,

( ) [

√

]

(1)

Where, denotes the variance of the Gaussian filter. The figure

4 shows the resultant frames after the filtering process.

Fig. 4 Filtered frames

3.3 SEGMENTATION The segmentation is the process of extracting the shape of the

object from the background. In this paper, the Hierarchical

Markov Random Field-Sparce (HMRF-Sparce) technique is used

for the extracting the human image from the background image.

The intensity of the pixels is used for the segmentation process.

The figure 5 shows the process of extracting the shape of the

human body from the background.

Fig. 5 Extraction of human body shape from the background

3.4 VIDEO TRACKING The features present in multiple frames are collected and a

pattern is generated based on the values of the features. These

values are classified for detecting the human activity. The video

tracking begins with the generation of the set of match hypothesis

for the frame association and the set of tracks. Based on the

features computed at the frame, the observation potential is

computed for each frame. The classifiers such as Support Vector

Machine (SVM), and Neural Network are used for grouping the

frames into activity segments.

3.5 FEATURE EXTRACTION The node features and the edge features for the potential

functions are computed for the training data using the Local

Binary Pattern (LBP). The proposed LBP works in a 3x3 pixel

block of an image. The pixels in the block are threshold based on

the center pixel value, multiplied by powers of two and then

summed to obtain the center pixel value. As the number of

neighborhood pixel is 8, a total of different labels can

be obtained depending on the relative gray values of the center

and the pixels in the neighborhood. The figure 6 illustrates the

feature extraction process using the LBP.

P-ISSN: 2347-4408 E-ISSN: 2347-4734


Threshold

8 1 1

2 5 8

3 4 9

Binary 1001100

Decimal: 52

1 0 0

0 1

0 0 1

Fig. 6 Feature extraction using the Local Binary Pattern

3.6 Activity Analysis Based on the extracted features, the activity being performed

by human is analyzed. Here, the activity performed by the person

in figure 6 is detected as walking.

4. PERFORMANCE ANALYSIS

The performance of the proposed Neural Network (NN)

classifier is validated against the existing SVM classifier for the

metrics such as,

Accuracy

Precision

Recall

4.1 ACCURACY The accuracy defines the proximity of the measurement

results to the true value. The accuracy of the proposed NN

classifier is computed using the following equation,

( )

(2)

Where,

TP is the number of true positives

TN denotes the number of true negatives

n represents the total population.

Fig. 7 Comparison of accuracy for the existing, and the proposed method

The figure 7 shows the comparison of accuracy for the

proposed NN, and SVM classifier. From the figure it is analyzed

that the proposed NN classifier provides higher accuracy than the

existing SVM classifier.

4.2 PRECISION The precision is defined as the ratio of the true positives and

the sum of true positive, and true negative values. It is computed

using the following equation,

(3)

The figure 8 shows the comparison of the precision for the

various video files. Each iteration in the graph represents an

individual video file. From the graph it is concluded that the

precision value is high for all the iterations.

Fig. 8 Comparison of precision for multiple iterations

4.3 RECALL

P-ISSN: 2347-4408 E-ISSN: 2347-4734


The recall is defined as the ratio between the True Positive,

and the sum of True Positive and True Negative values. It is

computed using the following equation,

(4)

The figure 9 shows the comparison of the recall for the various

video files. From the graph it is concluded that the recall value is

high for all the iterations.

Fig. 9 Comparison of recall for multiple iterations

4.4 COMPARISON OF PRECISION VALUES

The performance of the proposed Hierarchical Markov

Random Field-Sparce (HMRF-Sparce) is compared with the

existing segmentation techniques such as Bag of Word (BOW),

Hierarchical Markov Random Field -Dense (HMRF- Dense),

Morphological, Zhu. The figure 10 shows that the precision of the

proposed HMRF-Sparce is higher than the existing methods.

Fig. 10 Comparison of Precision Value for the existing and the proposed

segmentation techniques

5. CONCLUSION

An efficient NN based classifier is used for tracking the

human activity. During the frame conversion process, the input

video file is converted into multiple frames. The noise present in

the resized frames are removed using the Gaussian filter. To

extract the shape of the human from the background, the

background subtraction technique is used. The bounding box

technique is used for tracking the background subtracted frame.

The features from the frames are obtained using the LBP. With

the extracted features, the activity being performed by the human

is analyzed using the NN classifier. The performance of the NN

classifier is validated against the SVM classifier. The

experimental results prove that the suggested NN classifier

produce optimal performance in terms of accuracy, precision, and

recall than the existing SVM. Further, when compared to the

existing segmentation techniques such as BOW, HMRF Dense,

Morphological, and Zhu techniques, the proposed HMRF-Sparce

technique provides higher precision value.

REFERENCES

[1] W. Brendel and S. Todorovic, "Learning spatiotemporal

graphs of human activities," in IEEE International

Conference on Computer Vision (ICCV), 2011, pp. 778-

785.

[2] J. Wang, Z. Liu, Y. Wu, and J. Yuan, "Learning actionlet

ensemble for 3D human action recognition," IEEE

Transactions on Pattern Analysis and Machine

Intelligence, vol. 36, pp. 914-927, 2014.

[3] A. A. Chaaraoui, J. R. Padilla-López, P. Climent-Pérez,

and F. Flórez-Revuelta, "Evolutionary joint selection to

improve human action recognition with RGB-D

devices," Expert Systems with Applications, vol. 41, pp.

786-794, 2014.

[4] F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, and R.

Bajcsy, "Sequence of the most informative joints

(SMIJ): A new representation for human skeletal action

recognition," Journal of Visual Communication and

Image Representation, vol. 25, pp. 24-38, 2014.

[5] L. Xia, C.-C. Chen, and J. Aggarwal, "View invariant

human action recognition using histograms of 3d joints,"

in IEEE Computer Society Conference on Computer

Vision and Pattern Recognition Workshops (CVPRW),

2012, pp. 20-27.

[6] S. Ji, W. Xu, M. Yang, and K. Yu, "3D convolutional

neural networks for human action recognition," IEEE

Transactions on Pattern Analysis and Machine

Intelligence, vol. 35, pp. 221-231, 2013.

[7] a. R. K. W. Tanaya Guha, "Learning Sparse

Representations for Human Action Recognition," IEEE

Transactions On Pattern Analysis And Machine

Intelligence, pp. 1-14, 2011.

[8] C.-Y. Chen and K. Grauman, "Efficient activity

detection with max-subgraph search," in IEEE

P-ISSN: 2347-4408 E-ISSN: 2347-4734


Conference on Computer Vision and Pattern

Recognition (CVPR), 2012, pp. 1274-1281.

[9] V. Morariu and L. S. Davis, "Multi-agent event

recognition in structured scenarios," in IEEE Conference

on Computer Vision and Pattern Recognition (CVPR),

2011, pp. 3289-3296.

[10] M. Hoai, Z.-Z. Lan, and F. De la Torre, "Joint

segmentation and classification of human actions in

video," in IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), 2011, pp. 3265-3272.

[11] Q. V. Le, "Building high-level features using large scale

unsupervised learning," in IEEE International

Conference on Acoustics, Speech and Signal Processing

(ICASSP), 2013, pp. 8595-8598.

[12] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C.-C. Chen, J.

T. Lee, et al., "A large-scale benchmark dataset for event

recognition in surveillance video," in IEEE Conference

on Computer Vision and Pattern Recognition (CVPR),

2011, pp. 3153-3160.

[13] Y. Zhang, X. Liu, M.-C. Chang, W. Ge, and T. Chen,

"Spatio-temporal phrases for activity recognition," in

Computer Vision–ECCV 2012, ed: Springer, 2012, pp.

707-721.

[14] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and

L. Fei-Fei, "Human action recognition by learning bases

of action attributes and parts," in IEEE International

Conference on Computer Vision (ICCV), 2011, pp.

1331-1338.

[15] Ó. D. Lara, A. J. Pérez, M. A. Labrador, and J. D.

Posada, "Centinela: A human activity recognition system

based on acceleration and vital sign data," Pervasive and

Mobile Computing, vol. 8, pp. 717-729, 10// 2012.

[16] ULCA Department of Statistics. Available:

http://statistics.ucla.edu/

[17] "VIRAT Video Dataset."

P-ISSN: 2347-4408 E-ISSN: 2347-4734


DESIGN OF LOW POWER 32-BIT CSKA FOR HIGH SPEED APPLICATIONS

C.Yamini1, M.Krishnamurthy

2

1PG Scholar, PSNA College of Engineering and Technology, Dindigul.

Email id: [email protected]

2Assistant Professor, PSNA College of Engineering and Technology, Dindigul.

Email id: [email protected]

Abstract— Research in Very Large Scale Integration (VLSI) based

design of Integrated Circuits (IC) addresses the issues of power, area

and time consumption by the components used. These violates the

speed of operation in DSP processors. To improve the speed, an

optimized Design is required in such a way that the utilization of

components are less. The inclusion of Multiplier and Accumulator

(MAC) unit in the DSP processor Design performs the number of

operation by using adders. Hence, the reduction in power, area and

time in full adder is the necessary process in low power applications.

Modern DSP processors uses the carry chain for optimization in carry

forwarding path which reduces the delay effectively. The Carry Select

Adder (CSLA) is the prominent solution to improve the speed of

parallel operation. But, the result contains more number of carriers.

Hence, multiplexers are used for selection of required sum output and

associated carry. This paper optimizes the carry forwarding path by

replacing the multiplexers with the Boolean function based gate

construction. Moreover, the employment of carry-skip mechanism

reduces the number of components required to design a 32-bit ripple

carry adder. Besides, the application of Microwind- DSCH tool to

create the layout of corresponding 32 bit adder. The DSCH tool

visualizes the carry forwarding path and the time required to perform

the operation effectively. The optimization provided in adder structure

enhances the operational speed with minimum area occupation and

power consumption.

Index Terms— AND-OR Inverter (AOI), Carry Select Adder (CSLA),

Carry Skip Adder (CSKA), Critical Path delay, OR-AND Inverter

(OAI), Ripple Carry Adder (RCA), Power consumption.

1. INTRODUCTION

An effective use of available hardware is the ultimate objective of

various algorithms development. The efficiency of algorithm

depends on various measures power, area and time consumption.

The hardware performs the primitive set of Boolean and ALU

operations based on the algorithm designing. The determinations

of functions performed on hard logic and soft logic is an

important process in Field Programmable Gate Array (FPGA)

Design. Hardened arithmetic structures are long withstand

compared to soft structures such hard circuits otherwise called as

carry logic. The interaction of adders with the LUTs, Flip Flops

used and the optimal; trade-off between area-power and speed are

the important issues in the design of multi-bit carry adders. The

collective interaction between the computational units achieve the

low error rate and high precision. But, the hardware overhead is

an important problem in the design of hard or carry adders. The

evolution of self-checking Ripple Carry Adders (RCA) optimizes

the hardware overhead considerably.

The more number of critical path reduces the speed of

operation and the gain of transistor depends upon its size. The

proposal of transistor based adders reduces the time consumption

by shortening the critical path and size. The RCA contains the

simple design methods. But, the Carry Propagation Delay (CPD)

is the major problem in RCA. Hence, to overcome the delay issue

two strategies are introduced. They are Carry Look-ahead Adder

(CLA) and Carry Select Adder (CSA). The selection of one out of

each pair in final sum and carry reduces the CPD effectively. The

maximum path propagation consumes more delay. Hence various

strategies such as SQRT-CSLA, CCSLA Binary Excess-1

Converter (BEC), CSLA Common Boolean Logic (CSLA-CBL)

solves the delay problem and makes the design as an attractive

manner.

The analytical evaluation of various CSLA strategies extends

the capability of application of CSLA to binary, decimal adders

and subtractors. Reversible or information lossless systems are an

important requirement in low power CMOS applications. The

direct computation of carry values from adder inputs in carry look

ahead adder overcome the limitations in RCA strategies. The

reversible implementation in CLA optimized the number of gates,

quantum cost and delay. The D-latch based CSLA further reduces

the power. The scheduling of carry selection before the

calculation of final sum is different compared to existing carry

selection approaches. The extension of CSLA to Carry Skip

Adder (CSKA) to analyze the power and performance of MAC

designs.

The evolution of carry tree adders enhances the operational

speed with the various bit widths operation 128 and 256 bit. The

presence of fast carry chain in carry tree adders improves the

operational speed by minimizing the delay of RCA and carry skip

adder. The optimization in carry path is the ultimate solution for

delay. Heat dissipation in the components is one of the constraint

for CSKP adder design which originates the reversible logic

implementation. The introduction of reversible logic gate called

‘Inventive0gate’ synthesizes the adder modules to minimize the

gate count and outputs. The application of reversible logic

extends into the construction of quantum computation, nano-

technology low power digital circuits. The design of carry skip

BCD adder by using the reversible logic to minimize the number

of gates and outputs. Several reversible based BCD adder

strategies introduced in research work for low power digital

circuits. From the study, it is observed that, the reduction of gate



P-ISSN: 2347-4408 E-ISSN: 2347-4734


count, power consumption and area occupation are the major

constraints to design the optimal VLSI circuits. This paper

proposes the modified strategy of CSKA by replacing the

multiplexers by the Boolean function minimization based logic

gates which optimizes the critical path delay and improve the

speed of operation.

The technical contributions of proposed layout design of 32 bit

Carry Skip Adders (CSKA) listed as follows:

The proposal of optimized strategy in implementation logic in

RCA reduces the critical path delay.

The achievement of reduction of gate count and power

consumption by proposed CSKA.

The extensive visualization of layout framework for 32 bit

CSKA using Microwind-DSCH tool.

The rest of the paper is organized as follows. Section II

presents a description about the previous research works which is

relevant to the optimal adder Designs. Section III involves the

detailed description about the proposed layout framework for

optimized 32-bit Carry Skip Adder. Section IV presents the

comparative analysis between proposed and existing methods on

network parameters. This paper concludes in Section V.

2. RELATED WORK

This section describes the various related adder Designs and

the optimization methodologies for the improvement in

operational speed. High speed and low power digital circuits are

an attractive research area in DSP processors. Multiply

Accumulator (MAC) unit is the basic element in low power

circuits. The working of hardware governed by algorithms

developed by the user. Gurjar et al constructed the high speed

adder circuit by using the Hardware Description Language (HDL)

[1]. The brief analysis about the synthesis and simulation were

presented and the application of HDL for the design of high speed

circuits. The hardened adder circuits or carry based adder

evaluated the performance on various micro-benchmark circuits

and small designs. Luu et al extended the utilization of hardened

adder circuits in larger bench mark designs with the carry chain

mechanism [2]. During larger benchmark circuits implementation,

the interactions among the analog computational units was made

to achieve the low error rate and high precision. Woo et al

presented how the moderate interactions among the analog units.

The minimization of error and the achievement of high precision

provided [3]. But, the optimization in power and delay were

required. The integration among the components was the

important requirement for high speed arithmetic blocks. Francis

et al introduced the bypassing technique and modification in

adder Design in multiplier to optimize the power and delay. They

developed different logic style adders for high speed MAC unit

[4]. The speed of operation dependent on the critical path in

which the carry ripples through it. For the longest critical path,

the speed is low. Jain et al presented the area efficient transistor

based adders that shortened the critical path thereby delay

minimization was achieved[5]. The evolution of multi-standard

wireless receivers, portable and mobile devices requires an

optimized Design for area-delay and power minimization.

The improvement in performance of DSP processors provided

by an efficient adder design. Ripple Carry Adder (RCA) was the

simple design strategy in adder Design. But, the propagation of

carry consumed more time called Carry Propagation Delay

(CPD). The evolution of Carry Selection Adders (CSLA) reduced

the CPD effectively. Mohanty et al eliminate the redundant logic

operations of traditional CSLA and formulated the logic by

scheduling of carry selection process. The introduction of logic

optimization unit provided in carry selection unit [6] offered the

less delay and power. But, the gate count was high which

occupied large space. Sreenivasulu et al used the gate level

modification scheme to traditional CSLA Design significantly

reduced the power and area consumption. The modified version

of CSLA called as Square-Root CSLA (SQRT CSLA) [7].

Saranya extended the SQRT CSLA to 8, 16, 32 and 64 bit square

root operations [8]. But, the extension of CSLA to decimal adder

was not suitable due to the occurrence of incorrect carry bits.

Saxena et al introduced the gate-level modification to

conventional CSLA reduces the delay, power consumption. The

modifications were tested on several bit wise operations. The

comparison of modified CSLA [9] with the conventional

structures also presented. Dorrigiv et al computed the pairs of

corrective carry out bits according to decimal operations [10]. The

selection of corrected pairs by the carry-out bits and the inclusion

of carry-in bits in addition process achieved the optimal

consumption in area and delay. The study of data dependency and

the identification of redundant operations were important in the

implementation of CSLA.

Shirisha et al scheduled the carry selection process prior to

calculation of final sum [11]. The utilization of bit patterns in

carry words and the fixed Cin bits provided a logic optimization

in carry selection and generation units. The conventional CSLA

adder Designs dissipated the maximum power by lose of bits

information. The raise up of reversible logic in which unique map

between the input and output vectors achieved. Jamal et al

presented the reversible implementations on carry look ahead

adder to overcome the limitations in conventional CSLA by direct

computation of carry values from adder inputs [12]. A high speed

and low power consumption are the important factors in the

design of CSLA. Patnayak et al proposed D latch based CSLA

which was the extension of traditional CSLA to further reduce the

power consumption [13]. The scheduling of carry selection before

the calculation of final sum reduced the power consumption.

Reversibility prevented the energy dissipation and bit error by the

introduction of fault tolerant mechanism. Mitra et al presented the

detailed design of Reversible Fault Tolerant-Full Adder (RFT-

FA) [14] with the minimum quantum cost. The merging of

minimization of gates and garbage outputs in RFT based Carry

Look-ahead Adders (CLA) provided. The creation of area

efficient and low power high speed MAC unit comprised Carry

Skip Adders (CSKA). Hence, the research works turn to CSKA

Design and the optimization in CSKA. Kalaiselvi et al further

reduced the power consumption and improved the operational

speed [15]. The comparison of power consumption and

performance analysis between the proposed and existing methods

conveyed the suitability of CSKA in future MAC designs.

P-ISSN: 2347-4408 E-ISSN: 2347-4734


Bhagyalakshmi et al extended the applicability of reversible

logic to the Binary Coded Decimal (BCD) [16] to optimize the

quantum computational cost. The computational structures

introduced the high loss of information if the large bit sizes were

introduced. Ali et al developed the new reversible logic gates

called FHNG gate [17] which reduced the loss of information for

large size bit operations. Rajmohan et al improved the design

parameters gate count, area and power by the integration of

reversible mechanism with the CSKA [18]. The raise of some

problems during the integration of DSP processors with the

quantum computers. Shukla et al designed the low power

arithmetic and data path units by using the reversible logic

implementation in Carry Look-ahead Adder (CLA) [19]. The

investigation about the delay performance with the evolution of

carry tree adders was an important research area. Cury et al

discussed the design of fastest type adders by the application of

carry chain in traditional RCA [20] and support the minimum

delay performance for various bit sizes ranging from 128 to 256

bit. The heat dissipation by the components was more. Hence,

Misra et al introduced the reversible logic gate called

Inventive0gate [21] which was an efficient and optimized design

Design in order to reduce the gate count. Thereby, the

minimization of heat dissipation achieved. But, this paper

modifies the traditional structure of CSKA by replacing the

multiplexers by the Boolean function based gate minimization

further reduces the gate count, area and power consumption.

Moreover, the design of 32-bit CSKA structure by the Microwind

DSCH tool investigated.

3. OPTIMAL 32-BIT CARRY SKIP ADDER

VISUALIZATION

This section presents the visualization of optimal 32 bit Carry

Skip Adder (CSKA) by replacement of multiplexers in

Microwind-DSCH environment. The minimization of critical path

delay provided by using Boolean function gate minimization. The

Design for proposed optimal 32-bit adder Design is shown in fig.

1.

Fig. 1 Design of proposed optimal 32-bit adder

The Design of proposed work comprises basic half adders and

full adders, sequence of 32-bit Carry Skip Adder (CSKA) in

DSCH2 environment. The proposed CSKA Design is similar to

conventional CSKA with the difference introduced in incremental

block where the multiplexers are replaced by the Boolean

function based gate minimization. The proposed CSKA is shown

in fig. 2.

P-ISSN: 2347-4408 E-ISSN: 2347-4734


Fig. 2 CSKA Design.

The proposed CSKA contains the sequence of 4-bit Ripple Carry

Adder (RCA), AOI-OAI logic and the modified incrementation

block. The table I describes the notations used in proposed work.

Table 1 Notations and descriptions

Symbol Description

Full Adder (FA) stage

Ripple Carry Adder (RCA) stage

Intermediate sum

Final sum

Level of intermediate results

Propagation delay of carry output of FA

Propagation delay of skip logic

Propagation delay of AND gate

Propagation delay of OR gate

Propagation delay of AND-OR-Inverter

logic

Propagation delay of OR-AND-Inverter

logic

Delay of critical path

The detailed description of each block and the functions

performed by using these blocks are as follows:

3.1 CARRY SKIP ADDER

The proposed Carry Skip Adder (CSKA) contains N-cascaded

Ripple Carry Adders in which Full Adders (FA) are included with

the worst propagation delay during the summation of two N-bit

numbers. Depends upon the whether RCA or group of RCA in

propagation mode, the propagation delay also varied in two cases

as follows:

Case i: All FAs are in propagation mode

Let us consider the two N-bit numbers and . The

propagation delay for this case is defined by,

(1)

This equation defines the linear relationship of propagation

delay with numbers is called worst case delay.

Case ii: Group of cascaded FAs in propagation mode

The carry output of single Full Adder (FA) chain is equal to

carry input of another FA chain.

The critical path of each CSKA contains three parts and the

description about these paths and associated delay is defined as

follows:

The path of the first stage of FA in CSKA

The path of intermediate carry skip stage

The path of the last stage in FA chain.

The increase in adder stages will increase the critical path for

carry propagation. Thereby, the operational speed is less with the

maximum gate count. To optimize the critical path delay, the

modification of conventional CSKA Design is provided with the

carry skip logic, AOI-OAI logic and modified incrementation

block.

3.2 AOI-OAI LOGIC

The replacement of multiplexers with the AND-OR-Inverter

(AOI) and OR-AND-Inverter (OAI) logic gates that contains

small number of transistors with the lower delay and power and

area consumption. The propagation of carry through the skip

logics is complemented. Hence, the generation of complement of

carry at the even stages of skip logics. The power consumption by

AOI-OAI logic is less compared to the conventional Design.

Due to the presence of the inverting functions in standard cell

libraries of AOI-OAI, they are utilized instead of multiplexers in

order to reduce the power and area consumption. The alternative

utilization of AOI-OAI in such a way that, if one skip logic uses

AOI, then the next skip logic uses OAI increases the critical path

delay considerably. This is because of the fact that the CSKA

with AOI-OAI does not have the capability to bypass the zero

carry input. To overcome this problem, the zero carry input to the

RCA is implemented. This implementation leads to no need to

wait for carry propagation from previous RCA stage and the

parallel computation of carries effectively reduces the

unnecessary time consumption.

3.3 MODIFIED INCREMENTATION BLOCK

The optimal CSKA contains RCA with an additional block

called modified incrementation block. The carry input to RCA

blocks except first RCA is zero which provides the simultaneous

execution of addition operation. In the proposed structure the first

block computes the sum and carry and the other blocks are

simultaneously computes the intermediate results.

The first stage (0) in proposed Design contains RCA only and

the stage 1 to Q contains two modules namely, RCA and modified

incrementation block. The modified incremental block contains

chain of Half Adders (HA) as shown in fig. 3.

Fig. 3 Incrementation Structure

The incrementation block produces the number of intermediate

results up to the level defined by,

∑ (2)

The considerable reduction of delay provided with the

consideration of carry output generated in overall Design rather

than the carry output of incrementation stage.

From fig. 2, the carry output of Qth

stage is obtained on the

basis of intermediate results and carry output of previous stage

and the carry output of RCA stage . If is one, then

P-ISSN: 2347-4408 E-ISSN: 2347-4734


is also one. For is zero, then check whether the product of

intermediate results is one, then the output is same as .

The implementation of optimal 32-bit CSKA Design using

DSCH2 software tool and the simulated output is shown in fig. 4.

3.4 VISUALIZATION OF LAYOUT OF 32-BIT CSKA

The performance of time and critical path is evaluated using the

physical description level. The software that is used to design and

simulate the integrated circuits in physical level is Microwind.

The unification in schematic entry, extraction of schematic, the

layout compilation and mixed circuit simulation provided by

Microwind. The single key based simulation and the command

based editor in Microwind helps to extract the electrical circuit

and performed the analog simulation with the voltage and current

values with the time values.

The command based visualization of various characteristics of

nMOS and pMOS achieved in this tool. The changes in size and

associated parameters changed the voltage and current values.

Two important tools are used for validation of design. They are

process simulator and logic cell compiler. Once the fabrication is

completed, the first one shows the vertical perspective of layout.

The sophisticated tool that enables the automatic design of CMOS

circuit. The Verilog based description is provided by using the

combination of user friendly schematic editor called DSCH with

the logic simulator. The rules required for the design and

fabrication are arranged in the cell. The 3 D visual layout for

proposed optimal 32 bit Design is shown in fig. 5.

Fig. 4 Simulated Design

Fig. 5 3D layout of proposed 32-bit CSKA Design.

P-ISSN: 2347-4408 E-ISSN: 2347-4734


4 PERFORMANCE ANALYSIS

The utilization of modified incrementation module in Carry

Skip Adder (CSKA) Design reduces the gate count, power and

delay due to the reduction of delay in critical path. The

comparative analysis between the proposed optimal 32-bit CSKA

Design with the conventional CSLA (Dual RCA), modified

CSLA (with BEC), regular SQRT CSLA and modified SQRT

CSLA on the parameters of power consumption, gate count and

area presented to assure the effectiveness. In general, the delay of

critical path in conventional CSKA structure depends upon the

delay of carry, and sum output expressed as

[ ] *(

) + [

] (3)

CRITICAL PATH DELAY

The proposed Design contains three parts namely, the path of

the first stage of FA chain, the path of skip logics and the

incrementation block in last stage. The total critical path delay

depends upon the delay of each individual parts and expressed as

[ ] [ ] [( ) ]

(4)

The delay of skip logic is computed by taking the average of

AOI-OAI logic defined by,

(5)

With this modification, the equation (4) is modified as

[ ] *

+ [( )

] (6)

From the equation (3) and (6), the delay of skip logic is

minimum for the same number of operational stages as

conventional. Compared to and , and are

small. Hence, the reduction of delay in skip logic reduces the

delay of overall structure. The table II lists the comparative

analysis of proposed Design with the conventional structures on

the parameters of power consumption, area occupation and

number of gates requirement. The reduction of delay in path

effectively optimizes these parameters.

Table 2 Comparative Analysis Methods Parameters

Area ( ) Power (mW) Gate count

Conventional

(Dual RCA)

192 95.01 1040

Modified

(with BEC)

141 79.81 809

Regular

SQRT (Dual RCA)

129 553 698

Modified

SQRT(with BEC)

141 448 762

Optimal CSKA

(Proposed)

25 53.426 136

The comparative analysis of existing and proposed methods

graphically illustrated in fig. 6. Compared to existing, the

proposed method provides the minimum gate count, area and

power consumption due to the gate based modified incremental

block.

Fig. 6 Comparative analysis.

5 CONCLUSION

This paper addressed the limiting factors of power, area and

time consumption by the components used in the design of

processors. The operations performed and the components used in

DSP processors are more. Hence, the reduction in power, area and

time in full adder is the necessary process in low power

applications. Modern DSP processors used the carry chain for

optimization in carry forwarding path which reduced the delay

effectively. But, the creation of more number of carriers required

an additional adder structure. The replacement of multiplexers by

the Boolean function based gate construction optimized the carry

forwarding path. Moreover, the employment of carry-skip

mechanism reduced the number of components required to design

a 32-bit ripple carry adder. A Microwind- DSCH tool used for the

creation of layout of 32-bit adder for visualization of carry

forwarding path and reduction of time required to perform the

operation effectively. The optimization provided in this paper

enhanced the operational speed with minimum area occupation

and power consumption.

REFERENCES

[1] P. Gurjar, R. Solanki, P. Kansliwal, and M. Vucha, "VLSI

Implementation of adders for high speed ALU," in Annual IEEE India

Conference (INDICON), 2011 2011, pp. 1-6. [2] J. Luu, C. McCullough, W. Sen, S. Huda, Y. Bo, C. Chiasson, et al.,

"On Hard Adders and Carry Chains in FPGAs," in IEEE 22nd Annual

International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2014 2014, pp. 52-59.

[3] W. Sung Sik and R. Sarpeshkar, "A spiking-neuron collective analog

adder with scalable precision," in IEEE International Symposium on

Circuits and Systems (ISCAS), 2013 2013, pp. 1620-1623.

[4] T. Francis, T. Joseph, and J. K. Antony, "Modified MAC unit for low

power high speed DSP application using multipler with bypassing technique and optimized adders," in Fourth International Conference

on Computing, Communications and Networking Technologies

(ICCCNT),2013 2013, pp. 1-4. [5] N. Jain, P. Gour, and B. Shrman, "A high speed low power adder in

dynamic logic base on transmission gate," in International Conference

on Circuit, Power and Computing Technologies (ICCPCT), 2015 2015, pp. 1-5.

P-ISSN: 2347-4408 E-ISSN: 2347-4734


[6] B. K. Mohanty and S. K. Patel, "Area–Delay–Power

Efficient Carry-Select Adder," IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 61, pp. 418-422, 2014.

[7] P. Sreenivasulu, K. S. Rao, M. Reddy, and A. V. Babu, "Energy and

Area efficient Carry Select Adder on a reconfigurable hardware," International Journal of Engineering Research and Applications, vol.

2, pp. 436-440, 2012.

[8] K. Saranya, "Low Power and Area-Efficient Carry Select Adder," International Journal of Soft Computing and Engineering (IJSCE)

ISSN, pp. 2231-2307, 2013.

[9] P. Saxena, U. Purohit, and P. Joshi, "Analysis of Low Power, Area Efficient & High Speed Fast Adder," International Journal of

Advanced research on Computer and Communication Engineering,

2013. [10] M. Dorrigiv and G. Jaberipur, "Low area/power decimal addition with

carry-select correction and carry-select sum-digits," VLSI Journal, vol.

47, pp. 443-451, 9// 2014. [11] K. Shirisha and D. S. Rao, "Area-Delay-Power Efficient Carry-Select

Adder," 2015.

[12] L. Jamal, M. Shamsujjoha, and H. H. Babu, "Design of optimal reversible carry look-ahead adder with optimal garbage and quantum

cost," International Journal of Engineering and technology, vol. 2, pp.

44-50, 2012. [13] S. K. Patnayak, M. Raja, and D. Sailaja, "Design Carry Select Adder

With D-Latch," 2015.

[14] S. K. Mitra and A. R. Chowdhury, "Minimum Cost Fault Tolerant Adder Circuits in Reversible Logic Synthesis," in 25th International

Conference on VLSI Design (VLSID), 2012 2012, pp. 334-339. [15] K. Kalaiselvi and H. Mangalam, "Area Efficient High Speed and Low

Power MAC Unit," International Journal of Computer Applications,

vol. 67, pp. 40-44, 2013. [16] H. Bhagyalakshmi and M. Venkatesha, "Optimized design of BCD

adder and Carry skip BCD adder using reversible logic gates,"

International journal on computer science and Engineering, vol. 3, pp. 1439-1449, 2011.

[17] M. S. R. Md.Belayet Ali, Tahmina Parvin, "Optimized Design of Carry

Skip BCD adder using new FHNG reversible logic gates," International Journal of Computer Science, vol. 9, p. 3, 2012.

[18] M. Rajmohan and D. S. Lenin, "A Novel Design of Carry Skip BCD

Adder using Reversible Gates," International Journal of Computer Applications, vol. 73, pp. 46-51, 2013.

[19] S. Shukla, T. Verma, and R. Jain, "Design of 16 Bit Carry Look Ahead

Adder Using Reversible Logic," International Journal of Electrical, Electronics and Computer Engineering, vol. 3, pp. 83-89, 2014.

[20] C. Cury and M. Nisanth, "Design of Parallel Prefix Adders using

FPGAs," IOSR Journal of VLSI and Signal Processing, vol. 4, pp. 45-51, 2014.

[21] N. K. Misra, M. K. Kushwaha, S. Wairya, and A. Kumar, "Cost

efficient design of reversible adder circuits for low power applications," arXiv preprint arXiv:1509.04618, 2015.

P-ISSN: 2347-4408 E-ISSN: 2347-4734


HIGH SPEED AND LOW POWER CMOS TECHNOLOGY BASED RAM-CAM

MEMORY DESIGN

K.Vidhya kamu1

and Mr.P.Karthikeyan2

1Applied electronics, PSNA College of Engineering & Technology, Dindigul, Tamil Nadu, India. [email protected]

2PSNA College of Engineering & Technology, Dindigul, Tamil Nadu, India.

Abstract— Content-Addressable Memory (CAM) is a special type of

memory, which is used in very-high speed searching applications. The

other names of CAM are the associative memory, associative storage

and the associative array, which is used for retrieving the contents with

high speed based on the address. We proposed an XOR based content

addressable memory with 8 transistors. The energy consumed by the

proposed CAM design was lowered, when compared to the

conventional low-power CAM design. The proposed method is used to

design the Complementary Metal Oxide Semi Conductor (CMOS)

technology based RAM-CAM memory architecture. The Euclidean

distance based matching is used for comparing and retrieving the

contents in the CAM based memory. The proposed CMOS technology

is used to reduce the leakage power and the delay time in the

transistors. The speed of the proposed method is increased by using the

CAM method in searching and retrieving the content based on the

address. The parameters considered for the design are energy, power,

and the time.

Index Terms—Content-Addressable Memory (CAM), Random Access

Memory (RAM), Complementary Metal Oxide Semi Conductor

(CMOS), Transistors.

1. INTRODUCTION

The Very Large Scale Integration (VLSI) design is the process of

creating an Integrated Circuit (IC). The thousands of transistors

are combined into a single chip. Before the introduction of VLSI

technology the IC’s have a limited set of functions to perform.

The electronic system consists of a CPU, RAM, ROM and other

glue logic. The CMOS technology is used for the construction of

the integrated circuits. The CMOS technology is used in the

following devices such as:

Microprocessors

Microcontrollers

Static RAM

Digital logic circuits

There are two important characteristics of CMOS devices

such as: (1) The CMOS devices are immune to noise and (2) it

consumed less power.

Content addressable memory (CAM) is a special type of

solid-state memory, which accesses the data by their contents

rather than the physical locations. The data to be searched is

compared in parallel against the stored entries. Each and every

entry is associated with the tag to find the perfect match, which is

used in the comparison process. The Random Access Memory

(RAM) is the computer memory, which is accessed randomly.

Ram is the common type of memory, which is found in several

computers and other devices such as the printers. There are two

types of RAM, they are: (1) Dynamic Random Access Memory

(DRAM) and (2) Static Random Access Memory (SRAM).

Fig 1. Typical Structure of the 4x4 CAM

Fig 1 described the typical structure of CAM . The CAM

array is divided into several equal sized sub-blocks, which is

activated independently. The sub-blocks are activated and the

input tag is compared with the few entries of the sub-blocks,

while the rest of the sub-blocks are deactivated. From Fig.1 it is

clear that the input tag is 1100 and the sub-blocks are searched

and outputs the third sub-block.

The existing system used 7-T NAND logic gate for the CAM

cell design. The Match Line (ML) is analyzed and the ML is

combined with the RAM. There are several approaches in the

existing systems such as: (1) NOR based ML, (2) NAND based

ML. The disadvantages of the existing system are:

The CAM consumed more power for

processing

It has complex circuit design

It has high delay time

The bit comparison is performed using the NAND based logic as

a part of the Match-line circuitry. The structure of the NAND

type is identical except for the position of the ML transistors. The

https://email19.asia.secureserver.net/webmail.php?login=1

P-ISSN: 2347-4408 E-ISSN: 2347-4734


NAND type structure incorporates the significant delay. The

delay is found to be around 12 ns for the matching process. The

TCAM was used normally for the design of high-speed look up

intensive applications such as forwarding and classification of

SRAM. The remainder of the paper is systematized as follows,

Section II describes the literature review related to the low power

CMOS technology and CAM based memory design. Section III

illustrates the proposed architecture, and section IV describes the

performance analysis of the proposed method. Section V

illustrates the conclusion and future work of this paper.

2. RELATED WORK

This section illustrates the literature review related to the

low power CMOS technology and CAM based memory design.

Jarollahi, et al [1] proposed a low-power content addressable

memory (CAM) for employing the associativity between the

input tag and the address of the output data. The proposed method

was based on the sparse clustered network, which eliminated the

parallel comparisons performed during the search. Yang, et al [2]

proposed a Pai-Sigma matchline scheme to reduce the search

power of the Ternary Content Addressable Memory (TCAM).

The NAND-NOR Matchline scheme was used to address the

issues related to the charge sharing and the DC path. The 0.18µm

CMOS technology was used to reduce the power consumed by

the TCAM. Jarollahi, et al [3] proposed a sparse clustered neural

network for achieving the optimal efficiency and large diversity.

The parallel hardware implementation of Gripon-Berrou Neural

Network (GBNN) was introduced in the proposed method. The

proposed architecture was used in the applications of Data

Mining and was embedded inside the processor chips to

communicate with the memory units. Do, et al [4] introduced the

parity bit, which leads to 39% of delay reduction in sensing and

1% power overhead. The peak and the average power

consumption was reduced by the proposed gated-power

technique. Onizawa, et al [5] introduced a reordered overlapped

search mechanism for achieving a high-throughput and low-

energy CAM. A word circuit was divided into two sections that

were sequentially searched to lower the power dissipations.

Gripon and Berrou [6] introduced the two codes called the thrifty

code and the clique code, which were the sub-families of binary

constant weight codes. The sparse neural networks achieved an

optimal performance. In order to achieve an optimal performance

the two codes were introduced. Jarollahi, et al [7] introduced a

Clustered Neural Network (CNN), which stored numerous

messages than the traditional Hopfield Neural Network (HNN).

The hardware architectures for such memories were also

proposed. Peti, et al [8] presented a new hybrid scheme called the

RAM-CAM register renaming scheme. The RAM-CAM scheme

combined both the functionalities of the RAM and CAM

registers. The power dissipated by the CAM renaming scheme

was between 17% and 26% and consumed less energy than the

RAM-based renaming scheme. Jarollahi, et al [9] proposed a

new hardware architecture based on the Sparse Clustered

Networks (SCN) and the method known as Selective Decoding

SCN (SD-SCN). The proposed architecture were suitable for

implementations with low retrieval latency, but were limited to

small networks. Do, et al [10] proposed an efficient power ML

sensing technique, which achieved a low power consumption.

The fully parallel match line structure was introduced with an

Automated Background Checking scheme (ABC). The two

dummy rows were used by the ABC scheme. Wong, et al [11]

compared the Field Programmable Gate Array (FPGA) and the

custom CMOS. The delay and the area of a comprehensive set of

processors were implemented on the custom CMOS and FPGA.

Gripon, et al [12] introduced a low power CAM for employing a

new mechanism for providing the associativity between the input

tags and the address of the output data. The proposed architecture

was based on the clustered sparse network. 0.13µm CMOS

technology was used to reduce the power consumed by the

CMOS technology. The NAND based architecture with higher

number of transistors were constructed. Ullah, et al [13]

presented a novel memory architecture called the hybrid

partitioned Static Random Access (SRAM) memory based

ternary CAM. The TCAM has disadvantages such as low bit

density and high cost per bit. Zhang, et al [14] presented a design

of NOR-type CAM based on the Domain Wall (DW) motion in

magnetic tracks. The CMOS switching and sensing circuits were

shared globally to optimize the cell area. The CMOS 65 nm was

used to evaluate the high performance. Liu, et al [15] proposed

the first packet classification scheme, which used the Binary

CAM (BCAM). The BCAM was similar to the TCAM but every

bit has only two possible states such as 0 or 1. Number of

optimizations techniques were also proposed including the skip

lists, free expansion.

3. PROPOSED METHOD

This section elaborates the proposed CMOS based

Random Access Memory (RAM)-Content Addressable Memory

(CAM) design. The overall flow of our proposed work is depicted

in the Fig 2. It has the following components,

SRAM CMOS Design

CAM CMOS Design

SCAN based enable Function

ML Sense Amplifier

The input bit to be searched is given as input to the

SRAM data register memory. The 6T is used for the SRAM

design of the data register. The condition is checked and if the

RAM data is equal to the CAM data, then the match line is

enabled using the clock. The performance of the input is

analyzed. The CAM array has a group of cell and it is based on

the SCN based MLSA function. The CAM array cell consists of

the following:

Bit line inputs

CAM cell design based on the XOR

Write and read content data.

P-ISSN: 2347-4408 E-ISSN: 2347-4734


Fig 2. Proposed CMOS technology based RAM-CAM memory design

3.1 SRAM CMOS DESIGN

Static Random Access Memory (SRAM) is the Random

Access Memory (RAM), which retains the data in the memory as

long it is being supplied with the power. The SRAM is the

volatile type of memory. SRAM does not need to be refreshed

periodically. The cache memory is designed using the SRAM

technique. The 6T based SRAM CMOS design is first designed.

The design was constructed to store the single bit value based on

the bit line and he write line input. The design consists of two

inverter cross connection. The two inverter cross connection is

used for maintaining the given input data. The write line is set to

1 for getting the data from the cross inverter circuit connection.

3.2 CAM CMOS DESIGN

The Complementary metal-oxide semiconductor (CMOS) is

the semiconductor technology, which is used in the transistors.

Several microchips developed for the computers are using the

CMOS technology. In CMOS technology two types of transistors

are used such as: the positive type transistors (P-type Transistors)

and the negative type transistors (N-type Transistors). The 8T

XOR CAM CMOS design is constructed and is shown in Fig.3.

The design consists of the two control line such as the Selection

line and the Bit line for controlling the CAM cell. This process is

used to find the address bit location and the address bit location is

used to check the address level. The logic function is evaluated

and applied to check the address level. The CAM block operates

using the self timed control and an input controller. The CAM

CMOS block contains the CAM block to operate in the timed

manner and controlled manner.

3.3 SCAN BASED ENABLE FUNCTION

The SCAN based enable function is used for finding the

CAM cell array result. The Scan based enable function is used to

apply for the CAM and RAM data. The XOR gate is used for

matching the SRAM data and the CAM data. The XOR gate is

also used for inverting the activation results. The row values in

P-ISSN: 2347-4408 E-ISSN: 2347-4734


the CAM data are used for the registration of the input data bit.

Then it is used for activating the ML function.

3.4 ML SENSE AMPLIFIER

The Match Line (ML) amplifier has single energy

storage element ant it is considered as a first order system. The

ML scheme is used for saving the current and it has an active feed

back. The Match Line amplifiers are used check the CAM data

register and the RAM data register value. The clock function is

used to control the output of the matched line row and to get the

address of the contents. The CAM cell array is mainly focused on

the matched line process.

In the proposed system, the CAM is a XOR logic gate

based cell design. The ML sense amplifier is used for the data

matching process. The proposed system processes the data and

searches the data with high speed. The classifiers are constructed

using the proposed architecture. The scale based enable process is

used for finding the CAM cell array result. The XOR gate

function is used to match the SRAM data and the CAM data and

the results are inverted using the XOR based gate function. The

ML sense amplifiers are used check the CAM data registers. The

clock function is used for controlling the output matched row to

retrieve the content address. The advantages of the proposed

methods are:

The power consumed by the CAM is less

The design of the circuit is simple.

The selection process is very fast.

It is accurate that the existing methods

The distance is computed with high efficiency

using the Euclidean distance based matching.

Fig 3. Proposed Circuit Design using the XOR gate.

4. PERFORMANCE ANALYSIS

The performance of the proposed RAM-CAM memory

design is compared with the existing NAND or NOR type CAM

cell with the following metrics such as:

Processing Area

Power Consumption

Delay Time

4.1 PROCESSING AREA

The proposed RAM-CAM design totally reduced the

size of the IC chip. The IC chip consists of several transistors

involved in the design. If the transistor occupied more space, then

the overall circuit is complicated. Hence, the area in the proposed

method should be reduced. To reduce the area occupied by the

single transistor the below formula is used. The area is calculated

using the formula:

(1)

Where, LS denotes the length denotes the length of the single

transistor and the total denotes the whole number of transistors

used in the proposed RAM-CAM circuit. If the area of the single

transistor is reduced, then the whole area in the circuit design is

reduced. The area occupied by the transistors in the existing

methods is 4.83 mm2. The area is reduced by 0.32 mm

2 in the

proposed RAM-CAM based on the CMOS technology.

4.2 POWER CONSUMPTION

The Power is defined as the ability to do some work. The

amount of work is equivalent to the energy consumed per unit

time. The integral of power is defined by the work performed.

The proposed RAM-CAM design is used to reduce the power

consumption. The existing system consumed the 94 mW power

and the power consumption was reduced by 70 mW. If the area is

reduced, then the power consumption was also reduced by the

proposed method. Fig 4 showed the graph of the input values.

The time is plotted against the power and showed that the

proposed methods achieved very less power, when compared to

the existing methods.

P-ISSN: 2347-4408 E-ISSN: 2347-4734


Fig 4. Power Consumption graph

4.3 DELAY TIME

The delay time in the proposed method is reduced by the

proposed RAM-CAM method. The delay time in the existing

methods is high. So, the delay time should be decreased in the

proposed system. The delay time achieved in the proposed

method is 2.2 ns. The existing method has 7 ns as the delay time.

The delay time was reduced for matching the contents in the

CAM design.

The values of the metrics discussed above are tabulated

and shown in the Table.1. The values for the existing and the

proposed methods are tabulated in the table. From the table, it is

clear that the proposed RAM-CAM method based on the CMOS

technology achieved less power, low area and less delay time

than the existing system.

The graph for the proposed RAM-CAM memory design

is shown in Fig. 5, which depicted the input and output values.

The delay time is plotted against the voltage values. The graph is

plotted with the binary values such as the 0 and 1. The binary

value of 0 is shown by the 0v frequency and the binary value of 1

is shown by the +5v frequency. The five graphs are shown in

Fig.5. The last graph showed input values taken for processing,

which is the binary value 0 or the binary 1. The sample value is

taken for processing is shown in the first graph. The matching

input from the address is retrieved with high efficiency and within

the specified time. The second graph represents the matching line

in the specified address of the memory. The third graph shows the

processing values inside the memory. The binary value 1 is alone

taken for processing ie., the +5v is used in processing the input.

The ML sense amplifiers are used to sense the input values from

the contents of the memory. The contents retrieved from the

memory is at high speed than the traditional CAM methods. The

simulation results are depicted in the graph shown in fig.5. The

screenshot of the proposed RAM-CAM method is taken for the

proposed method.

Fig 5.Voltage vs. Time

Fig. 6. Comparison of the existing and Proposed System

From fig. 6 it is clear that the existing system has less area,

less power consumed and the less delay, when compared to the

existing methods.

Table 1. Comparison table for Existing and the Proposed system

using the metrics

0102030405060708090

100

Area Power Delay Time

SCN-CAM

RAM-CAM

P-ISSN: 2347-4408 E-ISSN: 2347-4734


METRICS EXISTING

SYSTEM

PROPOSED

SYSTEM

SUPPLY

VOLTAGE

1.8V 5V

AREA 4.83 mm2 0.320 mm

2

POWER 94mW 70mW

DELAY TIME 7ns 2.2ns

5. CONCLUSION

The RAM-CAM architecture was designed using the

CMOS technology. The performance of the system is compared

to the NOR and the NAND type CAM memory design. The XOR

based CAM is designed with high efficiency. The Cam cell is

designed to lower power. The CAM is suitable for the low-power

applications and for the applications, which required the parallel

look-up operations. The actual data of interest is stored in the

SRAM and the tag is associated with each and every block just

for the reference. The simple and fast updates can be achieved

without retraining the network entirely. The application depends

on the

non-uniform inputs and results in higher power consumptions and

does not affect the accuracy of the final result. The proposed

methods are used in the following applications such as:

CPU processors

ATM machine

Flash memories

Real Time Servers, etc.

REFERENCES

[1] H. Jarollahi, V. Gripon, N. Onizawa, and W. J. Gross,

"Algorithm and architecture for a low-power content-

addressable memory based on sparse clustered

networks," Very Large Scale Integration (VLSI) Systems,

IEEE Transactions on, vol. 23, pp. 642-653, 2015.

[2] S.-H. Yang, Y.-J. Huang, and J.-F. Li, "A low-power

ternary content addressable memory with Pai-Sigma

matchlines," Very Large Scale Integration (VLSI)

Systems, IEEE Transactions on, vol. 20, pp. 1909-1913,

2012.

[3] H. Jarollahi, N. Onizawa, V. Gripon, and W. J. Gross,

"Architecture and implementation of an associative

memory using sparse clustered networks," in IEEE

International Symposium on Circuits and Systems

(ISCAS), 2012, 2012, pp. 2901-2904.

[4] A.-T. Do, S. Chen, Z.-H. Kong, and K. S. Yeo, "A high

speed low power CAM with a parity bit and power-gated

ML sensing," Very Large Scale Integration (VLSI)


2013.

[5] N. Onizawa, S. Matsunaga, V. C. Gaudet, W. J. Gross,

and T. Hanyu, "High-Throughput Low-Energy Self-

Timed CAM Based on Reordered Overlapped Search

Mechanism," IEEE Transactions on Circuits and

Systems I: Regular Papers, vol. 61, pp. 865-876, 2014.

[6] V. Gripon and C. Berrou, "Nearly-optimal associative

memories based on distributed constant weight codes,"

in Information Theory and Applications Workshop

(ITA), 2012, 2012, pp. 269-273.

[7] H. Jarollahi, N. Onizawa, V. Gripon, and W. J. Gross,

"Reduced-complexity binary-weight-coded associative

memories," in IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP),

2013, 2013, pp. 2523-2527.

[8] S. Petit, R. Ubal, J. Sahuquillo, and P. Lopez, "Efficient

Register Renaming and Recovery for High-Performance

Processors," Very Large Scale Integration (VLSI)


2014.

[9] H. Jarollahi, N. Onizawa, and W. J. Gross, "Selective

decoding in associative memories based on sparse-

clustered networks," in Global Conference on Signal

and Information Processing (GlobalSIP), 2013 IEEE,

2013, pp. 1270-1273.

[10] A. T. Do, C. Yin, K. S. Yeo, and T. T.-H. Kim, "Design

of a power-efficient CAM using automated background

checking scheme for small match line swing," in

Proceedings of the ESSCIRC (ESSCIRC), 2013 2013,

pp. 209-212.

[11] H. Wong, V. Betz, and J. Rose, "Comparing FPGA vs.

custom CMOS and the impact on processor

microarchitecture," in Proceedings of the 19th

ACM/SIGDA international symposium on Field

programmable gate arrays, 2011, pp. 5-14.

[12] H. Jarollahi, V. Gripon, N. Onizawa, and W. J. Gross,

"A low-power content-addressable memory based on

clustered-sparse networks," in International Conference

on Application-Specific Systems, Architectures and

Processors (ASAP), 2013 IEEE 24th, 2013, pp. 305-308.

[13] Z. Ullah, K. Ilgon, and S. Baeg, "Hybrid partitioned

SRAM-based ternary content addressable memory,"

IEEE Transactions on Circuits and Systems I: Regular

Papers, vol. 59, pp. 2969-2979, 2012.

[14] Y. Zhang, W. Zhao, J.-O. Klein, D. Ravelsona, and C.

Chappert, "Ultra-high density content addressable

memory based on current induced domain wall motion

in magnetic track," IEEE Transactions on Magnetics,

vol. 48, pp. 3219-3222, 2012.

[15] A. X. Liu, C. R. Meiners, and E. Torng, "Packet

classification using binary Content Addressable

Memory," in INFOCOM, 2014 Proceedings IEEE, 2014,

pp. 628-636.

P-ISSN: 2347-4408

E-ISSN: 2347-4734


DESIGN AND IMPLEMENTATION OF 32 BIT ALU USING LOOK AHEAD CLOCK

GATING BASED ON FPGA

V. Prasanth (Ph.D.) 1

, M. Sri Manikyamba2

1Head of the Department-ECE, Pragati Engineering College, Surampalem (Ap), India.

2Pragati Engineering College, Surampalem (AP) India. Email id: [email protected]

Abstract: Any type of digital architecture is modified by using

the VLSI technology. In digital systems, clock gating is the

best method to reduce consumption of power. As power

consumption plays an important role in any integrated circuit.

This methodology is mainly used in all type of real world

applications and this technology is to enhance the internal

architecture level. There are 3 gating methods. The most

popularly known gating method is synthesis based.

Unfortunately, the Synthetic based gating method leaves the

majority of the clock pulses driving the flip flops are

terminated. A data driven method halts most of the clock

pulses and produces higher power savings, but its application

is complex and dependent. The Auto-Gated Flip Flops

(AGFF) is the third method which yields moderately lower

power saving. This paper introduces a novel Look-Ahead

Clock Gating (LACG) method which is the combination of all

the three gating methods. It calculates the clock enabling

signals of every flip flops one cycle from this time, which it

depends on the FFs cycle data at present. In a CPU, the most

commonly edited modules is the ALU. During most

instruction executions, it is employed. Therefore, a major

concern in the ALU is the consumption of power. This paper

motivates to reduce the ALU architecture for many digital

applications and to improve the internal process in ALU

architecture with look ahead clock gating approach.

Reduction of delay and power for the data path PE unit in 32

bit ALU architecture.

Keywords: ALU, LACG, Clock gating, dynamic power.

1. INTRODUCTION

In the earlier days, the designers of VLSI were more interested

on the area of the circuits, performance, reliability and cost was

also the main consideration and power consumption was their

minor consideration. Now-a-days, the power is also being given

equal importance in comparison to area and speed. The dynamic

power dissipation is being comparable with both short circuit

and leakage power as technology scale down. To identify and

modify the various leakages and switching of components is

very essential to estimate and also the reduction of power

consumption in high speed and low power applications. Clock

gating is the best method to reduce the power consumption. It is

involved in all levels of system architecture, logic design, block

design and gates. The predefined enabling signals and gating

signal are ANDed. There are three gating methods, first one is

the synthesis based, which enables clock signals based on the

logic of the fundamental system. Unfortunately, it leaves the

majority of the clock pulses driving the flip flops are terminated.

Second is the data driven method which stops most of those

clock pulses and gains higher power savings, but it is application

dependent and the implementation is highly complex. And the

third method is auto gated flip flop. It is simple but gains

relatively small power savings. In synthesis based clock gating,

functional blocks and modules are not needed to be clocked as

the clock enabling signals are well understood the system level.

It can be defined effectively and capture the period. To address

the redundancy from the synthesis based, a data driven clock

gating is proposed for flip flops. If the flip flops state is not

changed to the next clock cycle then the flip flops driven by the

clock signals are disabled. It suffers from a very short time

window where the gating circuitry works correctly. The delay of

the XOR, OR, AND gate and latch need not beyond the setup

time of the flip flop. Another difficulty of data driven clock

gating is its design methodology. The low power look ahead

clock gating method combines all the three gating methods. The

system clock signals is one of the main dynamic power

consumers in computing and electronic products, which is

normally responsible for 30-70 % of the total dynamic switching

power consumption. Several techniques are introduced to reduce

the dynamic power of which clock gating is primary. This look

ahead clock gating calculates the clock enabling signals of every

flip flops one cycle from this time, which it depends on the FFs

cycle data at present. For computing the enable signals and the

propagation, a full clock cycle is allotted for both auto gated and

data driven in order to avoid its tight timing constraints. This

look ahead clock gating method is introduced, based on the auto

gated flip flop. The Comparison of the look ahead, power gating

and data driven clock gating are done in this paper. It shows,

when compared to the data driven, the look ahead consumes less

power and to reduce the power

2. ARCHITECTURE

P-ISSN: 2347-4408

E-ISSN: 2347-4734


The ALU architecture mainly used to perform the arithmetic

and logical process and the arithmetic process consists the

adder and subtractor and multiplier architecture. The mux input

section to change the output based on selection process. The

selection process used to modify the output unit architecture.

The ALU block is used to optimize the internal architecture.

The internal architecture to consume the three arithmetic

operations and five logical operations. The buffer architecture

used to store the output with delay time for the ALU operation

and to check the required selection signal .The mux operation

used to find the output and to store the mux input. And to select

the input selection signal and to display the output result.

Figure 1. Block Diagram of ALU Architecture

3. MODULES

3.1 SELECTION MUX ARCHITECTURE

The input bits are to be applying the ALU circuit. The 8:1 mux

architecture is used to select the operation output data bits

based on arithmetic and logical operation. The mux architecture

is consists of overall internal ALU architecture output data bits

and the look ahead clock gating technique is used to optimize

the selection processing time. The mux input section is to

change the output based on selection process. The selection

process is used to modify the output unit architecture.

3.2 ALU ARCHITECTURE

The ALU architecture mainly used to perform the

arithmetic and logical process and the arithmetic process

consists the adder and subtractor and multiplier architecture.

The logical process consists the AND gate and OR gate and

XOR gate and inverter gate architecture. This architecture to

design the structure methodology. The ALU architecture used

to all type of core process and to optimize the internal

architecture.

3.3 ALU INTERNAL BLOCK

The ALU block to optimize the internal architecture. The

internal architecture to consume the three arithmetic operation

and five logical operation. The arithmetic operation requires

more time and high path delay and reduce the speed. So the

modification process mainly focused by the carry propagation

process. So we use the carry selection adder architecture

function in addition operation and the multiplier operation.

3.4 LOOK AHEAD CLOCK GATING

The look ahead clock gating circuit takes an input clock signal

and generates a gated clock based on a control signal. The look

ahead clock gated clock signal is used to activate the arithmetic

or logic or shift unit. It prevents unnecessary charging and

discharging of the clock signal in inactive modules which leads

to lower dynamic power dissipation. Look ahead clock gating

technique is a power down methodology, which involves

selectively clocking modules as and when required while

keeping other inactive modules in sleep mode.

Figure 2. Block Diagram Look Ahead Clock Gating

P-ISSN: 2347-4408

E-ISSN: 2347-4734


4. EXPERIMENTAL RESULTS

Figure 3. Simulation Results

Figure 4. RTL Diagram of 32 Bit ALU

Figure 5. Technology Diagram of 32 Bit ALU

Figure 6. Power Analysis of 32 Bit ALU Using Clock

P-ISSN: 2347-4408

E-ISSN: 2347-4734


Figure 7. Power Analysis of 32 Bit ALU Using Look Ahead

Clock Gating Technique

Comparison Table:

32-BIT ALU

WITH

CLOCK

32-BIT ALU

WITH LOOK

AHEAD

CLOCK

GATING

BIO COUNT 197

131

DELAY TIME

(ns)

1.050 1.005

LATENCY TIME

(ns)

0.672-1.149 0.619-1.253

TOTAL POWER

(mW)

233 184

DYNAMIC

POWER (mW)

147 98

Table. Comparison between Clock and Look Ahead Clock

Gating

5. CONCLUSION

In this paper, a low power look ahead clock gating is presented

and compared it with the previously clock gating technique i.e.

the data driven clock gating and with clock. It is also very

useful in reducing the dynamic power. One of the major sources

responsible for power consumption in digital circuits is the

systems clock signal. It contributes towards a large amount of

power consumption. This look ahead clock gating technique is

very much useful for reduction of the power consumption in

digital systems. As it computes each flip flop clock enabling

signals per cycle ahead of time. It is based on the flip flop

present cycle data and the drawbacks of the three gating

methods have been overcome. The result shows minimum

power consumption than that of data driven clock gating. In this

paper, a 32 bit ALU is designed and implemented on Xilinx

FPGA using VHDL. ALU is the part of a computer that

performs all arithmetic operations, such as addition and

subtraction, decrement, increment, shifting and all kinds of

basic logical operations

REFERENCES

[1] S. Wimer, “ On optimal flip-flop grouping for VLSI

power minimization ,” Oper. Res. vol.4, no . 5,pp. 486-489,

sep. 2013

[2] J. A. Kathuria, M. Ayoub, M.Khan, and A.Noor, “ A

review of clock gating Technique,” MIT Int. J. Electron. And

Commun. Engin., vol.1, no.2,pp.106-114, Aug. 2011

[3] A.G.M. Strollo and D. De Caro,” Low power flip flop

with clock gating on master nad slave latches,” Electro. Lett.,

vol. 36, no.4, pp.294-295, Feb.2000.

[4] S.Wimer and I.Koren, “The optimal fan-out of clock

network for power minimization by adaptive gating,” IEEE

Trans. VLSI Syst., vol. 20, no. 10, pp.1772-1780, Oct.2012.

[5] K. Madhanmohan, R.Murugasami,” Dual Optimized

System for Flip-Flop Grouping Data Driven Clock Gating

Approach,” International Journal of Advanced Reasearch in

computer science and software engineering, vol. 4, issue 4, april

2014

[6] V. G. Oklobdzija, Digital System Clocking- High-

performance and Low-power Aspects, New York, NY, USA:

Wiley, 2003.

[7] L. Benani, A. Bogliolo, and G. De. Micheli, “ A

survey on design techniques for system-level dynamic power

management”. IEEE Trans. VLSI Syst., vol. 8, no. 3, pp. 299-

316, jun. 2000.

elysium journal of engineering -...

Documents