ijiret archana kv increasing memory performance using cache optimizations in chip multiprocessors

ISSN: XXXX-XXXX Volume X, Issue X, Month Year

Increasing Memory Performance Using Cache Opti-

mizations in Chip Multiprocessors

Archana.K.V

Dept of Computer Science and Engineering

BTL Institute of Technology

Bangalore, India

[email protected]

Abstract: The processor-memory bandwidth in modern generation

processors is the important bottleneck due to a number of

processor cores dealing it through with the same bus/ pro-

cessor-memory interface. Caches take a significant amount

of energy in current microprocessors. To design an energy-

efficient microprocessor, it is important to optimize cache

energy economic consumption. Powerful utilization of this

resource is consequently an important view of memory hier-

archy design of multi core processors. This is presently an

important field of research on a large number of research

issues that have suggested a number of techniques to figure

out the problem. The better contribution of this theme is the

assessment of effectiveness of some of the proficiencies that

were enforced in recent chip multiprocessors. Cache optimi-

zation techniques that were named for single core proces-

sors but have not been implemented in multi core processors

are as well tested to forecast their effectiveness.

Keywords: On-chip cache hierarchy; cache optimizations

1. INTRODUCTION The on-chip memory and its efficient usage in multi core

processors is the prime focusing of this paper. With the en-

hancing number of cores on a single chip, this scheme will

find the overall memory performance and therefore the per-

formance of the applications running on these systems. The

workload running on these systems is a mix of multiple pro-

grams. The overall performance would consequently not

only be observed from the throughput of multiple programs

just also from the performance of programs making up of

multiple parallel processes running on multiple cores of the

identical chip. The on-chip cache hierarchy needs to be de-

signed with the best feasible configuration and optimizations

to do the above purpose. Portable computing applications

have changed from conventional low performance products

much as wristwatches and calculators to high throughput and

computation intense products such as notebook computers

and cellular phones. The early portable computing applica-

tions expect high speed, however low energy consumption

because for such products longer battery life interprets to

extended use and better marketability. This paper introduces

a case study of performance and power trade-offs in design-

ing on-chip caches for the microprocessors used in portable

computing applications. Early cache studies have primarily

focused on improving performance. Studies of cache access

times and miss rates for different cache parameters (e.g.

cache size, block size, and degree of set associativity) of the

single level caches can be found in [5,8]. Corresponding

studies focusing on multi-level cache organizations can be

found in [6,7]. Studies of instruction set design and it affect

the cache performance and power consumption can be

founding [1,3]. This paper consists of five sections. Section

2 briefly describes the cache performance and energy mod-

els used in this study. Section 3 presents several experi-

mental cache organizations which are designed for either

improving performance or saving energy. Section 4 shows

the experimental results of this study. Finally, concluding

remarks are offered in Section 5.

Fig.1 Block diagram of on-chip memory hierarchy in CMPs

mailto:[email protected]

International Journal of Innovatory research in Engineering and Technology - IJIRET

ISSN: XXXX-XXXX Volume X, Issue X, Month Year 26

2. ANALYTICAL MODELS FOR

ON-CHIP CACHES A formal cache can be separated into three dissimilar

components: address decoding path, cell arrays, and I/O

path. The address decoding path admits address buses and

address decoding logic. The cell arrays include read/write

circuitry, tag arrays, and the data arrays. The I/O path in-

cludes I/O pads and buses to link the address and data buses.

The on-chip cache cycle time is computed based on an ana-

lytical model demonstrated in [6,14] (which was based on

the access time model of Wada et al in [13]). This time mod-

el, based on 0.8 mm CMOS technology, gives some cache

cycle time (i.e. the minimum time required between the start

of two accesses) and cache access time (i.e. the minimum

time between the start and end of a single access) in terms of

cache size, block size, and associativity. The characteristics

of this time model is that it applies SPICE parameters to

predict the delays due to the address decoder, word-line

driver, pre-charged bit lines, sense amplifiers, data bus driv-

er, and data output drivers.

The average time for an off-chip cache access is computed

by the average off-chip access and transfer times which are

rounded to the next higher multiple of on-chip cycle time.

The on-chip cache energy expenditure is based on an ab-

stract model which believes only those cache factors that

dominate overall cache power consumption. In the address

decoding path, the capacity of the decoding logic is general-

ly less than that of the address bus. Energy expenditure of

the address buses dominate the total energy consumption of

the address decoding path. In the cell arrays, the read/write

circuitry generally does not take much power. Most energy

took in the cell arrays is due to both tag and data arrays. The

tag and data arrays in established cache designs can be im-

plemented in dynamic or static logic. In a dynamic circuit

design, word/bit lines are generally pre-charged before they

are accessed. The energy took by the pre-charged cache

word/bit lines normally dominates the overall energy con-

sumption in the cell arrays. In a stable circuit design, there

are no pre-charges on the word/bit lines. The energy ex-

penditure of the tag and data arrays right away depends on

the bit switch activities of the bit lines. In the I/O path, most

energy is consumed during bit switches of the I/O pads.

2.1 OPTIMIZATIONS IMPLEMENT-

ED SUCCESSFULLY A number of cache optimization proficiencies that were

implemented in single core processors were successfully

carried out in multi core processors. Multi-level cache with

the modern structure of two-level has been implemented

afterwards the very first multi core processor visualized in

(Fig.1). In this form, the first-level cache is private to each

one core and coherence is preserved among them with MESI

or MOESI protocols (Villa, F.J., et al., 2005). The second-

level cache has been carried out with different design selec-

tions in several architectures. In universal, the second-level

cache is distributed between all cores with a number of op-

timizations to be talked over in this section. One of the ma-

jor introductions in the design of the second level cache is

NUCA (Non Uniform Cache Architecture) cache (Kim, C.,

et al., 2003). The cause for building NUCA organization is

that the second-level cache is induced much larger than the

first-level to fulfill the design necessities of multi-level

cache. The result is a slower access time with the enhancing

cache size. This problem is dissolved by dividing the cache

into banks. The context of a particular core is kept in a bank

physically closer to it making advance in the speed of ac-

cess. A number of variants of NUCA have developed over

the last few years with many innovations implemented in

modern generation processors.

3 EXPERIMENTAL CACHE ORGAN-

IZATIONS 3.1 Conventional Designs Conventional cache plans include direct-mapped and set

associative. A set associative cache generally has a better hit

rate than a direct-mapped cache of the equal size, although

the access time for the set associative cache is commonly

higher than the direct-mapped cache. The number of bit line

switches in the set associative cache is normally more than

that in the direct-mapped cache, but the energy consumption

of each bit line in a set associative cache is in generally less

than that in a direct-mapped cache of the equal size.

3.2 Cache Designs for Low Power This paper investigates three various cache design ap-

proaches to attain low power: vertical cache partitioning,

horizontal cache partitioning and Gray code addressing

Vertical Cache Partitioning

The fundamental idea of vertical cache partitioning is to

optimize the capacity of each one cache access by increase

on-chip cache hierarchy (e.g. two-level caches). Accessing a

smaller cache has lower power economic consumption since

a smaller cache has a lower load capacitance. We use block

buffering as an good example of this approach. A fundamen-

tal structure of a block buffered cache [1] is presented in

Figure 1. The block buffer itself is, in effect, some other

cache which is closer to the processor than on-chip caches.

The processor finds out if there is a block hit (i.e. the current

access data is placed at the same block of the latest access

data). If it is a hit, the data is directly read from the block

buffer and the cache is not functioned. The cache is operated

only if there is a block miss. A block buffered cache pre-

serves power by optimizing capacity of each cache access.

The effectiveness of block buffering powerfully depends on

the spatial locality of applications and the block sizes. The

higher the spatial locality of the access patterns (e.g. an in-

struction sequence), the larger the number of energy which

can be preserved by block buffering. The block size is also

very essential in block buffering. Excluding the effect to the

cache hit rate of the cache block size, a small block may

result in defining the number of energy protected by the



block buffered cache and a large block may result in enhanc-

ing unnecessary energy economic consumption by the un-

used data in the block.

Horizontal Cache Partitioning

The primary idea of the horizontal cache segmentation

approach is to partition the cache data memory into various

segments. Each segment can be high-powered individually.

Cache sub-banking, proposed in [11], is one horizontal

cache partition technique which partitions the data array of a

cache into different banks (called cache sub banks). Each

cache sub-bank can be accessed (powered up) separately.

Just the cache sub-bank where the applied data is located

consumes power in each cache access. A primary structure

for cache sub-banking is presented in Figure Cache

sub-banking keeps power by eliminating unnecessary ac-

cesses. The number of power saving depends on the number

of cache sub-banks. More cache

Sub-banks preserve more power. One advantage of cache

sub-banking over block buffering is that the efficient cache

hit time of a sub-bank cache can be as smooth as a conven-

tional performance-driven cache since the sub-bank selec-

tion logic is generally very simple and can be well hidden in

the cache index decoding logic. With the advantage of main-

taining the cache performance, cache sub-banking would be

very attractive to computer architects in designing energy-

efficient high-performance microprocessors.

Gray Code Addressing

Memory addressing used in a traditionalistic processor

design is usually in a 2’s complement representation. The bit

switching of the address buses when accessing consecutive

memory space is not optimal. Since there is a significant

number of energy consumed on the address buses and se-

quential memory address access are frequently seen in an

application with high spatial locality, it is essential to opti-

mize bit switching activities of the address buses for low

power caches.

4.RESULTS AND DISCUSSION 4.1. Proposed Cache Optimizations A number of cache optimization techniques were success-

fully carried out in single core processors or single core mul-

tiprocessors but have not yet been attempted in multi core

processors. Any of these techniques are discussed in this

section with a prediction of their effectiveness in multi core

processors.

3.2 Ineffective Cache Optimizations The optimization proficiencies introduced in Section 3.1

needs to be carried out to find out their effectiveness. A few

optimizations were tried out for multi core processors and

were found to be ineffective. As more optimizations are test-

ed, one may find more such techniques as not being efficient

for multi core processors. The succeeding paragraphs give a

brief account of the tested techniques that were not success-

ful in CMPs.

Cache affinity is a policy decision taken by the operating

system to schedule processes on particular cores. The deci-

sion is based on the activity of a process that has its context

in a cache and is expected to reuse the contents as a result of

temporal locality. After a context switch, when a process is

rescheduled, it is allocated to the same processor, assuming

that its context may still be present in the cache, reducing the

compulsory or cold start misses. This scheme has improved

the performance in conventional multiprocessors (SMPs).

On investigation of this scheme in multi core processors and

summarized in (Kazempour, et al., 2008), it was observed

that the performance improvement in multi core uniproces-

sors (CMPs) is not significant, but the performance is good

in case of multi core multiprocessors (SMPs based on

CMPs).

5. CONCLUSION AND FUTURE DI-

RECTIONS This paper forms part of the guideline for future work for

researchers interested in optimization of memory hierarchy

for scalable multi core processors, as it presents a survey of

all such techniques proposed in recent publications. The

techniques are also presented along with the comments

about their effectiveness. A summary of all the optimization

techniques discussed in this paper is presented in Table 1.

The effect of the mechanisms and policies of operating sys-

tem on the memory hierarchy, especially the on-chip cache

hierarchy is another direction of research that can be ex-

plored. High coherence traffic gives rise to congestion at the

first level cache. Directory-based coherence protocols may

reduce the overall coherence traffic but this comes with the

cost of maintaining the directory and keeping it updated.

These and other research directions shall be explored in fu-

ture research.

6. REFERENCES [1] Chang and Sohi, (2006),”Cooperative Caching for Chip

Multiprocessors”, Proceedings of the 33rd Annual Interna-

tional Symposium on Computer Architecture, p.264-276

[2] Chen and Kandemir, (2008), “Code Restructuring for

Improving Cache Performance in MPSoCs”, IEEE Transac-

tions on Parallel and Distributed Systems, Vol. 19, No. 9, p.

1201-1214

[3] Dybdahl , H., P. Stenström, (2007), “An Adaptive

Shared/Private NUCA Cache Partitioning Scheme for Chip

Multiprocessors”, Proceedings of the IEEE 13th Internation-

al Symposium on High Performance Computer Architecture,

p. 2-12

[4] Dybdahl., Stenström, (2006), “Enhancing Last-Level

Cache Performance by Block Bypassing and Early Miss



Determination”, Asia-Pacific Computer Systems Architec-

ture Conference (ACSAC), LNCS

4186, p. 52-66

[5] Core Systems, Proceedings of the 42nd International

Symposium on Micro-architecture (MICRO), p.327-336

ijiret archana kv increasing memory performance using cache optimizations in chip multiprocessors

Documents