summary of simultaneous multithreading: maximizing on-chip parallelism

COMPUTER ARCHITECTURE

BATCH 2012

Assignment tittle

“Summary of Paper”

BY

FARWA ABDUL HANNAN

(12-CS-13)

&

ZAINAB KHALID

(12-CS-33)

Date OF Submission: Wednesday, 11 May, 2016

NFC – INSITUTDE OF ENGINEERING AND FERTILIZER

RESEARCH, FSD

Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy

Department of Computer Science and Engineering

University of Washington

Seattle, WA 98195

______________________________________________________________________________

1. Introduction

The paper examines simultaneous

multithreading which is a technique that

allows several independent threads to issue

multiple functional units in each cycle. The

objective of simultaneous multithreading is

to increase processor utilization for both long

memory latencies and limited available

parallelism per thread.

This study evaluates the potential

improvement, relative to wide superscalar

architectures and conventional multithreaded

architectures, of various simultaneous

multithreading models.

The proposed results show the limits of

superscalar execution and traditional

multithreading to increase instruction

throughput in future processors.

2. Methodology

The main goal is to evaluate several

architectural alternatives in order to examine

simultaneous multithreading. For this a

simulation environment has been developed

that defines the implementation of the

simultaneous multithreaded architecture and

that architecture is the extension of next

generation wide superscalar processors.

2.1 Simulation Environment

The simulator uses the emulated based

instruction level simulation that caches the

partially decoded instructions for fast

emulated execution. The simulator models

the pipeline execution, hierarchy of memory

and the branch prediction logic of wide

superscalar processors. The simulator is

based on Alpha 21164. Unlike Alpha this

model supports the increased single stream

parallelism. The simulated configuration

consists of 10 functional units of four types

such as four integer, two floating point, three

load/store and 1 branch and issue rate id at

maximum of 8 instructions per cycle. It is

assumed that all functional units are

completely pipelined. Assuming that the first

and second-level on-chip caches

considerably larger than on the Alpha, for

two reasons. First, multithreading puts a

larger strain on the cache subsystem, and

second, larger on-chip caches are expected to

be common in the same time frame that

simultaneous multithreading becomes viable.

Simulations with caches closer to current

processors, discussed in these experiments as

appropriate, are also run but do not show any

results.

Whenever the program counter crosses the

boundary of 32 bytes, the instruction caches

access occurs otherwise the instruction is

fetched from the already fetched buffer.

Dependence free instructions are issued in

order to an eight instructions per thread

scheduling window. From there, instructions

can be scheduled onto functional units,

depending on functional unit availability.

Instructions that are not scheduled due to

functional unit availability have priority in

the next cycle. This straightforward issue is

complemented model with the use of state-

of-the-art static scheduling, using the

Multiflow trace scheduling compiler. This

reduces the benefits that might be gained by

full dynamic execution, thus eliminating a

great deal of complexity (e.g., there is no

need for register renaming unless we need

precise exceptions, and we can use a simple

1-bitper- register score boarding scheme) in

the replicated register sets and fetch/decode

pipes.

2.2 Workload

The workload consists of SPEC92

benchmark suite that consists of twenty

public-domain, non-trivial programs that are

widely used to measure the performance of

computer systems, particularly those in the

UNIX workstation market. These

benchmarks were expressly chosen to

represent real-world applications and were

intended to be large enough to stress the

computational and memory system resources

of current-generation machines.

To gauge the raw instruction throughput

which is achievable by multithreaded

superscalar processors, the uniprocessor

applications are chosen by assigning a

distinct program to each thread. This models

a parallel workload which is achieved by

multiprogramming rather than parallel

processing. Hence the throughput results are

not affected by synchronization delays,

inefficient parallelization, etc.

Each program is compiled with the Multiflow

trace scheduling compiler and is modified to

produce Alpha code scheduled for target

machine. The applications were each

compiled with several different compiler

options.

3. Superscalar Bottlenecks:

Where Have All the Cycles

Gone?

This section provides motivation for SM. By

using the base single hardware context

machine, the issue utilization is measured,

i.e., the percentage of issue slots that are

filled in each cycle, for most of the SPEC

benchmarks. The cause of each empty issue

slot is also recorded. The results also

demonstrate that the functional units of

proposed wide superscalar processor are

highly underutilized. These results also

indicate that there is no dominant source of

wasted issue bandwidth. Simultaneous

multithreading has the potential to recover all

issue slots lost to both horizontal and vertical

waste. The next section provides details on

how effectively it does so.

4. Simultaneous

Multithreading

The performance results for simultaneous

multithreaded processors are discussed in this

section. Several machine models for

simultaneous multithreading are defined and

it is showed here that simultaneous

multithreading provides significant

performance improvement for both single

threaded superscalar and fine grain

multithreaded processors.

4.1 The Machine Models

The Fine-Grain Multithreading, SM:Full

Simultaneous Issue, SM:Single Issue,

SM:Dual Issue, and SM:Four Issue,

SM:Limited Connection models reflects

several possible design choices for a

combined multithreaded and superscalars

processors.

Fine-Grain Multithreading

SM:Full Simultaneous Issue

SM:Single Issue

SM:Dual Issue.

SM:Four Issue

SM:Limited Connection

4.2 The Performance of Simultaneous

Multithreading

Simultaneous Multithreading act also

displayed. The fine-grain multithreaded

architecture offers a maximum speed up. The

advantage of the real-time multithreading

models, achieve maximum speedups over

single thread. The speedups are calculated

using the full real-time issue.

By using Simultaneous Multithreading, it’s

not compulsory for any particular thread to

use the whole resources of processor to get

the maximum performance. One of the four

issue model it becomes good with full

simultaneous issue like the ratio of threads &

slots increases.

After the experiments it is seen the possibility

of transacting the number of hardware

contexts against the complexity in other

areas. The increasing rate in processor

consuming are the actual results of threads

which shares the processor resources if not

then it will remain idle for many time but

sharing the resources also contains negative

effects. The resources that are not executed

plays important role in the performance area.

Single-thread is not so reliable so it is

founded that it’s comfortable with multiple

one. The main effect is to share the caches

and it has been searched out that increasing

the public data brings the wasted cycles down

to 1%.

To gain the speedups the higher caches are

not so compulsory. The lesser caches tells us

the size of that caches which disturbs the 1-

thread and 8-thread results correspondingly

and the total speedups becomes constant in

front of extensive range of size of caches.

As a result it is shown that the limits of

simultaneous multithreading exceeds on the

performance possible through either single

thread execution or fine-grain

multithreading, when run on a wide

superscalar. It is also noticed that basic

implementations of SM with incomplete per-

thread abilities can still get high instruction

throughput. For this no change of architecture

required.

5. Cache Design for a

Simultaneous Multithreaded

Processor

The cache problem has been searched out.

Focus was on the organization of the first-

level (L1) caches, which related the use of

private per-thread caches to public caches for

both instructions and data.

The research use the 4-issue model with up to

8 threads. Not all of the private caches will be

consumed when less than eight threads are

running.

When there are many properties for

multithreaded caches then the public caches

adjusts for a small number of threads while

the private ones perform with large number

of threads well.

For instance the two caches gives the

opposite results because of their transactions

are not the same for both data and

instructions.

Public cache leaves a private data cache total

number of threads whereas the caches which

holds instructions can take advantage from

private cache at 8 threads. The reason is that

they access different patterns between the

data and instructions.

6. Simultaneous

Multithreading versus

Single-Chip Multiprocessing

The performance of simultaneous

multithreading to small-scale, single-chip

multiprocessing (MP) has been compared.

While comparing it is been noted that the two

scenarios are same that is both have multiple

register sets, multiple FU and higher

bandwidth on a single chip. The basic

difference is in the method of how these

resources are separated and organized.

Obviously scheduling is more complex for an

SM processor.

Functional unit configuration is frequently

enhanced for the multiprocessor and

represents a useless configuration for

simultaneous multithreading. MP calculates

with 1, 2 and 4 issues per cycle on every

processor and SM processors with 4 and 8

issues per cycle. 4 issue model is used for all

SM values. By using that model it reduces the

difficulties between SM and MP

architectures.

After the experiments we see that the SM

results are good in two ways that is the

amount of time required to schedule

instructions onto functional units, and the

public cache access time.

The distance between the data cache and

instructions or the load & store units may

have a big influence on cache access time

which is that the multiprocessor, with private

caches and private load & store units, can

decrease the distances between them but the

SM processor unable to do so even if with

private caches, the reason is that the load &

store units are public.

The solution was that the two different

structures could remove this difference.

There comes further advantages of SM over

MP that are not presented by the experiments:

the first one is Performance with few threads:

Its results display only the performance at

maximum exploitation.

The advantage of SM over the MP is greater

as some of the processors become unutilized.

The second advantage is Granularity and

flexibility of design: the options of

configurations are better-off with SM. For

this in multiprocessor, we have to add

calculating in units of whole processor. Our

evaluations did not take advantage of this

flexibility.

Like the performance and complexity results

displayed the reasons is that when constituent

concentrations allows us to set multiple

hardware contexts and wide-ranging issue

bandwidth on a single chip, instantaneous

multithreading denotes the most well-

organized organization of those resources.

summary of simultaneous multithreading: maximizing on-chip parallelism

Education