summary of simultaneous multithreading: maximizing on-chip parallelism
TRANSCRIPT
COMPUTER ARCHITECTURE
BATCH 2012
Assignment tittle
“Summary of Paper”
BY
FARWA ABDUL HANNAN
(12-CS-13)
&
ZAINAB KHALID
(12-CS-33)
Date OF Submission: Wednesday, 11 May, 2016
NFC – INSITUTDE OF ENGINEERING AND FERTILIZER
RESEARCH, FSD
Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy
Department of Computer Science and Engineering
University of Washington
Seattle, WA 98195
______________________________________________________________________________
1. Introduction
The paper examines simultaneous
multithreading which is a technique that
allows several independent threads to issue
multiple functional units in each cycle. The
objective of simultaneous multithreading is
to increase processor utilization for both long
memory latencies and limited available
parallelism per thread.
This study evaluates the potential
improvement, relative to wide superscalar
architectures and conventional multithreaded
architectures, of various simultaneous
multithreading models.
The proposed results show the limits of
superscalar execution and traditional
multithreading to increase instruction
throughput in future processors.
2. Methodology
The main goal is to evaluate several
architectural alternatives in order to examine
simultaneous multithreading. For this a
simulation environment has been developed
that defines the implementation of the
simultaneous multithreaded architecture and
that architecture is the extension of next
generation wide superscalar processors.
2.1 Simulation Environment
The simulator uses the emulated based
instruction level simulation that caches the
partially decoded instructions for fast
emulated execution. The simulator models
the pipeline execution, hierarchy of memory
and the branch prediction logic of wide
superscalar processors. The simulator is
based on Alpha 21164. Unlike Alpha this
model supports the increased single stream
parallelism. The simulated configuration
consists of 10 functional units of four types
such as four integer, two floating point, three
load/store and 1 branch and issue rate id at
maximum of 8 instructions per cycle. It is
assumed that all functional units are
completely pipelined. Assuming that the first
and second-level on-chip caches
considerably larger than on the Alpha, for
two reasons. First, multithreading puts a
larger strain on the cache subsystem, and
second, larger on-chip caches are expected to
be common in the same time frame that
simultaneous multithreading becomes viable.
Simulations with caches closer to current
processors, discussed in these experiments as
appropriate, are also run but do not show any
results.
Whenever the program counter crosses the
boundary of 32 bytes, the instruction caches
access occurs otherwise the instruction is
fetched from the already fetched buffer.
Dependence free instructions are issued in
order to an eight instructions per thread
scheduling window. From there, instructions
can be scheduled onto functional units,
depending on functional unit availability.
Instructions that are not scheduled due to
functional unit availability have priority in
the next cycle. This straightforward issue is
complemented model with the use of state-
of-the-art static scheduling, using the
Multiflow trace scheduling compiler. This
reduces the benefits that might be gained by
full dynamic execution, thus eliminating a
great deal of complexity (e.g., there is no
need for register renaming unless we need
precise exceptions, and we can use a simple
1-bitper- register score boarding scheme) in
the replicated register sets and fetch/decode
pipes.
2.2 Workload
The workload consists of SPEC92
benchmark suite that consists of twenty
public-domain, non-trivial programs that are
widely used to measure the performance of
computer systems, particularly those in the
UNIX workstation market. These
benchmarks were expressly chosen to
represent real-world applications and were
intended to be large enough to stress the
computational and memory system resources
of current-generation machines.
To gauge the raw instruction throughput
which is achievable by multithreaded
superscalar processors, the uniprocessor
applications are chosen by assigning a
distinct program to each thread. This models
a parallel workload which is achieved by
multiprogramming rather than parallel
processing. Hence the throughput results are
not affected by synchronization delays,
inefficient parallelization, etc.
Each program is compiled with the Multiflow
trace scheduling compiler and is modified to
produce Alpha code scheduled for target
machine. The applications were each
compiled with several different compiler
options.
3. Superscalar Bottlenecks:
Where Have All the Cycles
Gone?
This section provides motivation for SM. By
using the base single hardware context
machine, the issue utilization is measured,
i.e., the percentage of issue slots that are
filled in each cycle, for most of the SPEC
benchmarks. The cause of each empty issue
slot is also recorded. The results also
demonstrate that the functional units of
proposed wide superscalar processor are
highly underutilized. These results also
indicate that there is no dominant source of
wasted issue bandwidth. Simultaneous
multithreading has the potential to recover all
issue slots lost to both horizontal and vertical
waste. The next section provides details on
how effectively it does so.
4. Simultaneous
Multithreading
The performance results for simultaneous
multithreaded processors are discussed in this
section. Several machine models for
simultaneous multithreading are defined and
it is showed here that simultaneous
multithreading provides significant
performance improvement for both single
threaded superscalar and fine grain
multithreaded processors.
4.1 The Machine Models
The Fine-Grain Multithreading, SM:Full
Simultaneous Issue, SM:Single Issue,
SM:Dual Issue, and SM:Four Issue,
SM:Limited Connection models reflects
several possible design choices for a
combined multithreaded and superscalars
processors.
Fine-Grain Multithreading
SM:Full Simultaneous Issue
SM:Single Issue
SM:Dual Issue.
SM:Four Issue
SM:Limited Connection
4.2 The Performance of Simultaneous
Multithreading
Simultaneous Multithreading act also
displayed. The fine-grain multithreaded
architecture offers a maximum speed up. The
advantage of the real-time multithreading
models, achieve maximum speedups over
single thread. The speedups are calculated
using the full real-time issue.
By using Simultaneous Multithreading, it’s
not compulsory for any particular thread to
use the whole resources of processor to get
the maximum performance. One of the four
issue model it becomes good with full
simultaneous issue like the ratio of threads &
slots increases.
After the experiments it is seen the possibility
of transacting the number of hardware
contexts against the complexity in other
areas. The increasing rate in processor
consuming are the actual results of threads
which shares the processor resources if not
then it will remain idle for many time but
sharing the resources also contains negative
effects. The resources that are not executed
plays important role in the performance area.
Single-thread is not so reliable so it is
founded that it’s comfortable with multiple
one. The main effect is to share the caches
and it has been searched out that increasing
the public data brings the wasted cycles down
to 1%.
To gain the speedups the higher caches are
not so compulsory. The lesser caches tells us
the size of that caches which disturbs the 1-
thread and 8-thread results correspondingly
and the total speedups becomes constant in
front of extensive range of size of caches.
As a result it is shown that the limits of
simultaneous multithreading exceeds on the
performance possible through either single
thread execution or fine-grain
multithreading, when run on a wide
superscalar. It is also noticed that basic
implementations of SM with incomplete per-
thread abilities can still get high instruction
throughput. For this no change of architecture
required.
5. Cache Design for a
Simultaneous Multithreaded
Processor
The cache problem has been searched out.
Focus was on the organization of the first-
level (L1) caches, which related the use of
private per-thread caches to public caches for
both instructions and data.
The research use the 4-issue model with up to
8 threads. Not all of the private caches will be
consumed when less than eight threads are
running.
When there are many properties for
multithreaded caches then the public caches
adjusts for a small number of threads while
the private ones perform with large number
of threads well.
For instance the two caches gives the
opposite results because of their transactions
are not the same for both data and
instructions.
Public cache leaves a private data cache total
number of threads whereas the caches which
holds instructions can take advantage from
private cache at 8 threads. The reason is that
they access different patterns between the
data and instructions.
6. Simultaneous
Multithreading versus
Single-Chip Multiprocessing
The performance of simultaneous
multithreading to small-scale, single-chip
multiprocessing (MP) has been compared.
While comparing it is been noted that the two
scenarios are same that is both have multiple
register sets, multiple FU and higher
bandwidth on a single chip. The basic
difference is in the method of how these
resources are separated and organized.
Obviously scheduling is more complex for an
SM processor.
Functional unit configuration is frequently
enhanced for the multiprocessor and
represents a useless configuration for
simultaneous multithreading. MP calculates
with 1, 2 and 4 issues per cycle on every
processor and SM processors with 4 and 8
issues per cycle. 4 issue model is used for all
SM values. By using that model it reduces the
difficulties between SM and MP
architectures.
After the experiments we see that the SM
results are good in two ways that is the
amount of time required to schedule
instructions onto functional units, and the
public cache access time.
The distance between the data cache and
instructions or the load & store units may
have a big influence on cache access time
which is that the multiprocessor, with private
caches and private load & store units, can
decrease the distances between them but the
SM processor unable to do so even if with
private caches, the reason is that the load &
store units are public.
The solution was that the two different
structures could remove this difference.
There comes further advantages of SM over
MP that are not presented by the experiments:
the first one is Performance with few threads:
Its results display only the performance at
maximum exploitation.
The advantage of SM over the MP is greater
as some of the processors become unutilized.
The second advantage is Granularity and
flexibility of design: the options of
configurations are better-off with SM. For
this in multiprocessor, we have to add
calculating in units of whole processor. Our
evaluations did not take advantage of this
flexibility.
Like the performance and complexity results
displayed the reasons is that when constituent
concentrations allows us to set multiple
hardware contexts and wide-ranging issue
bandwidth on a single chip, instantaneous
multithreading denotes the most well-
organized organization of those resources.