copyright © 2005-2011 curt hill parallelism in processors several approaches
TRANSCRIPT
Copyright © 2005-2011 Curt Hill
Parallelism in Processors
Several Approaches
Copyright © 2005-2011 Curt Hill
Why Parallelism?• Simple fact is there is never enough
processor speed• Performance gains come from two
areas• Better integration technololgy• Better implementation of parallelism• Next two graphics show this
Copyright © 2005-2011 Curt Hill
Chip Performance
Copyright © 2005-2011 Curt Hill
Gains From Parallelism
Copyright © 2005-2011 Curt Hill
Summary• The bulk of the gains have come
from faster and smaller components• A significant amount from parallelism• The parallelism has also offset the
greater complexity of the instruction set
Copyright © 2005-2011 Curt Hill
Approaches• Instruction level parallelism
– Instructions operate in parallel– Pipelining
• Data parallelism– Vector processors
• Processor level parallelism– Multiple CPUs
Copyright © 2005-2011 Curt Hill
First Attempt• One bottleneck is that accessing
instructions from memory is slow• Processor is usually order of
magnitude faster• Usually faster than cache also• Therefore have a fetch engine that
gets instructions all the time• This is the Prefetch buffer
Copyright © 2005-2011 Curt Hill
Prefetch buffer• Don’t wait for the current instruction
to finish– Fetch the next instruction as soon as the
current instruction arrives
• This scheme can make a mistake since a goto or branch makes the next instruction difficult to guess
• You may also fetch in two directions and discard the unused– These are stored in the prefetch buffer
Copyright © 2005-2011 Curt Hill
Two stages
• Now we have two independent pieces
• The instruction fetch mechanism– Using the prefetch buffer
• The instruction execute mechanism– This is where most of the work is done
• This generalizes into a pipeline of several stages
Copyright © 2005-2011 Curt Hill
Pipelines• Each of the following are stages:
– Fetch the instruction– Decode the instruction– Locate and fetch operands– Execute the operation– Write the results back
• These may belong to separate hardware chunks that operate in parallel
Copyright © 2005-2011 Curt Hill
Example:• All of this goes on in parallel:• Fetch instruction 8• Decode instruction 7• Fetch operands for instruction 6• Execute instruction 5• Write back data for instruction 4
Copyright © 2005-2011 Curt Hill
A Simulator
Copyright © 2005-2011 Curt Hill
Superscalar architectures• Have a single fetcher drive two different
lines each of which consists of these stages• The decode through write back occurs in
parallel on two or more separate lines• This is the Pentium approach• The main pipeline can handle anything• The second pipeline can handle integer
operations or simple floating point operations– Simple such as load / store from floating
processor
Copyright © 2005-2011 Curt Hill
CDC 6600• Just the execute is parallel• This only works well if execute step
takes longer than the other steps• This is particularly true for floating
point and memory access instructions• The 6600 had multiple I/O and Floating
Point processors that could execute in parallel– This is the last of the Cray machines in 60s
Copyright © 2005-2011 Curt Hill
Problems?• Pipelining needs some instruction
independence to work optimally• If instructions A, B, C are consecutive and
B depends on the result of A and C depends on the result of B we may have a problem with either approach
• Operand fetch of B cannot complete until write back of A, stalling the whole line
• However, the average mix of instructions tends to not have these hard dependencies in every instruction
• Compilers can also optimize by mixing up the expression output
Copyright © 2005-2011 Curt Hill
Problem Example
Copyright © 2005-2011 Curt Hill
Limits on Instruction Level Parallelism
• There is a limit on the gains• The more stages the less likely that
the instruction sequence will be suitable
• The more expensive the recovery for a mistake
• Dividing up an instruction processing past 10-20 stages makes for too little work to be done by each stage
• The more complicated the processor the more heat it generates
Copyright © 2005-2011 Curt Hill
Chip Power Consumption
Operating System Parallelism
• Next we need the types of parallel processing enabled by the OS
• This usually involves multiple processes and thread
• Several flavors:• Uniprocessing• Hyperthreading– Multiprocessing
Copyright © 2005-2011 Curt Hill
Copyright © 2005-2011 Curt Hill
UniProcessing• Single CPU, but apparent multiple
tasks• Permissive
– Any system call allows the current task to be suspended and another started
– Windows 3• Preemptive
– A task is suspended when it makes a system call that could require waiting
– A time slice occurs• Scalar, array and vector processors
Copyright © 2005-2011 Curt Hill
Multiple Processors MultiProcessing
• Real multiprocessing involves multiple CPUs
• Multiple CPUs can be executing different jobs
• They may also be in the same job, if it allows
• The CPUs are almost completely independent– They may share memory or disk or both
Copyright © 2005-2011 Curt Hill
Multiprocessors• Two or more CPUs with shared
memory• Multiprocessors generally need both
hardware and OS support• This technique has been used since
the 60s• The idea is that two CPUs can
outperform one• It will become even more important
Copyright © 2005-2011 Curt Hill
Half Way: HyperThreading• The Hyper Threading CPUs are a
transitional form• There is one CPU with two register
sets• The CPU alternates between registers
in execution thus giving better concurrency than a uniprocessor
• Windows XP considers it two CPUs
Copyright © 2005-2011 Curt Hill
Multi-Tasking Operating System
– There are multiple processes– Each has its own memory– In a single CPU system process executes
until:• Waiting for I/O• Used its time slice• Something with higher priority is now ready
– When a process is suspended– A queue of processes waiting to execute is
examined, the first is chosen and executed
Copyright © 2005-2011 Curt Hill
Multiple CPUs– Updating this to multiple CPUs mostly
requires that the dispatcher part cannot have both CPUs running there at the same time
– This requires some type of exclusive instruction and the dispatcher utilize it
– Windows 95, DOS cannot– Windows NT, OS/2 and UNIX allow
Copyright © 2005-2011 Curt Hill
MPU Loss• Because of the need to have one CPU
lock out the other in certain instances, two CPUs never perform to the same level as one that is twice as fast– 90% seems to be average– Thus an MPU with two 1 GHz processors
will perform similar to a 1.8 GHz uniprocessor
• More than two yields more loss• Most servers are duals or more
Copyright © 2005-2011 Curt Hill
Multiprocessors Again• Before the Pentium a multiprocessor
needed extra hardware to prevent the CPUs from performing a race error of some sort
• The Pentium could share four pins and that was all the hardware support that was needed
• The next advance was the multicores
Copyright © 2005-2011 Curt Hill
Multicore Chips• Instead of one very fast CPU on a
chip put two not so fast CPUs• These are the multicore chips• They are actually removing some of
the complexity of pipelining to make it smaller and then also using a slower and cooler technology
Copyright © 2005-2011 Curt Hill
Manufacturer’s Offerings
• Intel’s HyperThreading chips were a transitionary form
• AMD and Intel dual-processors became available in 2005
• Sun has a 4 core SPARC to be released 2005-2006
• Microsoft changed its license to be per chip, so that a multi-core chip is considered one processor
Copyright © 2005-2011 Curt Hill
Disadvantages• The bus to the memory becomes the
bottleneck• Several things are accessing the memory
independently: two or more CPUs, Direct Memory Access controllers (disk controllers, video)
• One solution is dual port memory• Separate caches can also help• Another solution is to give each processor
its own local, private memory, but this diminishes the type of sharing that can go on
Copyright © 2005-2011 Curt Hill
Chip MultiProcessors
Copyright © 2005-2011 Curt Hill
Multicomputers• When the number of connections get
large the sharing of memory gets hard
• A multicomputer consists of many parallel processors, each with their own memory and disk
• Then communication is accomplished by messages sent from one to all or one to another
• Grid computing is one alternative
Conclusion• Moore’s Law has not been about just
better integration techniques• Parallelism in the single CPU and in
multiple CPUs has also contributed• Pipelining has been major technique
for single CPUs• There are other presentations on
multicomputer and multiprocessor systems
Copyright © 2005-2011 Curt Hill