copyright © 2005-2011 curt hill parallelism in processors several approaches

Copyright © 2005-2011 Curt Hill

Parallelism in Processors

Several Approaches


Why Parallelism?• Simple fact is there is never enough

processor speed• Performance gains come from two

areas• Better integration technololgy• Better implementation of parallelism• Next two graphics show this


Chip Performance


Gains From Parallelism


Summary• The bulk of the gains have come

from faster and smaller components• A significant amount from parallelism• The parallelism has also offset the

greater complexity of the instruction set


Approaches• Instruction level parallelism

– Instructions operate in parallel– Pipelining

• Data parallelism– Vector processors

• Processor level parallelism– Multiple CPUs


First Attempt• One bottleneck is that accessing

instructions from memory is slow• Processor is usually order of

magnitude faster• Usually faster than cache also• Therefore have a fetch engine that

gets instructions all the time• This is the Prefetch buffer


Prefetch buffer• Don’t wait for the current instruction

to finish– Fetch the next instruction as soon as the

current instruction arrives

• This scheme can make a mistake since a goto or branch makes the next instruction difficult to guess

• You may also fetch in two directions and discard the unused– These are stored in the prefetch buffer


Two stages

• Now we have two independent pieces

• The instruction fetch mechanism– Using the prefetch buffer

• The instruction execute mechanism– This is where most of the work is done

• This generalizes into a pipeline of several stages


Pipelines• Each of the following are stages:

– Fetch the instruction– Decode the instruction– Locate and fetch operands– Execute the operation– Write the results back

• These may belong to separate hardware chunks that operate in parallel


Example:• All of this goes on in parallel:• Fetch instruction 8• Decode instruction 7• Fetch operands for instruction 6• Execute instruction 5• Write back data for instruction 4


A Simulator


Superscalar architectures• Have a single fetcher drive two different

lines each of which consists of these stages• The decode through write back occurs in

parallel on two or more separate lines• This is the Pentium approach• The main pipeline can handle anything• The second pipeline can handle integer

operations or simple floating point operations– Simple such as load / store from floating

processor


CDC 6600• Just the execute is parallel• This only works well if execute step

takes longer than the other steps• This is particularly true for floating

point and memory access instructions• The 6600 had multiple I/O and Floating

Point processors that could execute in parallel– This is the last of the Cray machines in 60s


Problems?• Pipelining needs some instruction

independence to work optimally• If instructions A, B, C are consecutive and

B depends on the result of A and C depends on the result of B we may have a problem with either approach

• Operand fetch of B cannot complete until write back of A, stalling the whole line

• However, the average mix of instructions tends to not have these hard dependencies in every instruction

• Compilers can also optimize by mixing up the expression output


Problem Example


Limits on Instruction Level Parallelism

• There is a limit on the gains• The more stages the less likely that

the instruction sequence will be suitable

• The more expensive the recovery for a mistake

• Dividing up an instruction processing past 10-20 stages makes for too little work to be done by each stage

• The more complicated the processor the more heat it generates


Chip Power Consumption

Operating System Parallelism

• Next we need the types of parallel processing enabled by the OS

• This usually involves multiple processes and thread

• Several flavors:• Uniprocessing• Hyperthreading– Multiprocessing



UniProcessing• Single CPU, but apparent multiple

tasks• Permissive

– Any system call allows the current task to be suspended and another started

– Windows 3• Preemptive

– A task is suspended when it makes a system call that could require waiting

– A time slice occurs• Scalar, array and vector processors


Multiple Processors MultiProcessing

• Real multiprocessing involves multiple CPUs

• Multiple CPUs can be executing different jobs

• They may also be in the same job, if it allows

• The CPUs are almost completely independent– They may share memory or disk or both


Multiprocessors• Two or more CPUs with shared

memory• Multiprocessors generally need both

hardware and OS support• This technique has been used since

the 60s• The idea is that two CPUs can

outperform one• It will become even more important


Half Way: HyperThreading• The Hyper Threading CPUs are a

transitional form• There is one CPU with two register

sets• The CPU alternates between registers

in execution thus giving better concurrency than a uniprocessor

• Windows XP considers it two CPUs


Multi-Tasking Operating System

– There are multiple processes– Each has its own memory– In a single CPU system process executes

until:• Waiting for I/O• Used its time slice• Something with higher priority is now ready

– When a process is suspended– A queue of processes waiting to execute is

examined, the first is chosen and executed


Multiple CPUs– Updating this to multiple CPUs mostly

requires that the dispatcher part cannot have both CPUs running there at the same time

– This requires some type of exclusive instruction and the dispatcher utilize it

– Windows 95, DOS cannot– Windows NT, OS/2 and UNIX allow


MPU Loss• Because of the need to have one CPU

lock out the other in certain instances, two CPUs never perform to the same level as one that is twice as fast– 90% seems to be average– Thus an MPU with two 1 GHz processors

will perform similar to a 1.8 GHz uniprocessor

• More than two yields more loss• Most servers are duals or more


Multiprocessors Again• Before the Pentium a multiprocessor

needed extra hardware to prevent the CPUs from performing a race error of some sort

• The Pentium could share four pins and that was all the hardware support that was needed

• The next advance was the multicores


Multicore Chips• Instead of one very fast CPU on a

chip put two not so fast CPUs• These are the multicore chips• They are actually removing some of

the complexity of pipelining to make it smaller and then also using a slower and cooler technology


Manufacturer’s Offerings

• Intel’s HyperThreading chips were a transitionary form

• AMD and Intel dual-processors became available in 2005

• Sun has a 4 core SPARC to be released 2005-2006

• Microsoft changed its license to be per chip, so that a multi-core chip is considered one processor


Disadvantages• The bus to the memory becomes the

bottleneck• Several things are accessing the memory

independently: two or more CPUs, Direct Memory Access controllers (disk controllers, video)

• One solution is dual port memory• Separate caches can also help• Another solution is to give each processor

its own local, private memory, but this diminishes the type of sharing that can go on


Chip MultiProcessors


Multicomputers• When the number of connections get

large the sharing of memory gets hard

• A multicomputer consists of many parallel processors, each with their own memory and disk

• Then communication is accomplished by messages sent from one to all or one to another

• Grid computing is one alternative

Conclusion• Moore’s Law has not been about just

better integration techniques• Parallelism in the single CPU and in

multiple CPUs has also contributed• Pipelining has been major technique

for single CPUs• There are other presentations on

multicomputer and multiprocessor systems


copyright © 2005-2011 curt hill parallelism in processors several approaches

Documents

curt hillcopyright

curt hillpipelineseach

curt hillgains

curt hillparallelism

curt hillcdc

curt hillexample

curt hillwhy parallelism

curt hilltwo stagesnow