advanced computer architecture.doc

Upload: sameer-khude

Post on 10-Jan-2016

14 views

Category:

Documents


0 download

TRANSCRIPT

Advanced Computer ArchitectureWhat is Parallel Processing?

Parallel processing is another method used to improve performance in a computer system, when a system processes two different instructions simultaneously,it is performing parallel processing

Nowadays, commercial applications are most used on parallel computers. A computer that runs such an application has to be able to process large amount of data in sophisticated ways. We can say with no doubt that commercial applications will define future parallel computers architecture. But scientific applications will still remain important users of parallel computing technology. Trends in commercial and scientific applications are merging as commercial applications perform more sophisticated computations and scientific applications become more data intensive. Today, a lot of parallel programming languages and compilers, based on dependencies detected in source code, are able to automatically split a program into multiple processes and/or threads to be executed concurrently on the available processors from a parallel system.

Parallel computing is an efficient form of information processing which emphasizes the exploitation of concurrent events in the computing process. Concurrency implies parallelism, simultaneity and pipelining. Parallel events may occur in multiple resources during the same time interval; simultaneous events may occur at the same time instant; and pipelined events may occur in overlapped time spans. Parallel processing demands concurrent execution of many programs in the computer. It is a cost effective means to improve system performance through concurrent activities in the computer.The highest level of parallel processing is conducted among multiple jobs or programs through multiprogramming, time-sharing, and multiprocessing. This presentation covers the basics of parallel computing. Beginning with a brief overview and some concepts and terminology associated with parallel computing, the topics of parallel memory architectures, Parallel computer architectures and Parallel programming models are then explored.

Introduction:-

Parallel computing is an efficient form of information processing which emphasizes the exploitation of concurrent events in the computing process. Concurrency implies parallelism, simultaneity and pipelining. Parallel events may occur in multiple resources during the same time interval; simultaneous events may occur at the same time instant; and pipelined events may occur in overlapped time spans. Parallel processing demands concurrent execution of many programs in the computer. The highest level of parallel processing is conducted among multiple jobs or programs through multiprogramming, time-sharing, and multiprocessing.

What is Parallel Computing?

Traditionally, software has been written for serial computation. To be executed by a single computer having a single Central Processing Unit (CPU). Problems are solved by a series of instructions, executed one after the other by the CPU. Only one instruction may be executed at any moment in time.

Where as parallel computing is the simultaneous use of multiple compute resources to solve a computational problem. The compute resources can include a single computer with multiple processors, an arbitrary number of computers connected by a network, A combination of both.

The computational problem usually demonstrates characteristics such as the ability to be:

1) Broken apart into discrete pieces of work that can be solved simultaneously.

2) Execute multiple program instructions at any moment in time.

3) Solved in less time with multiple compute resources than with a single compute resource.

Why Use Parallel Computing?

There are two primary reasons for using parallel computing:

a) Save time - wall clock time

b) Solve larger problems

Other reasons might include:

A) Taking advantage of non-local resources - using available compute resources on a wide area network, or even the Internet when local compute resources are scarce.

B) Cost savings - using multiple "cheap" computing resources instead of paying for time on a supercomputer.

C) Overcoming memory constraints - single computers have very finite memory resources. For large problems, using the memories of multiple computers may overcome this obstacle.

D) Transmission speeds - the speed of a serial computer is directly dependent upon how fast data can move through hardware. Absolute limits are the speed of light (30 cm/nanosecond) and the transmission limit of copper wire (9 cm/nanosecond). Increasing speeds necessitate increasing proximity of processing elements.

1) Concepts of Parallel Computing Parallelism in Uniprocessor systems:

We can introduce parallelism techniques in Uniprocessor systems. Whose having single processor those techniques are.

A) Multiplicity of Functional units: Any functions of the ALU can be distributed to multiple and specialized functional units, which can operate in parallel. For example in CDC 6600 Uniprocessor has 10 functional units built into its CPU. These 10 units are independent of each other and may operate simultaneously.

B) Parallelism and pipelining with in CPU: Parallel adders using such techniques as carry-look ahead and carry-save are now built into almost all ALUs. High-speed multiplier recording and convergency division are techniques for exploring parallelism.

Various phases of Instructions executions are now pipelined. Including instruction fetch, decode, operand fetch arithmetic logic execution and store result. To facilitate overlapped instruction execution through pipe, instruction prefect and data buffering have been developed.

C) Overlapped CPU and I/O Operations:I/O operations can be performed simultaneously with CPU computations by using separate I/O controllers, channels and I/O processors. The DMA channel can be used to provide direct information transfer between I/O devices and main memory.

2.architectural classification schemes4.2 Architectural Classification Schemes

4.2.1 Flynns Classification

The most popular taxonomy of computer architecture was defined by Flynn in 1966.

Flynn's classification scheme is based on the notion of a stream of information. Two types of

information flow into a processor: instructions and data. The instruction stream is defined as the

sequence of instructions performed by the processing unit. The data stream is defined as the data

traffic exchanged between the memory and the processing unit.

According to Flynn's classification, either of the instruction or data streams can be single or

multiple.

Computer architecture can be classified into the following four distinct categories:

single-instruction single-data streams (SISD);

single-instruction multiple-data streams (SIMD);

multiple-instruction single-data streams (MISD); and

19

multiple-instruction multiple-data streams (MIMD).

Conventional single-processor von Neumann computers are classified as SISD systems. Parallel

computers are either SIMD or MIMD. When there is only one control unit and all processors

execute the same instruction in a synchronized fashion, the parallel machine is classified as

SIMD. In a MIMD machine, each processor has its own control unit and can execute different

instructions on different data. In the MISD category, the same stream of data flows through a

linear array of processors executing different instruction streams. In practice, there is no viable

MISD machine; however, some authors have considered pipelined machines (and perhaps

systolic-array computers) as examples for MISD. An extension of Flynn's taxonomy was

introduced by D. J. Kuck in 1978. In his classification, Kuck extended the instruction stream

further to single (scalar and array) and multiple (scalar and array) streams. The data stream in

Kuck's classification is called the execution stream and is also extended to include single (scalar

and array) and multiple (scalar and array) streams. The combination of these streams results in a

total of 16 categories of architectures.

4.2.1.1 SISD Architecture

A serial (non-parallel) computer

Single instruction: only one instruction stream is being acted on by the CPU during any

one clock cycle

Single data: only one data stream is being used as input during any one clock cycle

Deterministic execution

This is the oldest and until recently, the most prevalent form of computer

Examples: most PCs, single CPU workstations and mainframes

Figure 4.1 SISD COMPUTER

20

4.2.1.2 SIMD Architecture

A type of parallel computer

Single instruction: All processing units execute the same instruction at any given clock

cycle

Multiple data: Each processing unit can operate on a different data element

This type of machine typically has an instruction dispatcher, a very high-bandwidth

internal network, and a very large array of very small-capacity instruction units.

Best suited for specialized problems characterized by a high degree of regularity, such as

image processing.

Synchronous (lockstep) and deterministic execution

Two varieties: Processor Arrays and Vector Pipelines

Examples:

o Processor Arrays: Connection Machine CM-2, Maspar MP-1, MP-2

o Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2, Hitachi S820

Figure 4.2 SIMD COMPUTER

CU-control unit

PU-processor unit

MM-memory module

SM-Shared memory

IS-instruction stream

DS-data stream

4.2.1.3 MISD Architecture

21

There are n processor units, each receiving distinct instructions operating over the same data

streams and its derivatives. The output of one processor become input of the other in the macro

pipe. No real embodiment of this class exists.

A single data stream is fed into multiple processing units.

Each processing unit operates on the data independently via independent instruction

streams.

Few actual examples of this class of parallel computer have ever existed. One is the

experimental Carnegie-Mellon C.mmp computer (1971).

Some conceivable uses might be:

o multiple frequency filters operating on a single signal stream

o multiple cryptography algorithms attempting to crack a single coded message.

Figure 4.3 MISD COMPUTER

4.2.1.4 MIMD Architecture

Multiple-instruction multiple-data streams (MIMD) parallel architectures are made of multiple

processors and multiple memory modules connected together via some interconnection network.

They fall into two broad categories: shared memory or message passing. Processors exchange

information through their central shared memory in shared memory systems, and exchange

information through their interconnection network in message passing systems.

22

Currently, the most common type of parallel computer. Most modern computers fall into

this category.

Multiple Instruction: every processor may be executing a different instruction stream

Multiple Data: every processor may be working with a different data stream

Execution can be synchronous or asynchronous, deterministic or non-deterministic

Examples: most current supercomputers, networked parallel computer "grids" and multiprocessor

SMP computers - including some types of PCs.

A shared memory system typically accomplishes interprocessor coordination through a global

memory shared by all processors. These are typically server systems that communicate through a

bus and cache memory controller.

A message passing system (also referred to as distributed memory) typically combines the local

memory and processor at each node of the interconnection network. There is no global memory,

so it is necessary to move data from one local memory to another by means of message passing.

Figure 4.4 MIMD COMPUTER

Computer Class Computer System Models

1. SISD IBM 701, IBM 1620, IBM 7090, PDP VAX11/ 780

2. SISD (With

multiple

functional units)

IBM360/91 (3); IBM 370/168 UP

3. SIMD (Word

Slice

Processing)

Illiac IV ; PEPE

4. SIMD (Bit Slice STARAN; MPP; DAP

23

processing)

5. MIMD (Loosely

Coupled)

IBM 370/168 MP; Univac 1100/80

6. MIMD(Tightly

Coupled)

Burroughs- D 825

Table 4.1 Flynns Computer System Classification

4.2.2 Fengs Classification

Tse-yun Feng suggested the use of degree of parallelism to classify various computer

architectures.

Serial Versus Parallel Processing

The maximum number of binary digits that can be processed within a unit time by a

computer system is called the maximum parallelism degree P.

A bit slice is a string of bits one from each of the words at the same vertical position.

There are 4 types of methods under above classification

Word Serial and Bit Serial (WSBS)

Word Parallel and Bit Serial (WPBS)

Word Serial and Bit Parallel(WSBP)

Word Parallel and Bit Parallel (WPBP)

WSBS has been called bit parallel processing because one bit is processed at a time.

WPBS has been called bit slice processing because m-bit slice is processes at a time.

WSBP is found in most existing computers and has been called as Word Slice processing

because one word of n bit processed at a time.

WPBP is known as fully parallel processing in which an array on n x m bits is processes at one

time.

Mode Computer Model Degree of

parallelism

WSPS

N = 1

M = 1

The MINIMA (1,1)

WPBS

N=1

M>1

STARAN

MPP

DAP

(1,256)

(1,16384)

(1,4096)

WSBP

n>1

m=1

(Word Slice Processing)

IBM 370/168 UP

CDC 6600

Burrough 7700

VAX 11/780

(64,1)

(60,1)

(48,1)

(16/32,1)

WPBP

n>1

m>1

(fully parallel Processing)

Illiav IV (64,64)

24

Table 4.2 Fengs Computer Classification

4.2.3 Handlers Classification

Wolfgang Handler has proposed a classification scheme for identifying the parallelism

degree and pipelining degree built into the hardware structure of a computer system. He

considers at three subsystem levels:

Processor Control Unit (PCU)

Arithmetic Logic Unit (ALU)

Bit Level Circuit (BLC)

Each PCU corresponds to one processor or one CPU. The ALU is equivalent to Processor

Element (PE). The BLC corresponds to combinational logic circuitry needed to perform 1 bit

operations in the ALU.

A computer system C can be characterized by a triple containing six independent entities

T(C) = 4-Stage Pipeline

[1] FI: Fetch an instruction from memory[2] DA: Decode the instruction and calculate the effective address of the operand[3] FO: Fetch the operand[4] EX: Execute the operation

5

5

INSTRUCTION PIPELINE

Execution of Three Instructions in a 4-Stage Pipeline

FI

DA

FO

EX

FI

DA

FO

EX

FI

DA

FO

EX

i

i+1

i+2

Conventional (sequential)

Pipelined

FI

DA

FO

EX

FI

DA

FO

EX

FI

DA

FO

EX

i

i+1

i+2

6

6

INSTRUCTION EXECUTION IN A 4-STAGE PIPELINE

1

2

3

4

5

6

7

8

9

10

12

13

11

FI

DA

FO

EX

1

FI

DA

FO

EX

FI

DA

FO

EX

FI

DA

FO

EX

FI

DA

FO

EX

FI

DA

FO

EX

FI

DA

FO

EX

2

3

4

5

6

7

FI

Step:

Instruction

(Branch)

Fetch instruction

from memory

Decode instruction

and calculate

effective address

Branch?

Fetch operand

from memory

Execute instruction

Interrupt?

Interrupt

handling

Update PC

Empty pipe

no

yes

yes

no

Segment1:

Segment2:

Segment3:

Segment4:

7

7

Instruction buffers

For taking the full advantage of pipelining, pipelines should be filled continuously.

Instruction fetch rate should be matched with the pipeline consumption rate. To do this, instruction buffers are used.

Instruction buffers in CPU have high speed memory for storing the instructions. The instructions are pre-fetched in the buffer from the main memory.

Another alternative for the instruction buffer is the cache memory between the CPU and the main memory.

The advantage of cache memory is that it can be used for both instruction and data. But cache requires more complex control logic than the instruction buffer.

Arithmetic Pipeline

The complex arithmetic operations like multiplication, and floating point operations consume much of the time of the ALU.

These operations can also be pipelined by segmenting the operations of the ALU and as a consequence, high speed performance may be achieved.

Thus, the pipelines used for arithmetic operations are known as arithmetic pipelines.

Arithmetic Pipelines

The technique of pipelining can be applied to various complex and slow arithmetic operations to speed up the processing time.

Arithmetic pipelines based on arithmetic operations. Arithmetic pipelines are constructed for simple fixed-point and complex floating-point arithmetic operations.

These arithmetic operations are well suited to pipelining as these operations can be efficiently partitioned into subtasks for the pipeline stages.

For implementing the arithmetic pipelines we generally use following two types of adder:

Carry propagation adder (CPA): It adds two numbers such that carries generated in successive digits are propagated.

Carry save adder (CSA): It adds two numbers such that carries generated are not propagated rather these are saved in a carry vector .

Fixed Arithmetic pipelines

Ex: Multiplication of fixed numbers.

Two fixed-point numbers are added by the ALU using add and shift operations. This sequential execution makes the multiplication a slow process. Multiplication is the process of adding the multiple copies of shifted multiplicands as shown

The first stage generates the partial product of the numbers, which form the six rows of shifted multiplicands.

In the second stage, the six numbers are given to the two CSAs merging into four numbers.

In the third stage, there is a single CSA merging the numbers into 3 numbers.

In the fourth stage, there is a single number merging three numbers into 2 numbers.

In the fifth stage, the last two numbers are added through a CPA to get the final product

Floating point Arithmetic pipelines

Floating point computations are the best candidates for pipelining.example :Addition of two floating point numbers. Following stages are identified for the addition of two floating point numbers First stage will compare the exponents of the two numbers. Second stage will look for alignment of mantissas. In the third stage, mantissas are added. In the last stage, the result is normalized

ARITHMETIC PIPELINE

Floating-point adder

[1] Compare the exponents[2] Align the mantissa[3] Add/sub the mantissa[4] Normalize the result

X = A x 2aY = B x 2b

R

Compare

exponents

by subtractn

a

b

R

Choose exponent

Exponents

R

A

B

Align mantissa

Mantissas

Difference

R

Add or subtract

mantissas

R

Normalize

result

R

R

Adjust

exponent

R

Segment 1:

Segment 2:

Segment 3:

Segment 4:

18

18

Classification according to pipeline configuration

Unifunction Pipelines: When a fixed and dedicated function is performed through a pipeline, it is called a Unifunction pipeline.

Multifunction Pipelines: When different functions at different times are performed through the pipeline, this is known as Multifunction pipeline. Multifunction pipelines are reconfigurable at different times according to the operation being performed

Classification according to type of instruction and data

Scalar Pipelines: This type of pipeline processes scalar operands of repeated scalar instructions.

Vector Pipelines: This type of pipeline processes vector instructions over vector operands.

PIPELINE AND MULTIPLE FUNCTION UNITS

Multiple Functional Units

Example - 4-stage pipeline - sub operation in each stage; tp = 20nS - 100 tasks to be executed - 1 task in non-pipelined system; 20*4 = 80nS Pipelined System (k + n - 1)*tp = (4 + 99) * 20 = 2060nS

Non-Pipelined System n*k*tp = 100 * 80 = 8000nS

Speedup Sk = 8000 / 2060 = 3.88

4-Stage Pipeline is basically identical to the system with 4 identical function units

Pipelining

21

21

Efficiency: The efficiency of a pipeline can be measured as the ratio of busy time span to the total time span including the idle time. Let c be the clock period of the pipeline, the efficiency E can be denoted as: E = (n. m. c) / m. [m. c + (n-1).c] = n / [(m + (n-1)] As n-> , E becomes 1.

Throughput: Throughput of a pipeline can be defined as the number of results that have been achieved per unit time. It can be denoted as:

T = (n / [m + (n-1)]) / c = E / c

Throughput denotes the computing power of the pipeline.

Maximum speedup, efficiency and throughput are the ideal cases.

Limitations to speed up

Data dependency between successive tasks: There may be dependencies between the instructions of two tasks used in the pipeline.

For example:One instruction cannot be started until the previous instruction returns the results, as both are interdependent.

Another instance of data dependency will be when that both instructions try to modify the same data object. These are called data hazards.

Resource Constraints: When resources are not available at the time of execution then delays are caused in pipelining. For example: 1)If one common memory is used for both data and instructions and there is need to read/write and fetch the instruction at the same time, then only one can be carried out and the other has to wait.

2)Limited resource like execution unit, which may be busy at the required time.

Branch Instructions and Interrupts in the program: A program is not a straight flow of sequential instructions. There may be branch instructions that alter the normal flow of program, which delays the pipelining execution and affects the performance. Similarly, there are interrupts that postpones the execution of next instruction until the interrupt has been serviced. Branches and the interrupts have damaging effects on the pipelining.

Five segments of Instruction Pipeline

Fetch Instruc-tion

Decode

Fetch Operands

Execute

Store Results

Overlapped Execution of Instruction Without Branching

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

I 1

I 2

I 3

I 4

I 5

I 6

I 7

I 8

Effect of Branching on performance of Instruction pipeline

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

I 1

I 2

I 3

I 4

I 5

I 6

I 7

I 8

Timing Diagram for Instruction Pipeline Operation

30

39

Pipeline Throughput

The average number of task initiations per clock cycle

Dynamic pipeline and Reconfigurability The dynamic pipeline may initiate task from different reservation table simultaneously to allow multiple number of initiation of different function in the same pipeline.

It is assumed that any computation step can be delay by inserting non-compute stage.

Pipeline with perfect cycle can be better utilize than those with non perfect initiation cycle.