pipeline performance parallel · pdf filelesson 02: pipeline performance chapter 06:...

48
Lesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing

Upload: vodiep

Post on 19-Feb-2018

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Lesson 02:Pipeline Performance

Chapter 06: Instruction Pipelining and Parallel Processing

Page 2: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

2

Objective

• Learn how to compute the instruction pipeline performance and throughput• Learn datapath logic circuit uneven latencies, • Understand the effect of buffers and • Understand the effect of large number of stages on performance

Page 3: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

3

Cycle Time

Page 4: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

4

First impression about cycle time of pipelined processor

• First impression─ The number of cycles required to execute a given set of instructions is now also five, total clock period of all 5 stages same as before for each instruction

• Each instruction taking 5 clock cycles to complete in five stage pipeline

• Pipelining doesn't increase the performance of the processor

Page 5: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

5

First impression about Cycle Time of Pipelined Processor

• Pipelining a processor generally increases the number of clock cycles it takes to execute a program

• Some instructions get held up in the pipeline waiting for the instructions that generate their inputs to execute

Page 6: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

6

Unpipelined processor

Page 7: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

7

Pipelined five stages processor

Page 8: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

8

Facts about cycle Time of Pipelined Processor

• The performance benefit of pipelining comes from the fact that, because less number of the logic gates in the datapath gets executed in a single cycle in a pipelined processor

• Single Clock Cycle of period• Cycle Time unpipelined = T• Because less number of the logic gates in the

datapath there are though five Clock Cycles but each has period Cycle Time pipelined T' about T÷5

Page 9: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

9

Cycle Time of Pipelined Processor

Page 10: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

10

Cycle Time of Pipelined Processor

• Pipelined processors can be clocked a fast clock rate and thus can have reduced cycle times (more cycles/second by a fast clock) than unpipelined implementations of the same processor

Page 11: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

11

Throughput• Since the pipelined processor has a

throughput of one instruction per cycle, the total number of instructions executed per unit time is higher in the pipelined processor, giving better performance

Page 12: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

12

Factors for Cycle time of a pipelined processor

(i) The cycle time of the unpipelined diversion of the processor

(ii) The number of pipeline stages(iii) How evenly the datapath logic divided

among the stages(iv) The latency of the pipeline latches

(buffers)

Page 13: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

13

Clock Period

• Assume─ the logic divided evenly among the pipeline stages

• Cycle Time pipelined = (Cycle Time unpipelined

÷ Number of Pipeline Stages ) + Pipeline Latch Latency

Page 14: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

14

A slightly higher clock rate

• Can be achieved by taking advantage of the fact that a n-stage pipeline requires only n − 1 pipeline latches

• Use enough additional logic to the last stage of the pipeline to make its total latency equal to that of the other pipeline stages including their pipeline latches

Page 15: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

15

Effect of Increasing Number of pipeline stages

Page 16: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

16

Number of pipeline stages

• Each stage contains the same fraction of the original logic, plus one pipeline latch (buffer)

Page 17: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

17

Increasing Number of pipeline stages

• The pipeline latch (buffer) latency becomes a greater and greater fraction of the cycle time

• Limiting the benefit of dividing a processor into a very large number of pipeline stages

Page 18: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

18

Example

• Assume─ an unpipelined processor version cycle time = 25 ns

• Assume─ evenly divided pipeline stages• Assume─ each pipeline latch (buffer) has a

latency of 1 ns?

Page 19: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

19

Cycle time of the 5 stage pipelined version of the processor

• Cycle time for the 5-stage pipeline = (25 ns/5)+ 1 ns = 6 ns

Page 20: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

20

Cycle time of the 50 stage pipelined version of the processor

• For the 50-stage pipeline, cycle time = (25 ns/50) + 1 ns = 1.5 ns

Page 21: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

21

Result Analysis

• In the 5-stage pipeline, the pipeline latch (buffer) latency is only l/6th of the overall cycle time

• Pipeline latch (buffer) latency is 2/3 of the total cycle time in the 50-stage pipeline

Page 22: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

22

Another way of looking at the result

• 50-stage pipeline cycle time = one-fourth that of the 5-stage pipeline, at a cost of 10 times as many pipeline latches

Page 23: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

23

Data Path Logic

Page 24: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

24

Data Path Logic

• Often, the datapath logic cannot easily be divided into equal-latency pipeline stages

Page 25: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

25

Example of Data Path Logic

• Accessing the register from a register file in a processor might take 3 ns

• Decoding an instruction might take 4 ns

Page 26: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

26

Dividing a datapath into pipeline stages

• Designer balances the need to have each stage have the same latency with the difficulty of dividing the datapath into pipeline stages at different points

Page 27: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

27

Dividing a datapath into pipeline stages

• Designer determines the amount of space that the latch (buffer) takes up on the chip

• Designer balance the amount of data that has to be stored in the pipeline latch (buffer)

Page 28: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

28

Instruction decode logic

• Irregular, making it hard to split them into stages

• Other stages may logic generate a large number of intermediate data values that would have to be stored in the pipeline latch (buffer)

Page 29: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

29

Data Path Logic sections efficient way

• Designer place the pipeline latch (buffer) at a point where there are fewer intermediate results, and thus fewer bits that have to be stored in the latch (buffer)

Page 30: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

30

Effect of Data Path Logic on Clock cycle time of the processor

Page 31: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

31

Unequal-latency pipeline stages

• Clock cycle time of the processor = equal to the latency of the longest pipeline stage plus the pipeline latch (buffer) delay

• Since the cycle time has to be long enough for the longest pipeline stage to complete and store its result in the pipeline latch (buffer) between it and the next stage

Page 32: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

32

Total latency of a longest pipeline stage

Page 33: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

33

Example

• Assume─ An unpipelined processor with a 25-ns cycle time is divided into 5 pipeline stages with latencies of 5, 7, 3, 6, and 4 ns

• Assume─ the pipeline latch (buffer) latency =1 ns

Page 34: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

34

Solution

• The longest pipeline stage =7 ns• Adding a 1 ns pipeline latch (buffer) to this

stage gives a total latency of 8 ns, which is the cycle time

Page 35: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

35

Pipeline Latency of all stages

Page 36: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

36

Pipeline Latency

• While pipelining reduce a processor's cycle time, thereby increase instruction throughput it also increases the latency of the processor by at least the sum of all the pipeline latch latencies

Page 37: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

37

Pipeline Latency

• The amount of time that a single instruction takes to pass through the pipeline

• Product of the number of pipeline stages and the clock cycle time

Page 38: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

38

Example

• Assume─ An unpipelined processor a cycle time = 25 ns

• Assume─ evenly divided into 5 pipeline stages using pipeline latches with 1-ns latency

Page 39: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

39

Cycle time of the 5-stage pipeline • Cycle time = (25 ns/5) + 1 ns = 6 ns• Computing the latency of each pipeline─

Multiply the cycle time by the n• Latency = 6 ns × 5 = 30ns for the n = 5

Page 40: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

40

Cycle time of the 50-stage pipeline

• Cycle time = (25 ns/50) + 1 ns = 1.5 ns• Computing the latency of each pipeline─

Multiply the cycle time by the n• Latency = 1.5 ns × 50 = 75 ns for the n = 50

Page 41: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

41

Result

• Shows the impact pipelining can have on latency, particularly as the number of stages grows.

• The n = 5 pipeline has a latency of 30ns, 20 percent longer than the original 25-ns unpipelined processor

• The n = 50-has a latency of 75 ns, 3 times that of the original processor

Page 42: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

42

Formula for Pipelines with uneven pipeline stages

Page 43: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

43

Formula for Pipelines with uneven pipeline stages

• Same formula, although they see an even greater increase in latency, because the cycle time must be long enough to accommodate the longest stage of the pipeline, even if the other stages are much shorter

Page 44: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

44

Example

• Assume─ an unpipelined processor with a 25-ns cycle time divided into 5 pipeline stages with latencies of 5, 7, 3, 6, and 4 ns

• Assume─ the pipeline latch latency = 1 ns

Page 45: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

45

Latency of the pipeline

• Longest stage latency -= 7 ns• Thus cycle time of 7 ns + 1 ns = 8 ns• Total latency of the pipeline for 5 stages = 8 ns

× 5 = 40 ns

Page 46: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

46

Summary

Page 47: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

Schaum’s Outline of Theory and Problems of Computer ArchitectureCopyright © The McGraw-Hill Companies Inc. Indian Special Edition 2009

47

We learnt

• Compute the instruction pipeline performance and throughput

• Datapath logic circuit and uneven latencies• Effect of buffers • Effect of large number of stages on

performance

Page 48: Pipeline Performance Parallel · PDF fileLesson 02: Pipeline Performance Chapter 06: Instruction Pipelining and Parallel Processing. Schaum’s Outline of Theory and Problems of Computer

End of Lesson 02Pipeline Performance