pipe lining performance

8/3/2019 Pipe Lining Performance

1/9

Understanding pipelining performance

The original Pentium 4 was a radical design for a number of reasons, but perhaps its most striking and controversial

feature was its extraordinarily deep pipeline. At over 20 stages, the Pentium 4's pipeline almost twice as deep as the

pipelines of the P4's competitors. Recently Prescott, the 90nm successor to the Pentium 4, took pipelining to the next

level by adding another 10 stages onto the Pentium 4's already unbelievably long pipeline.

Intel's strategy of deepening the Pentium 4's pipeline, a practice that Intel calls "hyperpipelining", has paid off in terms

of performance, but it is not without its drawbacks. In previous articles on the Pentium 4 and Prescott, I've referred to

the drawbacks associated with deep pipelines, and I've even tried to explain these drawbacks within the context of

larger technical articles on Netburst and other topics. In the present series of articles, I want to devote some serious

time to explaining pipelining, its effect on microprocessor performance, and its potential downsides. I'll take you

through a basic introduction to the concept of pipelining, and then I'll explain what's required to make pipelining

successful and what pitfalls face deeply pipelined designs like Prescott. By the end of the article, you should have a

clear grasp on exactly how pipeline depth is related to microprocessor performance on different types of code.

Pipelining Introduction

Let us break down our microprocessor into 5 distinct activities, which generally

correspond to 5 distinct pieces of hardware:

1. Instruction fetch (IF)

2. Instruction Decode (ID)

3. Execution (EX)

4. Memory Read/Write (MEM)

5. Result Writeback (WB)

Any given instruction will only require one of these modules at a time, generally in

this order. The following timing diagram of the multi-cycle processor will show this

in more detail:

This is all fine and good, but at any moment, 4 out of 5 units are not active, and

could likely be used for other things.

Pipelining Philosophy

Pipelining is concerned with the following tasks:

Use multi-cycle methodologies to reduce the amount of computation in a single cycle.

Shorter computations per cycle allow for faster clock cycles.

Overlapping instructions allows all components of a processor to be operating on a different instruction.

Throughput is increased by having instructions complete more frequently.
http://en.wikibooks.org/wiki/File:Nopipeline.png


2/9

We will talk about how to make these things happen in the remainder of the chapter.

[edit]Pipelining Hardware

Given our multicycle processor, what if we wanted to overlap our execution, so that up to 5 instructions could be

processed at the same time? Let's contract our timing diagram a little bit to show this idea:

As this diagram shows, each element in the processor is active in every cycle, and the instruction rate of the

processor has been increased by 5 times! The question now is, what additional hardware do we need in order to

perform this task? We need to add storage registers between each pipeline state to store the partial results between

cycles, and we also need to reintroduce the redundant hardware from the single-cycle CPU. We can continue to use

a single memory module (for instructions and data), so long as we restrict memory read operations to the first half of

the cycle, and memory write operations to the second half of the cycle (or vice-versa). We can save time on the

memory access by calculating the memory addresses in the previous stage.

The registers would need to hold the data from the pipeline at that point, and also the necessary control codes to

operate the remainder of the pipeline.

Our resultant processor design will look similar to this:
http://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit&section=3http://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit&section=3http://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit&section=3http://en.wikibooks.org/wiki/File:Pipeline-base.pnghttp://en.wikibooks.org/wiki/File:Fivestagespipeline.pnghttp://en.wikibooks.org/wiki/File:Pipeline-base.pnghttp://en.wikibooks.org/wiki/File:Fivestagespipeline.pnghttp://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit&section=3


3/9

If we have 5 instructions, we can show them in our pipeline using different colors. In the diagram below, white

corresponds to a NOP, and the different colors correspond to other instructions in the pipeline. Each stage, the

instructions shift forward through the pipeline.

[edit]Superpipeline
http://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit&section=4http://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit&section=4http://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit&section=4http://en.wikibooks.org/wiki/File:Pipeline_3.pnghttp://en.wikibooks.org/wiki/File:Pipeline_MIPS.pnghttp://en.wikibooks.org/wiki/File:Pipeline_3.pnghttp://en.wikibooks.org/wiki/File:Pipeline_MIPS.pnghttp://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit&section=4


4/9

Superpipelining is the technique of raising the pipeline depth in order to increase the clock speed and reduce the

latency of individual stages. If the ALU takes three times longer then any other module, we can divide the ALU into

three separate stages, which will reduce the amount of time wasted on shorter stages. The problem here is that we

need to find a way to subdivide our stages into shorter stages, and we also need to construct more complicated

control units to operate the pipeline and prevent all the possible hazards.

It is not uncommon for modern high-end processors to have more than 20 pipeline stages.

Example: Intel Pentium 4

The Intel Pentium 4 processor is a recent example of a super-pipelined processor. This diagram shows a Pentium 4

pipeline with 20 stages.
http://en.wikibooks.org/wiki/File:Pentium4superpipeline.png


5/9

Instruction pipeline

An instruction pipeline is a technique used in the design ofcomputersand other digital electronic devices to

increase their instruction throughput (the number of instructions that can be executed in a unit of time).

The fundamental idea is to split the processing of a computer instruction into a series of independent steps, with

storage at the end of each step. This allows the computer's control circuitry to issue instructions at the processing

rate of the slowest step, which is much faster than the time needed to perform all steps at once. The term pipeline

refers to the fact that each step is carrying data at once (like water), and each step is connected to the next (like the

links of a pipe.)

The origin of pipelining is thought to be either theILLIAC IIproject or theIBM Stretchproject though a simple version

was used earlier in theZ1in 1939 and theZ3in 1941.[1]

.

The IBM Stretch Project proposed the terms, "Fetch, Decode, and Execute" that became common usage.

Most modernCPUsare driven by a clock. The CPU consists internally of logic and memory (flip flops). When the

clock signal arrives, the flip flops take their new value and the logic then requires a period of time to decode the new

values. Then the next clock pulse arrives and the flip flops again take their new values, and so on. By breaking the

logic into smaller pieces and inserting flip flops between the pieces of logic, the delay before the logic gives valid

outputs is reduced. In this way the clock period can be reduced. For example, theclassic RISC pipelineis broken into

five stages with a set of flip flops between each stage.

1. Instruction fetch

2. Instruction decode and register fetch

3. Execute

4. Memory access

5. Register write back

When a programmer (or compiler) writes assembly code, they make the assumption that each instruction is executed

before execution of the subsequent instruction is begun. This assumption is invalidated by pipelining. When this

causes a program to behave incorrectly, the situation is known as a hazard. Various techniques for resolving hazards

such as forwarding and stalling exist.

A non-pipeline architecture is inefficient because some CPU components (modules) are idle while another module is

active during the instruction cycle. Pipelining does not completely cancel out idle t ime in a CPU but making those

modules work in parallel improves program execution significantly.

Processors with pipelining are organized inside into stages which can semi-independently work on separate jobs.

Each stage is organized and linked into a 'chain' so each stage's output is fed to another stage until the job is done.

This organization of the processor allows overall processing time to be significantly reduced.
http://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/ILLIAC_IIhttp://en.wikipedia.org/wiki/ILLIAC_IIhttp://en.wikipedia.org/wiki/ILLIAC_IIhttp://en.wikipedia.org/wiki/IBM_Stretchhttp://en.wikipedia.org/wiki/IBM_Stretchhttp://en.wikipedia.org/wiki/IBM_Stretchhttp://en.wikipedia.org/wiki/Z1_(computer)http://en.wikipedia.org/wiki/Z1_(computer)http://en.wikipedia.org/wiki/Z1_(computer)http://en.wikipedia.org/wiki/Z3_(computer)http://en.wikipedia.org/wiki/Z3_(computer)http://en.wikipedia.org/wiki/Z3_(computer)http://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-rojas-0http://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-rojas-0http://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-rojas-0http://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-rojas-0http://en.wikipedia.org/wiki/Z3_(computer)http://en.wikipedia.org/wiki/Z1_(computer)http://en.wikipedia.org/wiki/IBM_Stretchhttp://en.wikipedia.org/wiki/ILLIAC_IIhttp://en.wikipedia.org/wiki/Computer


6/9

A deeper pipeline means that there are more stages in the pipeline, and therefore, fewer logic gates in each stage.

This generally means that the processor's frequency can be increased as the cycle time is lowered. This happens

because there are fewer components in each stage of the pipeline, so the propagation delay is decreased for the

overall stage.[2]

Unfortunately, not all instructions are independent. In a simple pipeline, completing an instruction may require 5

stages. To operate at full performance, this pipeline will need to run 4 subsequent independent instructions while the

first is completing. If 4 instructions that depend on the output of the first instruction are not available, the pipeline

control logic must insert a stall or wasted clock cycle into the pipeline until the dependency is resolved. Fortunately,

techniques such as forwarding can significantly reduce the cases where stalling is required. While pipelining can in

theory increase performance over an unpipelined core by a factor of the number of stages (assuming the clock

frequency also scales with the number of stages), in reality, most code does not allow for ideal execution.

Advantages and Disadvantages

Pipelining does not help in all cases. There are several possible disadvantages. An

instruction pipeline is said to be fully pipelinedif it can accept a new instruction

everyclock cycle. A pipeline that is not fully pipelined has wait cycles that delay the

progress of the pipeline.

Advantages of Pipelining:

1. The cycle time of the processor is reduced, thus increasing instruction issue-rate in

most cases.

2. Some combinational circuits such as adders or multipliers can be made faster by

adding more circuitry. If pipelining is used instead, it can save circuitry vs. a more

complex combinational circuit.

Disadvantages of Pipelining:

1. A non-pipelined processor executes only a single instruction at a time. This

prevents branch delays (in effect, every branch is delayed) and problems with

serial instructions being executed concurrently. Consequently the design is

simpler and cheaper to manufacture.

2. The instruction latency in a non-pipelined processor is slightly lower than in a

pipelined equivalent. This is because extraflip flopsmust be added to the data

path of a pipelined processor.

3. A non-pipelined processor will have a stable instruction bandwidth. The

performance of a pipelined processor is much harder to predict and may vary more

widely between different programs.

[edit]Examples
http://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-Guardian-1http://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-Guardian-1http://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-Guardian-1http://en.wikipedia.org/wiki/Clock_cyclehttp://en.wikipedia.org/wiki/Clock_cyclehttp://en.wikipedia.org/wiki/Clock_cyclehttp://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=2http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=2http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=2http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=2http://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Clock_cyclehttp://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-Guardian-1


7/9

[edit]Generic pipeline

Generic 4-stage pipeline; the colored boxes represent instructions independent of each other

To the right is a generic pipeline with four stages:

1. Fetch

2. Decode

3. Execute

4. Write-back

(for lw and sw memory is accessed after execute stage)

The top gray box is the list of instructions waiting to be executed; the bottom gray box is

the list of instructions that have been completed; and the middle white box is the pipeline.

Execution is as follows:

Time Execution

0 Four instructions are awaiting to be executed

1 the green instruction is fetched from memory

2

the green instruction is decoded

the purple instruction is fetched from memory
http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=3http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=3http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=3http://en.wikipedia.org/wiki/File:Pipeline,_4_stage.svghttp://en.wikipedia.org/wiki/File:Pipeline,_4_stage.svghttp://en.wikipedia.org/wiki/File:Pipeline,_4_stage.svghttp://en.wikipedia.org/wiki/File:Pipeline,_4_stage.svghttp://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=3


8/9

3

the green instruction is executed (actual operation is performed)

the purple instruction is decoded

the blue instruction is fetched

4

the green instruction's results are written back to the register file or memory

the purple instruction is executed

the blue instruction is decoded

the red instruction is fetched

5

the green instruction is completed

the purple instruction is written back

the blue instruction is executed

the red instruction is decoded

6

The purple instruction is completed

the blue instruction is written back

the red instruction is executed

7 the blue instruction is completed

the red instruction is written back

8 the red instruction is completed

9 All instructions are executed

[edit] Mathematical pipelines: Mathematical or arithmetic pipelines are different from instructional pipelines, in that

when mathematically processing large arrays or vectors, a particular mathematical process, such as a multiply is

repeated many thousands of times. In this environment, an instruction need only kick off an event whereby the

arithmetic logic unit (which is pipelined) takes over, and begins its series of calculations. Most of these circuits can be

found today in math processors and math processing sections of CPUs like the Intel Pentium line.

[edit]History

Math processing (super-computing) began in earnest in the late 1970s as Vector Processors and Array Processors.

Usually very large bulky super-computing machines that needed special environments and super-cooling of the

cores. One of the early super computers was the Cyber series built by Control Data Corporation. Its main architect

was Seymour Cray, who later resigned from CDC to head up Cray Research. Cray developed the XMP line of super

computers, using pipelining for both multiply and add/subtract functions. Later, Star Technologies took pipelining to

another level by adding parallelism (several pipelined functions working in parallel), developed by their engineer,

Roger Chen. In 1984, Star Technologies made another breakthrough with the pipelined divide circuit, developed by

James Bradley. By the mid 1980s, super-computing had taken off with offerings from many different companies

around the world.

Today, most of these circuits can be found embedded inside most micro-processors
http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=4http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=4http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=4http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=8http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=8http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=8http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=8http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=4


9/9

How is pipelining achieved in 8086 microprocessor?

EXECUTION UNIT TELLS THE BUS INTERFACE UNIT FROM WHERE TO FETCHINSTRUCTIONS AS WELL AS TO READ DATA.EU GETS THE OPCODE OF AN INSTRUCTION

FROM AN INSTRUCTION QUEUE.THEN THE EU DECODES IT OR EXECUTES IT.BIU AND EU

OPERATE INDEPENDENTLY.WHEN THE EU EXECUTING AN INSTRUCTION,THEN BIU

FETCHES INSTRUCTION CODES FROM MEMORY AND STORES THEM IN THE QUEUE.THIS

TYPE OF OVERLAPPING OPERATION OF THE BIU A ND EU FUNCTIONAL UNITS OF A

MICROPROCESSOR IS CALLED PIPELINING

Pipelining of Microcontroller Microprocessor

ew important characteristics and features of pipeline concept:

- Processes more than one instruction at a time, and doesnt wait for one instruction to complete before

starting the next. Fetch, decode, execute, and write stages are executed in parallel

- As soon as one stage completes, it passes on the result to the next stage and then begins working onanother instruction

- The performance of a pipelined system depends on the time it takes only for any one stage to be

completed, not on the total time for all stages as with non-pipelined designs

- Each instruction takes 1 clock cycle for each stage, so the processor can accept 1 new instruction per clock.

Pipelining doesnt improve the latency of instructions (each instruction still requires the same amount of time

to complete), but it does improve the overall throughput

- Sometimes pipelined instructions take more than one clock to complete a stage. When that happens, the

processor has to stall and not accept new instructions until the slow instruction has moved on to the next

stage

- A pipelined processor can stall for a variety of reasons, including delays in reading information from

memory, a poor instruction set design, or dependencies between instructions

- Memory speed issues are commonly solved using caches. A cache is a section of fast memory placed

between the processor and slower memory. When the processor wants to read a location in main memory,that location is also copied into the cache. Subsequent references to that location can come from the cache,

which will return a result much more quickly than the main memory

- Dependencies. Since each instruction takes some amount of time to store its result, and several

instructions are being handled at the same time, later instructions may have to wait for the results of earlier

instructions to be stored. However, a simple rearrangement of the instructions in a program (called

Instruction Scheduling) can remove these performance limitations from RISC programs
http://electronicsbus.com/microprocessors-microcontrollers/pipelining-of-microcontroller-microprocessor-computer/http://electronicsbus.com/microprocessors-microcontrollers/pipelining-of-microcontroller-microprocessor-computer/http://electronicsbus.com/microprocessors-microcontrollers/pipelining-of-microcontroller-microprocessor-computer/

pipe lining performance

Documents