pipe lining performance
TRANSCRIPT
-
8/3/2019 Pipe Lining Performance
1/9
Understanding pipelining performance
The original Pentium 4 was a radical design for a number of reasons, but perhaps its most striking and controversial
feature was its extraordinarily deep pipeline. At over 20 stages, the Pentium 4's pipeline almost twice as deep as the
pipelines of the P4's competitors. Recently Prescott, the 90nm successor to the Pentium 4, took pipelining to the next
level by adding another 10 stages onto the Pentium 4's already unbelievably long pipeline.
Intel's strategy of deepening the Pentium 4's pipeline, a practice that Intel calls "hyperpipelining", has paid off in terms
of performance, but it is not without its drawbacks. In previous articles on the Pentium 4 and Prescott, I've referred to
the drawbacks associated with deep pipelines, and I've even tried to explain these drawbacks within the context of
larger technical articles on Netburst and other topics. In the present series of articles, I want to devote some serious
time to explaining pipelining, its effect on microprocessor performance, and its potential downsides. I'll take you
through a basic introduction to the concept of pipelining, and then I'll explain what's required to make pipelining
successful and what pitfalls face deeply pipelined designs like Prescott. By the end of the article, you should have a
clear grasp on exactly how pipeline depth is related to microprocessor performance on different types of code.
Pipelining Introduction
Let us break down our microprocessor into 5 distinct activities, which generally
correspond to 5 distinct pieces of hardware:
1. Instruction fetch (IF)
2. Instruction Decode (ID)
3. Execution (EX)
4. Memory Read/Write (MEM)
5. Result Writeback (WB)
Any given instruction will only require one of these modules at a time, generally in
this order. The following timing diagram of the multi-cycle processor will show this
in more detail:
This is all fine and good, but at any moment, 4 out of 5 units are not active, and
could likely be used for other things.
Pipelining Philosophy
Pipelining is concerned with the following tasks:
Use multi-cycle methodologies to reduce the amount of computation in a single cycle.
Shorter computations per cycle allow for faster clock cycles.
Overlapping instructions allows all components of a processor to be operating on a different instruction.
Throughput is increased by having instructions complete more frequently.
http://en.wikibooks.org/wiki/File:Nopipeline.png -
8/3/2019 Pipe Lining Performance
2/9
We will talk about how to make these things happen in the remainder of the chapter.
[edit]Pipelining Hardware
Given our multicycle processor, what if we wanted to overlap our execution, so that up to 5 instructions could be
processed at the same time? Let's contract our timing diagram a little bit to show this idea:
As this diagram shows, each element in the processor is active in every cycle, and the instruction rate of the
processor has been increased by 5 times! The question now is, what additional hardware do we need in order to
perform this task? We need to add storage registers between each pipeline state to store the partial results between
cycles, and we also need to reintroduce the redundant hardware from the single-cycle CPU. We can continue to use
a single memory module (for instructions and data), so long as we restrict memory read operations to the first half of
the cycle, and memory write operations to the second half of the cycle (or vice-versa). We can save time on the
memory access by calculating the memory addresses in the previous stage.
The registers would need to hold the data from the pipeline at that point, and also the necessary control codes to
operate the remainder of the pipeline.
Our resultant processor design will look similar to this:
http://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit§ion=3http://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit§ion=3http://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit§ion=3http://en.wikibooks.org/wiki/File:Pipeline-base.pnghttp://en.wikibooks.org/wiki/File:Fivestagespipeline.pnghttp://en.wikibooks.org/wiki/File:Pipeline-base.pnghttp://en.wikibooks.org/wiki/File:Fivestagespipeline.pnghttp://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit§ion=3 -
8/3/2019 Pipe Lining Performance
3/9
If we have 5 instructions, we can show them in our pipeline using different colors. In the diagram below, white
corresponds to a NOP, and the different colors correspond to other instructions in the pipeline. Each stage, the
instructions shift forward through the pipeline.
[edit]Superpipeline
http://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit§ion=4http://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit§ion=4http://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit§ion=4http://en.wikibooks.org/wiki/File:Pipeline_3.pnghttp://en.wikibooks.org/wiki/File:Pipeline_MIPS.pnghttp://en.wikibooks.org/wiki/File:Pipeline_3.pnghttp://en.wikibooks.org/wiki/File:Pipeline_MIPS.pnghttp://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit§ion=4 -
8/3/2019 Pipe Lining Performance
4/9
Superpipelining is the technique of raising the pipeline depth in order to increase the clock speed and reduce the
latency of individual stages. If the ALU takes three times longer then any other module, we can divide the ALU into
three separate stages, which will reduce the amount of time wasted on shorter stages. The problem here is that we
need to find a way to subdivide our stages into shorter stages, and we also need to construct more complicated
control units to operate the pipeline and prevent all the possible hazards.
It is not uncommon for modern high-end processors to have more than 20 pipeline stages.
Example: Intel Pentium 4
The Intel Pentium 4 processor is a recent example of a super-pipelined processor. This diagram shows a Pentium 4
pipeline with 20 stages.
http://en.wikibooks.org/wiki/File:Pentium4superpipeline.png -
8/3/2019 Pipe Lining Performance
5/9
Instruction pipeline
An instruction pipeline is a technique used in the design ofcomputersand other digital electronic devices to
increase their instruction throughput (the number of instructions that can be executed in a unit of time).
The fundamental idea is to split the processing of a computer instruction into a series of independent steps, with
storage at the end of each step. This allows the computer's control circuitry to issue instructions at the processing
rate of the slowest step, which is much faster than the time needed to perform all steps at once. The term pipeline
refers to the fact that each step is carrying data at once (like water), and each step is connected to the next (like the
links of a pipe.)
The origin of pipelining is thought to be either theILLIAC IIproject or theIBM Stretchproject though a simple version
was used earlier in theZ1in 1939 and theZ3in 1941.[1]
.
The IBM Stretch Project proposed the terms, "Fetch, Decode, and Execute" that became common usage.
Most modernCPUsare driven by a clock. The CPU consists internally of logic and memory (flip flops). When the
clock signal arrives, the flip flops take their new value and the logic then requires a period of time to decode the new
values. Then the next clock pulse arrives and the flip flops again take their new values, and so on. By breaking the
logic into smaller pieces and inserting flip flops between the pieces of logic, the delay before the logic gives valid
outputs is reduced. In this way the clock period can be reduced. For example, theclassic RISC pipelineis broken into
five stages with a set of flip flops between each stage.
1. Instruction fetch
2. Instruction decode and register fetch
3. Execute
4. Memory access
5. Register write back
When a programmer (or compiler) writes assembly code, they make the assumption that each instruction is executed
before execution of the subsequent instruction is begun. This assumption is invalidated by pipelining. When this
causes a program to behave incorrectly, the situation is known as a hazard. Various techniques for resolving hazards
such as forwarding and stalling exist.
A non-pipeline architecture is inefficient because some CPU components (modules) are idle while another module is
active during the instruction cycle. Pipelining does not completely cancel out idle t ime in a CPU but making those
modules work in parallel improves program execution significantly.
Processors with pipelining are organized inside into stages which can semi-independently work on separate jobs.
Each stage is organized and linked into a 'chain' so each stage's output is fed to another stage until the job is done.
This organization of the processor allows overall processing time to be significantly reduced.
http://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/ILLIAC_IIhttp://en.wikipedia.org/wiki/ILLIAC_IIhttp://en.wikipedia.org/wiki/ILLIAC_IIhttp://en.wikipedia.org/wiki/IBM_Stretchhttp://en.wikipedia.org/wiki/IBM_Stretchhttp://en.wikipedia.org/wiki/IBM_Stretchhttp://en.wikipedia.org/wiki/Z1_(computer)http://en.wikipedia.org/wiki/Z1_(computer)http://en.wikipedia.org/wiki/Z1_(computer)http://en.wikipedia.org/wiki/Z3_(computer)http://en.wikipedia.org/wiki/Z3_(computer)http://en.wikipedia.org/wiki/Z3_(computer)http://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-rojas-0http://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-rojas-0http://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-rojas-0http://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-rojas-0http://en.wikipedia.org/wiki/Z3_(computer)http://en.wikipedia.org/wiki/Z1_(computer)http://en.wikipedia.org/wiki/IBM_Stretchhttp://en.wikipedia.org/wiki/ILLIAC_IIhttp://en.wikipedia.org/wiki/Computer -
8/3/2019 Pipe Lining Performance
6/9
A deeper pipeline means that there are more stages in the pipeline, and therefore, fewer logic gates in each stage.
This generally means that the processor's frequency can be increased as the cycle time is lowered. This happens
because there are fewer components in each stage of the pipeline, so the propagation delay is decreased for the
overall stage.[2]
Unfortunately, not all instructions are independent. In a simple pipeline, completing an instruction may require 5
stages. To operate at full performance, this pipeline will need to run 4 subsequent independent instructions while the
first is completing. If 4 instructions that depend on the output of the first instruction are not available, the pipeline
control logic must insert a stall or wasted clock cycle into the pipeline until the dependency is resolved. Fortunately,
techniques such as forwarding can significantly reduce the cases where stalling is required. While pipelining can in
theory increase performance over an unpipelined core by a factor of the number of stages (assuming the clock
frequency also scales with the number of stages), in reality, most code does not allow for ideal execution.
Advantages and Disadvantages
Pipelining does not help in all cases. There are several possible disadvantages. An
instruction pipeline is said to be fully pipelinedif it can accept a new instruction
everyclock cycle. A pipeline that is not fully pipelined has wait cycles that delay the
progress of the pipeline.
Advantages of Pipelining:
1. The cycle time of the processor is reduced, thus increasing instruction issue-rate in
most cases.
2. Some combinational circuits such as adders or multipliers can be made faster by
adding more circuitry. If pipelining is used instead, it can save circuitry vs. a more
complex combinational circuit.
Disadvantages of Pipelining:
1. A non-pipelined processor executes only a single instruction at a time. This
prevents branch delays (in effect, every branch is delayed) and problems with
serial instructions being executed concurrently. Consequently the design is
simpler and cheaper to manufacture.
2. The instruction latency in a non-pipelined processor is slightly lower than in a
pipelined equivalent. This is because extraflip flopsmust be added to the data
path of a pipelined processor.
3. A non-pipelined processor will have a stable instruction bandwidth. The
performance of a pipelined processor is much harder to predict and may vary more
widely between different programs.
[edit]Examples
http://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-Guardian-1http://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-Guardian-1http://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-Guardian-1http://en.wikipedia.org/wiki/Clock_cyclehttp://en.wikipedia.org/wiki/Clock_cyclehttp://en.wikipedia.org/wiki/Clock_cyclehttp://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit§ion=2http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit§ion=2http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit§ion=2http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit§ion=2http://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Clock_cyclehttp://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-Guardian-1 -
8/3/2019 Pipe Lining Performance
7/9
[edit]Generic pipeline
Generic 4-stage pipeline; the colored boxes represent instructions independent of each other
To the right is a generic pipeline with four stages:
1. Fetch
2. Decode
3. Execute
4. Write-back
(for lw and sw memory is accessed after execute stage)
The top gray box is the list of instructions waiting to be executed; the bottom gray box is
the list of instructions that have been completed; and the middle white box is the pipeline.
Execution is as follows:
Time Execution
0 Four instructions are awaiting to be executed
1 the green instruction is fetched from memory
2
the green instruction is decoded
the purple instruction is fetched from memory
http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit§ion=3http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit§ion=3http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit§ion=3http://en.wikipedia.org/wiki/File:Pipeline,_4_stage.svghttp://en.wikipedia.org/wiki/File:Pipeline,_4_stage.svghttp://en.wikipedia.org/wiki/File:Pipeline,_4_stage.svghttp://en.wikipedia.org/wiki/File:Pipeline,_4_stage.svghttp://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit§ion=3 -
8/3/2019 Pipe Lining Performance
8/9
3
the green instruction is executed (actual operation is performed)
the purple instruction is decoded
the blue instruction is fetched
4
the green instruction's results are written back to the register file or memory
the purple instruction is executed
the blue instruction is decoded
the red instruction is fetched
5
the green instruction is completed
the purple instruction is written back
the blue instruction is executed
the red instruction is decoded
6
The purple instruction is completed
the blue instruction is written back
the red instruction is executed
7 the blue instruction is completed
the red instruction is written back
8 the red instruction is completed
9 All instructions are executed
[edit] Mathematical pipelines: Mathematical or arithmetic pipelines are different from instructional pipelines, in that
when mathematically processing large arrays or vectors, a particular mathematical process, such as a multiply is
repeated many thousands of times. In this environment, an instruction need only kick off an event whereby the
arithmetic logic unit (which is pipelined) takes over, and begins its series of calculations. Most of these circuits can be
found today in math processors and math processing sections of CPUs like the Intel Pentium line.
[edit]History
Math processing (super-computing) began in earnest in the late 1970s as Vector Processors and Array Processors.
Usually very large bulky super-computing machines that needed special environments and super-cooling of the
cores. One of the early super computers was the Cyber series built by Control Data Corporation. Its main architect
was Seymour Cray, who later resigned from CDC to head up Cray Research. Cray developed the XMP line of super
computers, using pipelining for both multiply and add/subtract functions. Later, Star Technologies took pipelining to
another level by adding parallelism (several pipelined functions working in parallel), developed by their engineer,
Roger Chen. In 1984, Star Technologies made another breakthrough with the pipelined divide circuit, developed by
James Bradley. By the mid 1980s, super-computing had taken off with offerings from many different companies
around the world.
Today, most of these circuits can be found embedded inside most micro-processors
http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit§ion=4http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit§ion=4http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit§ion=4http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit§ion=8http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit§ion=8http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit§ion=8http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit§ion=8http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit§ion=4 -
8/3/2019 Pipe Lining Performance
9/9
How is pipelining achieved in 8086 microprocessor?
EXECUTION UNIT TELLS THE BUS INTERFACE UNIT FROM WHERE TO FETCHINSTRUCTIONS AS WELL AS TO READ DATA.EU GETS THE OPCODE OF AN INSTRUCTION
FROM AN INSTRUCTION QUEUE.THEN THE EU DECODES IT OR EXECUTES IT.BIU AND EU
OPERATE INDEPENDENTLY.WHEN THE EU EXECUTING AN INSTRUCTION,THEN BIU
FETCHES INSTRUCTION CODES FROM MEMORY AND STORES THEM IN THE QUEUE.THIS
TYPE OF OVERLAPPING OPERATION OF THE BIU A ND EU FUNCTIONAL UNITS OF A
MICROPROCESSOR IS CALLED PIPELINING
Pipelining of Microcontroller Microprocessor
ew important characteristics and features of pipeline concept:
- Processes more than one instruction at a time, and doesnt wait for one instruction to complete before
starting the next. Fetch, decode, execute, and write stages are executed in parallel
- As soon as one stage completes, it passes on the result to the next stage and then begins working onanother instruction
- The performance of a pipelined system depends on the time it takes only for any one stage to be
completed, not on the total time for all stages as with non-pipelined designs
- Each instruction takes 1 clock cycle for each stage, so the processor can accept 1 new instruction per clock.
Pipelining doesnt improve the latency of instructions (each instruction still requires the same amount of time
to complete), but it does improve the overall throughput
- Sometimes pipelined instructions take more than one clock to complete a stage. When that happens, the
processor has to stall and not accept new instructions until the slow instruction has moved on to the next
stage
- A pipelined processor can stall for a variety of reasons, including delays in reading information from
memory, a poor instruction set design, or dependencies between instructions
- Memory speed issues are commonly solved using caches. A cache is a section of fast memory placed
between the processor and slower memory. When the processor wants to read a location in main memory,that location is also copied into the cache. Subsequent references to that location can come from the cache,
which will return a result much more quickly than the main memory
- Dependencies. Since each instruction takes some amount of time to store its result, and several
instructions are being handled at the same time, later instructions may have to wait for the results of earlier
instructions to be stored. However, a simple rearrangement of the instructions in a program (called
Instruction Scheduling) can remove these performance limitations from RISC programs
http://electronicsbus.com/microprocessors-microcontrollers/pipelining-of-microcontroller-microprocessor-computer/http://electronicsbus.com/microprocessors-microcontrollers/pipelining-of-microcontroller-microprocessor-computer/http://electronicsbus.com/microprocessors-microcontrollers/pipelining-of-microcontroller-microprocessor-computer/