pipe lining performance

Upload: sms-ah

Post on 06-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Pipe Lining Performance

    1/9

    Understanding pipelining performance

    The original Pentium 4 was a radical design for a number of reasons, but perhaps its most striking and controversial

    feature was its extraordinarily deep pipeline. At over 20 stages, the Pentium 4's pipeline almost twice as deep as the

    pipelines of the P4's competitors. Recently Prescott, the 90nm successor to the Pentium 4, took pipelining to the next

    level by adding another 10 stages onto the Pentium 4's already unbelievably long pipeline.

    Intel's strategy of deepening the Pentium 4's pipeline, a practice that Intel calls "hyperpipelining", has paid off in terms

    of performance, but it is not without its drawbacks. In previous articles on the Pentium 4 and Prescott, I've referred to

    the drawbacks associated with deep pipelines, and I've even tried to explain these drawbacks within the context of

    larger technical articles on Netburst and other topics. In the present series of articles, I want to devote some serious

    time to explaining pipelining, its effect on microprocessor performance, and its potential downsides. I'll take you

    through a basic introduction to the concept of pipelining, and then I'll explain what's required to make pipelining

    successful and what pitfalls face deeply pipelined designs like Prescott. By the end of the article, you should have a

    clear grasp on exactly how pipeline depth is related to microprocessor performance on different types of code.

    Pipelining Introduction

    Let us break down our microprocessor into 5 distinct activities, which generally

    correspond to 5 distinct pieces of hardware:

    1. Instruction fetch (IF)

    2. Instruction Decode (ID)

    3. Execution (EX)

    4. Memory Read/Write (MEM)

    5. Result Writeback (WB)

    Any given instruction will only require one of these modules at a time, generally in

    this order. The following timing diagram of the multi-cycle processor will show this

    in more detail:

    This is all fine and good, but at any moment, 4 out of 5 units are not active, and

    could likely be used for other things.

    Pipelining Philosophy

    Pipelining is concerned with the following tasks:

    Use multi-cycle methodologies to reduce the amount of computation in a single cycle.

    Shorter computations per cycle allow for faster clock cycles.

    Overlapping instructions allows all components of a processor to be operating on a different instruction.

    Throughput is increased by having instructions complete more frequently.

    http://en.wikibooks.org/wiki/File:Nopipeline.png
  • 8/3/2019 Pipe Lining Performance

    2/9

    We will talk about how to make these things happen in the remainder of the chapter.

    [edit]Pipelining Hardware

    Given our multicycle processor, what if we wanted to overlap our execution, so that up to 5 instructions could be

    processed at the same time? Let's contract our timing diagram a little bit to show this idea:

    As this diagram shows, each element in the processor is active in every cycle, and the instruction rate of the

    processor has been increased by 5 times! The question now is, what additional hardware do we need in order to

    perform this task? We need to add storage registers between each pipeline state to store the partial results between

    cycles, and we also need to reintroduce the redundant hardware from the single-cycle CPU. We can continue to use

    a single memory module (for instructions and data), so long as we restrict memory read operations to the first half of

    the cycle, and memory write operations to the second half of the cycle (or vice-versa). We can save time on the

    memory access by calculating the memory addresses in the previous stage.

    The registers would need to hold the data from the pipeline at that point, and also the necessary control codes to

    operate the remainder of the pipeline.

    Our resultant processor design will look similar to this:

    http://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit&section=3http://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit&section=3http://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit&section=3http://en.wikibooks.org/wiki/File:Pipeline-base.pnghttp://en.wikibooks.org/wiki/File:Fivestagespipeline.pnghttp://en.wikibooks.org/wiki/File:Pipeline-base.pnghttp://en.wikibooks.org/wiki/File:Fivestagespipeline.pnghttp://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit&section=3
  • 8/3/2019 Pipe Lining Performance

    3/9

    If we have 5 instructions, we can show them in our pipeline using different colors. In the diagram below, white

    corresponds to a NOP, and the different colors correspond to other instructions in the pipeline. Each stage, the

    instructions shift forward through the pipeline.

    [edit]Superpipeline

    http://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit&section=4http://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit&section=4http://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit&section=4http://en.wikibooks.org/wiki/File:Pipeline_3.pnghttp://en.wikibooks.org/wiki/File:Pipeline_MIPS.pnghttp://en.wikibooks.org/wiki/File:Pipeline_3.pnghttp://en.wikibooks.org/wiki/File:Pipeline_MIPS.pnghttp://en.wikibooks.org/w/index.php?title=Microprocessor_Design/Pipelined_Processors&action=edit&section=4
  • 8/3/2019 Pipe Lining Performance

    4/9

    Superpipelining is the technique of raising the pipeline depth in order to increase the clock speed and reduce the

    latency of individual stages. If the ALU takes three times longer then any other module, we can divide the ALU into

    three separate stages, which will reduce the amount of time wasted on shorter stages. The problem here is that we

    need to find a way to subdivide our stages into shorter stages, and we also need to construct more complicated

    control units to operate the pipeline and prevent all the possible hazards.

    It is not uncommon for modern high-end processors to have more than 20 pipeline stages.

    Example: Intel Pentium 4

    The Intel Pentium 4 processor is a recent example of a super-pipelined processor. This diagram shows a Pentium 4

    pipeline with 20 stages.

    http://en.wikibooks.org/wiki/File:Pentium4superpipeline.png
  • 8/3/2019 Pipe Lining Performance

    5/9

    Instruction pipeline

    An instruction pipeline is a technique used in the design ofcomputersand other digital electronic devices to

    increase their instruction throughput (the number of instructions that can be executed in a unit of time).

    The fundamental idea is to split the processing of a computer instruction into a series of independent steps, with

    storage at the end of each step. This allows the computer's control circuitry to issue instructions at the processing

    rate of the slowest step, which is much faster than the time needed to perform all steps at once. The term pipeline

    refers to the fact that each step is carrying data at once (like water), and each step is connected to the next (like the

    links of a pipe.)

    The origin of pipelining is thought to be either theILLIAC IIproject or theIBM Stretchproject though a simple version

    was used earlier in theZ1in 1939 and theZ3in 1941.[1]

    .

    The IBM Stretch Project proposed the terms, "Fetch, Decode, and Execute" that became common usage.

    Most modernCPUsare driven by a clock. The CPU consists internally of logic and memory (flip flops). When the

    clock signal arrives, the flip flops take their new value and the logic then requires a period of time to decode the new

    values. Then the next clock pulse arrives and the flip flops again take their new values, and so on. By breaking the

    logic into smaller pieces and inserting flip flops between the pieces of logic, the delay before the logic gives valid

    outputs is reduced. In this way the clock period can be reduced. For example, theclassic RISC pipelineis broken into

    five stages with a set of flip flops between each stage.

    1. Instruction fetch

    2. Instruction decode and register fetch

    3. Execute

    4. Memory access

    5. Register write back

    When a programmer (or compiler) writes assembly code, they make the assumption that each instruction is executed

    before execution of the subsequent instruction is begun. This assumption is invalidated by pipelining. When this

    causes a program to behave incorrectly, the situation is known as a hazard. Various techniques for resolving hazards

    such as forwarding and stalling exist.

    A non-pipeline architecture is inefficient because some CPU components (modules) are idle while another module is

    active during the instruction cycle. Pipelining does not completely cancel out idle t ime in a CPU but making those

    modules work in parallel improves program execution significantly.

    Processors with pipelining are organized inside into stages which can semi-independently work on separate jobs.

    Each stage is organized and linked into a 'chain' so each stage's output is fed to another stage until the job is done.

    This organization of the processor allows overall processing time to be significantly reduced.

    http://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/ILLIAC_IIhttp://en.wikipedia.org/wiki/ILLIAC_IIhttp://en.wikipedia.org/wiki/ILLIAC_IIhttp://en.wikipedia.org/wiki/IBM_Stretchhttp://en.wikipedia.org/wiki/IBM_Stretchhttp://en.wikipedia.org/wiki/IBM_Stretchhttp://en.wikipedia.org/wiki/Z1_(computer)http://en.wikipedia.org/wiki/Z1_(computer)http://en.wikipedia.org/wiki/Z1_(computer)http://en.wikipedia.org/wiki/Z3_(computer)http://en.wikipedia.org/wiki/Z3_(computer)http://en.wikipedia.org/wiki/Z3_(computer)http://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-rojas-0http://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-rojas-0http://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-rojas-0http://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-rojas-0http://en.wikipedia.org/wiki/Z3_(computer)http://en.wikipedia.org/wiki/Z1_(computer)http://en.wikipedia.org/wiki/IBM_Stretchhttp://en.wikipedia.org/wiki/ILLIAC_IIhttp://en.wikipedia.org/wiki/Computer
  • 8/3/2019 Pipe Lining Performance

    6/9

    A deeper pipeline means that there are more stages in the pipeline, and therefore, fewer logic gates in each stage.

    This generally means that the processor's frequency can be increased as the cycle time is lowered. This happens

    because there are fewer components in each stage of the pipeline, so the propagation delay is decreased for the

    overall stage.[2]

    Unfortunately, not all instructions are independent. In a simple pipeline, completing an instruction may require 5

    stages. To operate at full performance, this pipeline will need to run 4 subsequent independent instructions while the

    first is completing. If 4 instructions that depend on the output of the first instruction are not available, the pipeline

    control logic must insert a stall or wasted clock cycle into the pipeline until the dependency is resolved. Fortunately,

    techniques such as forwarding can significantly reduce the cases where stalling is required. While pipelining can in

    theory increase performance over an unpipelined core by a factor of the number of stages (assuming the clock

    frequency also scales with the number of stages), in reality, most code does not allow for ideal execution.

    Advantages and Disadvantages

    Pipelining does not help in all cases. There are several possible disadvantages. An

    instruction pipeline is said to be fully pipelinedif it can accept a new instruction

    everyclock cycle. A pipeline that is not fully pipelined has wait cycles that delay the

    progress of the pipeline.

    Advantages of Pipelining:

    1. The cycle time of the processor is reduced, thus increasing instruction issue-rate in

    most cases.

    2. Some combinational circuits such as adders or multipliers can be made faster by

    adding more circuitry. If pipelining is used instead, it can save circuitry vs. a more

    complex combinational circuit.

    Disadvantages of Pipelining:

    1. A non-pipelined processor executes only a single instruction at a time. This

    prevents branch delays (in effect, every branch is delayed) and problems with

    serial instructions being executed concurrently. Consequently the design is

    simpler and cheaper to manufacture.

    2. The instruction latency in a non-pipelined processor is slightly lower than in a

    pipelined equivalent. This is because extraflip flopsmust be added to the data

    path of a pipelined processor.

    3. A non-pipelined processor will have a stable instruction bandwidth. The

    performance of a pipelined processor is much harder to predict and may vary more

    widely between different programs.

    [edit]Examples

    http://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-Guardian-1http://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-Guardian-1http://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-Guardian-1http://en.wikipedia.org/wiki/Clock_cyclehttp://en.wikipedia.org/wiki/Clock_cyclehttp://en.wikipedia.org/wiki/Clock_cyclehttp://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=2http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=2http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=2http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=2http://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Clock_cyclehttp://en.wikipedia.org/wiki/Instruction_pipeline#cite_note-Guardian-1
  • 8/3/2019 Pipe Lining Performance

    7/9

    [edit]Generic pipeline

    Generic 4-stage pipeline; the colored boxes represent instructions independent of each other

    To the right is a generic pipeline with four stages:

    1. Fetch

    2. Decode

    3. Execute

    4. Write-back

    (for lw and sw memory is accessed after execute stage)

    The top gray box is the list of instructions waiting to be executed; the bottom gray box is

    the list of instructions that have been completed; and the middle white box is the pipeline.

    Execution is as follows:

    Time Execution

    0 Four instructions are awaiting to be executed

    1 the green instruction is fetched from memory

    2

    the green instruction is decoded

    the purple instruction is fetched from memory

    http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=3http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=3http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=3http://en.wikipedia.org/wiki/File:Pipeline,_4_stage.svghttp://en.wikipedia.org/wiki/File:Pipeline,_4_stage.svghttp://en.wikipedia.org/wiki/File:Pipeline,_4_stage.svghttp://en.wikipedia.org/wiki/File:Pipeline,_4_stage.svghttp://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=3
  • 8/3/2019 Pipe Lining Performance

    8/9

    3

    the green instruction is executed (actual operation is performed)

    the purple instruction is decoded

    the blue instruction is fetched

    4

    the green instruction's results are written back to the register file or memory

    the purple instruction is executed

    the blue instruction is decoded

    the red instruction is fetched

    5

    the green instruction is completed

    the purple instruction is written back

    the blue instruction is executed

    the red instruction is decoded

    6

    The purple instruction is completed

    the blue instruction is written back

    the red instruction is executed

    7 the blue instruction is completed

    the red instruction is written back

    8 the red instruction is completed

    9 All instructions are executed

    [edit] Mathematical pipelines: Mathematical or arithmetic pipelines are different from instructional pipelines, in that

    when mathematically processing large arrays or vectors, a particular mathematical process, such as a multiply is

    repeated many thousands of times. In this environment, an instruction need only kick off an event whereby the

    arithmetic logic unit (which is pipelined) takes over, and begins its series of calculations. Most of these circuits can be

    found today in math processors and math processing sections of CPUs like the Intel Pentium line.

    [edit]History

    Math processing (super-computing) began in earnest in the late 1970s as Vector Processors and Array Processors.

    Usually very large bulky super-computing machines that needed special environments and super-cooling of the

    cores. One of the early super computers was the Cyber series built by Control Data Corporation. Its main architect

    was Seymour Cray, who later resigned from CDC to head up Cray Research. Cray developed the XMP line of super

    computers, using pipelining for both multiply and add/subtract functions. Later, Star Technologies took pipelining to

    another level by adding parallelism (several pipelined functions working in parallel), developed by their engineer,

    Roger Chen. In 1984, Star Technologies made another breakthrough with the pipelined divide circuit, developed by

    James Bradley. By the mid 1980s, super-computing had taken off with offerings from many different companies

    around the world.

    Today, most of these circuits can be found embedded inside most micro-processors

    http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=4http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=4http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=4http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=8http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=8http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=8http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=8http://en.wikipedia.org/w/index.php?title=Instruction_pipeline&action=edit&section=4
  • 8/3/2019 Pipe Lining Performance

    9/9

    How is pipelining achieved in 8086 microprocessor?

    EXECUTION UNIT TELLS THE BUS INTERFACE UNIT FROM WHERE TO FETCHINSTRUCTIONS AS WELL AS TO READ DATA.EU GETS THE OPCODE OF AN INSTRUCTION

    FROM AN INSTRUCTION QUEUE.THEN THE EU DECODES IT OR EXECUTES IT.BIU AND EU

    OPERATE INDEPENDENTLY.WHEN THE EU EXECUTING AN INSTRUCTION,THEN BIU

    FETCHES INSTRUCTION CODES FROM MEMORY AND STORES THEM IN THE QUEUE.THIS

    TYPE OF OVERLAPPING OPERATION OF THE BIU A ND EU FUNCTIONAL UNITS OF A

    MICROPROCESSOR IS CALLED PIPELINING

    Pipelining of Microcontroller Microprocessor

    ew important characteristics and features of pipeline concept:

    - Processes more than one instruction at a time, and doesnt wait for one instruction to complete before

    starting the next. Fetch, decode, execute, and write stages are executed in parallel

    - As soon as one stage completes, it passes on the result to the next stage and then begins working onanother instruction

    - The performance of a pipelined system depends on the time it takes only for any one stage to be

    completed, not on the total time for all stages as with non-pipelined designs

    - Each instruction takes 1 clock cycle for each stage, so the processor can accept 1 new instruction per clock.

    Pipelining doesnt improve the latency of instructions (each instruction still requires the same amount of time

    to complete), but it does improve the overall throughput

    - Sometimes pipelined instructions take more than one clock to complete a stage. When that happens, the

    processor has to stall and not accept new instructions until the slow instruction has moved on to the next

    stage

    - A pipelined processor can stall for a variety of reasons, including delays in reading information from

    memory, a poor instruction set design, or dependencies between instructions

    - Memory speed issues are commonly solved using caches. A cache is a section of fast memory placed

    between the processor and slower memory. When the processor wants to read a location in main memory,that location is also copied into the cache. Subsequent references to that location can come from the cache,

    which will return a result much more quickly than the main memory

    - Dependencies. Since each instruction takes some amount of time to store its result, and several

    instructions are being handled at the same time, later instructions may have to wait for the results of earlier

    instructions to be stored. However, a simple rearrangement of the instructions in a program (called

    Instruction Scheduling) can remove these performance limitations from RISC programs

    http://electronicsbus.com/microprocessors-microcontrollers/pipelining-of-microcontroller-microprocessor-computer/http://electronicsbus.com/microprocessors-microcontrollers/pipelining-of-microcontroller-microprocessor-computer/http://electronicsbus.com/microprocessors-microcontrollers/pipelining-of-microcontroller-microprocessor-computer/