superscalar pipeline architectures by: matthew osborne, philip ho, xun chen april 19, 2004

36
Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

Upload: gwen-wilson

Post on 26-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

Superscalar Pipeline Architectures

By: Matthew Osborne, Philip Ho, Xun Chen

April 19, 2004

Page 2: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

Superscalar Architecture

• Relatively new, first appeared in early 1990s

• Builds on the concept of pipelining

• Superscalar architectures can process multiple instructions in one clock cycle (multiple instruction execution units)

• Allows for instruction execution rate to exceed the clock rate (CPI of less than 1)

Page 3: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

Overview of Selected Superscalar Architectures

• Intel

• MIPS

• PowerPC

• T 1000 Architectures

• Hobbes: A Multi-threaded superscalar

Page 4: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

Intel Superscalar Architecture

According to Sara Sarimento, in her essay “Recent History of Intel Architecture – A Refresher”

- Intel’s first use of a superscalar architecture was its Pentium Processor

- “Instruction Level Parallelism” - instructions independent of the outcome of one another execute concurrently to utilize more of the available hardware resources and increase instruction throughput.

Page 5: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

Intel P5 Microarchitecture

•Used in initial Pentium processor•Could execute up to 2 instructions simultaneously•Instructions sent through the pipeline in order - if the next two instructions had a dependency issue, only one instruction (pipe) would be executed and the second execution unit (pipe) went unused for that clock cycle.

Page 6: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

Intel P6 Microarchitecture

- Used in the Pentium II, III and Pro processors

-3 instruction decoders, which break each CISC instruction (macro-op) into equivalent micro-operations (µops) for the Out-of-Order Execution unit

-10 stage instruction pipeline utilized in this architecture

Page 7: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

Intel P6 Microarchitecture

• “Out of Order” instruction execution - executes instructions without data dependency issues out of order for a higher level of hardware utilization

• “Scheduler” unit resolves data dependency issues between individual instructions

• “Re-Order Buffer” puts instructions back in order before writing them back to memory

• Up to 3 instructions can be retired concurrently to memory

Page 8: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

Intel NetBurst MicroArchitecture

-New architecture used for the Intel Pentium IV and Pentium Xeon processors

Page 9: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

Intel NetBurst Microarchitecture

Changes from P6 Architecture• Only one instruction decoder present• Decoder moved outside the Out-of-Order

Execution Unit; an Execution Trace Cache was added in its place

• Increased number of pipeline stages to 20• Improved branch prediction algorithms• ALUs operate twice as quickly as their P6

counterparts

Page 10: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

Intel NetBurst Microarchitecture

Execution Trace Cache• Alleviates delays in fetching and translating CISC

instructions to their appropriate µops• Instructions are now decoded by a translation engine,

with the resulting µops stored as traces (sequence of µops) in the Execution trace cache.

• Traces stored in path of predicted program execution flow, with results of branches in the code integrated into this path

• Delivers up to 3 µops to the core of the Execution Unit per clock cycle

Page 11: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

Intel NetBurst Microarchitecture

Branch Prediction• Branch targets are predicted based on their

linear address using branch prediction logic and fetched as soon as possible

• Targets are fetched from the Execution Trace Cache if cached there; otherwise they are fetched form the memory hierarchy

• Downside: despite the improved prediction algorithm, one of the biggest costs of this architecture is mispredicted branches because of the longer instruction pipeline than previous architectures.

Page 12: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

MIPS Superscalar Architecture

• MIPS is a RISC instruction platform, versus Intel’s CISC instruction platform (made design of Superscalar Architecture easier than for Intel’s CISC platform)

• First MIPS processor with a Superscalar Architecture was the MIPS R8000 64 bit, released in 1994.

Page 13: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

MIPS R8000 Processor

R8000 Chip Set Diagram

Courtesy of Silicon Graphics http://sgi.cartsys.net/i2sec7.html

Page 14: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

MIPS R8000 Features

• Superscalar

• Can support/process 4 in-order instructions each cycle

• Multi-component chip set (Integer Unit, Floating Point Unit, Tag RAMs and Data Streaming Cache)

• Designed for peak performance with Floating Point Operations

Page 15: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

MIPS R8000 Pitfalls

• Integer operation performance limited

• Very high cost

As a result of these two key factors:

• The R8000 was only in the marketplace for about a year.

• This processor was mainly used only in the scientific community

Page 16: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

MIPS R10000 Processor

Superscalar Pipeline Architecture for the R10000 processor. Diagram courtesy of R10000 Microprocessor User’s Manual.

http://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/t5.Ver.2.0.book_12.html

Page 17: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

R10000 Processor - Features

• Introduced in 1995• Improved integer instruction performance• Ability to create a multi-processor system

(can attach up to 4 R10000 chips together)• Fetches and decodes 4 instructions each

clock cycle/pipeline stage• “Out Of Order” Instruction Execution –

First MIPS Processor to support this feature

Page 18: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

R10000 Block Diagram

Each decoded instruction is sent to one of 3 instruction queues -Address Queue (Load/Store Instructions) -Integer Queue (Integer ALU Operations) -Floating Point Queue (Floating Point Arithmetic Operations)

Page 19: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

MIPS R10000 Processor

• 5 Execution Pipelines

- Load/Store Unit

- Two Integer ALUs

- Floating Point Adder

- Floating Point Multiplier• Can process up to 4 out of order instructions

simultaneously• Base architecture core that all successor MIPS

processors have been built from

Page 20: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

PowerPC

• Direct descendent of IBM 801, RT PC and RS/6000

• All are RISC

• RS/6000 first superscalar

• PowerPC 601 superscalar design similar to RS/6000

• Later versions extend superscalar concept

Page 21: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

PowerPC 601 Pipeline Structure

Page 22: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

PowerPC 601 Pipeline

Page 23: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

PowerPC 601 General View

Page 24: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

PowerPC storage model

• Supports for byte(8-bits), halfword(16-bits), word(32-bits) and doubleword(64-bits) data types.

• Handles string operations for multi-byte strings up to 128 bytes

• 32-bit PowerPC implementations supports a 4-GB effective address space.

• 64-bits PowerPC implementations supports a 16-exabyte effiective address space.

Page 25: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

General-purpose registers (GPR)

• User Instruction Set architecture specifies all implementations have 32 GPRs

• GPRs are the source and destination of all integer operations

• No lookup is done for GPR0’s contents.

Page 26: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

Floating-point registers (FPR)

• All implementations have 32 FPRs.

• FPR are source and destination operands of all floating-point operations.

• Contains 32-bit and 64-bit signed and unsigned integer vlaues, single-precision and double-precision floating-point values.

Page 27: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

Special-purpose registers (SPR)

• Give status and control of resources within the processor core.

• Read and written by applications without support from a system service include the Count Register, the Link Register and the Integer Exception Register.

• Can only be ready by applications with support form a system service include the Time Base and other timers.

Page 28: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

T1000 Architectures

• The T1000 Architectures are reconfigurable computing architectures embedded into a superscalar

• T1000 Architectures rely on the programmable functional unit ( PFU ), integrated into the datapath.

• T1000 is assumed to be a 4-issue out-of-order machine. It helps tolerate the latencies of some data dependent instruction sequences.

• T1000 extended instruction is encoded as a register-register operation with a specific opcode.

Page 29: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

Hobbes

• A multi-threaded architecture attempt to increase pipeline utilization by concurrently executing instructions from different threads.

• The architecture chosen was the aggressive speculative and out-of-order superscalar processor based on the MIPS R2000 instruction set.

• The Hobbes architecture combines multi-threading with superscalar issue, with the supposition that strengths of one should offset the weaknesses of the other.

• By supporting superscalar issue from more than one thread, the architecture overcomes the lack of instruction-level parallelism that plagues other superscalar structures.

Page 30: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

Background

• The Hobbes micro-architecture draws its inspiration from two widely differing architectures: Multi-threaded and superscalar.

• It is hoped that the combined of the fundamental concepts of these architecture will build upon their respective strengths and compensate for their corresponding weaknesses, allowing a hybrid to be greater than the sum of its parts.

Page 31: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

Multi-threaded Architectures

• Multi-threaded processors can concurrently execute instructions from more than one thread.

• The contexts of multiple threads are stored on-board, which allows instructions to be issued from different threads.

• Traditional multi-threaded architectures have usually implemented a round-robin execution strategy with switched that instruction execution to a new a thread every cycle.

Page 32: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

The Thread Unit of Hobbes

• The Thread unit contains all of the elements required to support a single thread.

• It consists of a fetch buffer, issue buffer, decode logic, branch adder and the thread state storage.

Page 33: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

The Thread Unit

• Instruction fetch is performed by reading an entire cacheline of four words and storing it in the fetch buffer.

• Each thread decodes and issues its instructions in program order. After and instruction has been decoded, it is stalled until all of its operands are available.

• Once the operands are ready, the instruction is placed into the issue buffer and the issue unit is notified.

• The register file is very similar to that found on the R2000. The register file has two write ports and both of these may be from the same thread.

• Branches which do not affect the register file are executed in the thread unit and are not issued to the execution unit.

Page 34: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

The Execution Units of Hobbes

• The Hobbes architecture has an almost identical set of execution units as out-of –order superscalar processor.

• The characteristics of the execution units approximately correspond to those of the R2000/R2010.

• Execution Units• Integer: 2 ALUs,

Shifter, Multiply / Divide, Load / Store, Data cache interface

• FP: FP Convert, FP Add, FP Multiply, FP Divide

Page 35: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

Superscalar Architecture

• Superscalar processors improve performance by reducing the average number of cycles required to execute each instruction

• This is accomplished by issuing and executing more than one independent instruction per cycle, rather than limiting execution to just on instruction per cycle as traditional pipelined architectures.

• For superscalar architectures to experience speed-up over traditional pipelined architectures they require the average level of available instruction-level parallelism to be greater than one.

Page 36: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004

References• Hennessy, John L and Patterson, David A. “Computer Organization and Design, The

Hardware/Software Interface.” San Francisco: Morgan Kaufmann Publishers 1998. • Sarimento, Sara. “Recent History of Intel Architecture – A Refresher.” 17 April 2004. Intel

Corporation www.intel.com 18 April 2004 http://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htm

• Zhou & Martonosi. “Augmenting Modern Suuperscalar Architectures with Configurable Extended Instructions”. 19 April 2004. http://ipdps.eece.unm.edu/2000/raw/18000943.pdf

• Kish & Preiss. “Hobbes: A Multi-Threaded Superscalar Architecture 19, April 2004 http://www.brpreiss.com/page75.html

• R10000 Processor User’s Manual. 9 Dec 1996. SGI Corporation. 22 April 2004 http://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/index.html#HEADING1

• “MIPS Architecture.” 17 April 2004. Wikipedia, The Free Encyclopedia http://en.wikipedia.org/wiki/Main_Page 23 April 2004 http://en.wikipedia.org/wiki/MIPS_architecture.

• Mapleson, Ian. “Indigo 2 and Power Indigo 2 Technical Report.” SiliconGraphics. 23 April 2004 http://sgi.cartsys.net/i2sec7.html.

• “Power PC Architecture” 23 April 2004 http://www-1.ibm.com/servers/eserver/pseries/hardware/whitepapers/power/ppc_arch.html