15354_the micro architecture of pentium4 processor

8/6/2019 15354_The Micro Architecture of Pentium4 Processor

1/27

The Microarchitecture of Pentium 4 Processor

Glenn Hinton, Desktop Platforms Group, Intel Corp.Dave Sager, Desktop Platforms Group, Intel Corp.Mike Upton, Desktop Platforms Group, Intel Corp.Darrell Boggs, Desktop Platforms Group, Intel Corp.Doug Carmean, Desktop Platforms Group, Intel Corp.Alan Kyker, Desktop Platforms Group, Intel Corp.Patrice Roussel, Desktop Platforms Group, Intel Corp.


2/27

Salient Features of the Pentium 4 processor

42 million transistors implemented on 0.18u CMOS process.

Die size of 217 mm 2.

Consumes 55 watts of power at 1.5 GHz.

3.2 GB/s system bus.

144 new 128-bit SSE2 SIMD instructions for multimedia, contentcreation, scientific and engineering applications.


3/27

Salient Features of the Pentium 4 processor (contd.)

Much Better User Experience in:Internet Audio and Streaming Video.Image Processing.Video Content Creation.

Speech Recognition.3D applications and games.Multi-tasking user environments.Real-time MPEG2 video encoding and near real-time MPEG4

encoding.Efficient video editing and video conferencing.


4/27

The Intel NetBurst TM Microarchitecture

The Execution Trace Cache.

Out-of-order Execution engine.

Extremely low latency double-pumped ALU.

Very low latency Level 1 Data Cache.

Outstanding floating-point and Multimedia performance.


5/27

NetBurst TM Microarchitecture (Block Diagram)


6/27

Main Sections of the Architecture

In-Order Front End.

Out-of-Order Execution Logic.

Integer and Floating-Point Execution Units.

Memory Subsystem.


7/27

Pentium 4 Processor Microarchitecture


8/27

Front End

The Execution Trace Cache.The Microcode ROMInstruction TLB (ITLB)Front-End Branch Predictor (Front-End BTB)

IA-32 Instruction Decoder.


9/27

The Execution Trace Cache.

Primary or Advanced form of L1 Instruction Cache.

Delivers 3 ops/clock to the out-of-order execution logic.Most instructions fetched and decoded from the Trace Cache.It takes the already decoded ops from the IA-32 decoder and buildsthem into program ordered sequence of ops called traces made upof many lines. It packs 6 ops per trace line.NetBurst Architecture caches the ops of previously decodedinstructions here, so it bypasses the instruction decoder most of thetime thereby reducing misprediction latency.It has its own branch predictor called the Trace Cache BTB to predictbranches. It along with the front-end BTB uses a highly advancedbranch prediction algorithm that reduces the misprediction rate by1/3 rd compared to P6 microarchitecture.Recovery time for a mispredicted branch is much shorter as themachine does not have to re-decode the IA-32 instructions needed toresume execution from the branch target location.


10/27

The Microcode ROM

Used for complex IA-32 instructions like the string move and for fault and interrupt handling .On a complex instruction the Trace Cache jumps to the microcodeROM which then issues the ops needed to complete the operation.The ops from the Trace Cache and the Microcode ROM arebuffered in a simple in-order op queue that smoothens the flow of ops to the out-of-order execution engine.

Instruction TLB (ITLB)

Steers the front-end when the machine misses the Trace Cache.

Translates linear instruction pointer addresses into physical addressneeded to access the L2 cache and also perform page-levelprotection checking.It uses the highly accurate front-end branch prediction logic to know

what to fetch next.


11/27


12/27

IA-32 Instruction Decoder.It receives the IA-32 instruction bytes from the L2 cache, 64 bits at atime and decodes them into ops.

It can decode at a maximum rate of one IA-32 instruction per clockcycle.Some instructions need a single op while some need many ops.If more than 4 ops are needed for an IA-32 instruction (complex)then the decoder sends the machine to the microcode ROM to

complete it.

Retirement Logic:It reorders the out-of-order executed instructions back to the program order.

It receives the completion status of the executed instructions from theexecution units and processes the results so that proper architectural state iscommitted (or retired) according to the program order.Pentium 4 can retire up to 3 ops per clock cycle.It ensures that exceptions occur only if the operation causing exception is theoldest, non-retired operation in the machine.This logic also reports the branch history information to the branch predictorsat the front-end so that they can remain updated.


13/27

Out-of-Order Execution Logic

Here instructions are prepared for execution and actually executed.

It has several buffers to smoothen out and reorder the instruction flow tooptimize performance as they go down the pipeline and get executed.It aggressively reorders the instructions to allow them to execute quickly assoon as their input operands are ready.Instructions after the delayed instructions are allowed to execute as long asthey are not dependent on the former.The retirement logic reorders the instructions, executed in an out-of-order manner, back to their original program order.It has following components:

The Allocator.Register Renaming.op Scheduling.


14/27

The Allocator.

Allocator allocates many of the key machine register buffers needed

by each of the ops to execute.If a required register file entry is unavailable for any of the 3 opscoming to the allocator this clock cycle, it will stall this part of themachine. Upon availability of the resources it assigns them to therequesting ops.It allocates a Reorder buffer (ROB) entry to track the completionstatus of one of the 126 op that could be in flight simultaneously.It allocates one of the 128 integer and floating-point register entriesfor the result data of the op.

It allocates a load or store buffer to track one of the 48 loads and 24stores in the machine pipelines.It allocates an entry in one of the two op queues in front of theinstruction schedulers.


15/27

Register Renaming.

The register renaming logic renames the logical IA-32 registers suchEAX, EBX, etc onto the 128-entry physical register file. It removes

false conflicts caused by multiple instructions which may be usingtheir own unique versions of the logical registers like EAX, EBX,ECX, etc..The renaming logic remembers the most current version of eacharchitectural register in the RAT (Register Alias Table) so that a newinstruction can know where to get the correct current instance of itsinput operands.The Register File (RF) entry is allocated sequentially from a list of available registers in the 128 entry RF.

A sequence number is assigned to each op indicating its relativeage and points to the ops entry in the ROB array.


16/27

Register Allocation

It allocates the ROB and the result dataRegister File (RF) entries separately.

The ROB entries track the op status, consistonly of status field and are allocated anddeallocated sequentially.

The current version is in RF.

Upon retirement no result data values areactually moved from one physical structure toanother.

It allocates the ROB and data result register entries as a single, wide entity.

The ROB entries track the op data result value& status and consist of both data and statusfield and are allocated and deallocatedsequentially.The current register could be in ROB or in RRF.

Upon retirement result data values arephysically moved from ROB data result field intoseparate RRF.

NetBurst ArchitecturePentium III (P6) Architecture


17/27

op Scheduling.Heart of the out-of-order execution engine.The op schedulers determine when a op is ready for execution by trackingits input register operands and allow instruction to be reordered to execute asthey are ready while still maintaining the correct dependencies from theoriginal program.There are 2 op queues-one for memory operations and one for non-memoryoperations.

Each scheduler stores ops in strict FIFO order but each queue is allowed tobe read out-of-order w.r.t. other queue.Two fast execution dispatch ports shared by multiple schedulers.Fast ALU schedulers can schedule on each half of the main clock cycle whileother schedulers can schedule only once per main processor clock cycle.Load & Store dispatch port that can dispatch a ready load and store eachclock cycle.Collectively the dispatch ports can dispatch up to 6 ops per clock cycle ( >Front-End and Retirement bandwidth-3 ops/cycle)


18/27

Integer and Floating-Point Execution Units

Low Latency Integer ALU.

Low Latency Level 1 (L1) Data Cache.

Store-to-Load Forwarding.

FP/SSE Execution units.


19/27

Low Latency Integer ALU.

Operates at twice the clock rate and hence speed up programexecution.

ALU core kept as small as possible to minimize metal length andloading and only essential hardware is included.

This uses staggered ALU addition to perform fast calculations.

Simple, very frequent ALU operations go to the high-speed integer

ALU execution units whereas complex operations go to separatehardware for completion (shift, rotate, multiply, divide). Most integer shift or rotate operations go to the complex integer dispatch port.


20/27

Low Latency Level 1 (L1) Data Cache.

8 K-byte, low latency, 4-way set-associative, write-through cache with

64 bytes/line used for both integer and floating-point/SSE loads andstores.

Combined with a medium-latency L2 cache lower net load-accesslatency can be achieved and therefore leads to a higher performance

It uses new access algorithms to achieve very low load-accesslatency by leveraging the fact that almost all accesses hit the first-level data cache and the data TLB.

The op schedulers dispatch dependent operations before the parentload has finished executing.

It uses a form of a data speculation and a mechanism called replay that tracks and re-executes dependent operations that use incorrectdata.


21/27

Store-to-Load Forwarding.

The machine can have up to 24 stores in the pipeline at a time.Stores at retirement have to wait for previous stores to complete their data cache update. Stores can commit only after retiring.

To make use of pending stores, modern out-of-order processorshave a 24-entry pending store buffer that allows loads to use thepending store results before the stores have been written into the L1data cache.

The pending store buffer is optimized to quickly and efficientlyforward data to dependent loads from the pending stores.The load must have same size or smaller than the pending store andhave the same beginning physical address as the store, for theforwarding to take place, but the NetBurst Microarchitecture removemost of these bad store-to-load forwarding cases.


22/27

FP/SSE Execution units.

Floating-point, MMX, SSE, and SSE2 instructions having operands

from 64-128 bits, are executed here.

Two 128-bit ports; one for general execution and other for register moves and memory stores. These can begin a new operation everyclock cycle.

The machine can keep busy interleaving a multiply and an add everytwo clock cycles at much less cost then fully pipelining all the FP/SSEexecution hardware.

The FP adder can execute one Extended-Precision, one Double-Precision and two Single-Precision additions every clock cycle.

There are 3 execution units that run in parallel for integer SIMDinteger instructions.


23/27

Memory Subsystem

Level 2 Instruction and Data Cache.

3.2 GB/s System Bus.


24/27

Level 2 (L2) Instruction and Data Cache.

L2 cache stores data that cannot fit into the L1 cache.

It is a 256K-byte,8-way set associative, write-back cache with 128bytes/line that holds both the instructions as well as data.

A hardware prefetcher It is associated with this cache that monitors data access patterns andprefetches data automatically into the L2 cache.It attempts to stay 256 bytes ahead of the current data access locations.It remembers cache miss history to prefetch concurrent, independentstreams of data ahead of use in the program and also tries to minimize

the prefetching of unwanted data avoiding over utilization of the systembus and delay the real accesses the program needs.


25/27

3.2 GB/s System Bus.

It is a key enabler for applications that stream data from the memory

by using a 64-bit wide bus capable of transferring data at a rate of 400 MHz.

It uses a source-synchronous protocol with a 64-byte access lengththat quad-pumps the 100 MHz bus to give 400 million data transfersper second.

It has a split-transaction, deeply pipelined protocol to enable thememory subsystem to overlap memory requests to deliver actualhigh memory real-time bandwidths.


26/27

Performance


27/27

Highlights

A new, state-of-art processor architecture design The NetBurst

Architecture.Deeply pipelined design.Powerful Out-of-order execution engine.World-leading operating frequencies.Novel Microarchitecture ideas:

Trace Cachedouble-clocked ALUVery low-latency L1 data cache algorithmsVery High bandwidth system bus.Store-to-load forwarding.

World Class performance in media-rich environments, 3D &workstation applications and content creation.

15354_the micro architecture of pentium4 processor

Documents