sun ultrasparc-iii architecture cmpe 511 presentation prepared by:balkır kayaaltı

SUN ULTRASPARC-III ARCHITECTURE

CMPE 511 PRESENTATIONPrepared by:Balkır Kayaaltı

Introduction SPARC stands for a Scalable Processor ARChitecture. It is an open processor architecture.(i.e. Member companies to

the SPARC community can freely produce the processor) SUN ULTRA SPARCv9 is a robust RISC architecture with

-64 bit integer address and data

-Superscalar implementations

-Extremely fast trap handling and context switching.

The presentation will look in detail to the SUN Microsystem’s Ultra SPARC III v9 architecture.

Major Architectural units

The processor’s micro-architecture designhas six major functional units that performrelatively independently:

Instruction issue unit (IIU) Floating point unit (FPU) Integer execution unit (IEU) Data cache unit (DCU) External memory unit (EMU) System interface unit (SIU)

The units communicate requests and results among themselves through well-defined interface protocols, as the next figure

Communication paths between architectural units

Instruction issue unit This unit feeds the execution pipelines with the instructions. It independently predicts the control flow through a program

and fetches the predicted path from the memory system. Fetched instructions are staged in a queue before forwarding to

the two execution units: ‘integer and floating point’

This unit includes: 32-Kbyte, four-way associative ‘Instruction cache’ ‘The instruction address translation buffer’ A 16 K-entry ‘branch predictor’

Ultra SPARC-III pipeline and physical data

Pipeline feature Parameter

Instruction issue 4 integer

2 float point

2 graphics

Level-one(L1) caches Data: 64-Kbyte, 4-way

Instructions: 32-Kbyte, 4-way

Prefetch: 2-Kbyte,4-way

Write : 2-Kbyte,4-way

Level-two(L2) cache Unified (data and instructions)

4- and 8-Mbyte,1-way

On-chip tags;off chip data

Pipeline

Pipeline blocksStage Function

A Generate instruction fetch addresses, generate pre-decoded instruction bits on

P Fetch first cycle of instructions from cache; access first cycle of branch prediction

F Fetch second cycle of instructions from cache; access second cycle of branch prediction; translate virtual-to-physical address

B Calculate branch target addresses; decode first cycle of instructions

I Decode second cycle of instructions;enqueue instructions into the queue

J Steer instructions to execution unitsR Read integer register file operands; check operand

dependenciesE Execute integers for arithmetic, logical, and shift

instructions; read, and check dependency of, first cycle of data cache access floating-point register file

Pipeline blocks[2]

Stage Function

C Access second cycle of data cache, and forward load data for word and doubleword loads; execute first cycle of

floating-point instructions

M Load data alignment for half-word and byte loads; execute second cycle of floating-point instructions

W Write speculative integer register file; execute third cycle of floating-point instructions

X Extend integer pipeline for precise floating-point traps; execute fourth cycle of floating-point instructions

T Report traps

D Write architectural register file

Pipeline The instruction issue unit :Stages A-J The execution unit :Stages R-D data cache: E, C, M, and W stages of the pipe in parallel with

integer execution unit stages

Floating point unit: Side pipeline parallel E through D stages of the integer pipeline

Pipeline

Instruction issue unit cont.

To increase the performance high level of instruction parallelism is desired.

Ultra SPARC is a static speculation machine.

- Dynamic speculation machines require very high fetch bandwidths to fill an instruction window and find instruction-level parallelism.

- In a static speculation machine the compiler can make the speculated path sequential, resulting in fewer requirements on the instruction fetch unit.

Instruction issue unit:

Stage A: Address lines enter to the instruction cache.

All fetch address generation and selection occurs.

Stage P,F: Instruction cache access. Branch prediction Instruction address translation access

By the time the instructions are available from the cache in the Bstage, we also have the physical address from the translator and aprediction for any branch that was fetched.

The processor uses all this information in the B stage todetermine whether to follow a sequential or taken-branch path

Branch prediction

The processor also determines whether the instruction cache access was a hit or miss. If the processor predicts a taken branch in the B stage, the processor sends back the target address for the branch to the A stage to redirect the fetch stream.

Waiting until the B stage to redirect the fetch stream lets us use a large, accurate branch predictor.

Branch predictor uses a ‘G-share algorithm’ with 16K 2-bit saturating up/down counters

Predictor is pipelined since it is big.

Instruction buffer (queue)

There are 2 instruction queue’s designed (instruction queue and miss queue)

The 20-entry instruction queue decouples the fetch unit from the execution units, allowing each to proceed at its own rate

If a branch is taken at the two cycles that should pass for filling the queue with right instructions , immediately instructions in the miss queue can be used.

Integer execute unit

Execution pipelines can support concurrent launch up to six instructions; which can consist of:

-two integer operations,A0/A1 pipelines

-two FP operations, FP pipelines

-one memory operation (load/store), MS pipeline

-one special purpose memory operation ( prefetch cache load only)

-one control transfer instruction (CTI), BR pipeline

However only four Instructions per cycle (IPC) can be executed in a sustain manner.

Working and Architectural Register File (WARF)

Physically it is a one block but logically it can be seen as two separate register files. (working register file and architectural)

SPARC architectures use register files and windowing techniques.

Any time 8 global registers can be reached g0 – g7 Global register g0 is always ‘0’. At any time, an instruction can access the 8 global and a 24-

register window into the registers. A register window comprises the 8 ‘in’ and 8 ‘local registers’ of a particular register set, ttogether with the 8 ‘in’ registers of an adjacent register set, which are addressable from the current window as out registers.

Register windows

WARF

WRF consist of 32 – 64-bit registers (each of with 3 write,7 read ports and 32*64=2048 minus 64 =1984 bit write port to transport data from Architectural register file

ARF has 160 entries (Total 8 register windows)

8x8=64 for local registers in the window

8x8=64 registers for 16 IN/OUT shared registers.

28 register for 4 set of 8 global registers.

The WRF manages as single window & updated as results computed

The processor accesses the WRF in the pipeline’s R stage and supplies integer operands to the execution units.

Most integer operations complete in one cycle , so result can be written immediately at C stage.

If an exceptional event occurs, results written must be undone; so original copies of integer registers are copied using broadside copy of all integer files from appropriate ARF window.

The place where to architecture register file is written at the end of the pipeline since all exceptions should be resolved.

ARF fills 16 WRF entries after a window change On an exception 31 nonzero registers of WRF should be

updated.

On chip memory system

Chache diagram used in the architecture

On chip memory system

Level-one(L1) caches Data: 64-Kbyte, 4-way Instructions: 32-Kbyte, 4-way Prefetch: 2-Kbyte,4-way Write : 2-Kbyte,4-way

Level-two(L2) cache Unified (data and instructions) 4- and 8-Mbyte,1-way On-chip tags; off chip

data

average latency = L1 hit time + L1 miss rate * L1miss time + L2 miss rate * L2 miss time

Prefetch cache Performance is highly increased by using a ‘Prefetch Cache’ in

parallel with the ‘L1 data cache’.

By issuing up to eight in-flight prefetches to main memory, the prefetch cache enables program to utilize 100% of the available main memory bandwidth without incurring a slow-down due to the main memory latency.

Prefetch cache The prefetch cache :2-Kbyte SRAM organized as 32 entries of

64 bytes and using four-way associativity with an LRU replacement policy.

A multi-port SRAM design let us achieve a very high throughput.

Data can be streamed through the prefetch cache in a manner similar to stream buffers.

On every cycle, each of two independent read ports supply 8 bytes of data to the pipeline while a third write port fills the cache with 16 bytes.

Prefetch cache Some early processors like Ultra Sparc II uses prefetch

instructions. Autonomous stride prefetch engine that tracks the program

counters of load instructions and detects when a load instruction is striding through memory .

When the prefetch engine detects a striding load, the prefetch engine issues a hardware prefetch independent of any software prefetch.

This allows the prefetch cache to be effective even on codes that do not include prefetch instructions.

Write cache Write-caching is an excellent way to reduce the

bandwidth due to store traffic. A write cache is used in SPARC-III to reduce the store

traffic bandwidth to the off-chip L2 data cache Size is 2Kbyte -4 way associative

Advantage of using it is : being the sole source of on-chip dirty data, the write cache easily handles both multiprocessor and on-chip cache consistency.

Error recovery also becomes easier with the write cache, since the write cache keeps all other on-chip caches clean and simply invalidates them when an error is detected.

Write chaching A byte validate policy is used on the write cache. Rather than

reading the data from the L2 cache for the bytes within the line that are not being overwritten, we just keep an individual valid bit for each byte. Not performing the read-on-allocate saves considerable L2 cache bandwidth by postponing a read-modify-write until the write cache evicts a line. Frequently, by eviction time the entire line has been written so the write cache can eliminate the read.

Write cache is included in the L2 data cache and write-cache data can supersede read data from the L2 data cache . We handle this by a byte-merging multiplexer on the incoming L2 cache data bus that can choose either writecache data or L2 cache data for each byte.

Floating point unit This unit contains data paths and control logic to execute floating point

and partitioned fixed-point data type instructions. Three data paths concurrently execute floating point or graphics

instructions, one each per cycle from the following classes:-Divide/multiply (single or double precision or partitioned)-Add/subtract/compare (single or double precision or partitioned)-An independent division datapath which lets non-pipelined divide proceed concurrently with the full pipelined multiply and adder paths.

In order to meet the cycle time of the floating point operations latency cycles must be added.

With using advanced circuit techniques for floating point add multiply units a latency cycle will be enough.

External memory interface

External memory consist of a large L2 cache built off chip and a main memory built off chip using synchronous DRAM’s.

Size of L2 caches: 4 or 8 Mbyte Latency: 12 clock cycles to support 32 byte line to L1 Tags for the L2 is placed on-chip to early detect L2 miss

(L2 cache controller accesses on-chip tags parallel with the start of the off-chip SRAM access and provide a way select signal to a late select address pin on the off-chip SRAMs)

L2 caches are Wave-pipelined and operate at 600MHz., Main memory DRAM controller is on chip, reducing memory

latency and scales the memory bandwidth with the number of processor.

The memory controller supports up to 4 Gbytes of SDRAM memory organized as four independent banks.

Trap stage in the pipeline In this architecture classical stall signal( which freezes the state

of the pipeline is eliminated for performance purposes) Instead a trap stage is put at the end of the pipeline to restore a

state when an unexpected event occurs. It’s handled like a trap:the instructions that are in the pipeline

will be refetched from Stage A.

Conclusion One of the advanced RISC microprocessor is the Sun

Microsystems UltraSPARC.It finds many application in desktops, network systems , scientific calculation machines.

The internal architecture of the UltraSPARC-III. is represented . Various parts of the processor is examined like: instruction

issue, execution, on chip and external memory.

References 1) ‘Ultra Sparc III:Designing Third -Generation 64-Bit

performance’ ,IEEE Micro ,June 1999

2)’Design Decisions Influencing Ultra SPARC’s Instruction Fetch Architecture’, 29th annual IEEE/ACM International Symposium on Microarchitecture ,p178-190,1996 Paris

3)Ultra SPARC III v9 Manual,Sun Microsystems.

THANK YOU

sun ultrasparc-iii architecture cmpe 511 presentation prepared by:balkır kayaaltı

Documents