ia-64 application architecture tutorial

95
PACT 1999: IA-64 Application Architecture Tutorial 1 ® IA-64 Application Architecture Tutorial Thomas E. Willis IA-64 Architecture and Performance Group Intel Corporation [email protected]

Upload: brennan-stephenson

Post on 30-Dec-2015

37 views

Category:

Documents


4 download

DESCRIPTION

IA-64 Application Architecture Tutorial. Thomas E. Willis IA-64 Architecture and Performance Group Intel Corporation [email protected]. Objectives for This Tutorial. Provide background for some of the architectural decisions - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 1®

IA-64 Application Architecture Tutorial

Thomas E. Willis

IA-64 Architecture and Performance Group

Intel Corporation

[email protected]

Page 2: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 2®

Objectives for This Tutorial

Provide background for some of the architectural decisions

Provide a description of the major features of the IA-64 application architecture– Provide introduction and overview– Describe software and performance usage models– Mention relevant design issues

Show an example of IA-64 feature usage (C asm)

Page 3: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 3®

Agenda for this Tutorial

IA-64 History and StrategiesIA-64 History and Strategies

IA-64 Application Architecture Overview

C IA-64 Example

Conclusions and Q & A

Page 4: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 4®

IA-64 Definition History

Two concurrent 64-bit architecture developments– IAX at Intel from 1991

• Conventional 64-bit RISC– Wideword at HP Labs from 1987

• Unconventional 64-bit VLIW derivative

IA-64 definition started in 1994– Extensive participation of Intel and HP architects, compiler

writers, micro-architects, logic/circuit designers – Several customers also participated as definition partners

There are 3 generations of IA-64 microprocessors currently in different stages of design

Page 5: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 5®

IA-64 Strategies I

Extracting parallelism is difficult– Existing architectures contain limitations that prevent

sufficient parallelism on in-order implementations

Strategy– Allow the compiler to exploit parallelism by removing static

scheduling barriers (control and data speculation)– Enable wider machines through large register files, static

dependence specification, static resource allocation

Page 6: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 6®

IA-64 Strategies II

Branches interrupt control flow/scheduling– Mispredictions limit performance– Even with perfect branch prediction, small basic blocks of

code cannot fully utilize wide machines

Strategies– Allow compiler to eliminate branches (and increase basic

block size) with predication– Reduce the number and duration of branch mispredicts by

using compiler generated branch hints– Allow compiler to schedule more than one branch per clock -

multiway branch

Page 7: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 7®

IA-64 Strategies III

Memory latency is difficult to hide– Increasing relative to processor speed which implies larger

cache miss penalties

Strategy– Allow the compile to schedule for longer latencies by using

control and data speculation– Explicit compiler control of data movement through an

architecturally visible memory hierarchy

Page 8: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 8®

IA-64 Strategies IV

Procedure calls interrupt scheduling/control flow– Call overhead from saving/restoring registers– Software modularity is standard

Strategy– Reduce procedure call/return overhead

• Register Stack• Register Stack Engine (RSE)

– Provide special support for software modularity

Page 9: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 9®

IA-64 Strategies Summary

Move complexity of resource allocation, scheduling, and parallel execution to compiler

Provide features that enable the compiler to reschedule programs using advanced features (predication, speculation)

Enable wide execution by providing processor implementations that the compiler can take advantage of

Page 10: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 10®

Agenda for this Tutorial

IA-64 History and Strategies

IA-64 Application Architecture OverviewIA-64 Application Architecture Overview

C IA-64 Example

Conclusions and Q & A

Page 11: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 11®

IA-64 Application Architecture Overview

Application StateApplication State

Instruction Format and Execution Semantics

Integer, Floating Point, and Multimedia InstructionsMemoryControl Speculation and Data SpeculationPredicationParallel ComparesBranch ArchitectureRegister Stack

Page 12: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 12®

1b

Application State

gr0 = 0

gr127

fr0 = +0.0

fr1 = +1.0

fr127

0 pr0 = 1

pr64

br0

br7

General Registers FP Registers Branch RegistersPredicate Registers

IPInstruction Pointer

Current Frame Marker CFM

User Mask UM

ar0

ar127

cpuid0

Application Registers*

CPU Identifier

cpuid<n>

pmd0

Performance Monitors

pmd<m>

6b

64b

64b

64b

64b

82b

64b

...

64b

...

...

...

...

64b

1b

...

...

*Only 27 populated

gr31

gr32

...

...

...

Sta

ticS

tacke

d

Page 13: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 13®

IA-64 Application Architecture Overview

Application State

Instruction Format and Execution SemanticsInstruction Format and Execution Semantics

Integer, Floating Point, and Multimedia InstructionsMemoryControl Speculation and Data SpeculationPredicationParallel ComparesBranch ArchitectureRegister Stack

Page 14: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 14®

Instruction 041 bits

128 bits

Instruction Formats: Bundles

Instruction types– M: Memory

– I: Shifts and multimedia

– A: ALU

– B: Branch

– F: Floating point

– L+X: Long

Template encodes types– MII, MLX, MMI, MFI, MMF– Branch: MIB, MMB, MFB, MBB,

BBB

Template encodes parallelism– All come in two flavors: with and

without stop at end– Also, stop in middle: MI_I M_MI

Instruction 141 bits

Instruction 241 bits

Template5 bits

Template identifies types of instructions in bundle and delineates independent operations (through “stops”)

Page 15: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 15®

add r1 = r2, r3sub r4 = r1, r2shl r2 = r4, r8

add r1 = r2, r3sub r4 = r11, r21shl r12 = r14, r8

Case 1 – Dependent Case 2 – Independent

Execution Semantics I

Traditional architectures have sequential semantics– The machine must always behave as if the instructions were

executed in an unpipelined sequential fashion– If a machine actually issues instructions in a different order

or issues more than one instruction at a time, it must insure sequential execution semantics are obeyed

Page 16: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 16®

IA-64 is an explicitly parallel (EPIC) architecture– Compiler uses templates with stops (“;;”) to indicate groups

of independent operations known as “instruction groups”– Hardware need not check for dependent operations within

instruction groups• WAR register dependencies allowed• Memory operations still require sequential semantics

– Predication can disable dependencies dynamically

Case 1 – Dependent Case 2 – Independent

(p1) add r1 = r2, r3

(p2) sub r1 = r2, r3 ;;

Case 3 – Predication

Execution Semantics II

shl r12 = r1, r8shl r12 = r1, r8

sub r4 = r1, r2 ;;;;

add r1 = r2, r3 ;;;; add r1 = r2, r3sub r4 = r11, r21shl r12 = r14, r8

3 Instruction groups

1 Instruction group2 Instruction groups

Page 17: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 17®

IA-64 Application Architecture Overview

Application StateInstruction Format and Execution Semantics

Integer, Floating Point, and Multimedia InstructionsInteger, Floating Point, and Multimedia Instructions

MemoryControl Speculation, Data SpeculationPredicationParallel ComparesBranch ArchitectureRegister Stack

Page 18: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 18®

Integer Instructions

Memory – load, store, semaphore, ...

Arithmetic – add, subtract, shladd, ...

Compare – lt, gt, eq, ne, ..., tbit, tnat

Logical – and, and complement, or, exclusive-or

Bit fields – deposit, extract

Shift pair

Character

Shifts – left, right

32-bit support – cmp4, shladdp4

Move – various register files moves

No-ops

Page 19: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 19®

Floating-point Architecture

IEEE 754 compliant

Single, double, double extended (80-bit)

Canonical representation in 82-bit FP registers

Fused multiply-add instruction

128 floating-point registers– Rotating, not stacked

Load double/single pair

Multiple FP status registers for speculation

Page 20: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 20®

Parallel FP Support

Enable cost-effective 3D Graphics platforms

Exploit data parallelism in applications using 32-bit floating-point data– Most application’s geometry calculations (transforms and

lighting) are done with 32-bit floating-point numbers– Provides 2X increase in computation resources for 32-bit

data parallel floating-point operations

Floating-point registers treated as 2x32 bit single precision elements– Full IEEE compliance

• Single, double, double-extended data types, packed-64– Similar instructions as for scalar floating-point– Availability of fast divide (non IEEE)

Page 21: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 21®

Multimedia Support

Audio and video functions typically perform the same operation on arrays of data values

IA-64 provides instructions that treat general registers as 8 x 8, 4 x 16, or 2 x 32 bit elements; three types– Addition and subtraction (including special purpose forms)– Left shift, signed right shift, and unsigned right shift– Pack/unpack to convert between different element sizes.

Semantically compatible with IA-32’s MMX Technology™

Page 22: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 22®

IA-64 Application Architecture Overview

Application StateInstruction FormatExecution Semantics

Integer, Floating Point, and Multimedia Instructions

MemoryMemory

Control Speculation and Data SpeculationPredicationParallel ComparesBranch ArchitectureRegister Stack

Page 23: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 23®

Memory

Byte addressable, accessed with 64-bit pointers

Ordered and unordered memory access instructions

Access granularity and alignment– 1, 2, 4, 8, 10, 16 bytes– Instructions are always 16-byte aligned– Supports big- or little-endian byte order for data accesses– Alignment on natural boundaries is recommended for best

performance

Page 24: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 24®

Memory Hierarchy Control

Explicit control of cache allocation and deallocation– Specify levels of the memory hierarchy affected by the

access– Allocation and flush resolution is at least 32-bytes

Allocation– Allocation hints indicate at which level allocation takes place

• But always implies bringing the data close to the CPU– Used in load, store, and explicit prefetch instructions

Deallocation and flush – Invalidates the addressed line in all levels of cache hierarchy– Write data back to memory if necessary

Page 25: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 25®

IA-64 Application Architecture Overview

Application StateInstruction FormatExecution Semantics

Integer, Floating Point, and Multimedia InstructionsMemory

Control Speculation and Data SpeculationControl Speculation and Data Speculation

PredicationParallel ComparesBranch ArchitectureRegister Stack

Page 26: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 26®

Control and Data Speculation

Two kinds of instructions in IA-64 programs– Non-Speculative Instructions – known to be useful/needed

• Would have been executed in the original program– Speculative instructions – may or may not be used

• Schedule operations before results are known to be needed

• Usually boosts performance, but may degrade• Heuristics can guide compiler in aggressiveness• Need profile data for maximum benefit

Two kinds of speculation– Control and Data

Moving loads up is a key to performance– Hide increasing memory latency– Computation chains frequently begin with loads

Page 27: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 27®

Speculation

Control Speculation

Original:

(p1) br.cond

ld8 r1 = [ r2 ]

Transformed:

ld8.s r1 = [ r2 ]

. . .

(p1) br.cond

. . .

chk.s r1, recover

Data Speculation

Original:

st4 [ r3 ] = r7

ld8 r1 = [ r2 ]

Transformed:

ld8.a r1 = [ r2 ]

. . .

st4 [ r3 ] = r7

. . .

chk.a r1, recover

Separates loads into 2 parts: speculative loading of data and conflicts/fault detection

Page 28: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 28®

Control Speculation

Example:– Suppose br.cond checks if r2

contains a null pointer

– Suppose the code then loads r1 by dereferencing this pointer and then uses the loaded value

– Normally, the compiler cannot reschedule the load before the branch because of potential fault

Control speculation is ...– Moving loads (and possibly

instructions that use the loaded values) above branches on which their execution is dependent

Three steps...

Traditional Architectures

null:...

ld8 r1 = [r2] ;;add r3 = r1, r4

instr1instr2 ;;…br.cond null ;;

Page 29: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 29®

null:...

ld8 r1 = r2 ;;add r3 = r1, r4

instr1instr2 ;;…br.cond null ;;

null:...

ld8.s r1 = [r2] ;;chk.s r1, recoveradd r3 = r1, r4

instr1instr2 ;;…br.cond null ;;

Control Speculation: Step 1

Separate load behavior from exception behavior– The ld.s which defers exceptions– The chk.s which checks for deferred exceptions

Exception token propagates from ld.s to chk.s– NaT bits in General Registers, NaTVal (Special NaN value)

in FP Registers

Page 30: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 30®

Reschedule load and (optionally) use– Now, ld8.s will defer a fault and set the NaT bit on r1– The chk.s checks r1’s NaT bit and branches/faults if needed

Allows faults to propagate– NaT bits in General Registers, NaTVal (Special NaN value)

in FP Registers

ld8.s r1 = [r2]instr1instr2 ;;…add r3 = r1, r4br.cond null ;;

null:chk.s r1, recover

null:...

instr1instr2 ;;…br.cond null ;;

Control Speculation: Step 2

ld8.s r1 = [r2] ;;chk.s r1, recoveradd r3 = r1, r4

Page 31: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 31®

Add recovery code– All instructions speculated are repeated in a recovery block– Will only execute if the chk.s detects a NaT– Use non-speculative versions of instructions

ld8.s r1 = r2instr1instr2 ;;…add r3 = r1, r4br.cond null ;;

ld8.s r1 = [r2]instr1instr2 ;;…add r3 = r1, r4br.cond null ;;

null:chk.s r1, recover chk.s r1, recoverback:

...

ld8 r1 = [r2] ;;add r3 = r1, r4br back

“null” block elided...

Control Speculation: Step 3

Page 32: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 32®

Control Speculation Summary

Original

load

use

Transform

load

use

chk.s

Reschedule / Recoveryload

use

chk.s recover

Instruction(s)

Page 33: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 33®

chk.s r5, recoverchk.s r5, recoverback:back: sub r7 = r5, r2sub r7 = r5, r2

ld8.s r3 = [r9] ld8.s r4 = [r10] ;; add r6 = r3, r4 ;; ld8.s r5 = [r6] cmp.eq p1, p2 = 6, r7(p1) br.cond froboz ;;

Only one check needed in this block!

recover:recover: ld8 r3 = [r9]ld8 r3 = [r9] ld8 r4 = [r10] ;;ld8 r4 = [r10] ;; add r6 = r3, r4 ;;add r6 = r3, r4 ;; ld8 r5 = [r6]ld8 r5 = [r6] br backbr back

NaT Propagation

All computation instructions propagate NaTs to reduce the number of checks required

froboz::......

Page 34: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 34®

Exception Deferral

Speculation can create speculative loads whose results are never used – want to avoid taking faults on these instructions

Deferral allows the efficient delay of costly exceptions

NaTs and checks (chk) enable deferral with recovery

Exceptions are taken in the recovery code, when it is known that the result will be used for sure

Page 35: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 35®

IA-64 Support for Control Speculation

65th bit (NaT bit) on each GR indicates if an exception has occurred

Special speculative loads that set the NaT bit if a deferrable exception occurs

Special chk.s instruction that checks the NaT bit and, if it is set, branches to recovery code

Computational instructions propagate NaTs like IEEE NaN’s

Compare operations propagate “false” when writing predicates

Page 36: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 36®

Data Speculation

Example:– Suppose st1 writes into memory

location fully or partially accessed by the ld8 instruction

– If the compiler cannot guarantee that the store and the load addresses never overlap, it cannot reschedule the load before the store

– Such store to load dependences are ambiguous memory dependences

Data speculation is ...– Moving loads (and possibly

instructions that use the loaded values) above potentially overlapping stores

Three steps...

Traditional Architectures

instr1instr2. . .st1 [r3] = r4ld8 r1 = [r2] ;;add r3 = r1,r4

Page 37: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 37®

Data Speculation: Step 1

Separate load behavior from overlap detection– The ld8.a performs normal loads and book-keeping– The chk.a checks “the books” to see if a conflict occurred

“The Books”: Advanced Load Address Table (ALAT)– The ld8.a puts information about an advanced load (address

range accessed) into ALAT– Stores and other memory writers “snoop” ALAT and delete

overlapping entries– The chk.a checks to see if a corresponding entry is in ALAT

instr1instr2. . .st1 [r3] = r4ld8 r1 = [r2] ;;add r3 = r1, r4

instr1instr2. . .st1 [r3] = r4ld8.a r1 = [r2] ;;chk.a r1, recoveradd r3 = r1, r4

Page 38: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 38®

Data Speculation: Step 2

Reschedule load and (optionally) use– The ld8.a allocates an entry in the ALAT– If the st1 address overlaps with the ld8.a address, then

the ALAT entry is removed– The chk.a checks for matching entry in ALAT – if found,

speculation was OK; if not found, need to re-execute

ld8.a r1 = [r2] instr1instr2add r3 = r1, r4. . .st1 [r3] = r4 ;;chk.a r1, recover

instr1instr2. . .st1 [r3] = r4ld8.a r1 = [r2] ;;chk.a r1, recoveryadd r3 = r1, r4

Page 39: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 39®

Add recovery code– All instructions speculated are repeated in a recovery block– Will only execute if the chk.a cannot find ALAT entry– Use non-speculative versions of instructions

ld8.a r1 = [r2] ;; instr1 instr2 add r5 = r1, r4 ;; . . . st1 [r3] = r4 chk.a r1, recoverback: ld8 r1 = [r2] ;;

add r5 = r1, r4br back ;;

Data Speculation: Step 3

ld8.a r1 = [r2]instr1instr2add r3 = r1, r4. . .st1 [r3] = r4 ;;chk.a r1, recover

Page 40: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 40®

Original

store

load

use

Transform

store

load

use

chk.a

Reschedule / Recovery

store

load

use

chk.a recover

Data Speculation Summary

Page 41: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 41®

IA-64 Support for Data Speculation

ALAT – Hardware structure that contains information about outstanding advanced loads

Instructions– Advanced loads: ld.a– Check loads: ld.c– Advance load checks: chk.a

Speculative advanced loads – ld.sa combines the semantics of ld.a and ld.s (i.e., it is a control speculative advanced load with fault deferral)

Page 42: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 42®

IA-64 Application Architecture Overview

Application StateInstruction FormatExecution Semantics

Integer, Floating Point, and Multimedia InstructionsMemoryControl Speculation and Data Speculation

PredicationPredication

Parallel ComparesBranch ArchitectureRegister Stack

Page 43: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 43®

Predication Concepts

Branching causes difficult to handle effects– Instruction stream changes (reduces fetching efficiency)– Requires branch prediction hardware– Requires execution of branch instructions– Potential branch mispredictions

IA-64 provides predication– Allows some branches to be moved– Allows some types of safe code motion beyond branches– Basis for branch architecture and conditional execution

Page 44: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 44®

Predication

Allows instructions to be dynamically turned on or off using a predicate register value

Compare and test instructions write predicates with results of comparison/test– Most compare/test write result and complement– Example: cmp.eq p1,p2 = r1,0 performs the operation p1 r1 == 0, p2 !(r1 == 0)

Most instructions can have a qualifying predicate (qp)– Example: (p1) add r1 = r2,r3– If qp is true, instruction executes normally, else instruction

squashed into iGoo

64 1-bit predicate registers (true/false), p0 always “true”

Page 45: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 45®

Control Flow Simplification

Predication– Changes control flow dependences into data dependences– Removes branches

• Reduce / eliminate mispredictions and branch bubbles• Exposes parallelism• Increases pressure on instruction fetch unit

Original Code

p2p1

Predicated Code

(p2)

(p2)

(p1)

(p1)

Page 46: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 46®

cmp p1, p2 = ... ;;

(p2) cmp p3, p4 = … ;;

Transform & Reschedule

(p1) r8 = 5

(p3) r8 = 7

(p4) r8 = 10

Multiple Selects

(p3) br.cond ...

cmp p3, p4 = ...

(p1) br.cond ...

Original

p2p1

cmp p1, p2 = ...

r8 = 5

r8 = 10r8 = 7

p4p3

Page 47: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 47®

Original Transform Reschedule

Downward Code Motion

store

p1

!p1

p1

p1 = false

!p1

(p1) store

p1

p1 = false

!p1

(p1) store

!p1

p1 true along all paths from red

(p1) store

Page 48: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 48®

IA-64 Application Architecture Overview

Application StateInstruction FormatExecution Semantics

Integer, Floating Point, and Multimedia InstructionsMemoryControl Speculation and Data SpeculationPredication

Parallel ComparesParallel Compares

Branch ArchitectureRegister Stack

Page 49: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 49®

Parallel Compares

Parallel compares allow compound conditionals to be executed in a single instruction group.

Example

if ( a && b && c ) { . . . }

IA-64 Assembly

cmp.eq p1 = r0,r0 ;; // initialize p1=1

cmp.ne.and p1 = rA,0

cmp.ne.and p1 = rB,0

cmp.ne.and p1 = rC,0

Page 50: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 50®

Original Transform & Reschedule

Parallel Compares: Height Reduction

cmp.and pA = ...

(pA) br.cond x

cmp.and pB = ...

(pB) br.cond x

cmp.and pC = ...

(pC) br.cond x

cmp.and pABC = ...

cmp.and pABC = ...

cmp.and pABC = ...

(pABC) br.cond x

cmp.eq pABC = r0,r0 ;;

First compare initialize pABC

cmp.and instructions execute in parallel

Page 51: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 51®

Architectural Support

Compare– Equality: eq, ne– Relational (only against zero): lt, le, gt, ge, etc…– Test instructions: tbit, tnat

Allows for both ‘and’ and ‘or’ compares– One side: and– One side: or– Both sides of conditional: or.andcm, and.orcm

Page 52: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 52®

IA-64 Application Architecture Overview

Application StateInstruction FormatExecution Semantics

Integer, Floating Point, and Multimedia InstructionsMemoryControl Speculation and Data SpeculationPredicationParallel Compares

Branch ArchitectureBranch Architecture

Register Stack

Page 53: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 53®

Branch Architecture

IP-relative branches use 21-bit displacement

Branch registers– 8 registers for indirect jumps and procedure call/return link

Multi-way branches– Group 1 to 3 branches in a bundle– Allow multiple bundles to participate

Branches have static and dynamic prediction hints

Page 54: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 54®

Branch Execution

Examples

(p0) br target1 ;; // unconditional

cmp p1,p2 = <cond> // conditional

(p1) br.cond target2 ;;

Compare and branch can be in same instruction group

Compiler-directed static prediction with dynamic prediction– Reduces false mispredicts due to aliasing– Frees space in hardware predictor– Gives hint for dynamic predictor

Page 55: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 55®

Multiway Branches

Allow multiple branch targets to be selected in one instruction group

Example

(p1) br.cond target_1

(p2) br.cond target_2

(p3) br.call b1

Four possible instructions executed next: fall through, target_1, target_2, or address in b1

Page 56: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 56®

Control Height Reduction

ld8 r6 = [ra](p1) br exit1

ld8 r7 = [rb](p3) br exit2

ld8 r8 = [rc](p5) br exit3

chk r7, rec2(p3) br exit2

chk r8, rec3(p5) br exit3

ld8.s r6 = [ra]ld8.s r7 = [rb]ld8.s r8 = [rc] ;;

chk r6, rec1(p1) br exit1

Hoisting Loads

...

ld8.s r6 = [ra]ld8.s r7 = [rb]ld8.s r8 = [rc] ;;

3 branch cycles

chk r6, rec1(p2) chk r7, rec2(p4) chk r8, rec3(p1) br exit1(p3) br exit2(p5) br exit3

IA-64w/o Speculation

p5

p3

p1

2 Total cycles, 1 Branch cycle

p5p3p1

p1

p3

p5

Page 57: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 57®

Branch Architecture Conclusions

Odds & Ends– Multiple branches per clock is a natural side-effect of

speculation– Allows fast selection of multiple branch targets– Branch prediction for both single and multiple branches is

important for good performance– Compiler profiling can help facilitate the use of hints– Hints may reduce needed size/functionality of hardware

predictors– Works in conjunction with control speculation, data

speculation, predication, and parallel compares

Page 58: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 58®

IA-64 Application Architecture Overview

Application StateInstruction FormatExecution Semantics

Integer, Floating Point, and Multimedia InstructionsMemoryControl Speculation and Data SpeculationPredicationParallel ComparesBranch Architecture

Register StackRegister Stack

Page 59: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 59®

Register Stack

IA-64 provides hardware support for stack frames

Motivation– Automatic save and restore of GRs on procedure call and

return– Cache traffic reduction– Latency hiding of register spill and fill

Page 60: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 60®

Stacked

Static

0

3132

127

General Registers

Page 61: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 61®

(Inputs)

Static0

3132

127

LocalsOutputs

IllegalSize of frame (sof)

sofsolCurrent Frame Marker (CFM)

Size of locals (sol)

GR Stack Frame

Page 62: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 62®

GR Stack Frame – Example

32

46

loc

out52

21

sofsof

14

solsol

CFM

??PFM

Size of frame (sof)

Size of locals (sol)*

Virtual register numbers

*Includes inputs

in

Page 63: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 63®

GR Stack Frame – Call

32

46

loc

out52

out32

38

call

21

sofsof

14

solsol

CFM

??PFM

7

sofsof

0

solsol

2114

in

Pro

cedu

re A

Pro

cedu

re B Note renaming!

Page 64: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 64®

32

46

loc

out52

out32

38

call

32

48

loc

out50

GR Stack Frame – Allocate

21

sofsof

14

solsol

CFM

??PFM

7

sofsof

0

solsol

2114

19

sofsof

16

solsol

2114

alloc

in

in

Pro

cedu

re A

Pro

cedu

re B

Page 65: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 65®

GR Stack Frame – Return

32

46

loc

out52

out32

38

call

32

48

loc

out50

21

sofsof

14

solsol

CFM

??PFM

7

sofsof

0

solsol

2114

19

sofsof

16

solsol

2114

21

sofsof

14

solsol

2114

alloc32

46

loc

out52

return

in

in in

Pro

cedu

re A

Pro

cedu

re B Note renaming!

Page 66: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 66®

Register Stack Engine / Backing Store

The register stack has finite physical depth (number of physical registers is 96) but infinite virtual depth

“Overflow” – When the processor runs out of physical registers (during an alloc), the Register Stack Engine (RSE) automatically spills registers to memory (backing store)

“Underflow” – When a procedure whose register stack was spilled to backing store returns (br.ret), the RSE restores the registers from the backing store

Thus, programmer does not need to explicitly save or restore registers due to procedure calls (saves code)

Page 67: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 67®

Unallocated

Procedure A

Procedure B

Procedure C

Unallocated

Procedure A

Procedure AAncestors

sol B

sol A

sof C

Call

Return

BSP

BSPSTORE

Active frame

RSE load/stores

Higher register & memory addresses

Register Stack Engine: Backing Store

Procedure BRSE load/stores

Procedure A calls Procedure B calls Procedure C

Physical Stacked Registers Backing Store

RSE traffic need only occur on overflow & underflow

Page 68: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 68®

Agenda for this Tutorial

IA-64 History and Strategies

IA-64 Application Architecture Overview

C C IA-64 Example IA-64 Example

Conclusions and Q & A

Page 69: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 69®

A C to IA-64 Example

Compile ChkGetChunk function from Spec95 vortex

Assumptions for this example– Abstract machine model– Unlimited instruction issue (execution) resources– Loads have 2 cycle latency to first level cache– All other instructions have 1 cycle latency– Elide speculation recovery code

Hold on...

Page 70: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 70®

Synthesis: ChkGetChunk Code

if (((Theory->Flags[ChunkNum] & 0x0008)) && ((Theory->Flags[ChunkNum] & 0x0040)) && (*(Theory->ChunkAddr[ChunkNum] - 28)) == SizeOfUnit) { StackPtr = (*(Theory->ChunkAddr[ChunkNum] - 20)); if (Index >= StackPtr) { if (SetGetSwi) { *Status = -10009; } else { Mem_DumpChunkChunk (0, ChunkNum); *Status = 1005; } }} else { if ((*(Theory->ChunkAddr[ChunkNum] - 28)) != SizeOfUnit) { *Status = 1003; } else { *Status = 1004; } Mem_DumpChunkChunk (0, ChunkNum); } }return((Test = *Status==0 ? True:Ut_PrintErr (F,Z,*Status)));

Page 71: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 71®

Synthesis: Initial CFG

8p2

4

1 2

1

p13p12p9p8

p11p10

p15 p14

3

1

C+3

1

1

C+1

3 3

C+1

2

8p4

p6p5

# = Cycle count, C = Call latency

p3

p1

25 Cycles 2C + 31

Page 72: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 72®

Synthesis: Red Blocks

ld8 rT = [ &rT ] 0:B1ld8 rCN = [ &rCN ] ;; 0:B1add rAdF = rT,8 2:B1shladd rOff = rCN,4 ;; 2:B1add rAdFs = rOff,rAdF ;; 3:B1ld8 rFD = [ rAdFs ] ;; 4:B1and rMask8 = rFD,0x8 ;; 6:B1cmp.eq p1,p2 = rMask8,0 7:B1(p1) br.cond GREEN 7:B1and rMask4 = rFD,0x40 ;; 0:B2cmp.eq p3,p4 = rMask4,0 1:B2(p3) br.cond GREEN 1:B2add rAdCA = 16,rT ;; 0:B3add rAdCAs= rAdCA,rOff;; 1:B3ld8 rTmp = [ rAdCAs ] ;; 2:B3sub rTmp2 = rTmp,28 ;; 4:B3ld8 rDR = [ rTmp2 ] ;; 6:B3cmp.eq p5,p6 = rDR,SOU 7:B3(p5) br.cond GREEN 7:B3

2

ld8 rT = [ &rT ] 0:B1ld8 rCN = [ &rCN ] ;; 0:B1add rAdCA = 16,rT 2:B3add rAdF = rT,8 2:B1shladd rOff = rCN,4 ;; 2:B1add rAdCAs = rAdCA,rOff 3:B3add rAdFs = rOff,rAdF ;; 3:B1ld8 rFD = [ rAdFs ] 4:B1ld8.s rTmp = [ rAdCAs ] ;; 4:B3cmp.ne p1,p2 = r0,r0 6:??and rMask4 = rFD,0x40 6:B2and rMask8 = rFD,0x8 6:B1sub rTmp2 = rTmp,28 ;; 6:B3// cmp.eoa = cmp.eq.or.andcm cmp.eoa p1,p2 = rMask4,0 7:B2cmp.eoa p1,p2 = rMask8,0 7:B1ld8.s rDR = [ rTmp2 ] ;; 7:B3cmp.eoa p1,p2 = rDR,SOU 9:B3(p1) br.cond GREEN

10

8

8

Steps:

1) Speculate the ld8 instructions, reschedule, use parallel compares

Page 73: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 73®

Synthesis: Updated CFG

10

p2

4

1 2

1

p13p12p9p8

p11p10

p15 p14

3

1

C+3

1

1

C+1

3 3

C+1

# = Cycle count, C = Call latency

p1

Page 74: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 74®

Synthesis: Green Blocks

(p8) st [ rStat ]=rValp8 0:B2(p9) st [ rStat ]=rValp9 0:B3(p8) mov rVal = 1003 0:B2(p9) mov rVal = 1004 0:B3br.call MDC(0,CN) 0:B4br PURPLE 1:B4

cmp.eq p8,p9 = rDR,SOU 0:B1if (p8) { 0:B1 ld8 rStat = [&Status] 0:B2 mov rVal = 1003 0:B2 mov rValp8 = 1003 ;; 0:B2 st [rStat] = rValp8 2:B2} else { ld8 rStat = [&Status] 0:B3 mov rVal = 1004 0:B3 mov rValp9 = 1004 ;; 0:B3 st [rStat] = rValp9 2:B3}br.call MDC(0,CN) 0:B4br PURPLE 1:B4

Steps:

1) Speculate cmp.eq instruction into red block’s cycle 9

2) Speculate ld8 instructions into red block’s cycle 1

3) Copy and speculate mov instructions into red block’s cycle 0

4) Predicate both sides of the conditional

1

3

C+1

C+1

3

Page 75: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 75®

Synthesis: Updated CFG

10

p2

4

2

1

p13p12

p11p10

p15 p14

3

1

C+3

1

1

C+1

C+1

# = Cycle count, C = Call latency

p1

Computed: p8, p9

Page 76: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 76®

Synthesis: Purple Blocks

cmp.ne p10,p11 = rVal,0 0:B1(p11)br.call PE(F,Z,rVal);;0:B3(p10)mov r8 = 1 1:B2br.ret 1:B4

cmp.ne p10,p11 = rVal,0 0:B1if (p10) { 0:B1 mov rTest = 1 0:B2} else { br.call PE(F,Z,rVal) 0:B3 mov rTest = r8 1:B3}mov r8 = rTest 0:B4br.ret 0:B4

Steps:

1) Replace rTest with r8

2) Predicate both sides of conditional

C+1

1

C+11

1

Page 77: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 77®

Synthesis: Updated CFG

10

p2

4

2p13p12

p15 p14

3

1

C+3

C+1

C+1

# = Cycle count, C = Call latency

p1

Computed: p8, p9

Page 78: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 78®

Synthesis: Blue Blocks

3

(p13) br.call MDC(0,CN) 0:B3(p12) st [rStat] = rValp12 1:B2(p13) st [rStat] = rValp13 1:B3(p12) rVal = -10009 1:B2(p13) rVal = 1005 1:B3br PURPLE 1:B4

ld8 rSGS = [ &SGS ] ;; 0:B1cmp.eq p12,p13 = rSGS,0 2:B1if ( p12 ) { 2:B1 ld8 rStat = [ &Status ] 0:B2 mov rValp12 = -10009 ;; 0:B2 st8 [ rStat ] = rValp12 2:B2 mov rVal = -10009 2:B2} else { br.call MDC(0,CN) 0:B3 ld8 rStat = [ &Status ] 1:B3 mov rValp13 = 1005 ;; 1:B3 st8 [ rStat ] = rValp13 3:B3 mov rVal = 1005 3:B3}br PURPLE 3:B3

Steps:

1) Speculate all the ld8 instructions into red block’s cycle 1

2) Speculate the cmp.eq instruction into red block’s cycle 3

3) Copy and speculate the mov rVal instruction into red block’s cycle 0

4) Predicate both sides of the conditional

2

C+3

1

C+1

Page 79: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 79®

Synthesis: Updated CFG

10

p2

4p15

p14

C+1

C+1

C+1

# = Cycle count, C = Call latency

p1

Computed: p8, p9, p12, p13

Computed: p10, p11

Page 80: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 80®

Synthesis: Turquoise Block I

(p2) st8 [&StkPtr] = rDR2 0:B1(p14) br.cond BLUE 0:B1br PURPLE 0:B1(p13) br.call MDC(0,CN) 0:B3(p12) st [rStat] = rValp12 1:B2(p13) st [rStat] = rValp13 1:B3(p12) rVal = -10009 1:B2(p13) rVal = 1005 1:B3br PURPLE 1:B4

sub rTmp3 = rTmp2,20 ;; 0:B1ld8 rDR2 = [ rTmp3 ] ;; 1:B1st8 [ &StkPtr ] = rDR2 3:B1cmp.ge p14,p15 = rDR2,rIdx 3:B1(p14) br.cond BLUE 3:B1 br PURPLE 3:B1

Steps:

1) Speculate the sub, ld8, and cmp.ge into the red block (in cycles 6, 7, and 9, respectively)

2) Predicate the st8 with p2

3) Concatenate with the blue block

Continues on next slide...

C+1

4

Page 81: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 81®

Synthesis: Turquoise Block II

(p2) st8 [&StkPtr] = rDR2 0:B1(p13) br.call MDC(0,CN) 0:B3(p12) st [rStat] = rValp12 1:B2(p13) st [rStat] = rValp13 1:B3(p12) rVal = -10009 1:B2(p13) rVal = 1005 1:B3br PURPLE 1:B4

(p2) st8 [&StkPtr] = rDR2 0:B1(p14) br.cond BLUE 0:B1br PURPLE 0:B1(p13 ) br.call MDC(0,CN) 0:B3(p12) st [rStat] = rValp12 1:B2(p13) st [rStat] = rValp13 1:B3(p12) rVal = -10009 1:B2(p13) rVal = 1005 1:B3br PURPLE 1:B4

Steps:

4) Qualify p13 and p12 (now in red block) so they are only true if both p2 and p14 are true (use parallel and-compares) by either:

– Adding four parallel compares to red block– Lengthening red block by one cycle

5) Now it is safe to remove the first two branches to purple and blue

C+1C+1

Page 82: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 82®

Synthesis: Updated CFG

10

p2

C+1

C+1

C+1

# = Cycle count, C = Call latency

p1

Computed: p1, p2, p8, p9,p12 & p2 & p14,p13 & p2 & p14

Turquoise merged in

Computed: p10, p11

Page 83: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 83®

Synthesis: Final Steps

(p2) st8 [&StkPtr] = rDR2 0:B1(p13) br.call MDC(0,CN) 0:B3(p12) st [rStat] = rValp12 1:B2(p13) st [rStat] = rValp13 1:B3(p12) rVal = -10009 1:B2(p13) rVal = 1005 1:B3br PURPLE 1:B4

(p8) st [rStat] = rValp8 0:B2(p9) st [rStat] = rValp9 0:B3(p8) rVal = 1003 0:B2(p9) rVal = 1004 0:B3(p1) br.call MDC(0,CN) 0:B4br PURPLE 1:B4

Steps:

1) Note that p2 is the branch predicate for the blue / turquoise block and that p12 and p13 are qualified with p2 already

2) Predicate br.call MDC(0,CN) with p1

3) If we further qualify p8 and p9 with p1 (the “green branch”) in the red block, then the green and blue instructions are guaranteed to be independent ! This can be done by either:

– Adding three parallel compares to red block– Lengthening red block by one cycle

C+1C+1

Page 84: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 84®

Synthesis: Final CFG

10C+1

C+1C+1

Red Computes: p1 & p8, p1 & p9,

p12 & p2 & p14, p13 & p2 & p14

12 Cycles 2C + 12

8p2

4

1 2

1

p13p12p9p8

p11p10

p15 p14

3

1

C+3

1

1

C+1

3 3

C+1

2

8p4

p6p5

p3

p1

25 Cycles 2C + 31

Page 85: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 85®

Agenda for this Tutorial

IA-64 History and Strategies

IA-64 Application Architecture Overview

C IA-64 Example

Conclusions and Q & AConclusions and Q & A

Page 86: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 86®

Conclusions

Application State– Large register files

Instruction Format and Execution Semantics– Explicitly-parallel, compiler specifies independent operations

Integer, Floating Point, and Multimedia Instructions– Variety of instructions

Control Speculation and Data Speculation– Support aggressive code motion

Predication– Converts control-flow into data dependences

Register Stack– Supports procedure call and return

Page 87: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 87®

For More Information

Consult the IA-64 Application Developer’s Guide athttp://developer.intel.com/design/ia64/downloads/adag.htm

Q & A

Page 88: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 88®

Backup Slides

Page 89: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 89®

Application State

Directly accessible CPU state– 128 x 65-bit General registers (GR)– 128 x 82-bit Floating-point registers (FR)– 64 x 1-bit Predicate registers (PR)– 8 x 64-bit Branch registers (BR)

Indirectly accessible CPU state– Current Frame Marker (CFM)– Instruction Pointer (IP)

Control and Status registers– 19 Application registers (AR)– User Mask (UM)– CPU Identifiers (CPUID)– Performance Monitors (PMC, PMD)

Memory

Page 90: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 90®

Instruction Formats: Instructions

Major opcode (4 bits)

Minor opcode

Immediate operands (8-22 bits)

Register result identifier(s) (6 or 7 bits)

Register operand identifiers (7 bits)

Qualifying predicates (6 bits)– A few instructions do not have a QP

Major Op4 bits

Minor Op or Immediate10 bits

Register ID7 bits

Register ID7 bits

Register ID7 bits

Qualifying Pred6 bits

41 bits

Page 91: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 91®

Instruction Interactions with RSE

On br.call...– Set PFS.pfm to CFM– Set sof to sof - sol– Set sol, sor, and rrb to 0– Create new frame with only output registers

On alloc...– Resize current frame by setting sof to i + l + o, and sol to i + l– Save PFS to a GR

On mov to PFS...– Restore PFS from a GR

On br.ret...– Restore CFM from PFS.pfm– Restore local registers for previous frame

Page 92: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 92®

Register Stack: Explicit Management

flushrs– Flushes all frames except for the active frame to Backing

Store– Stalls execution if necessary

loadrs– Loads programmable amount of memory below BSP into

physical registers– Stalls execution if necessary– Amount of memory to be loaded is set in RSC.loadrs

cover– Creates a new frame with size zero (sof=sol=0) above

current– Saves CFM in IFM– Return from interrupt, rfi, restores CFM from IFM

Page 93: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 93®

Register Stack Engine: RSE Control

RSC: Register Stack Configuration Register (AR16)

Mode hints – Lazy or synchronous operation– Load intensive or load asynchronous– Store intensive or store asynchronous– Eager or asynchronous

Privilege level, endian mode– RSE operation depends on data in register stack – Not on the mode in PSR

Tearpoint distance (loadrs)

Page 94: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 94®

Register Stack Engine: State

Registers– 96 x 65-bit virtual registers– at least 96 physical registers– memory backing store

Backing Store Pointer, BSP– Points to the Backing Store Address of GR32 of current

frame

BSPStore– Points to the memory location for the next RSE store

RNAT– RSE NaT Collection register (64-bits)

RSC– RSE Control Register

Page 95: IA-64 Application Architecture Tutorial

PACT 1999: IA-64 Application Architecture Tutorial 95®

Register Stack Engine: NaT bits

RSE responsible for saving NaT bits– Spilled/filled in groups of 63– Every 63 register saves/restores– NaTs are collected in RNaT

register

When BSPStore{8:3} == 111111– RNaT register is stored– Bit 0 of RNAT corresponds to

earliest stored register– Bit 62 of RNAT corresponds to

last stored register– Bit 63 is always zero

NaT Collection

NaT Collection

63 Stacked,

General Registers

63 Stacked,General Registers

Memory

8 Bytes

BSPStore{8:3}

111111

111110

000000

111111

111110

000000