Download - NCI Report: Zephyr

NCI Report: Zephyr

PLDI NCI TutorialPLDI NCI Tutorial

University of VirginiaUniversity of Virginia

Princeton UniversityPrinceton University

6/16/2000 PLDI NCI Tutorial 2

Zephyr Goals

• Goal– Deliver high-quality, language-

neutral tools for rapidly constructing compilers for experimental computing systems research

• How– Provide specification languages and

processors to automatically generate key compiler components•Don’t write code, write specifications!


Zephyr Compilers

EDG C++Java

MachSUIF

SUIF-to-VPOBridge

VPO

lccEDG C++

Alpha

SUIF

Sparc MIPS X86Alpha X86

In terprocedura lanalysis

Para lle lizationand loca lity

optsO bject-oriented

optsScheduling

RegisterA llocation

Instruction se lectionRegister a llocation

Code motionM emory access

coalescingInduction variab le

e lim inationCSE

Loop unro llingIn lin ing

SUIF Zephyr


Zephyr Building Blocks

• ASDL: Abstract Syntax Description Language

• VPO: Very Portable Optimizer• CSDL: Computer System

Description Language


ASDL: Abstract Syntax Description Language

Parser

Lexer

Toke

ns

ASTSemanticAnalysis

AS

T

Translate IR OPT1

....

IR

IR OPTn

IR

CodeGen

AST IR

GlueGenerator

GlueDescription


ASDL

• ASDL makes it easy to communicate complex recursive data structures

• ASDL and its tools provide – Concise descriptions of tree-like

structures, including ASTs and compiler (IRs)

– Automatic generation of data structure implementations and pickling functions for C, C++, Java, Standard ML, and Haskell.

– Graphical browsing and editing of data structures on disk.


ASDL

• For more information about ASDL see:– Give reference here– Give URL here


VPO: Very Portable Optimizer

• VPO is a retargetable optimizer that operates on a low-level, machine-independent representation called RTLs (register transfer lists)

• VPO is retargeted by providing a machine description (MD) of the target machine, and revising a few machine-dependent routines

• VPO is small, easily extended, and extremely effective


History Lesson

• PO developed in 1981– Pioneered use of RTLs– Demonstrated ability to

do optimizations on low-level representation

• Development split in 1982– gcc development

• Richard Stallman and Len Tower

– VPO development• Many people at Uva

and a few industrial labs

P O

V P O gcc


Register Transfer Lists• Based on Bell and Newell's ISP

notation• Machine-independent

representation of a machine-dependent operation

• Algorithms that manipulate RTLs are machine-independent


Register Transfer Lists• While assembly language notations

may very, RTLs are very similar across architectures

ExampleRTL Machineadd %o1,%o2,%o2 SPARCaddu $10,$10,$9 MIPSar 10,9 IBM

in RTL each operation would be representedr[10] = r[10] + r[9];


RTLs

• The form of RTLs are fixed• dst = src ; dst = src ; dst = src …

– The individual register transfers are performed in parallel

– Example• r[1] = r[1] + r[2] ; NZ = r[1] + r[2] ? 0

– VPO provides machine-independent primitives for operating on and manipulating RTLs• Obtain the sources and destinations• Obtain the memory locations read and written• Obtain the type of instruction (arithmetic,

branch, control transfer, etc.)


RTLs

• Think of RTL as a machine-independent assembly language– For a machine X, each RTLx describes

an instruction in X’s instruction set (may be a synthetic instruction)

– RTLx should specify• instruction’s input and outputs• the transformation the instruction

makes on the machine state– VPO uses this information to

compute a dataflow graph


Compilation with VPO

SourceCode

Front andMiddle Ends

VPO Mach MachineCode

RTL

You supply the front end and a simple code generator, we supply an optimizing back end


Generating RTLX

• Translate IL ops to semantically equivalent sequences of instructions for the target machine– Generate RTL representation of

instructions, not assembly language– Do not worry about code quality

• Perform naïve, straightforward translation• Expose all computations (even effective

address computations) to VPO• Use virtual or pseudo registers for temporaries• VPO handles activation record and data

placement


Generating RTLx

The C codeK = I + 1;

= <int,32>

ADDR K<local,32>

+ <int,32>

@ <int,32>

ADDR I<local,32>

CON 1<int,32>

IL SPARC RTLADDR int K r[33]=r[14]+K.;ADDR int I r[34]=r[14]+I.;@ int r[35]=M[r[34]]; r[34]CON int 1 r[36]=1;+ int r[37]=r[35]+r[36]; r[35]:r[36]= int M[r[33]]=r[37]; r[33]:r[37]


VPO design rationale• All "traditional" optimizations performed

at the machine-level on a single representation—RTL– most optimizations are machine-dependent– better code is produced– instruction selection can be performed on

demand– avoids phase ordering problems– simplifies implementation of optimizations– easier to accommodate emerging

architectures– "plug and play" structure


RTLs in VPO

• VPO optimization algorithm– repeat

apply code-improving transformationuntil fixed-point reached or exhausted registers

• Maintaining two invariants– Semantic invariant (S)

• Observable behavior of program unchanged (according to RTL semantics)

– Machine invariant (M)• Every RTL equivalent to one machine instruction


VPO code improvements

• Each code-improving transformation is– machine-level, but– machine-independent

• Any semantics-preserving transformation is OK

• Preserve machine invariant (M) using machine description;– for each new RTL produced, ask MD if OK– if any is not target machine instruction,

roll back transformation


Code improvement catalog

• Register assignment and allocation

• Common subexpression elimination

• Induction variable elimination

• Code motion• Constant propagation• Copy propagation• Memory access

coalescing

• Recurrence detection

• Instruction scheduling

• Dead code elimination

• Constant folding• Loop unrolling• Branch minimization• Evaluation order

determination


VPO Optimizations

• Common subexpression elimination•Davidson, J. W. and Fraser, C. W.,

‘Eliminating Redundant Object Code,’ in Conference Record of the Ninth Annual ACM Symposium on Principles of Programming Languages, January 1982, pp. 128–132.

• Evaluation Order Determination•Davidson, J. W. , ‘A Retargetable Instruction

Reorganizer’, in Proceedings of the SIGPLAN ‘86 Symposium on Compiler Construction, 21(7), June 1986, pp. 23–241.


VPO Optimizations

• Link-time optimization• Benitez, M. E. and Davidson, J. W., ‘A Portable

Global Optimizer and Linker’, in Proceedings of the SIGPLAN ‘88 Symposium on Programming Language Design and Implementation, June 1988, pp. 329—338.

• Memory access coalescing• Davidson, J. W. and Jinturkar, S., ‘Memory

Access Coalescing: A Technique for Eliminating Redundant Memory Accesses’, in Proceedings of the SIGPLAN ‘94 Symposium on Programming Language Design and Implementation, Orlando, FL, June 1994, pp. 186— 195.


VPO Optimizations

• Code Motion• Benitez, M. E. and Davidson, J. W., ‘The

Advantages of Machine-Dependent Global Optimization’, in Proceedings of the 1994 Conference on Programming Languages and Systems Architectures, Zurich, Switzerland, March 1994, pp. 105–124.

• Loop Unrolling• Jinturkar, S. and Davidson, J. W., ‘Improving

Instruction-level Parallelism by Loop Unrolling and Dynamic Memory Disambiguation’, in Proceedings of the 28th Annual IEEE/ACM International Symposium on Microarchitecture, Ann Arbor, MI, November 1995, pp. 125–132.


VPO Optimizations

• Branch mininization•F. Mueller and D. B. Whalley, ‘Avoiding

Conditional Branches by Code Replication’ in Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation, June 1995, pages 56-66.

•M. Yang, G. Uh, and D. Whalley, ‘Improving Performance by Branch Reordering’ in Proceedings of the SIGPLAN '98 Conference on Programming Language Design and Implementation, June 1998, pages 130-141.


VPO Optimizations

• Recurrence detection and optimization

•Benitez, M. E. and Davidson, J. W., ‘Code Generation for Streaming: an Access/Execute Mechanism’, in Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, April 1991, pp. 132–141.


Building VPO

VPOGenerator

Eval. Order Determ.

ZIFLow Analysis &Transformation Libraries

VPOMIPS

CSDLSPARCSpecification

NewTransformation

CSDLMIPSSpecification

CSDLALPHASpecification

CSDLi486Specification

Register Allocation

Access Coalescing

Comm. Subexpr. Elim.

Eval. Order Determ.

Induction Var. Elim.

Instruction Scheduling

Code Motion

SSA Computation


CSDL: Computing System Description Language

• Computing System Description Language– Modular system of components– Allows applications to customize a

description– Easily extensible for adding new

details– Reusable/application independent


CSDL

CallingConvention

(CCL)

MemorySystem

Description(MSDL)

PipelineDescription(PLUNGE)

CSDL Core

InstructionRepresentation

(SLED)

Object-fileFormat

MemorySystem(MSDL)

CallingConvention

(CCL)PipelineDescription(PLUNGE)Pipeline

(PLUNGE)

InstructionSemantics

(l -RTL)


Zephyr Compilers

• EDGSUIF-to-VPO Compiler– Five targets (SPARC, Pentium, Alpha,

MIPS, SimpleScalar)

TargetMachine

Code

EDG Front EndSourceCode ...

SUIF Pass 1 VPOSPARC

SUIF Passes

SUIF-to-LIRALIRA-to-SPARC

RTLSPARC


Zephyr Compilers

• EDG-to-VPO C++ compiler– Funded by Edison Design group– Targeted to SPARC only– Compiles all benchmark suites (SPEC,

PGI, lcc)– Code generator (translator from EDG

intermediate representation to RTLs) provided as a literate program


Zephyr Compilers

• lcc-to-VPO C compiler– Targeted to SPARC, X86, MIPS, ALPHA,

and SimpleScalar– Code generators (translators from LIRA

to target-machine RTLs) provided as literate programs

– Currently producing good code, some optimizations are not fully implemented/debugged


SPEC results for SPARC

Benchmark Gcc –O Lcc vpolcc go 13.4 6.45 11.0 M88ksim 5.70 4.98 6.2 li 8.98 5.93 7.48 Compress 11.6 9.0 9.28 Ijpeg 8.79 5.54 8.6 Perl 12.3 9.2 10.2 Vortex 10.7 8.27 11.2


Acknowledgements

• This work has been funded by:– Defense Advanced Research Projects

Agency– National Science Foundation– Panasonic AVC Labs– Edison Design Group


Afternoon Schedule

Time Talk

1:30-2:00 ASDL: Dan Wang

2:00-2:55 Using Zephyr for PL Research: Kevin Scott The VPO Code Generation Interfaces LIRA: The lcc intermediate representation SUIF-to-LIRA

2:55-3:15 Using Zephyr for Architecture Research: Jason Hiser and Chris Milner Introduction Handling a target machine’s calling convention

3:15-3:30 Break


Afternoon Schedule

Time Talk

3:30-4:30 Using Zephyr for Architecture Research (continued): Jason Hiser and Chris Milner Writing a VPO machine description (md.y) Writing a VPO register specification (regs.rt) EASE: Environment for Architecture Study and Evaluation Case Study: Targeting SimpleScalar

4:30-5:20 Using Zephyr for Optimization Research: Jack Davidson Introduction to VPO’s optimization structure Adding a new optimization to VPO


Afternoon Schedule

Time Talk

5:20-5:40 Zephyr support tools: Raja Venkateswaran VET: Observing and debugging VPO VPOISO: Isolating optimization errors

5:40-6:00 Wrap up and Open Discussion

Using Zephyr for Programming Language

ResearchKevin Scott

University of Virginia


Overview

• Zephyr organization and philosophy• VPO code generation interfaces• Adding a new front-end to Zephyr:

– Using the Lira intermediate representation

– With a custom code expander using the VPO code generation interfaces

• Language related issues in retargeting Zephyr

• Q & A


What is Zephyr?

• Set of tools for generating and optimizing RTL programs– VPO (Very Portable Optimizer)

• SPARC, Alpha, x86, MIPS, SimpleScalar (PISA)

– Code Expanders• Turn a front-end’s IR into RTLs

– Glue for hooking front-ends up to VPO• VPO code generation interfaces• Lira IR

– Debugging tools• VET – interface for controlling and visualizing

VPO transformations• vpoiso – isolates optimizer bugs


National Compiler Infrastructure

SML/NJ EDG C++ Ada95DEC

FORTRANJava

MachSUIF

SUIF-to-VPOBridge

VPO

lccEDG C++IBM C++

VisualAge

Alpha

SUIF

Sparc MIPS X86Alpha X86

Interproceduralanalysis

Parallelizationand locality

optsObject-oriented

optsScheduling

RegisterAllocation

Instruction selectionRegister allocation

Code motionMemory access

coalescingInduction variable

eliminationCSE

Loop unrollingInlining

SUIFInfrastructure

ZephyrInfrastructure

Optional Item


Why use Zephyr?

• You’re a language researcher– Easy to hook a front-end up to VPO– Relatively little effort required to get

multiple targets– VPO is a very good optimizer

•Wide range of existing operations•Leverage work of others contributing new

optimizations to VPO– Let’s you concentrate on front-end

issues– Less work than writing a VPO-quality

optimizer yourself


Front Ends

Zephyr Organization

lccEDG SUIF

SPARC MIPS

Alpha x86

Lira code expanders

VPO

EDG code expanders

SPARC

VPOi and VPOasm

VPCC

SPARC

x86

CVM code expanders

MIPS


Four Front Ends

• VPCC – A K&R C compiler– IR is code for a C virtual machine (CVM)– Deprecated in favor of lcc front-end

• EDG – Edison Design Group C/C++– Very flexible IR

• Lcc – Retargetable C compiler– Simple backend emits Lira, an IR based on

lcc trees

• SUIF 2.1– High level optimizations and analyses– suif2lira pass transforms SUIF IR into Lira


Code Expanders

• CVM Code Expanders– SPARC, x86, MIPS– Generate encoded RTL files directly –

don’t use VPOi or VPOasm

• EDG Code Expanders– SPARC– First expander to use VPOi and

VPOasm interfaces


Lira Code Expanders

• Targets– SPARC– X86– Alpha– MIPS32– MIPS64 and SimpleScalar (PISA)

• Input Lira code specialized for target• Output encoded RTLs for VPO• All use the VPOi and VPOasm

interfaces


VPOi

• VPOi provides a C interface for:– Creating RTLs– Sending RTLs to VPO for optimization

• Abstracts away specifics of:– RTL representation– How RTLs are sent to VPO

• RTL creation routines can be semi-automatically generated from a machine specification


VPOasm

• VPOasm provides a C interface for sending assembly language statements to VPO.

• Allows a code expander to:– Change segments– Define symbols– Initialize storage locations– Specify alignments for code or data


More on VPOi and VPOasm

• Why use these interfaces?– Simpler than writing out VPO encoded RTL

files manually.– Can get some of the implementation for

free if doing a new target architecture.– Allows us to change RTL and assembly

language representations w/o fouling you up. Much.

• Reference manual for VPOi and VPOasm:– http://www.cs.virginia.edu/zephyr/vpoi

http://www.cs.virginia.edu/zephyr/vpoi


VPOi and VPOasm caveats

• Interfaces are written in C.– Bad if you’re writing a code expander in

languages with no mechanism for calling C functions.

• Interfaces are relatively rigid.– Suppose you want to communicate

something to the optimizer that doesn’t look like an RTL or assembly language.

• Interfaces have only been tested on C/C++ front ends.– Might have to change to accommodate new

language features…


Lira

• Simple IR based on lcc trees• Targets a stack-oriented virtual

machine• Two types of entities in a Lira file:

– Instructions– Directives


Lira Instructions

• Instruction is composed of:– Operator (33)

– Type• F (float), I (signed integer), U (unsigned integer),

P (pointer), V (void), B (aggregate)

– Size• 1, 2, 4, 8, …

– Auxiliary info

CALLGEMODADDCVF

ARGEQLSHNEGBCOM

NEASGNDIVINDIRCNST

LABELLTSUBBXORCVUADDRL

JUMPLERSHBORCVPADDRG

RETGTMULBANDCVIADDRF


Lira Instruction Example

• C Fragmentint a;

a = a + 10;

• Lira Translation

ADDRGP4 “a”

INDIRI4

CNSTI4 10

ADDI4

ADDRGP “a”

ASGNI4


Lira Directives

• Change program segments with:– code, data, bss, lit

• Specify alignment with:– align

• Control symbol visibility with:– import, export

• Initialize storage locations with:– bytes, string, address, skip


Lira Directives (cont)

• Indicate procedure boundaries with:– proc, endproc

• Describe procedure locals and parameters with:– local, param

• Describe source coordinates with:– file, line


Lira Directive Example

• Reserving storage for a global int “a”-bss-export a-align 4+LABELI4 “a”-skip 4


The truth about Lira

• Lira can be emitted from lcc using a postorder walk of lcc trees. Almost.

• Typical case:ADDI4

INDIRI4

ADDRGP4 “a”

CNSTI4 10

ADDRGP4 “a”

INDIRI4

CNSTI4 10

ADDI4


The truth about Lira (cont)

• Sometimes, we don’t do a postorder traversal:

ADDI4

INDIRI4

ADDRGP4 “a”

CNSTI4 10

ADDRGP4 “a”

INDIRI4

CNSTI4 10

ADDI4

ADDRGP “a”

ASGNI4

ADDRGP4 “a”

INDIRI4


The truth about Lira (cont)

• A Lira program is specialized to the compilation target.– Types, sizes and alignments are

target specific– Front-end must generate appropriate

target dependent code for accessing the components of aggregates (arrays and structs)


Lira Code Expander

• Structured for simplicity.• Code is generated by a big switch

statement.• Two passes made over the input.

– First gather symbol information.– Second generates code.

• SPARC expander is about 1800 lines of C. Close of ½ of the code is machine independent or easily reused on new targets.


Retargeting Lira code expander

• Three big tasks:– Modify dumptree to map Lira ops

onto RTLs for the new target. Easiest of the three since there is substantial opportunity for cut & paste coding.

– Modify sp_call to emit target dependent RTLs. On the SPARC we emit the following when the caller returns a struct:VPOi_rtl(ST(tmp_loc, sp_plus(r[14], SP_OFS-4)),

VPOi_locSetBuild(tmp_loc, 0));


Retargeting Lira code expander

• Modify setup_frame to:– Use right offsets for parameters and

locals.– Emit RTLs to do target dependent

frame setup on procedure entry. For procedures returning a struct on the SPARC, we emit:

VPOi_rtl(LD(sp_plus(r[30], SP_OFS-4),tmpreg), 0);

locaddr = sp_plus_ra(r[30], locals.t[0].sym, 0);

VPOi_rtl(ST(tmpreg, Rtl_fetch(locaddr, 32)),

VPOi_locSetBuild(locaddr, tmpreg, 0));


Why use Lira?

• Lira is a pretty good intermediate language for C-like languages. (Thanks to Chris Fraser and Dave Hanson!)– Abstracts away specifics of a target’s calling

sequence! Left to code expander to implement.

• Separating Lira from lcc means that we can reuse the Lira code expanders for front-ends other than lcc. E.g., SUIF.

• Very easy to write a Lira code expander.


Lira References

• “A Retargetable C Compiler: Design and Implementation”

• Lcc version 4.1 code generation interfaces– http://www.cs.princeton.edu/software/lcc/pkg/doc/4.

html

• More on the way…

http://www.cs.princeton.edu/software/lcc/pkg/doc/4.html

http://www.cs.princeton.edu/software/lcc/pkg/doc/4.html


Adding a front-end to Zephyr

• Is your language C-like? – If yes then consider writing code to

map your IR onto Lira. This gets you all of Lira’s targets almost for free.

– If no then you might need to write a code expander for each target you want to support.


Adding a front-end to Zephyr

• Is my target already supported?– If yes then you’re golden.– If no then you may have to do one or

more of the following:•Create VPOi and VPOasm interfaces for

your target. This can be partially automated.

•Write a Lira code expander for the new target, or

•Write a custom code expander for the new target.

•Port VPO to the new target.


Adding a front-end using Lira

• Difficulty depends on your IR.– Trivial for lcc – almost same IR!– Pretty easy for SUIF. E.g.

void Translator::trans(BinaryExpression exp) { int lira_op;

translate(exp->get_source1()); translate(exp->get_source2()); switch(op_map(exp->get_opcode())) {

case SOP_add: lira_op = LIRA_ADD; break;...

} emitter->emit(lira_op, lira_map_ty(exp->get_result_type());}


Where can I find out more?

• Should be releasing suif2lira as a literate program around July 1.– Good starting point for someone

familiar with SUIF wanting to hook up a front-end with Lira.

• Literate source for SPARC and x86 Lira code expanders will be available immediately after PLDI.


Adding a front-end using a custom code expander

• Difficulty again depends on your IR.

• Refer to EDG SPARC code expander:– http://www.cs.virginia.edu/zephyr/dist/edg-sparc-1.0.pdf

http://www.cs.virginia.edu/zephyr/dist/edg-sparc-1.0.pdf


Language issues in retargeting Zephyr

• Calling convention– In addition to emitting RTLs to

properly handle language calling conventions on function calls and function entry, also need to consider fixentry in VPO.

– fixentry finalizes a procedure’s prologue after optimization is complete.

– More in next talk.

Using Zephyr for Architecture Research

Jason Hiser and Chris Milner


A Brief Introduction to Zephyr and Architectural

ResearchJason Hiser



Roadmap

• Handling a machine’s calling convention– Jason

• Break– Coffee!

• Writing a VPO machine description and Writing a VPO register description– Chris Milner

• Case Study: Targeting SimpleScalar– Jason

Handling a Machine’s Calling Conventionfixentry fun (regs.c)

Jason HiserUniversity of Virginia


Introduction To regs.c

• Fixentry: The main routine of regs.c – Responsibilities of fixentry

• Parameters, external and global data used in fixentry

• Other functions: regarg, initmap, map, transfer, leaf


Responsibilities of Fixentry

• Calculate stack space needed – outgoing parameters, spill locations,

local variables, saved registers, and incoming parameters

• Emit function prologue – Adjust stack pointer– save return address, and saved

registers– add RTLs for local equates


Fixentry Responsibilities (continued)

• Create and maintain a “mapping” from the registers used to the actual hardware registers

• Save/restore necessary registers and incoming parameters to stack

• Emit function epilogue (including code to restore saved registers)


Not the responsibility of Fixentry

• Perform any optimization• Insert spill code• Make decisions about register

usability• Emit assembly code for any

instructions• Setup registers/stack for making

a function call• Allocate global data


Extern Variables (Where fixentry gets its data)

• struct bblock *top List of basic blocks in current function

• struct locuse *locs local variables and parameters

• int isused[MAXREGS] which registers are used and which

aren’t• int varargs is this a variable

argument function?


Parameters to Fixentry

• struct list *ptr the RTLs in the current function

• struct blist *retb the basic blocks that need epilogue code


Global Variables

• int gpregmap[] The “mapping” of the general purpose registers

• int fpregmap[] The “mapping” of the float registers

• int spilloff Information to the code emitter

about where to place spill variables


Calculating Stack Space

• Loop through RTLs and find out how much space is needed for outgoing params

• Loop through temps and calculate spill space needed

• Loop through locals and calculate local space needed


Calculating Stack Space (cont.)

• Loop through registers and find out which ones need to be saved

• Determine space needed for incoming parameters (register params only)


Emitting Prologue and Epilogue

• Prologue– Emit code to adjust stack pointer– Emit code to spill return address and

saved regs

• Epilogue– For each exit block

•Restore spilled registers•Restore stack pointer• Jump to return address


Register Map

• Register allocator determines what variables are in which register– Fixentry needs to put these variables

in the proper register.

• Fixentry attempts to map registers so no movements are necessary, overriding the allocator assignment policy– If it can’t, register to register moves

are necessary


Other Functions of regs.c

• regarg Boolean function returns true if a local variable is an argument, and enters the

function in a register• initmap Initializes the gpregmap

and fpregmap• map Returns the mapping for a

register


Other Functions of regs.c(continued)

• transfer Creates a transfer RTL from two machine

locations (memory, register, or spill)

• leaf Boolean function determines if a function is a leaf


Summary

• Fixentry is the main portion of regs.c

• Fixentry is responsible for – function prologue– function epilogue – register mapping to avoid register to

register moves

• Regs.c also contains a few functions to let other areas know about the mapping.

Using Zephyr for Architecture Research

(continued)

Jason Hiser and Chris Milner


Writing a VPOMachine Specification

Chris MilnerUniversity of Virginia


Outline of talk

• Structure of VPO• Machine descriptions• How to construct the descriptions• Getting machine dependent

information for machine independent transformations– combiner– loop (and other) transformations– scheduler

• EASE


Structure of VPO

C Code

ma

chin

e in

de

pe

nd

en

t so

urc

e

C CodeCSE

C Codestrength

reduction

C Codedead codeelimination

...

C Codesimp.c

Registerdescription

reg.rt

C Codertl.c

machine dependent source

Instructiondescription

md.y

InstructionProcessor

yyfast

C Codesched.c

machineindependent

combiner()

loop_strength()

machinedependent

inst_is_legal()

is_basic()

VPO optimizer

C Code

C Compiler

Pipelinedescription

pipe.pg

RegisterProcessor

regtool

PipelineProcessor(real soon now)

C Code


VPO

• “Machine independent” transformations on low level “machine dependent” intermediate form (register transfer lists)

• Retargeted portion assists in:– recognizing legal RTLs– converting and inserting RTLs to

assist transformations– picking apart RTLs to get information


Role of Machine Descriptions

• md.y - legal instructions– maintains VPO invariant– YACC grammars

• regs.rt - register file– register types– alignment– size– ABI


md.y

• RTL recognizer– Workhorse– RTLs come from combiner (at compile

time)– ours are not usual table driven ones

but directly executable (yyfast)

• How do you do it?– Work from existing ones (derive

Alpha from MIPS); or, – construct one anew


Sample machine

• Subset SIMPLESCALAR– e.g. student project on FPGA– load/store– chars, half words and words– constants must be loaded into

registers– add, and, not, sll, sra, srl– branch on less than, branch on

equal,jump, call, return


Constructing md.y (continued)

• Operands - registers%token REG0 REG1 REG2

(scanner converts ‘b’‘[‘‘1’’]’ to REG0)

reg: REG0

| REG1

| REG2



• Operands - memory%token BMEM WMEM RMEM (scanner converts ‘B’‘[‘ to BMEM )

mem: BMEM reg ‘]’

| WMEM reg ‘]’

| RMEM reg ‘]’



• Operands - misc%token PC RT ST (used for call and return)

%token LOCAL GLOBAL CON LBL

expr: LOCAL

| GLOBAL

| CON

| LBL



• Operations%left ‘=‘ ‘+’ ‘&’ ‘”’ ‘{‘ ‘}’

%nonassoc ‘~’ ‘,’

rhs : reg ‘+’ reg

| reg ‘&’ reg

| reg ‘{‘ reg

| reg ‘}’ reg

| reg ‘”’ reg



• Binary operationsbinops: reg ‘=‘ rhs

• Unary operationnot: reg ‘=‘ ‘~’ rhs



• Load, load immediate and storel : reg ‘=‘ mem

li: reg ‘=‘ expr

s : mem ‘=‘ reg

si: expr ‘=‘ reg (FORTRAN)



• Branchbb: PC ‘=‘ reg ‘:’ reg

| PC ‘=‘ reg ‘<‘ reg • jump call and returnjmp: PC ‘=‘ reg

jal: ST ‘=‘ expr

ret: PC ‘=‘ RT



• All instructionsinst: bb | jmp | jal | ret

| binst | not

| l | li | s

• Now, we need some glue and some checking


Glue for parser

• Build up semantic records• Found in isem.c

– addr() - record for addressing modereg: REG0 {$$=addr(BYTE,BREGISTER…)}

– memref() - record for memory access– brecord() - record for binary op– rrecord() - record for relational op– same() - ensure records are same


Semantic routines

• inst.c– each instruction or instruction class

has a routine– routine checks for legal operands– is responsible for emitting legal asm– e.g. bb() -

•on MIPS check the semantics for compare and branch

• right hand operand immediate, use immediate form of instruction

• records instruction type


Structure of VPO(again)

C Code

ma

chin

e in

de

pe

nd

en

t so

urc

e

C CodeCSE

C Codestrength

reduction


...

C Codesimp.c

Registerdescription

reg.rt

C Codertl.c



md.y


yyfast

C Codesched.c

machineindependent

combiner()

loop_strength()

machinedependent

inst_is_legal()

is_basic()

VPO optimizer

C Code

C Compiler

Pipelinedescription

pipe.pg

RegisterProcessor

regtool


C Code


regs.rt

• TYPES– basic types of registers on the

machine– byte,half,word,float,double– BTREG, WTREG, RTREG, FTREG,

DTREG

• CODES– condition codes – IC,FC,etc.


regs.rt(continued)

• CLASS – general_purpose, float, spill– number – scratch – reserve


regs.rt(continued)

• CLASS (continued) – type

•alignment (even-odd register pairs)•size - how many to allocate•invariant - mark as invariant for loops

– e.g. fp and sp•memchar, regchar - give it a different name

•stack, fifo - tells the allocator about them


regs.rt for MIPS

types BTREG, WTREG, RTREG, FTREG, DTREG

codes FC

class = general_purpose

number = 32

scratch = 2..15, 24, 25

reserve = 0, 1, 26, 27, 28, 29, 31

(notes: MIPS - reg 0 is zero, reg 1 is asm reg,reg 26,27 are used by os, reg 28 is gp,reg 29 is sp, reg 31 is return address)


regs.rt for MIPS (continued)

type = RTREG

alignment = 1

size = 1

invariant = 28, 29

endtype

type = BTREG, WTREG

alignment = 1

size = 1

endtype



class = floating_point

number = 16

scratch = 0..9

type = FTREG, DTREG

alignment = 1

size = 1

endtype

endclass



class = SPILL

number = 32

type = BTREG, WTREG, RTREG, FTREG

alignment = 1

size = 1

endtype

type = DTREG

alignment = 2

size = 2

endtype

endclass


Structure of VPO(again)

C Code

ma

chin

e in

de

pe

nd

en

t so

urc

e

C CodeCSE

C Codestrength

reduction


...

C Codesimp.c

Registerdescription

reg.rt

C Codertl.c



md.y


yyfast

C Codesched.c

machineindependent

combiner()

loop_strength()

machinedependent

inst_is_legal()

is_basic()

VPO optimizer

C Code

C Compiler

Pipelinedescription

pipe.pg

RegisterProcessor

regtool


C Code


Other files

• simp.c - helps the combiner• sched.c - machine specific

portion of scheduling

• rtl.c - routines to find machine idioms in

transformations


simp.c

• Combine RTLs in machine dependent way

• e.g. SPARC 1 r[35]=~r[35]

2 {1} r[33]=r[33]&r[35]

combines tor[33]=r[33]&~r[35]

semantically ok, but not an instructioncomp() makes machine idiom substitution

r[33]=r[33] ANDNOT r[35]


simp.c(continued)

• e.g. SPARC constants 4095 is biggest immediate1 r[40]=4095

2 {1} r[41]=r[40]+13

combines and folds tor[41]=4108

comp() converts to r[41]=HI[4108]

r[41]=r[41]|LO[4108]


rtl.c

• Manipulate– reverse() - reverse a branch– don’t_bother_with() - tell cse to ignore

• Predicates– is_call(), is_rjmp(), ismem(), writes_mem()

– is_pc(),

• Pick apart– findlabel(), usetype()


rtl.c(continued)

• Insert code to help transformations– store(), load()– multconst()

•add series of shifts and adds

– locsub() - substitute reg for mem•SPARC has sign extend on load•no single sign extend move•have to insert shifts to do sign extend


rtl.c(continued)

r[1] = 0

r[9] = r[14] + a

L32:

r[8] = r[1]*4

R[r[8]+r[9]]=0

r[1]=r[1]+1

IC=r[1]?100

PC=IC<0,L32

• regular induction variable• induced expression• basic induction variable

•Assist loop strength reduction•might be one instruction or several


sched.c

• SPARC - yes, MIPS - no• Scheduler uses mostly machine

independent list scheduling algorithm

• keeps machine specific dependencies straight

• helps avoid hazards


sched.c(continued)

• md_sets_uses– what an instruction does– what an instruction is blocked by– reads can slide past read, not past

writesrtl->does |= READS

rtl->blocks |= WRITES

– writes cannot slide past anythingrtl->does |= WRITES

rtl->blocks |= WRITES | READS


sched.c(continued)

• sched_adv()– relative advantage or disadvantage

of scheduling this instructions next– relative to last instruction scheduled– e.g. SPARC

•space out float instructions•avoid consecutive stores•make consecutive instructions

independent


EASE

• EASE: Environment for Architecture Study and Experimentation– VPO includes a facility for obtaining

•Measurements of instruction usage• Instruction cache traces•Data cache traces•precise timing

– VPO provides facilities for emulating architectures•Can extend existing architectures


EASE(continued)

• Use control-flow graph to insert instrumentation code

• Low overhead (10 to 15%)

• Cache traces generated on the fly (no need to store)

Bump Counter

Bump Counter

BasicBlocks


EASE(continued)

• Emulation of new architecture features– Add new

instructions to machine description

– Generate code and optimize as if new features exist

– In last step of VPO, emit code to emulate new features

r [ 3] = r [ 3] + r [ 2]

r [ 5] = r [ 5] + ( r [ 3] * r [ 2] )

add r2, r3, r3

mul r3, r2, r1add r1, r5, r5

VPOMachLast Step

VPOMachLast Step

Case Study: Targeting SimpleScalar

Jason HiserUniversity of Virginia


Introduction

• What is SimpleScalar? Why use it?

• Why use VPO with SimpleScalar?– SimpleScalar comes with gcc, why

not use that?

• Experiences in porting VPO to SimpleScalar

• Research with SimpleScalar and VPO


What is SimpleScalar?

• SimpleScalar is a functional simulator designed for use with architectural research– sim-safe -- a simple, fast simulator– sim-bpred -- measures branch

predictor statistics– sim-cache -- measures cache

statistics– sim-outorder -- models a multi-issue,

out of order superscalar processor


Why Use SimpleScalar?

• Easy to model many common architectural features.– hybrid branch predictors,arbitrarily many

functional units, much more

• Extendible instruction set -- PISA– Allows any instruction to be “annotated”

•easy to create new instructions or add fields to old ones

• Comes with GNU tools for SimpleScalar– gcc, gas, gld, glibc, etc.


Why VPO and SimpleScalar?(Why not use gcc?)

• gcc does not generate instruction annotations

• difficult to write new optimizations to take advantage of new instructions

• just building gcc can be a challenge


Why VPO and SimpleScalar?(continued)

• Easily build VPO on any machine you can build SimpleScalar

• Describe new instructions in machine description and optimizer will automatically use them when beneficial

• New optimizations can consult the machine description to see if architectural support is available– allows portability of optimizations


Experiences with Porting VPO to SimpleScalar

• PISA is basically MIPS– changes to some instruction formats– dmfc1 appears to be broken, negu not

available, branch if (not) equal to zero instructions don’t exist

• Change instruction format in inst.c• When compiling for SimpleScalar

tell the machine description that negu, beqz, bneqz and dmfc1 are not available


Research with SimpleScalar and VPO at UVa

• Idea– Compiler managed on-chip memory can

provide performance and power benefits

• Framework– Add instructions to move data to/from

on-chip memory from/to registers• to VPO (in md.y, inst.c)• to SimpleScalar (machine.def)

– Add optimization to promote variables from cache to on-chip memory


Summary

• SimpleScalar is a versatile functional simulator

• Porting VPO isn’t difficult– SimpleScalar target soon to be

included with VPO

• VPO and SimpleScalar make a great vehicle for architectural research

Using Zephyr for Optimization Research

Jack DavidsonUniversity of Virginia


VPO Logical Structure

VPOGenerator

Eval. Order Determ.

ZIFLow Analysis &Transformation Libraries

VPOMIPS

CSDLSPARCSpecification

NewTransformation

CSDLMIPSSpecification

CSDLALPHASpecification

CSDLi486Specification

Register Allocation

Access Coalescing

Comm. Subexpr. Elim.

Eval. Order Determ.

Induction Var. Elim.

Instruction Scheduling

Code Motion

SSA Computation


Actual Structure

VPO

lib SPARC MIPS X86 ALPHA


VPO Program Representation

TOP

BASIC BLOCK

BASIC BLOCK

i

BASIC BLOCK

i

LIST (RTL struct)

LIST

LIST

RTLCOSTINST TYPEUSESSETSDEF/USE

PREDSIDOMSDOMNEST LVLUSESDEFSOUTSPHIREGSTATE


VPO Optimizations

• Review vpo.h


VPO Optimization Algorithm

repeatapply code-improving

transformationuntil fixed-point reached or exhausted registers

• Maintaining two invariants– Semantic invariant (S)

• Observable behavior of program unchanged (according to RTL semantics)

– Machine invariant (M)• Every RTL equivalent to one machine instruction


VPO code optimization

• Each code-improving transformation is– machine-level, but– machine-independent

• Any semantics-preserving transformation is OK

• Preserve machine invariant (M) using machine description;– for each new RTL produced, ask MD if OK– if any is not target machine instruction,

roll back transformation


VPO Optimization Driver

• Review vpo.c


Adding a new optimization

• Determine where in optimize to insert the function– What analyses does the optimization

need?•Control-flow optimizations usually come

first as they need very little data-flow information

•Data-flow optimizations follow: code motion, induction-variable elimination, common subexpression elimination

– Does the optimization operate on a single basic block or does it operate across basic blocks?


Adding a new optimization

• Browse controlflow.c/fix_control_flow()

• Browse cdmotion.c/code_motion()


Semantic Safe Points

• A semantic safe point is a point in the optimization process where the code satisfies the M and S invariants– Code can be emitted at any semantic

safe point and it should run correctly– Can insert new optimization between

any semantic semantic-safe point


Debugging the compiler

SourceCode

Front andMiddle Ends

VPO Mach MachineCode

RTL

Trans n..........Trans 4Trans 3Trans 2Trans 1


VET-VPO Examination Tool

• Allows transformations to be observed– Observe data structure (control-flow

graph)– Set a break point at a transformation– Set a break point at a phase– Replay a transformation

VET and VPOISO

Raja VenkateswaranUVA


VET

• VET -> VPO Examination Tool• GUI for viewing optimizations• By Phase and By transformation• Ability to revert to previous

phases• Wide range of user options


VPOISO

• Tool for isolating optimizer bugs

• Uses binary search to find the first transformation error

• Works by comparing against the correct output

Download - NCI Report: Zephyr

Top Related