workshop on hardware support for objects and...

Workshop on

Hardware Support for Objects

And

Microarchitectures for Java

In conjunction with ICCD'99

Austin, TexasOctober 10, 1999

MESSAGE FROM WORKSHOP CO-CHAIRS

Most modern programming languages and techniques include object-oriented methods.However, mainstream computer architectures have not acknowledged the presence ofobjects. With the widespread use of object-oriented programming languages andtechniques, it is becoming important for computer architects to acknowledge the existenceof these methods and their impacts on execution (including high object allocation rates, theimpact of garbage collection, dynamic binding of calls to methods, and dynamic assemblyof programs at run time from components obtained from disparate sources).

Java is an exciting new object-oriented technology. Hardware for supporting objects andother features of Java such as multithreading, dynamic linking and loading is the focus ofthis workshop. The impact of Java's features on micro-architectural resources and issues inthe design of Java-specific architectures are interesting topics that require immediateattention of the research community.

The purpose of this workshop is to draw together researchers and practitioners concernedwith hardware support for objects and Java implementations for a stimulating exchange ofviews. To the organizers' best knowledge, this is the first event of its kind, and as such is anattempt to begin the task of building a community in this field. We thank all the programcommittee members, the authors and the invited panelists for helping us start this process.Also, we would like to thank the ICCD organizers and Prof. Craig Chase, in particular, fortheir support to this workshop. We hope you will enjoy this workshop as much as we did inorganizing this.

Mario Wolczko Vijaykrishnan Narayanan Sun Microsystems Pennsylvania State University

Organizers

Workshop Co-Chairs

Vijaykrishnan Narayanan, Pennsylvania State Univ.Mario Wolczko, Sun Microsystems, Inc.

Program Committee

Timothy Heil, Univ. of Wisconsin, MadisonLizy John, Univ. of Texas at Austin

Vijaykrishnan Narayanan, Pennsylvania State Univ.Nagarajan Ranganathan, Univ. of South Florida

Mario Wolczko, Sun Microsystems, Inc.

TABLE OF CONTENTS

8:15-9:45a.m. Session 1. Innovations in Memory System Design

Chair: Timothy Heil, University of Wisconsin, Madison

• A Case for Using Active Memory to Support Garbage CollectionSylvia Dieckmann and Urs Hoelzle

• Tolerating Latency by Prefetching Java Objects,Brendon Cahoon and Kathryn McKinley

• DMMX: Dynamic Memory Management Extension,J. Morris Chang, Witawas Srisa-an, and Chia-Tien Dan Lo

10:15-11:45a.m. Session 2. Architectural Issues in Dynamic Translation

Chair: Mario Wolczko, Sun Microsystems, Inc.

• How can hardware support Just-In-Time compilation?A. Murthy, N. Vijaykrishnan, A. Sivasubramaniam

• Exploiting Hardware Resources: Register Assignment across Method BoundariesIan Rogers, Alasdair Rawsthorne, Jason Souloglou

• A Decoupled Translate Execute Architecture (DTEA) to Improve Performance of JavaExecution,Ramesh Radhakrishnan and Lizy Jurian John

1:00-2:00p.m. Session 3. Object-Oriented Architectural Support

Chair: Lizy John, University of Texas, Austin

• Applying Predication to Reduce the Direct Cost of Virtual Function Calls in Object-Oriented Programs,Sandeep K. S. Gupta, Chris Sadler

• Hardware Support for Profiling Java Programs,Nathan M. Hanish and William E. Cohen

2:15-3:45p.m. Session 4. Microarchitectures for Java

Chair: Vijaykrishnan Narayanan, Pennsylvania State University

• VLSI Architecture Using Lightweight Threads (VAULT),Ian Watson, Greg Wright, Ahmed El-Mahdy

• A two step approach in the development of a Java Silicon Machine (JSM) for smallembedded systems,Hagen Ploog, Ralf Kraudelt, Nicco Bannow, Tino Rachui, Frank Golatowski, DirkTimmermann

• Quantitative Analysis for Java Microprocessor Architectural Requirements: InstructionSet Design,M. Watheq EL-Kharashi, Fayez ElGuibaly, Kin F. Li

4:00-5:30p.m. Panel Session: Java Virtual Machines: What can hardware offer to supportthem?

Panelists:David Hardin, Ajile Systems, Inc.Jim Smith, University of Wisconsin, MadisonMarc Tremblay, Sun Microsystems, Inc.Moderator: Mario Wolczko, Sun Microsystems, Inc.

Session 1

Innovations in Memory System Design

heusech-ftenter

ndrnu-e.

oreing I/Oavesyncees

ter, itputome

isorreata

er-uto-or-

ry

f-d-

ner, ifegu-edC,erytheca-elfach the

A Case for Using Active Memory to Support Garbage Collection

Sylvia Dieckmann and Urs HölzleUniversity of California, Santa Barbara

{sylvie,urs}@cs.ucsb.edu

Abstract

Most modern programming languages require efficientautomatic memory management (garbage collection, GC)as part of the runtime system. Since GC is very memoryintensive it can potentially suffer significantly from poormemory access times. Unfortunately, memory performanceimproves at a slower pace than processor speed, makingmemory accesses relatively more expensive in the future.Active Memory architectures aim to overcome this problemby placing additional computational power in memory, thusallowing the application to execute small but memory-intensive functions closer to the data and in parallel. Thegoal is to improve latency and bandwidth for programs thatcan otherwise suffer from slow memory accesses.

To date, Active Memory has been studied only withdatabases, image processing, arithmetic computations, andother very regular applications. In this paper, we propose toanalyze its impact on garbage collection. We are convincedthat garbage collection too will profit from this architecturesince GC is simple, repetitive, easy to partition into offload-able functions, and its performance depends crucially onfast memory access. We describe a possible incarnation ofan Active Memory architecture suitable for GC support andargue why GC should benefit from such an architecture.

1. Motivation

Efficient and reliable garbage collection (GC) is anessential part of most modern (especially object-oriented)programming languages. GC relieves the programmer fromerror prone explicit deallocation, thus preventing memoryleaks or early deallocation. But GC performance greatlydepends on fast memory access, which can pose a challengenot only to the GC implementation but also to the design ofthe underlying machine. For example, a simplemark&sweep collector first identifies and marks all liveobjects and than reclaims (sweeps) the unmarked space.Both phases are very memory intensive since they requiretouching the entire heap (or at least all live objects) and thuscan potentially suffer significantly from poor memoryperformance. Traditional techniques designed to improvememory performance do not always work for GC. Forexample, GC can have a negative effect on the cache hit rate

because it evicts all application-related entries from tcache. Even worse, GC itself has a poor hit rate becaeach collection dereferences all pointers at most once. Teniques such as prefetching that exploit access patterns ofail because memory access during GC follows poinchains and can be very irregular. [10, pp. 284]

Alas, memory performance, namely access latency abandwidth, is an issue of great concern for modemachines. Already, the cost of processor-memory commnication has a significant impact on overall performancMost researchers agree that it will become even mimportant in the future, since (1) processor speed is growat a faster pace than memory performance, and (2) theconnections used to deliver this data to the processors hlimited bandwidth [15, 28, 53, 71]. For example, Hennesand Patterson estimate that since 1985 CPU performahas grown by 50% every year whereas DRAM access timhave improved by only around 7% per year [19]. With fasprocessors but similar memory latencies and bandwidthsbecomes harder to feed the processing unit with useful indata to run at peak speed and memory accesses becrelatively more expensive. (In the literature this problemtypically referred to as the increasing Memory-ProcessPerformance Gap.) In addition, modern applications aputting stronger demands on the memory system as dsets grow larger [1, 36] and object-oriented and pointbased programs with irregular access patterns and amatic memory control are becoming more and more imptant

We can address the problem of increasing memoaccess costs in two ways.Processor-centric optimizationslike prefetching, speculation, multilevel caches and out-oorder execution aim to better cope with the existing banwidth and latency, for example, by exploiting locality iaccess patterns. They tend to make things worse, howevthe executed application does not express the expected rlarities, which is often the case with modern object-orientand pointer-based programs. Especially during Gmemory access follows pointer chains and can be virregular. The cache hit rate can actually deteriorate in presence of GC because every iteration evicts all applition related entries from the cache. Even worse, GC itshas a poor hit rate since most algorithms dereference epointer only once during each GC phase, thus defeatingadvantage of caching data.

dta-nly.g

l asr to

inralfferli-ale

per.delnd

n etite

ectsartree-atui-a

ons

larrser forbe

ugh

elosta-

tol)cest

In contrast,memory-centricapproaches hand attemptto move at least part of the computation closer to the datait processes, thus actively improving latency and band-width. This concept is represented by a new breed of hard-ware,Active Memory [6, 16, 17, 23] (or the closely relatedActive Disks[1, 22]) which aims to enhance the memoryunit with computational logic so that it can take over someor all of the processor’s work. Often, Active Memory isimplemented by integrating additional logic into theDRAM chip, thus allowing the main processor to offloadsmall functional units calledmemlets1 to be computeddirectly in memory. For example, the Active Pages modelproposed by Oskin et al. [16] suggests assigning a smallembedded processor or an array of FPGAs to every 512Kbytes of DRAM.

We believe that GC is very likely to benefit from ActiveMemory. Most GC algorithms are composed of highlyrepetitive and simple components which can easily beoffloaded to an in-memory processor. This is especiallytrue for non-incremental algorithm that stop the actualapplication (mutator) during the duration of the collection.By delegating GC to Active Memory, one could not onlyreduce the amount of data passed between processor andmemory but also parallelize the inner loop of the collector.

We therefore suggest to study the suitability of ActiveMemory to support automatic memory management (i.e.,garbage collection or simply GC). The ultimate goal ofthis study is to show that garbage collection—which webelieve is crucial for runtime performance and thusdeserves special effort—will benefit significantly from theexistence of an Active Memory approach in the spirit ofActive Pages.

We are currently working on a hardware simulator thatwould allow us to empirically evaluate the impact ofARAM (our Active Memory model) on the performanceof garbage collection in a high-performance Java VirtualMachine. With the help of this simulator, we hope todemonstrate that and how garbage collection can profitfrom the presence of Active Memory. Moreover, we planto design a suite of GC algorithms partitioned for ARAMand to analyze the impact of software decisions related topartitioning, page size, allocation strategy, etc.

To the best of our knowledge, none of the groupsworking on Active Memory has made an attempt yet toutilize this architecture for garbage collection or even forlanguage support in general. The idea of Active Memoryis still relatively new and few proposals have actually beenimplemented to date. Those that were evaluated—either

1 None of the current proposals actually uses the termmemlet.However, Acharya et al. [1] refer todisklets as code that is offloaded andexecuted on disk. Accordingly, we refer code that is executed in memoryas memlets.

by simulation or with a prototype—were usually testewith relatively regular applications from the areas of dabases, image processing and arithmetic computation oAlthough these studies generally show promisinspeedups for the selected type of applications as weltechnical feasibility of Active Memory hardware, furthetests with more sophisticated programs are requiredshow general applicability.

2. Active Memory

Since the research community became interestedActive Memory architectures a few years ago, sevemodels have been proposed. Most approaches disignificantly in terms of expected benefits, targeted appcations, design complexity, software effort, and technicfeasibility. Unfortunately, a thorough discussion of somof these proposals would exceed the scope of this paInstead, in this section we sketch an independent mothat represents our understanding of Active Memory aserves as a baseline for our project.

Although this model—which we name ARAM—isderived from Active Pages, a model suggested by Oskial. [16], it attempts to be more general than that. Despthe ongoing work of several research teams, many aspof active memory design are not fully understood yet. Pof our planned work is to investigate various hardwamodifications and their impact onto GC behavior. Therfore, the ARAM model is meant to eventually cover large design space. Nevertheless, to provide a first intive notion of ARAM and its capabilities we describe rather concrete incarnation in this section.

2.1 ARAM

The ARAM model described in this section is based two fundamental principles: (1) traditional applicationthat do not define memlets must execute—with simiperformance—on an ARAM machine (no harm foanybody) and (2) applications that were modified to umemlets must experience a noticeable speedup (bettesome). Although the model leaves many issues to resolved in later design phases, it aims to provide enodetail so that any ARAM implementation will followthese two guidelines.

As in most current computer architectures, our modconsists of a main processor (sometimes also called hprocessor), a set of ARAM chips that are physically seprated from the main processor, and a memory busconnect both units. The model includes a (multilevememory hierarchy with caches and virtual address spato allow fast access for applications with good locality. A

hiptan

ndt ofnnotM1)feruldof

delsstore

stn-

heto

erencyals-ichtualain

toem.

rd to orsi-

enc-he

toica-omis

tailsral

ans

ther an.

the first sight, an ARAM chip resembles conventionalDRAM; but unlike in DRAM, the available storage spacein ARAM is divided into one or more units (regions)1 ofequal size. Oskin et al. [16] found 512 Kbyte regions to bemost practical for the available hardware. A smallembedded RISC processor calledIn-Memory Processor(IMP) is assigned to each unit. To obtain optimal latencyand bandwidth, the physical chip layout determines theassociation between regions and their associated proces-sors. Most likely, the user will be unable to dynamicallymodify either region size or region-processor association.

We assume that the main processor can access anARAM cell almost as fast as a cell in a comparableDRAM chip. (Synchronization between main processorand IMP might add a small overhead, though.) As inDRAM, the costs for accesses by the main processor canvary within a certain range depending on the accessedlocation. But for accesses initiated by memlet code on acertain IMP, the situation is more complex as it most likelydepends much more on the accessed memory locationrelative to the IMP: here we assume that an IMP canaccess data residing in its own region (calledintra-regionaccess in the remainder of this document) with signifi-cantly shorter latency and larger bandwidth than the mainprocessor. In fact, one of the motivations for transformingmemory intensive functions into memlets is to replacemain processor accesses with cheaper intra-region IMPaccesses. However, accesses to locations that currentlybelong to the region of another IMP on the same chip(inter-region, intra-chip accesses) are likely to be moreexpensive, although probably still competitive with aconventional main processor access. Any access issued by

1 We use the termregion rather than(active) page to describe thememory unit directly associated with a single IMP to emphasize thedifference between a region/active page and an OS page (i.e. virtualmemory page). Active memory is not meant to replace the virtualmemory system; usually, regions cover several OS pages.

an IMP and directed to a location on another physical c(inter-chip accesses) will be penalized most. We expecthis type of access to be significantly more expensive tha direct access from the main processor.

Note that implementation details as well as actual arelative access costs remain open at this point. Mosthese aspects depend on technical conditions and cayet be determined anyway. However, any actual ARAimplementation should be guided by two principles: (applications that do not use memlets should not suffrom the new architecture and (2) those that do, shoexperience noticeable speedups due to off-loading memlets. Therefore, no matter how the final cost mowill look, the main processor must be able to acceARAM almost as efficiently as DRAM and IMPs musaccess at least data in their own domain significantly mefficiently.

Finally, a modified operating system (OS) muprovide the traditional OS functionality in combinatiowith ARAM support. For example, it will be in the responsibility of the OS to set up, invoke and manage tmemlets (section 2.2 sketches an example API), synchronize IMPs and main processor, to delivmessages between the processors, to maintain consistbetween cache and ARAM, and to maintain a virtumemory system on top of ARAM. The last point is necesary because unlike traditional memory systems, whuse only physical addresses, memlet code contains viraddresses. Consequently, somebody—either the mprocessor or the IMPs themselves—must be ableresolve these virtual addresses within the memory syst

2.2 Programming Model

The user interface of ARAM provides the standavirtual memory interface extended by a set of functionsallocate regions, define memlets, bind them to a regiona group of regions and activate them. It is in the responbility of the user to partition an application, i.e. writmemlets and invoke them from the code. We use a futional model for this proposal since it seems to be teasiest to integrate into an existing system.

Note that the purpose of this programming model ishelp understand the requirements of memlets and appltion code and to design GC algorithms independent fractual decisions about the underlying hardware. It meant as an abstraction that hides away hardware desuch as region-IMP association and must be geneenough to express all GC needs. However, by no medoes it determine an actual ARAM implementation.

The user defines memlets as special functions togewith the actual application. A memlet operates ondomain given as function parameters during invocatio

Figure 1. ARAM Architecture

ARAM ARAM

CPU

Caches

IMP IMP IMPIMP IMP IMP

Reg

ion

Reg

ion

Reg

ion

Reg

ion

Reg

ion

Reg

ion

r-gh8%

erkllel-llyrsltosperl-

lelup.er-

uire is hisnsctorby in

-al

m,er,l of

Cnd

toro-

e.toed

data

.5ark

esuntlingccesse

weVM.

For now, domains always correspond to one ARAMregion; rather than providing a domain parameter withevery memlet invocation, the user can bind it to a certainIMP up front. A memlet is invoked by a special instructionsuch as a store to a memory-mapped device. It can receiveas many arguments as it needs. For example, the storedaddress could point to a memlet header with functionpointer, arguments, and IMP identifier. Whenever an IMPdetects a write to the magic location, it retrieves this array,determines function and arguments and invokes thememlet. On termination, the memlet sends a signal back tothe main processor.

When executing memlet code, the IMP hardwareresolves addresses and communicates with other IMPs onthe same chip using some protocol. The IMP also providesinstructions to indicate the physical location of an address(to determine the access costs). Any memlet can accessthe entire virtual address space, although accesses toremote locations might be disproportionately expensive.While intra-chip accesses may be resolved in hardware,off-chip accesses might involve software protocols anduse the main CPU to relay data to another ARAM chip.While slower than hardware, a software solution wouldconsiderably reduce the complexity of ARAM-basedsystems by eliminating the need for inter-ARAM buslogic.

3. Why GC is Likely to Profit From ActiveMemory

All garbage collectors perform the same basic task:they determine the set of reachable (i.e., live) objects andreclaim the storage used by all unreachable (dead) objects.Most GC algorithms (with the exception of referencecounting) do this by periodically analyzing a snapshot ofthe heap to detect and reclaim objects that are not longerreachable. Consequently, the collector needs to access all(live) objects, but on each object performs very littlecomputation before it starts visiting the children.1 Thismakes GC inherently memory-intensive. In addition,accessing objects by following a pointer chain leads to avery irregular access pattern where each object is visitedonly once in each phase. Therefore, caches often performpoorly for GC.

Preliminary results from a small study of the effect ofgarbage collection on cache performance of a Java VMindicate that garbage collection related activity has asignificantly higher L1 miss rate than the actual applica-tion code (8-16% for GC vs. 6-9% for the application).They also show that the JDK1.1.5 spends up to 30% of its

1 In this respect GC resembles a pointer chase problem.

time in garbage collection, which makes good GC perfomance all the more important. Write misses, althougenerally not as critical as read misses, rise to about 2for one application (javac).2

To benefit from Active Memory, an application must bpartitionable so that enough computational woinvolving memory accesses can be offloaded and paraized. We are convinced that GC algorithms generafulfill this requirement since most memory activity occuin a tight inner loop. In terms of computationacomplexity, this loop is simple enough to be offloaded an IMP with limited power. Although several algorithmrequire scratch space, one can usually define an upbound. The inner loop is also likely to benefit from paralelization; for example, Endo et al. [7] studied a paralmark-sweep collector and reported a significant speedfor parallel marking with work stealing in a shared heap

Parallelization can become a problem for ARAM if thapplication contains a high number of inter-region refeences. In the worst case, every single step could reqinter-region communication (e.g., if a chain of pointersspread over several regions). However, we believe thatrisk can be reduced by (1) dividing the heap over regioin accordance to the access order imposed by the collescheme (e.g., generational GC, Train Algorithm) or (2) instrumenting a copying collector to rearrange objectsorder to reduce region-crossing references.

Finally—at least at this point—partitioning an application to use ARAM is awkward and requires some internknowledge. However, GC is part of the runtime systewritten by a language implementor. Unlike the end usthese system experts can justify spending a great deatime with low-level optimizations as it will pay offmultiple times later during runtime.

To summarize our arguments, we believe that good Gperformance is crucial for state-of-the-art OO systems athat memory latency and bandwidth is a significant facin GC overhead. The overall structure of most GC algrithm is simple, highly repetitive, and memory intensivTherefore, most algorithms can naturally be divided inmemlets executed in ARAM, which would parallelize thcollection, improve latency and bandwidth for offloadememory accesses, and greatly reduce the amount of

2 In most of these experiments we ran the optimized JDK1.1executing memory intensive programs from the SPECjvm98 benchmsuite on a 147 MHz UltraSPARC-I with 16 Kbytes L1 and 512 KbytL2 caches. The UltraSPARC family provides hardware registers to cosome basic events during execution at no additional costs. By polthese counters before and after each GC one can observe cache aand miss rates of a life application with virtually no impact on thapplication’s performance.Since polling hardware counters requires modifying source code, have not yet repeated the same experiments for a more competitive J

ge

A

.

-

ed

c-ge

i-e27,li-

,S.:

r- ofo-

transferred to the main processor. This is especiallyimportant for GC performance since conventionalmethods to improve memory performance such as cachesdo not always suffice for this type of algorithms.

4. References

[1] A. Acharya, M. Uysal, and J. Saltz. Active Disks:Programming model, algorithms and evaluation. InProceedings of ASPLOS VIII,San Jose, CA, October1998.

[2] R. Barua, W. Lee, S. Amarasinghe, and A. Agarwal.Maps: A compiler-managed memory system for RAWmachines. InProceedings of ISCA-26, Atlanta, GA, June1999.

[3] N. Bowman, N. Cardwell, C. Kozyrakis, C. Romer, and H.Wang. Evaluation of existing architectures in IRAMsystems. InWorkshop on Mixing Logic and DRAM: Chipsthat Compute and Remember, Denver, CO, June 1997.

[4] D. Burger, J. Goodman, and A. Kagi. Quantifying memorybandwidth limitations in future processors. InProceedingsof ISCA-23, Philadelphia, PA, May 1996.

[5] B. Calder, C. Krintz, S. John, and T. Austin. Cache-conscious data placement. InProceedings of ASPLOSVIII, San Jose, CA, October 1998.

[6] J. Carter et al. Impulse: Building a smarter memorycontroller. In Proceedings of HPCA-5, Orlando, FL,January 1999. IEEE Computer Society.

[7] T. Endo, K. Taura, and A. Yonezawa. A scalable mark-sweep garbage collector on large-scale shared-memorymachines. InProceedings of SC97,November 1997.

[8] M. Gonçalves.Cache Performance of Programs withIntensive Heap Allocation and Generational GarbageCollection. Ph.D. thesis, Princeton University, May 1995.

[9] M. Gonçalves and A. Appel. Cache performance of fast-allocating programs. InRecord of FPCA’95, June 1995.

[10] R. Jones and R. Lins.Garbage Collection: Algorithms forAutomatic Dynamic Memory Management.John Wiley &Sons, 1996.

[11] T. Kamada, S. Matsuoka, and A. Yonezawa. Efficientparallel global garbage collection on massively parallelcomputers. InProceedings of SC‘94,pages 79-88, 1994.

[12] K. Keeton, D. Patterson, and J. Hellerstein. A case forIntelligent Disks (IDISKs). InSIGMOD Record,27(3),August 1998.

[13] C. Kozyrakis and D. Patterson. A new direction forcomputer architecture research.IEEE Computer,31(11):24-32, November 1998.

[14] C. Kozyrakis at al. Scalable processors in the billion-tran-sistor era: IRAM. IEEE Computer, 30(9):75-78,September 1997.

[15] S. Nettles and J. O’Toole. Real-time replication garbacollection. InProceedings of PLDI’93,volume 28(6) ofACM SIGPLAN Notices, Albuquerque, NM, June 1993.

[16] M. Oskin, F. Chong, and T. Sherwood. Active Pages:computation model for intelligent memory. InProceed-ings ISCA’98,Barcelona, Spain, June 1998.

[17] D. Patterson et al. A case for intelligent RAM: IRAMIEEE Micro,17(2):34-44, March-April 1997.

[18] D. Patterson et al. Intelligent RAM (IRAM): The industrial setting, applications, and architectures. InProceed-ings of ICCD’97, TX, October 1997.

[19] D. Patterson and J. Hennessy.Computer Organization andDesign: The Hardware/Software Interface. MorganKaufman, 1994.

[20] M. Reinhold. Cache performance of garbage-collectprograms. InProceedings of PLDI’93,volume 28(6) ofACM SIGPLAN Notices, Albuquerque, NM, June 1993.

[21] K. Taura and A. Yonezawa. An efficient garbage colletion strategy for parallel programming languages on larscale distributed-memory machines. InProceedings ofPPoPP-6, ACM SIGPLAN Notices, pp. 264-275, LasVegas, NE, June 1997.

[22] M. Uysal, A. Acharya, and J. Saltz. An evaluation of archtectural alternatives for rapidly growing datasets: Activdisks, clusters, and SMPs. Technical Report, TRCS98-Department of Computer Science, University of Cafornia, Santa Barbara, October 1998

[23] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. LeeV. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, Amarasinghe, and A. Agarwal. Baring it all to softwareRaw Machines.IEEE Computer,30(9):86-93, September1997.

[24] W. Wulf and S. McKee. Hitting the memory wall: Impli-cations of the obvious.Computer Architecture News,23(1), March 1995.

[25] B. Zorn. The effect of garbage collection in cache perfomance. Technical Report, CU-CS-528-91, DepartmentComputer Science, Campus Box 430, University of Colrado, Boulder, May, 1991.

Tolerating Latency by PrefetchingJava Objects

BrendonCahoon KathrynS.McKinleyDepartmentof ComputerScience, University of Massachusetts,Amherst,MA 01003,

�cahoon,mckinley� @cs.umass.edu�

AbstractIn recentyears,processorspeedhasbecomeincreasingly

faster than memoryspeed. One technique for improvingmemoryperformanceis data prefetching which is success-ful in array-basedcodesbut onlynoware researchersapply-ing to pointer-basedcodes.In this paper, weevaluatea dataprefetching technique, calledgreedyprefetching, for tolerat-ing latencyin Javaprograms.In greedyprefetching, whenaloop or recursive methodupdatesan object o, we prefetchobjects to which o refers. We describeinter- and intra-procedural algorithmsfor computingobjectsto prefetch andwe presentpreliminary resultsshowingits effectivenessona few, small Java programs. Prefetching improvesperfor-mance, but there is significantroomfor further improvement.

1. Intr oductionModernprocessorspeedscontinuetosignificantlyoutpace

advancesin memoryspeed.Eventhoughmodernprocessorsusedeepmemoryhierarchies,thedisparitybetweenproces-sorandmemoryspeedsresultsin anunderutilization of re-sourcesdueto memorybottlenecks.

Softwarecontrolleddataprefetchingis atechniquefor im-proving memory performanceby tolerating latency in thememoryhierarchy. Compilersstatically analyzeprogramsand insert prefetchinstructionsto load datainto the cacheprior to use. Previous researchshows the benefitsof soft-ware prefetchingtechniquesin array-basedscientific pro-grams[4, 14, 2, 13]. Prefetchingin array-basedcodesis sim-pler thanin pointer-basedcodes.Givenanarray, thesizeofeachelementanda regular accesspattern,the compilercancomputetheaddressof any elementin thearrayandscheduleprefetchestoelementsin aloopthatwill beaccessedin futureiterations.Array-basedcodesarealsoamenableto analyseswhichallow compilersto restructureloops,usingtechniquessuchasloop tiling, to improve spatialandtemporallocality.Recentwork usesprefetchingfor programswith dynamicallyallocateddatastructures[11, 12, 16]. However, this workonly considersC programs.

Compilerscannotusethesameapproachin pointer-based

�Thiswork is supportedby NSFgrantEIA-9726401,anNSFInfrastruc-

ture grantCDA-9502639,Darpagrant5-21425,andCompaq. Kathryn S.McKinley is supportedby an NSF CAREERAward CCR-9624209.Anyopinions,findings,and conclusionsor recommendationsexpressedin thismaterialarethoseof theauthor(s)anddonotnecessarilyreflecttheviews ofthesponsors.

codesbecauseseparatedynamicallyallocatedobjectsaredis-joint andtheaccesspatternsarelessregularandpredictable.Givenanobjecto, weknow theaddressof objectsthato refer-encesandcannotprefetcharbitraryobjectswithoutfollowingpointerchains.

In this paper, we evaluateone simple prefetchingtech-nique, called greedyprefetching,on Java programs. LukandMowry introducedandevaluatedthegreedyprefetchingalgorithmfor recursive datastructuresin C programs[12].We investigatethe applicability andeffectivenessof greedyprefetchingfor Java programs.Ourspecificcontributionsin-clude a new intra-proceduraldataflow analysisfor findingobjectsto prefetch,theuseof aninter-proceduralanalysistoimproveour analysisin thepresenceof recursion,anda pre-liminary evaluationonobject-orientedprograms.

Object-orientedprogramspose analysischallengesbe-causethey mostlyallocatedatadynamically, containfrequentmethodinvocations,andoften implementloopswith recur-sion. We useVortex, a compilercontainingadvancedanal-ysesspecificallytailored for object-orientedlanguages[9].Our preliminary resultsindicate that greedyprefetchingiseffective on a few, small object-orientedprograms. Also,classanalysisand methodinlining enableeffective greedyprefetching. We plan to implement more sophisticatedprefetchingtechniquesin thefuture.

2. RelatedWorkIn thissection,wegiveabriefsummaryof relatedwork for

improvingmemoryperformanceof pointer-basedcodes.Pre-viouswork investigatingprefetchingon pointer-basedcodesonly usesC programs.

Lipasti et. al., presentone of the initial evaluationsofprefetchingpointer-basedcodes[11]. The technique,calledSPAID, generatesprefetch instructionsfor function argu-mentsprior to calls. Resultsshow cachemissrateimprove-mentsonseveralprograms.

Luk and Mowry introduce and evaluate the greedyprefetchingalgorithmusingC versionsof the Oldenbench-marks [12]. The main contribution of our work is to usedataflow algorithmsratherthana loop-basedapproachandwe extendtheanalysisfor object-orientedfeatures.Our pre-liminary resultsshow similarperformanceresultsto Luk andMowry on Java programs. Luk and Mowry also introducehistorybasedprefetchinganddatalinearization,andpresentlimited resultson handoptimizedexamples.RothandSohi

class SList �int data;SList next;int sum() �prefetch(next);if (next != null)return data + next.sum();

return data;��

class DList �int data;DList next, prev;int sum() �prefetch(next);// prefetch(prev);if (next != null)

return data + next.sum();return data;��

class Tree �int data;Tree left, right;int sum() �prefetch(left);prefetch(right);int s = data;if (left != null) s += left.sum();if (right != null) s += right.sum();return s;��

Figure 1. Prefetch examples for singly linked list, doubly linked list, and binary treeevaluatea hardware/softwareprefetchingapproachfor toler-atingmemorylatency in pointer-basedcodes[16]. Thetech-niqueusesjump-pointerprefetchingwhich is anextensionofLuk andMowry’s history-pointertechnique.RothandSohipresentresultsusingthe C versionof the Oldenbenchmarksuite. We intendto extendthesetechniquesfor Java in thefuture.

Several researchersimprove the memory performanceof pointer-basedprogramsby rearrangingdataat run time[3, 6, 7, 8, 18]. Rubin, Bernstein,andRodehcombinedatareorganizationanda differenttype of greedyprefetchingtoimproveperformanceonasmallC kernel[17].

3. GreedyPrefetchingWe extend Luk and Mowry’s algorithm for prefetching

object-orientedlanguagessuchas Java. During the traver-salof a linkeddatastructure,greedyprefetchingattemptstoprefetchobjectsthatwill beaccessedin thefuture. Its majorlimitation is thatit canonly scheduleprefetchinstructionsforobjectsdirectlyconnectedto thecurrentobject.

Figure1 showssimpleclassdefinitionsfor asingly linkedlist, a doublylinkedlist, anda binarytree(we usetheexam-plesto illustrategreedyprefetchingandnotasgoodexamplesof object-orientedprogramming).Eachclasscontainsasummethodwhich addstheelementsin thedatastructure.In theexample,weinsertaprefetchinstructionfor thenext objectinthe linkedlist. We cannotprefetchobjectsfurtheraheadbe-causewedonotknow theaddressof futureobjects.Prefetch-ing two objectsahead,prefetch(this.next.next),requirestheaddressof this.next which is unknown untiltheprogramdereferencesthis.

Achieving the full benefitsof prefetchingrequiresthecomputationtime betweenthe prefetchand useof the ob-ject to begreaterthanor equalto thememoryaccesstime tocompletelyhide the latency. However, even if thecomputa-tion time is lessthanthe memoryaccesstime, the prefetchcanpartially hidethe latency. In the linkedlist example,weonly partiallyhidethereadlatency of next if thecostof theadditionandfunctioncall is lessthanthe costof a memoryaccess.Similarly, we typically only partiallyhidethelatencyof theprefetchof left in thebinarytreeexample.However,sincewe alsoprefetchright, we may completelyhide itsmemorycost.

The greedyprefetchalgorithm consistsof two parts; aphasewhich finds objectsto prefetchfollowed by a phasewhich schedulesthe prefetchinstructions.The algorithmisgreedybecausewedo not performany analysisto determineif anobjectis alreadyin thecacheandwe try to prefetchasmuchaspossible.Our algorithmusesboth intra-proceduralandinter-proceduraldataflow analysisto find objecttraver-salsin loopsandrecursivecalls.

3.1. DetectingRecurrent Object UpdatesA recurrentobject updateis a statementof the form, o

= o.next, occuringin a loop or recursive methodcall. InFigure1,next.sum() andleft.sum() areexamplesofrecurrentobjectupdatesoccurringin recursivecalls.

Detectingrecurrentobject updatesis similar to findingloop inductionvariables.Previousalgorithmsfor finding in-ductionvariablesareeitherloopbased[1] or usestaticsingleassignment(SSA) form [19]. We presenta traditionaldataflow analysisapproachto findingrecurrentobjectupdates.

The algorithmfor detectingrecurrentobject updatesre-quiresboth intra- and inter-proceduralanalysis. The intra-proceduralanalysisphasedetectsrecurrentobjectupdatesinloops. Theinter-proceduralanalysisdetectsrecurrentobjectupdatesin recursive methodcalls. Luk andMowry do notperform inter-proceduralanalysisandonly identify self re-cursivecalls.

Our intra-proceduraldataflow analysisis a forward,iter-ative traversalthatusesa threestagelatticeto capturerecur-rentobjectupdatesateachpoint in theprogram.We defineafunctionto mapeachobjectto a latticevalueateachpoint intheprogram.

Not recurrent. Thetopelementindicatesanobjectis notre-current.

Possiblyrecurrent. Thefirst time we processanobjectit ispotentiallyrecurrent.

Recurrent. Thebottomelementindicatesanobjectis recur-rent.

At eachstoreexpressionin theprogram,wedefineatrans-fer functionwhichassignsobjectsto latticevalues.� If the storeexpressionis a field assignmentof the form

o = p.next, whennext is an objectreferenceandpis not recurrent, thenwemarko possiblyrecurrent. If pis possiblyrecurrent, theno is recurrent.

Table 1. Olden Benchmark SuiteName Main DataStructure(s) LOC Methods BytecodeLen. Inputs Inst. Issued TotalMemory

mst arrayof lists 206 39 1452 512nodes 406M 10.4MBperimeter quadtree 219 44 1717 4K x 4K image 239M 4.3MBtreeadd binarytree 66 11 474 1M nodes 264M 24.3MBtsp binarytree,linkedlist 273 15 1711 60,000cities 1106M 7.2MBvoronoi binarytree 612 89 4138 20,000points 1043M 19.5MB

� If the storeexpressionis an objectassignment,o = p,thenassignthevalueof p to o.� For all other storeexpressions,o = expr, we assignthevaluenot recurrent to o.

At termination,objectsat each programpoint belongto oneof the 3 lattice values. The dataflow merge function en-surestheordering �� ! #"%$& �� .

We use the possibly recurrent value to detect loopingstructures.The first time we analyzea loop, an object,o,occurringon the LHS of a field referencebecomespossiblyrecurrent (e.g., o = b.next). On the seconditerationoftheanalysis,theobjectbecomesrecurrent if thebaseobjectof thefield reference(i.e., b) is alsopossiblyrecurrent. If bis not recurrent, theno’s valueremainsthesame.Thealgo-rithm incorrectlymarksobjectsin loop invariantexpressionsasrecurrent (e.g., t in o=p.next; t=o.next). Movingloop invariantexpressionsout of loopseliminatesthis prob-lem.

We also track the fields usedin the recurrentobjectup-dates. In Figure 1 for example, we record that next istheonly field involvedin the recurrentobjectupdatefor thetraversalof thedoublylinkedlist.

3.2. SchedulingPrefetchInstructions

We greedilyscheduleprefetchinstructionsfor objectsouralgorithmfindsrecurrentduringtheanalysisphase.Weinsertprefetchinstructionsat theearliestpoint whenwe know thebaseobjectis not null. Theintra-proceduralclassanalysisinVortex indicateswhenobjectsarenot null. In Figure1, thethis pointer is the baseobjectandwe know it is not nulluponenteringsum.

Theschedulingphaseusesthefield informationtheanal-ysis phasecomputesto only scheduleprefetchesfor fieldsinvolved in recurrentobjectupdates.For the doubly linkedlist in Figure1, weonly generateaprefetchof thenext fieldandnot theprev field. Luk andMowry’s algorithmgener-atesbothprefetchessincethey donot trackthefieldsusedintherecurrentupdates.

During schedulingwe performa simplealiasanalysistoensurethe scheduleronly generatesa single prefetch in-struction for groupsof aliasedrecurrentobjects. Tempo-rary objectscausealiaseswhen usedin sequencessuchasp=o.next; o=p;. In a loop, we mark botho andp asrecurrent, but we only generatea prefetchfor oneof theob-jects.

3.3. Inter -proceduralAlgorithmWe useaninter-proceduralalgorithmto find recurrentob-

ject updatesoccurringin recursive methodcalls. Using aninter-proceduraldataflow analysisis anextensionof Luk andMowry’s original algorithmwhich is only ableto detectselfrecursivefunctioncalls.

The inter-proceduralalgorithm is a top-down, context-sensitive traversalof the call graph. A context-sensitive al-gorithm enablesthe analysisphaseto determinethe fieldsusedin recurrentobject updates. A context-insensitive al-gorithm cannot track the recurrentfields becausedistinctmethodcallsaretreatedsimilarly. For example,in Figure1,a context-sensitive analysisdeterminesthat this.leftandthis.right areboth recurrentfields in the recursivemethod,sum. A context-insensitive analysisonly analyzessum onceandwill notdeterminethatbothleft andrightarerecurrentfields.

The inter-proceduralanalysisusesour intra-proceduralanalysisto computetherecurrentobjectswithin a procedure.At eachcall site, we map the recurrentlattice valuesfromeachactualto eachformal. Then,weanalyzethemethodus-ing the intra-proceduralanalysis. Recursive calls causetheanalysisto iterateuntil the recurrentstatusof the formalsreachesa fixedpoint.

3.4. Summary of Extensionsfor JavaObject-orientedlanguagescontainfeatureswhich arepo-

tentially problematicfor the greedyprefetchalgorithm. Webelieve inter-proceduralanalysisimprovesthe effectivenessof greedyprefetchingbecauseobject-orientedprogramsoftenuserecursionto expressloopingconstructs.In Figure1, thesum methodusesrecursionto sumtheobjectsin the linkedlist. In C, programmerstypically usea while statementforthesamefunction.

To improvetheeffectivenessof greedyprefetchingin Java,we runouralgorithmafterperformingclasshierarchyanaly-sisandinlining. Oneuseof classhierarchyanalysisenablesvirtual methodinvocationsto betransformedinto directfunc-tion calls which improvesour inter-proceduralanalysisandalsoimprovesinlining [10]. We rely uponinlining to removeunnecessarymethodcalls that encapsulatereferencesto po-tential recurrentfields. The following is a typical Java codesequencefor traversinga linkedlist.

Enumeration e = list.elements();while (e.hasMoreElements()) �List l = (List)e.nextElement();

0

20

40

60

80

100

mst' mst-pf

' perimeter

( perimeter-pf

( treeadd

) treeadd-pf

) tsp) tsp-pf

) voronoi

* voronoi-pf

*

Nor

mal

ized

Exe

cutio

n T

ime

+

store stallload stallinst stall

busy

Figure 2. Performance of Greedy Prefetching

0

10

20

30

40

50

60

70

80

90

100

mst L1

' mst L2

' perimeter L1

( perimeter L2

( treeadd L1

) treeadd L2

) tsp L1

) tsp L2

) voronoi L1

* voronoi L2

*

Per

cent

age

of P

refe

tche

s

,

unnec.earlylate

useful

Figure 3. Prefetch Effectiveness// computation involving l�

If list is a linkedlist with anext field, thentheexpressione.nextElement hidestheaccessof l.next andtheex-pressione.hasMoreElements hidesthe test for null.Inlining eliminatesthe calls to hasMoreElements andnextElement.

In the absenceof inlining, we can extend the inter-proceduraldata flow analysis to track methodsreturningfieldsandusetheinformationto checkfor recurrentobjects.We planon extendingour analysisin thefuture,but inliningappearsto providemostof thisbenefit.

4. Experimental ResultsWe implementthe greedyprefetchingalgorithm in the

Vortex optimizing compiler [9]. We useVortex to compileJava programs,performobject-orientedandtraditionalopti-mizations,andgenerateSparcassemblycode.

We presentpreliminary resultsusing several programsfrom theOldenbenchmarksuite[5]. Researchershave usedthe Olden suite to evaluateoptimizationsfor pointer-basedprograms[7, 12, 16]. Table1 lists theOldenbenchmarksweusein our experimentsalongwith characteristicsabouteachprogram.Wetranslatedtheprograms,originally writtenin C,

to Java usingan object-orientedstyle. We compilethe pro-gramsusing JDK 1.1.6. The lines of code(LOC) numberexcludescommentsandblank lines. Thebytecodelengthisthe size of the codesegmentsin bytesand not the numberof instructionsin theprograms.We computethetotal mem-oryusingtotalMemory() andfreeMemory() fromtheRuntime class.Wedisablegarbagecollectionduringall ourexperiments.

We useRSIM to performa detailedcycle by cycle simu-lation of our programs[15]. RSIM modelsa modernout-of-orderprocessorbaseduponthe MIPS R10000. The defaultprocessorrunsat 300 MHz, issuesup to 4 instructionspercycle,andhasa 64 entryinstructionwindow. Thefunctionalunitsinclude2 ALU, 2 FP, 1 branch,and2 addressunits.Theinstructionwindow has64entries.We usethedefault valuesfor mostof theparametersexceptfor thecachehierarchy. Thefollowing tablelists thememoryhierarchyRSIM parametersweusein our experiments.Thedefault cachesizesaresmallfor modernprocessors,but matchourdatasizesanddecreasesimulationtimes.

L1 Cache 16KB, directWT, splitL2 Cache 64KB, 4-way, WB, unifiedRequestPorts 2Line Size 32BL1/L2/Memhit time 1/12/60cyclesCacheMissHandlers(MSHR) 8,8(L1, L2)

Figure2 showspreliminaryperformanceresultsof greedyprefetching.We normalizetheresultsto theexecutiontime,in cycles,of theprogramswhenwedonotperformprefetch-ing. We usethe RSIM conventionto accountfor busy andstall cycles. We mark a cycle busy if the processorretires4 instructions(the maximum). Otherwise,the first instruc-tion that cannotbe retiredby the cycle accountsfor a stall.Figure2 shows improvementsof 3% (mst),6%, (perimeter),12% (treeadd),and - 1% (tsp, voronoi). Improvementsaredueto fewer loadstallsin theprograms.Evenafterprefetch-ing, thepercentageof loadstallsremainsquitehigh.

Figure 3 provides insight into the effectiveness ofprefetchingby dividing theprefetchesinto variouscategories.A usefulprefetcharrivesontimeandis accessed.Thelatencyof a late prefetchis only partially hiddenbecausea cachemiss occurswhile the memorysystemretrieves the datum.The cachereplacesan early prefetchbeforethe useof thedatum. An unnecessaryprefetchhits in the cacheor is co-alescedinto an MSHR. Figure3 categorizesthe prefetchesfor both L1 andL2 prefetches.We scalethe graphfor theL2 prefetchesto thepercentageof requeststo theL2 cache.Useful,late,andearlyprefetchesrequireaccessesto thenextlevel in thememoryhierarchy. For eachprogram,prefetchesto the L1 cachecontaina small numberof useful and lateprefetches.However, many of theprefetchesareunnecessarybecausethey hit in theL1 cache.MostprefetchesareearlyintheL2 cachebecausethecacheis small,unified,andwrite-backsomuchof thedataarereplaced.

Table 2. Cache Statistics with and without PrefetchingReads L1 Hit L1 Miss (%) L2 Hit L2 Miss(%) PrefetchesProgram(M) (%) conf. cap. coal. (%) conf. cap. coal. static dyn. L1 (M) dyn. L2 (M)13.5 78.3 0.6 13.8 7.3 35.8 7.8 57.3 0mst

w/pf 14.2 81.5 0.6 8.9 9.0 43.8 10.5 45.7 0 8 2.235 .74330.4 95.9 0.9 0.9 2.3 53.4 4.5 42.0 0perimeter

w/pf 30.4 96.8 0.6 0.7 1.9 53.0 6.6 40.3 0 8 .32 .07111.5 81.6 0.1 7.1 11.2 3.1 0.9 96.0 0treeadd

w/pf 11.3 85.2 0.1 1.4 13.3 14.0 5.9 80.1 0 2 2.26 .659106.3 97.4 0.6 1.0 1.0 50.1 14.4 35.5 0tsp

w/pf 126.4 96.5 1.0 0.7 1.8 77.2 3.2 19.6 0 31 25.8 1.08113.8 93.6 1.5 2.2 2.7 59.0 12.6 26.4 2.0voronoi

w/pf 113.3 93.8 1.4 2.1 2.7 56.9 14.4 26.2 2.5 18 .816 .150

Table2 listsseveralimportantcachestatisticsfor eachpro-gramwith andwithoutprefetching.Wedividethemissstatis-tics into conflict,capacity, andcoalescedmisses(coldmissesareinsignificant).A coalescedreferencemissesin thecache,but hits in a MSHR. Thetablealsodisplaysstatisticson thenumberof staticanddynamicprefetches.Thestaticprefetchnumbersdonot include23prefetchinstructionsthecompilerinsertsinto theJava library code.In general,prefetchingim-provesthehit ratesandreducescapacitymisses.

5. ConclusionTraditionalcompileralgorithmsfor improving the cache

performanceare difficult to perform on languagesthatmostlyallocatememorydynamically. Compilerinserteddataprefetchingis an effective techniquefor tolerating latency,even in pointer-basedprograms. In this paper, we evaluatethe usefulnessof prefetchingin Java programs.We presentanintra-andinter-proceduralalgorithmfor asimpleprefetch-ing algorithm,calledgreedyprefetching.Ourpreliminaryre-sultsshow improvementsdueto prefetching.However, ourresultsindicatethat thereis roomto improve prefetchingef-fectivenessin Java programsbecausemany prefetchinstruc-tionshit in thecache.Weplanto continueinvestigatingbetterprefetchingalgorithmsfor Java.

References[1] A. V. Aho, R. Sethi,andJ. D. Ullman. Compilers, Principles,Tech-

niques,andTools. Addison-Wesley, Reading,MA, 1986.

[2] D. Bernstein,D. Cohen,A. Freund,andD. E. Maydan.Compilertech-niquesfor dataprefetchingon the PowerPC. In Proceedingsof the1995InternationalConferenceon Parallel Architecturesand Compi-lation Techniques, pages19–26,Limassos,Cyprus,June1995.

[3] B. Calder, C. Krintz, S. John,andT. Austin. Cache-consciousdataplacement.In ASPLOS-VIII:Eighth InternationalConferenceon Ar-chitectural Supportfor ProgrammingLanguagesand Operating Sys-tems, SanJose,CA, Oct.1998.

[4] D. Callahan,K. Kennedy, andA. Porterfield.Softwareprefetching.InASPLOS-IV: Fourth InternationalConferenceon ArchitecturualSup-port for ProgrammingLanguagesand Operating Systems, pages40–52,SantaClara,CA, Apr. 1991.

[5] M. C. CarlisleandA. Rogers.Softwarecachingandcomputationmi-gration in olden. In Proceedingsof the 1995 ACM SIGPLANSym-posiumon Principles and Practice of Parallel Programming, pages29–38,SantaBarbara,CA, July1995.

[6] T. M. Chilimbi, B. Davidson,andJ.R. Larus. Cache-consciousstruc-turedefinition.In Proceedingsof the1999ACM SIGPLANConferenceonProgrammingLanguage DesignandImplementation(PLDI), pages13–24,Atlanta,GA, May 1999.

[7] T. M. Chilimbi, M. D. Hill, andJ.R. Larus.Cache-consciousstructurelayout. In Proceedingsof the 1999 ACM SIGPLANConferenceonProgrammingLanguageDesignandImplementation(PLDI), pages1–12,Atlanta,GA, May 1999.

[8] T. M. Chilimbi andJ. R. Larus. Using generationalgarbagecollec-tion to implementcache-consciousdataplacement. In The1998In-ternationalSymposiumonMemoryManagement, Vancouver, BC,Oct.1998.

[9] J.Dean,G. DeFouw, D. Grove, V. Litinov, andC. Chambers.Vortex:An optimizingcompilerfor object-orientedlanguages.In Proceedingsof the1996ACM SIGPLANConferenceon Object-OrientedProgram-ming Systems,Languages & Applications(OOPSLA’96), pages83–100,SanJose,CA, Oct.1996.

[10] J.Dean,D. Grove, andC. Chambers.Optimizationof object-orientedprogramsusingstaticclasshierarchyanalysis.In ECOOP’95 Confer-enceProceedings, Aarhus,Denmark,Aug. 1995.

[11] M. H. Lipasti,W. J.Schmidt,S.R.Kunkel,andR.R.Roediger. SPAID:Softwareprefetchingin pointer- andcall-intensive environments. InProceedingsof the 28thAnnualIEEE/ACM InternationalSymposiumonMicroachitecture, 1995.

[12] C.-K. Luk andT. C. Mowry. Compiler-basedprefetchingfor recursivedatastructures.In ASPLOS-VII:SeventhInternationalConferenceonArchitectural Supportfor ProgrammingLanguagesandOperatingSys-tems, pages222–233,Cambridge,MA, Oct.1996.

[13] N. McIntosh.CompilerSupportfor Software Prefetching. PhDthesis,RiceUniversity, May 1998.

[14] T. C. Mowry, M. S. Lam, andA. Gupta. Designandevaluationof acompileralgorithmfor prefetching.In ASPLOS-V: Fifth InternationalConferenceonArchitectural Supportfor ProgrammingLanguagesandOperatingSystems, pages63–72,Oct.1992.

[15] V. S. Pai, P. Ranganathan,andS. V. Adve. RSIM referencemanual(version1.0). TechnicalReportTechnicalReport9705,RiceUniver-sity, Dept.of ElectricalandComputerEngineering,Aug. 1997.

[16] A. RothandG.Sohi.Effective jump-pointerprefetchingfor linkeddatastructures.In Proceedingsof the26thAnnualInternationalSymposiumonComputerArchitecture, Atlanta,GA, May 1999.

[17] S. Rubin, D. Bernstein,and M. Rodeh. Virtual cacheline: A newtechniqueto improve cacheexploitation for recursive datastructures.In CompilerConstruction,8thInternationalConference, CC’99, pages259–273.Springer, Mar. 1999.

[18] D. N. Truong,F. Bodin, and A. Seznec. Improving cachebehaviorof dynamicallyallocateddatastructures.In Proceedingsof the 1998InternationalConferenceon Parallel Architectures and CompilationTechniques, Paris,France,Oct.1998.

[19] M. Wolfe. Beyond inductionvariables. In Proceedingsof the 1992ACM SIGPLANConferenceon ProgrammingLanguage DesignandImplementation(PLDI), SanFrancisco,CA, June1992.

1

An Introduction to DMMX (Dynamic Memory Management Extension)J. Morris Chang, Witawas Srisa-an and Chia-Tien Dan Lo

Department of Computer ScienceIllinois Institute of TechnologyChicago, IL, 60616-3793, USA

{chang | sriswit | lochiat} @charlie.iit.edu

Abstract

Automatic Dynamic Memory Management (ADMM) allowsprogrammers to be more productive and increases systemreliability and functionality. However, the truecharacteristics of these ADMM algorithms are known to beslow and non-deterministic. It is a well-known fact thatobject-oriented applications tend to be dynamic memoryintensive. Therefore, it is imperative that the programmersmust decide whether or not the benefits of ADMM outweighthe shortcomings. In many object-oriented real-time andembedded systems, the programmers agree that theshortcomings are too severe for ADMM to be used in theirapplications. Therefore, these programmers while usingJava or C++ as the development language decide toallocate memory statically instead of dynamically. In thispaper, we present the design of an application specificinstruction extension called Dynamic Memory ManagementeXtension (DMMX) that would allow automatic dynamicmemory management to be done in the hardware. Our high-performance scheme allows both allocation and garbagecollection to be done in a predictable fashion. The allocationis done through the modified buddy system, which allowsconstant time object creation. The garbage collectionalgorithm is mark-sweep, where the sweeping phase can beaccomplished in constant time. This hardware scheme wouldgreatly improve the speed and predictability of ADMM.Additionally, our proposed scheme is an add-on approach,which allows easy integration into any CPU, hardwareimplemented Java Virtual Machine (JVM), or Processor inMemory (PIM).

index terms: automatic dynamic memory management,real-time garbage collector, mark-sweep garbage collector,instruction extension, object-oriented programming

1. Introduction

By early 2000s, many industrial observers predict thatthe VLSI technology would allow fabricators to pack 1billion transistors into a single chip that can run at Giga-Hertz clock speed. Obviously, the challenge is no longerhow to make billion-transistor chips, but instead, what kindof facilities should be incorporated into the design [5]. Thecurrent trend in CPU design is to include application specific

instruction sets such as MMX and 3D-now as extensions tobasic functionalities. The rationales behind such approachesare obvious. First, space and cost limitations are no longerissues. High-density chips can be manufactured cheaply incurrent semiconductor technology. Second, these applicationspecific instruction sets are included to alleviateperformance bottlenecks in the most commonly usedapplications such as 3-D graphic rendering and multimedia.These rationales closely follow the corollary of Amdahl’slaw: Make the common case fast. Amdahl’s Law reminds usthat the opportunity for improvement is affected by howmuch time the event consumes. Thus, making the commoncase fast will tend to enhance the performance better thanoptimizing rare cases [7]. Since the biggest merit ofhardware is speed, the significant speedup can be gainedthrough hardware implementations of common cases.

As the popularity of object-oriented programming andgraphical user interface increases, applications become moreand more dynamic memory intensive. It is well-knownamong experienced programmers that automatic dynamicmemory management functions (i.e. allocation and garbagecollection) are slow and non-deterministic. Since object-oriented applications prolifically allocate memory in theheap, it is also no coincident that such applications can runup to 20 times slower than the procedural counterparts. Astudy has also shown that Java applications can spend 20%of the execution time in dealing with dynamic memorymanagement [1]. Unlike stack or queue, heap is not a well-defined data structure. Allocating memory in the heap oftenrequires some form of search routines. In softwareapproaches to heap management, searching is done insequential fashion (i.e. linked list search). As the number ofexisting objects grows, the search time would grow linearlylonger as well. Studies have shown that applications writtenin C++ can invoke up to ten times more dynamic memorymanagement calls than comparable C applications [10].Apparently, dynamic memory management is a commoncase in object-oriented programming. With Amdahl’scorollary in mind, the need of a high-performance dynamicmemory manager is obvious.

Deterministic turnaround time is a very desirable trait forreal-time applications. Presently, software approaches toautomatic dynamic memory management often fail to yield

2

predictable turnaround time. The most often used softwareapproach in maintaining allocation status is sequential fit orsegregated fit. These two approaches utilize linked-list tokeep the occupied chunks or free chunks. With linked-list,the turnaround time often relates to the length of the list. Asthe linked-list becomes longer the sequential search timewould grow longer as well [9]. Similarly, the softwareapproaches to garbage collection also yield unpredictableturnaround time. Basically two of the most commonapproaches for garbage collection are mark-sweep andcopying collector. In both instances, the turnaround time isnot deterministic.

According to Nilsen and Schmidt, one of the ways toachieve hard real-time performance for garbage collection isthrough the hardware support [8]. In this paper, we introducean application specific instruction extension called DynamicMemory Management eXtension (DMMX) that includesh_malloc, mark, and sweep instructions at the user-level. Inh_malloc, our high-performance allocation scheme allowsallocation to be completed in a few instruction cycles.Unlike software approaches, our scheme is fast anddeterministic. To perform garbage collection, the markinstruction is invoked repeatedly until all the live objects aremarked on a bit-map. Once the marking phase is completed,the sweep instruction is called. Since we have a dedicatedhardware to perform the sweeping, this phase can becompleted in a few instruction cycles.

The remainder of this paper is organized as follow.Section 2 provides a top-level architecture of our instructionset. Section 3 describes the internal structure of the DynamicMemory Management Unit (DMMU). Section 4 addressesthe architectural support issues for the DMMU. Section 5concludes this paper.

2. Overview of the DMMX

In our proposed Dynamic Memory ManagementeXtenstion (DMMX), there are three user-level instructions,h_malloc, mark, and sweep. These three instructions areused as the communication channels between the CPU andthe Dynamic Memory Management Unit (DMMU). ThisDMMU can either be packaged inside CPUs or outside. Thisunit can also be included inside the hardware implementedJava Virtual Machines (i.e. PicoJava II from SunMicrosystems). The main purpose of the DMMU is to takeresponsibility for managing heap space for all processes inthe hardware domain. The proposed DMMU utilizes themodified buddy system combined with the bit-map approachto perform constant-time allocation [4]. Usually, eachprocess has a heap associated with it. In the proposedscheme, each heap requires three bit-maps, one for allocationstatus (A bit-map), one for object size (S bit-map), and one

for marking during the garbage collection (X bit-map). It isnecessary to place these three bit-maps together all the time,since searching and modification to these three bit-maps arerequired for each garbage collection cycle. Figure 1demonstrates the top-level integration of the DMMU into acomputer system.

Figure 1. The top-level description of a DMMU

Figure 1 illustrates the basic functionality of the DMMU.First, the DMMU provides services to CPU by maintainingthe memory allocation status inside the heap region of therunning process. Thus, the DMMU must be able to access theA bit-map, S bit-map, and X bit-map of the running process.Similar to TLB, the DMMU is shared among all processes.The parameters that the CPU can pass to the DMMU are theh_malloc, mark, or sweep signal, the object_size (for theallocation request), and the object_pointer. The operations ofthe DMMU are very similar to the function calls (i.e.malloc()) in C language. Thus, object_pointer is eitherreturned from the DMMU in allocation or passed on to theDMMU during the garbage collection process. The gc_ack isalso returned at the completion of garbage collection cycle.If the allocation should failed, the DMMU would make arequest to the operating system for additional memory usingsystem call sbrk() or brk().

Since the algorithms used in the DMMU areimplemented through pure combinational logic, the time toperform a memory request or memory sweeping is constant.On the other hand, the time for a software approach inperforming an allocation or a sweeping cycle is non-deterministic. As stated earlier, Java applications spendabout 20% of the execution time in dealing with automaticdynamic memory management. This extensive executiontime can be greatly reduced with the use of the DMMU.

3. Internal architecture of the DMMU

Inside the DMMU, three bit-vectors are used to keep allof the object relevant information such as allocation status ofthe heap, the size information of occupied blocks and freeblocks, and the live object pointers. The allocation status iskept on the Allocation bit-vector (A bit-vector). When ah_malloc is called, the size information is received by the

object_size

object_pointer

0 1 2 3 4 5 6 7....

O.S.Kernel

CPU DMMU sbrk/brk

h_malloc /mark / sweep

A bit-vct S bit-vct X bit-vct

gc_ack

3

Complete Binary Tree (CBT). This dedicated hardware unitis responsible for locating the first free memory chunk thatcan satisfy the request using the modified buddy system.Besides locating the memory chunk, the CBT also has tosend out the address of that newly allocated memory andupdates the status of that memory block from free toallocated. It is worth noting that while the free block lookupis done using size index of 2n, the system only allocates therequested size. For example, if 5 blocks of memory isrequested, the system will have to find the first free chunk ofsize 8 (23). After a chunk is located, the system onlyallocates 5 blocks and relinquishes the remaining 3 blocks.Each time an object is created or reclaimed, the Size bit-vector (S bit-vector) is instantly updated by a dedicatedhardware, S-Unit. The auXiliary bit-vector (X bit-vector) isonly used during the marking phase of the garbage collectioncycle. Once the marking phase is completed, the sweepinstruction is invoked. A dedicated hardware, bit-sweeper, isused to perform this task in constant time. The internalarchitecture of the DMMU is given in Figure 2.

Figure 2. Internal architecture of the DMMU.

Figure 2 depicts the sequence needed to complete theallocation or garbage collection. For example, if anallocation of size 5 is requested, A1s indicate the first stepneeded to complete the allocation. According the Figure 2,the h_malloc and input signal would go to logic ’1’ and therequested size would be given to the CBT. Since the CBT is acombinatorial hardware, the free memory chunk lookup, thereturn address pointer, and the new allocation status signalscan be produced at the same time (A2s). Next, the newallocation status is latched in the A bit-vector (A3). Since theS-Unit is also a combinatorial hardware, as soon as the A bit-vector is latched, the new size information is available to theS bit-vector. Lastly, the new size information is latched inthe S bit-vector (A4) and the allocation is completed. Thesequence of garbage collection can also be traced in a similarfashion.

4. Architectural support for DMMU

This section summarizes the process of memoryallocation and deallocation in the DMMU. Since the bit-maps of a given process may be too large to be handled inthe hardware domain, the bit-vector, a small segment of thebit-map, is used in the proposed system. This idea is verysimilar to the idea of using TLB (Translation Look-asideBuffer) in the virtual memory. Due to the close tie betweenthe S bit-map, A bit-map, and X bit-map, the term bit-vectorused in this section represents one A bit-vector (of A bit-map), one S bit-vector (of S bit-map), and one X bit-vector(of X bit-map). Figure 3 presents the operation of theproposed DMMU.

When a memory allocation request is received (step 1),the requested size is compared against thelargest_available_size of each bit-vector in a parallelfashion. This operation is similar to the tag comparison in afully associated cache. However, it is not an equalitycomparison. There is a hit in the DMMU, if one of thelargest_available_size is greater or equal to the request size.If there were a hit, the corresponding bit-vector would beread out (step 2) and sent to the CBT [4]. The CBT is ahardware unit to perform allocation/deallocation on a bit-vector. For the purpose of illustration, we assume that onebit-vector represents one page of the heap.

After the CBT identified the free chuck memory from thechosen page, the CBT will update the bit-vector (step 3) andthe largest_available_size field (step 3*). The object pointer(in terms of page offset address) of the newly created objectis generated by the CBT (step 4). This page offset combinesthe page number (from step 2*) into the resultant address.

Figure 3.The allocation and garbage collection processes of the DMMU

For the garbage collection, when the DMMU receives amark request, the page number of the object pointer (i.e. avirtual address) is used to select a bit-vector (step A). This

Complete Binary Tree (CBT)

S-Unit (Size encoder)

Allocation bit-vector( A bit-vector )

Size bit-vector(S bit-vector )

Aux bit-vector( X bit-vector )

Bit-Sweeper

object_pointer output (A2)

Update allocation status (A3)

object_size input (A1)

object_pointer input (G1)

sweep signal input (G3)

A4,G5

(G3)

UpdateAllocationStatus (G4)

(A) Steps required for allocation

(G) Steps required for garbage collection

Object pointersfor marking (G2)

h_malloc / marksignal input (A1, G1) Current allocation status (A2)

gc_acknowledgeoutput (G3)

page number bit-vectors largest_available_size

1

2

2

CBT

1 free memory space request

3

33 *

3 *

bit-vector read out

bit-vector update

update largest_available_size

page offsetpage number

4

4offset address of newly created object

page number page offset

A

A *

Allocation steps:

B

A Select bit-vectors

C

D *

marked the bit-vector

bit-vector read out

update largest_available_size

A

Garbage collection steps:

* starting page-offset address

B

D *

starting address to be freed

2 *

2 * page # read out

starting address of new objectbit-sweeper

C

D

D bit-vector update

4

process is similar to the tag comparison in cache operation.At the same time, the page offset is sent to the CBT as theaddress to be marked (step A*). The process is repeated untilall the memory references to live objects are marked. Whenthe marking phase is completed, the sweeping phase (step C)would begin by reading out the bit-vectors and send them tothe bit-sweeper. The bit-weeper would keep all of the objectswhere the starting addresses were provided by step A* andupdate the bit-vector (step D) and the largest available sizefield (step D*). The page number, bit-vectors, and thelargest_available_size are placed in a buffer, called theAllocation Look-aside Buffer (ALB).

Since the DMMU is shared among all processes, contentof the ALB will be swapped during the context-switching.This issue also exists in TLB. To solve this problem, we canadd a process-id field in the ALB. This will allow bit-vectorsof different processes to coexist in the ALB. We expect theperformance of the ALB to be very similar to the much-studied TLB. However, further research in the ALBorganization, hit ratio and miss penalty is required.

5. Conclusion

Besides providing the speed and predictability inautomatic dynamic memory management, the DMMU canalso reduce the number of cache misses and page faults.Since all the dynamic memory management information iskept separately from the object, we do not need to bring theobject in for the marking and object size look up. The bit-map approach also requires no splitting and coalescing.Additionally, our scheme can also improve the performanceof multithreaded applications in a multiprocessorenvironment. While multithreaded programming inmultiprocessor environment promotes parallel execution ofthreads, the task of managing the heap is still done in asequential fashion. This means that other threads have towait if one thread is allocating inside the heap. Since thesoftware approach to allocation is slow and non-deterministic, dynamic memory management can be a majorbottleneck in multiprocessor-multithreaded applications thatare memory intensive [3]. Our scheme allows allocation tobe done quickly, and thus, reduces wait time.

The adoption of object-oriented languages such as C++and Java in embedded system development also increasesthe need for a high-performance automatic dynamic memorymanager. Industry observers predict that by year 2010, therewill be 10 times more embedded system programmers thangeneral-purpose programmers [2]. This prediction is alsoconfirmed by the surge of interests in the web-applianceswhere each device is a small object-oriented embeddedsystem. In this paper, we introduce hardware instruction

extensions that would allow ADMM to be fast, robust, andcan respond to the hard real-time requirement.

6. References

[1] E. Armstrong, “Hotspot, A new breed of virtual ma-chine”, JavaWorld, March 1998.

[2] R.W. Atherton, “Moving Java to the factory”, IEEESpectrum, December 1998, pp 18-23.

[3] D. Haggander and L. Lundberg, “Optimizing DynamicMemory Management in a Multithreaded ApplicationExecutin on Multiporcessor”, Proc. 1998 Int’l Confer-ence on Parallel Porcessing, pp. 262-269.

[4] M. Chang and E. F. Gehringer, “A High-PerformanceMemory Allocator for Object-Oriented Systems,” IEEETransactions on Computers. March, 1996. pp. 357-366.

[5] K. Kavi, J.C. Browne, and A. Tripathi, “Computer Sys-tems Research: The pressure is on” Computer, January1999, pp. 30-39.

[6] R. Jones, R. Lins, Garbage Collection: Algorithms forautomatic Dynamic Memory Management, John Wileyand Sons, 1996, pp.20-28, 87-95, 296

[7] D. Patterson and J. Hennessy, “Computer Architecture, AQuantitative Approach”, Morgan Kaufmann Publishers,Inc., Second Edition 1996.

[8] K. Nilsen and W. Schmidt, “A High-Performance Hard-ware-Assisted Real-Time Garbage Collection System,Journal of Programming Languages, January 1994, 1 -40.

[9] Paul Wilson, M. Johnstone, M Neely and D. Boles, “Dy-namic Storage Allocation: A Survey and Critical Re-view”, Proc. 1995 Int’l workshop on MemoryManagement, Scotland, UK, Sept. 27-29, 1995.

[10] Benjamin Zorn, “Custo-Malloc: efficient synthesizedmemory allocators,” Technical Report CU-CS-602-92,Computer Science Department, University of Colorado,July 1992.

Session 2

Architectural Issues in Dynamic Translation

�� ! #"$�% '&)(+* ��* ��(+,-�.��(-��0/

1243%57698;:=<?>[email protected]<LKM6IDONI:CPCH?PQH?PCRQ124STDVU?HWNI5CXC6IH?YZH?PCD[HWY\]_^7`?acbedf]cghbjilk�mniodp^4qTb_]caErtscuv]cg'sw]x`?g4y{zngL|?u}g']w]ca~u}gL|

��]cghg7�c�W��o`?ghue`�r#b�`?b_]%�$ghu}�?]caw�cu}be��!ghu}�?]ca;�cu}be�{�j`?a;�J��]cghg7�c�W��o`?ghue`

�W�W�L�T�=�=�7��_�4�W�L��9�W�h�W�T�h�o�C�L�?�� ¢¡4�£�j ¤��=�

¥¦A§w¨ª©ª«M¬£¨

-®+¯_°²±V³_ó±9µL¶�·¹¸�º}w³ªµ£»�¼�½~·$¾�¶�¿À¸lÁÂ¶Ã¯$ÄwÉ¸VÅ�¼¤¶V¸l´W°Äw´�Æ�¾�Á¢¸OÇl¸lÁ_Á¢¸�ÆÈ¯_°²Éw¿À¸j½ÊÇn¶�·$¾�¿À¸l·¹¸¤ó°²¶�´-Ëp°ÍÌ£¸ÎJÄwÏwÄÐL¶�Á_°²®oÄw¿+Ñ.Äc¼�Ì+¶�´�¸nº[7ÐLÑÎ»�½w´�Á¢¸_¯¤½~®JÁ¢¼�¸l±}Á_¶V¼�Ìj·¹Äw±¼vÌJ¶�´?¸l¯¤ÒÔÓT¯_¶�´-Ë.Ä.w³ªµ�¼�½w·!¾?¶�¿À¸¤Á�ÕÖ°ÍÌ£¸È×lÉw°}¸�¼�½9Æª¸_¯ÄwÁ¢¸È°²Á¢Äwó¯_¿ÀÄw°}¸�ÆØ°}½Ø´?Ä~°O¶�Ïw¸E¼�½9Æª¸pÄ~°�Á_®J´W°O¶�·¹¸IÒp³l´°�ÌJ¶Ã¯�¾oÄv¾o¸¤Á�ÕÎÙ4¸�¶�óÏw¸_¯_°²¶ÚËªÄ~°[¸�ÙTÌ£¸¤Á¢¸�°ÍÌ£¸�°O¶�·¹¸Ø¶Ã¯¯²¾£¸¤ó°x¶�´Û¯_®o¼vÌÜÆwÉw´?Ä~·n¶V¼�¼�½w·!¾?¶�¿À¸¤Á�¯%Äw´�Æ%Ìo½wÙÙ4¸l¿�¿4ÄÈ¯_·¹ÄwÁ_°!w³-µÔ¼�½w·!¾?¶�¿À¸¤Á�¼�Äw´�Æc½ªÒjÝ�¸vÞc°OÕtÙ4¸¾?Á¢½�¾£½w¯¤¸nÄ~Á¢¼vÌJ¶�°[¸�¼¤°O®JÁvÄw¿M·¹¸�¼vÌoÄwó¶Ã¯_·n¯�°�ÌoÄw°'¼�Ä~´x×�¸®+¯¤¸vÆp°}½È¯_®I¾c¾o½wÁ_°$°ÍÌ£¸�ÆwÉw´?Ä~·n¶V¼È¼�½9Æª¸�Ëª¸¤´?¸¤Á¢Ä~°O¶V½~´Äw´�Æ¹¶�´£¯_°}Äw¿�¿ÀÄ~°O¶V½~´?Ò

ß à-áAâ+ãoätå�æ!ç?â£èÊäAá

étêoë�ì-í;îwíQï�ðÀñ¢ò¢óWí~ô�õ�íªö�ê£ðø÷£ë�ùOìcïÖõ.ú�ûÀülý�ðÃþò¢ê£ë�ölÿcñv÷£ë9ñ#þÊòvÿc÷£ëÂÿ��=ì-í;îwí�ò¢ë9ö�êo÷£ÿcôøÿ��ë��£ðÀò¢ÿ��jð��ðÀ÷��.òvê£ë ��ÂñvðÚòvë�[ÿc÷oö¤ë�ñ¢óo÷�[íc÷��Âêoë¤ñvë��£ñvÿ��jðÃþeë��[ò ðøþ�ë��?ë9ölò¢ë�� òvêoíwò�ò¢êoðøþ�ë9÷oí��£ôøðø÷��ò¢ë9ö�êo÷£ÿcôÿ��ÂðøôÀô �¹í�!ªë ðÚò�íÛôøÿ~ò%ëIícþ¢ðÀë9ñ ò¢ÿ"�Jë¤îªë¤ôøÿ��Wÿªñeò�í��£ôøëÂþeÿ��ò#�'ícñ¢ë'í~÷��Îþeòví~÷$�£í~ñ%�Jð&�¤ë��ðÀ÷-òvë¤ñ'�VícölëIþò¢êoí~ò$þ��oíc÷píjþ'�WëIö_òvñ¢ó��Üÿ��TêWí~ñ%��'ícñ¢ë(�£ôÃíwò��Íÿªñ'�¹þ��ìªí;îwí)�£ñvÿ��ªñví��¹þ7ícñ¢ëtòvñvíc÷oþ¢ôøí~ò¢ë��Îðø÷-ò¢ÿ�í*�¹ícö�êoðÀ÷£ë�ðÀ÷$�Jë��?ë¤÷��£ë¤÷-ò�ìªï�õ+�Íÿcñ,�¹íwò�ùVö9í~ôøôÀë��-��-ò¢ëIölÿ��JëIþvú�.ò¢ÿ�ðÀ÷oþ¢ó£ôÃíwòvë#ò¢êoë��/�Íñ¢ÿ��ò¢ê£ëAó£÷��Jë9ñ¢ô&�+ðÀ÷��(�¹ícö�ê£ðø÷£ëí~ñ�ö�ê£ðÀò¢ë9ölò¢ó£ñvë�ÿª÷0�Âê£ðÃö�ê�òvê£ë��0�'ÿcóoô&�Øë9îcë9÷ªòvóoí~ôøô&�ë��Jë9ö¤óJò¢ë��Tétê£ëIþeë1��-òvë9ölÿ��JëIþLö9í~÷*�?ë4ë��JëIölóJòvë��)��32ðÀ÷-òvë¤ñ,�£ñ¢ë¤òví~ò¢ðøÿc÷4.$ìªóoþeò�5�}÷�[étð&�jë�ùVì��Êé$ú�ö¤ÿ��6�£ðøôøí�ò¢ðøÿc÷�û 7IýÂÿªñ8��9�Jðøñ¢ëIö_ònë��Jë9öló£ò¢ðøÿc÷�ÿª÷�ì-í;î;í0�£ñ¢ÿ�ölëIþ¢þ¢ÿcñ�þ�û :�.<;wý5�>=?�jÿc÷��ò¢êoë9þ¢ë�òvë9ö�ê£÷£ðA@-ó£ë9þ�.hòvê£ëì��Êé ö¤ÿ��6�£ðøôÀë9ñ�òvë9ö�ê£÷£ÿªôÀÿ��fðøþÈí>�£ñvë��Íë9ñ¢ë��Ôþeò#�-ôøëÿ��Tìªï�õ ð�6�£ôøë��jë¤÷-ò�íwò¢ðøÿc÷Èÿc÷�ñvë9þ¢ÿcóoñvö¤ë�[ñ¢ðÃö�ê6�jí�ö�ê£ðø÷£ë9þ�� étê£ðÃþ-�oí��?ë¤ñB�£ñvë9þ¢ë¤÷-òvþC�Wÿ-þ¢þ¢ð�oôÀë �JðÀñvë9ö�ò¢ðøÿc÷oþ4ò¢êWíwòAÿª÷£ë�ö9í~÷jò�í�!cë�ðÀ÷C�£ñ¢ÿwî+ðA�Jðø÷��í~ñ�ö�ê£ðÀò¢ë9ö�ò¢ó£ñ�í~ôMþ¢ó��?ÿcñ¢òD�Íÿcñ?��-÷Wí��jðøö�ö¤ÿ��6�£ðøôøí~ò¢ðøÿc÷4�étêoë¹ñ¢ëIþÊò�ÿ��Aòvê£ðøþE�oí��?ë¤ñ�ðÃþ�ÿªñ'�-í~÷£ð&�¤ë��xíªþF�Íÿcô

ôøÿG�$þ��H�}÷ òvê£ëÔ÷£ë��+ò�þeëIö_ò¢ðøÿc÷I.J�Aë ðø÷+îcëIþÊòvð�-íwò¢ëÿª÷K�Âê£ë9ñ¢ë�ò¢ê£ëpò¢ð&�jëÈðÃþjþ��?ë¤÷-ò¹ðø÷Zì��Êé ölÿ��L�oðÀôÃíGòvðÀÿª÷4�>=$ñ�ö�ê£ðÀò¢ë9ölò¢ó£ñ�í~ôtþ¢ó��?ÿcñ¢ò8�jë9ö�êoíc÷£ðÃþ��¹þ8�Íÿcñë9÷£êoí~÷Wölðø÷��ò¢ê£ë��?ë¤ñ'�Íÿcñ,�¹í~÷oö¤ëEÿ��ì��Êé ölÿ��L�oðÀôÃíGòvðÀÿª÷ í~ñvëM�£ë9þvölñvð�?ë��N�£ñvðøë�O��%ðÀ÷�òvê£ë�þ¢ë9ö_òvðÀÿª÷P:��Q ÿª÷oölôøó��Jðø÷��ñ¢ë��jícñ'!JþÎí~ñvë-�oñ¢ÿwî+ðA�Jë��ðÀ÷fþeëIö_ò¢ðøÿc÷;$�

R SUT?VTá0ä#ã0SUTWV=â�T?VMãxâJäYXä1Z\[�è�]�V^ à�_9`

aW�+÷oí��jðøöxölÿ��6�£ðøôÃíwò¢ðøÿc÷QêoícþC�Wë9ë¤÷b�?ÿ��oó£ôøícñ¢ô&�óoþ¢ë��xû 7�.�c;ý=ò¢ÿjþ'�Wë9ë��ó��pì-í;îwí�ë��JëIölóJòvðÀÿª÷oþ��hétê£ðÃþí��£ñvÿªícö�ê{í;îªÿcðA�£þ�òvê£ëEölÿªþeò¢ô&��ðÀ÷-òvë¤ñ,�£ñ¢ë¤òví~ò¢ðøÿc÷{ÿ��ìªï�õd��-ò¢ëIölÿ��JëIþ�.e�Âê£ðøôÀëÎþeðA�Jë9þeò¢ë��£ðø÷��¹òvê£ë�ðÃþ¢þ¢ó£ëÿ��Lêoí;î+ðÀ÷��ò¢ÿL�£ñvë�}ölÿ��6�£ðøôøë!í~ôøô?ò¢ê£ëÖñ¢ÿªóJò¢ðø÷£ëIþ#òvêoíwòö¤ÿcó£ôA��ë9îcë¤ñF�?ënñvë��Íë9ñ¢ë9÷oölë��{ùf�Íñvÿ��g�?ÿ~òvê�ò¢êoë8�ÍëIíGþ¢ð�oðÀôøðÚò#�Ôíc÷��h�?ë¤ñ'�Íÿcñ,�¹í~÷oö¤ëxí~÷��côøë9þ�úi�jaW�+÷oí��jðøöö¤ÿ��6�£ðøôøí~ò¢ðøÿc÷ òvë9ö�ê£÷oð&@-ó£ëIþ�.Zê£ÿG�Aë9îcë9ñ�.>�Wík� ò¢ê£ë�?ë¤÷oícôÚò#�Öÿ��Jêoí;î+ðø÷��tò¢êoë4ölÿ��L�oðÀôÃíwòvðÀÿª÷elIòvñvíc÷oþeôÃíwòvðÀÿª÷òvÿ.÷oí~ò¢ðøîcë�ö¤ÿ��£ëm�Ví~ôøôÀðø÷��ðø÷�òvê£ëÈö¤ñ¢ðÀò¢ðÃö¤ícô1�oí~ò¢ê{ÿ��£ñvÿ��ªñví�� ë��Jë9öló£ò¢ðøÿc÷4�onJðÀ÷oö¤ë�òvê£ðÃþ�ölÿªþeòxðÃþ�ë��?ë9ölò¢ë�� òvÿ\�?ë�ê£ð&�cê4..ðÚòÔ÷£ë9ë��£þfòvÿp�Wë í��Îÿªñ�òvð�9ë��ÿwîcë9ñm��ó£ôÀò¢ð&�£ôøë�ë��JëIölóJòvðÀÿª÷oþ¹ÿ��òvê£ëØò¢ñ�í~÷Wþ#ôÃíwòvë��ölÿ��Jë��Bq!ò¢ê£ë9ñ'�ÂðÃþ¢ë�.4�?ë¤ñ'�Íÿcñ,�¹í~÷oö¤ëjö9í~÷M�?ë�ö¤ÿ��jë)�Aÿªñvþ¢ë$òvêoí~÷r�Âê£ë9÷�òvê£ë�ölÿ��Jë�ðÃþtsÊóoþÊò$ðø÷-ò¢ë9ñ��£ñvëlòvë��u�Lv�÷£ÿG�Âðø÷��C�Âê£ë9÷.òvÿ��+÷oí��jðÃö¤í~ôøô&��ölÿ��L�£ðøôøëjíB�jëlòvê£ÿ��u.Lÿªñ)�Âê£ëlòvê£ë¤ñ�òvÿEö¤ÿ��6�£ðøôÀë¹í~ò�ícôÀôw.ðÃþ7ë��+ò¢ñvë��jë¤ô&��ð�6�?ÿcñ¢òvíc÷ªòt�Íÿcñ<�ªÿ-ÿ��*�?ë¤ñ'�Íÿcñ,�¹í~÷oö¤ë��õØÿ-þÊòÂÿ��Tòvê£ë�öló£ñvñvë¤÷-ò¢ô&��í;îwí~ðøôøí��£ôøë�ë��+ëIölóJòvðÀÿª÷Èë9÷�î+ðøñ¢ÿª÷��jë¤÷-òvþ�.Jþeóoö�êEícþx!wíGy�ë¹û z~ýLí~÷��pì�a(vÛü��Àüjû {wýë��L�oôÀÿG��ôøð&�ÎðÀò¢ë��ê£ë¤ó£ñvðÃþÊòvðøö9þÖò¢ÿ0�JëIölðA�Jë¹ÿc÷ �Âê£ë¤÷ùVÿcñr�Âê£ëlòvê£ë¤ñ_ú�ò¢ÿ�ölÿ��L�oðÀôøëì��Êé)�?|£ÿªñÈë��£í��6�£ôøë�.v�í�y?ëhò#��£ðÃö¤ícôÀô&��ò¢ñ�í~÷oþ¢ôøí~ò¢ëIþMíD�jëlòvê£ÿ��ÿc÷�ðÀòvþ4}oñ�þÊòðø÷+îcÿJö¤í~ò¢ðøÿc÷4.Jñvë��-í~ñ%�JôÀëIþ¢þAÿ��Cê£ÿG�%ôÀÿª÷��jðÚòÂò�í�!cëIþ'ò¢ÿðø÷-ò¢ë¤ñ,�£ñvëlò�lIò¢ñ�í~÷WþeôÃíwòvëkl;ë��Jë9öló£ò¢ëÈò¢ê£ë��jëlòvê£ÿ�� í~÷��ê£ÿG�~�¹í~÷��òvð�jë9þ#ò¢ê£ëD�jë¤ò¢ê£ÿ��jðÃþhðÀ÷+îªÿ�!cë��u�Tì�aFvü�� 7EóWþeëIþ�ôøð�jðÀò¢ë��ê£ë¤óoñ¢ðÃþÊòvðøö9þnþ¢óoö�ê�íªþE�Jë¤ò¢ëIö_ò¢ðø÷��£ñvë9þ¢ë¤÷Wölëtÿ��ôÀÿ+ÿ��Îþeò¢ñvóoö_òvó£ñvë$þ�.��Âê£ðÀôøëtò¢êoë$þeÿ��£ê£ðÃþ#

ò¢ðÃö¤í~ò¢ë��Ôÿ��Jòvð�jð&�9íwòvðÀÿª÷oþB�Jÿc÷£ë��fòvê£ë.ê£ÿcòpþ'�Wÿcòölÿ��L�oðÀôøë¤ñjí~÷£÷oÿcó£÷oö¤ë��ñvë9ö¤ë¤÷-ò¢ô&� û �;ý$í~ñvë�÷£ÿcò8�cë¤òí;îwí~ðøôøí��£ôÀë��8�[ò�ðÃþ�÷£ÿ~ò�ö¤ôÀëIí~ñ�ê£ÿG�Y�'ë¤ôøôhÿc÷oëjö¤ÿcó£ôA��JÿJ�ÂðÀò¢ê.í¹þ'�¹í~ñ¢ò¢ë9ñ$ê£ë¤óoñ¢ðÃþÊòvðøö�ò¢êoíc÷0�Âêoíwò(�jíc÷��ÿ��ò¢êoë9þ¢ëØë¤÷+î+ðÀñvÿc÷��jë9÷ªò�þm�£ñvÿwî+ð&�Jë��b�ë�ðø÷-îªë9þeò¢ð�ªí~ò¢ë{òvê£ë9þ¢ë�ðÃþvþeó£ëIþ.ðø÷ òvê£ðÃþxþ¢ë9ölò¢ðøÿc÷ óoþ¢ðø÷��}oîªëþ��?ë9ö9ìcïÖõ0��zxû �wý<�?ë¤÷Wö�ê��¹í~ñ,!+þÎù�òvÿ��ªëlò¢êoë¤ñW�ÂðÀò¢ê�íþeð&�6�£ôøë��!ë¤ôøôÀÿ��xÿªñ¢ôA��oñ¢ÿ��cñ�í��C��ú7ÿc÷�ò¢ê£ëDv�íGy�ë�û z~ýë¤÷+î+ðÀñvÿc÷��Îë9÷-ò��|7ð&�có£ñvëjüÖþ¢ê£ÿG�$þ'òvê£ë�ñ¢ëIþeó£ôÀòvþx�Íÿªñ'ò¢ê£ë��Jðy�ë¤ñvë¤÷-ò

�Wë9÷oö�ê��¹í~ñ,!Jþ��<=$ôøô?ë��Jë9ö¤óJò¢ðøÿc÷jòvð�jë9þAí~ñvëÂ÷£ÿcñ,�¹í~ôð�9ë��m�ÂðÀò¢ê�ñ¢ëIþ��?ë9ölò#ò¢ÿ�ò¢ê£ë�ë��Jë9ö¤óJò¢ðøÿc÷jòvð�jë�òví�!ªë¤÷��Zòvê£ëxì��Êé ölÿ��6�£ðÀôÃíwòvðÀÿª÷��jÿ��Jë.ÿª÷bv�íGy�ë�û zwý5�|£ÿªñÎëIícö�êfí��£ôøðÃö¤íwòvðÀÿª÷4.#òvê£ëB}oñ�þÊòm�oícñÎðø÷��JðÃö¤í~ò¢ëIþò¢ê£ëòvð�jëxòví�!ªë¤÷Qò¢ÿfñ¢ó£÷Qò¢ê£ë �£ñvÿ��ªñví�� ðø÷Qòvê£ðÃþ�jÿ��£ë��jétê£ðøþ)�Wí~ñ�ðøþF�Íóoñeòvê£ë¤ñ)�£ñvÿ�!ªë¤÷��£ÿG�Â÷.ðø÷ªòvÿðÚò�þ�ò#�'ÿ�ölÿ��L�?ÿc÷oë¤÷-òvþ�.Lòvê£ë¹òvÿ~òvícôhò¢ð&�jë¹òví�!cë9÷xòvÿò¢ñ�í~÷oþ¢ôÃíwò¢ëGlwö¤ÿ��6�£ðøôÀëEòvê£ë.ðø÷+îcÿ�!ªë��~�jëlòvê£ÿ��£þÈíc÷��ò¢ê£ë�òvð�jë�ò�í�!ªë¤÷%ò¢ÿ ë��JëIölóJòvë�ò¢ê£ëIþeë�òvñvíc÷oþeôÃíwòvë��ùÍ÷oí~ò¢ðøîcëtölÿ��Jë;úI�jë¤ò¢ê£ÿ��£þ��7étêoëtö¤ÿc÷oþ¢ðA�Jë¤ñvë��E�Aÿªñ'!�ôÀÿ-í��£þhþ'�oíc÷nòvê£ë!þ'�WëIö_òvñ¢ó��r.��Íñ¢ÿ�� òvê£ÿªþ¢ëÂðÀ÷6�Âê£ðÃö�êò¢ê£ënò¢ñ�í~÷WþeôÃíwòvðÀÿª÷Øò¢ð&�jë9þ)�Jÿ��jðø÷oíwòvënþ¢óoö�ê.íªþ�Ìo¸l¿�¿À½í~÷��Æª×$ù��WëIö¤í~óWþeë?�jÿ-þÊò#ÿ��?òvê£ë?�jëlòvê£ÿ��£þ4ícñ¢ëÂ÷oë¤ðò¢ê£ë9ñ¹ò¢ð&�jë.ölÿª÷oþ¢ó��jðÀ÷��÷£ÿªñ¹ðÀ÷+îªÿ�!cë��÷+ó��jë9ñ¢ÿªóoþò¢ð&�jë9þ�úi.£òvÿjò¢ê£ÿ-þeë�ðø÷r�Âê£ðÃö�êÈòvê£ë�÷oí~ò¢ðøîcë�ö¤ÿ��£ë�ë��ë9ö¤óJò¢ðøÿc÷B�Jÿ��jðø÷oíwòvë9þtþeóWö�êpíªþÖ¼�½w·!¾?Áv¸_¯�¯Aíc÷��E�¤Äª¼,�ù��Âê£ë9ñ¢ëxò¢êoë�ö¤ÿªþeòpÿ��ò¢ñ�í~÷WþeôÃíwòvðÀÿª÷ÔðøþEí��Îÿªñeòvð�9ë��ÿwîcë9ñÎ÷+ó��jë9ñ¢ÿªóoþÎðÀ÷+îcÿJö9íwò¢ðøÿc÷Wþvú��hq�÷Zò¢ÿ��Zÿ��òvê£ëì��Êé ölÿ��6�£ðøôÃíwò¢ðøÿc÷Që��Jë9ö¤óJò¢ðøÿc÷��oí~ñpðøþB�ªðÀîªë¤÷ òvê£ëñví~ò¢ðøÿØÿ��tò¢êoë�ò¢ð&�jë¹òví�!cë9÷9��.ò¢ê£ðÃþ8�Îÿ��Jë�ò¢ÿØòvê£ëò¢ð&�jëAò�í�!ªë¤÷��ÍÿªñhðÀ÷-òvë¤ñ,�£ñ¢ë¤ò¢ðø÷��ò¢ê£ë��£ñvÿ��ªñví��%óoþ¢ðø÷��v�íGy�ë4ïÖõ9��=�þ7ë��WëIö_òvë��u.G�'ë�}o÷��ò¢êoí~òLòvñvíc÷oþeôÃíwò'ðÀ÷��ùVö¤ÿ��6�£ðøôÀðø÷��ì��Êé$úCòvê£ë!ðø÷+îcÿ�!ªë��8�jëlò¢êoÿ��oþAþ¢ð��÷£ð}Wö¤íc÷ªòvô�.ÿcóJò,�Wë9ñ��Íÿªñ'�¹þÖðÀ÷-òvë¤ñ,�£ñ¢ë¤ò¢ðø÷��Èò¢ê£ë¹ìªï�õ��ªòvë9ö¤ÿ��£ë9þ��étêoëÈì��ÊéÛölÿ��6�£ðøôÃíwò¢ðøÿc÷K�Îÿ��JëÈðÀ÷~v�íGy�ëÈö¤ÿ��L

�£ðøôÀëIþ�íC�Îë¤ò¢ê£ÿ��ò¢ÿ�÷Wíwò¢ðøîcëÎölÿ��Jënÿc÷.ðÀòvþW}oñ�þÊòÖðÀ÷�îcÿJö¤í~ò¢ðøÿc÷4��xëp÷£ë��+òÎðÀ÷+îcëIþÊòvð�-íwòvë�ê£ÿG�j�'ë¤ôøô4òvê£ëþ��¹ícñeòvë9þeò�ê£ë¤ó£ñvðÃþÊòvðøö�ö9í~÷h�Jÿ$.$þeÿ�òvêoíwòC�'ë.ö¤ÿ��L�£ðøôÀë�ÿª÷£ô&�{ò¢êoÿªþ¢ë��Îë¤ò¢ê£ÿ��£þ�ò¢êoí~ò�í~ñvëEò¢ð&�jë.ölÿª÷�þeó��Îðø÷��ù�òvê£ë.òvñvíc÷oþ¢ôøí~ò¢ðøÿc÷el~ölÿ��6�£ðøôÃíwò¢ðøÿc÷ ö¤ÿªþeòÈðÃþÿcóJò#�'ë¤ð&�cêoë��6��Îò¢ê£ë�ë��Jë9ö¤óJò¢ðøÿc÷¹òvð�jë;ú4íc÷��¹ðø÷ªòvë¤ñ'�£ñvëlò'ò¢ê£ëÖñ¢ë��¹í~ðø÷£ðÀ÷��E�jëlòvê£ÿ��£þ��#étêoðøþtö¤íc÷�òvë¤ôøô?óWþ�Âê£ëlòvê£ë¤ñ8�AëÈþeê£ÿªó£ôA��þÊòvñ¢ðøîcëjòvÿ��Jë¤îªë¤ôøÿ��í��jÿcñvëðÀ÷-òvë¤ôøôÀð&�cë9÷ªò�ê£ë9ó£ñvðøþeò¢ðÃöní~ò�í~ôøô�.Líc÷��xð��tþeÿ$.u�Âêoíwò�ðÃþò¢ê£ë6�?ë¤ñ'�Íÿcñ,�¹í~÷oö¤ëE�?ë¤÷£ë�}£ò�ò¢êoí~òF�'ëÎö9í~÷xë��WëIö_ò��Më¤ò�óoþ�þvík�Øò¢êoí~ò�ír�jëlò¢êoÿ�� !òví�!ªë9þ*��Âò¢ð&�Îëjòvÿ

��B��L�u�6�k�- k�%¡�¢?£%¤i�6£% �¥0¢?£,¦��i§)¨#�i k¨#��©kªw��i �ª« £%ª��#�r�i -¡��k��ªWªw��¢D¬k��(¬k§w�ii§w£%¢)®u� ¡Wªw�#§�¯��#ª?¡w�6� « ªw�#§�¯��¡��k� « �#�G£�¯��i§°�%±�¡��k�³²%´�µb��¢D¬k��#¢D�# �¡w£,¡w��i D�I�k��<��£%��¶�� k8£% k�J§w�#ª��i� ¯�� kEª�¥�ª�¡��#¢Y¨��·£%ª�ªw�#ªD�k©k§�� kEª�¥�ª�¡��¢Y�� k� ¶¡��·£%��¸�£,¡w��i �¹

1 2 3 4 5 6 70

20

40

60

80

100

120

No

rmalize

d E

xecu

tio

n T

ime

0.25 0.08 0.13 0.13 0.01 0.25

db javac jess jack compress hello

Execn. Time Trans. Time Interpret. Time

|7ð&�có£ñvë�ü�20a(�+÷oí��jðÃö Q ÿ��6�£ðøôøí~ò¢ðøÿc÷42��$ÿG�º�'ë¤ôøôö9í~÷B�Aë*�JÿC»

ðø÷-ò¢ë¤ñ,�£ñvëlò�.�¼ � ò¢ð&�jë�ò¢ÿ.ò¢ñ�í~÷oþ¢ôÃíwò¢ë�.#í~÷��9½ � òvð�jëòvÿØë��Jë9ö¤óJò¢ëjòvê£ëjò¢ñ�í~÷oþ¢ôÃíwò¢ë��xö¤ÿ��Jë��étêoë¤÷4.Lò¢ê£ë9ñ¢ëë��+ðÃþeòvþ�í�ö¤ñ¢ÿ-þ¢þ¢ÿwîcë9ñ8�?ÿcðø÷-òC¾E�m¿À¼u�wÁJù��xÂ�½?�[úi.�Âê£ë9ñ¢ë�ðÀò>�Aÿªó£ôA�Ã�?ë �Wë¤òeò¢ë9ñ�ò¢ÿ�òvñvíc÷oþeôÃíwòvëfò¢ê£ë�jëlòvê£ÿ��{ð�Âò¢ê£ëp÷+ó��E�?ë¤ñÎÿ��Âòvð�jëIþní��Îë¤ò¢ê£ÿ��ðÃþðø÷+îcÿ�!ªë��Ä4�?ÅY¾E�'.Líc÷��.ðø÷-ò¢ë9ñ'�£ñvëlò�ðÀò�ÿ~ò¢êoë¤ñ,�Âðøþ¢ë��ë�ícþvþeó��jë.ò¢êoí~òEíc÷Qÿcñ�ícö¤ôÀëþeó��£ôøðÀëIþ-Ä4�pù�òvê£ë÷+ó��8�Wë9ñÈÿ��Öòvð�jëIþÈíK�Îë¤ò¢ê£ÿ�� ðÃþÈðÀ÷+îcÿ�!cë��oújí~÷��¾8�4ù�òvê£ë!ðA�JëIí~ô�ölóJò'²ÿ�yÈòvê£ñ¢ëIþeêoÿcôA�L�ÍÿcñAí��Îë¤ò¢ê£ÿ��oú��5�EÄ ��Æ ¾ � .W�'ëðÀ÷-ò¢ë9ñ'�oñ¢ë¤òpí~ôøô�ðÀ÷+îªÿ+ö9íwòvðÀÿª÷oþ�ÿ��òvê£ëD�jëlò¢êoÿ��4.-íc÷��ÿcò¢ê£ë9ñ'�ÂðÃþ¢ëAòvñvíc÷oþ¢ôøí~ò¢ëtðÚò#ÿª÷Îò¢ê£ëîªë¤ñ,�)}oñvþeòhðÀ÷+îªÿ+ö9íwòvðÀÿª÷4�Tétê£ë$þ¢ë9ö¤ÿc÷��E�oí~ñ<�Íÿªñhë9íªö�êí��£ôøðøö9íwò¢ðøÿc÷þeêoÿG�$þ$òvê£ë8�?ë¤ñ'�Íÿcñ,�¹í~÷oö¤ëE�ÂðÀò¢êxò¢ê£ðÃþÿªñvíªölôøëpðø÷b|7ð&�có£ñvë�ü�.D�Âê£ðÃö�êh�'ë�þeêoícôÀôÖö¤ícôÀôn½�¾?°w��[òØö¤íc÷h�Wëÿ��oþ¢ë¤ñvîcë��fò¢êoí~ò�òvê£ë¤ñvë.ðÃþ�îªë¤ñ,�ZôøðÀòeò¢ôøë�Jðy�ë¤ñvë¤÷oö¤ë9�Wë¤ò#�Aë9ë¤÷%ò¢ê£ë{÷Wí~ðøîcë�ê£ë¤óoñ¢ðÃþÊòvðøöóoþ¢ë��-v�í�y?ë�íc÷��½v¾?°I�Íÿcñ�¼�½w·!¾?Áv¸_¯�¯Âí~÷��L�¤Äc¼%��þ¢ðø÷oölë�jÿªþeòEÿ��òvê£ë.òvð�jë�ðÃþpþ'�?ë¤÷-òpðø÷%ò¢ê£ëë��Jë9ö¤óJò¢ðøÿc÷ÿ��=ò¢ê£ë�ícölò¢óoícôWölÿ��JëÖí~÷��'ík�pùÍîªë¤ñ,�nôøðÚò¢ò¢ôøë!òvð�jëÖðÀ÷òvñvíc÷oþeôÃíwòvðÀÿª÷�ÿcñ7ðÀ÷-òvë¤ñ,�£ñ¢ë¤òví~ò¢ðøÿc÷Wú��I=!þTò¢êoë4òvñvíc÷oþeôÃíGòvðÀÿª÷�ölÿ��6�?ÿc÷£ë9÷-òW�cë¤òvþÂôÃí~ñ,�cë9ñ�ùVí��£ôÀðÃö¤í~ò¢ðøÿc÷oþ$ôÀð&!cëÆª×�.x�¤ÄwÏwÄc¼EÿªñØÌo¸l¿�¿ø½¤úi.Öòvê£ëÔ½v¾?°)�jÿ��Jë¤ô�þ¢ó��ªë9þeòvþòvêoíwòÂþeÿ��jë�ÿ��Mò¢ê£ëÖôøë9þvþ4òvð�jë�[ö¤ÿc÷oþ¢ó��jðÀ÷��ÈùÍÿªñtôÀëIþ¢þ�Íñvë�@-ó£ë9÷ªòvô� ðø÷+îcÿ�!cë��Wú6�jëlò¢êoÿ��oþ-�?ë�ðÀ÷-òvë¤ñ,�£ñ¢ë¤ò¢ë��òvÿØôÀÿG�'ë¤ñ�òvê£ë¹ë��Jë9ö¤óJò¢ðøÿc÷òvð�jë��pétê£ðÃþ�ñ¢ëIþeóoôÚò�þ�ðÀ÷íØü�ÇG�ü�c�È0þ¢í;î+ðø÷��ªþtðø÷�ë��+ëIölóJòvðÀÿª÷Eò¢ð&�jë*�Íÿcñ!ò¢ê£ëIþeëí��£ôøðøö9íwò¢ðøÿc÷Wþ��<�[ò$ðÃþtò¢ÿ6�?ë�÷£ÿ~òvë��Èò¢êoí~òÂò¢ê£ë�ë��£ícö_òþví;î+ðÀ÷��-þE�'ÿcóoô&�~�Jë�}W÷£ðÚòvë¤ô&�K�Jë��Wë9÷��ÿª÷�òvê£ëEë�É6ö¤ðÀë9÷oö��Øÿ��hò¢ê£ënò¢ñ�í~÷oþ¢ôÃíwò¢ðøÿc÷.ñ¢ÿªóJò¢ðø÷£ë9þ�.?ò¢ê£ënò¢ñ�í~÷Wþ#ôÃíwòvë��Eölÿ��Jë�ë��JëIölóJòvðÀÿª÷píc÷��pðÀ÷-ò¢ë9ñ'�oñ¢ë¤òvíwòvðÀÿª÷4�

étêoëj½�¾?°�ñ¢ëIþeóoôÚò�þÊ�ªðÀîªë!óoþ¢ë��Íóoô?ðø÷oþ¢ð�ªê-òvþ��<|7ð&�có£ñvëüAþ¢ê£ÿG�$þLò¢êoí~ò7ñvë��-í~ñ%�JôÀëIþ¢þMÿ��£òvê£ëAêoë¤ó£ñvðøþeò¢ðÃöhòvêoíwòCðÃþë��6�£ôøÿG�cë��ò¢ÿ)�JëIölðA�Jëtÿc÷E�Âêoë¤÷elG�Âê£ëlòvê£ë¤ñ7ò¢ÿ�ö¤ÿ��L�£ðøôÀë$ì��Êé).wÿc÷£ëÂö9í~÷jíwò��WëIþÊò#êoÿ��?ëAòvÿ�ò¢ñvð�Üü�ÇG�ü�c�ÈðÀ÷jòvê£ë$ë��+ëIölóJòvðÀÿª÷Îò¢ð&�Îë��1q�÷nòvê£ë$ÿcò¢ê£ë9ñ#êoí~÷$�u.��'ë}o÷��ò¢êoí~ò�íÈþeó��WþÊò�í~÷-ò¢ðÃí~ôCí��jÿcóo÷ªòÖÿ��#òvê£ëÎë��Jë9ö¤ó�ò¢ðøÿc÷�ò¢ð&�ÎëEðÃþjþ��?ë¤÷-òjðÀ÷�ò¢ñ�í~÷oþ¢ôøí~ò¢ðøÿc÷�íc÷��$lwÿcñÎë��ë9ö¤óJò¢ðø÷��Èò¢ê£ënò¢ñ�í~÷oþ¢ôÃíwò¢ë��Øölÿ��Jë�.Mí~÷��Eòvê£ë¤ñvënö¤ÿcó£ôA��Wë��?ëlò¢ò¢ë¤ñhñvë��tí~ñ%�£þI�Íñ¢ÿ��ÿ��Jòvð�jð&�¤ðø÷��Öò¢ê£ëIþeëÂö¤ÿ��L�Wÿª÷£ë¤÷-ò�þ��Mn+óoö�ê{ÿ��Jò¢ð&�jð�Iíwò¢ðøÿc÷Wþnö9í~÷{óoþ¢ëC�?ëlò¢ò¢ë9ñölÿ��L�oðÀôÃíwòvðÀÿª÷�ò¢ëIö�ê£÷£ðA@-ó£ë9þhíc÷��$l;ÿªñTêWí~ñ%��'ícñ¢ëAþeó��WÿªñeòD�ÍÿªñW��+÷oí��ÎðÃö�ö¤ÿ��6�£ðøôøí~ò¢ðøÿc÷4�'étê£ë�÷£ë��+ò�þ¢ë9ö�ò¢ðøÿc÷¹ÿ��?òvê£ðÃþ1�oí��Wë9ñ1�JðÃþ¢ö¤óoþvþeëIþ4í~ñ�ö�ê£ðÀò¢ëIö_ò¢óoñvícô£þeó��WÿªñeòI�ÍÿcñMë9÷£êoí~÷Wölðø÷��!ì��Êéö¤ÿ��6�£ðøôÀë9ñ4�?ë¤ñ'�Íÿcñ,�jíc÷oölë��

Ë Ì�ãWç�T!èÊâ�VLç?â£æ$ã�Í�](ÎtæW[W[�ä#ã£ârÏ�ä#ã Ð ÑxÒáWÍ�Z�è¢çb_Îã�ÍháWÓ�]�ÍTâ£èeä4á

=!ñvö�êoðÚòvë9ö_òvó£ñ�í~ôWþ¢ó��?ÿcñ¢ò1�Íÿªñx��-÷Wí��jðøö$ò¢ñ�í~÷oþ¢ôÃíGò¢ðøÿc÷Îö¤í~÷�òvícñ'�ªëlòhíwòCÿ��Jòvð�jð&�¤ðø÷��ë9ðÚòvê£ë¤ñ7òvê£ë'ò¢ñ�í~÷oþ�ôøí~ò¢ðøÿc÷ �oí~ñ¢ò�ÿcñ�íwò�ð�6�£ñvÿwî+ðÀ÷��òvê£ë6�Wë9ñ��Íÿªñ'�¹í~÷Wölëÿ��#òvê£ëÎë��Jë9ö¤óJò¢ðøÿc÷0�Wí~ñ¢ò�ÿ��4ò¢ê£ë�ò¢ñ�í~÷oþ¢ôøí~ò¢ë��ölÿ��Jë��=Ûôøðøþeònÿ��?ÿªþvþ¢ð�£ôøëJ�Jðøñ¢ëIö_òvðÀÿª÷oþ�ðø÷>�£ñvÿwî-ðA�Jðø÷��Øí~ñ'ö�ê£ðÀò¢ë9ölò¢ó£ñ�í~ôJþ¢ó��?ÿcñ¢ò°�Íÿªñ��+÷oí��jðÃö#òvñvíc÷oþeôÃíwòvðÀÿª÷�ðÃþ�cðøîcë9÷-�?ë¤ôøÿG�*�

ÔKÕ ñvÿwî+ð&�£ëfþ¢ó��?ÿcñ¢òM�Íÿcñ�ñvë��jÿwî+ðÀ÷��%ò¢êoë��÷oí��ÎðÃöÈò¢ñ�í~÷oþ¢ôÃíwò¢ðøÿc÷�ÿ��!òvê£ë�ölÿ��Jë-�Íñvÿ�� òvê£ë�£ñvÿ��ªñví��BÖ þ�ölñvðÚòvðøö9í~ô>�Wíwò¢êI� �[òY�¹ík�×�?ë�'ÿcñ¢ò¢ê��Âê£ðøôÀëCòvÿ$ðÀ÷+îªë9þeòLðÀ÷�êWí~ñ%��'ícñ¢ëhþ¢ó��?ÿcñ¢òùÍë9ðÚòvê£ë¤ñpÿc÷Ôò¢ê£ëþ¢í��Îë Q Õ1Ø ÿªñ�óoþ¢ðø÷��Zíc÷�ÿ~òvê£ë¤ñF�£ñvÿJölë9þvþ¢ÿcñ!ð�4ðÀòÖðÃþ�íBn£õ Õ úÂò¢êoí~ò�ö9í~÷�Jÿpò¢êoëjò¢ñ�í~÷oþ¢ôøí~ò¢ëjðø÷ �Wí~ñ�í~ôøôÀë9ôt�ÂðÀò¢êòvê£ë¹ðÀ÷�ò¢ë9ñ'�oñ¢ë¤òvíwòvðÀÿª÷�ÿ��I��ªòvë9ö¤ÿ��£ë9þ'ÿcñtÿ~òvê£ë¤ñtóoþ¢ë��Íóoô�'ÿcñ,!J�Jÿc÷£ë*��¹òvê£ë�ícölò¢óoícôu�£ñvÿ+ö¤ë9þvþeÿªñ��

ÔKÕ ñvÿwî+ð&�£ëfícñvö�ê£ðÀò¢ëIö_òvó£ñvícô¹ë¤÷£êWí~÷oö¤ë��jë¤÷-òvþòvÿþ¢ó��?ÿcñ¢òt��+÷oí��ÎðÃö4ö¤ÿ��Jë1�ªë¤÷£ë9ñví~ò¢ðøÿc÷�í~÷$��ðÀ÷�þeòví~ôøôÃíwò¢ðøÿc÷I�°� ê£ðøôøë�ölóoñ¢ñvë¤÷-ò³�£ñvÿJölë9þvþ¢ÿcñ�þ<�£ñ¢ÿ�î+ð&�£ë-�VíªölðøôÀðÀò¢ðøë9þÎòvÿMOoóoþ¢ê�òvê£ë��£íwò�í�í~÷��ZðÀ÷�þeò¢ñvóoö_òvðÀÿª÷xö¤íªö�ê£ë9þ�òvÿpþ¢ó��?ÿcñ¢ò�þ¢ë¤ô�Ê�jÿ��Jð��ðø÷��¹ölÿ��Jë�.JðÀòD�¹ík�J�Wë)�?ë¤÷£ë�}WölðÃí~ô�òvÿ6�£ñ¢ÿwî+ðA�Jëölÿ��Jë �cë9÷£ë¤ñ�íwòvðÀÿª÷h�£ó�y�ë¤ñ�þpÿcñEíZö9í��oí��£ðÀôøðÀò#�ò¢ÿr�ÂñvðÚòvëjðÀ÷-ò¢ÿr�5}ö¤íªö�ê£ë9þ��Ln+óWö�ê��£ó�y?ë9ñvþ�ö9í~÷�£ñvë¤îªë¤÷-ò4òvê£ë!òvñvíc÷oþ¢ôøí~ò¢ðøÿc÷¹ñvÿcóJòvðÀ÷£ëIþ³�Íñvÿ��Ûí��f�Íë9ölò¢ðø÷��jò¢êoë�ôÀÿJö¤ícôÀðÀò#��ðø÷pò¢ê£ë��£í~òvíjö¤íªö�ê£ë9þ��

ÔKÕ ñvÿwî+ð&�£ëní~ñ�ö�ê£ðÀò¢ëIö_ò¢óoñvícô7þ¢ó��?ÿcñ¢ò)�Íÿcñ)�£ñvÿ�}oôðø÷��F|£ÿcñ�ë��£í��6�£ôøë�.=í�ö¤ÿcó£÷-òvë¤ñ�ö¤ÿcó£ôA�Eòvñvíªö%!ò¢êoëA÷+ó��8�Wë9ñTÿ��£ê£ðÚò�þCícþvþeÿJölðÃíwòvë��)�ÂðÀò¢ê�íc÷�ë¤÷�ò¢ñ,��ðø÷Zò¢ê£ë0�oñvíc÷oö�ê{ò�í~ñ,�cëlòm�£ó�y?ë9ñ�� êoë¤÷ò¢êoëjö¤ÿcó£÷-òvë¤ñ�þ¢í~ò¢ó£ñ�íwòvë9þ�.?ðÀò�ö¤í~÷.ò¢ñvð��cë9ñ!òvê£ëölÿ��6�£ðÀôøë¤ñÎòvÿ �?ë¤ñ'�Íÿcñ,� ö¤ÿ��£ëÈðø÷£ôøðÀ÷£ðø÷��ÿ��£ò¢ð�jð�IíwòvðÀÿª÷�ò¢êoí~òö9í~÷�ñvë��£ôÃícö¤ë�ò¢ê£ëZðÀ÷$�JðÀñvë9ölò

�oñvíc÷oö�êxðø÷oþeò¢ñvóoö_òvðÀÿª÷M�ÂðÚòvêxòvê£ë�ö¤ÿ��£ë¹ÿ��'ò¢ê£ëðø÷+îcÿ�!cë��\�jëlòvê£ÿ��u�ÙqW��ölÿcóoñvþ¢ë�.J�'ë��¹ík�÷oë¤ë��0þ¢ÿ��jë �jë9ö�êoíc÷£ðøþ'��òvÿP�jÿc÷£ðÀò¢ÿªñ�ò¢ê£ë�oñ¢ÿ��cñ�í��À�Wë9êoí;î+ðÀÿªñÎö�êoíc÷��cëIþ�òvÿ�ó£÷��Jÿ�í~÷��ÿ��Jò¢ð&�jð�IíwòvðÀÿª÷oþ¹òvêoíwòC�¹ík�~�WëIölÿ��Îë.ðÀ÷+îwí~ôøð&�ôÃíwòvë¤ñ�� =!ôøþ¢ÿ�.hðÀò6�Aÿªó£ô&�>�WëB�Aÿªñeòvê��Âê£ðÀôøë�ðø÷�îªë9þeò¢ð&�ªí~ò¢ðø÷��òvê£ë6�Wÿ-þeðÀò¢ðøÿc÷£ðø÷��pÿ��Aò¢ê£ëÎò¢ñ�í~÷Wþ#ôÃíwòvë��ölÿ��Jënò¢ÿG�tí~ñ%�£þ�ð�6�£ñvÿwî+ðÀ÷��Èò¢êoëÎôøÿJö¤ícô�ðÀò#��£ó£ñ¢ðø÷��Øþ¢ó��oþ¢ë�@-ó£ë¤÷-ò�ë��JëIölóJòvðÀÿª÷4�m=?�ªícðÀ÷��-÷Wí��jðøö��Îÿª÷£ðÀò¢ÿcñvðø÷��ö¤í��oí��£ðøôøðÚòvðÀëIþ!þ¢óoö�ê�ícþ�¹ícðÀ÷-òvícðÀ÷oðÀ÷��-�oíwòvê.ê£ðÃþeò¢ÿcñ,�Eÿ��#òvê£ë6�jëlò¢êoÿ��ë��Jë9öló£ò¢ðøÿc÷oþÂö9í~÷pê£ë¤ô&�4�

Ô>Õ ñ¢ÿwî+ðA�Jë$òvê£ë�ö¤í��oí��oðÀôøðÚò#�jòvÿjë��+ò¢ë9÷��Èò¢ê£ë*��÷Wí��jðøö¹ö¤ÿ��6�£ðøôøí~ò¢ðøÿc÷�ö¤í��oí��£ðøôøðÚò#�.ò¢ÿØë¤÷Wölÿ��L�Wícþvþ4êoícñ,��'ícñ¢ë!ö¤ÿc÷�}$�có£ñ�íwòvðÀÿª÷�ícþtþeêoÿG�Â÷�ðÀ÷|7ð&�có£ñvë�üªù��Wúi�Cétê£ëÂðA�Jë9íÖÿ��3��+÷oí��jðÃötölÿc÷�}��óoñví~ò¢ðøÿc÷{ðÃþnò¢ÿxò¢ñ�í~÷oþ¢ôøí~ò¢ë�ò¢ê£ëB��ªòvë9ö¤ÿ��£ëpþ¢ë�@-ó£ë9÷oölëIþÖÿ��4ípìªí;îwíJ�jë¤ò¢ê£ÿ��ðø÷-ò¢ÿpñ¢ëIölÿª÷�}��óoñví��£ôÀë¹êWí~ñ%��'ícñ¢ëjÿª÷�Oòvê£ë�wO��nJó��oþ¢ë�@-ó£ë¤÷-òðø÷+îcÿJö¤í~ò¢ðøÿc÷oþ�ÿ��íK�Îë¤ò¢ê£ÿ��h�Wícþvþ�òvê£ë �oíwòvíòvÿØò¢êoë�ñ¢ëIölÿª÷�}��cóoñví��£ôÀëjêoícñ,��'ícñ¢ë�ÚTölÿ��6�£ó�ò�íwòvðÀÿª÷oþ�ícñ¢ëJ�?ë¤ñ'�Íÿcñ,�jë��ðø÷�ò¢ê£ëpölÿª÷�}��cóoñ¢ë��êWí~ñ%��'ícñ¢ëAí~÷��ñvë9þ¢ó£ôÀòvþ7í~ñvë4þ¢ë¤÷-òt�Wícö%!�ò¢ÿ�ò¢ê£ëö9í~ôøôÀðø÷��6�jëlòvê£ÿ��u�

�}÷�ò¢ê£ë.ñ¢ëIþÊò�ÿ��ò¢ê£ðÃþJ�oí��Wë9ñ�.��'ë��ÂðÀôøô)�JðøþvölóWþ¢þòvê£ër�Îÿcò¢ðøîwíwò¢ðøÿc÷fí~÷��K�jë9ö�êoíc÷£ðøþ'�Û�Íÿcñ¹þ¢ó��?ÿcñ¢ò�ðø÷��Ü��-÷Wí��jðøö.ölÿ��Jë.ðø÷oþeòvícôÀôÃíwòvðÀÿª÷Ôí~÷��h��+÷oí��jðøöö¤ÿc÷�}��ªó£ñ�íwò¢ðøÿc÷-�£ñvðÀë�O��ÝCÞ4ßtà�á>â�ã�ä8å4ætç"èBç�ß°ç�éGà�ê�â�åeßëà�ßtæíì�ß4î

ï ê�à�ð�ð�à�ê�â�åeß�ñ4ò°ó°óÊå�ékê�ôx�}÷Eÿcñ%�Jë¤ñ'òvÿÎðø÷+îcëIþÊòvð�-íwò¢ë�Âê£ë¤ò¢ê£ë9ñ.òvê£ë{òvñvíc÷oþ¢ôøí~ò¢ðøÿc÷�ñvÿcóJòvðÀ÷£ëIþxí�y?ëIö_òxò¢ê£ëö9ícö�ê£ë�ôÀÿJö¤ícôÀðÀò#��.��'ë�ðÃþeÿªôøí~ò¢ë��jò¢ê£ë�ö¤íªö�ê£ë?�?ë¤êoí;î+ðøÿcñ�Jóoñ¢ðø÷��$òvê£ëAò¢ñ�í~÷oþ¢ôøí~ò¢ðøÿc÷��Wí~ñ¢ò7í~÷$��ò¢ê£ë'ñvë9þeòTÿ��oò¢ê£ëì��Êé�ölÿ��L�oðÀôøë¤ñÖë��Jë9ö¤óJò¢ðøÿc÷I��étê£ëjþÊòvó��r�tícþ?�?ë¤ñ'�Íÿªñ'�jë��óWþeðø÷��Eò¢êoëx¼�Äc¼�Ì£¸l¯�¶�·mõÈö¤ícö�êoë�þeð&��ó£ôÃíwòvÿcñ�Íñvÿ��+nJêoí��JëØûÀü�Ç;ýhòvÿ+ÿcô#þ¢ëlò�ÿc÷�í~÷ Ø ôÀò¢ñ�í�n Õ =Wö Q�¹ícö�êoðÀ÷£ë�ñ¢ó£÷o÷£ðÀ÷��-n+ó£÷�q)n�c�� {��tétêoënö¤íªö�ê£ë*�?ë�êoí;î+ðøÿcñMÿ��Jò¢ê£ë4ò¢ñ�í~÷oþ¢ôÃíwò¢ë1�WÿªñeòvðÀÿª÷�ðøþTðÀôøôøóoþÊòvñví~ò¢ë��ðÀ÷|7ð&�có£ñvë?:��Cétê£ëW�£í~òví�ö9ícö�ê£ëD�jðÃþ¢þ¢ë9þ#ðÀ÷jòvê£ëÂò¢ñ�í~÷Wþ#ôÃíwòvëW�?ÿcñ¢ò¢ðøÿc÷¹ÿ��òvê£ëÖö¤ÿ��Jë!ölÿª÷-ò¢ñvð�£ó£ò¢ëÂòvÿE;�Ç�5z�Ç�Èÿ��+ícôÀô��£íwò�í��ÎðÃþvþeëIþ3�Íÿcñ4�¹íc÷��ÿ��-ò¢ê£ë1�Wë9÷oö�ê��¹í~ñ,!Jþ��=W�Îÿª÷��Öò¢ê£ëIþeë�.wò¢êoë��oíwòví(�ÂñvðÚòvëx�jðÃþ¢þ¢ë9þ<�Jÿ��jðÀ÷oí~ò¢ë�ÂðÀò¢ê£ðø÷�ò¢êoëEò¢ñ�í~÷oþ¢ôøí~ò¢ë��?ÿcñ¢ò¢ðøÿc÷�í~÷��ö¤ÿc÷-ò¢ñvð�oóJò¢ëòvÿ {�Ç�È ÿ��L�jðÃþ¢þ¢ë9þ��£ó£ñ¢ðø÷��fò¢ñ�í~÷WþeôÃíwòvëfùOþeë9ë�ò¢ê£ëòvê£ðÀñ%� �oícñ6�Íÿcñ�ëIícö�ê~�Wë9÷oö�ê��¹í~ñ,!�ðø÷h|Tð&�cóoñ¢ë0:-úi�õØÿ-þÊò�ÿ��Aòvê£ë9þ¢ë6�Âñ¢ðÀò¢ëm�jðÃþ¢þ¢ë9þ)�'ë¤ñvënÿ��oþ¢ë¤ñvîcë��ò¢ÿÿJö¤ö¤ó£ñ��Jó£ñvðø÷��òvê£ëm�cë9÷£ë¤ñ�íwòvðÀÿª÷xíc÷��.ðø÷oþeòvícôÀôÃíwòvðÀÿª÷ÿ��ò¢ê£ëölÿ��Jë��pn+ðø÷oölë�.Âò¢ê£ë��cë9÷£ë¤ñ�íwòvë�� ölÿ��Jë��Íÿcñòvê£ë?�jëlòvê£ÿ��¹ðøþ1�Âñ¢ðÀòeòvë¤÷jòvÿ��jë��jÿcñ,��Íÿªñ#ò¢êoëD}oñ�þÊòòvð�jë�.wðÀòCñ¢ëIþeó£ôÀòvþTðø÷nölÿ��6�£ó£ôÃþeÿªñ'�F�jðøþvþeëIþMðø÷�òvê£ëxaWQ íªö�ê£ë��0q�÷£ëm�¹ík�.ë��?ë9ölònþ¢ð�jðøôøícñ�ölÿ��L�oó£ôøþ¢ÿcñ,�

�jðøþvþeëIþJ�Âê£ë9÷ ò¢êoë��-ò¢ë9ö¤ÿ��Jë9þÈí~ñvë�ñ¢ëIí��h�Jóoñ¢ðø÷��ò¢ñ�í~÷oþ¢ôÃíwò¢ðøÿc÷I�³�$ÿG�'ë¤îcë9ñ�.-òvê£ë��Eí~ñvë�ñ¢ë9ôøí~ò¢ðøîcë¤ô&��ôøë9þvþ�Íñ¢ë�@-ó£ë¤÷-òjò¢êoíc÷fò¢ê£ër�ÂñvðÚòvëB�jðÃþ¢þ¢ë9þ¹þ¢ðø÷oölë07�cx÷oí�ò¢ðøîcë�ùwn Õ =Wö Q ú$ðÀ÷oþeò¢ñvóoölò¢ðøÿc÷oþ�í~ñvë*�cë¤÷oë¤ñ�íwò¢ë��B�Wë9ñ��ªòvë9ö¤ÿ��£ëÖÿª÷Øí~÷Eí;îcë¤ñ�í��ªënûÀücü_ý5��}÷WþÊò�í~ôøôÀðø÷��òvê£ëEölÿ��JëC�ÂðøôøôAñvë�@-ó£ðøñ¢ëC�Âñ¢ðÀò¢ðø÷��òvÿ

ò¢ê£ë��£íwò�í�ö9ícö�ê£ë�.�í~÷$� òvê£ë9þ¢ëÔí~ñvëÔölÿªó£÷-ò¢ë��íªþ�jðøþvþeëIþ�þ¢ðÀ÷oö¤ë�ò¢ê£ÿ-þeë�ôøÿJö¤í~ò¢ðøÿc÷oþØêWí;îcë�÷£ÿ~ò��Wë9ë¤÷ícö9ölë9þvþ¢ë��ë9ícñ¢ôøðÀë9ñ�ùVö¤ÿ��6�£ó£ôÃþ¢ÿcñ,�0�jðøþvþeëIþvú��Øétê£ë9þ¢ë�jðøþvþeëIþ�ðø÷ªòvñ¢ÿ��JóWölë0ò#�'ÿg!-ðø÷��£þ ÿ��ÿwîcë9ñ¢ê£ëIí��£þ��|7ðÀñ�þÊò�.Iò¢ê£ë1�oíwòvítêWícþ�òvÿ?�Wë³�Íëlòvö�êoë��F�Íñvÿ��/�Îë��jÿcñ,�ðÀ÷-òvÿ�ò¢ê£ëWaW}ö¤ícö�êoëD�Wë��Íÿcñvëtò¢ê£ë��Îícñ¢ëD�ÂñvðÚò¢ò¢ë9÷Îðø÷-ò¢ÿ$�étê£ðÃþhðøþhíÖñvë��Jóo÷��£í~÷-òhÿ��?ë¤ñ�íwòvðÀÿª÷Îþ¢ðø÷oölë'ò¢êoëD�Îë��Lÿcñ,��ðøþÎðø÷£ðÚòvðøícôÀð&�¤ë��9�Íÿªñnòvê£ë-}oñvþeòÎò¢ð&�Îë��Kn+ë9ö¤ÿc÷��4.ò¢ê£ëx÷£ë��Âô&�~�Âñ¢ðÀòeòvë¤÷Ôðø÷oþeò¢ñvóoö_òvðÀÿª÷oþJ�ÂðøôÀô�ò¢êoë¤÷��?ë�jÿwîcë��jùOí~óJòvÿ��¹íwòvðøö9í~ôøô��ÿc÷�ðø÷oþÊòvñ¢óWö_ò¢ðøÿc÷F�Íë¤òvö�ê�ÿ��ë¤ñ�íwòvðÀÿª÷oþvú1�Íñvÿ��Üò¢ê£ëEaW}ö¤ícö�êoë�òvÿjò¢ê£ëE�5[ö9ícö�ê£ë��³�[ò�Aÿªó£ôA�E�?ëÂóoþ¢ë��Íó£ô+òvÿ�êoí;îcëtíF�ÎëIö�êoí~÷oðøþ'�P�Âê£ë¤ñvë¤ðø÷ò¢ê£ëÖölÿ��JëÖö¤í~÷m�?ëW�ªë¤÷£ë9ñví~ò¢ë��m�JðÀñvë9ölò¢ô&�jðÀ÷-ò¢ÿ�ò¢ê£ë(�5ö¤íªö�ê£ë��¹étê£ðÃþ*�'ÿcó£ôA�.ñvë�@-ó£ðøñvëjþ¢ó��?ÿcñ¢ò)�Íñ¢ÿ�� òvê£ë�5[ö9ícö�ê£ë$ò¢ÿníªö¤ö¤ÿ��6�jÿ��oíwò¢ëÖí��ÂñvðÚòvë!ÿ��Wë9ñví~ò¢ðøÿc÷�ùÍð�ðÚòC�£ÿ-ëIþj÷£ÿ~ò�ícôÀñvë9í��{þeó��Wÿªñeò¹ðÀò�ú�.tíc÷��Ü�£ñ¢ë��Íë¤ñ'í��£ô&�xír�ÂñvðÀò¢ë�5�oíªö%!��5[ö9ícö�ê£ë��J�[ònþ¢ê£ÿcó£ôA��í~ôÃþeÿr�?ë÷£ÿ~òvë��Èò¢êWíwòD�ÍÿcñD�ªÿ+ÿ��-�?ë¤ñ'�Íÿcñ,�jíc÷oölë�.+ÿc÷£ë�þ¢ê£ÿcó£ôA��Wëpö¤ícñ¢ë��Íó£ôhò¢ÿ.ôÀÿJö¤í~ò¢ë¹òvê£ëÈö¤ÿ��Jëm�Íÿcñ�ò¢ñ�í~÷oþ¢ôøí~ò¢ðøÿc÷ðÚò�þeë9ô��þ¢óoö�ê�ò¢êWíwò�ðÀò-�Jÿ+ë9þ�÷£ÿ~òÈðÀ÷-ò¢ë9ñ��Íë9ñ¢ëGlIòvê£ñvíªþeê�ÂðÚòvêpò¢ê£ë*�ªë¤÷£ë9ñví~ò¢ë��pölÿ��Jë�ðø÷pò¢ê£ë*�5}ö¤íªö�ê£ë��

JavaMathod

FPGAProcessor

K - Dynamic compilation threshold

L - Dynamic configuration threshold

Interpreter

CompiledCode

E < K

Hardware

E>L

Macros

Configure

FPGA

Dynamic Compilation

E=K K<E<L E = L

E - Execution count for method

|7ð�ªó£ñ¢ëM7�2~=+a(�+÷oí��jðÃöxéTñvíc÷oþ¢ôøí~ò¢ÿcñ-�ÂðÚòvê�a(��÷oí��jðÃö Q ÿc÷�}$�có£ñ�íwòvðÀÿª÷píc÷�� Q ÿ��6�£ðøôøí~ò¢ðøÿc÷ Q í��oí��£ðøôÀðÀò#�Ý-Þ4ßtà�á>â�ã ä8åeßI÷<ø$ò°éGà�ê�â�å$ßÊô |7ð�ªó£ñ¢ë ü

þeêoÿG�$þ�ò¢êWíwò�òvê£ë¤ñvëÈícñ¢ëÈí��£ôøðÃö¤íwòvðÀÿª÷oþ�.7ôøð&!cë¼�½~·n±¾?Á¢¸l¯�¯i.!í~÷��>�¤Äc¼%�G.�ðÀ÷��Âê£ðøö�êÔíZþeð&�c÷oð�}Wö9í~÷-òC�?ÿcñ'ò¢ðøÿc÷ ÿ��¹ò¢ê£ëZò¢ð&�jëZðÃþþ��?ë¤÷-òxðø÷ ë��Jë9öló£ò¢ðø÷��Ôòvê£ëò¢ñ�í~÷oþ¢ôÃíwò¢ë��ö¤ÿ��£ë��hétê£ë¤ñvë�ðøþtíjölÿ��6�Îÿª÷xùOí~÷��ÈðÀ÷�

òvë¤ñvë9þeò¢ðø÷��-únòvñvícðÚò�ðø÷�ò¢ê£ëIþeë�í��£ôøðøö9íwòvðÀÿª÷oþ�.��Âê£ë9ñ¢ëòvê£ëxë��Jë9öló£ò¢ðøÿc÷ òvð�jëM�Jÿ��jðø÷oíwòvë9þ�íc÷��Qí�þeð&�c÷£ð�fðÃö¤íc÷ªòM�?ÿcñ¢ò¢ðøÿc÷ ÿ��ò¢ê£ðÃþxò¢ð&�jëfðÃþ�þ��?ë¤÷-ò�ðÀ÷�ölë9ñ�ò�í~ðø÷�þ'�?ë9ölð}Wö��Íó£÷oölò¢ðøÿc÷oþ��ù|£ÿªñØðÀ÷WþÊò�í~÷oö¤ë�.�¼�½w·Î±¾�Á¢¸_¯�¯!ë��6�£ôÀÿG�Jþ�íEþÊò�í~÷��oí~ñ%��þeë¤ò�ÿ��1�Íóo÷oö_òvðÀÿª÷oþÖò¢ÿë9÷oölÿ��Jë�ícôÀôLò¢ê£ëL�£íwò�í��5�#ÿc÷oë�ÿ��Jòvð�jð&�¤ëIþÂò¢êoë�ë��ëIölóJòvðÀÿª÷�ÿ��#þ¢óoö�ê��Íó£÷oölò¢ðøÿc÷oþ�.?ò¢ê£ë9÷Øò¢êoë¤ñvë�ðøþ�ê£ÿ��Wë�ÍÿªñÊ�nóoö�êC�?ëlò¢ò¢ë¤ñ��?ë¤ñ'�Íÿcñ,�¹í~÷oö¤ë��!ícñ,��tí~ñvë!ö¤ÿc÷�}��ªó£ñ�íwò¢ðøÿc÷.ÿ��#òvê£ë9þ¢ë8�jë¤ò¢ê£ÿ��£þÖóoþ¢ðÀ÷��-|7ðÀë9ô&� Õ ñvÿ��ªñví��L�¹í��£ôÀë0ú�í~ò¢ë�=!ñ¢ñ�ík�JþØù�| Õ ú)=Öú¹ÿc÷�Oòvê£ë�wO��ùOþeð&�jðÀôÃí~ñZò¢ÿ êoÿG� ì��Êé ölÿ��6�£ðÀôøë¤ñ ��+÷oí��ÎðÃö¤ícôÀô&�ÿ��Jòvþ�òvÿ ölÿ��L�oðÀôøë�[ë��Jë9ö¤óJò¢ë ñ�íwòvê£ë¤ñ�òvêoí~÷Ûðø÷-ò¢ë9ñ��£ñvëlò_ú'ðøþÂíc÷EðÀ÷-ò¢ë9ñ¢ëIþÊòvðÀ÷��nÿ��Jò¢ðøÿc÷4��xë�í~ñvë�öló£ñvñ¢ë9÷-ò¢ô&�~�jÿ��Jð��+ðÀ÷��Zò¢êoë ��+÷oí��jðøö

ö¤ÿ��6�£ðøôÀë9ñtò¢ÿ�ðø÷-ò¢ë��ªñví~ò¢ëÖò¢êoë��+÷oí��jðÃö�ölÿc÷�}��có£ñ�íGòvðÀÿª÷h�ÎëIö�êoí~÷oðøþ'� íªþ�þeêoÿG�Â÷fðø÷�|7ð�ªó£ñvë�7��/�xë�£ôÃí~÷�òvÿ{óoþ¢ë�êoícñ,��tí~ñvë{ùÍÿ��sÊë9ölòvþ�ú¹ölÿªñ¢ëIþ�ölÿªñ¢ñvë�þ'�Wÿª÷��Jðø÷��ò¢ÿnþeë9ôÀëIö_ò¢ë��jìªí;îwí)�jëlòvê£ÿ��£þ1�Jë9îcë9ôÀÿ��Wë��êoícñ,��tí~ñvëJ�JëIþeð&�c÷£ë9ñvþ8�ÂðÀò¢êZíþ'�WëIölð}Wö�ðø÷-ò¢ë9ñ��VíªölëÂòvÿ�ò¢ê£ë!ñ¢ëIþÊò#ÿ��?òvê£ë�ìªï�õ íªþhÿ��Wÿ-þeë��Îò¢ÿ��÷oí��jðøö9í~ôøô��òvñvíc÷oþ¢ôøí~ò¢ðø÷��(��-ò¢ëIölÿ��JëIþCðÀ÷-ò¢ÿ�í)| Õ ú)=ö¤ÿc÷�}��ªó£ñ�íwò¢ðøÿc÷K}oôøë{ûÚük7Iýw�%étê£ðøþ�í~ôÀò¢ë¤ñv÷oí~ò¢ðøîcëEö¤í~÷í;îªÿcðA��òvê£ëÎêoí~ñ%��tí~ñvë�ö¤ÿ��6�£ðøôøí~ò¢ðøÿc÷.ö¤ÿªþeò��?�[ò�í~ôÃþeÿ�?ë¤÷£ë�}£òvþ>�Íñ¢ÿ�� ò¢êoëh�Wë¤òeò¢ë9ñK�?ë¤ñ'�Íÿcñ,�¹í~÷oö¤ëfÿ��píö¤óoþÊòvÿ��jð&�¤ë��pêoícñ,��'ícñ¢ëÖölÿcñvë�ÿwîcë9ñtí¹þ'�+÷-ò¢ê£ëIþeð&�¤ë��ùVêoí~ñ%��tí~ñvë�ölÿ��L�oðÀôøë��oúnölÿªñ¢ë�� ê£ë9÷Zí��jëlò¢êoÿ��ðÃþÂë��JëIölóJòvë��B�ÍÿcñW�Îÿªñ¢ë�òvð�jëIþtò¢êoíc÷�í6�£ñvë��Jë�}o÷£ë��òvê£ñ¢ëIþeêoÿcôA�u.'ò¢ê£ë�ë��Jðøþeò¢ë9÷oölëØÿ��í�êWí~ñ%��'ícñ¢ëEölÿªñ¢ë�ÍÿªñCò¢ê£ëÂñvë�@-ó£ëIþÊòvë��8�Îë¤ò¢ê£ÿ��ÎðÃþhö�ê£ë9ö%!ªë��u�°�5��í�ölÿªñ¢ëðÃþÊ�Íÿcó£÷��4.+ò¢ê£ë*��+÷oí��ÎðÃö!ò¢ñ�í~÷oþ¢ôøí~ò¢ÿªñ4ðø÷£ðÚòvðøí~ò¢ëIþ'ò¢ê£ëôøÿªí��JðÀ÷��ÿ��hòvê£ënö¤ÿc÷�}��ªó£ñ�íwò¢ðøÿc÷r}WôÀënÿ��hò¢êoënölÿªñ¢ñvë�þ'�Wÿª÷��Jðø÷��Îêoí~ñ%��tí~ñvë!ölÿªñ¢ëÖðÀ÷-òvÿ�òvê£ëF| Õ ú)=��£étê£ëë��+ëIölóJòvðÀÿª÷Èÿ��Lòvê£ë*�jëlò¢êoÿ��B�£ñvÿJölë9ë��£þ'ðø÷EíL�oícñvícô�ôøë¤ô'ò¢êoñ¢ëIí��ë9ðÚòvê£ë¤ñjðø÷{ðÀ÷-òvë¤ñ,�£ñ¢ë¤ò¢ë��{ÿcñÎì��ÊéÜölÿ��L�£ðøôøë¤ñx�Îÿ��Jë)�Jó£ñvðø÷��òvê£ëÖö¤ÿc÷�}$�có£ñ�íwòvðÀÿª÷4�I� ê£ë9÷Èíþ¢ó��oþ¢ë�@-ó£ë9÷ªò$ö9í~ôøô?ðÃþ��¹í��Jë�ò¢ÿÎòvê£ë�þ¢í��ÎëF�jëlòvê£ÿ��u.òvê£ë.ölÿ��6�£ôÀë¤ò¢ðøÿc÷ ÿ��Öòvê£ëxö¤ÿc÷�}$�có£ñ�íwòvðÀÿª÷Ü�£ñvÿJölëIþ¢þðÃþ�ö�ê£ëIö%!cë��u�J�5�tò¢ê£ëÈölÿc÷�}��có£ñ�íwòvðÀÿª÷�ðøþ�ö¤ÿ��6�£ôøëlòvë�.òvê£ë$ðø÷��£óJòÊ�oíwòvíÖòvÿ�ò¢ê£ëW�Îë¤ò¢ê£ÿ��jðÃþ³�oíªþ¢þ¢ë��ò¢ÿ�ò¢ê£ë| Õ ú)=Ûíc÷��xò¢ê£ë¹ë��Jë9öló£ò¢ðøÿc÷M�Wë��cðø÷oþ�ðø÷xòvê£ë�ö¤ÿc÷�}��ªó£ñvë��{êWí~ñ%��'ícñ¢ë��K� ê£ë9÷�òvê£ë��jëlòvê£ÿ��ZðÃþjë��ëIölóJòvë��ÿc÷ò¢êoëJ| Õ ú)=�.=òvê£ë�ìªï�õ ölÿªó£ô&�xë¤ðÀò¢êoë¤ñ�?ë)�£óoþ'�m�WÿªôÀôøðÀ÷��nó£÷-òvðÀô�òvê£ë�ñ¢ëIþeó£ôÀòvþtícñ¢ë�ñ¢ë¤ò¢ó£ñv÷£ë��ÿªñtþ'�ÂðÀòvö�êÈò¢ÿÎò¢ê£ë�ë��+ëIölóJòvðÀÿª÷Èÿ��Tí~÷oÿ~ò¢êoë¤ñ'ò¢ê£ñvë9í��u�étê£ëAÿcóJò,�£óJò��£íwò�í?�ªë¤÷£ë9ñví~ò¢ë��)��Öò¢êoë'ö¤óoþÊòvÿ��jð&�¤ë��êoícñ,��'ícñ¢ëÈðøþm�oó�y?ë9ñ¢ë��Zíc÷��ò¢ñ�í~÷oþ��Íë¤ñvñ¢ë��òvÿ�ò¢ê£ë�£ñvÿJölëIþ¢þ¢ÿcñCÿc÷WölëÂòvê£ëÂë��Jë9ö¤óJò¢ðøÿc÷¹ðÃþ#ö¤ÿ��6�£ôøëlò¢ë��<�xëö¤ÿcó£ôA�Eí~ôÃþeÿ¹óoþ¢ë�òvê£ë��oícñeòvðøícô=ñ¢ëIölÿc÷�}��có£ñ�íwòvðÀÿª÷pö9íG�oí��£ðøôÀðÀò#�nðø÷jò¢êoëW| Õ ú)=!þhò¢ÿ�ð&�6�£ôÀë��jë¤÷-ò1��óoôÚòvð�£ôøëö¤ÿcñvë9þ1�Íÿcñtþeó��WÿªñeòvðÀ÷��L�Jðy?ë9ñ¢ë9÷ªò��jëlòvê£ÿ��£þtþeð&��ó£ôò�í~÷£ë9ÿcóoþ¢ô��

û Xä4á!çu]¢æWÓ+èÊä4á

étêoë³!cë��!òvÿ!í~÷�ë�É¹ö¤ðÀë9÷-ò7ì-í;î;ítî+ðøñeòvóoí~ô��¹ícö�ê£ðø÷£ëð�6�£ôøë��jë9÷ªò�íwòvðÀÿª÷.ðÃþ$òvê£ënþ'�+÷£ë¤ñ,��B�Wë¤ò#�Aë9ë¤÷��Aë9ôÀô�Jë9þ¢ð&�c÷£ë�� þ¢ÿ��ò#�tí~ñvë�.jí~÷�ÿ��Jòvð�jð&�¤ðø÷��ölÿ��L�oðÀôøë¤ñ�.þeó��WÿªñeòvðÀîªë�í~ñ�ö�ê£ðÀò¢ëIö_ò¢óoñ¢ë�íc÷��Eë�É�ölðøë¤÷-ò!ñvó£÷-ò¢ð&�jëôÀð&�£ñ�í~ñvðÀëIþ��0étê£ðÃþB�oí��?ë¤ñpêoícþpôøÿ+ÿ�!cë�� íwòEÿc÷£ô&�Ôíþ��¹ícôÀô+þeó��oþeë¤òTÿ��+ðÃþvþeó£ëIþI�ÂðÀò¢ê�ñ¢ëIþ��?ë9ölòLò¢ÿÖþeó��Wÿªñeò'ðÀîªë�í~ñ�ö�ê£ðÀò¢ëIö_ò¢óoñvícôu�Íë9í~ò¢ó£ñvë9þ?�Íÿcñ�ì-í;îwí�.?í~÷��pòvê£ë¤ñvëí~ñvë�íjôÀÿcò$ÿ��hðÃþ¢þ¢ó£ë9þÂò¢êoí~ò�í~ñvë�ñ¢ð&�Wë��ÍÿcñD�ÍóJòvó£ñvë�ñ¢ë�þeëIí~ñ�ö�ê4�

ü V4Ï%VMã�VLá�ç3VIÓûÀülý�é)�)�Mðø÷��Jê£ÿªô��íc÷��Y|x�Fý#ë¤ôøôÀðø÷4.{µ?Ìo¸�JÄwÏwÄ

ÐL¶�Á_°²®oÄw¿CÑ.Äª¼vÌJ¶�´?¸-þ~¾o¸�¼l¶ ÿ'¼�Äw°²¶V½w´��-=W��JðÃþ¢ÿc÷�ë9þ¢ôÀë��.=ü��

û 7;ý�é)� Q ñ�í��jë¤ñ�. ö*��|£ñvðøë��¹í~÷I.�é)�xõØðÀôøôøë¤ñ�.a8�Fn+ë��Wë9ñ'�ªë¤ñ�.�ö*�W� ðøôøþ¢ÿc÷I.�í~÷��õ>�W�ÿcôö��!cÿ$.E� Q ÿ��6�£ðøôÀðø÷��{ì-í;îwíJsÊóoþeò�ðø÷Zòvð�jë9. �³�� ÑØ¶V¼¤Á¢½G.Öîcÿªô��Öük��.D��4�D:�{��;�:�.Âõ�ík��ìªó£÷£ëjü��

û :wý��hõ�ö�úÖêoí~÷�í~÷$��õ9�1q8Ö Q ÿc÷£÷£ÿªñ�.*� Õ ðÃölÿ�ì-í;î;í�2°=/�Jðøñ¢ëIö_òtë��Jë9ö¤óJò¢ðøÿc÷�ë9÷��cðø÷£ëW�Íÿcñtìªí;îwí��-ò¢ë9ö¤ÿ��Jëm. ��³��h½~·$¾�®J°[¸¤Ái.��4�$7�7��:�Ç�.qÖö_òvÿ��?ë¤ñ�ü��z��

û ;~ý��Iï!ð seík��!+ñ¢ðÃþeêo÷oí~÷4.w³�¯�¯_®o¸l¯A¶�´j°�Ìo¸��Î¸_¯_¶ÚËc´�½ÊÇÄÎJÄwÏwÄ� tÁ¢½9¼�¸l¯�¯l½~Á��Á¢¼�Ì+¶�°}¸�¼l°²®JÁ¢¸�� Õ ê�aÔò¢ê£ë�þ¢ðøþ�. Q ÿªôÀôøë��ªëÂÿ��#÷��cðø÷£ë¤ë9ñ¢ðø÷��. Ø ÷£ðøîcë9ñvþ¢ðÚò#�nÿ��n+ÿªóJò¢ê |7ôÀÿªñ¢ðA�£í�.AéTí��L�Wí�.Ê|<�N:�:�{�7�Ç�.#ìªó£ô�ü��z��

û c;ý Ø �G�!ÿcô&�¤ôøë�.u�¢ì-í;îwíÂÿc÷6n-ò¢ë9ñ¢ÿªð&�oþ�2In+óo÷4Ö þCêoð�ªê��?ë¤ñ'�Íÿcñ,�jíc÷oölë�ì-í;î;í%ð&�L�oôÀë��Îë9÷-òvíwòvðÀÿª÷4. � ðø÷ 'Á¢½I¼�¸�¸vÆ~¶�´-Ë~¯ ½ÊÇ��n½w°��LÌ+¶ ¾W¯ ³��*.h=$ó��cóoþeòü��

û {wý��%q�îªë¤ñvî+ðÀë�� ÿ��ìªí;îwí+�£ôøí~ò��Íÿªñ'� �oñ¢ÿ��Jóoölò�Ví��jðøô�� ê-òeò,�42�l�lk�D�D�*� seí;îwícþ¢ÿ��ò�� ölÿ��Clk�£ñvÿ��Jóoö_ò�þl�q$ï s��! Õ ñ¢ÿ��Jóoölò�� ê-ò'�jô��

û��Iý�a8��úÖñ¢ðÃþ'�Aÿªô&�u.u�vétê£ëÂìªí;îwí(�$ÿ~òin��?ÿ~ò#ï!ðøñ¢ò¢óoícôõ�íªö�ê£ðÀ÷oë/=!ñvö�ê£ðÀò¢ëIö_òvó£ñ¢ëp. � õ�í~ñ�ö�ê ü��z��n+ó£÷ØõØðÃölñvÿªþ'�JþÊòvë��¹þx� ê£ðÀò¢ë��Wí��?ë¤ñ��

û zwý�� v�íGy�ë ï!ðøñeòvóoí~ô õ.ícö�ê£ðø÷£ë�� ê-òeò,�42�l�lk�D�D�*� ò¢ñ�í~÷oþ¢î+ðÀñ¢ò¢óWí~ôw� ö¤ÿ��r�

û �wý�� n Õ � Q ìcïÖõ0��z �'ë¤÷oö�ê��¹ícñ'!Jþ�� ê-òeò,�42�l�lk�D�D�*� þ��?ë9ö�� ÿªñ'��l;ÿ-þ��l,sÊî��m��z�l��

ûÀü�Ç~ý*ö��|x� Q �jë¤ôøð&!�í~÷��-aE��v�ë��?ë¤ôw.��,n+êoí��Jë�2�=�VíªþÊò�ðø÷oþeò¢ñvóoö_òvðÀÿª÷�}þeë¤ò�þeð&��ó£ôÃíwòvÿcñ*�Íÿªñ�ë��JëIöló�òvðÀÿª÷m�£ñvÿ�}oôøðÀ÷��. ��éLëIö�ê4��öÂë��4��nJõ0�I�#éDö��:Gük7�.�n+ó£÷ØõØðÃölñvÿªþ'�JþÊòvë��¹þ��}÷oö�.=ü��:��

ûÀücü¤ý*ö��×ö$í��£êoí�!+ñvðøþ¢ê£÷oíc÷4.�ì$�×öÂó��£ðÀÿ$.�í~÷��1�#ìcÿªê£÷4.L� Q êoícñvíªö_òvë¤ñvð�Iíwò¢ðøÿc÷�ÿ��Öì-í;îwí�í��oôÀðÃö¤í~ò¢ðøÿc÷oþ�í~ò¹ò¢ê£ë��-òvë9ölÿ��JëØôøë¤îªë¤ô!íc÷��íwòØ ôÚòvñví�n Õ =Wö Q 5��=õ�íªö�ê£ðÀ÷oë Q ÿ��JëÊ�Lë¤îªë¤ôw. ��ðÀ÷ tÁ¢½9¼�¸�¸�Æw¶�´+Ëw¯�½ÊÇ�³_´W°[¸¤Á_´?Ä~°O¶V½~´?Ä~¿��h½~´9Çl¸¤Á¢¸¤´?¼�¸½~´��#½~·$¾�®J°[¸¤Á��Î¸_¯_¶ÚËc´�.4qÖö_òvÿ��?ë¤ñ�ü��téLÿí��?ë9í~ñ��

ûÀü�7wý�ì$�hõ9� Õ � Q í~ñ%�Jÿ-þeÿ �� Q ��$ë¤ò¢ÿ$.E�¢õ.ícölñvÿ��Wícþ¢ë��)êoícñ,��'ícñ¢ë ö¤ÿ��6�£ðøôøí~ò¢ðøÿc÷ ÿ�� ì-í;î;í��-ò¢ëIölÿ��JëIþ ðø÷-ò¢ÿ í×��+÷oí+�jðøö ñ¢ëIölÿª÷�}��óoñví��£ôÀëpölÿ��L�oóJò¢ðø÷��þ��Jþeò¢ë��6��.#ðø÷ Õ ñ¢ÿJölë9ë��ðø÷��-þ�ÿ��Øò¢ê£ëY�Iòvêº��ën��6�Wÿ-þeðøó�� ÿc÷|7ðøë¤ôA� Õ ñvÿ��ªñví��6�jí��£ôøë Q óoþÊòvÿ�� Q ÿ��6�£ó£ò�ðø÷��¹õ�íªö�ê£ðÀ÷oë9þ�.�=?�£ñvðÀô4ü��

1 2 3 4 5 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

No

rma

lize

d N

um

be

r o

f M

isse

s

compress db javac jack mtrt mpeg

I−Misses In Translate D−Misses In Translate % Misses that are write in Translate

|7ð&�có£ñvëD:�2 Q íªö�ê£ëÂõØðÃþ¢þ¢ë9þ<�ÂðÀò¢ê£ðø÷�éTñvíc÷oþeôÃíwòvë Õ ÿªñ�òvðÀÿª÷4� Q íªö�ê£ëAö¤ÿc÷�}��ªó£ñ�íwò¢ðøÿc÷�óoþ¢ë��J24;�5�'ík�Öþeë¤ò7íªþ#þ¢ÿJölðÃíwò¢ðøîcë�.u{�;�vja Q ícö�êoëE�ÂðÀò¢ê�í�ôøðÀ÷oëÎþ¢ð�9ënÿ��Ê:�7��-ò¢ëIþ�íc÷��7Gw�tík�pþ¢ëlò�ícþvþ¢ÿ+ö¤ðøí~ò¢ðøîcë�.${�;�vp� Q ícö�ê£ë�ÂðÀò¢ê�í�ôÀðø÷£ë�þeð&�¤ë.ÿ��8:�7M��ªòvë9þ�� étê£ë�}oñ�þÊòB�oí~ñþ¢ê£ÿG�$þ7ò¢êoë?�5 Q íªö�ê£ë��jðøþvþeëIþhðø÷jò¢ñ�í~÷oþ¢ôøí~ò¢ëtñvë¤ôÃíwò¢ðøîcëòvÿxí~ôøô��5 Q íªö�ê£ëJ�jðÃþ¢þ¢ë9þ�.Cò¢ê£ëEþ¢ë9ölÿª÷��9�oícñÎþ¢ê£ÿG�$þòvê£ë aW Q ícö�ê£ë��ÎðÃþvþeëIþ�ðø÷Ôòvñvíc÷oþeôÃíwòvë.ñvë¤ôÃíwòvðÀîªëØò¢ÿícôÀô1aW Q ícö�ê£ëm�jðÃþ¢þ¢ë9þ�.Lòvê£ë¹òvê£ðÀñ%� �oí~ñnþeê£ÿG�$þ�ò¢ê£ëaW[ö9ícö�ê£ë6�ÂñvðÚòvëm�jðøþvþeëIþ�ðÀ÷òvñvíc÷oþeôÃíwòvëjñ¢ë9ôøí~ò¢ðøîcëÎò¢ÿÿwîªë¤ñ�í~ôøôeaW[ö9ícö�ê£ë)�jðÃþ¢þ¢ë9þtðø÷pò¢ñ�í~÷oþ¢ôøí~ò¢ë��

Exploiting Hardware Resources:Register Assignment across Method Boundaries

Ian Rogers, Alasdair Rawsthorne, Jason SouloglouThe University of Manchester, England

{Ian.Rogers,Alasdair.Rawsthorne,Jason.Souloglou}@cs.man.ac.uk

AbstractCurrent microprocessor families present dramaticallydifferent numbers of programmer-visible registerresources. For example, the Intel IA32 Instruction Setprovides 8 general-purpose visible registers, most ofwhich have special-purpose restrictions, while the IA64architecture provides 128 registers. It is a challenge forexisting code generators, particularly operating withinthe constraints of a just-in-time dynamic compiler, touse these varying resources across a number ofarchitectures with uniform algorithms. This paperdescribes an implementation of Java using Dynamite, anexisting Dynamic Binary Translation tool. Since onedesign goal of Dynamite is to keep semantic knowledgeof its subject machine localized to a front-end module,the Dynamite code generator ignores method boundarieswhen allocating registers, allowing it to fully exploit allhardware register resources across the hot spots of aJava program, regardless of the control graphsrepresented.

1. IntroductionCurrent microprocessor famili es present dramaticall y

different numbers of programmer-visible registerresources. For example, the Intel IA32 Instruction Set[1] provides 8 general-purpose visible registers, most ofwhich have special-purpose restrictions, while the IA64architecture [2] provides 128 registers. In the formercase, register renaming and out-of-order issue is used inthe microarchitecture to exploit a richer resource set thanindicated by the instruction set, but the paucity of thevisible instruction set remains a severe constraint on ajust-in-time code-generator. In particular, it is difficult topass method parameters efficiently in registers in x86implementations, since register pressure ensures that theli fetimes of values in registers are so short.

In the case of RISC, and particularly EPIC [3]architectures, the visible instruction set more closelymodels the register hardware resources provided in animplementation. Register assignment becomes viableusing a conventional approach, such as defining amethod calli ng sequence, involving caller-saved, callee-saved and parameter registers, and allocating local

variable and temporary registers within individualmethods.

In the distant future, it is possible that unconventionalCPU architectures may provide extremely high levels ofperformance without any significant number of registers.

This disparity of hardware resources has beenaddressed by designers of current Java Virtual Machinesby providing different register allocation algorithms for(e.g.) IA32 and RISC machines. In this paper, wedescribe a single register allocator appropriate toregister-rich and register-poor architectures, and weexplain how this allows us to set optimization boundariesindependently of method boundaries, giving performanceadvantages.

Conventional static compilers, and existing JITcompilers, use method-inlining, more or lessaggressively, to uncover a number of optimizationpossibiliti es, with register assignment among them.Inlining may have its limitations, however, and we havefound a number of cases where inlining (particularlyleaf-method inlining) does not completely address hotregions of an application.

In this paper, we present some techniques we use torun Java byte-coded programs on Dynamite, our existingdynamic binary translation environment. As will bedescribed below, our techniques rely on optimizingregions of code without reference to the semanticboundaries of methods. Broadly, we claim to gain theperformance benefits of inlining, without its limitations.

We describe the Dynamite binary translation system,its interfaces and its approach to optimization. In section4, we show how Java programs are implemented usingthe Dynamite facili ties, and section 5 discusses ourpreliminary results. Previous work in registerassignment for Java programs is introduced in section 6.

2. DynamiteDynamite is a reconfigurable Dynamic Binary

Translation system, whose aims are to extend the “WriteOnce, Run Anywhere” paradigm to existing binaryapplications, written and compiled for any binaryplatform. To enable the quick configuration of aparticular translator, Dynamite is constructed as a three-module system, as shown in figure 1.

The function of the major components is almost self-explanatory: the Front End transforms a binary inputprogram into an intermediate representation (IR), whichis optimized by the Kernel, and the Back End generatesand executes a binary version on the target processor.The Front End interface supports a number ofabstractions convenient for efficient Front Endimplementation, such as the “abstract register” , whichholds intermediate representations of the effects ofsubject instructions. This interface is configured by anindividual Front End module, since different subjectarchitectures require different numbers of registers.

Two features of this front end interface are relevant tothe current discussion: firstly, the interface resembles aRISC-like Register Transfer Language, containing nopeculiarities (such as condition codes or side-effects)adapted for particular subject architectures. Secondly,procedure (method) calli ng is achieved at a primitivelevel, typicall y by having the front end create IR tocompose a link value in subject state space, and thenbranching or jumping to the callee. Return from aprocedure is similarly implemented by having the FrontEnd create IR which causes a jump to be made to thereturn address. Parameter passing and stack-framemanagement is implemented by having the Front Endcreate the appropriate IR to model the subjectarchitecture’s requirements.

The Kernel contains about 80% of the complexity ofthe translator system. It creates and optimizes IR inresponse to Front End calls, and invokes the Back End tocode generate and execute blocks of target code.Register assignment for the target machine codegenerator currently takes place within the Kernel, againusing a Back End interface that is parameterized forspecific target architecture. Optimization is performedadaptively at a number of different levels, starting withinitial translation as described below.

To achieve its performance goals, Dynamite operatesin an entirely lazy manner. An instruction is nevertranslated until that instruction must be executed, eitherbecause it is a control (jump, branch, exception or call )

target, or because the immediately preceding branch hasfallen through. As instructions are decoded by the FrontEnd, their IR is combined until a control transfer isencountered. During this process, the kernel performsoptimizations such as value forwarding and dead codeelimination. When the block of IR is complete, it is codegenerated, executed by the back end, and cached forsubsequent reuse. After a block of target code isexecuted, its successor location may either be foundwithin the cache, or may need translation using the sameactions.

In efficiency terms, target code blocks generatedusing this initial scheme leave something to be desired.The benefit is that the initial translation is quick, takingonly a few thousand instructions per subject instruction.Register usage is determined by an individual Back End,largely as a result of the method calli ng sequencemandated by the static compiler used to compileDynamite. Target registers are used to store temporaryresults within the block, but existing Back Ends preserveall subject register values in target memory at theboundary of basic blocks.

More optimization and higher quali ty code generationare triggered when an individual target block is executedmore frequently than a dynamic execution threshold.This event causes the kernel to create a group containingthis and related blocks in a hot region, and to optimizethis Group Block as a single entity. Group Blocks mayspan arbitrary boundaries in the subject machine:indeed, in other applications, Dynamite optimizes acrossprograms and their procedures, static and dynamically-linked libraries, OS Kernels, and across different virtualmachines.

Within a Group Block, the Dynamite kernel examinesthe existing control flow of the region, identifying certainblocks as entry and exit blocks, and performing valuepropagation and dead-code elimination across the entiregroup. The control flow is used to straighten theconditional branches and eliminate jumps within thegroup, so that frequent cases fall through, minimizingtaken branches and maximizing I-cache utilization.

Code generation for a Group Block occurs next. Toavoid the expense of an iterative algorithm, a very simpleincremental register allocation algorithm is used.Starting with the target block, operands are allocated toregisters (if the target machine architecture requires), andoperation results are allocated to registers if they are tobe reused. As the register set is exhausted, spill code isgenerated to relinquish previously allocated registers fornew operations. Register allocations are carried acrossbasic block boundaries, and the act of code generatingfrom the most- to the least-frequently executed blockswithin the group ensures that spill s are minimized.

We emphasize that during this code generationprocess, all abstract registers and target registers aretreated symmetricall y. We do not distinguish between

KernelKernelFrontEnd

FrontEnd

BackEnd

BackEnd

SubjectSoftware

SubjectSoftware

TargetCPU

TargetCPU

Dynamite

Figure 1. Dynamite Structure

registers used to pass parameters, those used to carryvisible results, and those used to hold temporary values.In this way, we can code generate a region containingmultiple procedure call and returns as efficiently as onecontaining just a portion of a large procedure.

The final stage of code generation is to generate stubsfor the entry and exit blocks, which need to load abstractregister values into target registers and compute and storeexit values from the group block.

This group-block creation phase can be invoked andre-invoked any number of times during programexecution, creating larger and smaller groups of basicblocks, always independent of method boundaries, as thesubject program proceeds through its execution.

3. Implementing JavaThe critical design decisions when implementing

Java using Dynamite are the mapping of JVM registers,local variables, and the stack to the relevant Dynamiteobjects, namely abstract registers .

To allow Dynamite to optimize across different Javamethods, we need to map multiple stack framessimultaneously to different abstract registers. Twoschemes were considered for doing this.

3.1 Sliding frameOn entering a method the front-end would create a

frame within the abstract registers. The arguments to themethod (stored at JVM local variable 0 upwards) becomethe base for the frame. After the local variables the returnaddress is held in the next available abstract register. TheJVM stack is held at the end of the frame. However,studies [4] show that the stack is usually empty on basicblock boundaries. The purpose of the stack in the frameis therefore to hold onto stack values that occasionallyspan basic blocks and to pass arguments to calledmethods. The arguments could become part of the nextframe by overlapping the stack part of the caller’s frameand the local variable part of the callee frame.

Unfortunately, the problem with this scheme is thatthe IR for an individual method needs to refer to specificabstract registers. This fixes its translation to a particularstack depth. If the same method is called at a differentstack depth, we need to re-translate it for this new depth.This is particularly expensive for recursive methods. Wecould possibly generate special case translations forrecursive methods and fall back on a scheme that savesthe frame to memory on a method call . Otherwise, formethods that are called from multiple stack locations wecould avoid re-translation if the method’s frame is at agreater abstract register location than the current frame.We would, however, still have to copy the arguments tothe method from the caller’s frame to that of the callee.

Our research has shown that around 90% ofexecution time is spent in methods called from more than

16 call sites. These methods are typically utilityfunctions which are prime candidates for optimization.Expensive optimizations would be prohibitive for thesemethods as the optimization would need repeating manytimes.

We conclude that using a sliding frame is thereforeundesirable.

3.2 Fixed frameThe drawback with the “sliding-frame” scheme is

that it is necessary to recompile methods called withdifferent frame base pointer values. If we fix the addresswhere a method’s frame lives in abstract registers weremove this problem. To do this, we allocate a new,unique frame from a large pool of abstract registers thefirst time a particular method is invoked.

We do, however, still need to pass arguments to thecalled method from the stack of the calling method. Onencountering a method call , the arguments to the methodare held in intermediate representation ready to bewritten to registers. At this point, they can be writtendirectly to the called method’s local variables avoidingany copy operation.

The first penalty for this scheme is that we need toretranslate these abstract register assignments fordifferent methods called from the same call site. A studyusing Harissa [5] shows that at least 40% of method callscan be accurately statically predicted for 100% of thetime, and dynamic statistics are even better than this.

As in the “sliding-frame” design, recursion needs tobe handled differently, as two invocations of a methodcannot share a single frame. A simple scheme to handlerecursion is to save a frame to memory before using it, ifit is active, and to restore it on exit. Alternatively, wemay find some circumstances in which it is advantageousto generate special-cased versions of recursive methods,each of which uses a different frame of abstract registers.This special-casing will be triggered by a heuristicmonitored by code planted in the initial translation of a(potentially recursive) method.

Finally, assigning unique abstract registers to everymethod presents a problem when the static pool isexhausted. For programs studied to date, fewer than 8000abstract registers would be sufficient. If greater numberswere required, the front-end could start re-using frames.For example, all l eaf methods can share the same frame,and more generally, methods that occur only on disjointsubtrees of the call graph can share frames. In thepathological case, frames of abstract registers can bereused by planting code that spill s a number of frames tomemory and refills them when necessary.

4. DiscussionTo evaluate this scheme before its implementation,

we carried out a number of experiments by instrumenting

Kaffe [6] to log information about the dynamicbehaviour of Java applications. We keep sufficientinformation to create the dynamic method call tree of theapplication. We create a call tree, in which methodsappear once for each call site, to assess ourimplementation alternatives. For each methodoccurrence, we keep the number of byte codes executedin this invocation, and its local variable requirement,including parameters.

To estimate the number of target registers needed inoptimization regions of different sizes, we identify hotspots on this call tree (methods with high contributions tooverall i nstruction counts), and successively add them tooptimization regions, counting the total number of localvariables required at each step. This approximates tocode generating by hottest method first, then bysuccessively cooler region. As each successive methodis added to the optimization region, we require morelocal variables. This gives us the characteristics we showbelow. For these experiments, we monitored “ javac”, theJava compiler in Java, since it is the largest Javaapplication we can find.

0%

20%

40%

60%

80%

100%

1 10 100 1000

Local Variables

Inst

ruct

ion

Co

un

t

In figure 2, we show how a straightforward “hottest first”selection algorithm can use varying numbers of targetregisters to code generate “spill-free” regions of differentsizes. As we intuitively expect, as we allocate more andmore local variables into target registers, we canencompass larger and larger regions, contributing to everincreasing fractions of total instruction count. Forexample, with 5 target registers, we can code generate aregion that contributes 22% to total execution count, andwith 26 registers, 46% of execution count.

In Figure 3, we use a slightly different heuristic toselect methods to include within our selection region.Here, we select methods based on their run-timecontribution per local variable. That is, comparingmethods with similar run-time contributions, we

��

��

� ��

� ��

��

� ��

1 10 100 1000

Local Variables

Inst

ruct

ion

Co

un

t

preferentially select the method with the smallerrequirement for local variables. This gives better resultswhen there are fewer target registers available: forexample with 8 registers, we can cover 30% of totalinstruction count, and with 25 registers, we can cover54% of the instructions.

5. Previous WorkThe success of Java has resulted in many JVM

implementations. Some implementations such as Harissa[5], and J2C translate Java to C code. They then rely on aC compiler to perform register allocation within and overmethod call boundaries. Register mappings within Cprograms are beyond the scope of this paper.

In this section we examine how other JVMimplementations perform register mapping and allocationand compare these to Dynamite.

5.1 Register allocationCacao [4] initially maps the JVM stack and local

variables to pseudo-registers, which are then allocated toCPU registers. Each mapping and allocation begins at thestart of a basic block and builds on the mappings andallocations of previous basic blocks. When CPU registersare exhausted a register is spill ed to memory and fill edby a pseudo register.

On method call boundaries Cacao pre-allocatesregisters. It uses CPU registers to pass arguments and toreceive return values. On machines without registerwindows pre-allocation of arguments is only possible forleaf methods.

Pre-allocation can tie in with existing compilermethod call conventions: for example, in the DAISYJVM [7] arguments and return values are passed andreceived using the Power PC’s C compiler calli ngconventions, which uses standard registers for passingarguments.

Figure 2. Instruction Count against Local Variables

Figure 3. Instruction Count against Local Variables

5.2 Comparison with DynamiteMethod invocation creates a new frame on a call

stack. Cacao avoids unnecessary accesses to this frameby pre-allocation, utili zing register windows orpotentially by using the machine’s standard calli ngconvention. However, these static mappings take noaccount of run-time information on register usage.Therefore registers could be allocated and thensubsequently unused. Cacao would also calculate anyparameters even if they were unused. Cacao would alsohave to copy from one register to another if it repackagedarguments to another method. Dynamite on the otherhand can avoid this by value forwarding and dead codeelimination within a group block.

Also, when registers are spill ed only the surroundingand previous basic blocks are considered. This meansthat a pseudo register could be spill ed in one basic blockand then filled back again in the next, and Cacaowouldn’ t know it could spill different registers which areunused in subsequent basic blocks. Dynamite’s runtimeinformation about register usage can provide a betterregister allocation in this case.

6. ConclusionIn this paper, we have introduced Dynamite, an

environment for creating dynamic binary translators. Wehave shown how the run-time concepts of the JavaVirtual Machine are mapped onto the Dynamite front endinterface and its internal register allocation algorithms.This mapping necessarily discards the concepts ofmethods and their local variables.

Preliminary investigations show that “method-free”register allocation shows promise for efficient codegeneration across architectures providing wide ranges ofhardware register resources. We look forward topresenting more definitive numerical results at theworkshop in October.

7. References[1] Intel Corporation, “ Intel Architecture Software Developer’s

Manual, Volume 1: Basic Architecture,” Order Number243190, 1999.

[2] Intel Corporation, “ IA-64 Application Developer’sArchitecture Guide,” Order Number 245188, 1999.

[3] Trimaran consortium, “Trimaran Project Homepage,”http://www.trimaran.org

[4] Andreas Krall , “Efficient JavaVM Just-in-TimeCompilation,” International Conference on ParallelArchitecture and Compilation Techniques (PACT98), Paris,France, October 13-17, 1998.

[5] G. Muller, B. Moura, F. Bellard, C. Consel, “Harissa: AFlexible and Eff icient Java Environment Mixing Bytecodeand Compiled Code,” Third USENIX Conference onObject-Oriented Technologies (COOTS-97), Portland,Oregon, June 16-20 1997.

[6] Transvirtual Technologies Inc., “Kaffe ProductArchitecture,”http://www.transvirtual.com/products/architecture.html

[7] K. Ebcioglu, E.R. Altman, E. Hokenek, “A JAVA ILPMachine Based on Fast Dynamic Compilation” , IEEEMASCOTS International Workshop on Security andEfficiency Aspects of Java, Eilat, Israel, January 9-10, 1997.

� �� !��!�"�$#%��&�(')�*��,+�-.��/�0�"��1�"�32546��/7�8 ��9:�;�<43��=�>9@?A��7B�C�D�E��!�"-��>�

FHGJI�KMLONPF!GRQ"N�GJS5TOUVLON"W"GJW GJW"QPX�UVYMZ\[^]�TOU_GRWa`JbJN"WX�GJcdbeTfGhgfbeT�ZjikbeTHlbeInm"]�gOK%TpoETrqsN"UVgOK%qtgf]"TrK

u�KMm�GJT�gfInK%WvgpbJi�w�xVK%qtgfTOUVqyGJxzGJW"QalbeI�m�]�gfKMTpw�W�{e{"|}AN"K�~!W"U��eK%TrLOU�g�Z@bJi�}�K:�"GRLEGRgHo!]"LrgOU�Wz�"o!]"LrgOU�Wz��}/Kt��GJLE�J�v��

��e�e�"�J��<��f�e��5�d�h��"�J�� ¡"�e¢��£��v�e�

¤\¥�¦M§©¨�ª�«J§¬y:®f!¯±°�¯³²e´ µ ¶.s°.¯³²y·E¯³²!¸�¹.¸Jº©»¼:µ�¯³½V¾!¯³²¿½ÁÀ�¶;°�¹kÂ.½�ÃdsµÄ¶Å¯³²JÆ:ºy°.Ç½Vµ�¾fÈ/s²eÆ�¯±°pÉÄ¶�¯³²%·^ºy°�¶ÄÆE½Ê¹E¯³Ë/¸J»¼¶�ËÅ¶�²h½/°�¶ µ�®s¶ µp:²JÆ^¶�²h½Ê¶ µ�Ç¸Jµ�¯±°�¶�°�´Ä:»¼¶d¸JµÄ¹ÊÌ�¶.´ ½V°�Í�ÎJºh´ÄÀ!Ä¸M¸J»Ï¯�´.:½V¯�¹:²�°Ðµ ¶.Ñ º©¯³µ ¶�À©¯Ò·OÀB¸h¶ µ�ÇÂ ¹sµ�ËAs²J´.¶ È�:²JÆ!²J¶.¶.Æ!½_¹!µ�º©²Ó¶�Ô\´�¯�¶ ²R½V»Ï¾tÍ^ÕR¶.´ÄÀ©²h¯�Ñ�ºh¶�°>»Ï¯¼Öt¶¬f×%Õj´Ä¹:Ë�¸e¯³»¼¶�µ.°>s²JÆÅØ0¯�´.¹:¬©:®f�´.Ày¯ ¸R°>¹ÚÙ/¶�µB°�¹sËA¶B°�¸h¶Ä¶.Æ:ºO¸RÈÉ º©½p¯³²Û½³À©¯±°E¸�.¸�¶�µÜÃd¶n»¼¹r¹OÖ :½�:»Ï½Ê¶ µ�²es½Ê¶Ü½Ê¶.´ÄÀ©²h¯�Ñ�ºh¶�°Ý¯³²À�sµÄÆsÃdsµÄ¶¿Ã�Ày¯�´.ÀÞ´Ä:²ÞÉ.¶ß¯³²J´.¹:µ�¸�¹:µ s½Ê¶.Æn¯³²ÝÂ.º©½�º©µ ¶^Ë¯�´�µ ¹sÇ¸Jµ ¹O´.¶�°.°�¹sµ.°�ÍHà�¶z¸Jµ ¹.¸�¹s° ¶>Bá>¶.´.¹sºO¸e»¼¶.Æ�Õ�µ s²h°.»âs½Ê¶/ã�äy¶.´ º©½Ê¶å á�Õeã�æzç�µ ´.Ày¯³½Ê¶.´ ½�º©µ ¶ ÈJÃ�À©¯�´ÄÀÅ½_OÖt¶ °zMÆs®ss²h½Ê�·%¶�¹ÚÂdµ�º©²RÇÊ½V¯³ËA¶´ÄÀhsµ M´ ½Ê¶ µ�¯±°�½�¯�´�°H¹ÚÂA¬y:®fMÍjÕeÀh¶>á�Õeã)sµ ´.Ày¯³½Ê¶.´ ½�º©µ ¶H´.:²�É.¶¯³Ë�¸e»¼¶ ËA¶�²h½Ê¶ÄÆEºy°�¯³²y·EEË¯Òä^¹ÚÂÐÀhsµ Æ:Ãdsµ ¶p:²JÆ!°�¹kÂ.½�ÃdsµÄ¶p¹sµs»Ï½Ê¶ µ�²es½�¯³®f¶�»Ï¾Ð¸eº©µ ¶�»Ï¾A¯³²�Àhsµ Æ:Ãdsµ ¶OÍ

è 25��H�^��0�"- ��éMêfësêÐìÄírî.ï�ð�ñMò¼ñMótôÅïhêMõdöJí�írðEêM÷©ñtø©ìÄír÷AöyôÅìÄï�íÐí�ð%ì írù ø�ùÄú±õÚí/ì ññ:ûeí�ù�õ ñ:ü³ìkýzê:ùÄí�õ ñtòâþ©ì úâñtðRõzú¼ð�ësê:ùÄú¼ñMþhõ0ÿhíròâ÷�õ��h÷©þhíÐì ñAú¼ìÄõ�êM÷ �ëfêtð%ìÄê:óMírõ�ñ:ühøJñtù ìÄêtö�úâò¼ú¼ìkô��tõ írî�þ�ùÄúÒìkôBê:ðR÷��Åñ©÷©þ�ò±ê:ùÄúÒìkô � � ü�êtî �ì ñtùzý�ï�ú±î.ï�êsûeírî�ì.õ0ì ï�íÐí�yíOî þ©ìÄú¼ñMðEõ øJí�ír÷!ñ:ü"éMêfësê>úâõ0ìÄïhêsì�ú¼ìúâõ<õÚì úâòâòRø�ùÄír÷�ñ��Åú¼ðRê:ð%ì òâô>êBõ ñ:ü³ìkýzê:ùÄízí�;þhòâê:ì ír÷Åò±ê:ðhótþhêtótí � �ø�ù ñMótù.ê�Dý�ùÄú¼ìÚì írðBú¼ð�é%êfësêzúâõvì ù.ê:ðhõ òâê:ì íO÷��Áöyô�õ ñ:ü³ìkýzê:ùÄí"í�>þ �òâê:ì úâñtð� �ìÄñ;ê:ð!ê:ù.î.ï�úÒìÄírî ì þ�ùÄí � ð�í�þ�ì ù.ê:ò©úâðhõkìÄù þRî�ì úâñtðpõ í ìzî�êtò¼òâír÷ö%ô%ìÄírî ñ©÷©íOõ �� ïhírõ í�öyôMìÄírî�ñy÷�írõ/êtù íBøhòâê:ìÚüÁñtù�� ú¼ðR÷©í�øJí�ðh÷�í�ð%ìê:ðh÷î�êtðöJí�í��©íOî þ©ìÄír÷ñtðAê:ðyô;õ ô©õkìÄí�Ûý�ï�úâî.ïAõ þ�ø�øJñtù ìÄõ�ìÄï�íéMêfësêzùÄþ�ð%ì ú��ídí�ðyëyú¼ùÄñtð��ÅírðMì �� ô%øhúâîrê:òâò¼ô/ì ï�í,ö%ô%ìÄírî ñ©÷©íOõ5ê:ùÄíú¼ð%ì írù øhù í�ì ír÷;ñtù�î ñMðyëtí�ù ì íO÷Ðì ñÐðhê:ì úâëtí0ú¼ðRõkìÄù þhî ì úâñtðÅõÚí�ì�ñtühìÄï�í�ÅêMî.ï�úâð�íÐö%ô!êéMþhõÚì � ú¼ð � � ú��Åí��Vé�� zî�ñ��Åø�úâò¼írù �� ù.ê:ðhõ òâê:ì úâñtð�î�ñtðhõ þ��Åírõ�ê��Aê��kñtù�øhê:ù ì�ñtü©ì ï�ízî ô©î òâírõ�÷©þ�ù �ú¼ð�óEéMêfësêHí��©írî�þ©ì úâñtð^þRõÚúâð�ó�êHé�� î ñ��øhú¼òâí�ù�� Êð �Åêtðyôñ:ü�ìÄï�í"!yøRíOî�é�#%$"&�'¿ø�ù ñMótù.ê�Aõ��"êtø�ø�ùÄñ(�©ú)�AêsìÄí�òâôÝïhê:ò¼ü/ìÄï�í

ì ú��íBú±õ�õ øRírðMì�úâð�î ñ��øhú¼òâí�ì ú��íBñMøRírùÄê:ì úâñtðhõ�êtðh÷!ñtð�òâôpïhêtòÒüñ:ü0ì ï�íÅìÄú)�Åípú±õ�õ øRírðMì>ú¼ðÝí��©íOî þ©ìÄíAù íròâê:ì ír÷¿ñMøRírùÄê:ì úâñtðhõ��)�*� �� ï�ín÷�ê:ìÄê@î�êMî.ï�í+�ÅúâõÄõ^ù.êsìÄírõ^÷�þ�ù úâð�óDí�©írî þ�ì úâñtðÞý�úÒìÄï(êé�� î�ñ��Åø�úâò¼írùý0í�ùÄí!õ í�írðÜìÄñßöJí�÷©ñ��ÅúâðhêsìÄír÷ÜöyôÜî�ñ��Åø�þ�ò �õÚñMù ôAý�ùÄúÒìÄí,�ú±õÄõÚíOõ�©ý�ï�ú±î.ï�ý,írù íBêtù ú±õÚúâð�ó>üÁù ñ�� úâðhõkì.ê:òâòâê:ì úâñtðñ:ü�ì ïhíHìÄùÄêtðhõÚò±êsìÄír÷Óî�ñy÷�í � � ð�ñtì ï�írù>ø�ïhí�ð�ñ��írð�ñtðnúâõ;ì ïhê:ìì ï�í�ìÄùÄêtðhõ òâê:ì ír÷\î�ñy÷�í�úâõ!÷©í�øJñMõ úÒìÄír÷\úâð%ì ñÓìÄï�í�÷�ê:ìÄêÓî�êMî.ï�í

ê:ðh÷Hòâê:ì írùdüÁí�ìÄî.ï�íO÷AüÁùÄñ��)ì ïhí�úâðhõÚì ùÄþhî�ìÄú¼ñMðHî�êMî.ï�í��yùÄírõ þ�ò¼ìÄõ,ú¼ðêfëtñMúâ÷�êtö�òâí�÷�êsì.êÜìÄùÄêtðhõÚüÁí�ù�ê:ðh÷P÷©ñtþ�öhò¼í � îrêtî.ï�úâð�ó � �Êð ì ï�ú±õøhê:øJí�ù-�Ðý,íÓøhù íOõÚírðMì�ìÄï�í/.�írî ñMþ�ø�òâír÷ � ùÄêtðhõÚò±êsìÄí+01�©írî�þ©ì í�2. � 03 � ùÄî.ï�ú¼ì íOî�ìÄþ�ù íBý�ïhúâî.ïEüÁñtù��Aõ�êÅõÚñMò¼þ©ìÄú¼ñMðEì ñÅöRñtì ï�ñtüì ï�í�êtöRñsëMí�øhù ñMö�ò¼í��Aõ �

4 �)+��1� �Û� �*��,+�-.��/�0�"��Õ©µÄs²�°�»¼:½V¯�¹:²!êtðh÷ßã�äy¶.´ º©½�¯�¹s²Eê:ùÄíÐìký,ñE÷©ú±õkìÄú¼ðhî ì�ñMøRírùÄê:ì úâñtðhõñ©î�î þhù ùÄú¼ðhóß÷©þ�ùÄúâð�ó¿é%êfësê^í�yíOî þ©ìÄú¼ñMð �/5 ï�úâòâíEúâð%ì írù ùÄí�ò±êsìÄír÷6�ì ï�írù í�ú±õpõ ú¼óMð�ú¼ÿRî�êtðMìpøJñ:ìÄí�ð%ì ú±ê:ò/ñsëtí�ùÄò±ê:øÜìÄïhêsìEî�ê:ð\öJí�í�� ø�òâñtú¼ì ír÷nöRí�ìký,írí�ðnì ï�íOõÚí!ìký,ñßñtøJí�ù.êsìÄú¼ñMðhõ;þRõÚúâð�óßê7.�írî ñMþ �ø�òâír÷ � ùÄêtðhõ òâê:ì í�08�yíOî þ©ìÄí9�:. � 0; � ùÄî.ï�ú¼ì íOî�ìÄþ�ù í>ê:òâñtð�óAìÄï�íò¼úâð�íOõ"ñ:ühìÄï�í0ì ù.êt÷©ú¼ì úâñtðRê:ò<. � 0ÝêtùÄî.ï�ú¼ì íOî�ìÄþ�ù íOõ3� =<�>��?�� A@ úâó �þ�ùÄíB�Aú¼òâò¼þRõkìÄùÄê:ì írõBì ï�íHõÚì ùÄþhî ì þ�ùÄíAñ:ü0ì ï�íHø�ù ñMøRñ%õÚíO÷ß÷�írî ñMþ �ø�òâír÷ êtùÄî.ïhúÒìÄírî�ìÄþ�ùÄí��ý�ï�ú±î.ïPî�ñtðhõ úâõÚìÄõ!ñ:üBìÄï�í¿í��©írî�þ©ì íÝê:ðh÷ì ù.ê:ðhõ ò±êsì ínø�ù ñ©î�írõÄõÚñMùÄõ�êtðh÷ ð�íOî írõÄõÄê:ùÄôDö�þ©ûeí�ù.õ^üÁñMù�úâð%ì í�ù �ø�ùÄñyî�írõÄõÚñMù�î ñ��>þ�ð�ú±î�êsìÄú¼ñMð �C� ïhíÓö%ô%ìÄírî ñ©÷©íÓí�©írî þ�ìÄê:öhò¼íý�ú¼òâòeí�ð%ì írù0ì ï�í,01�©írî�þ©ì í,D<ùÄñ©î íOõ õ ñtùE�:0FD; dêtðh÷AìÄï�íÐí��©írî�þ©ì íø�ùÄñyî�írõÄõÚñMùBý�ú¼òâò,î.ïhírîHG�ý�ïhí ì ïhí�ù�ìÄï�ú±õI�Åí ìÄï�ñ©÷Óúâõ>êfësê:úâòâêtö�ò¼íú¼ðpì ï�í � ù.ê:ðhõ ò±êsì íO÷KJ,ñ©÷©íLJ0êtî.ïhí�� J�JM <ú¼ðAìÄï�í�ì ù.ê:ðhõ òâê:ì íO÷üÁñtù�� _ü"ú¼ì�úâõzð�ñtìzüÁñtþ�ðh÷!úâð!ìÄï�í � J�J%�©ú¼ìzý�úâòâò�÷©í�øJñMõ ú¼ì0ìÄï�í�Åí ì ïhñy÷hõ/ìÄñHöJí;ìÄùÄêtðhõ òâê:ì ír÷^ú¼ð%ì ñHì ï�í�0 � ì ñ � �ON þ�írþ�í �P� ï�í� ù.ê:ðRõÚò±êsìÄíQD<ùÄñyî�írõÄõÚñMùQ� � D; Ðý�ú¼òâò,ì ù.ê:ðhõ òâê:ì íÅìÄï�í!î�ñy÷�íHê:ðh÷ø�þ©ì�ì ï�í;ì ù.ê:ðhõ ò±êsì íO÷R�Åí ì ïhñy÷hõ�úâð%ì ñHì ï�í � J�J �S� ïhí�ì ù.ê:ðRõ �òâê:ì íBñMøRírùÄê:ì úâñtð�ø�ùÄñtótùÄírõÄõ írõzî ñMðhî þ�ùÄùÄí�ð%ì òâôHý�ú¼ì ï�í�yíOî þ©ìÄú¼ñMðT�ñtðhî�í^êÝüÁí�ý ùÄñtþ©ìÄú¼ðhírõAïhêfëMí�öRírí�ð\ìÄùÄêtðhõÚò±êsìÄír÷ �O� ï�í^í�©í �î þ©ìÄú¼ñMðÜø�ùÄñ©î í�íO÷�õêtõ�òâñtðhóßêtõ;ì ïhíHð�í�%ìÅùÄí N þ�úâù íO÷U�í�ì ï�ñ©÷ïhêtõ/öRírí�ð�ìÄùÄêtðhõÚò±êsìÄír÷�êtðh÷�÷©í�øJñMõ ú¼ì ír÷�úâð%ì ñHì ï�í � J�J �S� ï�íótù.ê:ðyþ�ò±ê:ùÄúÒìkôüÁñMùzì ù.ê:ðhõ òâê:ì úâñtðEî�ê:ðEöJíBö%ô%ìÄírî ñ©÷©íÐñMùBËÅ¶�½³Àh¹rÆ �� ò¼ì ï�ñMþ�ótï N þ�í�þhí+�úâótï%ì^õ ú)�Åø�òâúÒüÁôaî ñMðMìÄù ñMò:�Ðê\î�êMî.ï�í � ò¼ú�GtíõkìÄù þRî�ì þhù í�îrê:ðÅï�í�òâø��Åí ìÄï�ñ©÷Åù írþhõÚí �AV í�ðhî�í�ý0í�ê:ùÄí�þhõÚúâð�ó�êî�êMî.ï�í � ò¼ú�GtíBõÚì ùÄþhî�ìÄþ�ùÄíBüÁñtù � Djì ñK03D@î ñ��>þ�ð�ú±î�êsìÄú¼ñMð^ê:ðh÷êÐõÚú��Åø�ò¼í N þ�írþ�í�õkìÄù þRî�ì þhù í0üÁñtù803Dßì ñ � DÝî ñ��>þ�ð�ú±î�êsìÄú¼ñMð �W�ð�òâú)GMíK. � 0dõ��õ ø�ò¼ú¼ìÚìÄú¼ðhó�î ñ©÷©íAúâð%ì ñ�ì ù.ê:ðRõÚò±êsìÄíÅêtðh÷¿í�� írî�þ©ì í;ì ï�ùÄírêM÷�õ/ú±õ�írêtõ úâí�ù�êtðh÷^õÚì ù.ê:úâótï%ìÚüÁñMù ýzê:ù.÷�÷©þ�í;ì ñpìÄï�íú¼ðhï�í�ùÄí�ð%ìÐì ï�ùÄírêM÷©ír÷�ðhêsìÄþ�ùÄí>ñtü0é%êfëfêEî ñ©÷©í �X� ï�íÅú��Åø�ù ñsëMí ��Åí�ð%ìHñtö�ìÄê:úâð�íO÷�þRõÚúâð�ónõ þhî.ï@êÜ÷©íOî ñtþhø�ò¼úâð�óÜõÚì ù.êsìÄí�ótôný�ú¼òâò

Translated

Code Cache

(TCC)

TranslateProcessor

E-to-T Q

TranslateController

Bytecodes

ExecuteProcessor

@ úâótþhù í��Y � ïhí�./íOî ñMþ�ø�òâír÷ � ùÄêtðhõ òâê:ì íI01�©íOî þ©ìÄí � ù.î.ï�ú¼ì írî �ì þ�ùÄí��2. � 0 � Z\[^]:_H[a`cb dfehgMi\jai<k7l(m-n�oqpAlor]as6oqout,]:v(o8Z�w�oqxqm-]yo3z(py_xqoqeyey_*p8{)Z}|6~gHt-�%]:v(o8`<p�gHt(e:��g�]yo1z(p:_�xqoqeye:_*pF{)`T|6~��m(e:ou�%l�E]:v(o1Z}|Q]y_�xq_*��3m(t(d [xug�]:o��oq]:v(_�(er��l��]yoqxq_��-oueM]:_Plo�]ypygHt(ey��g�]:oq��]y_,]:v(o�`T|\��`A�T�7dfe3gxugHx�v-oSm(eyoq�8l��1`T|L]y_hxq_*��;m(t(dfxug�]:o\]:p�gHt-ey��g�]:ou�F��or]yv-_��(e��l�]:ouxq_�(oqel(gHx��]y_;]:v(o1Z�|\�-j^]he2]y_*p:oqeT]:v(o�]:p�gHt(e:��g�]yoq�,��or]:v(_�(er��l��]:oqxq_��(oqe1gHt-�� gHxqdf�fd ]�g�]:oqeh��or]:v(_�%p:oum-eyo*�

÷©í�øJí�ðh÷nñtðnì ï�í�ñsëtírù ò±ê:øÝìÄïhêsìAî�êtðnöJíEê:ìÚì.ê:úâð�ír÷nöRí�ìký,írí�ðì ù.ê:ðhõ òâê:ì íê:ðh÷ßí��©írî�þ©ì íñtøJí�ù.êsìÄú¼ñMðhõ � �Êð�ìÄùÄêM÷©ú¼ì úâñtðhêtòA. � 0ì í�ù��Åú¼ðhñtòâñtótô��eì ï�ú±õBý0ñtþ�ò±÷¿öRípîrê:òâò¼íO÷Ü°�»Ï¯ ¸ �9� ï�íp÷�úâõÚì úâðhî�ìÄò¼ôõÚírøhê:ù.êsìÄíAì ï�ùÄírêM÷�õ�ùÄí N þ�úâùÄír÷Ü÷©þhù úâð�ó�é%êfësê�í��©írî�þ©ì úâñtð�ø�ùÄñ �ë%ú±÷©íOõzøRñtì í�ð%ìÄúâêtòeüÁñtù/õ ú¼óMð�ú¼ÿRî�êtðMìzñsëtírù ò±ê:øEê:ðh÷!øhêtùÄêtò¼òâí�òâúâõ�� ïhí�ì ù.ê:ðhõ ò±êsì í�øhù ñ©î íOõ õ ñtùS�Aêfô/þRõÚídõ ñ:ü³ìkýzê:ùÄí�ì ù.ê:ðhõ ò±êsì úâñtðñtù�ïhêtùÄ÷�ý0êtù íÐìÄùÄêtðhõÚò±êsìÄú¼ñMð � !yñtü³ìký0êtù íÐìÄùÄêtðhõÚò±êsìÄú¼ñMð�î ñtðRõÚú±õkì.õñ:üÐí��êtî ì òâôÜì ïhí�õ ê��Åí�ñtøJí�ù.êsì úâñtðRõì ïhê:ìHïhê:øhøRírðhõAúâðDî þ�ù �ù írðMì�é�� õ0÷�þ�ù úâð�óì ïhí�ìÄùÄêtðhõ òâê:ì í�ø�ïhêtõ í �1V ê:ù.÷©ý0êtù í�ì ù.ê:ðhõ �òâê:ì úâñtðDý0ñtþ�ò±÷DúâðyëtñMò¼ëMí�ì í��øhòâê:ì í7�Åê:ìÄî.ï�úâð�ójê:ðh÷ ótí�ðhí�ù.ê �ì úâñtð�ñ:ü"ìÄùÄêtðhõÚò±êsìÄír÷�î ñ©÷©í>êtõ��Åñ©÷©írù ð��'��Åø�ùÄñ©î íOõ õ ñtù.õzótírð �í�ù.êsì íB��LDdõ-�,��u!\J ñtøhõH ñtùQW,�LD3!6�P��u!\J ò¼ú�GtíB�Åú±î ùÄñ �ñtøJí�ù.êsì úâñtðRõ� � � �;þ�ò¼ì ú¼ì ï�ùÄírêM÷©ír÷�ø�ùÄñ©î írõÄõ ñtù;þ�ì úâò¼ú��úâð�óÓ÷©ú¼ü �üÁí�ùÄí�ð%ìHìÄï�ù íOêt÷�õHüÁñtù!ì ù.ê:ðRõÚò±êsìÄú¼ñMð ê:ðh÷ í��©írî�þ©ì úâñtð��Åêfô\öJí

î ñtðRõÚú±÷©í�ùÄír÷êtõ�êBõ ñ:ü³ìkýzê:ùÄízú)�Åø�òâí�ÅírðMì.êsìÄú¼ñMðñtüRìÄï�íE. � 0 � �� )+��1�B�� &� �*��d+�-��0��Êðpì ïhúâõ�õ írî ì úâñtðHý0íÐ÷©írõÄî ùÄú¼öJíÐê>õ þ�öhõ í ìzñ:ü5ìÄï�íP. � 0 � ù.î.ï�ú �ì írî ì þ�ùÄí��<ì ï�í�� V ê:ù.÷�. � 0F� � ùÄî.ïhúÒìÄírî�ìÄþ�ùÄí � �ÊðÜìÄï�í V êtùÄ÷. � 0 êtùÄî.ïhúÒìÄírî�ìÄþ�ùÄíÅì ï�íAìÄùÄêtðhõ òâê:ì úâñtðÝñ:ü�öyô%ì írî�ñ©÷©írõ�ì ñ^ðhê �ì úâëtíßî�ñ©÷©í�ú±õ�÷©ñtð�í�ì ï�ùÄñtþhótï ênïhê:ù.÷©ýzê:ùÄí�ì ù.ê:ðhõ ò±êsì ñMù��êMõñtø�øJñMõ ír÷¿ìÄñ^þhõ ú¼ð�óßê�õ ñ:ü³ìkýzê:ùÄíí��;þ�ò±êsìÄñtù �"� ï�ú±õ;ø�ùÄñsëyúâ÷�írõêt÷©ësê:ð%ì.ê:ótíOõ0ñ:ü�õ øJí�ír÷EúâðEì ù.ê:ðhõ òâê:ì úâñtðT�©öhþ©ì�êsìzìÄï�í�í��©øJí�ðhõ íñ:ü�úâðhî�ù íOêtõ ú¼ð�óÅ÷©íOõÚúâótð�î ñ��øhò¼í�yú¼ìkôHêtðh÷�î.ï�ú¼ø�ê:ùÄírê �!©ñ:ü³ìkýzê:ùÄí(í�>þ�ò±êsì úâñtð ñ:ünöyô%ì íOî ñ©÷©írõ��Áúâð%ì írù ø�ùÄí ì.êsìÄú¼ñMðT�é�� î ñ��Åø�ú¼ò±êsìÄú¼ñMð� Ðú±õ;õ ò¼ñsý � ��þhù;õkìÄþh÷©úâírõ>ñtðÜé%êfësê�ý,ñMù�G �

ò¼ñ%êt÷�õ�� -��õÚï�ñsý)ìÄïhêsìÅúâðMìÄí�ùÄø�ùÄí ì íO÷Óí��©íOî þ©ìÄú¼ñMðnùÄí N þ�úâù íOõ�ñtðì ï�í¿êfëtírùÄêtótí">��!<D � ��J úâðhõkìÄù þRî�ì úâñtðRõpìÄñní�>þ�ò±êsì í�írêtî.ïöyôMìÄírî�ñy÷�í �h� ïhí0êfëMí�ù.ê:ótí8!<D � ��J¿ú¼ðRõkìÄù þhî ì úâñtðhõ�ùÄí N þ�úâù íO÷Ðì ñí�>þ�òâê:ì íêAöyôMìÄírî�ñy÷�í;ýzêtõ/êtø�ø�ùÄñ(�©ú)�AêsìÄí�òâô�=��Aý�ï�í�ð�þhõÚúâð�óê!é�� î�ñ��Åø�úâò¼írù�� !yñtü³ìký0êtù í>í�>þ�òâê:ì úâñtðßúâõÐíOêtõ ô!ìÄñ�ú)� �ø�òâí�Åí�ð%ì/üÁñMù/ð�írý ø�ò±êsìÚüÁñMù��Aõ/ö�þ©ì�îrê:ð�ðhñ:ì/ñtûJírù�êpõ ñtòâþ©ì úâñtðüÁñtùHü�êtõÚìHí��©írî�þ©ì úâñtð@ñtü�éMêfësêÝöyô%ì írî�ñ©÷©írõ � W/õ ú¼ð�ó�êÓïhê:ù.÷ �ý0êtù í!õÚñMò¼þ©ìÄú¼ñMðT��ý�ï�ú±î.ïÜú��øhò¼úâírõí��©íOî þ©ìÄú¼ð�ó�ì ï�í!öyôMìÄírî�ñy÷�írõ÷©úâù íOî�ì òâô^úâðßïRê:ù.÷©ý0êtù íý0ñtþ�ò±÷^írò¼ú��Åú¼ðhê:ì íÅìÄï�íAù í N þ�ú¼ùÄí�ÅírðMìñ:ü"êAõÚñtü³ìký0êtù í�òâêfôMí�ù0ì ñAí��;þ�ò±êsìÄí�ìÄï�íBöyô%ì írî�ñ©÷©írõ �� ï�í�ùÄí;ê:ùÄí;õ í�ëMí�ù.ê:òvú±õ õ þ�írõ�ì ñpöJí>î�ñtðhõ ú±÷©í�ùÄír÷Eý�ï�úâò¼í>í� �ø�òâñsô%úâð�óAïhêtùÄ÷�ý0êtù íÐìÄñpøRírùÚüÁñMù�� ì ù.ê:ðRõÚò±êsìÄú¼ñMð �� ï�í>î�þ�ùÄù írðMìïhê:ù.÷©ýzê:ùÄí<ú��øhò¼í��írð%ìÄêsìÄú¼ñMðhõ�ñ:ü©ìÄï�í�éMêfësê�#�ú¼ù ì þhêtò<$^êtî.ï�úâð�í

��é�#%$c <õÚþhî.ïAêMõ8D<ú±î ñ%éMêfësê�ê:ùÄízõÚìÄêMîHG;öhêMõÚíO÷ ��V ñsý0í�ëtírù��fìÄï�í�ÊðhõÚì ùÄþhî�ìÄú¼ñMð��5írëtí�ò;D�ê:ù.ê:òâò¼írò¼ú±õ�� ¡�q�hD3 ;ý�ï�ú±î.ïnî�êtðÜöRí!í�� ø�òâñtú¼ì ír÷�þhõ ú¼ð�óEê!õÚìÄêMîHG�ê:ù.î.ï�úÒìÄírî ì þ�ùÄí;ú±õ/òâú��ú¼ì íO÷ ��@ úâótþ�ùÄí�=õÚïhñsý�õ�ìÄï�í�øJí�ù.î í�ð%ì.ê:ótí0ùÄír÷©þhî ì úâñtðú¼ðí��©íOî þ©ìÄú¼ñMðî�ôyî�ò¼íOõ�ê:ðh÷ú)�Åø�ùÄñsëtí��Åí�ð%ì/úâð^ú¼ðRõkìÄù þhî ì úâñtðhõ�øJí�ùIJ,ôyî�ò¼í��¡�qDMJM züÁñtùÐé%êfëfêê:ø�øhò¼ú±î�ê:ì úâñtðhõ�í��©írî�þ©ì íO÷�þhõ úâð�ó�ìÄï�í,úâð%ì írù ø�ùÄí ìÄí�ù"ê:ðh÷�ì ï�ízé�� î ñ��øhú¼òâí�ù �%� ï�í>ø�ùÄñtótù.ê�Aõ�í�©írî þ�ì ír÷�þhõ ú¼ð�óHì ï�íé�� î�ñ�� ø�úâò¼írù/ê:ùÄí�õÚírí�ðEì ñ!õÚï�ñsýaõÚòâúâótï%ì òâôHï�úâótïhí�ù�ú��Åø�ù ñsëMí�Åí�ð%ì.õzú¼ðöRñtì ïEí��©írî�þ©ì úâñtðHì ú��ÅíÐêtðh÷Q�qDMJjý�ï�írð!í�yíOî þ©ìÄír÷!ñtð9? � ý0êfôê:ðh÷¢�-� � ý0êfôßø�ùÄñyî�írõÄõÚñMùÄõ �R� ï�íHúâðh÷©úâëyúâ÷©þRê:òz÷�êsì.ê�üÁñtù>írêtî.ïöRírðhî.ï��Aê:ù�GBî�êtð;öJídüÁñMþ�ðh÷�úâðR��*��têtðh÷�ý0ídúâðhî òâþh÷©í @ úâótþ�ùÄí;=ú¼ðÅìÄï�ú±õd÷©ñ©î þ\�írð%ì<ì ñ;õ í�ùÄëtí�êMõ<êP�Åñ:ìÄú¼ësê:ì úâñtðüÁñtù,ïhê:ù.÷©ýzê:ùÄíì ù.ê:ðhõ ò±êsì úâñtð;ú¼ð%ì ñÐê�ð�ñtð � õÚìÄêMîHGL�u! � � ��þ�ù"õÚú��;þhòâê:ì ñtù��Åñ©÷©í�ò±õù íOê:òâúâõÚì ú±î�öhùÄêtðhî.ïAø�ùÄír÷©ú±î�ìÄú¼ñMðT�yï�ñsý,írëtírù��%êtõÄõÚþ��ÅíOõ,ê;øRírùÚüÁíOî�ì£0ùÄêtðhî.ï � ê:ùÄótí�ìM£0þ©ûeí�ùI�:£ � £� zêtðh÷pìÄï�í�ùÄí üÁñMù í�ì ïhíBøRírùÚüÁñMù ��Aê:ðhî�íAñ:üzì ï�íAúâð%ì írù ø�ùÄí ìÄír÷Ýø�ùÄñtóMùÄê��Aõ�ý�úâòâòdöJípõkìÄú¼òâòdý0ñtù.õÚíñtð�ê;ùÄírêtòS�ÅêMî.ï�úâð�íÐú¼ðEý�ï�ú±î.ïHìÄï�íP£ � £Dúâõ;�Åñ©÷©í�òâír÷U��õ úâðhî íì ï�í,ú¼ð%ìÄí�ùÄø�ù í�ì írù�ïhêMõT�ÅñtùÄí<úâðh÷©úâù íOî�ì�öhùÄêtðhî.ï�írõ5ì ïhêtðBì ïhí,é�� Åñy÷�íBñ:ü�í�©írî þ�ì úâñtð� �£0êMõÚíO÷@ñtð ù íOõÚþ�ò¼ìÄõ!ñ:ü>ø�ù írò¼ú��Åú¼ðRê:ùÄô�øJí�ù üÁñtù��Aê:ðhî�í�írõÚì ú ��Aêsì íOõ�Aý0í�ü�êfëtñtùÝêaïhê:ù.÷©ýzê:ùÄínú��Åø�ò¼í��Åí�ð%ìÄê:ì úâñtð&ñ:üHìÄï�í

. � 0 � î ñMðhî írø©ì � £0ñ:ìÄïB0FD@ê:ðR÷ � D@î�ê:ð�öJí�ö�þhú¼ò¼ì�ñMð!ìÄï�íõ ê��Åí>î.ï�úâø�ê:ðR÷�ì ï�íÅê:ù.î.ï�ú¼ì íOî�ì þhù í;îrê:ð^öJíëtí�ùÄô�õÚú��Åú¼ò±ê:ù�ì ñ�Åñy÷�í�ùÄðH÷©ôyðhê�Åú±î�êtò¼òâôAõ î.ï�íO÷©þ�òâír÷��Åú±î ùÄñtø�ùÄñ©î íOõ õ ñtù.õ�í��î�í�ø©ìüÁñtù"ê�üÁírýÜö�þ�úâò±÷©ú¼ðhó�ö�òâñ©îHG©õ �A� ï�úâõ"êtùÄî.ï�ú¼ì íOî�ìÄþ�ù í,ý�ú¼òâòyú¼ðRî òâþh÷©íê�ïhê:ù.÷©ýzê:ùÄí>þhð�úÒìBìÄñ^î ñMð%ëMí�ù ìÐì ï�íAöyô%ì íOî ñ©÷©írõ�ú¼ð%ì ñ^õ ú��øhò¼í��u!\J ò¼ú�GtíEú¼ðRõkìÄù þhî ì úâñtðhõ �¤� ïhírõ í�î�ñtðyëtírùÚìÄír÷ÓúâðhõÚì ùÄþhî�ìÄú¼ñMðhõý�ú¼òâò©öJí�ù írótú±õkìÄí�ù"öhêtõ ír÷T�sì ï�írù íröyô�í�ðhêtö�òâú¼ð�óÐþhõ ízñ:üvõkì.ê:ðh÷hê:ù.÷ì íOî.ï�ð�ú N þhírõvìÄñ�ñtö©ì.ê:úâðBï�úâótïBøJí�ù üÁñtù��Åêtðhî í �� ï�í � ù.ê:ðhõ òâê:ì íO÷J,ñ©÷©íLJ0êtî.ïhí�� J�JM <ý�ú¼òâòJõÚì ñMù í�ì ï�í�î ñtðyëMí�ù ì ír÷öyô%ì íOî ñ©÷©írõ �� ï�í � J�JÝý�úâòâò�öRí�ëMí�ùÄô;õ ú)�Åúâòâêtù"ì ñ�ê � ù.êtî�í�î�êtî.ïhí,�f¥¦��'��y�Mú¼ðì ï�íÐõ í�ðhõ í�ìÄïhêsì0úÒì�î�êtø©ì þ�ùÄírõdì ïhí�÷©ôyðhê��Åúâî�ì ù.ê:ðhõ òâê:ì íO÷pî ñ©÷©í÷©þ�ùÄú¼ðhó;ø�ùÄñtóMùÄê��Ûí�yíOî þ©ìÄú¼ñMð � � ï�ú¼óMï � ò¼írëtí�òJö�òâñ©îHGÅ÷�úâêtótù.ê�ñ:ü5êBõ ôyõÚì í��Ûý�ú¼ì ïÅìÄï�í�ø�ùÄñtøJñMõ ír÷ïhêtùÄ÷�ý0êtù ízþ�ð�ú¼ìÄõdúâõdõÚï�ñsý�ðú¼ð @ úâótþ�ùÄíc> � �ÊðhõÚì íOêt÷\ñtü�üÁí�ìÄî.ï�úâð�óÜöyô%ì íOî ñ©÷©íOõAüÁùÄñ��ìÄï�íõkì.ê:ðh÷hê:ù.÷Ðú¼ðhõÚì ùÄþhî ì úâñtð>îrêtî.ï�í��fý�ï�í�ð�írëtírù�øJñMõÄõÚúâö�òâí��fì ïhí,î ñMð �ëtírùÚìÄír÷Óöyô%ì íOî ñ©÷©írõ;üÁùÄñ��ì ï�í�÷©íOî ñ©÷©ír÷nöyôMìÄírî�ñy÷�íHúâðhõkìÄù þRî �ì úâñtðHîrêtî.ï�í�ý�ú¼òâòeöRí�÷©úâù íOî�ì òâôÅüÁí ì.î.ï�ír÷Hê:ðh÷Hí��©írî�þ©ì íO÷Aöyô>ìÄï�íø�ùÄñyî�írõÄõÚñMù<î ñMù í � � òâòJö�ò¼ñ©îHG©õ�úâð @ ú¼óMþ�ù íP��Mí��î í�ø�ì<ì ï�í/ï�ú¼óMï �ò¼úâótï%ìÄír÷Ýö�òâñ©îHGyõ;ê:ùÄípøhê:ù ì�ñ:ü/êtðyôßõÚìÄêsìÄípñ:üzìÄï�í!êtùÚì��Åú±î ùÄñ �

=

@ úâótþhù í�=�Y � î�ñ��Åøhê:ùÄú±õÚñMðPñtüAõÄî�êtòâêtö�úâò¼ú¼ìkôP÷©þhù úâð�ó@úâðMìÄí�ù �ø�ù í�ì íO÷!êtðh÷�é�� î ñ��Åø�úâòâír÷!í��©írî�þ©ì úâñtð �|�oqpyxqoqt�]�gH§*o�pyoq�(m(xr]:df_*tBdftRorw-oqxqm-]:df_*tBxr�-xq�foqePgHt(�Rdf��z-py_�¨�oq��oqt�]%dftja|S�9pyoq��g�]:d ¨�o�]y_Mg;©H[^sTgq�%xq_*t-ª(§*m(p�g�]:df_*t,dfeTeyv(_us�t¦�\`�v-o1dft]:oqpyz(p:oq]:oq�xq_�(oFdfe�dft(xqpyougHeydft-§*� �Le2]�gHx��[a_*p:dfout�]:ou��¦s�v(df�fo1]yv-oM«*ja`7xq_�(o3dft-xu_*p:z_H[pyg�]yoqehpyoq§*dfe:]:oqph_*z-]:df��df¬ug�]ydf_*t-e��1{)`�v-dfehª�§*m(p:o8dfehz(p:_�¨�df�(oq�PgHeA��_H]:d ¨*g�[]:df_*t � _*pTt(_*t-[ae2]�gHxy�Ej2® dftE]yv(o1xq_*p:o1_ � «�gq¨*g;z(p:_�xqoqeye:_*pyeu� ~

ø�ù ñ©î�írõÄõÚñMù � � ò¼ì ï�ñMþ�ótï!ð�ñ:ìzí��©ø�òâúâî�úÒìÄò¼ôHú¼ðh÷�úâîrêsì íO÷6�%ì ï�írù íÐê:ùÄíN þhí�þ�íOõBöRí�ìký,írí�ðÓësê:ùÄúâñtþhõÐö�òâñ©îHG©õBú¼ð @ úâótþ�ùÄíQ>��írõ øJírî ú±ê:òâòâôöRí�ìký,írí�ð¿ì ï�íHüÁí ì.î.ïÓþ�ð�ú¼ìê:ðR÷�ìÄï�í V ê:ù.÷©ý0êtù í � ùÄêtðhõ òâê:ì ñtù �!%ì ñMù úâð�ó�ìÄï�í�ì ù.ê:ðhõ ò±êsì íO÷pöyô%ì íOî ñ©÷©írõ*¯(�Åí�ì ï�ñ©÷�õ,ú¼ð!ì ïhí � J�Jê:ò±õÚñ�ø�ù ñsëyú±÷©írõ�ñMø�øRñMùÚìÄþ�ð�ú¼ìkôÐüÁñtù<ê�ësê:ùÄú¼í�ìkôBñ:üRñMø©ì ú��Åú)�OêsìÄú¼ñMðhõêtõ5øJí�ù üÁñtù��Åír÷Ðöyô�ì ï�ídÿhòâò � þ�ð�ú¼ì�úâð;î�ñtð��kþ�ðRî�ì úâñtð;ý�úÒìÄï>êzìÄùÄêMî íî�êtî.ïhí�� &�� °M±u² ³�´uµuµ·¶¹¸E´ § ª�ºEº/¨\»<ª�«\¼ §<» ½E¾�«S»�½ ´�¸�¿

¥3Àz§<¾�«S»�½�¾�¦� ï�íÐö�ò¼ñ©îHGH÷�úâêtótù.ê� ñ:ü�ì ï�íÐüÁùÄñtð%ì � írðh÷püÁñMùzì ï�í V ê:ù.÷K. � 0úâõ"õ ï�ñsý�ð;úâð @ úâótþ�ùÄí;? �� ï�ízïhê:ù.÷©ýzê:ùÄí�ì ù.ê:ðhõ ò±êsì ñMù*¯:÷©írî�ñ©÷©í�ùý�ú¼òâò%î�ñtðyëtírùÚì�öyôMìÄírî�ñy÷�írõ�ú¼ð%ì ñ�õu�Aêtò¼òâí�ù�ÿ��©ír÷Bòâí�ðhó:ì ï>ú¼ðRõkìÄù þhî �ì úâñtðhõ��ý�ï�ú±î.ïÓý�úâò¼ò,öRípøRêtîHGtíO÷�úâð%ì ñ�ìÄï�í � ùÄêtðhõÚò±êsìÄír÷/J,ñ©÷©íJ0êtî.ï�íL� � J�JM �öyôÐì ï�í0ÿhòâò©þ�ð�ú¼ì � � ò¼úâð�ízúâð;ìÄï�í � J�JÓî�êtð;öJíÿhðhê:òâú��ír÷>ñtù<î�ñ��Åø�òâí ìÄír÷ý�ï�í�ðAñMð�í�î ñ��íOõ�êtî�ù ñ%õ õ"êBî ñMð%ì ùÄñtòÁ ñsýnúâðhõÚì ùÄþhî�ìÄú¼ñMð �A� ïhí�ö�ù.ê:ðhî.ïþ�ð�ú¼ìdý�úâò¼òRø�ù ñsëyú±÷©í0ì ï�í�ðhí��yìú¼ðhõÚì ùÄþhî ì úâñtð;êM÷�÷©ùÄírõÄõ�fê:ðh÷�úÒì�úâõ�üÁír÷;öhêtîHG�ì ñ/ì ï�í0ú¼ðRõkìÄù þhî ì úâñtðî�êtî.ïhí;ê:ðh÷�ì ï�í � J�J � �_ü�ì ï�íò¼ñyñ�G%þhø^ú¼ð�ì ï�í � J�JPùÄírõ þ�òÒì.õú¼ðÓêEï�ú¼ì��ì ï�írðßìÄï�íp÷©íOî ñ©÷©ír÷¿úâðhõÚì ùÄþhî�ìÄú¼ñMðhõBê:ùÄíüÁír÷ßì ñ�ìÄï�í÷©ôyðhê�Åú±î�ê:òâòâô^õ î.ï�íO÷©þ�òâír÷ �Åúâî�ù ñ � írð�ótúâð�íÅ÷©ú¼ùÄírî ì òâô�üÁùÄñ�� ìÄï�í� J�J �� ïhí�ÿhò¼ò�þhð�úÒìHêtðh÷�ì ï�í � J�J ï�í�òâøjú¼ð@î ùÄírê:ì úâð�ó¿òâêtù óMí�ùêsì ñ��ÅúâîÝþ�ðhúÒì.õ^ñ:üAý0ñtù�GS�Bý�ï�úâî.ï îrê:ðÞöJíÓüÁír÷ÞìÄñ\ì ï�íj÷©ô �ðhê�Åú±î�ê:òâòâô;õ î.ïhír÷©þ�òâír÷��Åú±î ùÄñ � í�ð�óMú¼ð�í � �Êðhî�ù íOêtõ ú¼ðhó�ìÄï�í�üÁí�ìÄî.ïöhê:ðh÷�ý�úâ÷yìÄïT�fþhõ ú¼ð�ózìÄï�í<ïhêtùÄ÷�ý0êtù í�ì ù.ê:ðRõÚò±êsìÄñtù5ê:ðR÷/ìÄï�í � J�Jý�ú¼òâò5ê:òâò¼ñsý þhõ0ì ñAí�yíOî þ©ìÄí�ì ï�íÐöyô%ì írî�ñ©÷©írõzúâð�ê>õ þ�øJí�ù.õ îrê:ò±ê:ù

Data Cache

F.P unit

Load Store Unit

Mem

ory

an

d I

/O I

nte

rfac

e u

nit Instruction

Cache

Decode

Fetch Unit

Unit

HardwareTranslator

TranslatedCode

Cache (TCC)

Integer unit

@ ú¼óMþ�ù íÂ>�Y � ïhí V êtùÄ÷ � . � 0 � ú��Åø�ò¼í��Åí�ð%ìÄê:ì úâñtð � �ú �î ùÄñMêtùÄî.ï�ú¼ì íOî�ìÄþ�ù í�ñtü�ê�÷©íOî ñtþhø�ò¼íO÷Aø�ùÄñ©î írõÄõ ñtù�ìÄïhêsì0øRírùÚüÁñMù��Aõì ù.ê:ðhõ ò±êsì úâñtðRõ,úâð�ïhê:ù.÷©ýzê:ùÄí

translatorh/wI-Cache

Register

Renaming

Unit

FillUnit

T

C

CUnits

Func.

branch unit

next instruction address

@ ú¼óMþ�ù í,?�Y8£,òâñ©îHGH÷©ú±ê:óMùÄê�� ñ:ü�ì ï�í V ê:ù.÷©ýzê:ùÄíL. � 0 � ý�ú¼ì ïê:ðEú¼ðRõkìÄù þhî ì úâñtðEÿhòâòvþ�ð�ú¼ì

>

ü�êtõ ï�ú¼ñMðT�©þ�ð�òâú�GtíÐì ïhí�D<ú±î ñ%éMêfësê�ý�ïhúâî.ï�÷©ú±÷�ð�ñ:ì�í�©írî þ�ì íÐú¼ð �õkìÄù þhî ì úâñtðhõ�ñtþ©ì;ñtü0ñMùÄ÷�í�ù;êMõÐì ï�íHú¼ðhõÚì ùÄþhî ì úâñtðÝüÁí ì.î.ïÓöhê:ðR÷ �ý�úâ÷yìÄï!ýzêtõ0ëMí�ùÄôòâú��ú¼ì íO÷ �F� ï�í,D<ú±î ñ%éMêfësê�üÁí ìÄî.ïhírõ0ñtð�òâô"� � =ö%ô%ìÄírî ñ©÷©í0úâðhõkìÄù þRî�ì úâñtðRõ�írëtírù ôBî ô©î�ò¼í �A� ï�í�õÚìÄêMîHGÐî�ù íOêsìÄírõ"ê:ðú)�Åø�òâúâî�úÒì�÷©írøRírðh÷©í�ðRî ôDê�ÅñMð�óMõÚìHìÄï�íßöyô%ì íOî ñ©÷©íOõ!êtðh÷@ìÄï�íëfêtù ú±ê:öhò¼í^òâí�ðhó:ì ïPñ:ü�ìÄï�í¿ö%ô%ìÄírî ñ©÷©í¿úâðhõkìÄù þRî�ì úâñtðRõK�AêGMí�ú¼ìïhê:ù.÷©í�ù0ìÄñÅ÷�írî ñ©÷©í,�>þ�òÒìÄú¼øhò¼íBúâðhõÚì ùÄþhî�ìÄú¼ñMðhõ�õÚú��;þhòÒì.ê:ð�írñtþhõ ò¼ô �°M±:Ã Ä ¾�«S»�½ ´�¸�¿ ¥;Àz§<¾�«S»�½�¾�¦ ´u¸ §<»ÆÅ ´ «e¨�»1Ç-»8º�¦� ïhê:ù.÷©ýzê:ùÄí�÷©írî�ñy÷�í�ù<ý�úâò¼òJî ñMð%ëMí�ù ì"ì ï�í�ö%ô%ìÄírî ñ©÷©íOõ"ì ñ��;þ�ò �ì úâø�ò¼íB�Åú±î ùÄñ � ñtøJí�ù.êsìÄú¼ñMðhõ � � ÷©íOî ñ©÷©úâð�óÜõÚì ùÄþhî�ìÄþ�ù í�õ ú)�Åúâòâêtùì ñ�ìÄï�í,�Êð%ì í�ò6D3�;÷©íOî ñ©÷©í�ùzî�êtðHöJí�þRõÚíO÷6�yý�ï�úâî.ïEî�êtð!÷©íOî ñ©÷©í�;þ�ò¼ì úâø�òâí/õÚú��Åø�òâí/ê:ðh÷ñMð�í�î ñ��øhò¼í�>úâðhõkìÄù þRî�ì úâñtðAúâðAñtðhí�î ô �î òâí � � ð�ñtì ï�írù,ê:øhø�ù ñ%êtî.ïúâõ<ì ñ�÷©íOî ñ©÷©í�ñMð�òâô;ìÄï�í/î ñ��ñMð�òâôþhõÚíO÷Aöyô%ì írî�ñ©÷©írõ0ê:ðh÷HõkìÄñtùÄí�ì ïhí� úâðpì ïhí � J�J%�yê:ðh÷pì ñ>í�� írî þ�ì ízì ïhí�ñ:ì ïhí�ù�öyô%ì írî�ñ©÷©írõ"öyôBìÄùÄêtø�ø�úâð�ó�ì ñP�Åúâî�ù ñ � î�ñ©÷©í,ñMùõÚñtü³ìký0êtù í �J,ñMðhõ úâ÷©írù/ìÄï�íö%ô%ìÄírî ñ©÷©í��6È�É�Ê\Ë �%� ï�ú±õ�ö%ô%ìÄírî ñ©÷©í÷©þ�ø�òâú �î�êsìÄírõHì ï�í¿ìÄñtøDìký0ñnírò¼í��Åí�ð%ìÄõEñ:ü;ì ï�íÓõÚìÄêtîHG@êtðh÷Dø�þRõÚï�íOõì ï�í�� ñtðAìÄï�í�õÚìÄêMîHG �A� ï�ú±õ0öyôMìÄírî�ñy÷�í�î�êtðHöJí�õ ø�òâúÒìzúâðMìÄñ�ìÄï�íüÁñtòâò¼ñsý�úâð�ó>ÿhëtí,�Åúâî�ù ñ � ñMøRírùÄê:ì úâñtðhõ�Y

Ì�Í¦Î�ÏOÐ�Ñ�Ñ ÒrÎ�Ó\Ñ�Ô<Õ×ÖcØ¦Ù�Ú È Ì\Í�ÎÜÛuÝ�Ù Ê Ù¦Þàß�Ý�Ú\á�â}ãÌ�Í¦Î�äåÐ�Ñ�Ñ ÒrÎ�Ó\Ñ¦æ¦Õ×ÖcØ¦Ù�Ú È Ì\Í�Î<Ñ}çÎ�ÓOÐ¦Ñ�Ñ�Î�Óéèåæ ÖRÌ\Í¦Îëê¹Ì\Í¦Î¹è ËÒrÎ�Ó\Ñ�Ô<Õ�Ð¦Ñ�Ñ�Ì\Í¦Î�ä Ö7ß�Ý�Ù�ì<í¤î\í�ïéÌ\Í¦ÎÒrÎ�Ó\Ñ¦æ¦Õ�Ð¦Ñ�Ñ�Ì\Í¦Î�Ï Ö7ß�Ý�Ù�ì<í¤î\í�ïéÌ\Í¦Î<Ñ}ç

�ÊðA÷©íOî ñ©÷©úâð�ó�ì ï�ú±õ�öyô%ì írî�ñ©÷©í��:ý0ízþhõ ír÷�=�ìÄí�ÅøJñtù.ê:ùÄôBù író �úâõÚì írùÄõ � �,! � ê:ðh÷ � �,!<£ ��5 í�êtòâõ ñc�Åñ©÷©ú¼ÿhír÷nì ï�í�õÚìÄêMîHGøRñMú¼ð%ì írù��:!<D; �� ï�íÅöyô%ì íOî ñ©÷©íÅù í N þ�ú¼ùÄír÷7=EõkìÄñtùÄírõ��=pòâñMêM÷�õê:ðh÷��pî ñ��Åø�þ©ìÄê:ì úâñtðÝúâðhõkìÄù þRî�ì úâñtðÜêtðh÷¿þhõ ír÷/=EúâðMìÄí�ù��Åír÷©ú �êsì í�ùÄí�óMúâõÚì írùÄõ>ì ñÓõÚì ñMù íEì í��ÅøRñMùÄêtù úâírõ � !yú)�Åúâòâêtù òâô��0ý0í�îrê:ðî�ê:ò±î þhòâê:ì íBìÄï�í;òâñMêM÷�¯sõÚì ñMù íBúâðhõÚì ùÄþhî�ìÄú¼ñMðhõ/êtðh÷^î ñ��Åø�þ©ìÄê:ì úâñtðú¼ðhõÚì ùÄþhî ì úâñtðhõzùÄí N þ�úâù íO÷AìÄñú��Åø�òâí�Åí�ð%ì/÷©ú¼ûJírù írð%ì�ö%ô%ìÄírî ñ©÷©íOõì ñÐî ñ��Åízþ�øý�úÒìÄïê�ùÄírõ ñtþ�ù.î ídì í��øhòâê:ì ízý�ï�ú±î.ïîrê:ð;öJízþhõ ír÷÷©þ�ùÄú¼ð�óA÷©íOî ñ©÷©úâð�ó �ð ñ ��^-��Ròôó �/4C�B�1õH��ÊðBì ï�ú±õ�øhê:øJí�ù�ý,í,ø�ù ñMøRñ%õÚí"ìÄï�í3. � 0�êtùÄî.ïhúÒìÄírî�ìÄþ�ùÄídê:ðh÷;ê:ò±õÚñ÷©írõÄî ùÄú¼öJíHê V ê:ù.÷/. � 0��ý�ï�ú±î.ïÜúâõê�õ þ�öhõ í ìñ:ü�ìÄï�í9. � 0ê:ù.î.ï�úÒìÄírî ì þ�ùÄí �/� ï�íB. � 0ÛêtùÄî.ïhúÒìÄírî�ìÄþ�ùÄípìÄê�GtíOõêM÷©ësê:ð%ìÄêtótíñ:ü<ìÄï�íAî ñtðRî þ�ùÄù írðhî ô^êfësêtú¼ò±ê:ö�òâí>ê:ì�éMêfësêpùÄþ�ðßì ú��Åí��÷©þhù úâð�óì ï�íA½�µ s²h°�»¼s½Ê¶zê:ðh÷Ó¶Ääy¶Ä´�º©½Ê¶,õÚìÄêtótíOõ ��/ìÄï�í�ù�ùÄírõ írê:ù.î.ï�írùÄõzïhêfëMí�ø�ùÄñtøJñMõ ír÷EìÄñpìÄê�Gtí;êM÷©ësê:ð%ìÄêtótíñ:ü�ì ïhí!î ñMðhî þhù ùÄí�ðhî�ôßöJí ìký0í�írðÓì ù.ê:ðRõÚò±êsìÄí!ê:ðR÷Óí��©írî�þ©ì í!öyôþhõÚúâð�óEêHõ í�øhêtùÄê:ì íBìÄï�ù íOêt÷�ñMð�êHõ í�øRê:ù.êsì í;ø�ù ñ©î�írõÄõÚñMù�ì ñ�÷©ñì ï�íÝìÄùÄêtðhõÚò±êsìÄú¼ñMð �ö� ï�í/. � 0 êtùÄî.ï�ú¼ì íOî�ìÄþ�ù íÝ÷©ú¼ûJírùÄõ�üÁù ñ��õÚþhî.ï�êAõÄî.ï�í�ÅíBúâð�ì ï�í�ü�êtî�ì/ì ïhê:ì�ì ï�í>î ñ��>þ�ð�ú±î�êsìÄú¼ñMð�öRí �ìký,írí�ðAìÄï�í � D�ê:ðh÷K0FDÝì.êGtíOõdøhòâêMî í�ìÄï�ùÄñtþ�óMï�÷�þ�í�þhírõ0ê:ðh÷

ì ï�í � J�J%��ý�ï�ú±î.ïjù íO÷©þhî íOõ>ì ï�í�î ñ%õkì � �Êð�ì ï�í". � 0Ûírêtî.ïø�ùÄñyî�írõÄõÚñMù�îrê:ð¿ù þhð¿êtï�írêM÷�ñtü,írêMî.ïßñtì ï�írùÐý�ï�í�ðÝøJñMõÄõÚúâö�òâí��õÚúâðhî�íBý,í�þRõÚíBö�þ�ûJírùÄõ�ì ñHõÚì ñMù íÐìÄï�í�ùÄírõ þ�òÒì�ñtü�írêtî.ï�ø�ùÄñ©î íOõ �õÚñMù � !©î.ï�ír÷�þ�ò¼úâð�óHì ï�ùÄírêM÷�õ�ìÄñHùÄþ�ð^ñMð�÷©ú¼ûJírù írðMì�ø�ùÄñyî�írõÄõÚñMùÄõî�êtð!öJíBí��©øJí�ðhõ ú¼ëMí�ú¼ü�ì ï�írù íBú±õ�êò¼ñtìzñtü�õ ý�úÒì.î.ï�úâð�ó\��êtðh÷Eþ�ð �ò¼íOõ õ�ìÄï�í�ùÄízúâõ�ê�òâñ:ì<ñ:ü}�Åí ì ïhñy÷>ùÄí � þhõÚí0ý0í0ý0ñtþhòâ÷õÚírí�ê�òâñ:ì�ñtüì ù.ê:ðhõ ò±êsì úâñtð>ìÄê�Gyú¼ð�óBøhòâêMî í � £0ôêM÷�÷©úâð�óP�ÅñtùÄízüÁþ�ðhî ì úâñtðhêtò¼ú¼ìkôì ñ!ì ï�í>ì ù.ê:ðRõÚò±êsìÄí;ø�ùÄñ©î írõÄõ ñtù-�Rý0íîrê:ð Gtí�írø^ú¼ìÐöhþhõÚô�úâð^ìÄï�íî�êMõÚíBñtü�òâñtð�óAùÄþ�ð�ðhú¼ð�ópêtø�ø�òâúâîrêsìÄú¼ñMðEý�ú¼ì ï�ïhú¼óMï��Åí�ì ï�ñ©÷�ùÄí �þhõ í �I� ï�í � D îrê:ð�øhù íO÷©úâî ì�ñMù�óMþ�írõÄõ/ì ï�í��Åí ì ïhñy÷hõÐý�ï�úâî.ïê:ùÄí�óMñtúâð�óì ñÅöJí�î�ê:òâòâír÷!ú¼ðEì ïhíÐüÁþ©ì þ�ùÄíBê:ðh÷�î�ñ��Åø�úâò¼í�ì ïhí�R�õÚñÅìÄïhêsì/ì ï�í>ðhêsìÄú¼ëMí�î ñ©÷©í;îrê:ð�öRí>þhõ ír÷!ì ï�í>ð�í��yì�ìÄú)�ÅíBìÄï�í�Åí ì ïhñy÷Eú±õzú¼ðyëMñ�GtíO÷ �� ï�í;. � 0 � ê:ù.î.ï�ú¼ì íOî�ì þhù ídêtðh÷Búâð;øhêtùÚìÄúâî�þ�ò±ê:ù-�rìÄï�í V ê:ù.÷ �. � 0 � êtù í&öJí�úâð�ó írësê:òâþhêsìÄír÷*þRõÚúâð�ó ì ï�íø!yøJírîré�#L$"&�'öRírðhî.ï��Aê:ù�G©õ � � õÚí�ì�ñ:ü�éMêfësê��Åúâî�ù ñ � öJí�ðhî.ï\�Åêtù�G©õzê:ùÄíBê:ò±õÚñþhõ ír÷�üÁñMù�ësê:òâú±÷�êsìÄú¼ñMð^ø�þ�ùÄøRñ%õÚíOõ�úâð^ìÄï�í>õÚì þh÷�ô �P� ï�í>êtö�úâò¼ú¼ìkôñ:ü<ìÄï�í�. � 0 � î�ñtðhî�í�ø©ì�ì ñEí��©ø�òâñtú¼ìÐî�ñtðhî�þ�ùÄù írðhî ô�öJí ìký0í�í�ðì ù.ê:ðhõ ò±êsì úâñtð!ê:ðh÷Aí�©írî þ�ì úâñtðT�yï�ú±÷©í�ì ï�íBî ñ%õkì0ñ:ü5ìÄùÄêtðhõ òâê:ì úâñtðT�

ê:ðh÷Eírðhê:ö�òâí;é%êfësê;ìÄñHêtî.ïhú¼írëtí�ì ï�í;øRírùÚüÁñMù��Aê:ðRî íBñ:ü�ñtû � ò¼úâð�íî ñ��øhú¼òâír÷Bî�ñy÷�í�ý�úÒìÄï�ñtþ©ì�ò¼ñ%õÚúâð�ózì ïhídêt÷©ësêtðMì.ê:óMírõvêtõÄõ ñyî�úâê:ì íO÷ý�úÒìÄï�é�� î ñ��Åø�úâò±êsì úâñtðEú±õ�öRírú¼ðhóÅõÚì þR÷©ú¼íO÷ �ó ��9:��ùfú�ûLüEýXü3þ-ÿ��¦þ��¦þ�� %ý�� þ��¦þ�� ý�� þ��ÿ� ý! ��"�þ��#�$��uþ�%Iþ��#�%! '& � ��()��+*�,-(.*�#��uþ�/102��#�,.�3��4�(þ�"-þü5#��*��6%7,8 9��:*�,-%7�- ;=<>,-()��ý�ü5,-?\ý�<�üA@CB�B�DFE�ú�B9 G�}þ�$H��uþ�*��I�JK��%7?�#�*�,L� � ��()��+*�,-(.*�#��,-ýNMO/,-(.*��(*þ�/�þ��¦ÿPJK��%7?�#�*�,.�MQ��R��,.,.��RTS8,.?�þ��*�%7,.�9*� VU5��"F,.��+*C�W��IV<�,.X�þ�� þY*� #��:*��6�Z �ú-B�B�B�ý��**�?�[ \�\�]^]^]�ý ,.(-,-ý #�*�,.X�þ��Hý ,Hÿ�#H\Y?��Y�:,-(L*��)\�,-(-,�\�/(*þF\�?��)\Y*�)B�B�DFEú�B�ý ?�ÿ�I:ý

ù _�ûN��ý�M1ý� �%7�+*�� `&ZS�,-(-��#�?�/,*ÿ � (.(-,-��)\�MaX�,-(-#�*�,bJK��%7?�#�*�,L� � �@(L��+*�,-(.*�#��,� ;��dcfe>gWh�i:j�k�lLjFm.npopq�k�l1eQq�r�s�t�nCuLiwv�x�l�nCuLryl� ?�?\ýH_�z�BY{�|�D�z� H�5��"�,-%y$H,.�;ú�B�z�}�ý

ù |ûN��ýA��Z �~Sý�<;ýA��#�/��þ9 �þ��ÿ'��ý5JK��uþ�� &:��,-%7��/)þY@*�,-��(L��,.�G,.(.*��+ÿ�,-(-��#�?�/,*ÿ+þY��(L��+*�,-(.*�#��,.�w]��+*��/þ��R�/,ÿ¦þY*uþ�%7,-%7��%7�ÿ�#�/6,� ;��biq�m)u�u��o�k9��l�q2�yn��u��n��cAk`�k�t�j��)k�nCuLiLk`j�npo�q�k`j��Av�x�r5s�q�l�o�t9r�q�k�eQq�r�s�t�nCuLiwcAim)�9o�nCum-�npt9i�uL H?�?�ý�_�|��Y{�_Y}�� þ��Qú-B�BF_�ý

ù }�ûN��ý`�� `�Eý¦üA,*ÿ�ÿ�� ~Sý`<;ý��#�/��þ9 <þ��ÿ!�TýHJK��uþ�� &2~O��@R��uþ�%�$¦þ�/�þ��(-,8þ��¦ÿy�+*��O�6%7?¦þ�(.*O��N��R��y?H,.�I��%Iþ��(-,1üA0 �Jþ��()��+*�,-(.*�#��,-�- ;=��w�biq�m)u�u��o�k9��l�q:�^n��u��)k�nCuLi)kHj�npo�q�kHj��`v�x�rN�s�q�l�o�t9r�q�k'��o6�Y�d�AuLi��-q�i)r7j�k`mLudeQq�r�s`t9nCuLi�cAim)�9o�nCum.npt9i�uL ?�?\ý`|�E�DY{�|FE�B� G�(þ��Rú�B�BF�ý

ù ��ûLüEý(ü;þ�ÿ��þ��6��þ�� 9��ý(ü5#�$�� þ��ÿ7�Tý9�� O&:JK��þ��uþ�(.*�,.��+@� þ�*��I �(þ�"-þEþ�?�?�/�(*þ�*��8þ�*^*��,5$��9*�,-(-�ÿ�,5/,-"�,-/\þ��ÿIþY*U�/+*�uþF �~ � üAJ�@¡00a��þ�(L��6��,5JK�ÿ�,A��,-"�,-/p ;��biq�m)u�u��o�k9��l�q:��)k�nCuLi)kHj�npopq�kHj��ZeQq�k��-uLi�uLkHm)uAq�k�eQq�r�s�t�nCuLiK¢fuLl�o6��k� F£�(.*��$H,L�ú�B�B�B�ýK<>�Pþ�?�?H,*þY�*ý

ù �ûLüEýXü3þ-ÿ��¦þ��¦þ�� ýXü5#�$�� ¤��ý�� þ��ÿ4�Eý��5�+@�rþ-��6��þ�� f&:MaX�,-(-#�*��¥()��þY�uþ�(.*�,L��6�:*��(-�y��IQ�:#��:*@¡��@�*��%7,(-��%7?��/6,L��- ;¦<>,.(L�\ý"ü5,-?\ý§<�ü�@2B�B�D�Eú�|9 ��}þ�$H��uþ�*��¨��I

?

JK��%7?�#�*�,.� � ��()��+*�,-(.*�#��,-ýOMO/6,.(.*��(*þ�/;þ��ÿ¥JK��%7?�#�*�,L��MQ��@R��,-,.��R�S8,-?�þY�*�%7,-��*� QU��"F,L��*C�§��I8<>,.X�þ��PþY* � #��:*��6�Z ú�B�B�Bý��**�?�[ \�\�]^]^]�ý ,-(-,-ý #�*�,LX¦þ��Hý ,*ÿ�#`\�?��Y�:,.(.*��)\�,-(.,Y\�/(*þ�\�?��)\�*�)B�BFD�Eú�E�ý ?��Hý

ù E*û7M1ý�ü5��*�,-�9$H,.��R� N �ý8©K,-��,.**� Lþ��¦ÿª��ý�M1ý� 9%7�*��Z 1&:<��uþ�(-,J�þ�()��,�[ � �Z�Y]«�}þY*�,-��(L� � ?�?��-þ�()��*��R��©�þ��¦ÿ�]��ÿ�*��02��:*��#�(.*��!¬�,.*�()��R� ;7��biq�m)u�u��o�k9��l=q:�8n��uf��n��)k�nCuLi.�kHj�npo�q�kHj��v�x�r�s�q�l)o�t�r®q�k!g�opm.iq�j�i:m)�9o�nCum.npt9i�uL GS8,-(,ú-B�B��ý

ù z�û� �ý��ý9~6þ�*�,./� ��9ý9MO"F,.��- �þ��ÿy¯%ý��%ýF~6þ�**� O&20C%7?��Y"9��R8*�uþ�(.,(*þ�(L��,�,.�G,-(.*��"F,-��,-��]��+*��ª$��uþ��()��?��%7��*��ëþ��¦ÿ3*�uþ�(.,?�þ�(L�9��R� ;=��^i:q�m)u)u��o�k9��l°q2�An��u�9±�n��7�)k�nCuLiLk`j�npo�q�k`j��Hv�x�rN�s�q�l�o�t9r®q�k§eQq�r�s�t�nCuLi�cAi:m)�9o�nCum.npt9i�uL Z��#��,,ú-B�B�zý

ù B�ûNSLýa�Eýa¬��,-�¦ÿ�/+�� Q <ýO��ýO~6þY*�,-/p Aþ��ÿ¤¯Lý>�%ý ~6þ�**� f&:~O#�**��R*��,�¬ �/6/wU��+*�*��²��H[³S��þ�%7�(�£�?�*��%7� � þ�*��I��<��uþ�(-,dJ�þ�()��,��6(L��?��(.,-��- ;��6�´�^i:q�mLu�u��o�k��ldq:��n��uµ ��l�n>cfe>gw¶9�:·�·Q·¥�)k�nCuLi)kHj�npo�q�kHj��`v�x�r5s�q�l�o�t9r¸q�k7g�o�m.i:q�j�i.�m)�9o�nCum.npt9i�uL GS�,-(Lú�B�B�zý

�

Session 3

Object-Oriented Architectural Support

Applying Predication to Reduce the Cost of Virtual Function Calls

Chris Sadler and Sandeep K. S GuptaDepartment of Computer Science

Colorado State UniversityFort Collins, CO 80523�

sadler,gupta � @cs.colostate.edu

Rohit BhatiaVLSI Technology Center (VTC)

Hewlett-Packard Company3404 East Harmony Road

Fort Collins, CO [email protected]

Abstract

The direct costs of virtual function calls in object-oriented programs is a runtime overhead incurred by thenumber of operations required to compute a target func-tion address, and the time to perform these operations. Wepresent a technique that uses an HPL PlayDoh architecturalfeature known as predication to reduce the direct costs ofvirtual function calls. This technique is based on the possi-bility that the same virtual function table will be shared be-tween virtual function calls, and whereby exploits this pos-sibility by interleaving the function calls for objects whosetype cannot be determined statically. With cost models weshow that predication will eliminate redundant loads of thevirtual function table from memory, and thereby reduce theimpacts of memory latency on the overall runtime perfor-mance of virtual function calls.

1. The overhead of virtual function calls

The use of virtual function calls within object-oriented(OO) languages has a direct cost of degrading the runtimeperformance of programs. As opposed to static functioncalls that can be resolved during compilation, a virtual func-tion call is resolved during runtime if it cannot be staticallybound during compilation. Hence, it will incur a runtimeoverhead to determine which function entry point to jump to.A study performed by Driesen and Holzle showed that C++programs spend a median of 5.2% of their execution time,and 3.7% of their instructions in performing virtual func-tion calls [1]. They go the additional step of saying that thisoverhead is likely to increase moderately on future proces-sors. We believe that this will not be the case if compilersare enhanced to better utilize newer architectural features inorder to reduce the additional time and/or instruction over-head. In this study, we present such a compiler optimizationthat applies predication on interleaved virtual function calls

to eliminate unnecessary loading of virtual function tables(VFT) for objects whose type cannot be determined stati-cally. By eliminating redundant loads , this technique willminimize the time spent accessing the memory system, andwill reduce potential stalls in the instruction stream. We canshow that the cost of applying this technique will be no morethan a single cycle if the predication is not successful. How-ever, we believe this cost is worth the potential savings whenconsidering the nature of OO programs, and the savings thatcan be obtained in light of the growing performance gap be-tween processor and memory systems.

1.1. The mechanics of virtual function calls

In order to understand how the overhead of virtual func-tion calls can be reduced, the mechanics of such a call shouldbe understood. Given in Figure 1(a) is an example of aclass hierarchy using single inheritance, and its use. Theobject and VFT layouts for this example are shown in Fig-ure 1(b), which are based on a standard layout as describedby Ellis and Stroustrup [2]. According to Srinivasan andSweeney [3], there are four steps required to perform a vir-tual function call when implementing a standard VFT lay-out. These are:

1. Load the VFT which contains the entry for the calledfunction (i.e. access the VFT via the vptr).

2. Access from the VFT the function entry point to branchto (i.e. load the address of A::foo or B::fee).

3. Adjust the reference to the object through which thefunction is being called, to refer to the sub-object thatcontains the definition of the function which will becalled (i.e. load A-offset and add to obj1, or load B-offset and add to obj2). This is known as a late cast.

4. Branch to the entry point of the function.

public virtual foo(); public virtual fee();

class A {

int x; };

int y; }; public virtual fee();class B : public A {

obj1->foo();obj2->fee();

A *obj1 = new B;A *obj2 = new B;...other code....

Virtual Function Calls

vptrint xint y

obj1 vptrint xint y

obj2

A::foo() A-offsetB::fee() B-offset

Class B’s VFT

(b) The object and VFT layouts

Class Declarations

(a) Example class hierarchy and its use

Figure 1. In C++ syntax, an example use ofa single inheritance class hierarchy, and theresulting object and VFT layouts based on astandard implementation.

In a single inheritance hierarchy, an object has a sin-gle VFT in which all virtual function address are stored,thereby simplifying step 1 by accessing the VFT via thevptr. Whereas, in a multiple inheritance hierarchy, an objectmay have multiple VFTs with different virtual function ad-dresses. Consequently, an extra step may be required knownas an early cast. In an early cast, the reference to the objectthrough which the function is being called is adjusted at run-time to refer to a sub-object whose VFT contains an entryfor the called function. For either single or multiple inher-itance, the late cast (i.e. step 3) is required only when thecompiler stores offsets in the VFT rather than ”thunk” code.The difference being that the ”thunk” code handles the cast-ing of the object to another object, and an offset provides arelative position of an object’s definition within the objectlayout [3].

The overhead incurred by the above steps becomes ap-parent when translating the steps into the instructions thatwill be execute in order to perform the virtual function call.Given in Figure 2 is the low-level intermediate representa-tion (LIR) and the data dependency graph for a virtual func-tion call under the HPL PlayDoy (HPL-PD) architecture [4].As can be seen from this Figure, the overhead of the virtualfunction call can be attributed to the load operations that areperformed, and the dependencies between the load and theALU operations (nodes 1,4 and 5). In most architectures,the memory loads will incur the largest overhead in com-parison to the ALU operations, and will most likely stall the

instruction stream if other non-dependent operations cannotbe scheduled during the load.

r3 = ld r2 r4 = r3 + 12 r5 = r3 +20 r6 = ld r4

intp1 = r2 + r6 r8 = pbrr r7 1 ret_addr = brl r8

r7 = ld r5

;load the virtual function table (VFT)

;load the late cast offset

;perform the late cast;prepare to branch;branch to the function entry point

;calc the address of the function entry point;calc the address for the late cast offset

;load the function entry point

r3 = ld r2

2r4 = r3 + 12

3r5 = r3 +20

4r6 = ld r4 r7 = ld r5

5

6intp1 = r2 + r6

7

8

1

ret_addr = brl r8

r8 = pbrr r7 1

(b) Data Dependency Graph

(a) Low-Level Intermediate Representation

Figure 2. The HPL-PD instructions for a virtualfunction call, and the data dependency graph.

In contrast to a static function call, as shown in Figure 3,the number of instructions to perform a virtual function callis greater, and consequently the performance is lower. A re-lationship can be established between the runtime perfor-

r3 = pbrr _$fn_foo__!DFv 1 ;prepare to branch to the functionintp1 = r2 ;set the this pointer

ret_addr = brl r3 ;branch to the static function address

Figure 3. The HPL-PD instructions for a staticfunction call.

mance of these function calls by expressing the runtime ofthe virtual function call ( �� ) in terms of the runtime of thestatic function call ( �� ). Ideally, �� would be the same as� � , however due to the extra instructions and the dependen-cies between these instructions, � � will be larger by somedelta. With this in mind, the relationship between � � and�� can be express as: ��

(1)

The value of �� is a factor of the extra instructions thatare required for a virtual function call ( �� ), the number

of cycles to issue an instruction (CPI), and the clock rate ofthe machine. Hence, �� can be expressed as:� � �� "!�#$#&%��('&�&% )�+*-,/.1032547698

(2)

and equation 1 can be rewritten as:�� :� � !�#$# %��('&�5% )�+*-,/.1032;4<698(3)

As for the CPI, this value can be expressed as the sum ofthe ideal CPI and the number of pipeline stall cycles per in-struction. In the case of the virtual function call, the data de-pendencies between the load and ALU operations will be thecause of the pipeline stalls, and can be categorized as LoadStall Cycles and ALU Stall Cycles. For the purpose of thisstudy we do not consider the stalls associated to the branchdelay, and assume that the branch operation for a static func-tion call will consume as many cycles as with a virtual func-tion call. With this simplification of the CPI, we can expressit as:�('&�=�� >78$4 *9�('&�?�A@ ,/4<>+B�694<*C*D�&ED.1*-8/FG�IH @"J B�694 *-*K�&E .1*C8LF� �"!�#$#

(4)

Since the clock rate of a machine is constant, the onlyway to reduce the runtime overhead of virtual functioncalls ( � � ) is to reduce the number of additional instruction( �� ) and/or reduce the cycles per instruction (CPI). Wepropose a technique below that will do exactly this, by uti-lizing an HPL-PD feature known as predication to eliminateunnecessary loads of the VFT.

1.2. Applying predication to virtual function calls

A common practice within OO programs is to partitionfunctionality into small, re-usable functions, also knownas methods. Consequently, the average number of calls tomethods in OO programs is higher than the number of callsto functions in procedural programs [5]. However, giventhat these methods are called through a finite set of objects,one would expect that a number of the methods are calledwith like objects (i.e instances of the same class). If thiswas the case, and the methods are virtual functions, then thecalls to these functions will use the same VFT, as is shownin the example given in Figure 1(b). Therefore, if the com-piler was able to interleave these calls, it could eliminate theredundant loads of the same VFT for each subsequent callafter the initial call. This would reduce the number of loadstall cycles (i.e. reduce the CPI) and reduce the pressure onthe load/store units, which opens the opportunity to scheduleother memory bound instructions. The compiler optimiza-tion presented in this study does exactly this by interleavingmultiple virtual function calls for objects whose type can-not be determined statically, and applies predication to con-ditionally load the VFT when needed.

Predication supports conditional execution of individualoperations based on boolean guards, which are implemented

as predicated register values [6]. By use of predication, thecompiler can transform a virtual function call so that cer-tain operations (e.g. loading the VFT) are controlled by 1-bitpredicate registers. The 1-bit predicate registers are read bythe hardware during the instruction decode/register fetch cy-cle, and forwarded onto the execute cycle. Depending on thevalue of the qualifying predicate, the computed results of theguarded instructions are either applied towards the proces-sor state, or discarded. As an example, Figures 4 and 5 showthe predicated and non-predicated HPL-PD instructions, re-spectively, for two interleaved virtual function calls. Theseschedules are based on the example given in Figure 1(a), andon the assumptions given in Table 1. Note that the stallsshown in cycles S1 and S2 are due to structural hazards re-sulting from a single load/store unit with a latency that is de-pendent on the average memory access (AMA) time. Hence,the length of these stalls is variable, and in the case of theS2 stall, it may be eliminated if the memory latency is short(e.g. a memory latency of 2 cycles).

Architecture 2-Way Issue VLIWInteger ALU Units 2 units with a latency of 1 cycleLoad/Store Units 1 unit with a latency based on AMA time

Table 1. Assumptions applied to the sched-ules given in Figures 4 and 5

load obj2’s VFT

calc. address for obj1’soffset & function entry pt.

load obj1’s offset

load obj1’s VFT

r11 = r10 + 16

r6 = ld r4

** Structural hazard stall **

r12 = r10 + 24

r4 = r3 + 12 r5 = r3 + 20

r10 = ld r9

** Remaining instructions **

calc. address of obj2’soffset & function entry pt.

4+S1+S2

5+S1+S2

S2

3+S1

2+S1

6+S1+S2

S1

1 r3 = ld r2

Cycle VLIW Instruction


Figure 4. The schedule of non-predicatedHPL-PD instructions for the two virtual func-tion calls shown in Figure 1(a).

As shown in the schedule where predication is used, if theobjects are of the same class type, then the redundant loadingof the VFT is eliminated at cycle 2+S1. This would avoidthe stall cycles at S2, and provide an opportunity to scheduleanother memory bound instruction in its place (i.e. in theS2 cycle). On the other hand, if the objects are of differentclass types, then all loads are performed and the length ofthe schedule is the same as when predication is not used.

In order to quantify the benefits of predication, we shouldfirst determine the cost of loading all VFTs, as would nor-mally occur. In a schedule without the use of predication,the number of VLIW instructions required to load the VFTs

p1 = 1; p2 = 0 iff r2 = r9p1 = 0; p2 = 1 iff r2 != r9

if p2 then r10 = ld r9if p1 then r10 = r3

r11 = r10 + 16

r6 = ld r4

r4 = r3 + 12

4+S1+S2

5+S1+S2

S2

3+S1

2+S1

6+S1+S2

S1

1

Cycle VLIW Instruction

r3 = ld r2

(p2) r10 = ld r9

p1,p2 = cmpr(r2,r9)

(p1) r10 = r3

r5 = r3 + 20

r12 = r10 + 24


** Remaining instructions **

** Potentially Eliminated Stalls **

Figure 5. The schedule of the predicated HPL-PD instructions for the two virtual functioncalls shown in Figure 1(a).

is MNPOCQRTS�U M5V , where V is the number of interleaved virtualfunction calls, and N is the maximum number of load oper-ations allowed per VLIW instruction. Each VLIW instruc-tion will complete the loading of the VFTs in the time periodfor which it takes to locate the VFTs within the memory hi-erarchy (i.e. the average memory access cycles). This is afactor of the hit cycles, miss rate and miss penalty at eachlevel in the memory hierarchy. For simplicity, we will referto this as the AMA cycles, which is a based on the averagememory access time as shown by Hennessy and Patterson[7]. Hence, the average number of cycles required to loadthe VFTs can be expressed as:H;W<X3Y @ ,L4 >5�&ED.1*-8/F; Z[]\_^a`9b]c Zed %�H&fgHI.hED.1*-8/F

(5)

and the AMA cycles for a 3-level memory hierarchy with anL1 and L2 cache and main memory, is:H&fgH:.hE .i*-8/F��j \ 6k�&ED.1*-8/FLlkmG�nf \ F$FG2547698/lkmo% ` j \ 6p�&E .1*C8LFLlrqo�f \ F$FG2547698/lrq�%�j \ 6�&ED.1*-8/FLs�t�u-v3s�wyx&zi{1| d

By use of predication, the number of operations to loadthe VFTs is reduced by the cardinality of the set of virtualfunction calls (S) that use the same object type as the firstvirtual function call. In other words, once the first VFT isloaded, each subsequent function call using the same VFTcan eliminate its load operation. Since the first virtual func-tion call will load the VFT into the L1 cache, each eliminatedload from the subsequent virtual function calls will reducethe overhead by the number of cycles required to access theL1 cache, assuming a nonblocking cache using a simple hit-under-one-miss scheme. We can express the number of cy-cles to load the VFTs with use of predication as an extensionto Equation 5:H;W7X?Y @ ,/4 >k�&E .i*-8/F<} \ 69~�'&��8L> \ .1476 \ , ^ �� Z�� B �[]\_^�`�b]c Z=d/� *

H&fgHP.hED.1*-8/F(6)

Given in the schedule shown in Figure 5, where the num-ber of interleaved virtual function calls is 2, and the assumed

latencies and miss rates for the different levels in the mem-ory hierarchy are as given in Table 2, the average load cycleswith predication is:H;W<X3Y @ ,L4 >&�&ED.1*-8/F�} \ 69~�'&��8L> \ .1476 \ , ^ �� B �[�\_^�` ) c � d�� %` � �n�DY � % `9� ��DY ) � %o�7� d�d

Level Latency Miss RateL1 Cache 2 20%L2 Cache 7 12%Main Memory 35

Table 2. Example latencies and miss rates

In the case where the virtual functions are called with dif-ferent types of objects, the set of virtual function calls (S) isempty. Whereas, if called with like objects then S is � obj2-� fee � . The average load cycles required between these twocases is 8.48 and 4.24 respectively. Since in some cases theload cycles can be masked by overlapping non-dependentinstructions with the loads, the reduction of load cycles byuse of predication cannot be translated into a speedup forthe virtual function call. However, when the load cyclescannot be fully masked, then the enhanced speedup result-ing from the use of predication is approximately the ratio�o�/�D�$�a�$�<�G�;�<�y�-� �� O �� K�$¡D�r¢o£L�i� O �h�/� O � Q�o�/� �/�a�$�<�G�;�<�y�-� � O �� +¢o£/�1� O �h�� O � Q . One would expect thatas the the number of interleaved virtual function calls usinglike objects increases, so will the enhanced speedup.

What is missing from the average load cycles with pred-ication is the cost of performing the predication (i.e. settingthe predicate registers and evaluating the predicated instruc-tions). Given an architecture that would allow a maximumof N loads in a single cycle, where N

�1, the use of predi-

cation would replace N - 1 loads with compare-to-predicateoperations (CMPR). Since the CMPR operation has a la-tency of 1 cycle, which is traditionally less than the latencyof the replaced load operations, the impact of using predica-tion thus far is a reduction in cycles. The N - 1 replaced loadsthen become predicated loads, which can be scheduled in thecycle following the initial load. If any of these predicatedloads are executed, then they would complete execution onecycle after the initial load, based on an average memory ac-cess time. Thus, the cost of using predicated loads whenthe object types do differ can be a single cycle. However,as shown in the schedule from Figure 5, the cost is 0 cycleswhen the load operation is not replaced but is merely pred-icated (i.e. N = 1), and the predicated load and assignmentoperations can be scheduled on the same cycle.

The other case to consider is when the objects are thesame type, and the assignment of the VFT must be per-formed. This assignment cannot take place until the initialload has completed. In an architecture that allows N load op-erations in a single VLIW instruction, one can safely assumewill also support N assignment operations in a single VLIW

instruction. Consequently, if all N - 1 predicated assignmentoperations were to be executed, they could be scheduled inthe same cycle and would complete in a single cycle (as-suming a latency of 1 for ALU operations). Interestinglyenough, if there exists a mix of virtual function calls that doand do not use the same object types, the predicated assign-ment operations can be scheduled on the last execution cycleof the predicated load operations since these operations areindependent of each other. Hence, we can safely say that thecost of using predication in virtual function calls will costone cycle in the worst case, thereby modifying Equation 6to be:H;W7X?Y @ ,/4<> �&E .1*C8LFi} \ 69~r'&��8L> \ .1476 \ , ^ ¤� Zg�� B �[�\_^�`�b�c Z=d/� %/H+fgH5.hED.1*-8/FL� )

(7)

In terms of Equation 3, predication reduces the number ofload cycles which can potentially reduce the number of loadstall cycles. The load stall cycles are used to compute theCPI, as shown in Equation 4. On the other hand, the instruc-tion count which is also used to compute the CPI, increasesdue to the additional compare-to-predicate and assignmentinstructions. However, the predicated instructions (i.e. theVFT load and assignment) may or may not be committed tothe processor state. This can impact the CPI. Thus, to ac-curately reflect the CPI on architectures that support pred-ication, a raw CPI and a useful CPI should be considered.The raw CPI reflects all instructions, regardless of if theyare predicated. Whereas, the useful CPI reflects only thosepredicated instructions that are committed. As for this study,we consider only the raw CPI which is an upper bound to theactual CPI. With this in mind, we can express � � with pred-ication as: �k�I} \ 69~5¥3��8$> \ .i476 \ , ^ ��k ¦��§ ` � � !�#i# � � d %

` �('&� �©¨ * \_[]\_^ 47698L>5Ba694<*-*K�&E .i*-8/F` � � !�#$# � � d d � )hª % )�+*-,/.10325476982 Other techniques

This application of predication towards virtual functioncalls is not limited to single inheritance hierarchies, as ourexample depicts. In fact, this concept can be further ex-tended with multiple inheritance hierarchies to reduce thecost of the early cast. For instance, if two virtual functionsare contained in the same VFT, and the objects with whichthese functions are called are siblings in the class hierarchy(i.e. they inherit from the same base classes), then both theearly cast and the loading of the VFT can be eliminated forone of the calls.

Another application of predication to reduce the over-head of virtual function calls is to use it in conjunction withruntime class tests, or also known as I-Call If Conversion[8]. This would convert the virtual function calls into staticfunction calls, and reduce the branch mispredicts that aregenerally seen with this type of conversion.

3 Summary

To minimize the runtime overhead associated to virtualfunction calls, one must either reduce the number of opera-tions required to perform a call, or reduce the CPI for thecall. With a basic understanding of the mechanics of vir-tual function calls, one can see why they incur this overhead,and how this overhead might be reduced in light of newerarchitectural features. Predication is one such feature thatcould be used to conditionally eliminate loading the sameVFT for multiple virtual function calls. By eliminating re-dundant load operations, we can minimize the time spent ac-cessing the memory system and reduce potential stalls in theinstruction stream. If the load operations cannot be reduced,then the cost of using predication, in the worst case, is a sin-gle cycle. We believe this cost is worth the potential savingswhen considering the nature of OO programs, and the sav-ings that can be obtained in light of the growing performancegap between the processor and memory systems.

References

[1] K. Driesen and U. Holzle. The direct cost of virtualfunction calls in C++. In Proceedings of the Conferenceon Object-Oriented Programming Systems Languagesand Applications (OOPSLA ’96), October 1996.

[2] M. A. Ellis and B. Stroustrup. The Annotated C++ Ref-erence Manual. Addison-Wesley, 1990.

[3] H. Srinivasan and P. Sweeney. Evaluating virtual dis-patch mechanisms for C++. Technical report, IBM Re-search Division, Jan 1996.

[4] V. Kathail, M. Schlansker, and B. Ramakrishna Rau.HPL PlayDoh architecture specification: Version 1.0.Technical report, Hewlett-Packard Computer SystemsLaboratory, HPL-93-80, Feb 1994.

[5] B. Calder, D. Grunwald, and B. Zorn. Quantifying be-havioral differences between C and C++ programs. InJournal of Programming Languages, 2:4, 1994.

[6] P. Y. Hsu and E. S. Davidson. Highly concurrent scalarprocessing. In Proceedings of the 13th InternationalSymposium on Computer Architecture, pages 386–395,June 1986.

[7] J. L. Hennessy and D. A. Patterson. Computer Archi-tecture: A Quantitative Approach. 2nd Edition, MorganKaufmann, San Francisco, CA, 1996.

[8] B. Calder and D. Grunwald. Reducing indirect func-tion call overhead in C++ programs. In ACM Princi-ples and Practices of Programming Languages, Port-land Oregon, 1994.

Hardware Support for Profiling Java Programs

[email protected] [email protected]

Abstract

Assuming the Java version of a program provides good performance, many programmers are interested in using Java as a replacement for many traditional programming languages because of the portability of Java and the extensive runtime libraries. However, in many cases the performance of the Java code requires improvement before it is acceptable. Profiling provides an effective means of identifying the sections of code that consume the most processing time and are the best candidates for optimization.

A prototype low-overhead, time-based profiling system has been developed for the Kaffe Java Virtual Machine’s (JVM) Just-In-Time (JIT) i386 translator using the high-resolution timestamp register of the Intel Pentium processor. Experience with this approach suggests that a ‘‘virtual time’’ register would be a useful addition to the processor to simplify measuring the performance of multithreaded programs. Direct user control of the performance monitoring hardware would reduce the cost of measuring multiple performance metrics on a per-method basis.

1. Introduction

2. Instrumentation Architecture

3. Results

Table 1: Comparison Two Method of Measuring Benchmark Runtime.

Table 2: Comparison of coverage of JVM runtime by profiler.

Table 3: Top 20 times for HelloWorldApp.

Table 4: Volano measurements.

4. Conclusion

Acknowledgments

Bibliography

Adaptive optimization for Self: Reconciling High Performance with Exploratory Programming

Proceedings of the 29th International Symposium on Microarchitecture

Architecture Optimization Manual

THE JAVA HOTSPOTTM PERFORMANCE ENGINE ARCHITECTURE

Proc. 24th ACM Symposium on Principles of Programming Languages

Kaffe

Session 4

Microarchitectures for Java

VLSI Architecture Using Lightweight Threads (VAULT) -Choosing the Instruction Set Architecture

I Watson, G Wright, A El-MahdyUniversity of Manchester, UK.

[email protected], [email protected], [email protected]

ll-

tolsobeue.ein--if

cerlya-e-asdeur.PUise-

er-to

resoex-

Abstract

The VAULT project is concerned with the design of a‘multi-processor on a chip’ aimed specifically at multi-threaded Java implementation. It has wide ranging aimsthat require research in a variety of hardware and softwareareas. The project is still in its early stages and most of thework is still to do. This paper provides an overview of theproject as envisaged currently and then examines some ofthe initial work in detail. In order to perform a comprehen-sive evaluation of the VAULT approach, it was thought nec-essary to perform detailed instruction level simulation. Asingle CPU structure therefore needed to be defined to formthe basic building block of the system and hence the simula-tor. One of the first decisions needed was the basic Instruc-tion Set Architecture of the CPU. The reasons behind thechoice are examined using results obtained by detailed in-strumentation of the Java Virtual Machine and the runningof a variety of Java benchmarks.

1.Introduction

The VAULT project has two distinct but closely con-nected aims: firstly to explore the potential of future VLSIto produce a ‘multi-processor on a chip’ and secondly toprovide support for high performance Java based systems ofthe future.

The combination of these two issues is neither accidentalnor arbitrary. The case for considering single chip multi-threaded and/or multi-processor structures to utilize futureVLSI technology is well known. However, most proposalsin this area have either been extensions of current super-sca-lar designs or integrated versions of conventional multi-processors [1]. Concentration on the Java environment al-lows us to consider new approaches which can optimize thesoftware performance, while at the same time benefit fromthe natural multi-threaded structure, to produce high per-formance, low power hardware.

The environment is significantly different from that re-quired to support serial or parallel C or FORTRAN basedcode for which current processors are optimized. The mostobvious differences are:-

• Execution by dynamic compilation of bytecode.• Object oriented code with heavy use of method ca

ing, dynamic binding, object access (via indirec-tions) and garbage collection.

• Dynamic linking and loading of objects.• Program level multi-threading.• Particular importance of multi-media applications.

It is believed that hardware structures can be tailoredthese needs with a resulting performance increase. It is aimportant to note that many such systems will need toportable and hence power dissipation is a significant iss

It is assumed that compatibility with previous hardwaris not necessary (this is probably essential to permit thevestigation of novel ‘on-chip’ mechanisms). Dynamic binary translation can be used to run ‘legacy’ softwarenecessary.

This paper outlines the overall features of the VAULTarchitecture, which are aimed at realizing high performanfor the Java environment. However, the project is at an eastage and much of the detail is still being explored. The mjor detail of the work described here is concerned with thselection of an Instruction Set Architecture (ISA). A selection of programs from the JavaSPEC [2] benchmarks hbeen used to analyse the various ways in which bytecocan be executed and the resulting overheads which occThis analysis suggests that a register windows based Cwould provide optimum performance. Assuming that thapproach is followed, the final section describes a more dtailed analysis of the benchmark execution in order to asctain the numbers of registers and windows necessaryachieve that performance.

2.Project Principles

The following are the principles and features which aguiding our research. This is currently in an early stagethat the detail of some of these issues has not yet beenplored.

1. Parallelism through multiple simple CPUs rather thanexploiting ILP etc.

t

forofas a

en-l

ing

e-re

ere

ecellte-

icmti-uesoackofed-p-ar

ofa

nalth-sefoc-eltm-r-a

• Natural exploitation of threaded parallelism - poten-tially higher performance than the alternative of timesliced execution on super-scalar CPU as threadcount increases.

• Complexity/performance ratio per CPU lower lead-ing to more efficient silicon use.

• Power/performance ratio per CPU lower - particu-larly important for portable computing.

• Simple processors imply significantly reduceddesign effort.

2. Processor structure optimized for support of (dynami-cally) compiled Java and multiple thread support.

• Register windows structure for efficient support offrequent subroutine (method) calls.

• Multiple register set ‘heap’ with hardware assistedallocation and spilling to allow frequent threadswitching within a CPU.

• Caching structure tailored to object accessing anddynamic binding.

3. Inter processor communication at register and cachelevel via special bus structures.

• On-chip communication will enable specializedcommunication paths to be implemented at highspeed.

• Support for very lightweight thread creation. Use ofdynamic thread creation. Use for ‘real’ parallelism(loops etc.) within Java level threads for higher par-allel performance (e.g. for multi-media computa-tions).

• Support for dynamic load balancing. Rapidexchange of load information and rapid task creationminimize ‘feedback system instability’.

• Ability to query remote caches a significant aid totask placement.

4. Processor support for multi-media applications.

• Multi-dimensional cache structures.• Multi-media processor functions?

5. Dynamic compilation (for parallelism).

• Dynamic compilation for optimal use of single proc-essor structure (Hot Spot etc.)[3].

• Decisions on size of parallel tasks best left until run-time.

• Decisions about how to divide (e.g. data sets) besleft until run time.

• Data representations can be altered dynamically?(e.g. tree structured arrays)

There have been a number of architectures proposedthe efficient execution of object-oriented languages, mostthem aimed at Smalltalk. The SOAR [4] processor hadregister windows structure but these were not organized afreely allocated heap like VAULT. Most of the other SOARfeatures such as tagged data are not applicable in a Javavironment. The Mushroom [5] project examined novememory mechanisms for object accessing and cachwhich may be relevant to VAULT.

We have (very) recently become aware of some of the dtails of the Sun MAJC architecture [6]. This appears to shamany of its higher level aims with VAULT. The major dif-ferences appear to be at the individual processor level whthey propose the use of multiple function units and a VLIWISA.

3.Choosing the VAULT CPU ISA

It is tempting to think that the correct way to approach thdesign of hardware to execute Java efficiently is to produa stack based CPU. This might either implement the fufunctionality of the Java Virtual Machine in hardware or aleast provide a simple stack based ISA into which the bytcodes can readily be translated.

It was felt that the first of these options was against basRISC philosophy and was thus unlikely to lead to optimuperformance. The picoJava [7] project has already invesgated the second approach. It uses stack caching techniqcoupled with hardware supported instruction folding tovercome some of the inherent disadvantages of a stbased ISA concerned with the movement and duplicationoperands. Such a processor is particularly suited to embded applications where it requires minimal software suport. A number of other projects have also studied similhardware techniques.[8][9]

We were not convinced that, if one assumed the usesophisticated JIT or dynamic compilation techniques,stack based structure would outperform a more conventioregister based ISA. However, due to the heavy use of meod calling in many Java applications, we thought that the uof register windows might be beneficial. Although it is, ocourse, possible to study Java implementations on real pressors which exhibit the various design alternatives we fthat there was a need to perform an evaluation using a comon methodology. We therefore studied a number of diffeent ISAs together with appropriate software support usingvariety of Java benchmarks.

altal

inls.i-

r_4

eth-llerro-heng

isllsns.%)a-

In order to produce a useful comparison, it was necessaryto postulate some simple cases which represented distinctpoints in the design space. We chose to compare the follow-ing models of execution:-

• Bytecode- A direct execution of Java bytecodeassuming no optimization.

• Folding_2 - Java bytecode execution assuming thefolding optimizations used in the picoJava-1 processor(up to two instructions are folded).

• Folding_3 - Extends the Folding_2 model to consider 3instruction folding

• Folding_4 - Java bytecode execution assuming thefolding optimizations used in the picoJava-2 processor(up to four instructions are folded).

• Reg- A simple register based ISA together with a ‘stateof the art’ register allocation algorithm. We chose tobase this on the techniques described for the Cacaocompiler [10] as these seemed to represent an efficientbut simple mechanism.

• Regwin- A register windows based ISA again using theCacao algorithm.

We have implemented these models and added them tothe Sun JDK JavaVM through which the JavaSPEC bench-mark programs are analyzed. (The _227_mtrt program wasomitted as it is a multithreaded program and we are con-cerned here with simple single thread characteristics.)

The nature of the benchmarks is described briefly in Ta-ble 1.

The results are summarized in Figures 1 and 2. Figure 1shows the dynamic bytecode execution frequencies for var-ious bytecode classes; constant loads (const), local variablesload/store (local), array load/store (array), stack operations(stack), arithmetic/logic (ALU), conditional/unconditionalbranches (branch), field load/store (field), method invoca-tion (invoke) and other (ow). The results assume the Byte-

code execution model. The high occurrence of const, locand stack operations, contributing around 50% of the toinstruction count, is responsible for the stack overhead.

Figure 1. Dynamic instruction mix.

Figure 2 shows the number of instructions executedeach benchmark program for the six execution modeFolding_2 reduced the number of instructions by a maxmum of 17% (for _209_db) and a minimum of 6% (fo_202_jess); the average is 12%. Folding_3 and Foldingcontribute another 3% at most with average of 1%.

Reg worked better than the folding models for four of thprograms; reductions ranged from 26% to 44%. For the oer two programs (_202_jess, _213_javac) there is a smareduction of 14%. As can be seen from Figure 1, these pgrams have nearly double (2, and 1.7 respectively) tnumber of method calls of the other programs, increasithe relative call overhead.

Figure 2. Relative instruction counts

Regwin was able to outperform all other models. Thwas expected observing, from Figure 1, that method caaccount between 1% and 5% of the executed instructioRegwin reduced the number of instructions by at least 40for five programs. The remaining program (_228_jackshowed only a 29% reduction. This is attributed to the rel

Table 1: Benchmark Programs

Program Description

_201_compress Lempel-Ziv compression

_202_jess Java Expert Shell System

_209_db Database functions

_213_javac JDK 1.02 Compiler

_222_mpegaudio Decompress MPEG-3 audio

_228_jack Java version of yacc

0%

10%

20%

30%

40%

50%

60%

cons

tloc

alar

ray

stack

ALU

bran

ch field

invok

e ow

Per

cent

age

of in

stru

ctio

ns

_201_compress _202_jess _209_db

_213_javac _222_mpegaudio _228_jack

0

0.2

0.4

0.6

0.8

1

1.2

_201

_com

pres

s

_202

_jess

_209

_db

_213

_java

c

_222

_mpe

gaud

io

_228

_jack

Nor

mal

ized

inst

ruct

ion

coun

t

Bytecode Folding_2 Folding_3 Folding_4 Reg RegWin

ed4rky-ut-

i-toitee

-sx-d

allv-th.ksicsitof

e-ade.he-

of

tively frequent stack (at least 8%) and infrequent local andALU operations. This suggests less sharing of local valuesand hence the benefit of using registers is decreased.

We have not considered the effects of ‘in-lining’ in ourstudy. This would undoubtedly narrow the gap between theReg and Regwin results for some applications. However,there are limits to which in-lining techniques can be used fordeeply nested programs and it is inapplicable in the generalcases of both recursion and dynamic binding. It can be ar-gued that most of the JavaSPEC benchmarks are not repre-sentative of programs written using the full Object Orientedstyle where the above issues will be of increased relevance.We therefore believe that achieving optimum performanceon real method calls is important and hence our approach isjustified.

4.Register and Window Usage

The previous analysis assumed an unlimited supply ofregisters and register windows. We re-instrumented the SunJDK JavaVM to count the number of local variables ac-cessed. Figure 3 shows the accumulated percentage of localvariable usage assuming that all local variables are mappedinto registers.

Figure 3. Cumulative percentage of register usage

The x-axis shows the number of registers required on abase 2 logarithmic scale (i.e., number of bits required to en-code a local variable). The most register hungry application(_222_mpegaudio) has a ‘knee’ at 13 registers covering91% of variable accesses. The next significant ‘knee’ is at30 registers where several applications achieve 95% usage.This indicates that the decision is between 16 and 32 regis-ters although a more detailed study of dynamic register us-age in the presence of dynamic register allocationalgorithms is required before a final decision is reached.

In order to determine the number of register windows, it

is necessary to study the method call depth. We extendour instrumentation to provide this information. Figureshows the call depth distribution for the various benchmaprograms. The x-axis is the actual call depth, while theaxis is the percentage of total instructions which get execed at that depth. The absolute value of the call depth is of mnor importance, in fact the offset of 13 in Figure 4 is duea set of ‘wrapper’ methods around the benchmark suwhich are executed initially. These will, of course, requirregister window allocation and register ‘spills’ if the win-dow total is exceeded but this will only occur once. The important characteristic is the width of the profile. Programsuch as _209_db with a very narrow profile indicate that eecution takes place with very shallow nesting while a broaprofile like _202_jess indicates deep nesting.

Figure 4. Call depth distribution

Flynn [11] suggests that a useful measure of the cdepth of programs is the relative call depth defined as the aerage of absolute differences from the average call depTable 2 shows the relative call depth for the benchmarused.This clearly distinguishes the different characteristwhich are apparent from the graphical profile. However,does not give an accurate figure for the actual numberwindows needed.

A more accurate estimate of the register window requirments is necessary before design decisions can be mOur instrumentation was yet further extended to study tway in which the provision of windows affected the execution.

We simulated the benchmarks with varying numbers

0

20

40

60

80

100

120

0 2 4 6 8 10

re g no (log base 2)

Acc

um

ula

tive

pe

rce

nta

ge



Table 2. Relative Call Depth

Benchmark 201 202 209 213 222 228

RelativeCall Depth

0.16 5.88 0.54 7.40 1.70 5.67

0%

20%

40%

60%

80%

100%

0 10 20 30 40 50 60

Call depth



d

erodev-

ignon-

aee-theex-er-

i-

ion

-

-

-

-

ct-d-ri-r

.el

windows and measured the miss ratio, defined as the ratio ofthe number of window accesses resulting in window over-flow or underflow to the total number of windows accesses.A window access takes place twice per method call, once onentry and once on exit. The results are shown in Figure 5.

As expected from the relative call depth figures, two ofthe benchmarks have a requirement for only a small numberof windows and, for them, two might suffice. However, toachieve a low miss ratio (less than 0.02) for all benchmarks,eight register windows appear to be necessary. To empha-size, at this level over 98% of all method calls would not re-quire register spilling.

From these experiments, it is believed that we have deter-mined that a configuration of eight register windows eachcontaining 16 or 32 visible registers would be sufficient toachieve good performance.

Figure 5. Window miss ratios

It is thought that the this configuration is of acceptablecomplexity for the processor design being considered.

5.Conclusions

This paper presents an overview of the VAULT projectfollowed by the detail of the choice of an ISA.

The design issues are outlined together with possible so-lutions currently being investigated. The most importantfeatures of VAULT are thought to be:-

• Simple CPU structure with optimized support for Javalike languages.

• On-chip multi-processor structure with fast threadsynchronization facilities.

• Support for multi-media processing.

Initial results concerning the ISA design are presentedwhich have demonstrated that an ISA using register win-dows results in a dramatic reduction of the stack overhead,far better than can be achieved by folding techniques. The

register windows structure is important for reducing methocall overheads.

We have verified by detailed simulation that the numbof registers and register windows required to achieve goperformance is modest and therefore consistent with the lel of hardware complexity envisaged.

The work done so far has enabled us to narrow the desspace to a level where we are able to embark on the cstruction of an instruction level simulator for a VAULTCPU. This work, together with compilation routes from Javbytecode (and C) is almost complete. Using this, we will bable to perform a very accurate verification of the design dcisions presented above and make any adjustments toCPU structure which are necessary. The next stage is totend the simulation to the multi-processor structure in ordthat we can start to study the full potential of the VAULT approach.

6.References

[1] Doug Burger and James R. Goodman. Billion-transistor Archtectures.IEEE Computer, 30(9):46-49, September 1997.

[2] SPEC JVM98 Benchmarks, Standard Performance EvaluatCorporation. http://www.spec.org/osg/jvm98.

[3].David Griswold. The Java HotSpot Virtual Machine Architecture. White paper, Sun Microsystems, March 1998. http://java.sun.com/products/hotspot/whitepaper.html.

[4] David M. Ungar. The Design and Evaluation of a High Performance Smalltalk System.ACM Distinguished Dissertation.MIT Press, Cambridge, Massachusetts, 1987.

[5] I.W. Williams, M.I. Wolczko and T.P. Hopkins. DynamicGrouping in an Object Oriented Virtual Memory Hierarchy.Pro-ceedings ECOOP 1987 pages 87-96.

[6] Introduction to the MAJC Architecture.Sun Microsystems,Au-gust 1999. http://www.sun.com/microelectronics/MAJC/documentation/majcintro.html

[7] Harlan McGhan and Mike O’Connor. PicoJava: A Direct Execution Engine for Java Bytecode.IEEE Computer, 30(9):79-85,September 1997.

[8] N. Vijaykrishnan, N. Ranganathan, and R. Gadekarla. ObjeOriented Architectural Support for a Java Processor. In E. Jul, eitor, Proceedings of the 12th European Conference on Object-Oented Programming,number 1445 in Lecture Notes in ComputeScience, pages 330-354. Springer, July 1998.

[9] Patriot Scientific Corporation.PSC1000 Microprocessor.http://www.ptsc.com/psc1000/index.html.

[10] Andreas Krall. Efficient JavaVM Just-In-Time CompilationIn Proceedings of the 1998 International Conference on ParallArchitectures and Compilation Techniques, Paris, France, October1998.

[11] Michael J. Flynn.Computer Architecture: Pipelined and Par-allel Processor Design. Jones and Bartlett, Boston, 1995.

0

0.1

0.2

0.3

0.4

0.5

2 4 8 16

Number of windows

Mis

s ra

tio



A Two Step Approach in the Development of a Java Silicon Machine (JSM)for Small Embedded Systems

H. Ploog · R. Kraudelt · N. Bannow · T. Rachui · F. Golatowski · D. TimmermannDepartment of Electrical Engineering and Information Technology

University of Rostock, GermanyE-mail : [email protected]

Abstract

In current solutions a Java Virtual Machine executesJava byte code by interpretation or dynamic compilation.To increase the execution performance we propose ourexperiences in the development of a processorarchitecture that can directly execute JavaCard 2.0compliant byte code.

1. Introduction

JAVA has been developed for desktop and internetbased systems but there are several implementations in theembedded area where specific Java advantages can bereused, so the idea behind JAVA and its benefits issuccessively transported into the embedded system-industry. In this paper we focus on static embeddedsystems in which is no need for dynamic class loadingduring runtime (e.g. car radio, cola machine, smart card).But the user of such systems also has the possibility tochoose from a set of different applications.

To execute Java byte code on a platform there are atleast 3 possibilities which are different in terms ofexecution speed:

• Interpreted executionThe byte code is interpreted by software.

• Compiled executionIn an additional step the Java byte code iscompiled to a specific processor architecture sothat there is no runtime overhead for inter-pretation. It is also possible to compile the bytecode just in time (JIT), the so called dynamiccompilation-technique.

• Direct execution (JSM)Java byte code can directly be executed by a realJava processor.

By real Java processor we consider a processor notonly optimized for executing JAVA-code but capable ofexecuting Java-code without a software implemented JavaVirtual Machine.

In the last few years the importance of SUN’s Java-technology raised up and it becomes a topic on manyuniversity and commercial research programs but only asmall number of projects dealing with the development ofa Java Silicon Machine (JSM) are known.

Many of the existing chips [7],[8] remap the Java bytecode to a new (reduced) instruction set to speed up theperformance. Glossner et al. described a system which isbased on the same idea but they also used amultithreading architecture [3]. Another theoreticaldescription of a Java processor architecture can be foundin [6]. At the University of Zurich an ongoing project iscalled JAMA [4]. It seems that this processor will bedesigned for direct execution of JAVA byte code.

To our knowledge by now only three existing (real)Java processors are known: picoJAVA-I [12], microJava(picoJAVA-II) and JEM-1[13] . The latter one is designedby Rockwell Collins Inc. and the former one by SUNitself. Since microJava is based on the intellectualproperty module PicoJava-II one anticipates an increasingnumber of custom Java processors.

In this paper we describe our experiences in thedevelopment of a JSM for small embedded systems likesmart cards.

The project is split into two parts. The first part is asoftware-based Java Virtual Machine suitable for 8051-processor like systems for architecture exploration, andpart two is the JSM itself. The second part is still underdevelopment since this is currently a work in progress.

2. Differences between Java and Java forsmart cards

A smart card is a single-chip computer based on an 8-bit microcontroller. The two most commonly used chipsare Motorola’s 6805 and Intel’s 8051. These systems

contain three different types of memory: RAM, EEPROMand ROM. The RAM is only used to store intermediateresults during calculation. The EEPROM holds privatecardholder values such as a private encryption key or abank account number. The ROM is used to store theprogram that runs on the smart card. Smart cards areconnected via five pins to the smart card reader. So thesesystems are “on” only if they are inserted into a reader.

The size of the die is constrained to 25 mm2. Thereforememory space is hard limited. In average, those systemscontain 4 to 20 Kbytes of ROM, 0.1 to 1 Kbytes of RAMand up to 10 Kbytes of EEPROM.

Clock frequency is typically about 3.57 to 5 MHz andan external clock has to be supplied. Smart cards can beseen as special cases of embedded systems.

SUN proposed a specification for the usage of Java onsmart cards [11]. Because of the limited memory on smartcards, SUN had to remove some memory-consuming op-codes and features, like

• string manipulation,• floating point arithmetic and• threads.

Neither the types char, float, double and long noroperations on those types are supported. Smart cards alsodo not support arrays with more than one dimension.Object usage is limited as there is no <clinit>-method.

Besides these obvious modifications there are otherrestrictions resulting from the behavior of a smart card.Java Card systems are not able to load classesdynamically. All classes used are masked into the ROMof the card during manufacturing. Installing through asecure installation process after the card has beendelivered to the smart card producer is possible too.Programs executing on the card may only refer to classeswhich already exist on the card as there is no way todownload classes during the normal execution ofapplication code. For more details see [11].

Due to the lack of memory space on smart cards theJVM has to be split into two parts, one for offlinepreparation as loading, resolving, and linking all classfilesand the other one for online execution of the Java bytecode as shown in Figure 1.

Java Virtual Machine

converterClassfiles

card terminal

AppletImage

(CAP-file)

PC smart card

Java Chip

OS / JCRE

Figure 1. Separation of the Java Virtual Machine into twoparts

During preparation the applets are converted to anapplet images which can be directly executed on the smartcard.

In conventional JVMs the byte code is verified beforeit is executed. In the smart card area the Byte-Code–Verifier is part of the offline block due to its time and areaconsumption. Therefore, a downloaded applet is notverified during runtime. To avoid illegal operating appletseach applet is signed with a digital signature (CAP-file) sothe card itself can determine whether the applet belongs toan environment it trusts or not.

But even a signature is no guarantee that an applet isalways working correctly [10]. Unfortunately, there is noway to protect the card against transitive Trojan Horsesbut currently no such attacks are known. More details onJava Cards can be found in [16], [17] and [18].

3. Simulation-model based on 8051

The first step in the development process was to builda software version of a Java Virtual Machine according toSUN's JavaCard Specification 2.0, which is suitable forimplementation on 8051-processor systems [5]. It is acleanroom implementation and was used to understand thebasic behavior of the JVM.

Since the goal was a Java processor some techniqueswhich are used for describing hardware systems were usedto implement the Java Virtual Machine. E.g., theexecution engine of the JVM has one big “switch-case”-construct. It is figured out that such a switch-case-construct is not very well suited for a small footprintimplementation. The next step was to identify groups ofopcodes so that function calls like ALU(opcode, A, B) orstackOp(opcode, value) can be used. Thereby we wereable to encapsulate those function-blocks.

The advantage is due to the fact that the simulationmodel now looks similar to a structural description of aJSM so that the HW-designer just has to make minormodifications only.

On the other hand the disadvantage is that functioncalls result in a small software overhead.

To avoid nondeterministic and time-consumingsearches in classes, methods, or fields in the constant poolit is recommended to have direct access to thesestructures. This can only be achieved if the addresses ofall accessible objects are known in advance. Sinceresolved applets contain each class they need, eachrelative location is fixed. But instead of storing anabsolute address an applet-relative address is stored. Inthis way applets can be moved inside the persistentmemory (relocatible applets) for the cost of an additionaladder.

Firmware or device specific software is located in theapplication programming interface (API). On smart cards

the API-functionality is located in ROM and the appletsare stored in EEPROM. To select an API or applet-method, the highest bit in the available address space isused as a selector (see Figure 2). This bit must be set bythe converter while linking the class-files. The baseaddress of the current active applet is stored in an appletbase register. Native methods for direct hardware accessare also located inside the API. These methods arenecessary since different applets must be able to readdata, e.g. a bank account number, or they have toincrement the number of tries made to activate the smartcard.

The implementation requires about 14 Kbytes of ROMon a 8051-system including the cost for the additionalmodularity of about 3 Kbytes. Since this is a simulationmodel we have not optimized the source code for thespecified architecture.

0 0 1 1 0 1 0 0 0 1 0 1 1 1 0 0

Applet API

applet or API?

base address base address

+0 0

base relative address to applet-image:

selector Applet (=0) or API (=1)

absolute addressto applet-image

Figure 2. Access to applet-image or API

4. Moving from software to hardware

A Java Card–system is more then just a Java processor.Moreover, the virtual machine is part of the Java Cardruntime environment (JCRE). The API, the executive (forhandling different applets), and the native methods alsobelong to the JCRE.

Because of security reasons Java does not offer anybyte code for direct hardware access.

To access IO using a Java Virtual Machine on standardor dedicated processor architectures, an API-function iscalled. Inside this function the Java-environment is left toaccess IO with the underlying processors IO-opcode, e.g.mov port_adr, #val (see Figure 3).

JVM

standard processor & peripherals

C / Assembler

API

OS

JAVA

Hardware

Figure 3. IO-access with JAVA

By using a Java processor there is no environmentwhich could be left. To overcome the IO-access problemin Java the instruction set will be expanded. This ispossible since two opcodes ($FE, $FF) in the opcodespace are reserved for custom usage.

Due to a missing standard these solutions areproprietary. On PicoJava-II about 33% of theimplemented opcodes can not be found in the Javaspecification.

Obviously, these new (hidden) opcodes can not begenerated using an ordinary Java-compiler. Accessingthese opcodes becomes somewhat difficult because aprocessor specific compiler has to be used.

Moreover, applets compiled in such a way are nolonger interchangeable and one of the major benefits ofJava gets lost.

As soon as implementation details become publicknowledge, it should not be too hard to write maliciouscode, i.e. Trojan horses. This can not be accepted in thesmart card area (revealing PIN's and POS's).

It can be shown [9] that only two additional opcodesare necessary: IO-Read and IO-Write.

In the proposed JSM the address of the opcode istraced to avoid illegal IO-access. Since application-applets are stored in reprogrammable memory theJavaCard–runtime-environment (JCRE) is located inROM or in a different EEPROM. Therefore, it is possibleto trace the address of the opcodes to-be-executed and avery small online-checker easily can allow or prohibit theexecution of the opcode and may generate an exception incase of a fault [2].

Although IO-access is required in different nativefunctions we only implemented IO-Read and IO-Write.Therefore we have no hardwired native functions and thefunctionality is realized by software inside the JCRE.

5. Java Silicon Machine

In Figure 4 the basic concept for the JSM is shown.Parts of it are already implemented using VHDL. Formulti-applet cards one additional opcode has to beimplemented: set_applet. It is used for selecting one ofthe applets and loads the applet base register with thecorresponding start address of the chosen applet.

We have to adopt parts of the JSM to the newJavaCard specification 2.1, since the format of thedownloadable applet is now specified and it is of greatimpact for some parts of the JSM.

The JSM is fully controlled by microcode. Thereforemany changes may result in some ’software’-updates [1].To get maximum independence between the submodulesthe state machines are just loosely coupled. Once thesestate machines are started, they run automatically until theend of the requested operation and feed back the result.The internal behavior of one state machine is completelyhidden to its connected state machines.

To speed up the proposed architecture it is important toknow about the dynamic probability of opcodes in a givenapplet. In [6] some values are given but they are based onbenchmarks for complete JVM/JSM’s. In the smart cardarea the situation is different So we are currentlyanalyzing different applets to get the percentagedistribution based on the dynamic instruction count. Afterthe linking process the constant pool just containsconstants of type integer. Those constants can only beaddressed using the ldc or ldc_w opcode. So the indexinto the constant pool can easily be replaced with theconstant itself. Consequently, we do not need a constantpool which results in some speed-up and less memoryusage. Benchmarking different architectures is difficult,

since execution speed depends on the compiler andcoding style of an applet. We benchmarked the DallasiButton by measuring the time to call methods andcompared the results with first theoretical values of ourarchitecture.

We assume that the iButton is a softwareimplementation since implementation details are notpublished. The underlying processor itself is driven by anunstabilized ring oscillator operating over a range of 10 to20 MHz [15]. Therefore the clock frequency of a iButtonis not constant. The comparison shows speed up of 100against the iButton (clocking the JSM @ 3.5MHz).

Techniques like caching promise some speed up butfor the cost of additional hardware [6]. Since die size islimited we do not benefit from the usage of these speed-up techniques.

6. Conclusion and future work

We presented some of our experiences in thedevelopment of a JSM for small embedded systems likesmart cards. We separate the development process intotwo parts. The experiences resulting from implementingthe JVM on a 8051-system directed us to the proposedJSM-architecture. We described some problems andpresented solutions for a JSM in the smart card area. It isplanned to complete the architecture until the end of theyear ´99. For functional verification an APTIX-systemexplorer M3PC containing four XCV1000-4 will be used.

Stack

can

must

ALU MMU

Control Unit

CONSTANTS

MUX

MUX

32

32

323232

32

8

Micro-sequencer

ROM

RAM

EEPROM

8

MMU(Stack)

RAM

32

8

32

IOUnit

StackALU

MMU

Control Unit

CONSTANTS

MUX(CONSTANTS)

MUX(DATAPATH)

Micro-sequencer

MMU(Stack)

IOUnit

state IO Unitfunction IO-Unit function MMU

state MMU

state ALUfunction ALU function stack

state stack

function MMU (stack)

state MMU (stack)

Figure 4. Datapath and control-path of the JSM

7. References

[1] N. Bannow, “Concept of a Java-Processor,”Technical Report, University of Rostock, Jan. 1999

[2] F. Golatowski, H. Ploog, R. Kraudelt, O. Hagendorf,and D. Timmermann, “Java Virtual Machines fürressourcenkritische eingebettete Systeme und Smart-Cards,” accepted for presentation at JIT’99, Frankfurt,Germany, Sep. 99

[3] C. J. Glossner, and S. Vassiliadis, “The Delft-JavaEngine: An Introduction,“ Euro-Par 97, Conf. Proc.,p.766-770, August 1997, Passau, Germany.

[4] http://www.tik.ee.ethz.ch/~jama/

[5] R. Kraudelt, „Entwicklung und Implementierung einerJAVA virtuellen Maschine (JVM) für den Einsatz inbesonders ressourcenkritischen Systemen (Smartcards),“diploma thesis, University of Rostock, 1999

[6] Vijaykrishnan Narayanan, “Issues in the design of aJAVA processor architecture,” PhD-thesis, University ofSouth Florida, 1998

[7] Eric Nguyen, “JAVATM-Based Devices fromMitsubishi”, Java One, 1996, slides at:http://java.sun.com/javaone/javaone96/pres/Mitsu.pdf

[8] Patriot Scientific, Java Processor PSC1000

[9] H. Ploog, T. Rachui, and D. Timmermann, “DesignIssues in the development of a JAVA-processor for smallembedded applications,” ACM/SIGDA InternationalSymposium on Field Programmable Gate Arrays,FPGA´99, Monterey, Feb 21-23

[10] J. Posegga, and H. Vogt, “Byte Code Verification forJava Smart cards based on Model Checking,“ 5th

European Symposium on Research in Computer Security(ESORICS), Springer Verlag, 1998

[11] JavaCard 2.0 Language Subset and Virtual MachineSpecification, http://www.javasoft.com/products/javacard,Sun Microsystems, Inc., 1997

[12] M. O'Connor and M. Tremblay, “picoJava-I: TheJava virtual machine in hardware,” IEEE Micro, March-April 1997, pp. 45-47

[13] Rockwell, http://www.techweb.com/wire/news/1997/ 09/922java.html

[14] Dallas Semiconductor, http://www.ibutton.com

[15] Stephen M. Curry, “An introduction to the JavaRing,” http://www.javaworld.com/javaworld/jw-04-1998/jw-04-javadev.html

[16] Guthery G. B., Java card: Internet computing on asmall card, Jan-Feb 1997, IEEE Internet Computing, pp/57-59.

[17] Michael Montgomery, “Get a jumpstart on the JavaCard”, http://www.javaworld.com/javaworld/jw-02-1998/jw-02-javadev.html , 1998

[18] Zhiqun Chen , “Understanding Java Card 2.0,“http://www.javaworld.com/javaworld/jw-03-1998/jw-03-javadev.html, 1998

1. This research was supported through a grant from the NationalSciences and Engineering Research Council of Canada (NSERC) to thesecond author.

Abstract

Java has emerged to dominate the network-programmingworld. This imposes certain requirements on its virtualmachine instruction set architecture and on anyprocessor design that intends to support Java. Thepurpose of this study is to carry out a behavioral analysisof the different aspects of Java instruction setarchitecture. This will help in extracting the hardwarerequirements for executing Java bytecodes.Recommendations for architectural requirements for Javaprocessors will be made throughout this study.

1. Introduction

Java was introduced to deal with heterogeneous networks,which require building software that is platform-independent [1,2]. This means that a compiled softwareshipped around the network needs to be able to run onany CPU it lands on (which is formulated as “write once,execute anywhere.” ) In addition, designed to be a modernhigh-level language, Java includes all modern featuresli ke modularity and object-orientation. To achieve allthese goals, Java targets an intermediate virtual platform,instead of direct execution on the host CPU [3,4,5,6]. Allwe need to execute cross-platform programs on theInternet is to port this virtual layer to the CPU/OScombination we want to run Java on. But, this comes at ahigh price. The special features supported by Java have atremendous impact on the overall system performanceand impose certain requirements on the Java system [7].

A number of schemas have been proposed to improveJava performance as a tool for programming on the weband networking in general [8,9,10,11,12,13,14]. Some ofthe promising directions incorporate hardware solutions.Building microprocessors for Java or simply modifyingother general-purpose processors to boost Java are amongthese hardware options [15,16,17,18,19,20,21,22,23,24].

Designing hardware for Java requires an extensiveworking knowledge about its virtual machine

organization and functionality. Java virtual machine(JVM) instruction set architecture (ISA) definescategories of operations that manipulate several datatypes, reached through a well -defined set of addressingmodes [25,26,27,28,29]. JVM specification defines theinstruction encoding mechanism required to package thisinformation into the bytecode stream. It also includesdetail s about the different modules required forprocessing these bytecodes. At runtime, the JVMimplementation and the execution environment affect theinstruction execution performance. This is manifesteddirectly in the wall-clock time needed to perform a certaintask and indirectly in the different overheads associatedwith executing the job (e.g., memory management) [30].

The goal of this research is to conduct a comprehensivebehavioral analysis of the Java virtual machine instructionset architecture [31,32]. Observing the Java instruction setarchitecture while it is executing Java benchmarks wil lreveal a lot about the details of the Java environment.This will be reflected in the form of suggestions for theactual hardware improvements and additions to boost theperformance of Java. Revised encoding formats anddevised hardware organizations (including a certain levelof paralleli sm, pipelining, caching, functionality, ... etc.)with insights about the internal detail s of Java wil l lead tobetter performance. Our rationale for conducting such astudy is based on the simple observation that modernprograms spend 80-90% of their time accessing only 10-20% of the instruction set architecture [33]. To be mosteffective, optimization efforts should focus on just the 10-20% part that reall y matters to the execution speed.

ISA study is of great importance for every attempt todevise a certain arrangement to boost Java performance.The results collected here affect the way of encoding Javainstructions into a binary representation for execution byany CPU supporting Java. It also affects the internalprocessor datapath design for any architecture that targetsJava. It is worth noting that although JVM ISA sharesmany general aspects with traditional microprocessors, ithas its distinguishing features. This stems from the factthat it is an intermediate layer for a high-level language.

Quantitative Analysis for Java Microprocessor Architectural Requirements:

Instruction Set Design1

M. Watheq EL-Kharashi Fayez ElGuibaly Kin F. Li

Department of Electrical and Computer Engineering, University of Victoria

P. O. Box 3055, Victoria, BC, Canada, V8W 3P6{watheq, fayez, kinli} @ engr.UVic.CA

For example, the branch prediction model of theunderlying hardware affects the overall Java performance,which is the case of all modern microprocessors. On theother hand, JVM-supporting hardware wil l be unique inthe way its stack model handles method invocations.

To undertake these objectives, a Java interpreter wasinstrumented to produce a Java trace. PendragonSoftware's CaffeineMark 3.0 was selected as thebenchmark as it is computationally interesting andexercises various JVM aspects [34]. It is a syntheticbenchmark that runs 9 tests to measure Java performance.The machine used in the evaluation is an UltraSPARCI/140 running at 143 MHz with 64 Mbytes memory. TheOS is Solaris 2.6. Based on the data gathered, generalrequirements for Java processors are drawn. In doing thisstudy, we followed the methodology used by Pattersonand Hennessy in studying the instruction set design [33].

This paper is organized as follows: Section 2 analyzesaccess patterns for data types. Addressing modes arestudied in Section 3. Section 4 is concerned with thedifferent instruction encoding aspects.

2. Access patterns for data types

Here we study the access patterns for different data types.Data types that are heavily used need more attention incase of designing certain hardware architecture to supportJava [30]. This information wil l prove useful whendecisions are made about storage allocation.

2.1. Single-type operations

Figure 1. Distribution of data accesses by type.

Figure 1 shows the distribution of data accesses by type.(“Generic” refers to operations that have no data typeassociated with them.) From this figure we see thatinteger data types dominate the typed operations,followed by the reference ones. Architectural support forJava object-orientation therefore needs to give privilegesto integer and reference data types in hardware. Also fromthis figure, we see that 32-bit data types are used themost. This will have an impact on the size of the registerfile and CPU datapaths. Furthermore, a superscalar designmay want to provide multiple functional units thatprocess integer and reference data types. From the ALUpoint of view all integer operations need eff icient support.

2.2. Type conversion operations

JVM has a set of instructions that converts data of acertain type to another. This is necessary for a stronglytyped language li ke Java. Figure 2 shows that theconversion from integer dominates all type conversionoperations (especiall y to character.) This informationcombined with the results from the previous subsectionimplies that integer conversion operations are the mostused. For better Java performance, the ALU design needsto perform this conversion in one clock cycle or less.

Figure 2. Frequency of type conversion instructions.

3. Addressing modes

This Section is concerned with the use of the JVMaddressing modes. The traditional concept of addressingmodes, as used in general-purpose processors, is notexactly applicable to JVM, which uses a stack-basedintermediate language. This, together with the object-orientation approach, is reflected in the combination oftraditional and non-traditional addressing modes [30].

0% 20% 40% 60% 80%

double

long

float

int

Inpu

t typ

e

Percentage of the resultant type

double

long

float

int

char

byte

27.97%

17.03%

0.25%

2.59%

0.42%

1.31%

0.36%

1.93%

47.27%

0.83%

0% 10% 20% 30% 40% 50%

generic

void

returnA ddress

reference

duplex

double

long

float

int

char

short

byte

Dat

a ty

pes

Percentage of access

The result of addressing mode usage patterns is shown inFigure 3. Local variable access dominates all othermodes. Also, of importance are the immediate access andthe quick reference. (The “Quick Reference” itemsummarizes the quick bytecode optimization.) Hardcodedaddressing modes (in which the operand value is encodedin the instruction itself) occupy more than one third of thetotal addressing modes used. We conclude here, that Javaprocessors need to support at least immediate, localvariable, quick referencing, and stack addressing modes.

Figure 3: Usage of memory addressing modes.

4. Instruction encoding

JVM specifications require Java bytecodes to be providedas a stream of bytes grouped in variable-lengthinstructions. However, it hardly mentions a generalinstruction format for the adopted instruction [1]. Thisirregular format might stand against the generality andefficiency of hardware execution of Java bytecodes.

This section quantitatively analyzes the different fieldsthat constitute JVM instructions. We aim at determiningthe optimum number of bits required for encoding thesefields. The analysis presented here should not beconsidered contradicting the JVM specification that tellsexactly the size of each instruction field. Architecturesthat provide support for Java might select to have a nativeinstruction format that is different from the JVM one inthe addressing modes, data types, etc. This approach wil lhelp attaining generality and eff iciency.

4.1. Immediates

As a stack machine, JVM does not rely a lot onimmediates for ALU operations. Immediates in JVM areeither pushed on the stack or used as an offset for acontrol flow or table switching instruction and could have

a length of up to 32 bits. As shown in Figure 4,immediates used in CaffeineMark are up to 18 bits inlength with an average of 4 bits (standard deviation of2.07 bits.) The peak occurs at 3 bits. Three bits areenough to cover more than 50% of immediate usage and5 bits can cover more than 75%. Based on the statisticsshown in Figure 4 we suggest using 8 bits to encodeimmediates values in JVM instructions, which will cover98% of the cases. Situations that wil l require more bitscan be handled by the compiler using a special wideformat.

Figure 4: Distribution of number of bits in immediates.

4.2. Array indices

Figure 5: Number of bits representing an array index.

Java technology carries array information down thehierarchy to the JVM. At runtime, the index required toaccess an array is popped from the operand stack.Although the index can be up to 32 bits, Figure 5 showsthat CaffeineMark does not require more than 13 bits to

0.13%

0.10%

46.34%22.20%

13.37%

7.37%

10.50%

0% 10% 20% 30% 40% 50%

Quick Reference

Object Reference

Array Indexed

Constant Pool Indexed

Stack

Local Variable

Immediate

Add

ress

ing

mod

e

Frequency of the addressing mode usage

Non-Hardcoded

Hardcoded

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

0 2 4 6 8 10 12 14 16 18

Number of bits needed for an immediate value

Per

cent

age

of Im

med

iate

Val

ue U

sage

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

0 2 4 6 8 10 12 14

Number of bits needed for an array index

Per

cent

age

of a

rray

inde

x us

age

access arrays (with an average and standard deviation of 3and 2.79, respectively.) For instruction encoding, we seethat 3-bit index size covers more than 50% of the arrayaccesses and 5 bits cover more than 80%. The graphshows an interesting pattern: the maximum occurs at 0then follows a decaying behavior with a sudden drop after6 indicating that about 90% of the array accesses takeplace in the first 64 element. This could give usefulguidelines in data caching—the first few elements (up tothe 64th) of an array should be cached. This could alsohave an impact on Java processor’s cache organization,such as block size, replacing strategies, etc.

4.3. Constant pool indices

JVM constant pool is a collection of all the symbolic dataneeded by a class to reference fields, classes, interfaces,and methods. A constant pool index size is either up to 16bits, or 32 bits if it is in a wide format. From Figure 6, wesee that 16 bits cover almost all constant pool accesses(the average is 8 and the standard deviation is 2.9.)

Figure 6: Number of bits for a constant pool index.

4.4. Local variable indices

As mentioned before, instead of specifying a set ofgeneral-purpose registers, JVM adopted the concept ofreferencing local variables. As Figure 7 shows, Javamethods typicall y require up to 16 local variables (withan average of 2 and standard deviation of 1.31), thoughthe specifications allow referencing up to 255, or 65535in case of wide instructions. The graph also shows nearlyno preference in accessing these variables. In designinghardware support for Java, this graph suggests allocatingthe local variables on-chip. In this case, general-purposeregister file can be configured to work as a reservoir forlocal variables, allowing Java programs to run faster.

Figure 7: Number of bits for a local variable index.

4.5. Branching distances

Java bytecodes only deli ver the offset in branches andJVM converts it internall y to the corresponding absoluteaddress. As Figure 8 (dots) shows, CaffeineMark requiresa target offset width of less than 10 bits (with an averageof 4 and standard deviation of 1.49), though up to 16 bitsare allowed. In the design of an instruction format, 8 bitsappear enough to cover more than 98% of the offsetdistances. Furthermore, if a branch target buffer (BTB) isused for branch speculation, a size of 512 bytecodes (256forward and backward) is suff icient. Figure 8 (triangles)also shows statistics of absolute jump-to address.Although this information depends on the run timeenvironment, it does indicate a typical behavior.

Figure 8: Number of bits for an address.

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

0 1 2 3 4

Number of bits needed for a local variable index

Per

cent

age

of c

lass

poo

l ind

ex

0%

5%

10%

15%

20%

25%

30%

0 4 8 12 16 20 24 28 32

Number of bits needed for a class pool index

Per

cent

age

of c

lass

poo

l ind

ex

0%

10%

20%

30%

40%

50%

60%

70%

0 5 10 15 20 25

Number of bits needed for branching distance

Per

cent

age

of b

ranc

hing

dis

tanc

e

Target Offset

Jump-to A ddress

5. Conclusions

In this work we conducted a behavioral analysis of Javavirtual machine instruction set architecture. In the light ofthe results collected from each part, we drew conclusionsabout the general architectural requirements for designingmicroprocessors that support Java. Our study clearlyshows that the advanced features of Java are its weakestpoints in terms of performance. Hardware support isrequired to increase the eff iciency of Java.

Acknowledgement

The authors would li ke to acknowledge Sun for the Javadevelopment kit license. In addition, we would like tothank Kenneth Kent from the Department of ComputerScience, University of Victoria, Canada for his supportwhile compiling the Java development kit.

References[1] J. B. Gosling, B. Joy, and G. Steele, The Java Language

specification, The Java Series, Addison-Wesley, Reading,MA, 1996.

[2] J. Gosling and H. McGilton, “The Java LanguageEnvironment, A White Paper,” Sun Microsystems,Mountain View, CA, October 1995.

[3] J. Gosling, “Java Intermediate Bytecodes,” ACMSIGPLAN Notices, January 1995, pp. 11-118.

[4] M. Lentczner, “Java's Virtual World,” MicroprocessorReport, March 25, 1996, 8-11,17.

[5] T. Lindholm and F. Yellin, The Java Virtual MachineSpecification, The Java Series, Addison-Wesley, Reading,MA, 1997.

[6] J. Meyer and T. Downing, Java Virtual Machine, O'Reillyand Associates, Inc, Sebastopol, CA, 1997.

[7] P. Wayner, “How to Soup Up Java: Part II,” Byte, May1998, pp. 76-80.

[8] B. Case, “ Implementing the Java Virtual World,”Microprocessor Report, March 25, 1996, pp. 12-17.

[9] P. Halfhill, “How to Soup Up Java: Part I,” Byte, May1998, pp. 60-74.

[10] C.-H A Hsieh, W.-M. W. Hwu, and M. Conte “Compilerand Architecture Support for Java,” a Tutorial presentedin the ASPLOS-VII Conference, Boston, October 1-5,1996.

[11] C.-H A. Hsieh, M. T. Conte, T. L. Johnson, J. C.Gyllenhall, and W.-M W. Hwu, “Optimizing NETCompilers for Improved Java Performance,” IEEEComputer, June 1997, pp. 67-75.

[12] C.-H A. Hsieh, J. C. Gyllenhall, and W.-m. W. Hwu,“Java Bytecode to Native Code Translation: The CaffeinePrototype and Preliminary Results,” Proc.of the 29th

Annual International Symp. on Microarchitectures(MICRO-29), IEEE Computer Society Press, LosAlamitos, CS, December 2-4, 1996, pp. 90-97.

[13] H. McGhan and. J. M. O’Connor, “picoJava: A DirectExecution Engine for Java Bytecode,” IEEE Computer,October 1998, pp. 22-30.

[14] J. M. O'Connor and M. Tremblay, “picoJava-I: The JavaVirtual Machine in Hardware,” IEEE Micro, March/April1997, pp. 45-53

[15] B. Case, “Java Virtual Machine Should Stay Virtual,”Microprocessor Report, April 15, 1996, pp. 14-15.

[16] B. Case, “Java Performance Advancing Rapidly,”Microprocessor Report, May 27, 1996, pp. 17-19.

[17] M. Watheq El-Kharashi and F. ElGuibaly, “JavaMicroprocessors: Computer Architecture Impli cations,”Proceedings of the PACRIM’97, Victoria, BC, Canada,August 20-22, 1997, pp. 277-280.

[18] R. Grehan, “JavaSoft’ s Embedded Specification Overdue,but Many Tool Vendors aren’t Waiting,” ComputerDesign April 1998, pp. 14-18.

[19] Patriot Scientific Corporation, PSC1000 Microprocessorhome page, http://www.ptsc.com/psc1000/

[20] M. Tremblay and J. M. O'Connor, “picoJava: AHardware Implementation of the Java Virtual Machine,”Hotchips Presentation, 1996.

[21] T. Turley, “MicroJava Pushes Bytecode Performance,”Microprocessor Report, October 28, 1997, pp. 28-31.

[22] T. Turley, “Most Significant Bits,” MicroprocessorReport, August 4, 1997, pp. 4-5, 9.

[23] T. Turley, “Sun Reveals First Java Processor Core,”Microprocessor Report, November 17, 1996, pp. 9-12.

[24] P. Wayner, “Sun Gambles on Java Chips,” Byte,November 1996, pp. 79-88.

[25] Sun Microelectronics, “Sun Blazes Another Trail –Introducing the microJava 701 microprocessor,” PressReleases, October 1997.

[26] Sun Microelectronics, “The Burgeoning Market for JavaProcessors. Inside The Networked Future: TheUnprecedented Opportunity for Java Systems,” whitepaper 96-043, October 1996.

[27] Sun Microelectronics, “Sun Microelectronics' picoJava-IPosts Outstanding Performance,” white paper 0015-01,October, 1996.

[28] Sun Microelectronics, “picoJava-I Microprocessor CoreArchitecture,” white paper 0014-01, October, 1996.

[29] A. Tanenbaum and J. Goodman, Structured ComputerOrganization, Fourth Edition, Prentice Hall , EnglewoodCli ffs, NJ, 1999.

[30] B. Venners, Inside the Java Virtual Machine, McGraw-Hill, NY, 1998.

[31] M. Watheq El-Kharashi, F. ElGuibaly, and K. F. Li,“Hardware Adaptations for Java: A Design SpaceApproach,” Technical Report ECE-99-1, Department ofElectrical and Computer Engineering, University ofVictoria, January 25, 1999.

[32] M. Watheq El-Kharashi, F. ElGuibaly, and K. F. Li,“Architectural Requirements for Java Processors: AQuantitative Analysis,” Technical Report ECE-98-5,Department of Electrical and Computer Engineering,University of Victoria, November 9, 1998.

[33] D. A. Patterson and J. L. Hennessy, ComputerArchitecture A Quantitative Approach, second edition,Morgan Kaufmann Publishers, Inc. San Francisco, CA,USA, 1996.

[34] Pendragon Software, CaffeineMark Benchmark HomePage http://www.pendragon-software.com/pendragon/cm3

workshop on hardware support for objects and...

Documents