polar opposites: next generation languages & architectures kathryn s mckinley the university of...

41
Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Upload: elfrieda-kelley

Post on 17-Jan-2018

220 views

Category:

Documents


0 download

DESCRIPTION

Computing in the Twenty-First Century New and changing architectures  Hitting the microprocessor wall  TRIPS - an architecture for future technology Object-oriented languages  Java and C# becoming mainstream Key challenges and approaches  Memory gap, parallelism  Language & runtime implementation efficiency  Orchestrating a new software/hardware dance  Break down artificial system boundaries

TRANSCRIPT

Page 1: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Polar Opposites:Next Generation

Languages & ArchitecturesKathryn S McKinley

The University of Texas at Austin

Page 2: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Collaborators• Faculty

– Steve Blackburn, Doug Burger, Perry Cheng, Steve Keckler, Eliot Moss,

• Graduate Students– Xianglong Huang, Sundeep Kushwaha,

Aaron Smith, Zhenlin Wang (MTU)• Research Staff

– Jim Burrill, Sam Guyer, Bill Yoder

Page 3: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Computing in the Twenty-First Century

New and changing architectures Hitting the microprocessor wall TRIPS - an architecture for future technology

Object-oriented languages Java and C# becoming mainstream

Key challenges and approaches Memory gap, parallelism Language & runtime implementation efficiency Orchestrating a new software/hardware dance Break down artificial system boundaries

Page 4: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Technology Scaling Hitting the Wall

130 nm

100 nm

70 nm35 nm

20 mm chip edge

Analytically … Qualitatively …

Either way … Partitioning for on-chip communication is key

Page 5: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

End of the Road for Out-of-Order SuperScalars

• Clock ride is over– Wire and pipeline limits– Quadratic out-of-order issue logic– Power, a first order constraint

• Major vendors ending processor lines

• Problems for any architectural solution – ILP - instruction level parallelism– Memory latency

Page 6: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Where are Programming Languages?• High Productivity Languages

– Java, C#, Matlab, S, Python, Perl• High Performance Languages

– C/C++, Fortran• Why not both in one?

– Interpretation/JIT vs compilation– Language representation

• Pointers, arrays, frequent method calls, etc.– Automatic memory management costs

Obscure ILP and memory behavior

Page 7: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Outline• TRIPS

– Next generation tiled EDGE architecture– ILP compilation model

• Memory system performance– Garbage collection influence – The GC advantage

• Locality, locality, locality• Online adaptive copying

– Cooperative software/hardware caching

Page 8: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

TRIPS•Project Goals

–Fast clock & high ILP in future technologies–Architecture sustains 1 TRIPS in 35 nm

technology–Cost-performance scalability–Find the right hardware/software balance

•New balance reduces hardware complexity & power–New compiler responsibilities & challenges

•Hardware/Software Prototype–Proof-of-concept of scalability and

configurability–Technology transfer

Page 9: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

TRIPS Prototype Architecture

Page 10: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Execution Substrate

0 1 2 3

I-cache 0

I-cache 1

I-cache 2

I-cache 3D-cache/LSQ 3

D-cache/LSQ 2

D-cache/LSQ 1

D-cache/LSQ 0

Global CtrlBranch Predictor I-cache H

Register banksExecution node

Execution array

Interconnect topology & latency exposed to compiler scheduler

Page 11: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Large Instruction Window

Execution Node

opcode src1 src2

opcode src1 src2

opcode src1 src2

Out-of-Order Instruction Buffers form a logical “z-dimension”

in each node

opcode src1 src2

4 logical framesof 4 X 4 instructions

Control

Router

ALU

• Instruction buffers add depth to execution array– 2D array of ALUs; 3D volume of instructions

• Entire 3D volume exposed to compiler

Page 12: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Execution Model• SPDI - static placement, dynamic

issue– Dataflow within a block– Sequential between blocks

• TRIPS compiler challenges– Create large blocks of instructions

• Single entry, multiple exit, predication– Schedule blocks of instructions on a tile– Resource limitations

• Registers, Memory operations

Page 13: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Block Execution Model• Program execution

– Fetch and map block to TRIPS grid– Execute block, produce result(s)– Commit results– Repeat

• Block dataflow execution– Each cycle, execute a ready instruction at every

node– Single read of registers and memory locations– Single write of registers and memory locations– Update the PC to successor block

• TRIPS core may speculatively execute multiple blocks (as well as instructions)

• TRIPS uses branch prediction and register renaming between blocks, but not within a block

start

end

A

B

C

D

E

Page 14: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Just Right Division of Labor• TRIPS architecture

– Eliminates short-term temporaries– Out-of-order execution at every node in grid– Exploits ILP, hides unpredictable latencies

• without superscalar quadratic hardware• without VLIW guarantees of completion time

• Scale compiler - generate ILP– Large hyperblocks - predicate, unroll, inline, etc.– Schedule hyperblocks

• Map independent instructions to different nodes• Map communicating instructions to same or close nodes

– Let hardware deal with unpredictable latencies (loads) Exploits Hardware and Compiler Strengths

Page 15: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

High Productivity Programming Languages• Interpretation/JIT vs compilation• Language representation

– Pointers, arrays, frequent method calls, etc.• Automatic memory management costs

MMTk in IBM Jikes RVM – ICSE’04, SIGMETRICS’04– Memory Management Toolkit for Java – High Performance, Extensible, Portable– Mark-Sweep, Copying SemiSpace,

Reference Counting– Generational collection, Beltway, etc.

Page 16: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Bump-Pointer

Fast (increment & bounds check)

Can't incrementally free & reuse: must free en masse

Relatively slow (consult list for fit)

Can incrementally free & reuse cells

Free-List

Allocation Choices

Page 17: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Allocation Choices

• Bump pointer– ~70 bytes IA32 instructions, 726MB/s

• Free list– ~140 bytes IA32 instructions, 654MB/s

• Bump pointer 11% faster in tight loop– < 1% in practical setting– No significant difference (?)

• Second order effects?– Locality??– Collection mechanism??

Page 18: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Implications for Locality

• Compare SS & MS mutator– Mutator time– Mutator memory performance: L1, L2 & TLB

Page 19: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

javac

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

1.05

1.1

1.15

1.2

javac mutator time

MarkSweepSemiSpace

Normalized Heap Size

Norm

aliz

ed m

utat

or ti

me

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

1.1

1.2

1.3

1.4

1.5

javac L1 misses

MarkSweepSemiSpace

Normalized Heap Size

Norm

aliz

ed L

1 m

isse

s

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

1.2

1.4

1.6

1.8

javac L2 misses

MarkSweepSemiSpace

Normalized Heap Size

Norm

aliz

ed L

2 m

isse

s

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

1.2

1.4

1.6

1.8

javac TLB misses

MarkSweepSemiSpace

Normalized Heap Size

Norm

aliz

ed T

LB m

isse

s

Page 20: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

pseudojbb

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

1.05

1.1

1.15

1.2

1.25

jbb mutator time

MarkSweepSemiSpace

Normalized Heap Size

Norm

aliz

ed m

utat

or ti

me

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

1.1

1.2

1.3

1.4

jbb L1 misses

MarkSweepSemiSpace

Normalized Heap Size

Norm

aliz

ed L

1 m

isse

s

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

1.1

1.2

1.3

1.4

1.5

1.6

1.7

jbb L2 misses

MarkSweepSemiSpace

Normalized Heap Size

Norm

aliz

ed L

2 m

isse

s

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

1.1

1.2

1.3

1.4

1.5

1.6

1.7

jbb TLB misses

MarkSweepSemiSpace

Normalized Heap Size

Norm

aliz

ed T

LB m

isse

s

Page 21: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

db

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

1.02

1.04

1.06

1.08

1.1

1.12

db L1 misses

MarkSweepSemiSpace

Normalized Heap Size

Norm

aliz

ed L

1 m

isse

s

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

1.025

1.05

1.075

1.1

1.125

1.15

db mutator time

MarkSweepSemiSpace

Normalized Heap Size

Norm

aliz

ed m

utat

or ti

me

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

1.01

1.02

1.03

1.04

1.05

1.06

1.07

db L2 misses

MarkSweepSemiSpace

Normalized Heap Size

Norm

aliz

ed L

2 m

isse

s

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

1.05

1.1

1.15

1.2

1.25

db TLB misses

MarkSweepSemiSpace

Normalized Heap Size

Norm

aliz

ed T

LB m

isse

s

Page 22: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Locality &Architecture

Page 23: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

MS/SS Crossover 1.6GHz PPC

1

1.5

2

2.5

3

1 2 3 4 5 6Heap Size Relative to Minimum

Normalized Total Time

1.6GHz PPC SemiSpace1.6GHz PPC MarkSweep

Page 24: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

MS/SS Crossover1.9GHz AMD

1

1.5

2

2.5

3

1 2 3 4 5 6

Heap Size Relative to Minimum

Normalized Total Time

1.6GHz PPC SemiSpace1.6GHz PPC MarkSweep1.9GHz AMD SemiSpace1.9GHz AMD MarkSweep

Page 25: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

MS/SS Crossover 2.6GHz P4

1

1.5

2

2.5

3

1 2 3 4 5 6

Heap Size Relative to Minimum

Normalized Total Time

1.6GHz PPC SemiSpace1.6GHz PPC MarkSweep1.9GHz AMD SemiSpace1.9GHz AMD MarkSweep2.6GHz P4 SemiSpace2.6GHz P4 MarkSweep

Page 26: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

MS/SS Crossover3.2GHz P4

1

1.5

2

2.5

3

1 2 3 4 5 6

Heap Size Relative to Minimum

Normalized Total Time

1.6GHz PPC SemiSpace1.6GHz PPC MarkSweep1.9GHz AMD SemiSpace1.9GHz AMD MarkSweep2.6GHz P4 SemiSpace2.6GHz P4 MarkSweep3.2GHz P4 SemiSpace3.2GHz P4 MarkSweep

Page 27: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

1

1.5

2

2.5

3

1 2 3 4 5 6

Heap Size Relative to Minimum

Normalized Total Time

1.6GHz PPC SemiSpace1.6GHz PPC MarkSweep1.9GHz AMD SemiSpace1.9GHz AMD MarkSweep2.6GHz P4 SemiSpace2.6GHz P4 MarkSweep3.2GHz P4 SemiSpace3.2GHz P4 MarkSweep

MS/SS Crossover

2.6GHz2.6GHz

1.9GHz1.9GHz

1.6GHz1.6GHz

locality space

3.2GHz3.2GHz

Page 28: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Locality in Memory Management

• Explicit memory management on its way out– Key GC vs Explicit MM insights 20 yrs old– Technology has and is changing

• Generational and Beltway Collectors– Significant collection time benefits over

full heap collectors– Collect young objects– Infrequently collect old space– Copying nursery attains similar locality effects

as full heap

Page 29: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Where are the Misses?

_209_db

0200400600800

100012001400160018002000

Boot ImageImmortal LOS Older GenNurseryTotal Accesses (in millions)

hitsmisses

Generational Copying Collector

Page 30: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Copy Order• Static copy orders

– Bredth first - Cheney scan– Depth first, hierarchical– Problem: one size does not fit all

• Static profiling per class– Inconsistant with JIT

• Object sampling– Too expensive in our experience

• OOR - Online Object Reordering– OOPSLA’04

Page 31: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

OOR Overview• Records object accesses in each

method (excludes cold basic blocks)

• Finds hot methods by dynamic sampling

• Reorders objects with hot fields in higher generation during GC

• Copies hot objects into separate region

Page 32: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Static Analysis Example

Compiler

Hot BBCollect access info

Cold BBIgnore

Compiler

Access List:1. A.b2. ….….

Method Foo { Class A a; try { …=a.b; … } catch(Exception e){ …a.c }}

Page 33: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Adaptive SamplingMethod Foo { Class A a; try { …=a.b; … } catch(Exception e){ …a.c }}

Adaptive Sampling

Foo is hot

Foo Accesses:1. A.b2. ….….

A.b is hot

A

B

b…..

c

Page 34: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Advice Directed Reordering• Example

– Assume (1,4), (4,7) and (2,6) are hot field accesses

– Order: 1,4,7,2,6 : 3,5

1

4

76

2 3 5

Page 35: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

OOR System Overview

BaselineCompiler

SourceCode

ExecutingCode

AdaptiveSampling Optimizing

Compiler

HotMethods

Access InfoDatabase

Register HotField Accesses

Look Up

AddsEntries

GC: copyingobjects

Affects Locality

AdviceGC: CopiesObjects

OOR additionJikes RVMInput/Output

Page 36: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Cost of OORBenchmark Default OOR Differencejess 4.39 4.43 0.84%jack 5.79 5.82 0.57%raytrace 4.63 4.61 -0.59%mtrt 4.95 4.99 0.70%javac 12.83 12.70 -1.05%compress 8.56 8.54 0.20%pseudojbb 13.39 13.43 0.36%db 18.88 18.88 -0.03%antlr 0.94 0.91 -2.90%gcold 1.21 1.23 1.49%hsqldb 160.56 158.46 -1.30%ipsixql 41.62 42.43 1.93%jython 37.71 37.16 -1.44%ps-fun 129.24 128.04 -1.03%Mean -0.19%

Page 37: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Performance db

Page 38: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Performance jython

Page 39: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Performance javac

Page 40: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Software is not enoughHardware is not enough• Problem: inefficient use of cache• Hardware limitations: set associativity, cannot

predict the future• Cooperative Software/Hardware Caching

– Combines high level compiler analysis with dynamic miss behavior

• Lightweight ISA support conveys compiler’s global view to hardware– Compiler-guided cache replacement (evict-

me)– Compiler-guided region prefetching– ISCA’03, PACT’02

Page 41: Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Exciting Times• Dramatic architectural changes

– Execution tiles– Cache & Memory tiles

• Next generation system solutions– Moving hardware/software boundaries– Online optimizations– Key compiler challenges (same old…) ILP and Cache Memory Hierarchy