© imperial college london exploring the barrier to entry incremental generational garbage...

Exploring the Barrier to Entry

Incremental Generational Garbage Collection for Haskell

Andy Cheadle & Tony FieldImperial College London

Simon Marlow & Simon Peyton JonesMicrosoft Research, Cambridge, UK

Lyndon WhileThe University of Western Australia, Perth

Introduction

We focus on Haskell with the intent of building an:

• Efficient• Barrierless• Hybrid• Incremental• Generational• …garbage collector for GHC

Investigate pause time bounds and mutator utilisation.

Explore application to other dynamic dispatch systems.

Highlights

• Improving Non-Stop Haskell– Incremental GC read-barrier optimisation without

the per-object space overhead

• Bridging the Generation Gap– Generational GC write-barrier optimisation

• Consistent Mutator Utilisation– Time-based versus Work based scheduling

Barriers: Friend or Foe - Summary

• Blackburn & Hosking - ISMM 2004

• Conditional read-barrier– AMD: 21.24%, P4: 15.91%, PPC: 6.49%– Incremental GC: Standard Baker read-barrier

• Unconditional read-barrier– AMD: 8.05%, P4: 5.04%, PPC: 0.85%– Brooks indirection read-barrier– Metronome ‘Eager’ barrier ~ 4%– BUT: space overhead -> increased GC count

• Must consider GC cost!!!

Non-Stop Haskell

• Implementing Baker’s incremental collector typically introduces high overheads

– The software read-barrier

• We have shown that this can be done efficiently in systems with dynamic dispatching

CaveatDynamic dispatching already “costs” something; we show that incremental garbage collection comes at virtually no extra cost.

Dynamic Dispatch and the STG Machine

• The STG machine is a model for the compilation of lazy functional languages

• All objects are represented on the heap as closures:

• To compute function ‘f’ applied to arguments ‘a b c d’ jump to Entry code

0: 3: imm2:1: 4: imm

Other fields

Entry code …

heap pointers

a b c df

static info table

The Read-Barrier Invariant

2: unscavenged

3: unscavenged

Stack top

from-space

to-space

1: scavenged

Problem 1

Problem 2

• When the garbage collector is on make info pointers point to code that scavenges evacuated closures before entering them

• At all other times the system operates with no read barrier!

Invariant Problem 1: Scavenging Closures

0: 3: imm2:1: 4: imm

Self-scav code …

heap pointers

Other fields

Q How do we restore the original info pointer?

A We remember it when the closure is evacuated

Non-Stop Haskell:

• Use an extra word in to-space

• Note: the space overhead applies only to objects copied from from-space but effectively reduces to-space by 30%

• Freshly allocated objects carry no space overhead

0: 3: imm2:1: 4: imm

Other fields

Entry code …

heap pointers

Self-scav code …

Other fields

Q How do we restore the original info pointer?

A We remember it when the closure is evacuated

In production:

• Specialise every closure type at compile time

• Runtime space overhead is replaced by a static one of ~ 25%

0: 3: imm2:1: 4: imm

Other fields

Entry code …

heap pointers

Self-scav code

JMP Entry code

Other fields

Invariant Problem 2: Stack Scavenging

• STG machine stack frames look just like closures

• Before returning to the caller frame we ‘hijack’ the caller’s return address, replacing it with a pointer to self-scavenging code for that frame

1: scavenged

2: unscavenged

3: unscavenged

2: scavengedscav; mod 3r; update; return

scav; mod 4r; update; returnupdate; return

Background Scavenging

• GHC’s heap is block allocated. So, scavenge at:– Every Allocation (EA)

– Every Block allocation (EB)

• Reduce forced-completions via block chaining

• Incremental scavenger pauses are allocation-dependent

• Exploit GHC’s lightweight scheduler to implement a time-scheduled scavenger (Jikes RVM Metronome)– Consistent mutator utilisation

– Increase in forced-completions due to allocation bursts

Results – Binary Sizes

Max 36

Min 36

wave4main

symalg

circsim

(EA)Stop-copy

Application Baker

Metronome

All 36

12.99%

12.76%

12.82%

12.84%

12.71%

12.12%

14.40%

12.98%

14.15%

15.83%

19.25%

11.49%

13.04%

20.22%

14.87%

11.67%

16.01%

16.90%

10.54%

32.20%

30.83%

29.89%

28.03%

28.46%

26.98%

31.90%

29.59%

26.72%

26.65%

26.63%

22.95%

23.95%

22.04%

27.83%

25.23%

34.71%

34.02%

32.14%

32.11%

32.04%

29.05%

34.74%

32.95%

Results – Runtimes

Max 36

Min 36

wave4main

symalg

circsim

(EA)Stop-copy

(seconds)

Application Baker

Metronome

All 36

181.41

218.39

29.33%

56.80%

16.56%

-0.11%

79.02%

27.34%

23.30%

34.05%

17.25%

-0.08%

73.56%

23.61%

20.29%

35.47%

-4.04%

35.47%

11.22%

13.94%

17.74%

-3.40%

37.26%

23.66%

-3.10%

24.20%

-2.54%

18.65%

18.75%

-13.52%

70.89%

13.63%

The Generational Write-barrier

root set for generation N – 1

inter-generational pointer

generation N generation N - 1

root set

Depending on the number of updates, the write-barrier can impose an overhead of 8 – 24% (NJ/ML and Clean).

Bridging the Generation Gap

We implement in GHC a mechanism that again exploits dynamic dispatch to eliminate unnecessary write-barriers:

root set for generation 0

generation 0

THUNK_SELECT

THUNK_1

THUNK_2

root set

Promote to generation 1

generation 1 generation 0

THUNK_SELECT

THUNK_1 THUNK_2

IND_PRE_UPDIND_PRE_UPD

root set

force THUNK selectee evaluation

THUNK_SELECT

THUNK_1 THUNK_2

IND_UPDIND_PRE_UPD

root set

THUNK_SELECT

THUNK_1 IND_OLDGEN

IND_UPDIND_PRE_UPD

root set

CONSTR_2

inter-generational pointer

Preliminary benchmarks suggested a reduction of 5 - 9%, in production it is actually around 2 - 3%.

Ongoing Work

• Unfortunately Java programs are not “pure” in their use of dynamic dispatch

– Field access via get() / set() methods– Inlining must be disallowed

Application of read-barrier optimisation to Java

Investigating within Jikes RVM:

• Inter- and intra-class inlining

• Code bloat arising from get() / set() methods, restricted inlining and additional per-class VMT

• Cost of VMT TIB pointer flip

Removal of collector-specific barriers and tests:• Yields cheaper ‘vanilla’ collectors

• Allows the efficient hybridisation of multiple collector algorithms

Conclusion

Time-based scheduling is massively attractive, but: • Complete decoupling from the allocator is problematic*

• A hybrid approach looks promising:– Parameterised by mutator utilisation– Sensitive to allocation rate

Elimination of per-object overhead:• Mandatory for our production collector

© imperial college london exploring the barrier to entry incremental generational garbage...

imperial college londonqhow

imperial college londonnon

read barrier

conditional readbarrieramd

gc cost

incremental garbage

object space overheadbridging

barriermetronome eager

Documents

graham cooke imperial college & imperial college nhs trust

faculty - imperial who college collaborating centre ·...

© imperial college londonpage 1 imperial college travel...

rj microwave - imperial college

capacitance - imperial college london

imperial college - courses.oilprocessing.net

imperial college brisbane catalogo

imperial college london case study imperial college

david britton, 2/9/03 imperial, college gridpp project...

imperial college altiplano expedition 2008...imperial...

imperial college london imperial college xuv attosecond...

statistical engineering - imperial college

0 - imperial college primer

imperial college disseration

imperial college union constitution

link.springer.com · editors-in-chief dan crisan, imperial...

imperial college londondhelm/m3p8/week10.pdf · 2018. 10....

cheadle royal business park cheadle sk8 3gr prime south...

imperial college thesis - imperial college london

imperial community college district imperial valley