lp seminar

47
On-the-Fly Garbage Collection Using Sliding Views Erez Petrank Technion – Israel Institute of Technology Joint work with Yossi Levanoni, Hezi Azatchi, and Harel Paz

Upload: guestdff961

Post on 30-Jun-2015

347 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lp seminar

On-the-Fly Garbage Collection Using Sliding

Views

Erez PetrankTechnion – Israel Institute of

Technology

Joint work with Yossi Levanoni, Hezi Azatchi, and Harel Paz

Page 2: Lp seminar

Erez Petrank GC via Sliding Views 2

Garbage Collection User allocates space dynamically, the

garbage collector automatically frees the space when it “no longer needed”.

Usually “no longer needed” = unreachable by a path of pointers from program local references (roots).

Programmer does not have to decide when to free an object. (No memory leaks, no dereferencing of freed objects.)

Built into Java, C#.

Page 3: Lp seminar

Erez Petrank GC via Sliding Views 3

Garbage Collection

Two Classic Approaches

Reference counting [Collins 1960]: keep a reference count for each object, reclaim objects with count 0.

Tracing [McCarthy 1960]: trace reachable objects, reclaim objects not traced.

Traditional Wisdom

Good Problematic

Page 4: Lp seminar

Erez Petrank GC via Sliding Views 4

What (was) Bad about RC ?

Does not reclaim cycles

A heavy overhead on pointer modifications.

Traditional belief: “Cannot be used efficiently with parallel processing”

A

B

Page 5: Lp seminar

Erez Petrank GC via Sliding Views 5

What’s Good about RC ? Reference Counting work is

proportional to work on creations and modifications. Can tracing deal with tomorrow’s huge

heaps? Reference counting has good locality.

The Challenge: RC overhead on pointer modification

seems too expensive. RC seems impossible to “parallelize”.

Page 6: Lp seminar

Erez Petrank GC via Sliding Views 6

Garbage Collection Today

Today’s advanced environments: multiprocessors + large memories

Dealing with multiprocessors

Single-threaded stop the world

Page 7: Lp seminar

Erez Petrank GC via Sliding Views 7

Garbage Collection Today

Today’s advanced environments: multiprocessors + large memories

Dealing with multiprocessors

Concurrent collectionParallel collection

Page 8: Lp seminar

Erez Petrank GC via Sliding Views 8

Terminology(stop the world, parallel, concurrent, …)

Stop-the-World

Parallel (STW)

Concurrent

On-the-Fly

programGC

Page 9: Lp seminar

Erez Petrank GC via Sliding Views 9

Benefits & Costs

InformalPausetimes

200ms

2ms

20ms ThroughputLoss: 10-20%

Stop-the-World

Parallel (STW)

Concurrent

On-the-Fly

programGC

Page 10: Lp seminar

Erez Petrank GC via Sliding Views 10

This Talk Introduction: RC and Tracing, Coping with SMP’s. RC introduction and parallelization problem. Main focus: a novel concurrent reference

counting algorithm (suitable for Java). Concurrent made on-the-fly based on “sliding

views” Extensions:

cycle collection, mark and sweep, generations, age-oriented.

Implementation and measurements on Jikes. Extremely short pauses, good throughput.

Page 11: Lp seminar

Erez Petrank GC via Sliding Views 11

Basic Reference Counting Each object has an RC field, new

objects get o.rc:=1. When p that points to o1

is modified to point to o2 execute: o2.rc++, o1.rc--.

if then o1.rc==0: Delete o1. Decrement o.rc for all children of o1. Recursively delete objects whose rc is

decremented to 0.

o1 o2

p

Page 12: Lp seminar

Erez Petrank GC via Sliding Views 12

An Important Term: A write barrier is a piece of code

executed with each pointer update. “po2 ” implies:

Read p; (see o1)p o2;o2.rc++; o1.rc- -;

o1 o2

p

Page 13: Lp seminar

Erez Petrank GC via Sliding Views 13

Deferred Reference Counting

Problem: overhead on updating program variables (locals) is too high.

Solution [Deutch & Bobrow 76] : Don’t update rc for local variables (roots). “Once in a while”: collect all objects with

o.rc=0 that are not referenced from local variables.

Deferred RC reduces overhead by 80%. Used in most modern RC systems.

Still, “heap” write barrier is too costly.

Page 14: Lp seminar

Multithreaded RC?

Traditional wisdom: write barrier must be synchronized !

Page 15: Lp seminar

Multithreaded RC? Problem 1: ref-counts updates must be

atomic

Fortunately, this can be easily solved : Each thread logs required updates in a local buffer and the collector applies all the updates during GC (as a single thread).

Page 16: Lp seminar

Multithreaded RC? Problem 1: ref-counts updates must be atomic

A

B DC

Thread 2: Read A.next; (see B)A.next D;B.rc- -; D.rc++

Thread 1: Read A.next; (see B)A.next C;B.rc- -; C.rc++

Problem 2: parallel updates confuse counters:

Page 17: Lp seminar

Erez Petrank GC via Sliding Views 17

Known Multithreaded RC

[DeTreville 1990, Bacon et al 2001]: Cmp & swp for each pointer

modification. Thread records its updates in a buffer.

Page 18: Lp seminar

Erez Petrank GC via Sliding Views 18

To Summarize Problems…

Write barrier overhead is high. Even with deferred RC.

Using RC with multithreading seems to bear high synchronization cost. Lock or “compare & swap” with each

pointer update.

Page 19: Lp seminar

Reducing RC Overhead: We start by looking at the “parent’s point of view”. We are counting rc for the child, but rc changes

when a parent’s pointer is modified.

Parent

Child

Page 20: Lp seminar

An Observation Consider a pointer p that takes the

following values between GC’s: O0,O1, O2, …, On .

All RC algorithms perform 2n operations: O0.rc--; O1.rc++; O1.rc--; O2.rc++; O2.rc--; … ; On.rc++;

But only two operations are needed:O0.rc-- and On.rc++

p

O1 O2 O3 On. . . . .O4O0

Page 21: Lp seminar

Use of Observation

Time

Only the first modification of each pointer is logged.

Garbage CollectionP O1; (record p’s previous value O0)

P O2; (do nothing)…P On; (do nothing)

Garbage Collection: For each modified slot p:

Read p to get On, read records to get O0. O0.rc-- , On.rc++

Page 22: Lp seminar

Some Technical Remarks When a pointer is first modified, it is marked

“dirty” and its previous value is logged. We actually log each object that gets modified

(and not just a single pointer). Reason 1: we don’t want a dirty bit per pointer. Reason 2: object’s pointers tend to be modified

together. Only non-null pointer fields are logged. New objects are “born dirty”.

Page 23: Lp seminar

Effects of Optimization• RC work significantly reduced:

• The number of logging & counter updates is reduced by a factor of 100-1000 for typical Java benchmarks !

Page 24: Lp seminar

Elimination of RC UpdatesBenchmar

kNo of stores

No of “first” stores

Ratio of “first” stores

jbb71,011,357264,1151/269

Compress64,905511/1273

Db33,124,78030,6961/1079

Jack135,174,775

1,5461/87435

Javac22,042,028535,2961/41

Jess26,258,10727,3331/961

Mpegaudio5,517,795511/108192

Page 25: Lp seminar

Effects of Optimization• RC work significantly reduced:

• The number of logging & counter updates is reduced by a factor of 100-1000 for typical Java benchmarks !

• Write barrier overhead dramatically reduced.

• The vast majority of the write barriers run a single “if”.

• Last but not least: the task has changed ! We need to record the first update.

Page 26: Lp seminar

Erez Petrank GC via Sliding Views 26

Reducing Synch. Overhead

Our second contribution: A carefully designed write barrier (and

an observation) does not require any sync. operation.

Page 27: Lp seminar

The write barrierUpdate(Object **slot, Object *new){ Object *old = *slot if (!IsDirty(slot)) { log( slot, old ) SetDirty(slot) } *slot = new}

Observation:If two threads:1. invoke the write barrier

in parallel, and 2. both log an old value,then both record the same old value.

Page 28: Lp seminar

Running Write Barrier Concurrently

Thread 1:

Update(Object **slot, Object *new){ Object *old = *slot if (!IsDirty(slot)) {/* if we got here, Thread 2 has *//* yet set the dirty bit, thus, has *//* not yet modified the slot. */ log( slot, old ) SetDirty(slot) } *slot = new}

Thread 2:

Update(Object **slot, Object *new){ Object *old = *slot if (!IsDirty(slot)) {/* if we got here, Thread 1 has *//* yet set the dirty bit, thus, has *//* not yet modified the slot. */ log( slot, old ) SetDirty(slot) } *slot = new}

Page 29: Lp seminar

Concurrent Algorithm:

Use write barrier with program threads. To collect:

Stop all threads Scan roots (local variables) get the buffers with modified slots Clear all dirty bits. Resume threads For each modified slot:

decrement rc for old value (written in buffer), increment rc for current value (“read heap”),

Reclaim non-local objects with rc 0.

Page 30: Lp seminar

Timeline

Stop threads.

Scan roots; Get buffers;erase dirty

bits;

Resumethreads.

Decrement values in

read buffers;

Increment “current” values;

Collect dead objects

Page 31: Lp seminar

Timeline

Stop threads.

Scan roots; Get buffers;erase dirty

bits;

Resumethreads.

Decrement values in

read buffers;

Increment “current” values;

Collect dead objects

Unmodified current values are in the heap. Modified are in new

buffers.

Page 32: Lp seminar

Concurrent Algorithm:

Use write barrier with program threads. To collect:

Stop all threads Scan roots (local variables) get the buffers with modified slots Clear all dirty bits. Resume threads For each modified slot:

decrease rc for old value (written in buffer), increase rc for current value (“read heap”),

Reclaim non-local objects with rc 0.

Goal 2: stop one thread at a time

Goal 1: clear dirty bits during program run.

Page 33: Lp seminar

Erez Petrank GC via Sliding Views 33

The Sliding Views “Framework”

Develop a concurrent algorithm There is a short time in which all the threads

are stopped simultaneously to perform some task.

Avoid stopping the threads together. Instead, stop one thread at a time.

Tricky part: “fix” the problems created by this modification.

Idea borrowed from the Distributed Computing community [Lamport].

Page 34: Lp seminar

Erez Petrank GC via Sliding Views 34

Graphically

A Snapshot A Sliding View

time time

HeapAddr.

HeapAddr.

t t1 t2

Page 35: Lp seminar

Erez Petrank GC via Sliding Views 35

Fixing Correctness The way to do this in our algorithm is to

use snooping: While collecting the roots, record objects

that get a new pointer. Do not reclaim these objects.

No details…

Page 36: Lp seminar

Erez Petrank GC via Sliding Views 36

Cycles Collection Our initial solution:

use a tracing algorithm infrequently.

More about this tracing collector and about cycle collectors later…

Page 37: Lp seminar

Erez Petrank GC via Sliding Views 37

Performance Measurements

Implementation for Java on the Jikes Research JVM

Compared collectors: Jikes parallel stop-the-world (STW) Jikes concurrent RC (Jikes concurrent)

Benchmarks: SPECjbb2000: a server benchmark ---

simulates business-like transactions. SPECjvm98: a client benchmarks --- a suite

of mostly single-threaded benchmarks

Page 38: Lp seminar

Erez Petrank GC via Sliding Views 38

Pause Times vs. STW

0

100

200

300

400

500

600

700

Pause Times

LevPet

Jikes STW

LevPet 1.3 0.67 1.68 0.59 0.97 0.89 0.8 0.61 1.06

Jikes STW 260.67 188.33 643.33 205.67 225 376 322 416.67 511.33

jess db javac mpeg jack mtrt2 jbb-1 jbb-2 jbb-3

Page 39: Lp seminar

Erez Petrank GC via Sliding Views 39

Pause Times vs. Jikes Concurrent

0

1

2

3

4

Pause Times - Concurrent

Jikes Concurrent

LevPet

Jikes Concurrent 2.77 1.84 2.81 0.8 1.66 1.8 1.79 2.6 3.15

LevPet 1.3 0.67 1.68 0.59 0.97 0.89 0.8 0.61 1.06

jess db javac mpeg jack mtrt2 jbb-1 jbb-2 jbb-3

Page 40: Lp seminar

Erez Petrank GC via Sliding Views 40

SPECjbb2000 Throughput

SPECjbb2000 - LevPet vs. Jikes Concurrent

0.8

1

1.2

1.4

1.6

1.8

2

256 320 384 448 512 576 640 704

heap sizes

jbb1

jbb2

jbb3

jbb4

jbb5

jbb6

jbb7

jbb8

Page 41: Lp seminar

Erez Petrank GC via Sliding Views 41

SPECjvm98 Throughput

SPECjvm98 - Jikes concurrent / LevPet

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

24 32 40 48 56 64 72 80 88 96

jess

db

javac

mpeg

jack

mtrt

Page 42: Lp seminar

Erez Petrank GC via Sliding Views 42

SPECjbb2000 Throughput

LP/parallel tracing

0.6

0.7

0.8

0.9

1

1.1

256 320 384 448 512 576 640 704

Series1

Series2

Series3

Series4

Series5

Series6

Series7

Series8

Page 43: Lp seminar

Erez Petrank GC via Sliding Views 43

A Glimpse into Subsequent Work:SPECjbb2000 Throughput

Tracing / RC

0.5

0.6

0.7

0.8

0.9

1

1.1

256 320 384 448 512 576 640 704

Series1

Series2

Series3

Series4

Series5

Series6

Series7

Series8

Page 44: Lp seminar

Erez Petrank GC via Sliding Views 44

Subsequent Work Cycle Collection [CC’05])

A Mark and Sweep Collector [OOPSLA’03]

A Generational Collector [CC’03]

An Age-Oriented Collector [CC’05]

Page 45: Lp seminar

Erez Petrank GC via Sliding Views 45

Related Work

It’s not clear where to start… RC, concurrent, generational, etc… Some more relevant work was

mentioned.

Page 46: Lp seminar

Erez Petrank GC via Sliding Views 46

Conclusions A Study of Concurrent Garbage Collection with a

Focus on RC. Novel techniques obtaining short pauses, high

efficiency. The best approach: age-oriented collection with

concurrent RC for old and concurrent tracing for young.

Implementation and measurements on Jikes demonstrate non-obtrusiveness and high efficiency.

Page 47: Lp seminar

Erez Petrank GC via Sliding Views 47

Project Building Blocks A novel reference counting algorithm. State-of-the-art cycle collection. Generational RC (for old) and tracing (for

young) A concurrent tracing collector. An age-oriented collector: fitting

generations with concurrent collectors.