paralog : enabling and accelerating online parallel monitoring of multithreaded applications

Post on 25-Feb-2016

33 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications. Evangelos Vlachos , Michelle L. Goodstein, Michael A. Kozuch , Shimin Chen, Phillip B. Gibbons, Babak Falsafi and Todd C. Mowry. Software Errors & Analysis Tools. - PowerPoint PPT Presentation

TRANSCRIPT

Computer Architecture Lab at

Evangelos Vlachos,Michelle L. Goodstein, Michael A. Kozuch,

Shimin Chen, Phillip B. Gibbons, Babak Falsafi and Todd C. Mowry

ParaLog: Enabling and Accelerating Online Parallel Monitoring of Multithreaded

Applications

ASPLOS '10 - ParaLog 2

Software Errors & Analysis Tools

• Errors abundant in parallel software– Program crashes/vulnerabilities, limited performance

• Three main categories of analysis tools– Checking before, during or after program execution

• Instruction-grain Lifeguards– Online detailed analysis, but with high overhead– Several tools available, but mostly support for single-

threaded code

© Evangelos Vlachos

ParaLog: a framework for efficient analysis of parallel applications

Lifeguards and Parallel Applications

Application Threads

TimeslicedExecution & Analysis

ParallelExecution & Analysis

Time Butterfly Analysis ParaLog

windows of uncertainty

precise application

order

(previous talk) (this talk)

DBI tools available today

- high overhead due to serialization

- some false positives+software-based

- new hardware required+no false positives+even better performance

4

Low-Overhead Instruction-level Analysis

© Evangelos Vlachos ASPLOS '10 - ParaLog

accelerators: IT, IF, MTLB

[Chen et. al., ISCA’08]

event streamevent capturing

applicationthread

lifeguard thread

event delivery

application lifeguardonline monitoring platform

metadata

add r1 r2, r4

add, r1, r2, r4add_handler(){

i = load_state(r2); j = load_state(r4); if(check(i, j)) upd_state(r1); else error();}

Lifeguard coreApplication core

ASPLOS '10 - ParaLog 5

accelerators: IT, IF, MTLB

accelerators: IT, IF, MTLB

Challenges in Parallel Monitoring

© Evangelos Vlachos

event stream

application lifeguardonline parallel monitoring platform[ParaLog]

applicationthread 1

event capturing event deliverylifeguard thread 1

globalmetadata

event streamapplicationthread k

event capturing event deliverylifeguard thread k

ASPLOS '10 - ParaLog 6

accelerators: IT, IF, MTLBaccelerators: IT, IF, MTLB

accelerators: IT, IF, MTLBaccelerators: IT, IF, MTLB

Addressing the Challenges

1. Application event ordering

2. Ensuring metadata access atomicity efficiently

3. Parallelizing hardware accelerators© Evangelos Vlachos

event streamapplication-onlyorder capturing

order enforcing

application lifeguardonline parallel monitoring platform

dependence arcs

[ParaLog]

applicationthread 1

event capturing event deliverylifeguard thread 1

globalmetadata

event streamapplication-onlyorder capturing

order enforcingapplication

thread k

event capturing event deliverylifeguard thread k

ASPLOS '10 - ParaLog 7

Outline

• Introduction

• Addressing the Challenges of Parallel Monitoring1. Capturing & enforcing application event ordering2. Ensuring metadata access atomicity3. Parallelizing hardware accelerators

• Evaluation

• Conclusions

© Evangelos Vlachos

ASPLOS '10 - ParaLog 8

Event Ordering: the Problem• Case Study: Information flow analysis (i.e., Taintcheck)

© Evangelos Vlachos

store(A)

load(A)

Applicationthread j thread k

st_handler(A)

Lifeguardthread j thread k

ApplicationTime

ld_handler(A)

Expose happens-before information to lifeguards

LifeguardTime

ASPLOS '10 - ParaLog 9

{thread j, tj}{thread j, tj}

progressj: tj progressj: tj - 2 progressk: tk

- 1 progressk: tk progressk: tk - 2 progressj: tj

- 1

Event Ordering: the solution (1/2)

• Coherence-based ordering of application events– Similar to FDR, but online, focusing on application-only events

© Evangelos Vlachos

store(A)

load(A)

Applicationthread j thread k

Time tj - 1

tjtj

+ 1tk - 1tk

tk + 1

st_handler(A)

ld_handler(A)

Lifeguardthread j thread k

wait whileprogressj < tj

ASPLOS '10 - ParaLog 10

Is monitoring coherence enough?

Event Ordering: the Solution (2/2)

• Previous work has not solved the problem of Logical Races• Both logical races and system calls resolved with Conflict Alert messages

© Evangelos Vlachos

free(A)

load(A)

Applicationthread j thread k

free(A)start

ld_handler(A)

Lifeguardthread j thread k

Metadata(A)

free(A)end

Conflict Alert Message Dependence

LogicalRace

ApplicationTime

LifeguardTime

ASPLOS '10 - ParaLog 11

Metadata Atomicity• Frequent use of locking too expensive

– # of instructions added & synchronization cost

• Dependence arcs handle the majority of the cases – Sufficient conditions:

1. One-to-one data-to-metadata mapping

2. Application reads don’t become metadata writes– Enforcing dependence arcs race-free operation

• Rest of the cases handled by acquiring a lock– Lock used only in the load_handler(); other handlers safe

© Evangelos Vlachos

(more details in the paper)

12

Parallel Hardware Accelerators• Speed-up frequent lifeguard actions

– Metadata-TLB; fast metadata address calculation– Idempotent Filters; filter out redundant checking– Inheritance Tracking; fast tracking of dataflow paths

• Accelerators have only local view of the analysis– Cache locally analysis information (e.g., frequent events) – Important events have application-wide effects (e.g., free())– Coherence-like issues with accelerators’ local state

• Important events accompanied by Conflict Alerts – Use Conflict Alerts to flush accelerators’ state

© Evangelos Vlachos ASPLOS '10 - ParaLog

ASPLOS '10 - ParaLog 13

Outline

• Introduction

• Addressing the Challenges of Parallel Monitoring– Capturing & enforcing application event ordering– Ensuring metadata access atomicity– Parallelizing hardware accelerators

• Evaluation

• Conclusions

© Evangelos Vlachos

ASPLOS '10 - ParaLog 14

Experimental Framework

© Evangelos Vlachos

• Log-Based Architectures framework– Simics full-system simulation– CMP system with {2, 4, 8, 16} cores– {1, 2, 4, 8} of application and lifeguard threads – Sequentially Consistent memory model

• Benchmarks and multithreaded Lifeguards used– SPLASH-2 and PARSEC– TaintCheck: Information flow tracking; accelerated by M-TLB, IT– AddrCheck: Memory access checking; accelerated by M-TLB, IF

• Comparison with Timesliced Monitoring

ASPLOS '10 - ParaLog 15

Performance Results: AddrCheck

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6AddrCheckNo Monitoring

Timesliced MonitoringParaLog

Nor

mal

ized

Exe

cutio

n Ti

me

© Evangelos Vlachos

8 app/lifeguard threads16 cores total

Normalized to sequential,

unmonitored

ASPLOS '10 - ParaLog 16

Performance Results: AddrCheck

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6AddrCheckNo Monitoring

Timesliced MonitoringParaLog

Nor

mal

ized

Exe

cutio

n Ti

me

© Evangelos Vlachos

17

Performance Results: AddrCheck

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6AddrCheckNo Monitoring

Timesliced MonitoringParaLog

Nor

mal

ized

Exe

cutio

n Ti

me

© Evangelos Vlachos ASPLOS '10 - ParaLog

2.3 6.1 6.7 1.71.9 2.9 9.5 15.4 2.1 6.2 1.9 2.4

• Timesliced Monitoring is not scalable• On average 15x slowdown over No Monitoring (8 threads)

ASPLOS '10 - ParaLog 18

Performance Results: AddrCheck

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6AddrCheckNo Monitoring

Timesliced MonitoringParaLog

Nor

mal

ized

Exe

cutio

n Ti

me

© Evangelos Vlachos

• Highest overhead with 8 threads: SWAPTIONS 6x• Lowest overhead with 8 threads: < 5%• Average overhead with 8 threads: 26%

ASPLOS '10 - ParaLog 19

Performance Results: TaintCheck

© Evangelos Vlachos

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6 TaintCheckNo Monitoring Timesliced MonitoringParaLog

Nor

mal

ized

Exe

cutio

n Ti

me

ASPLOS '10 - ParaLog 20

Performance Results: TaintCheck

© Evangelos Vlachos

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6TaintCheckNo Monitoring

Timesliced MonitoringParaLog

Nor

mal

ized

Exe

cutio

n Ti

me

2.1 11.5 12.9 1.910 1.7

1.9 2.9 6.64.6

15.7 2.4 2.81.7

• Timesliced Monitoring is not scalable• On average 23x slowdown over No Monitoring (8 threads)

ASPLOS '10 - ParaLog 21

Performance Results: TaintCheck

© Evangelos Vlachos

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6 TaintCheckNo Monitoring Timesliced MonitoringParaLog

Nor

mal

ized

Exe

cutio

n Ti

me

• Highest overhead with 8 threads: BARNES 2.6x• Lowest overhead with 8 threads: LU 5%• Average overhead with 8 threads: 48%

ASPLOS '10 - ParaLog 22

Other Results in the Paper

• Order capturing and order enforcing under TSO

• Performance Impact of Lifeguard Accelerators– AddrCheck: [1.13x – 3.4x], TaintCheck: [2x – 9x]

• A less expensive order capturing mechanism gets similar performance results– 1 timestamp per core vs. 1 timestamp per cache block

© Evangelos Vlachos

23

Conclusions• ParaLog: Fast and precise parallel monitoring

• Components of event ordering– Normal memory accesses: monitor coherence activity– Logical Races; use of Conflict Alert messages

• Metadata Atomicity– Enforcing dependence arcs ensures atomicity (most cases)

• Parallel Hardware Accelerators– Flush local state on remote events (Conflict Alert)

• Average overhead is relatively low– AddrCheck: 26% and TaintCheck: 48% (8 threads)

© Evangelos Vlachos ASPLOS '10 - ParaLog

ASPLOS '10 - ParaLog 24

Questions ?

© Evangelos Vlachos

ASPLOS '10 - ParaLog 25

Backup Slides

© Evangelos Vlachos

ASPLOS '10 26

Metadata Atomicity

• Synchronization-free fast path vs. slow path– Concurrent application reads; no ordering available!

• Concurrent metadata reads: follow the fast-path• Concurrent metadata writes: follow slow-path acquiring a lock• Concurrent metadata read and write: read may get either value

– In any other case dependence arcs are available

© Evangelos Vlachos

Application Event Lifeguard ActionR R WW R W

AddrCheckTaintCheckMemCheck

LockSet

ASPLOS '10 - ParaLog 27

Parallel Hardware Accelerators• Accelerators have only local view of the analysis

– Important events have system-wide effects– Case study: Idempotent Filters and AddrCheck

© Evangelos Vlachos

R(A)

R(B)

R(A)

R(A)

R(A)

R(C)

R(B)

R(A)

IF

free(A)

R(A)

IF

LG 0

LG 1

✔✖ ✔ Delivered to lifeguard

✖ Redundant; discarded

✖ ✔

✔✖ ✔✔

Flush IF filters

free(A)

Flush local and remote IF

filters

• Details for parallel M-TLB and IT can be found in the paper

Builds on Remote Conflict

Messages

28

Performance Impact of Lifeguard Accelerators

© Evangelos Vlachos ASPLOS '10 - ParaLog

BARNES LUOCEAN

BLACKSCH.FLUIDANIM.

SWAPTIONS FMMRADIOSITY

0

1

2

3

4

5

6 5.79.4 6.8

4.2 4.3

5.47.3 11.3

2.6

1.0 1.1 1.3 1.4

2.2

1.4 1.5

TaintCheck Not AcceleratedAccelerated

Slow

dow

n (8

thre

ads)

9.4 6.8 7.3 11.3

• Accelerators provide a major speedup [2x – 9x]

ASPLOS '10 - ParaLog 29

Performance Impact of Lifeguard Accelerators

© Evangelos Vlachos

• Accelerators provide a major speedup [1.13x – 3.4x]

BARNES LUOCEAN

BLACKSCH.FLUIDANIM.

SWAPTIONS FMMRADIOSITY

0

1

2

3

4

5

6

7

8

9

3.93.2

1.01.4 1.1

8.4

1.01.41.1 1.0 1.0 1.0 1.0

6.0

1.0 1.0

AddrCheckNot AcceleratedAccelerated

Slow

dow

n (8

thre

ads)

ASPLOS '10 - ParaLog 30

Transitive Reduction Sensitivity Study

© Evangelos Vlachos

BARNES LUOCEAN

BLACKSCH.FLUIDANIM.

SWAPTIONS FMMRADIOSITY

0

0.5

1

1.5

2

2.5

3

3.5

2.9

1.1 1.21.3

1.5

3.1

1.5 1.6

2.6

1.0 1.11.3 1.4

2.2

1.4 1.5

TaintCheck Limited (1 timestamp / cache)Ideal (1 timestamp / cacheblock)

Slow

dow

n (8

thre

ads)

• Limited transitive reduction– No major performance impact; savings in chip area

31

Supporting Total Store Order (TSO)

• Cycle of dependencies in relaxed memory models– TSO relaxes the RAW ordering– Previous work (RTR): maintain versions of data– Identify SC offending instructions; save loaded value

• This paper: maintain versions of metadata

© Evangelos Vlachos ASPLOS '10 - ParaLog

Thread 0 Thread 1Commit

order

0

1

2

Wr(A) Wr(B)

Rd(B) Rd(A)

Memory Order:

P(v1, A)

C(v0, B)

P(v0, B)

C(v1, A)

Log 0 Log 1

Wr(A)

Rd(B, v0)

Wr(B)

Rd(A, v1)

produce_version(v1,A)Lifeguard 0

store_handler(A)

wait_until_available(v0,B)

load_handler(B, v0)

ASPLOS '10 - ParaLog 32

Parallel Hardware Accelerators• Speed-up frequent lifeguard actions

– Fast metadata address calculation – Metadata-TLB– Fast tracking of data-flow paths – Inheritance Tracking– Filter out redundant checking – Idempotent Filters

• Per-instruction checking gives the same result; cache event

• Accelerators have only local view of the analysis– Important events have system-wide effects (e.g., free())– Coherence-like issues with accelerators’ local state

• Important events accompanied by Conflict Alerts– Use Conflict Alerts to flush state and deliver pending events

© Evangelos Vlachos

ASPLOS '10 - ParaLog 33

Experimental FrameworkBenchmarks Inputbarnes 16K bodiesocean Grid: 258 x 258lu Matrix: 1024 x 1024fmm 32768 particlesradiosity Base problemblackscholes Simlargefluidanimate Simlargeswaptions Simlarge

Simulation ParametersCores {2, 4, 8,16}, 1 GHz,

In-Order scalar x86L1I & L1D(private)

64KB, 64B line, 4-way assoc.

L2 (shared) {1, 2, 4, 8}MB, 64B line, 8-way assoc., 6-cycle latency

Memory 90-cycle latencyLog Buffer 64KB per thread

Multithreaded LifeguardsTaintCheck: Information flow tracking; accelerated by M-TLB and ITAddrCheck: Memory access checking; accelerated by M-TLB and IF

© Evangelos Vlachos

ASPLOS '10 - ParaLog 34

Relative Slowdown - TaintCheck

TaintCheck

0.0

0.5

1.0

1.5

2.0

2.5

3.0

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY

Slo

wdo

wn

Waiting for ApplicationWaiting for DependenceUseful Work

© Evangelos Vlachos

ASPLOS '10 - ParaLog 35

Relative Slowdown - AddrCheck

AddrCheck

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

BARNES LU OCEAN BLACKSH. FLUIDANIM. SWAPTIONS FMM RADIOSITY

Slow

dow

n

Waiting for ApplicationWaiting for DependenceUseful Work

3.0 6.0

© Evangelos Vlachos

ASPLOS '10 - ParaLog 36

Performance Results - AddrCheck

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6AddrCheckNo Monitoring

Timesliced MonitoringParaLog

Nor

mal

ized

Exe

cutio

n Ti

me

© Evangelos Vlachos

2.3 6.1 6.7 1.71.9 2.9 9.5 15.4 2.1 6.2 1.9 2.4

ASPLOS '10 - ParaLog 37

Performance Results - TaintCheck

© Evangelos Vlachos

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6TaintCheckNo Monitoring

Timesliced MonitoringParaLog

Nor

mal

ized

Exe

cutio

n Ti

me

2.1 11.5 12.9 1.910 1.7

1.9 2.9 6.64.6

15.7 2.4 2.81.7

38

Parallel Hardware Accelerators• Speed-up frequent lifeguard actions

– Metadata-TLB & Inheritance Tracking (discussed in the paper)

– Idempotent Filters; identify and filter out redundant checking• Per-instruction checking gives the same result• Cache incoming event and local state to identify redundancy

• Accelerators have only local view of the analysis– Important events have application-wide effects (e.g., free())– Coherence-like issues with accelerators’ local state

• Important events accompanied by Conflict Alerts – Use Conflict Alerts to flush accelerators’ state

© Evangelos Vlachos ASPLOS '10 - ParaLog

top related