paralog: enabling and accelerating online parallel monitoring of multithreaded applications

Computer Architecture Lab at

Evangelos Vlachos,

Michelle L. Goodstein, Michael A. Kozuch, Shimin Chen, Phillip B. Gibbons, Babak Falsafi

and Todd C. Mowry

ParaLog: Enabling and Accelerating Online Parallel Monitoring of Multithreaded

Applications

Software Errors & Analysis Tools

• Errors abundant in parallel software– Program crashes/vulnerabilities, limited performance

• Three main categories of analysis tools– Checking before, during or after program execution

• Instruction-grain Lifeguards– Online detailed analysis, but with high overhead

– Several tools available, but mostly support for single-threaded code

2© Evangelos Vlachos ASPLOS '10 - ParaLog

ParaLog: a framework for efficient analysis of parallel applicationsParaLog: a framework for efficient analysis of parallel applications

Lifeguards and Parallel Applications

Application Threads

TimeslicedExecution & Analysis

ParallelExecution & Analysis

TimeButterfly Analysis ParaLog

windows of uncertainty

precise application

(previous talk) (this talk)

DBI tools available today

- high overhead due to serialization

- some false positives+software-based

- new hardware required+no false positives+even better performance

Low-Overhead Instruction-level Analysis

accelerators: IT, IF, MTLB

[Chen et. al., ISCA’08]

event streamevent capturingevent capturing

applicationthread

lifeguard thread

event deliveryevent delivery

application lifeguardonline monitoring platform

metadata

add r1 r2, r4

add, r1, r2, r4add, r1, r2, r4

add_handler(){

i = load_state(r2); j = load_state(r4); if(check(i, j)) upd_state(r1); else error();}

Lifeguard coreApplication core

accelerators: IT, IF, MTLB

Challenges in Parallel Monitoring

event stream

application lifeguardonline parallel monitoring platform[ParaLog]

applicationthread 1

event capturingevent capturing event deliveryevent deliverylifeguard thread 1

globalmetadata

event streamapplicationthread k

event capturingevent capturing event deliveryevent deliverylifeguard thread k

accelerators: IT, IF, MTLBaccelerators: IT, IF, MTLB

Addressing the Challenges

1. Application event ordering

2. Ensuring metadata access atomicity efficiently

3. Parallelizing hardware accelerators

event streamapplication-onlyorder capturingapplication-onlyorder capturing

order enforcingorder enforcing

application lifeguardonline parallel monitoring platform

dependence arcs

[ParaLog]

applicationthread 1

event capturingevent capturing event deliveryevent deliverylifeguard thread 1

globalmetadata

event streamapplication-onlyorder capturingapplication-onlyorder capturing

order enforcingorder enforcingapplication

thread k

event capturingevent capturing event deliveryevent deliverylifeguard thread k

Outline

• Introduction

• Addressing the Challenges of Parallel Monitoring1. Capturing & enforcing application event ordering

2. Ensuring metadata access atomicity

3. Parallelizing hardware accelerators

• Evaluation

• Conclusions

Event Ordering: the Problem

• Case Study: Information flow analysis (i.e., Taintcheck)

store(A)store(A)

load(A)load(A)

Applicationthread j thread k

st_handler(A)st_handler(A)

Lifeguardthread j thread k

ApplicationTime

ld_handler(A)ld_handler(A)

Expose happens-before information to lifeguards

LifeguardTime

{thread j, tj}{thread j, tj}{thread j, tj}{thread j, tj}

progressj: tj progressj: tj - 2 progressk: tk

- 1 progressk: tk progressk: tk - 2 progressj: tj

Event Ordering: the solution (1/2)

• Coherence-based ordering of application events– Similar to FDR, but online, focusing on application-only events

store(A)store(A)

load(A)load(A)

Time tj - 1

tj + 1

tk - 1

tk + 1

st_handler(A)st_handler(A)

wait whileprogressj < tj

Is monitoring coherence enough?

Event Ordering: the Solution (2/2)

• Previous work has not solved the problem of Logical Races

• Both logical races and system calls resolved with Conflict Alert messages© Evangelos Vlachos ASPLOS '10 - ParaLog 10

free(A)free(A)

load(A)load(A)

free(A)startfree(A)start

Metadata(A)Metadata(A)

free(A)endfree(A)end

Conflict Alert Message Dependence

LogicalRace

ApplicationTime

LifeguardTime

Metadata Atomicity

• Frequent use of locking too expensive– # of instructions added & synchronization cost

• Dependence arcs handle the majority of the cases – Sufficient conditions:

1. One-to-one data-to-metadata mapping

2. Application reads don’t become metadata writes

– Enforcing dependence arcs race-free operation

• Rest of the cases handled by acquiring a lock– Lock used only in the load_handler(); other handlers safe

(more details in the paper)

Parallel Hardware Accelerators

• Speed-up frequent lifeguard actions– Metadata-TLB; fast metadata address calculation

– Idempotent Filters; filter out redundant checking

– Inheritance Tracking; fast tracking of dataflow paths

• Accelerators have only local view of the analysis– Cache locally analysis information (e.g., frequent events)

– Important events have application-wide effects (e.g., free())

– Coherence-like issues with accelerators’ local state

• Important events accompanied by Conflict Alerts – Use Conflict Alerts to flush accelerators’ state

Outline

• Introduction

• Addressing the Challenges of Parallel Monitoring– Capturing & enforcing application event ordering

– Ensuring metadata access atomicity

– Parallelizing hardware accelerators

• Evaluation

• Conclusions

Experimental Framework

• Log-Based Architectures framework– Simics full-system simulation

– CMP system with {2, 4, 8, 16} cores

– {1, 2, 4, 8} of application and lifeguard threads

– Sequentially Consistent memory model

• Benchmarks and multithreaded Lifeguards used– SPLASH-2 and PARSEC– TaintCheck: Information flow tracking; accelerated by M-TLB, IT

– AddrCheck: Memory access checking; accelerated by M-TLB, IF

• Comparison with Timesliced Monitoring

Performance Results: AddrCheck

8 app/lifeguard threads16 cores total

Normalized to sequential,

unmonitored

2.3 6.1 6.7 1.71.9 2.9 9.5 15.4 2.1 6.2 1.9 2.4

• Timesliced Monitoring is not scalable• On average 15x slowdown over No Monitoring (8 threads)

• Highest overhead with 8 threads: SWAPTIONS 6x• Lowest overhead with 8 threads: < 5%• Average overhead with 8 threads: 26%

Performance Results: TaintCheck

2.1 11.5 12.9 1.910 1.7

1.92.9

6.64.6

15.7 2.4 2.81.7

• Timesliced Monitoring is not scalable• On average 23x slowdown over No Monitoring (8 threads)

• Highest overhead with 8 threads: BARNES 2.6x• Lowest overhead with 8 threads: LU 5%• Average overhead with 8 threads: 48%

Other Results in the Paper

• Order capturing and order enforcing under TSO

• Performance Impact of Lifeguard Accelerators– AddrCheck: [1.13x – 3.4x], TaintCheck: [2x – 9x]

• A less expensive order capturing mechanism gets similar performance results– 1 timestamp per core vs. 1 timestamp per cache block

Conclusions

• ParaLog: Fast and precise parallel monitoring

• Components of event ordering– Normal memory accesses: monitor coherence activity– Logical Races; use of Conflict Alert messages

• Metadata Atomicity– Enforcing dependence arcs ensures atomicity (most cases)

• Parallel Hardware Accelerators– Flush local state on remote events (Conflict Alert)

• Average overhead is relatively low– AddrCheck: 26% and TaintCheck: 48% (8 threads)

Questions ?

Backup Slides

Metadata Atomicity

• Synchronization-free fast path vs. slow path– Concurrent application reads; no ordering available!

• Concurrent metadata reads: follow the fast-path

• Concurrent metadata writes: follow slow-path acquiring a lock

• Concurrent metadata read and write: read may get either value

– In any other case dependence arcs are available

Application Event Lifeguard Action

AddrCheckTaintCheckMemCheck

LockSet

• Accelerators have only local view of the analysis– Important events have system-wide effects

– Case study: Idempotent Filters and AddrCheck

free(A)

✔✖✔Delivered to lifeguard

✖ Redundant; discarded

✖ ✔

✔✖ ✔✔

Flush IF filtersFlush IF filters

free(A)

Flush local and remote IF

filters

Flush local and remote IF

filters

• Details for parallel M-TLB and IT can be found in the paper

Builds on Remote Conflict

Messages

Builds on Remote Conflict

Messages

Performance Impact of Lifeguard Accelerators

9.4 6.8 7.3 11.3

• Accelerators provide a major speedup [2x – 9x]

Performance Impact of Lifeguard Accelerators

• Accelerators provide a major speedup [1.13x – 3.4x]

Transitive Reduction Sensitivity Study

• Limited transitive reduction– No major performance impact; savings in chip area

Supporting Total Store Order (TSO)

• Cycle of dependencies in relaxed memory models– TSO relaxes the RAW ordering

– Previous work (RTR): maintain versions of data

– Identify SC offending instructions; save loaded value

• This paper: maintain versions of metadata

Thread 0 Thread 1Commit

Wr(A) Wr(B)

Rd(B) Rd(A)

Memory Order:

P(v1, A)

C(v0, B)

P(v0, B)

C(v1, A)

Log 0 Log 1

Rd(B, v0)

Rd(A, v1)

produce_version(v1,A)Lifeguard 0

store_handler(A)

wait_until_available(v0,B)

load_handler(B, v0)

• Speed-up frequent lifeguard actions– Fast metadata address calculation – Metadata-TLB

– Fast tracking of data-flow paths – Inheritance Tracking– Filter out redundant checking – Idempotent Filters

• Per-instruction checking gives the same result; cache event

• Accelerators have only local view of the analysis– Important events have system-wide effects (e.g., free())– Coherence-like issues with accelerators’ local state

• Important events accompanied by Conflict Alerts– Use Conflict Alerts to flush state and deliver pending

Experimental Framework

Benchmarks Input

barnes 16K bodies

ocean Grid: 258 x 258

lu Matrix: 1024 x 1024

fmm 32768 particles

radiosity Base problem

blackscholes Simlarge

fluidanimate Simlarge

swaptions Simlarge

Simulation Parameters

Cores {2, 4, 8,16}, 1 GHz, In-Order scalar x86

L1I & L1D(private)

64KB, 64B line, 4-way assoc.

L2 (shared) {1, 2, 4, 8}MB, 64B line, 8-way assoc., 6-cycle latency

Memory 90-cycle latency

Log Buffer 64KB per thread

Multithreaded Lifeguards

TaintCheck: Information flow tracking; accelerated by M-TLB and IT

AddrCheck: Memory access checking; accelerated by M-TLB and IF

Relative Slowdown - TaintCheck

TaintCheck

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY

Waiting for ApplicationWaiting for DependenceUseful Work

Relative Slowdown - AddrCheck

AddrCheck

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

BARNES LU OCEAN BLACKSH. FLUIDANIM. SWAPTIONS FMM RADIOSITY

Waiting for ApplicationWaiting for DependenceUseful Work

3.0 6.0

Performance Results - AddrCheck

2.3 6.1 6.7 1.71.9 2.9 9.5 15.4 2.1 6.2 1.9 2.4

Performance Results - TaintCheck

2.1 11.5 12.9 1.910 1.7

1.92.9

6.64.6

15.7 2.4 2.81.7

• Speed-up frequent lifeguard actions– Metadata-TLB & Inheritance Tracking (discussed in the paper)

– Idempotent Filters; identify and filter out redundant checking

• Per-instruction checking gives the same result

• Cache incoming event and local state to identify redundancy

• Accelerators have only local view of the analysis– Important events have application-wide effects (e.g., free())

– Coherence-like issues with accelerators’ local state

• Important events accompanied by Conflict Alerts – Use Conflict Alerts to flush accelerators’ state

paralog: enabling and accelerating online parallel monitoring of multithreaded applications

events evangelos vlachosasplos

tk progressk

tj progressj

paralogis monitoring

information flow analysis

stater2 j

new hardware

paralogevent ordering

Documents

exog, a novel paralog of endonuclease g in higher eukaryotes

a practical guide to writing multithreaded code - mark...

ch04 - multithreaded programming

multithreaded programming - chetana hegde programming 2 •...

multithreaded programming

march/april 2020 volume 24, issue 2...

multithreaded programming java provides built-in support...

multithreaded programming with posix

functional diversification of sonic hedgehog paralog

multithreaded data transport

micu2, a paralog of micu1, resides within the

multithreaded airport simulation systems

multithreaded algorithms

structures of shikimate dehydrogenase aroe and its paralog...

multithreaded chapter 4: multithreaded programming

paralog : enabling and accelerating online parallel...

chapter 4. multithreaded programming

lecture 11 multithreaded architectures

24 multithreaded algorithms

multithreaded processors ppt