the spa project golf and esp

60
The SPA Project GOLF and ESP Manuvir Das Microsoft Research (joint work with Manuel Fahndrich, Jakob Rehof)

Upload: toby

Post on 02-Feb-2016

16 views

Category:

Documents


0 download

DESCRIPTION

The SPA Project GOLF and ESP. Manuvir Das Microsoft Research (joint work with Manuel Fahndrich, Jakob Rehof). SPA Group Mentor. Software Productivity Tools. Jim Larus runs the group research.microsoft.com/spt SLAM, Vault, Behave, PipelineServer … Focus on software reliability. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The SPA Project GOLF and ESP

The SPA Project

GOLF and ESP

Manuvir DasMicrosoft Research

(joint work with Manuel Fahndrich, Jakob Rehof)

Page 2: The SPA Project GOLF and ESP

SPA Group Mentor

Page 3: The SPA Project GOLF and ESP

Software Productivity Tools

• Jim Larus runs the group• research.microsoft.com/spt

• SLAM, Vault, Behave, PipelineServer …• Focus on software reliability

Page 4: The SPA Project GOLF and ESP

What’s wrong with analysis?

• A: We don’t write or look at real code

• B: We don’t solve real problems

Page 5: The SPA Project GOLF and ESP

Why does this happen?

• Analysis is a mix of theory and practice

• But– Math and theory are elegant– experimentation needs infrastructure– engineering is boring

Page 6: The SPA Project GOLF and ESP

Today we’ll talk about …

• Doing analysis research the right way

• My day job– Slicing and Partial Evaluation– Pointer analysis– Error detection

Page 7: The SPA Project GOLF and ESP

Slicing and Partial Evaluation

• PE: Which computations depend only on known inputs? Do these early.

• Or, which computations may depend on unknown inputs? Don’t do these early.

• Insight: If a computation depends on unknown input, there must be an unknown input in its slice.

Page 8: The SPA Project GOLF and ESP

Forward slicing and BTA

• Binding-time analysis– identify static computations

• BTA via slicing– mark all unknown input nodes– forward slice from marked nodes and mark– all unmarked nodes are static computations

Page 9: The SPA Project GOLF and ESP

Why is this interesting?

• Slicing incorporates control dependence• Previous work used reaching definitions

read(y);x = 0;while (y != 0) { y--; x++; }z = x;

• We can now prove correctness

read(y);x = 0;while (y != 0) { y--; x++; }z = x;

read(y);x = 0;while (y != 0) { y--; x++; }z = x;

Page 10: The SPA Project GOLF and ESP

This project had flaws …

• A: We don’t write or look at real code– cubic algorithm, ran on 2k lines in 30

minutes– only one benchmark (ray tracer)

• B: We don’t solve real problems– who uses PE in practice?– was the lack of safety critical?– why not use a timer?

Page 11: The SPA Project GOLF and ESP

Then I visited MSR …

• Daniel Weise – 1.5 million lines of real code

• Real problems – software reliability

• I was hooked!

– find buffer overflows using static analysis– oops, need pointer analysis

Page 12: The SPA Project GOLF and ESP

Papers don’t tell the whole truth!

• Implemented Ste96, engineered it– lightning fast, but poor results

• Lots of papers on how to improve– structures, signatures, SH97

• Tried it all, nothing worked on real code• Needed Andersen (subtyping) on real

code

Page 13: The SPA Project GOLF and ESP

Frameworks are good

• A spectrum from Ste96 to And94– DGC POPL 98 : unification vs flow– SH POPL 97 : buckets within ECRs

• Frameworks – give us a way of tuning precision vs

efficiency– help us understand the problem

Page 14: The SPA Project GOLF and ESP

Frameworks are bad

• The real issue: how do you find the best trade-off point in a principled manner?

• What if the parameter being varied is not the key concept?– CFA varies control depth rather than data– SH 97 picks random categories– DGC 98 alters the behaviour of the same

statement

Page 15: The SPA Project GOLF and ESP

Back to pointer analysis …

• No way to run Andersen on MLOC

Page 16: The SPA Project GOLF and ESP

So, I hid in my office …

• Stared at SPEC code, wrote perl scripts– every feature is used– code is idiomatic– pointers are never assigned, except heap– most pointers arise through parameter

passing– some code is just too hard for any analysis

• Result: new algorithm driven by real code

Page 17: The SPA Project GOLF and ESP

Pointer Analysis Landscape

FICI:Flow-insensitiveContext-insensitive

FICS:Flow-insensitiveContext-sensitive

FSCI:Flow-sensitiveContext-insensitive

FSCS:Flow-sensitiveContext-sensitive

PrecisionCost

Page 18: The SPA Project GOLF and ESP

FICI Pointer Analysis

Imprecise Precise

Expensive

CheapSteensgaard

(almost linear)

Andersen(cubic)

One level flow(quadratic)

500 KLOC in several minutes, 2GB

1.5 MLOC in 1 minute, 100 MB

Page 19: The SPA Project GOLF and ESP

Andersen’s Algorithm

p = &q;

p = q;

q

r1

p

r2

pr1

r2

q

r3

Page 20: The SPA Project GOLF and ESP

Andersen’s Algorithm

p = *q;

*p = q;

r1

r2

q

s1

s2

s3

p

ps1

s2

qr1

r2

Page 21: The SPA Project GOLF and ESP

Steensgaard’s Algorithm

p = q; p qp qp q

Page 22: The SPA Project GOLF and ESP

Motivation for One Level Flow

foo(&s1);foo(&s2);

bar(&s3);

foo(struct s *p) { *p.a = 3; bar(p);}bar(struct s *q) { *q.b = 4;}

Page 23: The SPA Project GOLF and ESP

Simplified Example

p = &s1;p = &s2;

q = &s3;

q = p;

*p.a = 3;

*q.b = 4;

p q

s1,s2,s3

p q

s1 s2 s3

Page 24: The SPA Project GOLF and ESP

One Level Flow

p = q; p qp q

Page 25: The SPA Project GOLF and ESP

p = &s1;p = &s2;

q = &s3;

q = p;

*p.a = 3;

*q.b = 4;

p

s1

p = &s1;p = &s2;

q = &s3;

q = p;

*p.a = 3;

*q.b = 4;

p = &s1;p = &s2;

q = &s3;

q = p;

*p.a = 3;

*q.b = 4;

s2

p = &s1;p = &s2;

q = &s3;

q = p;

*p.a = 3;

*q.b = 4;

q

s3

p = &s1;p = &s2;

q = &s3;

q = p;

*p.a = 3;

*q.b = 4;

p

s1

s2

q

s3

Simplified Example

Page 26: The SPA Project GOLF and ESP

e

Single query: Linear

All queries: Quadratic

OLF: Simple Reachability

Page 27: The SPA Project GOLF and ESP

OLF: Cached Reachability

MAX

x

y

MS Word : From 1 hour to 30 seconds for all queries

Page 28: The SPA Project GOLF and ESP

Benchmark LOC AST nodes

compress 1,400 2,000

li 5,800 23,000

m88ksim 13,600 66,000

ijpeg 20,700 79,000

go 26,800 109,000

perl 23,700 116,000

vortex 50,600 200,000

gcc 148,000 604,000

word97 1,440,000 5,961,000

Running time (seconds)

Benchmark LOC AST nodes Ste96

compress 1,400 2,000 0.03

li 5,800 23,000 0.43

m88ksim 13,600 66,000 0.79

ijpeg 20,700 79,000 0.97

go 26,800 109,000 0.89

perl 23,700 116,000 1.21

vortex 50,600 200,000 3.35

gcc 148,000 604,000 5.70

word97 1,440,000 5,961,000 61.34

Benchmark LOC AST nodes Ste96 One level flow

compress 1,400 2,000 0.03 0.05

li 5,800 23,000 0.43 0.67

m88ksim 13,600 66,000 0.79 1.22

ijpeg 20,700 79,000 0.97 1.51

go 26,800 109,000 0.89 1.42

perl 23,700 116,000 1.21 2.1

vortex 50,600 200,000 3.35 5.66

gcc 148,000 604,000 5.70 9.45

word97 1,440,000 5,961,000 61.34 126.83

Page 29: The SPA Project GOLF and ESP

Average sizes of points-to sets

Benchmark Ste96 And94 One level flow

compress 2.1 1.22 1.22

li 287.7 185.62 185.62

m88ksim 86.3 3.19 3.29

ijpeg 17.0 11.76 11.78

go 45.2 14.79 14.79

perl 36.1 22.22 22.22

vortex 1,064.5 45.54 59.3

gcc 245.8 7.71 7.72

Word97 27,258.6 ? 11,219.5

Benchmark Ste96 And94

compress 2.1 1.22

li 287.7 185.62

m88ksim 86.3 3.19

ijpeg 17.0 11.76

go 45.2 14.79

perl 36.1 22.22

vortex 1,064.5 45.54

gcc 245.8 7.71

Word97 27,258.6 ?

Page 30: The SPA Project GOLF and ESP

This project had flaws too …

• B: We don’t solve problems– solved an open problem in pointer analysis

• But – never got around to buffer overflow– didn’t use PTA for optimization

– addressed these issues later, but– should have been driven by the problem

Page 31: The SPA Project GOLF and ESP

Since then …

• Others have made And94 fast– Heintze PLDI 01– suggested by OLF results

• But what about context-sensitivity?– crucial for value flow analysis

• GOLF (DLFR SAS 01)– combines OLF and one level of instantiation

constraints (Rehof’s lecture)– context-sensitive value flow on MLOC

Page 32: The SPA Project GOLF and ESP

OLF: Call Example

r = &x;p = r;

r = &y;

q = r;

*p = 3;

id(r) {return r;}

p = id(&x);

q = id(&y);

*p = 3;

Page 33: The SPA Project GOLF and ESP

OLF: Call Example

r = &x;p = r;

r = &y;

q = r;

*p = 3;

y

r

*r

x

p

*p

*q

q

Page 34: The SPA Project GOLF and ESP

GOLF: Call Example

id(r) {return r;}

p = id(&x);

q = id(&y);

*p = 3;y

r

*r

x

p

*p

*q

q

( )

[ ]

Page 35: The SPA Project GOLF and ESP

We have an analysis that is …

• fast enough to run on MLOC• good enough for static optimization

– who cares; leave it to the chip makers!

• not good enough for dynamic optimization (MDCE PASTE 01)

• not good enough to track interesting correctness properties in real code

Page 36: The SPA Project GOLF and ESP

Correctness: the killer app

• Hardware can – speed up programs– enforce correctness at run-time

• Hardware cannot – enforce correctness before product is

shipped

• Testers can– find errors on some paths

• Testers cannot– find errors on all paths

• So, use static analysis to find errors

Page 37: The SPA Project GOLF and ESP

ESP Vision

• Error Detection via Scalable Program Analysis

• Must be driven by real code• Must be sound (report all errors)• Must report few false positives

• Use knowledge of tradeoffs in analysis• Let user help the analysis

Page 38: The SPA Project GOLF and ESP

Step 1: Identify the problem

• Solve a realistic problem: – partial correctness– user specified, finite-state properties

• Solve a non-trivial problem:– don’t check uninits, NULL pointers– check locking protocols, resource usage

Page 39: The SPA Project GOLF and ESP

Parameterized Protocol Tracking

• User specified – FSM with parameterized actions– patterns

• Rest is automatic

C code patternTransition

KeAcquireSpinLock(l) Lock(l)

KeReleaseSpinLock(l) Unlock(l)

return e Ret

Lock(l)

Ret

Ret

Unlock(l)Lock(l)

INIT(l)

ERROR(l)

LOCKED(l)

Page 40: The SPA Project GOLF and ESP

Step 2: Examine real code

• Find common idioms• Understand level of precision needed

• Windows device drivers– mostly control dominated protocols– global data flow needs CS, but not FS/PS – path feasibility seems to matter

Page 41: The SPA Project GOLF and ESP

Sample driver code

STATUS Initialize(Object o){ Object p = o;

if (p->needLock) KeAcquireSpinLock(p);

p->data = 0;

if (p->needLock) KeReleaseSpinLock(p);

return OK;}

Page 42: The SPA Project GOLF and ESP

Step 3: Break up the problem

• Three distinct entities to be tracked– the temporal sequence of actions along a

particular control flow path– the data involved in the actions– the data involved in path feasibility

• Can use different levels of static analysis to track each entity

Page 43: The SPA Project GOLF and ESP

Data analysis vs control analysis

• RHS 95: Cost is Ο(ED3). What is D?– dataflow: D is generally related to program size– program size grows because of pointers, globals

• What if there is only a single global FSM?– D is just the #states in the FSM!

• Control is cheap, data is expensive

Page 44: The SPA Project GOLF and ESP

Step 4: Design static analyses

• track the temporal sequence of actions along a particular control flow path– cannot use flow-insensitive analysis– RHS95 is too expensive

• eliminate the data involved in the actions– use GOLF value flow

• now we have a control property, use RHS95

• both analyses are context-sensitive

Page 45: The SPA Project GOLF and ESP

Data elimination

STATUS Initialize(Object o){ Object p = o;

if (p->needLock) KeAcquireSpinLock(p);

p->data = 0;

if (p->needLock) KeReleaseSpinLock(p);

return OK;}

Page 46: The SPA Project GOLF and ESP

Data elimination

Initialize(){ if (*) Lock;

if (*) Unlock;}

I L E

Page 47: The SPA Project GOLF and ESP

Do we need context-sensitivity?

• What if GOLF cannot provide MUST info?

void Initialize(Object o1, Object o2) { LockWrapper(o1); LockWrapper(o2);

KeReleaseSpinLock(o1); KeReleaseSpinLock(o2);}void LockWrapper(Object p) { KeAcquireSpinLock(p);}

Page 48: The SPA Project GOLF and ESP

Interface nodes

• Limit scope of value flow to interface nodes• Produce RHS summaries for interface nodes

void LockWrapper(Object p) { KeAcquireSpinLock(p);}

p: INIT -> LOCKED, LOCKED -> ERROR

• Copy summaries to callers

Page 49: The SPA Project GOLF and ESP

Back to our example …

void Initialize(Object o1, Object o2) { i: LockWrapper(o1); j: LockWrapper(o2);

KeReleaseSpinLock(o1); KeReleaseSpinLock(o2);}

void LockWrapper(Object p) { KeAcquireSpinLock(p);}

o2

p

o1 i

j

Page 50: The SPA Project GOLF and ESP

Consider the abstraction!

• ESP makes an upfront abstraction– interface nodes in the GOLF graph– Plus: linear size, controls overall cost– Minus: may be too coarse

• SLAM allows tuning of abstraction– but now we are back in the framework

game

Page 51: The SPA Project GOLF and ESP

Path sensitivity

• PSCS is too expensive (need to track data)– function calls– loops– sequential, unrelated diamonds

• Function calls– use dataflow summaries– can only track local correlations

• Loops and diamonds– use abstract simulation

Page 52: The SPA Project GOLF and ESP

Abstract simulation

• Split simulator state into concrete state + FSM state

• At join points, merge simulator states with identical FSM states

• Extended constant-prop lattice of concrete states per FSM state

• Polynomial bound, better than dataflow• Handles common case efficiently

Page 53: The SPA Project GOLF and ESP

ESP Analysis

• Three increasingly precise phases of static analysis– Phase I : global FICS value flow analysis

• use GOLF to build a call graph and answer value flow questions

– Phase II : global FSCS protocol tracking• use RHS95 combined with polymorphic

value flow to track protocols attached to data

– Phase III : local PS feasibility analysis• use local abstract simulation + summaries

from Phase II to find infeasible paths

Page 54: The SPA Project GOLF and ESP

Step 5: Answer two questions

• Will this be precise enough?– manual inspection of drivers– some level of false positives is OK

• Will this scale?– PI : Yes. DLFR SAS 01 on 1.5MLOC– PII : Yes! Use RHS95, control vs data– PIII : Yes. Local, abstract simulation

Page 55: The SPA Project GOLF and ESP

Step 6: Commit to the full process

• Analysis is only 10% of error detection• Collaborate with others

– PREFix team– SQL Server team

• Be willing to work on deployment• Do the dirty work

Page 56: The SPA Project GOLF and ESP

Step 0: Find good interns!

Page 57: The SPA Project GOLF and ESP

Final word on ESP

• Related work– PREFix/PREFast– SLAM– ESC– Metal

• Problem driven analysis research• All about scaling and real code

Page 58: The SPA Project GOLF and ESP

My view of program analysis

• Static analysis is all about a tradeoff– efficiency vs precision

• Tradeoff along several dimensions– path feasibility : FI, FS, PS– path validity : CI, CS– level of abstraction : AI

• We’ve studied the theory …

Page 59: The SPA Project GOLF and ESP

… you need to be engineers

• Engineering is not menial labour• Engineers can write papers• Engineers produce real tools

• Engineers understand that the space of programs is limited in practice

Page 60: The SPA Project GOLF and ESP

… you need to solve problems

• Make connections between research areas• Don’t be intimidated by the literature• Static analysis is a means to an end

• Focus on software reliability• Correctness is wide open

– plenty of opportunity– critical problem– needs good people