using system-specific compiler extensions to find errors in systems code dawson engler ben chelf,...

Using System-Specific Compiler Extensions to Find Errors in Systems Code

Dawson EnglerBen Chelf, Andy Chou, Seth Hallem

Stanford University

Talk about how to find hundreds of errors by just putting a little bit of system specific knowledge in a compiler

Talk about how to find hundreds of errors by just putting a little bit of system specific knowledge in a compiler

Checking systems software Systems software has many ad-hoc restrictions:

– “acquire lock L before accessing shared variable X”– “do not allocate large variables on 6K kernel stack”

Error = crashed system. How to find errors?– Formal verification

» + rigorous» - costly + expensive. *Very* rare to do for software

– Testing:» + simple, few false positives» - requires running code: doesn’t scale & can be

impractical – Manual inspection

» + flexible» - erratic & doesn’t scale well.

– What to do??

Never do X (do not use floating point, allocate large vars on 6K kernel stack) Always do X before/after Y (acquire lock before use, release after use)


Practically intractable: far too strenous, and even if you do, spec isn’t code.Practically intractable: far too strenous, and even if you do, spec isn’t code.

Method of choice if you build systems for money: test. Problem: O(paths) = exponential in length of code. So if you build systems, what you wind up with is a system that only crashes after a week. Further, mapping crash back to cause can be *really* hard.

Method of choice if you build systems for money: test. Problem: O(paths) = exponential in length of code. So if you build systems, what you wind up with is a system that only crashes after a week. Further, mapping crash back to cause can be *really* hard.

Inspection is dead: modify code, have to do it again.

Inspection is dead: modify code, have to do it again.

Another approach Observation: rules can be checked with a compiler

– scan source for “relevant” acts check if they make sense E.g., to check “disabled interrupts must be re-enabled:” scan for calls to disable()/enable(), check that they match, not done twice

Main problem:– compiler has machinery to automatically check, but

not knowledge– implementor has knowledge but not machinery

Meta-level compilation (MC): – give implementors a framework to add easily-

written, system-specific compiler extensions



Want to make compilers aggressively system specific: if you design a system or interface and see a use of it, you invariably see ways of doing it better: give you way to articulate this knowledge and have compiler do it for you automatically

Want to make compilers aggressively system specific: if you design a system or interface and see a use of it, you invariably see ways of doing it better: give you way to articulate this knowledge and have compiler do it for you automatically

Meta-level compilation (MC) Implementation:

– Extensions dynamically linked into GNU g++ compiler

– Applied down all paths in input program source– E.g. 64-line extension to check disable/disable (82

bugs)

Static detection of real errors in real systems: – 600+ bugs in Linux, OpenBSD, FLASH, Xok exokernel– most extensions < 100 lines, written by system

outsiders

save(flags);cli();if(!(buf = alloc())) return NULL;restore(flags);return buf;

Linux: raid5.c

GNU C++ compiler

interrupt chk“did not re-enable interrupts!”

Actual error in Linux: raid5 driver disables interrupts, and then if it fails to allocate buffer, returns with them disabled. This kernel deadlock is actually hidden by an immediate segmentation fault since the callers dereference the pointer without checking for NULL

Actual error in Linux: raid5 driver disables interrupts, and then if it fails to allocate buffer, returns with them disabled. This kernel deadlock is actually hidden by an immediate segmentation fault since the callers dereference the pointer without checking for NULL

The main results: it works really well, and its easyThe main results: it works really well, and its easy

A bit more detail

{ #include ”linux-includes.h” }sm chk_interrupts { decl { unsigned } flags; // named patterns pat enable = { sti(); } | { restore_flags(flags); }; pat disable = { cli(); };

// states is_enabled: disable ==> is_disabled | enable ==> { err("double enable"); } ; is_disabled: enable ==> is_enabled | disable ==> { err("double disable"); } | $end_of_path$ ==> { err("exiting w/intr disabled!"); } ; }

Isenabled

Is disabled

disable

error

initial

enable

disable

enable

End-of- path

“X before Y” rule: system call pointers Applications are evil

– OS much check all input pointers before use– one missing check = security hole

MC checker: – Bind syscall ptr’s to “tainted” state– tainted vars only touched w/

“safe” routines– or: explicit check to make “clean”

/* from sys/kern/disk.c */int sys_disk_request(… struct buf *reqbp, u_int k) { ... /* bypass for direct scsi commands */ if (reqbp->b_flags & B_SCSICMD) return sys_disk_scsicmd (sn, k, reqbp);

P.tainted

check(p)

error

Each input ptr P

use(p)

P.clean

copyin(p),copyout(p)

use(p)

Deriving specification from common usage Problem: difficult to specify all user pointers

– so:see what code usually does, deviations probably errors

– if ever pass ptr to paranoid routine, make sure always do

Found 5 security errors in Linux. – Canonical example: hole in an “ioctl” routine for

some obscure device driver./* drivers/usb/evdev.c */static int evdev_ioctl(..., unsigned long arg) { ... switch (cmd) { case EVIOCGVERSION: return put_user(EV_VERSION, (__u32 *) arg); case EVIOCGID: /* copy_to_user(to, from)! */ return copy_to_user(&dev->id, (void *) arg, sizeof(struct input_id));

Drivers pull them from all sorts of random locationsDrivers pull them from all sorts of random locations

Must check that alloc succeeded Must allocate enough space Must not use after free() Must free alloc’d object on error:

Kernel alloc/dealloc rules

/* from drivers/char/tea6300.c */ client = kmalloc(sizeof *client,GFP_KERNEL); if (!client) return -ENOMEM; ... tea = kmalloc (sizeof *tea, GFP_KERNEL); if (!tea) return -ENOMEM; ... MOD_INC_USE_COUNT; /* bonus bug: kmalloc could sleep */

47 error leaks, 48 false positives. 72 mod_inc/mod_dec errors 47 error leaks, 48 false positives. 72 mod_inc/mod_dec errors

Stripped-down kernel malloc/free checker decl { scalar } sz; // match any scalar decl { const int } retv; // match const ints state decl { any_ptr } v; // match any pointer, can bind to a state

// Bind malloc results to “unknown” until observed start: { v = (any)malloc(sz) } ==> v.unknown | { free(v) } ==> v.freed; // can compare in states unknown, null, not_null v.unknown, v.null, v.not_null: { (v == 0) } ==> true = v.null, false = v.not_null | { (v != 0) } ==> true = v.not_null, false = v.null; // Cannot reach error path with unknown or not-null v.unknown, v.not_null: { return retv; } ==> { if(mgk_int_cst(retv) < 0) err("Error path leak!"); }; // No dereferences of null, freed, or unknown ptrs. v.null, v.freed v.unknown: { *(any *)v } ==> { err("Using ptr illegally!"); };

Interesting since it’s so small: a few lines of code and we have a path sensitive high-level analysis pass

Interesting since it’s so small: a few lines of code and we have a path sensitive high-level analysis pass

Some amusing bugs No check (130 errors, 11 false pos). Worse case

(many uses):

use after free (14 errors, 3 false pos): 5 cut&paste of

wrong size (2 errors)

/* drivers/isdn/pcbit:pcbit_init_dev */ kfree(dev); iounmap((unsigned char*)dev->sh_mem); release_mem_region(dev->ph_mem, 4096);

/* include/linux/coda_linux.h:CODA_ALLOC */ ptr = (cast)vmalloc((unsigned long) size); ... if (ptr == 0) printk("kernel malloc returns 0\n”);memset( ptr, 0, size );

/* drivers/parport/daisy.c:add_dev:50 */ newdev = kmalloc (GFP_KERNEL,sizeof(struct daisydev));

“In context Y, don’t do X”: blocking Linux: if interrupts are disabled, or spin lock held,

do not call an operation that could block. MC checker:

– Compute transitive closure of allpotentially blocking fn’s

– Hit disable/lock: warn of any calls– 123 errors, 8 false pos

/* drivers/net/pcmcia/wavelan_cs.c */spin_lock_irqsave (&lp->lock, flags); /* 1889 */switch(cmd)... case SIOCGIWPRIV: ... if(copy_to_user(wrq->u.data.pointer, …)) /* 2305 */

ret = -EFAULT;

clean

disable()

error

lock(l)

NoBlock

enable()

unlock(l)

Block call

400 lines later have this violation. This is a common pattern: implementor just doesn’t know about the rule, so keeps violating it. Happens: since rules manually enforced and poorly documented.

400 lines later have this violation. This is a common pattern: implementor just doesn’t know about the rule, so keeps violating it. Happens: since rules manually enforced and poorly documented.

Example: statically checking assert Assert(x) used to check “x” at runtime. Abort if

false– compiler oblivious, so cannot analyze statically– Use MC to build an assert-aware extension

Result: found 5 errors in FLASH. – Common: code cut&paste from other context– Manual detection questionable: 300-line path

explosion between violation and check

General method to push dynamic checks to static

msg.len = 0;...assert(msg.len !=0);

assert checker line 211:assert failure!

Another way to use MC is to push dynamic checks to static. Usually have some amount of dynamic type checking going on where you have a series of if statements at the beginning of your routine to check for error conditions. So just pull into the compiler and check.

Another way to use MC is to push dynamic checks to static. Usually have some amount of dynamic type checking going on where you have a series of if statements at the beginning of your routine to check for error conditions. So just pull into the compiler and check.

Check Errors False pos LOC

Static assert 5 0 100

Stack check 10+ 0 53

Allocation 184 64 60

Blocking 123 8 131

Module race ~75 2 133

Mutex 82 201 64

FLASH 34 69 553

Total ~545 ~226 ~1100

Result overview

(+others)

High bits: small number of LOC = big bug count, 2:1 ratio of false positivesHigh bits: small number of LOC = big bug count, 2:1 ratio of false positives

MC goal: make programming much more powerful– How: Raise compilation from level of programming

language to the “meta level” of the systems implemented in that language

MC works well in real, heavily tested systems– We found bugs in every system we’ve looked at. – Over 600 bugs in total, many capable of crashing

system– Easily written by people unfamiliar w/ checked system

Currently:– making correctors, using domain-knowledge to

extract verifiable specs, deriving errors by usage deviations, performing meta-level optimization…

Conclusions

Given a set of uses of some interface you’ve built, you invariably see better ways of doing things. This gives you a way to articulate this knowlege and have the compiler do it for you automatically. Let one person do it.


Meta-level compilation: – Make compilers aggressively system-specific

– Easy: digest sentence fragment, write checker/optimizer.

– Result: Static, precise, immediate error diagnosis As outsiders found errors in every system looked at Over 600 bugs, many capable of crashing system

Currently:– making correctors, using domain-knowledge to

extract verifiable specs, deriving errors by usage deviations, performing meta-level optimization…

Conclusions



One way to find bugs:– have a deep understanding of code semantics, detect

when code makes no sense. Hard. Easier:

– see what code usually does: deviations probably bugs– x protected by lock(a) 1000 times, by lock(b) once,

probably an error

– Find inverses by looking for common pairings– More general: derive temporal orderings. Use

machine learning to derive more sophisticated patterns?

Bugs as deviant behavior

lock(a);x++; unlock(a);




lock(b);x++; unlock(b);


See system call polarity: return < 0 or > 0 on error: found 8 places in linux. If ever check a routine failure: make sure they always check for failure. See what they do after calling a security check (suser): if they do it a lot and deviate, whine. Weakness: always do wrong, never do right.

See system call polarity: return < 0 or > 0 on error: found 8 places in linux. If ever check a routine failure: make sure they always check for failure. See what they do after calling a security check (suser): if they do it a lot and deviate, whine. Weakness: always do wrong, never do right.

Static analysis works in some cases, not well in others– hit undecidable problems with loop termination

conditions, data values, pointers,…

Alternative: – use domain-specific slicing to extract spec from code– run through verifier

Main lever: a little domain knowledge goes a long way– e.g., strip out Linux TCP finite-state-machine by

keying off of variable “sk->state”– Real example: checking FLASH code

What to do when static analysis too weak?

Embedded sw for cache coherence in FLASH machine– errors crash or deadlock machine: can take week to

track– typical protocol: 18K lines of hairy C code

Extract specifications from source by simple slicing– found 9 errors in code– despite 5+ years of heavy testing and formal

verification! How?

– Given list of data structure fields and message operations, slice out all relevant operations

– Compose with specification (manual) boilerplate– run through Murphi model checker – Levers: aliasing and globals, but in a stylized way

that we can mostly ignore. 4 loops in code.

Extracting specs from FLASH code

Deeply nested control structures (29 conditional compilation directives, 21 if statements)Deeply nested control structures (29 conditional compilation directives, 21 if statements)

FLASH vs Murphi

nh.len := len_cacheline;if ((DH.Pending = 0)) then if ((DH.Dirty = 0)) then assert(nh.len != len_nodata); mbResult := pi_send_func(src, PI_Putt); DH.Local := 1; else assert((DH.List = 0)); assert((DH.RealPtrs = 0)); …

HANDLER_GLOBALS(header.nh.len) = LEN_CACHELINE;if (! HANDLER_GLOBALS(h.hl.Pending)) { if (! HANDLER_GLOBALS(h.hl.Dirty)) { ASSERT(!HANDLER_GLOBALS(h.hl.IO)); PI_SEND(F_DATA, F_FREE, F_SWAP,…,…); HANDLER_GLOBALS(h.hl.Local) = 1; /* ... deleted 14 lines */ } else { ASSERT(!HANDLER_GLOBALS(h.hl.List)); ASSERT(!HANDLER_GLOBALS(h.hl.RealPtrs));

Murphi

FLASH

Problem: big system, lots of bugs– may not be your system or take too long to fix

manually Can turn some classes of checkers into correctors:

– “Do not allocate large variables on kernel stack”: if you hit a violation, rewrite code to dynamically allocate var

– “Do not call blocking memory allocator with interrupts disabled”: hoist allocation out

– “On error paths, rollback side-effects”: dynamically track what these are, and reverse.

Interesting: trade dynamic checks for simplicity

Checkers into Correctors

cli();p = malloc(sizeof *p);sti();

tmp = malloc(sizeof *p);cli();p = tmp;sti();

Malloc: can insert null pointer check (if you can figure out right value to return) or can preallocate. Hoist mod_inc/dec above sleeping operations.

Malloc: can insert null pointer check (if you can figure out right value to return) or can preallocate. Hoist mod_inc/dec above sleeping operations.

Empirical observation: we have sent out hundreds of bug reports. 20-30 have gotten fixed.Empirical observation: we have sent out hundreds of bug reports. 20-30 have gotten fixed.

Optimization rules similar to checking:– “if data is not shared with interrupt handlers, protect

using spin locks rather than interrupt disable/enable– “to save an instruction when setting a message

opcode, xor it with the new and old (msg.opcode ^= (new ^ old));

– “replace quicksort with radix sort when sorting integers”

Common rule: “In situation X, do Y rather than Z”: – “if a variable is not modified, protect using read locks”

– and with a few lines: change opt into checker:

MC optimization

read_lock(q->lock);head = q->head;n = q->nelem;read_unlock(q->lock);

lock(q->lock);head = q->head;n = q->nelem;unlock(q->lock);

read_lock(q->lock);q->head = h;

“modifying q with read lock!”

Meaning more apparent + domain-specific knowlege

Easier to bound side-effects: use knowledge of abstract state to ignore many concrete actions

Aliasing less of a problem– typical: opaque handles vs normal mess of pointers

Operations more coarse grain– read()/write() vs load/store; matrix ops vs +/-

MC analysis vs. traditional compiler analysis

Bigint a, b, c;

set(a, 3);

mul(b, a, a);

mul(c, b, b);

printf(“%s”, bigint_to_str(c));

printf(“81”);

Interfaces let compilers treat implementation as a black box just like programmers!Interfaces let compilers treat implementation as a black box just like programmers!

Could imagine following bunches of pointers and memory allocation: instead, we know that it does a +, - whatever and we can ignore it. Similarly: ignoring ptrs.

Could imagine following bunches of pointers and memory allocation: instead, we know that it does a +, - whatever and we can ignore it. Similarly: ignoring ptrs.

Many compiler problems hard because static. Not a problem with runtime checking. Less control needed: pushing around function calls, rather than doing register allocation

Many compiler problems hard because static. Not a problem with runtime checking. Less control needed: pushing around function calls, rather than doing register allocation

using system-specific compiler extensions to find errors in systems code dawson engler ben chelf,...

Documents