![Page 1: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/1.jpg)
1
EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR
CLAIRE LE GOUES
SITE VISIT
FEBRUARY 7, 2013
![Page 2: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/2.jpg)
2
“Benchmarks set standards for innovation, and can encourage or stifle it.”
-Blackburn et al.
![Page 3: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/3.jpg)
3
2009: 15 papers on automatic program repair*
2011: Dagstuhl seminar on self-repairing programs
2012: 30 papers on automatic program repair*
2013: dedicated program repair track at ICSE
*manually reviewed the results of a search of the ACM digital library for “automatic program repair”
AUTOMATIC PROGRAM REPAIR OVER TIME
![Page 4: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/4.jpg)
4
Manually sift through bugtraq data.
Indicative example: Axis project for automatically repairing concurrency bugs
• 9 weeks of sifting to find 8 bugs to study.• Direct quote from Charles Zhang, senior author, on the
process: "it's very painful”
Very difficult to compare against previous or related work or generate sufficiently large datasets.
CURRENT APPROACH
![Page 5: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/5.jpg)
5
GOAL: HIGH-QUALITY EMPIRICAL EVALUATION
![Page 6: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/6.jpg)
6
SUBGOAL: HIGH-QUALITY BENCHMARK SUITE
![Page 7: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/7.jpg)
7
Indicative of important real-world bugs, found systematically in open-source programs.
Support a variety of research objectives.
• “Latitudinal” studies: many different types of bugs and programs
• “Longitudinal” studies: many iterative bugs in one program.
Scientifically meaningful: passing test cases repair
Admit push-button, simple integration with tools like GenProg.
BENCHMARK REQUIREMENTS
![Page 8: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/8.jpg)
8
Indicative of important real-world bugs, found systematically in open-source programs.
Support a variety of research objectives.
• “Latitudinal” studies: many different types of bugs and programs
• “Longitudinal” studies: many iterative bugs in one program.
Scientifically meaningful: passing test cases repair
Admit push-button, simple integration with tools like GenProg.
BENCHMARK REQUIREMENTS
![Page 9: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/9.jpg)
http://genprog.cs.virginia.edu 9
Goal: a large set of important, reproducible bugs in non-trivial programs.
Approach: use historical data to approximate discovery and repair of bugs in the wild.
SYSTEMATIC BENCHMARK SELECTION
![Page 10: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/10.jpg)
10
Indicative of important real-world bugs, found systematically in open-source programs:
• Add new programs to the set, with as wide a variety of types as possible (support “latitudinal” studies)
Support a variety of research objectives:
• Allow studies of iterative bugs, development, and repair: generate a very large (100) set of bugs in one program (php) (support “longitudinal” studies).
NEW BUGS, NEW PROGRAMS
![Page 11: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/11.jpg)
11
Program LOC Tests Bugs Description
fbc 97,000 773 3 Language (legacy)
gmp 145,000 146 2 Multiple precision math
gzip 491,000 12 5 Data compression
libtiff 77,000 78 24 Image manipulation
lighttpd 62,000 295 9 Web server
php 1,046,000 11,995 100 Language (web)
python 407,000 355 11 Language (general)
wireshark 2,814,000 63 7 Network packet analyzer
valgrind 711,000 595 2 Simulator and debugger
vlc 522,000 17 ?? Media player
svn 629,000 1,748 ?? Source control
Total 7,001,000 16,077 163
![Page 12: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/12.jpg)
12
Indicative of important real-world bugs, found systematically in open-source programs.
Support a variety of research objectives.
• “Latitudinal” studies: many different types of bugs and programs
• “Longitudinal” studies: many iterative bugs in one program.
Scientifically meaningful: passing test cases repair
Admit push-button, simple integration with tools like GenProg.
BENCHMARK REQUIREMENTS
![Page 13: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/13.jpg)
13
They must exist.
• Sometimes, but not always, true (see: Jonathan Dorn)
TEST CASE CHALLENGES
![Page 14: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/14.jpg)
14
Program LOC Tests Bugs Description
fbc 97,000 773 3 Language (legacy)
gmp 145,000 146 2 Multiple precision math
gzip 491,000 12 5 Data compression
libtiff 77,000 78 24 Image manipulation
lighttpd 62,000 295 9 Web server
php 1,046,000 11,995 100 Language (web)
python 407,000 355 11 Language (general)
wireshark 2,814,000 63 7 Network packet analyzer
valgrind 711,000 595 2 Simulator and debugger
Total 5,850,000 14,312 163
BENCHMARKS
![Page 15: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/15.jpg)
15
They must exist.
• Sometimes, but not always, true (see: Jonathan Dorn)
They should be of high quality.
• This has been a challenge from day 0: nullhttpd• Lincoln labs noticed it too: sort• In both cases, adding test cases led to better repairs.
TEST CASE CHALLENGES
![Page 16: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/16.jpg)
16
They must exist.
• Sometimes, but not always, true (see: Jonathan Dorn)
They should be of high quality.
• This has been a challenge from day 0: nullhttpd• Lincoln labs noticed it too: sort• In both cases, adding test cases led to better repairs.
They must be automated to run one at a time, programmatically, from within another framework.
TEST CASE CHALLENGES
![Page 17: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/17.jpg)
17
Need to be able to compile and run new variants programmatically.
Need to be able to run test cases one at a time.
• It’s not simple, and as we scale up to real-world systems, becomes increasingly tricky.
• Much of the challenge is unrelated to the program in question, instead requiring highly-technical knowledge of OS-level details.
PUSH-BUTTON INTEGRATION
![Page 18: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/18.jpg)
18
Calling a process from within another process :
• system(“run test 1”) ...; wait()
wait() returns the process exit status.
This is complex.
• Example: a system call can fail because the OS ran out of memory in creating the process, or because the process itself ran out of memory.
How do we tell the difference?
• Answer: bit masking
DIGRESSION ON WAIT()
![Page 19: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/19.jpg)
19
Moral: integration is tricky, and lends itself to human mistakes.
Possibility 1: original programmers make mistakes in developing the test suite.
• Test cases can have bugs, too.
Possibility 2: we (GenProg devs/users) make mistakes in integration.
• A few old php test cases are not to our standards; faulty bitshift math for extracting the return value components.
REAL-WORLD COMPLEXITY
![Page 20: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/20.jpg)
20
Interested in more, better benchmark design, with easy integration (without gnarly OS details).
• Virtual machines provide one approach.
Need a better definition of “high quality test case” vs. “low quality test case:”
• Can the empty program pass it? • Can every program pass it?• Can the “always crashes” program pass it?
INTEGRATION CONCERNS
![Page 21: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/21.jpg)
21
Over the past year, we have conducted studies of representation and operators for automatic program repair:
• One-point crossover on patch representation.• Non-uniform mutation operator selection.• Alternative fault localization framework.
Results on the next slide incorporate “all the bells and whistles:”
• Improvements based on those large-scale studies.• Manually confirmed quality of testing framework.
CURRENT REPAIR SUCCESS
![Page 22: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/22.jpg)
22
CURRENT REPAIR SUCCESS
Program Previous Results Current Results
fbc 1/3 1/3
gmp 1/2 1/2
gzip 1/5 1/5
libtiff 17/24 17/24
lighttpd 5/9 5/9
php 28/44 55/100
python 1/11 2/11
wireshark 1/7 4/7
valgrind --- 1/2
Total 55/105 87/163
![Page 23: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/23.jpg)
23
TRANSITION
![Page 24: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/24.jpg)
24
REPAIR TEMPLATES
CLAIRE LE GOUES
SHIRLEY PARK
DARPA SITE VISIT
FEBRUARY 7, 2013
![Page 25: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/25.jpg)
BIO + CS INTERACTION
25
![Page 26: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/26.jpg)
Immune response is equally fast for large and small animals.
• Human lung is 100x larger than mouse lung, still finds influenza infections in ~8 hours.
• Successfully balances local search and global response.
Balance between generic and specialized T-cells:
• Rapid response to new pathogens vs. long-term memory of previous infections (cf. vaccines).
IMMUNOLOGY: T-CELLS
26
![Page 27: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/27.jpg)
27MUTATE
DISCARD
INPUT EVALUATE FITNESS
ACCEPT
OUTPUT
![Page 28: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/28.jpg)
Tradeoff between generic mutation actions and more specific action templates:
• Generic: INSERT, DELETE, REPLACE• Specific:
if ( != NULL) { <code using >}
AUTOMATIC SOFTWARE REPAIR
28
![Page 29: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/29.jpg)
29
HYPOTHESIS: GENPROG CAN REPAIR MORE BUGS, AND REPAIR BUGS MORE QUICKLY, IF WE AUGMENT MUTATION ACTIONS WITH
“REPAIR TEMPLATES.”
![Page 30: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/30.jpg)
30
Insight: Just like T-cells “remember” previous infections, abstract previous fixes to generate new mutations.
Approach:
• Model previous changes using structured documentation.• Cluster a large set of changes by similarity.• Abstract the center of each cluster
Example:
if( < 0)
return 0;
else
<code using >
OPTION 1: PREVIOUS CHANGES
![Page 31: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/31.jpg)
31
Insight: Looking up things at a library provides people with the best example of what they are looking to reproduce.
Approach:
• Generate static paths through C programs.• Mine API usage patterns from those paths• Abstract the patterns into mutation templates.
Example:
while(it.hasnext())
<code using it.next()>
OPTION 2: EXISTING BEHAVIOR
![Page 32: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/32.jpg)
32
THIS WORK IS ONGOING.
![Page 33: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1](https://reader037.vdocuments.us/reader037/viewer/2022110400/56649dca5503460f94ac1327/html5/thumbnails/33.jpg)
33
We are generating a benchmark suite to support GenProg research, integration and tech transfer, and the automatic repair community at large.
Current GenProg results for 12-hour repair scenario: 87/163 (53%) of real-world bugs in dataset.
Repair templates will augment GenProg’s mutation operators to help repair more bugs, and repair bugs more quickly.
CONCLUSIONS