meaningful variable names for decompiled code: a machine ...statistical machine translation (smt) 12...

Post on 04-Jul-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Meaningful Variable Names for Decompiled Code:

A Machine Translation Approach

Alan Jaffe, Jeremy Lacomis, Edward J. Schwartz*, Claire Le Goues, and Bogdan Vasilescu

*

Problem: Obfuscated Variable Names in Code

2

function callback(error, response, body) {if (!error && response.statusCode == 200) {

var info = JSON.parse(body);…

function callback(o, s, a) {if (!o && s.statusCode == 200) {

var c = JSON.parse(a);…

Minified JavaScript:

Problem: Obfuscated Variable Names in Code

3

function callback(error, response, body) {if (!error && response.statusCode == 200) {

var info = JSON.parse(body);…

function callback(o, s, a) {if (!o && s.statusCode == 200) {

var c = JSON.parse(a);…

Minified JavaScript:

Problem: Obfuscated Variable Names in Code

4

function callback(error, response, body) {if (!error && response.statusCode == 200) {

var info = JSON.parse(body);…

function callback(o, s, a) {if (!o && s.statusCode == 200) {

var c = JSON.parse(a);…

cp = buf;(void)asxTab(level + 1);for (n = asnContents(asn, buf, 512); n > 0; n--) {

printf(" %02X ", *(cp++));}

v14 = &v15;asxTab(a2 + 1);for (v13 = asnContents(a1, &v15, 512LL); v13 > 0; --v13) {

v9 = (unsignedchar*)(v14++);printf(" %02X ", *v9);

}

Minified JavaScript:

Decompiled C Code:

Problem: Obfuscated Variable Names in Code

5

function callback(error, response, body) {if (!error && response.statusCode == 200) {

var info = JSON.parse(body);…

function callback(o, s, a) {if (!o && s.statusCode == 200) {

var c = JSON.parse(a);…

cp = buf;(void)asxTab(level + 1);for (n = asnContents(asn, buf, 512); n > 0; n--) {

printf(" %02X ", *(cp++));}

v14 = &v15;asxTab(a2 + 1);for (v13 = asnContents(a1, &v15, 512LL); v13 > 0; --v13) {

v9 = (unsignedchar*)(v14++);printf(" %02X ", *v9);

}

Minified JavaScript:

Decompiled C Code:

Problem: Obfuscated Variable Names in Code

6

function callback(error, response, body) {if (!error && response.statusCode == 200) {

var info = JSON.parse(body);…

function callback(o, s, a) {if (!o && s.statusCode == 200) {

var c = JSON.parse(a);…

Minified JavaScript:

• Software is “natural” [Hindle et al., 2011].

Problem: Obfuscated Variable Names in Code

7

function callback(error, response, body) {if (!error && response.statusCode == 200) {

var info = JSON.parse(body);…

function callback(o, s, a) {if (!o && s.statusCode == 200) {

var c = JSON.parse(a);…

Minified JavaScript:

• Software is “natural” [Hindle et al., 2011].

• Use large corpora + machine learning to predict better identifier names.• Corpora are easy to generate!

Problem: Obfuscated Variable Names in Code

8

function callback(error, response, body) {if (!error && response.statusCode == 200) {

var info = JSON.parse(body);…

function callback(o, s, a) {if (!o && s.statusCode == 200) {

var c = JSON.parse(a);…

Minified JavaScript:

• Software is “natural” [Hindle et al., 2011].

• Use large corpora + machine learning to predict better identifier names.• Corpora are easy to generate!

• Bavishi et al., Context2Name, 2017• Vasilescu et al., JSNaughty, 2017• Raychev et al., JSNice, 2015

Problem: Obfuscated Variable Names in Code

9

cp = buf;(void)asxTab(level + 1);for (n = asnContents(asn, buf, 512); n > 0; n--) {

printf(" %02X ", *(cp++));}

v14 = &v15;asxTab(a2 + 1);for (v13 = asnContents(a1, &v15, 512LL); v13 > 0; --v13) {

v9 = (unsignedchar*)(v14++);printf(" %02X ", *v9);

}

Decompiled C Code:

Can we use similar strategies for decompiled code?

Statistical Machine Translation (SMT)

10

• Noisy channel model

Statistical Machine Translation (SMT)

11

• Noisy channel model• English à French:

Statistical Machine Translation (SMT)

12

• Noisy channel model• English à French:

Va faire de la recherche!Go do some research!

Statistical Machine Translation (SMT)

13

• Noisy channel model• English à French:

Va faire de la recherche!Go do some research!

!"#$!%&( ) *)

Statistical Machine Translation (SMT)

14

• Noisy channel model• English à French:

Va faire de la recherche!Go do some research!

= "#$%"&' ) * +))(+)

)(*)"#$%"&') + *)

= "#$%"&') * +))(+)

Statistical Machine Translation (SMT)

15

• Noisy channel model• English à French:

Va faire de la recherche!Go do some research!

= "#$%"&' ) * +))(+)

)(*)"#$%"&') + *)

= "#$%"&') * +))(+)

Translation Model: Probability that f is a translation of e

Statistical Machine Translation (SMT)

16

• Noisy channel model• English à French:

Va faire de la recherche!Go do some research!

= "#$%"&' ) * +))(+)

)(*)"#$%"&') + *)

= "#$%"&') * +))(+)

Language Model: “Fluency” of e

Statistical Machine Translation (SMT)

17

• Noisy channel model• English à French:

Va faire de la recherche!Go do some research!

= "#$%"&' ) * +))(+)

)(*)"#$%"&') + *)

= "#$%"&') * +))(+)

) * +): Translation Model

)(+): Language ModelMOSES SMT:

SMT Model for Natural Language

18

Aligned French/English corpus

English corpus

SMT Model for Minified JavaScript

19

Aligned original/minified source corpus

Original source corpus

Problem: Obfuscated Identifiers in Code

21

cp = buf;(void)asxTab(level + 1);for (n = asnContents(asn, buf, 512); n > 0; n--) {

printf(" %02X ", *(cp++));}

v14 = &v15;asxTab(a2 + 1);for (v13 = asnContents(a1, &v15, 512LL); v13 > 0; --v13) {

v9 = (unsignedchar*)(v14++);printf(" %02X ", *v9);

}

Decompiled C Code:

Can we use SMT for decompiled code?

SMT Model for Decompiled Code?

22

Aligned original/decompiled source corpus

Original source corpus

SMT Model for Decompiled Code?

23

Aligned original/decompiled source corpus

Original source corpus

Nontrivial

24

Difficulty: Decompilation Changes Structure

#include <stdio.h>int main() {int cur = 0;while (cur <= 9) {

printf("%d\n", cur);++cur;

}return 0;

}

#include <stdio.h>int main() {int v1 = 0;int v2;for (v2 = 0; v2 < 10; ++v2)

printf("%d\n", v2);return v1;

}

Original Source Decompiled Code

25

Difficulty: Decompilation Changes Structure

• Different line count.

#include <stdio.h>int main() {int cur = 0;while (cur <= 9) {

printf("%d\n", cur);++cur;

}return 0;

}

#include <stdio.h>int main() {int v1 = 0;int v2;for (v2 = 0; v2 < 10; ++v2)

printf("%d\n", v2);return v1;

}

Original Source Decompiled Code9 Lines 8 Lines

26

Difficulty: Decompilation Changes Structure

• Different line count.• Different numbers of variables.

#include <stdio.h>int main() {int cur = 0;while (cur <= 9) {

printf("%d\n", cur);++cur;

}return 0;

}

#include <stdio.h>int main() {int v1 = 0;int v2;for (v2 = 0; v2 < 10; ++v2)

printf("%d\n", v2);return v1;

}

Original Source Decompiled Code

27

Difficulty: Decompilation Changes Structure

• Different line count.• Different numbers of variables.• Different types of loops.

#include <stdio.h>int main() {int cur = 0;while (cur <= 9) {

printf("%d\n", cur);++cur;

}return 0;

}

#include <stdio.h>int main() {int v1 = 0;int v2;for (v2 = 0; v2 < 10; ++v2)

printf("%d\n", v2);return v1;

}

Original Source Decompiled Code

Decompiled Code Corpus Generation

28

#include <stdio.h>int main() {int v1 = 0;int v2;for (v2 = 0; v2 < 10; ++v2)

printf("%d\n", v2);return v1;

}

Decompiled Code

Decompiled Code Corpus Generation

29

#include <stdio.h>int main() {int v1 = 0;int v2;for (v2 = 0; v2 < 10; ++v2)

printf("%d\n", v2);return v1;

}

Decompiled Code

#include <stdio.h>int main() {int cur = 0;while (cur <= 9) {

printf("%d\n", cur);++cur;

}return 0;

}

Original Code

Decompiled Code Corpus Generation

30

#include <stdio.h>int main() {int v1 = 0;int v2;for (v2 = 0; v2 < 10; ++v2)

printf("%d\n", v2);return v1;

}

#include <stdio.h>int main() {int v1 = 0;int v2;for (v2 = 0; v2 < 10; ++v2)

printf("%d\n", v2);return v1;

}

Decompiled Code

#include <stdio.h>int main() {int cur = 0;while (cur <= 9) {

printf("%d\n", cur);++cur;

}return 0;

}

Original Code

Decompiled Code Corpus Generation

31

#include <stdio.h>int main() {int v1 = 0;int v2;for (v2 = 0; v2 < 10; ++v2)

printf("%d\n", v2);return v1;

}

#include <stdio.h>int main() {int v1 = 0;int v2;for (v2 = 0; v2 < 10; ++v2)

printf("%d\n", v2);return v1;

}

Decompiled Code

#include <stdio.h>int main() {int cur = 0;while (cur <= 9) {

printf("%d\n", cur);++cur;

}return 0;

}

Original Code

Decompiled Code Corpus Generation

32

#include <stdio.h>int main() {int v1 = 0;int __;for (__ = 0; __ < 10; ++__)

printf("%d\n", __);return v1;

}

#include <stdio.h>int main() {int v1 = 0;int v2;for (v2 = 0; v2 < 10; ++v2)

printf("%d\n", v2);return v1;

}

Decompiled Code

#include <stdio.h>int main() {int cur = 0;while (cur <= 9) {

printf("%d\n", cur);++cur;

}return 0;

}

Original Code

Decompiled Code Corpus Generation

33

#include <stdio.h>int main() {int v1 = 0;int cur;for (cur = 0; cur < 10; ++cur)

printf("%d\n", cur);return v1;

}

❌ �

#include <stdio.h>int main() {int v1 = 0;int v2;for (v2 = 0; v2 < 10; ++v2)

printf("%d\n", v2);return v1;

}

Decompiled Code

#include <stdio.h>int main() {int cur = 0;while (cur <= 9) {

printf("%d\n", cur);++cur;

}return 0;

}

Original CodeRenamed Decompiled Code

Better SMT Model for Decompiled Code

36

Aligned renamed/decompiled source corpus

Renamed source corpus

37

#include <stdio.h>int main() {int cur = 0;while (cur <= 9) {

printf("%d\n", cur);++cur;

}return 0;

}

#include <stdio.h>int main() {int v1 = 0;int v2;for (v2 = 0; v2 < 10; ++v2)

printf("%d\n", v2);return v1;

}

Choosing Renamings

Original Code Decompiled Code

38

#include <stdio.h>int main() {int cur = 0;while (cur <= 9) {

printf("%d\n", cur);++cur;

}return 0;

}

#include <stdio.h>int main() {int v1 = 0;int v2;for (v2 = 0; v2 < 10; ++v2)

printf("%d\n", v2);return v1;

}

Choosing Renamings

Original Code Decompiled Code

39

#include <stdio.h>int main() {int cur = 0;while (cur <= 9) {

printf("%d\n", cur);++cur;

}return 0;

}

#include <stdio.h>int main() {int v1 = 0;int v2;for (v2 = 0; v2 < 10; ++v2)

printf("%d\n", v2);return v1;

}

Choosing Renamings

• Not used as the return value.

Original Code Decompiled Code

40

#include <stdio.h>int main() {int cur = 0;while (cur <= 9) {

printf("%d\n", cur);++cur;

}return 0;

}

#include <stdio.h>int main() {int v1 = 0;int v2;for (v2 = 0; v2 < 10; ++v2)

printf("%d\n", v2);return v1;

}

Choosing Renamings

• Not used as the return value.• Used inside of a loop.

Original Code Decompiled Code

41

#include <stdio.h>int main() {int cur = 0;while (cur <= 9) {

printf("%d\n", cur);++cur;

}return 0;

}

#include <stdio.h>int main() {int v1 = 0;int v2;for (v2 = 0; v2 < 10; ++v2)

printf("%d\n", v2);return v1;

}

Choosing Renamings

• Not used as the return value.• Used inside of a loop.• Used in a function call.

Original Code Decompiled Code

42

#include <stdio.h>int main() {int cur = 0;while (cur <= 9) {

printf("%d\n", cur);++cur;

}return 0;

}

#include <stdio.h>int main() {int v1 = 0;int v2;for (v2 = 0; v2 < 10; ++v2)

printf("%d\n", v2);return v1;

}

Choosing Renamings

• Not used as the return value.• Used inside of a loop.• Used in a function call.• Same operations.

Original Code Decompiled Code

43

#include <stdio.h>int main() {int cur = 0;while (cur <= 9) {

printf("%d\n", cur);++cur;

}return 0;

}

#include <stdio.h>int main() {int v1 = 0;int __;for (__ = 0; __ < 10; ++__)

printf("%d\n", __);return v1;

}

Choosing Renamings

• Not used as the return value.• Used inside of a loop.• Used in a function call.• Same operations.

Original Code Decompiled Code

44

#include <stdio.h>int main() {int cur = 0;while (cur <= 9) {

printf("%d\n", cur);++cur;

}return 0;

}

#include <stdio.h>int main() {int v1 = 0;int cur;for (cur = 0; cur < 10; ++cur)

printf("%d\n", cur);return v1;

}

Choosing Renamings

• Not used as the return value.• Used inside of a loop.• Used in a function call.• Same operations.

Original Code Decompiled Code

System Architecture

45

Results and Evaluation

46

my_rc base2_string(base2_handle base2_h, char* buffer,size_t buffer_size)

Original

Results and Evaluation

47

my_rc base2_string(base2_handle base2_h, char* buffer,size_t buffer_size)

my_rc base2_string(base2_handle a1, char* a2,size_t a3)

Original

Decompiled

Results and Evaluation

48

my_rc base2_string(base2_handle base2_h, char* buffer,size_t buffer_size)

my_rc base2_string(base2_handle a1, char* a2,size_t a3)

Original

Decompiled

my_rc base2_string(base2_handle base2_h, char* buf,size_t len)

Renamed Decompiled

Results and Evaluation

49

my_rc base2_string(base2_handle base2_h, char* buffer,size_t buffer_size)

Original

my_rc base2_string(base2_handle base2_h, char* buf,size_t len)

Renamed Decompiled

Results and Evaluation

50

my_rc base2_string(base2_handle base2_h, char* buffer,size_t buffer_size)

Original

my_rc base2_string(base2_handle base2_h, char* buf,size_t len)

Renamed Decompiled

Exact

Results and Evaluation

51

my_rc base2_string(base2_handle base2_h, char* buffer,size_t buffer_size)

Original

my_rc base2_string(base2_handle base2_h, char* buf,size_t len)

Renamed Decompiled

Approx

Results and Evaluation

52

my_rc base2_string(base2_handle base2_h, char* buffer,size_t buffer_size)

Original

my_rc base2_string(base2_handle base2_h, char* buf,size_t len)

Renamed Decompiled

Not a match

Results and Evaluation

53

my_rc base2_string(base2_handle base2_h, char* buffer,size_t buffer_size)

Original

my_rc base2_string(base2_handle base2_h, char* buf,size_t len)

Renamed Decompiled

• 12.7% Exact• 16.2% Exact + Approx

Results and Evaluation

54

my_rc base2_string(base2_handle base2_h, char* buffer,size_t buffer_size)

Original

my_rc base2_string(base2_handle base2_h, char* buf,size_t len)

Renamed Decompiled

Not a match

• 12.7% Exact• 16.2% Exact + Approx

Results and Evaluation

55

my_rc base2_string(base2_handle base2_h, char* buffer,size_t buffer_size)

my_rc base2_string(base2_handle a1, char* a2,size_t a3)

Original

Decompiled

my_rc base2_string(base2_handle base2_h, char* buf,size_t len)

Renamed Decompiled

• 12.7% Exact• 16.2% Exact + Approx

Preliminary Investigation: Human Study

• Presented users with short snippets (<50 lines) of decompiled code, asked to perform various maintenance tasks, graded and timed:

56

Preliminary Investigation: Human Study

• Presented users with short snippets (<50 lines) of decompiled code, asked to perform various maintenance tasks, graded and timed:

57

1 int x = 1;2 int y = 0;3 while (x <= 5) {4 y += 2;5 x += 1;6 }7 printf("%d", y);

- What is the value of the variable y on line 7?

Preliminary Investigation: Human Study

• Presented users with short snippets (<50 lines) of decompiled code, asked to perform various maintenance tasks, graded and timed:

58

1 int x = 1;2 int y = 0;3 while (x <= 5) {4 y += 2;5 x += 1;6 }7 printf("%d", y);

- What is the value of the variable y on line 7?

• For correct answers, the time to answer using our renamings was statistically significantly lower than when using the decompiler names.

System Architecture

45

Conclusion

•Questions?•Suggestions?

59

top related