extracting compiler provenance from program...

46
From Program Binaries Extracting Compiler Provenance Nathan Rosenblum Computer Sciences Department University of Wisconsin [email protected] http://pages.cs.wisc.edu/~nater/ Barton Miller and Xiaojin Zhu Joint work with Thursday, October 7, 2010

Upload: others

Post on 26-Apr-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

From Program BinariesExtracting Compiler Provenance

Nathan RosenblumComputer Sciences DepartmentUniversity of Wisconsin

[email protected]://pages.cs.wisc.edu/~nater/

Barton Miller and Xiaojin ZhuJoint work with

Thursday, October 7, 2010

Provenance in two parts

2

binary production toolchain

program binary

source

Thursday, October 7, 2010

• compiler family

Provenance in two parts

2

binary production toolchain

• compiler version• library/operating system versions• optimization level• link-time optimization• binary rewriting [obfuscation]

program binary

source

Thursday, October 7, 2010

• compiler family

Provenance in two parts

2

binary production toolchain

• compiler version• library/operating system versions• optimization level• link-time optimization• binary rewriting [obfuscation]

program binary

source

Thursday, October 7, 2010

• compiler family

Provenance in two parts

2

binary production toolchain

• compiler version• library/operating system versions• optimization level• link-time optimization• binary rewriting [obfuscation]

GCC?ICC?

program binary

source

Thursday, October 7, 2010

• compiler family

Provenance in two parts

2

binary production toolchain

• compiler version• library/operating system versions• optimization level• link-time optimization• binary rewriting [obfuscation]

GCC?ICC?

ICCGCC GCCprogram binary

source

Thursday, October 7, 2010

Provenance in two parts

3

011101011010101010101110101001010101110001001001011010110011010101010101010010011110

exhaustive disassembly

control flow graph

stripped binary artifact

binary production toolchain

source

program binary

Thursday, October 7, 2010

Why compiler provenance?

4

Thursday, October 7, 2010

Why compiler provenance?

4

IDA Pro

Thursday, October 7, 2010

Why compiler provenance?

4

IDA Pro

Thursday, October 7, 2010

Why compiler provenance?

4

IDA Pro

Thursday, October 7, 2010

Why should this work?

5

Thursday, October 7, 2010

6

int bar(int foo) { int i, j;

for(i=0;i<foo;++i) { i = j + i; j *= i; } return j;}

Thursday, October 7, 2010

6

int bar(int foo) { int i, j;

for(i=0;i<foo;++i) { i = j + i; j *= i; } return j;}

test edi,edijle 4004ae <bar+0x16>mov eax,0x0lea eax,[rdx+rax]imul edx,eaxadd eax,0x1cmp edi,eaxjg 4004a1 <bar+0x9>mov eax,edxret

GCCxor edx,edxtest edi,edijle 400989 <bar+0x11>add edx,eaximul eax,edxinc edxcmp edx,edijl 40097e <bar+0x6>ret

ICC

Thursday, October 7, 2010

6

int bar(int foo) { int i, j;

for(i=0;i<foo;++i) { i = j + i; j *= i; } return j;}

test edi,edijle 4004ae <bar+0x16>mov eax,0x0lea eax,[rdx+rax]imul edx,eaxadd eax,0x1cmp edi,eaxjg 4004a1 <bar+0x9>mov eax,edxret

GCCxor edx,edxtest edi,edijle 400989 <bar+0x11>add edx,eaximul eax,edxinc edxcmp edx,edijl 40097e <bar+0x6>ret

ICC

Thursday, October 7, 2010

Binary code model

7

program binary

… c7 04 24 10 70 05 08 ff d0 c9 c3 90 81 ec e4 00 00 00 8b b4 24 ec 00 00 00 … program bytes

Thursday, October 7, 2010

Binary code model

7

program binary

GCC GCCICC ICC

… c7 04 24 10 70 05 08 ff d0 c9 c3 90 81 ec e4 00 00 00 8b b4 24 ec 00 00 00 … program bytes

Thursday, October 7, 2010

Binary code model

7

program binary

GCC GCCICC ICC

… c7 04 24 10 70 05 08 ff d0 c9 c3 90 81 ec e4 00 00 00 8b b4 24 ec 00 00 00 … program bytes

8d b4 26 00 00 00 00 8d bc 27 00 00 00 00 90

80 4c 9080 4c 9480 4c 9880 4c 9b

match_initzp_init_keys

seekable

padding

addresses

data

Thursday, October 7, 2010

Binary code model

7

program binary

GCC GCCICC ICC

… c7 04 24 10 70 05 08 ff d0 c9 c3 90 81 ec e4 00 00 00 8b b4 24 ec 00 00 00 … program bytes

Thursday, October 7, 2010

Binary code model

7

program binary

GCC GCCICC ICC

...

yi yi-1 yj yj+1

… c7 04 24 10 70 05 08 ff d0 c9 c3 90 81 ec e4 00 00 00 8b b4 24 ec 00 00 00 …

sequence labels

program bytes

∈ {icc,gcc,...,data}

Thursday, October 7, 2010

Binary code model

7

program binary

GCC GCCICC ICC

...

yi yi-1 yj yj+1

… c7 04 24 10 70 05 08 ff d0 c9 c3 90 81 ec e4 00 00 00 8b b4 24 ec 00 00 00 …

sequence labels

program bytes

∈ {icc,gcc,...,data}

Thursday, October 7, 2010

Binary code model

7

program binary

GCC GCCICC ICC

...

yi yi-1 yj yj+1

… c7 04 24 10 70 05 08 ff d0 c9 c3 90 81 ec e4 00 00 00 8b b4 24 ec 00 00 00 …

sequence labels

program bytes

∈ {icc,gcc,...,data}

Thursday, October 7, 2010

Binary code features

8

‹mov [IMM], RAX ; sub [IMM], RAX›

‹push EBP ; * ; mov ESP, EBP›

Thursday, October 7, 2010

Binary code features

8

‹mov [IMM], RAX ; sub [IMM], RAX›

‹push EBP ; * ; mov ESP, EBP›

single-instruction wildcard

opcode class abstraction hidden immediates

Thursday, October 7, 2010

Binary code features

8

‹mov [IMM], RAX ; sub [IMM], RAX›

‹push EBP ; * ; mov ESP, EBP›

single-instruction wildcard

opcode class abstraction hidden immediates

...

long-range control flow interaction

branch

Thursday, October 7, 2010

Learning framework

9

fI(u,c)(xi,yi) fT(c,c’)(yi) fCF(yi,yj)

λI(u,c) λT(c,c’) λCF

feature functions

weights(learned)

x program locationy assigned label

u idiomc compiler

Thursday, October 7, 2010

Learning framework

9

fI(u,c)(xi,yi) fT(c,c’)(yi) fCF(yi,yj)

λI(u,c) λT(c,c’) λCF

Linear-chain CRFLafferty, et al. 2001

feature functions

weights(learned)

x program locationy assigned label

u idiomc compiler

Thursday, October 7, 2010

Learning framework

9

fI(u,c)(xi,yi) fT(c,c’)(yi) fCF(yi,yj)

λI(u,c) λT(c,c’) λCF

Linear-chain CRFLafferty, et al. 2001

feature functions

weights

[approximate]

(learned)

x program locationy assigned label

u idiomc compiler

Thursday, October 7, 2010

Single-compiler provenance

10

01110101101010101010111010100101010111000100100101101011001101010101010101001

GCC

01110101101010101010111010100101010111000100100101101011001101010101010101001

ICC

01110101101010101010111010100101010111000100100101101011001101010101010101001

MSVC

λI(u,c) λT(c,c’)

~80k parameters

20 20 20training binaries

x program locationy assigned label

u idiomc compiler

Thursday, October 7, 2010

Single-compiler provenance

10

01110101101010101010111010100101010111000100100101101011001101010101010101001

GCC

01110101101010101010111010100101010111000100100101101011001101010101010101001

ICC

01110101101010101010111010100101010111000100100101101011001101010101010101001

MSVC

λI(u,c) λT(c,c’)

~80k parameters

20 20 20training binaries

testing binaries

579 174 386

+

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

sequence labeling

92.5% accuracy

x program locationy assigned label

u idiomc compiler

Thursday, October 7, 2010

Mixed-compiler provenance

11

λI(u,c) λT(c,c’)

(previously learned)GCC 4.3 ICC 10.1

10 10GNU coreutils

programsrandom

compilations

×

(per-file random selection)

x program locationy assigned label

u idiomc compiler

Thursday, October 7, 2010

Mixed-compiler provenance

11

λI(u,c) λT(c,c’)

(previously learned)

+

sequence labeling

93.8% accuracy

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

GCC 4.3 ICC 10.1

10 10GNU coreutils

programsrandom

compilations

×

(per-file random selection)

x program locationy assigned label

u idiomc compiler

Thursday, October 7, 2010

Stripped binary parsing

12

exhaustive disassembly

01110101101010101010111010100101010111000100100101101011001101010101010101001

CFG

Thursday, October 7, 2010

Stripped binary parsing

12

exhaustive disassembly

01110101101010101010111010100101010111000100100101101011001101010101010101001

CFG

Function Entry Point

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

Which model?

Rosenblum, et al. 2008

λI(u) GCC

λI(u) ICC

λI(U) MSVC

Thursday, October 7, 2010

Integrating compiler provenance

13

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

trainλI(u)

Thursday, October 7, 2010

Integrating compiler provenance

13

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

trainλI(u) × {GCC,ICC,MSVS} λI(u,c)

Thursday, October 7, 2010

stripped program binary

Integrating compiler provenance

13

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

trainλI(u) × {GCC,ICC,MSVS} λI(u,c)

Thursday, October 7, 2010

stripped program binary

Integrating compiler provenance

13

GCC ICC GCCICC ICC

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

trainλI(u) × {GCC,ICC,MSVS} λI(u,c)

Thursday, October 7, 2010

stripped program binary

Integrating compiler provenance

13

GCC ICC GCCICC ICC

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

trainλI(u) × {GCC,ICC,MSVS} λI(u,c)

fI(u,GCC)(xi,yi) fI(u,ICC)(xi,yi)

Thursday, October 7, 2010

stripped program binary

Integrating compiler provenance

13

GCC ICC GCCICC ICC

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

trainλI(u) × {GCC,ICC,MSVS} λI(u,c)

fI(u,GCC)(xi,yi) fI(u,ICC)(xi,yi)

18% error reductionThursday, October 7, 2010

Future directions

14

toolchain provenance

Thursday, October 7, 2010

Future directions

14

toolchain provenance

Compilerfamily

version

optimization level

source language

Thursday, October 7, 2010

Future directions

14

toolchain provenance

Compilerfamily

version

optimization level

source language

⎫⎬⎭

done

Thursday, October 7, 2010

Future directions

14

Systemglibc static codelibrary imports

toolchain provenance

Compilerfamily

version

optimization level

source language

⎫⎬⎭

done

Thursday, October 7, 2010

Future directions

14

Systemglibc static codelibrary imports

Link & post-linkwhole-program optimization

rewriting toolsobfuscation tools

toolchain provenance

Compilerfamily

version

optimization level

source language

⎫⎬⎭

done

Thursday, October 7, 2010

questions

Thursday, October 7, 2010