extracting compiler provenance from program...
TRANSCRIPT
From Program BinariesExtracting Compiler Provenance
Nathan RosenblumComputer Sciences DepartmentUniversity of Wisconsin
[email protected]://pages.cs.wisc.edu/~nater/
Barton Miller and Xiaojin ZhuJoint work with
Thursday, October 7, 2010
Provenance in two parts
2
binary production toolchain
program binary
source
Thursday, October 7, 2010
• compiler family
Provenance in two parts
2
binary production toolchain
• compiler version• library/operating system versions• optimization level• link-time optimization• binary rewriting [obfuscation]
program binary
source
Thursday, October 7, 2010
• compiler family
Provenance in two parts
2
binary production toolchain
• compiler version• library/operating system versions• optimization level• link-time optimization• binary rewriting [obfuscation]
program binary
source
Thursday, October 7, 2010
• compiler family
Provenance in two parts
2
binary production toolchain
• compiler version• library/operating system versions• optimization level• link-time optimization• binary rewriting [obfuscation]
GCC?ICC?
program binary
source
Thursday, October 7, 2010
• compiler family
Provenance in two parts
2
binary production toolchain
• compiler version• library/operating system versions• optimization level• link-time optimization• binary rewriting [obfuscation]
GCC?ICC?
ICCGCC GCCprogram binary
source
Thursday, October 7, 2010
Provenance in two parts
3
011101011010101010101110101001010101110001001001011010110011010101010101010010011110
exhaustive disassembly
control flow graph
stripped binary artifact
binary production toolchain
source
program binary
Thursday, October 7, 2010
6
int bar(int foo) { int i, j;
for(i=0;i<foo;++i) { i = j + i; j *= i; } return j;}
Thursday, October 7, 2010
6
int bar(int foo) { int i, j;
for(i=0;i<foo;++i) { i = j + i; j *= i; } return j;}
test edi,edijle 4004ae <bar+0x16>mov eax,0x0lea eax,[rdx+rax]imul edx,eaxadd eax,0x1cmp edi,eaxjg 4004a1 <bar+0x9>mov eax,edxret
GCCxor edx,edxtest edi,edijle 400989 <bar+0x11>add edx,eaximul eax,edxinc edxcmp edx,edijl 40097e <bar+0x6>ret
ICC
Thursday, October 7, 2010
6
int bar(int foo) { int i, j;
for(i=0;i<foo;++i) { i = j + i; j *= i; } return j;}
test edi,edijle 4004ae <bar+0x16>mov eax,0x0lea eax,[rdx+rax]imul edx,eaxadd eax,0x1cmp edi,eaxjg 4004a1 <bar+0x9>mov eax,edxret
GCCxor edx,edxtest edi,edijle 400989 <bar+0x11>add edx,eaximul eax,edxinc edxcmp edx,edijl 40097e <bar+0x6>ret
ICC
Thursday, October 7, 2010
Binary code model
7
program binary
… c7 04 24 10 70 05 08 ff d0 c9 c3 90 81 ec e4 00 00 00 8b b4 24 ec 00 00 00 … program bytes
Thursday, October 7, 2010
Binary code model
7
program binary
GCC GCCICC ICC
… c7 04 24 10 70 05 08 ff d0 c9 c3 90 81 ec e4 00 00 00 8b b4 24 ec 00 00 00 … program bytes
Thursday, October 7, 2010
Binary code model
7
program binary
GCC GCCICC ICC
… c7 04 24 10 70 05 08 ff d0 c9 c3 90 81 ec e4 00 00 00 8b b4 24 ec 00 00 00 … program bytes
8d b4 26 00 00 00 00 8d bc 27 00 00 00 00 90
80 4c 9080 4c 9480 4c 9880 4c 9b
match_initzp_init_keys
seekable
padding
addresses
data
Thursday, October 7, 2010
Binary code model
7
program binary
GCC GCCICC ICC
… c7 04 24 10 70 05 08 ff d0 c9 c3 90 81 ec e4 00 00 00 8b b4 24 ec 00 00 00 … program bytes
Thursday, October 7, 2010
Binary code model
7
program binary
GCC GCCICC ICC
...
yi yi-1 yj yj+1
… c7 04 24 10 70 05 08 ff d0 c9 c3 90 81 ec e4 00 00 00 8b b4 24 ec 00 00 00 …
sequence labels
program bytes
∈ {icc,gcc,...,data}
Thursday, October 7, 2010
Binary code model
7
program binary
GCC GCCICC ICC
...
yi yi-1 yj yj+1
… c7 04 24 10 70 05 08 ff d0 c9 c3 90 81 ec e4 00 00 00 8b b4 24 ec 00 00 00 …
sequence labels
program bytes
∈ {icc,gcc,...,data}
Thursday, October 7, 2010
Binary code model
7
program binary
GCC GCCICC ICC
...
yi yi-1 yj yj+1
… c7 04 24 10 70 05 08 ff d0 c9 c3 90 81 ec e4 00 00 00 8b b4 24 ec 00 00 00 …
sequence labels
program bytes
∈ {icc,gcc,...,data}
Thursday, October 7, 2010
Binary code features
8
‹mov [IMM], RAX ; sub [IMM], RAX›
‹push EBP ; * ; mov ESP, EBP›
Thursday, October 7, 2010
Binary code features
8
‹mov [IMM], RAX ; sub [IMM], RAX›
‹push EBP ; * ; mov ESP, EBP›
single-instruction wildcard
opcode class abstraction hidden immediates
Thursday, October 7, 2010
Binary code features
8
‹mov [IMM], RAX ; sub [IMM], RAX›
‹push EBP ; * ; mov ESP, EBP›
single-instruction wildcard
opcode class abstraction hidden immediates
...
long-range control flow interaction
branch
Thursday, October 7, 2010
Learning framework
9
fI(u,c)(xi,yi) fT(c,c’)(yi) fCF(yi,yj)
λI(u,c) λT(c,c’) λCF
feature functions
weights(learned)
x program locationy assigned label
u idiomc compiler
Thursday, October 7, 2010
Learning framework
9
fI(u,c)(xi,yi) fT(c,c’)(yi) fCF(yi,yj)
λI(u,c) λT(c,c’) λCF
Linear-chain CRFLafferty, et al. 2001
feature functions
weights(learned)
x program locationy assigned label
u idiomc compiler
Thursday, October 7, 2010
Learning framework
9
fI(u,c)(xi,yi) fT(c,c’)(yi) fCF(yi,yj)
λI(u,c) λT(c,c’) λCF
Linear-chain CRFLafferty, et al. 2001
feature functions
weights
[approximate]
(learned)
x program locationy assigned label
u idiomc compiler
Thursday, October 7, 2010
Single-compiler provenance
10
01110101101010101010111010100101010111000100100101101011001101010101010101001
GCC
01110101101010101010111010100101010111000100100101101011001101010101010101001
ICC
01110101101010101010111010100101010111000100100101101011001101010101010101001
MSVC
λI(u,c) λT(c,c’)
~80k parameters
20 20 20training binaries
x program locationy assigned label
u idiomc compiler
Thursday, October 7, 2010
Single-compiler provenance
10
01110101101010101010111010100101010111000100100101101011001101010101010101001
GCC
01110101101010101010111010100101010111000100100101101011001101010101010101001
ICC
01110101101010101010111010100101010111000100100101101011001101010101010101001
MSVC
λI(u,c) λT(c,c’)
~80k parameters
20 20 20training binaries
testing binaries
579 174 386
+
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
sequence labeling
92.5% accuracy
x program locationy assigned label
u idiomc compiler
Thursday, October 7, 2010
Mixed-compiler provenance
11
λI(u,c) λT(c,c’)
(previously learned)GCC 4.3 ICC 10.1
10 10GNU coreutils
programsrandom
compilations
×
(per-file random selection)
x program locationy assigned label
u idiomc compiler
Thursday, October 7, 2010
Mixed-compiler provenance
11
λI(u,c) λT(c,c’)
(previously learned)
+
sequence labeling
93.8% accuracy
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
GCC 4.3 ICC 10.1
10 10GNU coreutils
programsrandom
compilations
×
(per-file random selection)
x program locationy assigned label
u idiomc compiler
Thursday, October 7, 2010
Stripped binary parsing
12
exhaustive disassembly
01110101101010101010111010100101010111000100100101101011001101010101010101001
CFG
Thursday, October 7, 2010
Stripped binary parsing
12
exhaustive disassembly
01110101101010101010111010100101010111000100100101101011001101010101010101001
CFG
Function Entry Point
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
Which model?
Rosenblum, et al. 2008
λI(u) GCC
λI(u) ICC
λI(U) MSVC
Thursday, October 7, 2010
Integrating compiler provenance
13
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
trainλI(u)
Thursday, October 7, 2010
Integrating compiler provenance
13
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
trainλI(u) × {GCC,ICC,MSVS} λI(u,c)
Thursday, October 7, 2010
stripped program binary
Integrating compiler provenance
13
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
trainλI(u) × {GCC,ICC,MSVS} λI(u,c)
Thursday, October 7, 2010
stripped program binary
Integrating compiler provenance
13
GCC ICC GCCICC ICC
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
trainλI(u) × {GCC,ICC,MSVS} λI(u,c)
Thursday, October 7, 2010
stripped program binary
Integrating compiler provenance
13
GCC ICC GCCICC ICC
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
trainλI(u) × {GCC,ICC,MSVS} λI(u,c)
fI(u,GCC)(xi,yi) fI(u,ICC)(xi,yi)
Thursday, October 7, 2010
stripped program binary
Integrating compiler provenance
13
GCC ICC GCCICC ICC
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
01110101101010101010111010100101010111000100100101101011001101010101010101001
trainλI(u) × {GCC,ICC,MSVS} λI(u,c)
fI(u,GCC)(xi,yi) fI(u,ICC)(xi,yi)
18% error reductionThursday, October 7, 2010
Future directions
14
toolchain provenance
Compilerfamily
version
optimization level
source language
Thursday, October 7, 2010
Future directions
14
toolchain provenance
Compilerfamily
version
optimization level
source language
⎫⎬⎭
done
Thursday, October 7, 2010
Future directions
14
Systemglibc static codelibrary imports
toolchain provenance
Compilerfamily
version
optimization level
source language
⎫⎬⎭
done
Thursday, October 7, 2010
Future directions
14
Systemglibc static codelibrary imports
Link & post-linkwhole-program optimization
rewriting toolsobfuscation tools
toolchain provenance
Compilerfamily
version
optimization level
source language
⎫⎬⎭
done
Thursday, October 7, 2010