recovery of variables and heap structure in x86 executables
DESCRIPTION
Recovery of Variables and Heap Structure in x86 Executables. Gogul Balakrishnan Thomas Reps University of Wisconsin. Overview. Introduction Challenges Background Recovering A-locs via Iteration An Abstraction for Heap-Allocated Storage Experiments. Introduction. - PowerPoint PPT PresentationTRANSCRIPT
Recovery of Variables and Heap Structure in x86
Executables
Gogul BalakrishnanThomas Reps
University of Wisconsin
Overview
• Introduction• Challenges• Background• Recovering A-locs via Iteration• An Abstraction for Heap-Allocated
Storage• Experiments
Introduction
• The Need of Analyzing Executables– What You See Is Not What You eXecute
• Many Obstacles in Analyzing Executables– Data Objects are Not Easily Identifiable.– Absence of Symbol Table & Debugging Information– Determining the Memory Addresses of Data Objects– Difficult to Track the Flow of Data through Memory– Challenging to get useful information about the heap
e.g) memset(password, ‘\0’, len); free(password);
Challenges(1/3)
• Recovering Variable-like Entities– The layout of Memory is known at Compile
time or Assembly time (IDAPro’ Approach)
– To Recover y, the Set of Values that eax Holds at 5 Needs to be Determined.
void main() { int x, y; x = 1; y = 2; return;}
proc main1 mov ebp, esp2 sub esp, 83 mov [ebp-8], 14 mov eax, ebp5 mov [eax-4], 26 add esp, 87 retn
Challenges(2/3)
• Granularity of Recovered Variable-like
Entities– Affects the complexity and accuracy of
subsequent analyses
• The Structure of Heap-Allocated Objects– Only the Size of the Allocated Block is Known.– Using Abstract-Refinement Algorithm
Challenges(3/3)
• Resolving Virtual-Function Calls
– A Definite Link between the Object and the Virtual Function Table is Never Established. (Weak Update)
one-variable-per-malloc-site abstraction
Background(1/6)
• Abstract Locations (A-locs)– Memory Region
• A Set of Disjoint Memory Areas• Represents a Group of Locations that have Similar
Runtime Properties
– Abstract Locations• Locations between two addresses/offsets in Memory-
Region• Address & Offsets are Statically Determined
Background(2/6)
• Abstract Locations (cont’d) proc main0 mov ebp,esp1 sub esp,402 mov ecx,03 lea eax,[ebp-40]L1: mov [eax], 15 mov [eax+4],26 add eax, 87 inc ecx8 cmp ecx, 59 jl L110 mov eax,[ebp-36]11 add esp,4012 retn
Background(3/6)
• Value-Set Analysis (VSA)– Combined Numeric-Analysis & Pointer-Analysis– Over-Approximation of the values that each a-
loc holds at each program point– Value-Set
• The Set of Addresses and Numeric Values• N-tuple of strided intervals of the form s[l, u]
• (Global Region, Procedure Region, …)• (1[0, 9], ∮) versus (∮, -8[-40, -8])
e.g) 8[-40, -8] = {-40, -32, -24, -16, -8}
N : the number of memory-regions
Background(4/6)
• Value-Set Analysis (cont’d)– The Value-Set of eax at L1
• (∮, 8[-40, -8]) • eax holds the offsets
{-40, -32, -24, -16, -8}• Starting Addresses of Field x of p
proc main0 mov ebp,esp1 sub esp,402 mov ecx,03 lea eax,[ebp-40]L1: mov [eax], 15 mov [eax+4],26 add eax, 87 inc ecx8 cmp ecx, 59 jl L110 mov eax,[ebp-36]11 add esp,4012 retn
Typedef struct { int x, y;} Point;
int main() { int i; Point p[5]; for(i=0; i<5; ++i) { p[i].x = 1; p[i].y = 2; } return p[0].y;}
Background(5/6)
• Aggregate Structure Identification (ASI)– Can Distinguish between Accesses to Different
Parts of the Same Aggregate– Aggregate is broken up into smaller parts
(atoms)– Data-Access Constraint Language (DAC)
• Specifying Data-Access Pattern in the Program
DataRef Reference to a set of sequences of bytes
UnifyConstraint
Flow of Data in the Program
Background(6/6)
• Aggregate Structure Identification (cont’d)– Data-Access Constraint Language (DAC)
• DataRef [l : u] refers to bytes l through u in DataRef• DataRef n : n is the number of elements
– ASI DAG
e.g) P[0:11] 3 = P[0:3], P[4:7], or P[8:11]
return_main
p[0:39] 5[0:3] ≈ const_1[0:3];p[0:39] 5[4:7] ≈ const_2[0:3];return_main[0:3] ≈ p[4:7]
Recovering A-locs via Iteration• Problems of VSA
– Can only Represent a Contiguous Sequence of Memory Locations
– Cannot Detect Internal Substructure
• Basic Idea
1. VSA is used to obtain memory-access patterns in the executable;
2. ASI is used as a heuristic to determine a set of a-locs according to the memory-access patterns obtained from the information recovered by VSA.
IDAPro
ASI VSAFinal Value-Sets
Recovering A-locs via Iteration• Generating Data-Access Constraints
from Value<Algorithm 1 SI2ASI>if s[l,u] is a singleton then return <“r[l : l+length-1]”, true>else size ← max(s, length) n ← (u – l + size – 1) / size ref ← “r[l : u+size-1] n[0 : size-1]” return <ref, (s = size)>enf if
e.g) s[l, l]
Actual Byte Range
The number of array elements
Input : (r, s[l, u], length)Output : (ASI Ref, Boolean)
(AR_main, 8[-40, -8], length)=> {AR_main[(-40):(-1)] 5[0:7]}AR_main[-40:-33][0:7]AR_main[-32:-25][0:7]AR_main[-24:-17][0:7]AR_main[-16:-9][0:7]AR_main[-8:-1][0:7]
Recovering A-locs via Iteration• Generating Data-Access Constraints
from Value<Algorithm 2>if (s1[l1,u1] or s2[l2,u2] is a singleton then return SI2ASI(r, s1[l1, u1] ⊕ s2[l2, u2], length)end ifif s1 ≥ (u2 – l2 + length) then baseSI ← s1[l1, u1] indexSI ← s2[l2, u2]else if s2 ≥ (u1 – l1 + length) then baseSI ← s2[l2, u2] indexSI ← s1[l1, u1]else return SI2ASI(r, s1[l1, u1] ⊕ s2[l2, u2], length)end if<baseRef, exactRef> ← SI2ASI(r, baseSI, stride(baseSI))if exactRef is false then return SI2ASI(r, s1[l1, u1] ⊕ s2[l2, u2], length)else return concat(baseRef, SI2ASI(‘’, indexSI, length))endif
Determine base register
Row-major order
Base Addr
Base Addr
Index Addr
e.g) eax : (1[0:9], ∮)ecx : (∮, 16[-160, -16])In case of [ecx+eax] =>AR[-160:-1] 10[0:15] [0:9] 10[0:0]
Recovering A-locs via Iteration• Interpreting Indirect Memory-
References– Lookup Algorithm
• NodeDesc : <name, length>
• NodeDescList : An Ordered List of NodeDesc
• Three Operations
name :the name associated with the ASI tree nodelength : the length of above node
e.g) [nd1, nd2, …, ndn]
Name Output
GetChildren(aloc) List of Child Nodes
GetRange(start, end)
List of Nodes with offsets in the given range [start, end]
GetArrayElements(m)
List of Nodes with m elements
Recovering A-locs via Iteration• Lookup Algorithm Examples
e.g) Lookup p[0:39] 5[0:3]
GetChildren(p) = [<a3, 4>, <a4, 4>, <i2, 32>]GetRange(0, 39) = [<a3, 4>, <a4, 4>, <i2, 32>]GetArrayElements(5) = [<a3, 4>, <a4, 4>], [<a5, 4>, <a6, 4>]GetRange(0, 3) = [<a3, 4>, <a5, 4>]
An Abstraction for Heap-Allocated Storage
• Previous Abstraction
• Recency Abstraction– Allowing VSA & ASI to recover Info. About
virtual-function tables– Use Two Memory-Regions per allocation site s
• MRAB[s] : Most Recently Allocated Block• NMRAB[s] : Non-Most Recently Allocated Block• count : How many concrete blocks the memory-region
represents (MRAB[s].count, NMRAB[s].count)– SmallRange = {[0, 0], [0, 1], [1, 1], [0, ∞], [1, ∞], [2, ∞]}
• size : over-approximation of the size of block (MRAB[s].size, NMRAB[s].size)
All of the nodes allocated at a given allocation site s are folded together into a single summary node ns.
An Abstraction for Heap-Allocated Storage
• Operation– AbsEnv[s] : MRAB[s]/NMRAB[s] →
<count,size,alocEnv>– AlocEnv = a-loc → ValueSet– Allocation site s transforms absEnv to absEnv’
• absEnv’(MRAB[s]) = <[0,1], size, a-loc.Value-Set>• absEnv’(NMRAB[s]).count = absEnv(NMRAB[s]).count +
absEnv(MRAB[s]).count• absEnv’(NMRAB[s]).size = absEnv(NMRAB[s]).size ∪
absEnv(MRAB[s]).size• absEnv’(NMRAB[s]).alocEnv = absEnv(NMRAB[s]).alocEnv
∪ absEnv(MRAB[s]).alocEnv
An Abstraction for Heap-Allocated Storage
Experiments
• Environments
• Software
OS Compiler Language Target Files
Windows Visual Studio 6.0
C++ .obj
Experiments
• Results of Virtual-Function Call Resolution
Experiments
• Results of A-loc Identification– Comparing the Results of Algorithm with
Debugging Information
The structure of 87% of the local variables is correct
Experiments
• Results of A-loc Identification
The structure of 72% of the objects in the heap is correct
Q & A