Turn a x86 binary back into C source code
- 24 mins- Objective: turn a x86 binary executable back into C source code.
- Understand how the compiler turns C into assembly code.
- Low-level OS structures and executable file format.
Arithmetic Instructions
mov eax,2 ; eax = 2 
mov ebx,3 ; ebx = 3
add eax,ebx ; eax = eax + ebx 
sub ebx, 2 ; ebx = ebx - 2
Accessing Memory
mox eax, [1234] ; eax = *(int*)1234 
mov ebx, 1234 ; ebx = 1234 
mov eax, [ebx] ; eax = *ebx 
mov [ebx], eax ; *ebx = eax 
Conditional Branches
cmp eax, 2 ; compare eax with 2 
je label1 ; if(eax==2) goto label1
 ja label2 ; if(eax>2) goto label2
jb label3 ; if(eax<2) goto label3 
jbe label4 ; if(eax<=2) goto label4
 jne label5 ; if(eax!=2) goto label5
 jmp label6 ; unconditional goto label6
Function calls
First calling a function: call func ; store return address on the stack and jump to func The first operations is to save the return pointer:
pop esi ; save esi 
Right before leaving the function:
pop esi ; restore esi
ret ; read return address from the stack and jump to it 
Modern Compiler Architecture
C code –> Parsing –> Intermediate representation –> optimization –> Low-level intermediate representation –> register allocation –> x86 assembly
High-level Optimizations
Inlining
For example, the function c:
int foo(int a, int b){
     return a+b }
 c = foo(a, b+1) 
translates to
c = a+b+1
Loop unrolling
The loop:
for(i=0; i<2; i++){
      a[i]=0;
 }
 
becomes
   a[0]=0; 
   a[1]=0; 
Loop-invariant code motion
The loop:
for (i = 0; i < 2; i++) {
 a[i] = p + q; 
} 
becomes:
temp = p + q;
for (i = 0; i < 2; i++) {
    a[i] = temp;
}
 
Common subexpression elimination
The variable attributions:
- Objective: turn a x86 binary executable back into C source code.
- Understand how the compiler turns C into assembly code.
- Low-level OS structures and executable file format.
Arithmetic Instructions
mov eax,2 ; eax = 2 
mov ebx,3 ; ebx = 3
add eax,ebx ; eax = eax + ebx 
sub ebx, 2 ; ebx = ebx - 2
Accessing Memory
mox eax, [1234] ; eax = *(int*)1234 
mov ebx, 1234 ; ebx = 1234 
mov eax, [ebx] ; eax = *ebx 
mov [ebx], eax ; *ebx = eax 
Conditional Branches
cmp eax, 2 ; compare eax with 2 
je label1 ; if(eax==2) goto label1
 ja label2 ; if(eax>2) goto label2
jb label3 ; if(eax<2) goto label3 
jbe label4 ; if(eax<=2) goto label4
 jne label5 ; if(eax!=2) goto label5
 jmp label6 ; unconditional goto label6
Function calls
First calling a function: call func ; store return address on the stack and jump to func The first operations is to save the return pointer:
pop esi ; save esi 
Right before leaving the function:
pop esi ; restore esi
ret ; read return address from the stack and jump to it 
Modern Compiler Architecture
C code –> Parsing –> Intermediate representation –> optimization –> Low-level intermediate representation –> register allocation –> x86 assembly
High-level Optimizations
Inlining
For example, the function c:
int foo(int a, int b){
     return a+b }
 c = foo(a, b+1) 
translates to
c = a+b+1
Loop unrolling
The loop:
for(i=0; i<2; i++){
      a[i]=0;
 } 
becomes
   a[0]=0; 
   a[1]=0; 
Loop-invariant code motion
The loop:
for (i = 0; i < 2; i++) {
 a[i] = p + q; 
} 
becomes:
temp = p + q;
for (i = 0; i < 2; i++) {
    a[i] = temp;
}
 
Common subexpression elimination
The variable attributions:
a = b + (z + 1)
p = q + (z + 1)
becomes
temp = z + 1
a = b + z
p = q + z
Constant folding and propagation
The assignments:
a = 3 + 5
b = a + 1
func(b)
Becomes:
func(9)
Dead code elimination
Delete unnecessary code:
a = 1
if (a < 0) {
printf(“ERROR!”)
}
to
a = 1
Low-Level Optimizations
Strength reduction
Codes such as:
y = x * 2
y = x * 15
Becomes:
y = x + x
y = (x << 4) - x
Code block reordering
Codes such as :
if (a < 10) goto l1
printf(“ERROR”)
goto label2
l1:
    printf(“OK”)
l2:
    return;
Becomes:
if (a > 10) goto l1
printf(“OK”)
l2:
return
l1:
printf(“ERROR”)
goto l2
Register allocation
- Memory access is slower than registers.
- Try to fit as many as local variables as possible in registers.
- The mapping of local variables to stack location and registers is not constant.
Instruction scheduling
Assembly code like:
mov eax, [esi]
add eax, 1
mov ebx, [edi]
add ebx, 1
Becomes:
mov eax, [esi]
mov ebx, [edi]
add eax, 1
add ebx, 1
a = b + (z + 1)
p = q + (z + 1)
becomes
temp = z + 1
a = b + z
p = q + z
Constant folding and propagation
The assignments:
a = 3 + 5
b = a + 1
func(b)
Becomes:
func(9)
Dead code elimination
Delete unnecessary code:
a = 1
if (a < 0) {
printf(“ERROR!”)
}
to
a = 1
Low-Level Optimizations
Strength reduction
Codes such as:
y = x * 2
y = x * 15
Becomes:
y = x + x
y = (x << 4) - x
Code block reordering
Codes such as :
if (a < 10) goto l1
printf(“ERROR”)
goto label2
l1:
    printf(“OK”)
l2:
    return;
Becomes:
if (a > 10) goto l1
printf(“OK”)
l2:
return
l1:
printf(“ERROR”)
goto l2
Register allocation
- Memory access is slower than registers.
- Try to fit as many as local variables as possible in registers.
-  The mapping of local variables to stack location and registers is not constant. 
- Objective: turn a x86 binary executable back into C source code.
- Understand how the compiler turns C into assembly code.
- Low-level OS structures and executable file format.
Arithmetic Instructions
mov eax,2 ; eax = 2 
mov ebx,3 ; ebx = 3
add eax,ebx ; eax = eax + ebx 
sub ebx, 2 ; ebx = ebx - 2
Accessing Memory
mox eax, [1234] ; eax = *(int*)1234 
mov ebx, 1234 ; ebx = 1234 
mov eax, [ebx] ; eax = *ebx 
mov [ebx], eax ; *ebx = eax 
Conditional Branches
cmp eax, 2 ; compare eax with 2 
je label1 ; if(eax==2) goto label1
 ja label2 ; if(eax>2) goto label2
jb label3 ; if(eax<2) goto label3 
jbe label4 ; if(eax<=2) goto label4
 jne label5 ; if(eax!=2) goto label5
 jmp label6 ; unconditional goto label6
Function calls
First calling a function: call func ; store return address on the stack and jump to func The first operations is to save the return pointer:
pop esi ; save esi 
Right before leaving the function:
pop esi ; restore esi
ret ; read return address from the stack and jump to it 
Modern Compiler Architecture
C code –> Parsing –> Intermediate representation –> optimization –> Low-level intermediate representation –> register allocation –> x86 assembly
High-level Optimizations
Inlining
For example, the function c:
int foo(int a, int b){
     return a+b }
 c = foo(a, b+1) 
translates to
c = a+b+1
Loop unrolling
The loop:
for(i=0; i<2; i++){
      a[i]=0;
 } 
becomes
   a[0]=0; 
   a[1]=0; 
Loop-invariant code motion
The loop:
for (i = 0; i < 2; i++) {
 a[i] = p + q; 
} 
becomes:
temp = p + q;
for (i = 0; i < 2; i++) {
    a[i] = temp;
}
 
Common subexpression elimination
The variable attributions:
- Objective: turn a x86 binary executable back into C source code.
- Understand how the compiler turns C into assembly code.
- Low-level OS structures and executable file format.
Arithmetic Instructions
mov eax,2 ; eax = 2 
mov ebx,3 ; ebx = 3
add eax,ebx ; eax = eax + ebx 
sub ebx, 2 ; ebx = ebx - 2
Accessing Memory
mox eax, [1234] ; eax = *(int*)1234 
mov ebx, 1234 ; ebx = 1234 
mov eax, [ebx] ; eax = *ebx 
mov [ebx], eax ; *ebx = eax 
Conditional Branches
cmp eax, 2 ; compare eax with 2 
je label1 ; if(eax==2) goto label1
 ja label2 ; if(eax>2) goto label2
jb label3 ; if(eax<2) goto label3 
jbe label4 ; if(eax<=2) goto label4
 jne label5 ; if(eax!=2) goto label5
 jmp label6 ; unconditional goto label6
Function calls
First calling a function: call func ; store return address on the stack and jump to func The first operations is to save the return pointer:
pop esi ; save esi 
Right before leaving the function:
pop esi ; restore esi
ret ; read return address from the stack and jump to it 
Modern Compiler Architecture
C code –> Parsing –> Intermediate representation –> optimization –> Low-level intermediate representation –> register allocation –> x86 assembly
High-level Optimizations
Inlining
For example, the function c:
int foo(int a, int b){
     return a+b }
 c = foo(a, b+1) 
translates to
c = a+b+1
Loop unrolling
The loop:
for(i=0; i<2; i++){
      a[i]=0;
 } 
becomes
   a[0]=0; 
   a[1]=0; 
Loop-invariant code motion
The loop:
for (i = 0; i < 2; i++) {
 a[i] = p + q; 
} 
becomes:
temp = p + q;
for (i = 0; i < 2; i++) {
    a[i] = temp;
}
 
Common subexpression elimination
The variable attributions:
a = b + (z + 1)
p = q + (z + 1)
becomes
temp = z + 1
a = b + z
p = q + z
Constant folding and propagation
The assignments:
a = 3 + 5
b = a + 1
func(b)
Becomes:
func(9)
Dead code elimination
Delete unnecessary code:
a = 1
if (a < 0) {
printf(“ERROR!”)
}
to
a = 1
Low-Level Optimizations
Strength reduction
Codes such as:
y = x * 2
y = x * 15
Becomes:
y = x + x
y = (x << 4) - x
Code block reordering
Codes such as :
if (a < 10) goto l1
printf(“ERROR”)
goto label2
l1:
    printf(“OK”)
l2:
    return;
Becomes:
if (a > 10) goto l1
printf(“OK”)
l2:
return
l1:
printf(“ERROR”)
goto l2
Register allocation
- Memory access is slower than registers.
- Try to fit as many as local variables as possible in registers.
- The mapping of local variables to stack location and registers is not constant.
Instruction scheduling
Assembly code like:
mov eax, [esi]
add eax, 1
mov ebx, [edi]
add ebx, 1
Becomes:
mov eax, [esi]
mov ebx, [edi]
add eax, 1
add ebx, 1
a = b + (z + 1)
p = q + (z + 1)
becomes
temp = z + 1
a = b + z
p = q + z
Constant folding and propagation
The assignments:
a = 3 + 5
b = a + 1
func(b)
Becomes:
func(9)
Dead code elimination
Delete unnecessary code:
a = 1
if (a < 0) {
printf(“ERROR!”)
}
to
a = 1
Low-Level Optimizations
Strength reduction
Codes such as:
y = x * 2
y = x * 15
Becomes:
y = x + x
y = (x << 4) - x
Code block reordering
Codes such as :
if (a < 10) goto l1
printf(“ERROR”)
goto label2
l1:
    printf(“OK”)
l2:
    return;
Becomes:
if (a > 10) goto l1
printf(“OK”)
l2:
return
l1:
printf(“ERROR”)
goto l2
Register allocation
- Memory access is slower than registers.
- Try to fit as many as local variables as possible in registers.
- The mapping of local variables to stack location and registers is not constant.
Instruction scheduling
Assembly code like:
mov eax, [esi]
add eax, 1
mov ebx, [edi]
add ebx, 1
Becomes:
mov eax, [esi]
mov ebx, [edi]
add eax, 1
add ebx, 1
Instruction scheduling
Assembly code like:
mov eax, [esi]
add eax, 1
mov ebx, [edi]
add ebx, 1
Becomes:
mov eax, [esi]
mov ebx, [edi]
add eax, 1
add ebx, 1