Abstract:
One assembly language instruction can be encoded differently in machine code. Possible applications are steganography and compiler identification.

Created by Peter Kankowski
Last changed
Filed under Assembly language and machine code

Share on social sitesReddit Digg Delicious Buzz Facebook Twitter

Redundancy of x86 Machine Code

There are many ways to encode the same assembly language program in machine code. The encoding of ModR/M and SIB bytes is redundant; the SHL instruction has two equivalent opcodes; the operands can be swapped for some instructions. You can use this redundancy to identify the compiler, or to hide some information inside your executable files.

Some examples are listed below.

Two encodings for register-register operations

A typical instruction of x86 architecture has two opcodes. The first of them has a register as the first operand and a register or a memory location as the second one (that's abbreviated "reg, reg/mem32" in the opcode reference or "Gv, Ev" in the opcode table). The operands for the second opcode are reversed (that's abbreviated "reg/mem32, reg" or "Ev, Gv"). This makes sense: the processor must know if it copies to the memory, or from the memory. But when both operands are registers, the encoding becomes redundant:

                  ; mod reg r/m
03C3 add eax, ebx ;  11 000 011
01D8 add eax, ebx ;  11 011 000

The opcode 03 (ADD reg, r/m) instructs the processor to take the first operand from the reg field, and the second from the r/m field. 000 is the encoding for eax register, and 011 means ebx register. So 03C3 can be disassembled as "add eax, ebx".

The opcode 01 means "ADD r/m, reg", so the first operand will be from r/m (000, which means eax), and second will be from reg (011, which means ebx). This again gives us "add eax, ebx".

The encodings for these instructions are redundant: add, adc, and, xor, or, sbb, sub, cmp, and mov. Some assemblers emit 03C3 for "add eax, ebx", and some emit 01D8, so this technique can be used to identify the compiler that produced the executable file.

You can also swap the base and the index register if the index is not scaled (if the scale factor is 1):

                     ;        base  index
C60418 05            mov byte[eax + ebx*1],5
C60403 05            mov byte[ebx + eax*1],5

Two distinct opcode extensions for SAL/SHL

The SAL/SHL instruction can be encoded with two opcode extensions: /4 and /6 (100 and 110 in binary). Intel's manual documents only /4; AMD's manual mentions both of them (however, /6 works okay on Pentiums). This again can be used to distinguish between compilers.

The TEST instruction with an immediate operand also has 2 opcode extensions (/0 and /1):

F7C3 05000000        test ebx, 5
F7CB 05000000        test ebx, 5

Again, the alternative encoding is documented only in AMD's manual, but also works on Intel processors. Many disassemblers and debuggers (including OllyDbg) cannot recognize the second instruction, so you can use it in anti-debugging code.

Alternative opcode for instructions with an immediate byte operand

Some instructions (namely, add, or, adc, sbb, and, sub, xor, and cmp) have two opcodes when used with a immediate byte operand. Here is an example:

8000 00              add byte[eax],0
8200 00              add byte[eax],0

The second opcode is invalid in 64-bit mode, so the trick can be used only in 32-bit mode.

Size-changing tricks

The techniques listed above don't change the size of instruction. If the alternative encoding can have a different size, you have a wider choice of tricks:

Special encoding for EAX and byte immediates

If the immediate operand of some instructions is in range -80..7F, it can be encoded in 1 byte instead of 4 (compilers use the opcode 83 for that). But the same instruction can be encoded in larger number of bytes with opcode 81. Also, there is a special encoding for instructions with EAX register as the first operand:

83C0 01              add eax,1
81C0 01000000        add eax,1
05 01000000          add eax,1

Using SIB byte when it's not needed

The SIB byte specifies Scale, Index and Base values, but it also can be used to address an absolute address in memory. Here are five (valid and documented) encodings for one instruction:

C605 00104000 05     mov byte[401000],5
C60425 00104000 05   mov byte[401000],5
C60465 00104000 05   mov byte[401000],5
C604A5 00104000 05   mov byte[401000],5
C604E5 00104000 05   mov byte[401000],5

The same trick works with ebp-relative addressing:

C645 04 05           mov byte[ebp+4],5
C64465 04 05         mov byte[ebp+4],5
C64425 04 05         mov byte[ebp+4],5
C644A5 04 05         mov byte[ebp+4],5
C644E5 04 05         mov byte[ebp+4],5

Using zero offset

013E                 add dword[esi],edi
017E 00              add dword[esi+00],edi
01BE 00000000        add dword[esi+00000000],edi

This trick is used by MSVC++ compiler to emit the NOP instructions of different length (for padding before jump targets). For example, MSVC++ generates the following code if it needs 4-byte and 6-byte padding:

8d6424 00            lea [ebx+00],ebx       ; 4-byte padding
8d9b 00000000        lea [esp+00000000],esp ; 6-byte padding

The first line is marked as "npad 4" in assembly listings generated by the compiler, and the second is "npad 6". The registers (ebx, esp) can be chosen from the rarely used ones to avoid false dependencies in the code.

Usage

There is a steganographical tool, Hydan, that changes x86 instructions to their equivalents and hides your message inside executable files. Hydan uses the tricks described above (not all of them: it doesn't "know" about SAL/SHL thing). It also can reorder the independent instructions and use more high-level tricks, for example, it changes "XOR eax, eax" to "SUB eax, eax" or vice versa. Hydan is an open-source program; you should have GCC installed to compile it from sources. It works with Portable Executable (.exe) and ELF formats.

The x86 architecture has a long history, and the weird encodings are often maintained for compatibility with old software. The complicated addressing modes are quite redundant themselves. So, you have a good chance to hide something "between the lines" of your code.

Peter Kankowski
Peter Kankowski

About the author

Peter is the developer of Aba Search and Replace, a tool for replacing text in multiple files. He likes to program in C with a bit of C++, also in x86 assembly language, Python, and PHP.

Created by Peter Kankowski
Last changed

11 comments

John Li,
Hi,

I cannot open page "can be used to identify the compiler". Can you help on this?

Thanks,
John
Peter Kankowski,
Try again, it may be some temporate problems with FASM forum server.
bitRAKE,
I have often thought about writing a re-code generator which can use different models for code generation. For example, use specific encodings to make code more compressible; use only ASCII instructions (has been done before); encodings that produce data for another type of file (JPG, etc.), or creates a picture.
Peter Kankowski,
It's a nice idea, but will not be easy to implement. Do you know some code generator which produces only ASCII instructions? Would you please provide an URL to it?
ac,
At first I didn't understand why would anybody need "only ASCII instructions" and only now it occurred to me: that guy wants to make an xploit, that is, to put his code in a textbox or something similar and then somehow make it executing. Otherwise any encoder would do. Not my cup of tea.
arkon,

Another cool trick first I saw it at Ken Silverman's page IIRC. sub eax, -128 to add 0x80, using byte imm. add eax, -128 to sub 0x80, using byte imm.

Enjoy

Michael Rolle,

Swapping the base and index registers CAN produce different results in rare cases. The place where this is so is something like [esp + eax] and [eax + esp], because a different default segment is used. 64 bit mode doesn't have this problem because SS and DS are always equivalent. But in 16 or 32 bit mode, if your DS and SS are different segments, then you have trouble.

Peter Kankowski,

Formally, you are right, but Win32 code runs in flat memory model, where DS and SS are the same, so it's usually not a problem.

David Bakin,

The Alsys Ada compiler for x86/DOS used these techniques (esp. the multiple encodings for reg/reg moves) to encode:

a) in each subroutine prologue, whether or not there were exception handlers defined in that procedure (the handlers themselves were described in a table)

b) before each subroutine call, the number of items pushed on the x87 stack at the point of call (used when unwinding the stack for exceptions)

c) in the compilation unit, starting from the beginning, a "signature" that proved the code was generated by the Alsys compiler (for copyright protection)

Most of the time there were existing reg/reg instructions that could be used, but if necessary, a nop reg/reg instruction (e.g., mov ax,ax, which also, of course, has two encodings) would be inserted.

Peter Kankowski,

Thank you, it's an interesting example.

b4 4c cd 21,

The x86 instruction set may seem weird, but if you look at the encodings the redundancy makes sense :)

80 xx ib : alu_op eb, ib

81 xx iw : alu_op ew, iw

82 xx ib : alu_op eb, ib

83 xx ib : alu_op ew, ib (sign extended)

so bit 0 encodes whether the destination is a byte or (d)word, and bit 1 means the immediate operand is a signed byte, which of course makes no difference if both operands are bytes.

Your name:


Comment: