Metasm, the Ruby assembly manipulation suite
============================================

* sample scripts in samples/ -- read comments at the beginning of the files
* all files are licensed under the terms of the LGPL

Author: Yoann Guillot <john at ofjj.net>


Basic overview:

Metasm allows you to interact with executables formats (ExeFormat):
PE, ELF, Mach-O, Shellcode, etc.
There are three approaches to an ExeFormat:
 - compiling one up, from scratch
 - decompiling an existing format
 - manipulating the file structure


Ready-to-use scripts can be found in the samples/ subdirectory, check the
comments in the scripts headers. You can also try the --help argument if
you're feeling lucky.

For more information, check the doc/ subdirectory. The text files can be
compiled to html using the misc/txt2html.rb script.


Here is a short overview of the Metasm internals.


Assembly:

When compiling, you start from a source text (ruby String, consisting
mostly in a sequence of instructions/data/padding directive), which is parsed.

The string is handed to a Preprocessor instance (which handles #if, #ifdef,
#include, #define, /* */ etc, should be 100% compatible with gcc -E), which is
encapsulated in an AsmPreprocessor for assembler sources (to handles asm macro
definitions, 'equ' and asm ';' comments).
The interface to do that is ExeFormat#parse(text[, filename, lineno]) or
ExeFormat.assemble (which calls .new, #parse and #assemble).

The (Asm)Preprocessor returns tokens to the ExeFormat, which parses them as Data,
Padding, Labels or parser directives. Parser directives always start with a dot.
They can be generic (.pad, .offset...) or ExeFormat-specific (.section,
.import, .entrypoint...). They are handled by #parse_parser_instruction().
If the ExeFormat does not recognize a word, it is handed to its CPU instance,
which is responsible for parsing Instructions (or raise an exception).
All those tokens are stored in one or more arrays in the @source attribute of
the ExeFormat (Shellcode's @source is an Array, for PE/ELF it is a hash
[section name] => [Array of parsed data])
Every immediate value can be an arbitrary Expression (see later).

You can then assemble the source to binary sections using ExeFormat#assemble.

Once the section binaries are available, the whole binary executable can be
written to disk using ExeFormat#encode_file(filename[, format]).

PE and ELF include an autoimport feature that allows automatic creation of
import-related data for known OS-specific functions (e.g. unresolved calls to
'strcpy' will generate data so that the binary is linked against the libc
library at runtime).

The samples/{exe,pe,elf}encode.rb can take an asm source file as argument
and compile it to a working executable.

The CPU classes are responsible for parsing and encoding individual
instructions. The current Ia32 parser uses the Intel syntax (e.g. mov eax, 42).
The generic parser recognizes labels as a string at the beginning of a line
followed by a colon (e.g. 'some_label:'). GCC-style local labels may be used
(e.g. '1:', refered to using '1b' (backward) or '1f' (forward) ; may be
redefined as many times as needed.)
Data are specified using 'db'-style notation (e.g. 'dd 42h', 'db "blabla", 0')
See samples/asmsyntax.rb


EncodedData:

In Metasm all binary data is stored as an EncodedData.
EncodedData has 3 main attributes:
 - #data which holds the raw binary data (generally a ruby String, but see
VirtualString)
 - #export which is a hash associating an export name (label name) to an offset
within #data
 - #reloc which is a hash whose keys are offsets within #data, and whose values
are Relocation objects.
A Relocation object has an endianness (:little/:big), a type (:u32 for unsigned
32bits) and a target (the intended value stored here).
The target is an arbitrary arithmetic/logic Expression.

EncodedData also has a #virtsize (for e.g. .bss sections), and a #ptr (internal
offset used when decoding things)

You can fixup an EncodedData, with a Hash variable name => value (value should
be an Expression or a numeric value). When you do that, each relocation's target
is bound using the binding, and if the result is calculable (no external variable
name used in the Expression), the result is encoded using the relocation's
size/sign/endianness information. If it overflows (try to store 128 in an 8bit
signed relocation), an EncodeError exception is raised. Use the :a32 type to
allow silent overflow truncating.
If the relocation's target is not numeric, the target is unchanged if you use 
EncodedData#fixup, or it is replaced with the bound target with #fixup! .


Disassembly:

This code is found in the metasm/decode.rb source file, which defines the
Disassembler class.

The disassembler needs a decoded ExeFormat (to be able to say what data is at
which virtual address) and an entrypoint (a virtual address or export name).
It can then start to disassemble instructions. When it encounters an
Opcode marked as :setip, it asks the CPU for the jump destination (an
Expression that may involve register values, for e.g. jmp eax), and backtraces
instructions until it finds the numeric value.

On decoding, the Disassembler maintains a #decoded hash associating addresses
(expressions/integer #normalize()d) to DecodedInstructions.

The disassembly generates an InstructionBlock graph. Each block holds a list of
DecodedInstruction, and pointers to the next/previous block (by address).

The disassembler also traces data accesses by instructions, and stores Xrefs
for them.
The backtrace parameters can be tweaked, and the maximum depth to consider
can be specifically changed for :r/:w backtraces (instruction memory xrefs)
using #backtrace_maxblocks_data.
When an Expression is backtracked, each walked block is marked so that loops
are detected, and so that if a new code path is found to an existing block,
backtraces can be resumed using this new path.

The disassembler makes very few assumptions, and in particular does not
suppose that functions will return ; they will only if the backtrace of the
'ret' instructions is conclusive. This is quite powerful, but also implies
that any error in the backtracking process can lead to a full stop ; and also
means that the disassembler is quite slow.

The special method #disassemble_fast can be used to work around this when the
code is known to be well-formed (ie it assumes that all calls returns)

When a subfunction is found, a special DecodedFunction is created, which holds
a summary of the function's effects (like a DecodedInstruction on steroids).
This allows the backtracker to 'step over' subfunctions, which greatly improves
speed. The DecodedFunctions may be callback-based, to allow a very dynamic
behaviour.
External function calls create dedicated DecodedFunctions, which holds some
API information (e.g. stack fixup information, basic parameter accesses...)
This information may be derived from a C header parsed beforehand.
If no C function prototype is available, a special 'default' entry is used,
which assumes that the function has a standard ABI.

Ia32 implements a specific :default entry, which handles automatic stack fixup
resolution, by assuming that the last 'call' instruction returns. This may lead
to unexpected results ; for maximum accuracy a C header holding information for
all external functions is recommanded (see samples/factorize-headers-peimports
for a script to generate such a header from a full Visual Studio installation
and the target binary).

Ia32 also implements a specific GetProcAddress/dlsym callback, that will
yield the correct return value if the parameters can be backtraced.

The scripts implementing a full disassembler are samples/disassemble{-gui}.rb
See the comments for the GUI key bindings.


ExeFormat manipulation:

You can encode/decode an ExeFormat (ie decode sections, imports, headers etc)

Constructor: ExeFormat.decode_file(str), ExeFormat.decode_file_header(str)
Methods: ExeFormat#encode_file(filename), ExeFormat#encode_string

PE and ELF files have a LoadedPE/LoadedELF counterpart, that are able to work
with memory-mmaped versions of those formats (e.g. to debug running
processes)


VirtualString:

A VirtualString is a String-like object: you can read and may rewrite slices of
it. It can be used as EncodedData#data, and thus allows virtualization
of most Metasm algorithms.
You cannot change a VirtualString length.
Taking a slice of a VirtualString will return either a String (for small sizes)
or another VirtualString (a 'window' into the other). You can force getting a
small VirtualString using the #dup(offset, length) method.
Any unimplemented method called on it is forwarded to a frozen String which is
a full copy of the VirtualString (should be avoided if possible, the underlying
string may be very big & slow to access).

There are currently 3 VirtualStrings implemented:
- VirtualFile, whichs loads a file by page-sized chunks on demand,
- WindowsRemoteString, which maps another process' virtual memory (uses the
windows debug api through WinDbgAPI)
- LinuxRemoteString, which maps another process' virtual memory (need ptrace
rights, memory reading is done using /proc/pid/mem)

The Win/Lin version are quite powerful, and allow things like live process
disassembly/patching easily (using LoadedPE/LoadedELF as ExeFormat)


Debugging:

Metasm includes a few interfaces to handle debugging.
The WinOS and LinOS classes offer access to the underlying OS processes (e.g.
OS.current.find_process('foobar') will retrieve a running process with foobar
in its filename ; then process.mem can be used to access its memory.)

The Windows and Linux low-level debugging APIs have a basic ruby interface
(PTrace and WinAPI) ; which are used by the unified high-end Debugger class.
Remote debugging is supported through the GDB server wire protocol.

High-level debuggers can be created with the following ruby line:
Metasm::OS.current.create_debugger('foo')

Only one kind of host debugger class can exist at a time ; to debug multiple
processes, attach to other processes using the existing class. This is due
to the way the OS debugging API works on Windows and Linux.

The low-level backends are defined in the os/ subdirectory, the front-end is
defined in debug.rb.

A linux console debugging interface is available in samples/lindebug.rb ; it
uses a (simplified) SoftICE-like look and feel.
It can talk to a gdb-server socket ; use a [udp:]<host:port> target.

The disassembler-gui sample allow live process interaction when using as
target 'live:<pid or part of program name>'.


C Parser:

Metasm includes a hand-written C Parser.
It handles all the constructs i am aware of, except hex floats:
 - static const L"bla"
 - variable arguments
 - incomplete types
 - __attributes__(()), __declspec()
 - #pragma once
 - #pragma pack()
 - C99 declarators - type bla = { [ 2 ... 14 ].toto = 28 };
 - Nested functions
 - __int8 etc native types
 - Label addresses (&&label)
Also note that all those things are parsed, but most of them will fail to
compile on the Ia32/X64 backend (the only one implemented so far.)

Parsing C files should be done using an existing ExeFormat, with the
parse_c_file method. This ensures that format-specific macros/ABI are correctly
defined (ex: size of the 'long' type, ABI to pass parameters to functions, etc)

When you parse a C String using C::Parser.parse(text), you receive a Parser
object. It holds a #toplevel field, which is a C::Block, which holds #structs,
#symbols and #statements. The top-level functions are found in the #symbol hash
whose keys are the symbol names, associated to a C::Variable object holding
the functions. The function parameter/attributes are accessible through
func.type, and the code is in func.initializer, which is itself a C::Block.
Under it you'll find a tree-like structure of C::Statements (If, While, Asm,
CExpressions...)

A C::Parser may be #precompiled to transform it into a simplified version that
is easier to compile: typedefs are removed, control sequences are transformed
into 'if (XX) goto YY;' etc.

To compile a C program, use PE/ELF.compile_c, that will create a C::Parser with
exe-specific macros defined (eg __PE__ or __ELF__).

Vendor-specific headers may need to use either #pragma prepare_visualstudio
(to parse the Microsoft Visual Studio headers) or prepare_gcc (for gcc), the
latter may be auto-detected (or may not).
Vendor headers tested are VS2003 (incl. DDK) and gcc4 ; ymmv.

Currently the CPU#compilation of a C code will generate an asm source (text),
which may then be parsed & assembled to binary code.

See ExeFormat#compile_c, and samples/exeencode.rb