Metasm, the Ruby assembly manipulation suite ============================================ * You have some samples in samples/ * LICENCE is LGPL Author: Yoann Guillot Basic overview: Metasm allows you to interact with executables formats (ExeFormat): PE, ELF, Shellcode, etc There are three approaches of an ExeFormat: - compiling one up, from scratch ( -> source) - decompiling an existing format ( -> blocks) - manipulating the file structure( -> encoded) Assembly: When compiling, you start from a source text (ruby String, consisting mostly in a sequence of instructions/data/padding directive), then you parse it. The string is handed to a Preprocessor (which handles #if, #ifdef, #include, #define, comments etc, almost 100% compatible with gcc -E), which is encapsulated in an AsmPreprocessor (which handles asm macro definitions, equ and asm comments). This AsmPreprocessor returns tokens to the ExeFormat, which parses them as Data, Padding, Labels or parser directives. Parser directives always start with a dot. They can be generic (.pad, .offset...) or ExeFormat-specific (.section, .import...). If the ExeFormat does not recognize a word, it hands it to its CPU instance, which is responsible for parsing Instructions, or raise an exception. All these tokens are stored in one or more arrays in the @source attribute of the ExeFormat (Shellcode's @source is an Array, for PE/ELF it is a hash of section name => Array) Every immediate value can be an arbitrary Expression (see later). You can then assemble the source to binary sections. ExeFormat has a constructor to do that: ExeFormat.assemble(cpu, source) it parses the source, assemble it, and return the ExeFormat instance. EncodedData: In Metasm all binary data is stored as an EncodedData. EncodedData has 3 main attributes: - @data which holds the raw binary data (generally a ruby String, but see VirtualString) - @export which is a hash associating an export name (label name) to an offset within @data - @reloc which is a hash whose keys are offsets within @data, and whose values are Relocation objects. A Relocation object has an endianness (:little/:big), a sign (:signed/:unsigned/:any), a size (in bits) and a target. The target is an arbitrary arithmetic/logic Expression. EncodedData also has a @virtualsize (for e.g. .bss sections), and a @ptr (used when decoding things) You can fixup an EncodedData, with a Hash variable name => value (value should be an Expression or a numeric value). When you do that, each relocation's target is bound using the binding, and if the result is calculable (no external variable name used in the Expression), the result is encoded using the relocation's size/sign/endianness information. If it overflows (try to store 128 in an 8bit signed relocation), an EncodeError exception is raised. If the relocation's target is not numeric, the target is unchanged if you use EncodedData#fixup, or it is replaced with the bound target with #fixup! . Desassembly: (experimental) When decompiling, you start from a decoded ExeFormat (you need to be able to say what data is at which virtual address), you specify a virtual address to start (virtual address or export name). The ExeFormat starts disassembling instructions. When it encounters an Opcode marked as :setip, it calls the CPU to find the jump destination, and backtracks instructions until it finds the numeric value. The disassembled code is stored as InstructionBlocks, whichs holds a list of DecodedInstruction, a list of @from and @to (array of block addresses) A DecodedInstruction has an Instruction, an Opcode and a bin_length (to allow printing the hex dump) (experimental for now, does not handle external calls, does not handle well subfunctions, should only be used on small shellcodes) Constructor: Shellcode.disassemble(cpu, binary) ExeFormat manipulation: You can encode/decode an ExeFormat (ie decode sections, imports, headers etc) Constructor: ExeFormat.decode_file(str), ExeFormat.decode_file_header(str) Methods: ExeFormat#encode_file(filename), ExeFormat#encode_string VirtualString: A VirtualString is an object String-like : you can read/maybe write slices of it. It can be used as @data in an EncodedData, and thus allows virtualization of most Metasm algorithms. You cannot change a VirtualString length. Taking a slice of a VirtualString can return either a String (length smaller than 4096) or another VirtualString. You can force getting a small VirtualString using the #dup(from, length) method. Any unimplemented method called on it is forwarded to frozen String which is a full copy of the VirtualString (should generally not be used). There are currently 3 VirtualStrings implemented: - VirtualFile, whichs loads a file by 4096-bytes chunks, on demand, - WindowsRemoteString, which maps another process' virtual memory (uses windows debug api) - LinuxRemoteString, which maps another process' virtual memory (need ptrace rights, memory reading is done using /proc/pid/mem) The Win/Lin version are quite powerful, and allow things like live process disassembly/patching easily (use LoadedPE/LoadedELF as ExeFormat) Things planned: Write a C parser (at least for headers), and adding syntax to support C structs in assembly. Write a good disassembler, supporting external calls through C header parsing, recognize/handle sub functions. Write an UI for dasm