For most hashes where MD5 is used, building a proper md5 format is likely
not the best bet overall.  A format is not trivial.  It requires maintainance
and will likely requires specific enhancements to get it to perform
optimally on all hardware.  Likely there will need to be 'generic' C
code done, then it will need code to tie it into CPU specific optimizations,
such as SSE, MMX, intrisic SSE, GPU, ... ... ...    This will also mean that
to stay up to date, the format will require ongoing work and mainainance.

However, there is one format which may reduce a lot of this maintainance
work to very little.  Now, that format itself will need to be kept up to
date, but any formats that are built upon its internal workings.  That
format is md5-gen.  In this 'format', there is a scripting language, where
a format developer only need to describe the actual operations properly,
and the format is 'done', and working.

This document will go over how to 'build' a format that uses this md5-gen
format, how to optimize it to work faster, and how to build a 'thin'
quasi format which insulates the end user from the md5-gen format line
building.

**** Introduction ****

To start off with, a little background on 'how' and 'where' to build the
scripts that run md5-gen, what interanal data structures are available to
be used.

The 'where' which a format developer can easily build into john, is to add a
new md5-gen format 'script', into john.ini file (john.conf).  This
file usually is located in the current directory where john is run out
of (but the --config=file can override the default behavior).  Within the
john.conf, a new 'section' can be added for a md5 genercic format. The
new 'section' will be set by using this section naming:

[List.Generic:md5_gen(NUM)]

You replace the NUM with the sub-format number (from 1001 to 9999).
Pick a number that is not used.

Within this 'section', there will be multiple lines added.  These lines
are primarily of the form:    Type=Value

The actual contents of these scripts will be addressed later.  That will
be the 'How', and preforming this is actually outside of the intro section.

The 'Data' and runtime information is this:

Inside of the md5-gen format, there are 2 input buffers (actually ALL data
is arrays of 128 of each buffer type).  There is input1 and input2 buffers.
The main operations on these buffers is to clear them, and to append data,
to build string which will later be md5 hashed.

There are also 2 output buffers.  These buffers will receive the md5 hashing
from the 2 input buffers.  NOTE, when the format processing is complete, the
results MUST be placed into output1 buffer. This is where all of the comparison
functions will check against.

In the format, there is a salt (if the format is salted).   There may also be
a second salt value.

There are also 'keys' value(s).  These are the passwords being tested at this
given time.

There are also 8 'constant' strings which can be used within a format.  A
format such as md5-po has a couple of constants within it.

There are also numerous optimization 'flags' which do special things when
loading keys or salts, and there are numourous special 'optimization' primative
functions within the format, for speedup of certain operations.

**** Simple format building ****

We will start out with a few simple formats, and simply 'show' how to build
a straight forward script. The scripts may or may not be optimal.  Later
we will optimize these somewhat.  When building the formats here, there will
be comments interspersed, listing just what is being done, and why.

we will build these formats:
md5_gen(1030) md5($p.$p)
md5_gen(1031) md5($s.md5($p).$p)
md5_gen(1032) md5(md5($s).md5($p).$p)

[List.Generic:md5_gen(1030)]
Expression=md5_gen(1030): md5($p.$p)
Func=MD5GenBaseFunc__clean_input
Func=MD5GenBaseFunc__append_keys
Func=MD5GenBaseFunc__append_keys
Func=MD5GenBaseFunc__crypt
Test=md5_gen(1030)42b72f913c3201fc62660d512f5ac746:test1

Here is the exact same format, with some comments added, describing the
sub-sections, and exactly what is being done.

#first line is the section name. It MUST be of the format shown.
[List.Generic:md5_gen(1030)]
#
#the next line, is a required line.  It serves 2 purposes.  It is output
#in john, when the format 'starts'.  Also, the md5_gen(#) part is used
#to destinguish this exact format (so the command line of --sub=md5_gen(1030)
#would specify this and only this format)
#
Expression=md5_gen(1030): md5($p.$p)
#
#This is the set of functions.  This is the ONLY section of the format
#where order IS important.  The functions ARE handled one after the
#other, from top to bottom, to perform the string operations, and md5
#operations which are needed to perform the hash of this format
#The functions ARE a required part of the format.
#
#first step, clean the input.  All work for this format is done using
#only input 1 and output 1 buffers.
Func=MD5GenBaseFunc__clean_input
#
#Step 2, append the keys. Note, the buffer is clean, so this is simply
#the same as Input=keys (but required 2 steps, the clean and append keys).
Func=MD5GenBaseFunc__append_keys
#
#Step 3, append keys again (the format is ($p.$p) or keys appended to keys.
Func=MD5GenBaseFunc__append_keys
#
#Step 4, final step performs md5 of $p.$p  This will properly leave the
#results in output1
Func=MD5GenBaseFunc__crypt
#
#This is test string.  These ARE required. You can provide more than
#one.  5 or 6 are best, to make sure the format is valid.
#
Test=md5_gen(1030)42b72f913c3201fc62660d512f5ac746:test1

Ok, here is the second format.  The format being done is md5($s.md5($p).$p)
Here are a few comments about this format:
1.  There is a Flag= value.  This is because this is a Salted format. This
    REQUIRES the MGF_SALTED flag.
2.  We only use input 1 and output 1.
3.  There are a couple of calls to crypt (md5).  The first simply gets
    md5($p) and puts it into output1, which will later be appeneded in
    base-16 format as we build our string.
4.  After the first crypt (md5), we clear our input buffer, then put
    the salt in, append the base-16 of md5($p), and then append $p
5.  Finally, and call to crypt is done, which leaves the results in
    output1, so the rest of the md5-gen format can properly compare it.

[List.Generic:md5_gen(1031)]
Expression=md5_gen(1031): md5($s.md5($p).$p)
Flag=MGF_SALTED
Func=MD5GenBaseFunc__clean_input
Func=MD5GenBaseFunc__append_keys
Func=MD5GenBaseFunc__crypt
Func=MD5GenBaseFunc__clean_input
Func=MD5GenBaseFunc__append_salt
Func=MD5GenBaseFunc__append_from_last_output_as_base16
Func=MD5GenBaseFunc__append_keys
Func=MD5GenBaseFunc__crypt
Test=md5_gen(1031)a459f60614498dbdd9a79dcc9c538749$aabbccdd:test1


Now, here is the final format:  md5(md5($s).md5($p).$p)

[List.Generic:md5_gen(1032)]
Expression=md5_gen(1032): md5(md5($s).md5($p).$p)
Flag=MGF_SALTED
Func=MD5GenBaseFunc__clean_input
Func=MD5GenBaseFunc__append_salt
Func=MD5GenBaseFunc__crypt
Func=MD5GenBaseFunc__clean_input2
Func=MD5GenBaseFunc__append_keys2
Func=MD5GenBaseFunc__crypt2
Func=MD5GenBaseFunc__clean_input
Func=MD5GenBaseFunc__append_from_last_output_as_base16
Func=MD5GenBaseFunc__append_from_last_output2_to_input1_as_base16
Func=MD5GenBaseFunc__append_keys
Func=MD5GenBaseFunc__crypt
Test=md5_gen(1032)042d1f15ed57929a2ac8ee4f0a924679$aabbccdd:test1

Ok, now that these have been built, here are a few 'benchmarks' listing
that they are WORKING, and what speed they are working:

Here is MinGW build 'x86'

john_x86 -test -for=md5-gen -sub=md5_gen(1030)
Benchmarking: md5_gen(1030) md5_gen(1030): md5($p.$p) [128x1 (MD5_Go)]... DONE
Raw:    3530K c/s

john_x86 -test -for=md5-gen -sub=md5_gen(1031)
Benchmarking: md5_gen(1031) md5_gen(1031): md5($s.md5($p).$p) [128x1 (MD5_Go)]... DONE
Many salts:     1945K c/s
Only one salt:  1890K c/s

john_x86 -test -for=md5-gen -sub=md5_gen(1032)
Benchmarking: md5_gen(1032) md5_gen(1032): md5(md5($s).md5($p).$p) [128x1 (MD5_Go)]... DONE
Many salts:     1016K c/s
Only one salt:  1031K c/s


Here is MinGW build of SSE2

john_sse2 -test -for=md5-gen -sub=md5_gen(1030)
Benchmarking: md5_gen(1030) md5_gen(1030): md5($p.$p) SSE2 [SSE2 32x4 (.S)]... DONE
Raw:    7250K c/s

john_sse2 -test -for=md5-gen -sub=md5_gen(1031)
Benchmarking: md5_gen(1031) md5_gen(1031): md5($s.md5($p).$p) SSE2 [SSE2 32x4 (.S)]... DONE
Many salts:     5065K c/s
Only one salt:  4436K c/s

john_sse2 -test -for=md5-gen -sub=md5_gen(1032)
Benchmarking: md5_gen(1032) md5_gen(1032): md5(md5($s).md5($p).$p) SSE2 [SSE2 32x4 (.S)]... FAILED (get_hash[0](0))


Here is some timings to check against:

john_x86 -test -for=md5-gen -sub=md5_gen(0)
Benchmarking:  md5_gen(0): md5($p)  (raw-md5)  [128x1 (MD5_Go)]... DONE
Raw:    4005K c/s

john_sse2 -test -for=md5-gen -sub=md5_gen(0)
Benchmarking:  md5_gen(0): md5($p)  (raw-md5)  SSE2 [SSE2 32x4 (.S)]... DONE
Raw:    10740K c/s


**** Optimizations of prior formats ****

For format 1030, the speed should be very close to that of md5_gen(0).
In both formats, there is only 1 call to md5().  However, we are seeing that the
(1030) is slower than (0).  The explanation of this, is that the (0) format has
an optimization used, which we can not use in the (1030).  The (1030) is likely
about as optimal as it can be made in the current md5-gen format.   The optimization
for format (0) is:   Flag=MGF_KEYS_INPUT  What that does, is to place the keys
directly into the input field, and then later, when john gets the keys back (it
does this if a hash is cracked), john gets them from the input.  In the (1030)
format, we load the keys, into the 'keys' arrays.  We then have to call a function
to clean input buffer 1, and to append the keys (twice).  Thus, what we have is
additional memory movement, and that slows things down.  However, to use the
MGF_KEYS_INPUT optimization, we would have had to keep the input1 buffer prestine
and ONLY put in the keys (passwords).  Since we had to append the keys twice,
we simply 'blew' that requirement, and thus, could NOT use it.    At a later
time, we will show a format WHERE we can use this optimization.

For format 1031, there also appears to be no optimizations available.

For 1032, there are optimizations.  In this format, we notice that we have
this sub expression:  md5($s).  Well, there is an optimization, which when it
loads the input file, it converts all salts into md5($s) and uses that value
instead.  So, at startup time, we perform md5 hashes of all salts, but at
runtime, we simply place the salt into the building string, instead of performing
a MD5 on the salt.  So, in the 1032, we had 3 calls to crypt.  By using this
optimization, we can reduce that to 2 crypts. The starting format is:
md5(md5($s).md5($p).$p)  This optimization makes the format 'behave' at
runtime, like it is md5($s.md5($p).$p), which was format 1031.  Note, after
we make this optimzation, the timings will be almost identical to the 1031
timings.  Also note, the Test string for 1032 and 1042 are exactly the
same.  These are the same formats. It is just that 1042 performs fewer
crypt calls per test.  Also note, in the 'original' run of SSE2, the 1032
format failed.  This failure, is due to the SSE2 / MMX code only working
for strings up to 54 bytes (optimization reason).  The length of this string:
md5($s).md5($p) is 64 bytes by itself, and we also append $p to that. Thus,
our string is OVER 54 bytes in length, and thus, can not be used in SSE2
mode. We do have a couple work arounds for this, to get it working properly
on SSE2 builds.  We can use a flag which simply stops SSE2 dead in its tracks
(and preforms all work using x86 code).  This is flag MGF_NOTSSE2Safe

[List.Generic:md5_gen(1042)]
Expression=md5_gen(1042): md5(md5($s).md5($p).$p)
Flag=MGF_SALTED
Flag=MGF_SALT_AS_HEX
Flag=MGF_NOTSSE2Safe
Func=MD5GenBaseFunc__clean_input
Func=MD5GenBaseFunc__append_keys
Func=MD5GenBaseFunc__crypt
Func=MD5GenBaseFunc__clean_input
Func=MD5GenBaseFunc__append_salt
Func=MD5GenBaseFunc__append_from_last_output_as_base16
Func=MD5GenBaseFunc__append_keys
Func=MD5GenBaseFunc__crypt
Test=md5_gen(1042)042d1f15ed57929a2ac8ee4f0a924679$aabbccdd:test1

Once the above changes have been done, here are the speeds:

john_x86 -test=5 -for=md5-gen -sub=md5_gen(1031)
Benchmarking: md5_gen(1031) md5_gen(1031): md5($s.md5($p).$p) [128x1 (MD5_Go)]... DONE
Many salts:     2007K c/s
Only one salt:  1913K c/s

john_x86 -test=5 -for=md5-gen -sub=md5_gen(1032)
Benchmarking: md5_gen(1032) md5_gen(1032): md5(md5($s).md5($p).$p) [128x1 (MD5_Go)]... DONE
Many salts:     1052K c/s
Only one salt:  1030K c/s

john_x86 -test=5 -for=md5-gen -sub=md5_gen(1042)
Benchmarking: md5_gen(1042) md5_gen(1042): md5(md5($s).md5($p).$p) [128x1 (MD5_Go)]... DONE
Many salts:     1420K c/s
Only one salt:  1372K c/s

john_sse2 -test=5 -for=md5-gen -sub=md5_gen(1042)
Benchmarking: md5_gen(1042) md5_gen(1042): md5(md5($s).md5($p).$p) SSE2 [128x1 (MD5_Go)]... DONE
Many salts:     1416K c/s
Only one salt:  1372K c/s


We can also perform even more optimizations in the format.  What we do in this format, is we
md5 the salt (when we first load the file). Thus the salts which john works with, are really
md5($s)  (same as we did in format 1042).  Then we use a different flag, which puts the
md5($p) into offset 32 of input1 (where we want it). Then we simply overwrite the data in
input 1 with the salt (which is md5($s) in base-16 format), then force set length to 64, then
append the keys, then crypt.

[List.Generic:md5_gen(1052)]
Expression=md5_gen(1052): md5(md5($s).md5($p).$p)
Flag=MGF_SALTED
Flag=MGF_SALT_AS_HEX
Flag=MGF_KEYS_BASE16_IN1_Offset32
Flag=MGF_NOTSSE2Safe
Func=MD5GenBaseFunc__overwrite_salt_to_input1_no_size_fix
Func=MD5GenBaseFunc__set_input_len_64
Func=MD5GenBaseFunc__append_keys
Func=MD5GenBaseFunc__crypt
Test=md5_gen(1052)042d1f15ed57929a2ac8ee4f0a924679$aabbccdd:test1

Here are the benchmarks for the above format:

john_x86 -test=5 -for=md5-gen -sub=md5_gen(1052)
Benchmarking: md5_gen(1052) md5_gen(1052): md5(md5($s).md5($p).$p) [128x1 (MD5_Go)]... DONE
Many salts:     2251K c/s
Only one salt:  1369K c/s

john_sse2 -test=5 -for=md5-gen -sub=md5_gen(1052)
Benchmarking: md5_gen(1052) md5_gen(1052): md5(md5($s).md5($p).$p) SSE2 [128x1 (MD5_Go)]... DONE
Many salts:     2251K c/s
Only one salt:  1369K c/s


Now, note the speed for 'many salts'.  It is very close to the speed of (1031), actually faster.
This speed is the speed john will have for a normal password cracking, where you have dozens (or
hundreds, or 1000's) of password hashes to crack.

To understand WHY this format is this much faster (the 'Many salts', is the normal way to
benchmark the speed of a salted hash), is to understand what is happening under the hood within
john's 'crypt all' loop.

   while (!feof(password_file)) {
      for (i = 0 to max_num_passwords)
         SetKey(i, getnextpassword(password_file));
      if (salted)
      {
         while (z<salt_count)
         {
            SetSalt(salt[z]);
            crypt_all
            for (all_binaries_for_salt[z])
               CheckForMatched(binary)
         }
      }
   }

The above code is certainly not 'exact', but should show close enough, the algorithm used
within john.  Now, the algorithm as used within md5-gen will be shown (specifically for the
flag  MGF_KEYS_BASE16_IN1_Offset32).

 - SetKey() is called numerous times.  This will set a 'dirty flag' for the keys inside of md5-gen.
 - SetSalt() will be called.  The salt handed to us is actually md5($s), since MGF_SALT_AS_HEX is set
   The SetSalt() calls are happening within the 'while(z<salt_count)' loop in john.
 - crypt_all is called.
   Within crypt_all, md5-gen knows that we want the base-16 md5($p) to be placed at offset 32
   within input1.  So the first call to crypt_all (for the first salt), will cause the md5($p)
   to be computed, and to be placed at offset 32.
   Then the script will overwrite the starting bytes of input1 with the 32 bytes of the salt,
   then the length is set to 64, then the key is appened, then a crypt, and then comparisons.
 - NOW, we are at the next loop within the 'while(z<salt_count)'.
 - Then john loads the next salt [ SetSalt() ].
 - Then john calls crypt_all.
   At this time, there have been NO additional SetKey() calls. Thus, md5-gen knows that the
   base-16 text of md5($p) is STILL located at offset 32 of Input1. So, the format DOES NOT
   perform this crypt again (until new SetKey() function calls happen).
 - This SetSalt .. crypt_all .. compare continues until all salts are tested.  However, there
   will be no crypt calls to md5($p) again, UNTIL the working code within john calls SetKey()
   again (when starting with new passwords, after all salts have been checked).


Now, in the final format, we start from 1042, and do NOT turn off the sse2 code. What we do, is
to turn off SSE2 when it is not valid.  This will generate x86 code (generic) that runs exactly
the same as in 1042 (the 2 function calls of MD5GenBaseFunc__SSEtoX86_switch_output1 and
MD5GenBaseFunc__X86toSSE_switch_output1 are no-ops in x86 builds). However, in SSE mode,
the first crypt will be done using SSE.  Thus, as we see, the speed went from 1420k, up
to almost 1800k.  But note, this is NOT as fast as format 1052, for 'many' salts.

[List.Generic:md5_gen(1062)]
Expression=md5_gen(1062): md5(md5($s).md5($p).$p)
Flag=MGF_SALTED
Flag=MGF_SALT_AS_HEX
Func=MD5GenBaseFunc__clean_input
Func=MD5GenBaseFunc__append_keys
Func=MD5GenBaseFunc__crypt
Func=MD5GenBaseFunc__SSEtoX86_switch_output1
Func=MD5GenBaseFunc__clean_input
Func=MD5GenBaseFunc__append_salt
Func=MD5GenBaseFunc__append_from_last_output_as_base16
Func=MD5GenBaseFunc__append_keys
Func=MD5GenBaseFunc__crypt
Func=MD5GenBaseFunc__X86toSSE_switch_output1
Test=md5_gen(1062)042d1f15ed57929a2ac8ee4f0a924679$aabbccdd:test1

john_sse2 -test=5 -for=md5-gen -sub=md5_gen(1062)
Benchmarking: md5_gen(1062) md5_gen(1062): md5(md5($s).md5($p).$p) SSE2 [SSE2 32x4 (.S)]... DONE
Many salts:     1792K c/s
Only one salt:  1715K c/s

So all in all, 1032, 1042, 1052, 1062 were all equivalent (1032 was not, since it fails in
SSE2 builds, but that was 'fixed' in 1042).  They all run using differing sets of flags, differing
sets of Function primatives, and have different runtime speeds.  However, in the end, they all


Now, the above format 1062 is slower than 1052. This is due to the final crypt still having to be
done in x86 mode. However, in 1062, we crypt EVERY password for each salt.  Thus you can see there
is no speed gain between many salts, and 1 salt.  Yes, the md5($p) IS done using SSE2 which is much
faster, but in version 1052, when there are multiple salts, the slower md5($p) is done only 1 time
per password.


Now, the flag MGF_KEYS_BASE16_IN1_Offset32 (or other flags like it), CAN be used in SSE2 to
get much faster behavior, however, it has to be in a format that IS SSE2 friendly.  Here
is an example:

md5(md5($p).$s)   In this format, we CAN build an SSE2 friendly format, that is VERY fast.
For this test, we will set the salt length to a fixed size of 12.

Here is a very easy to read, but also very far from optimal format for the above type:
[List.Generic:md5_gen(1033)]
Expression=md5_gen(1033): md5(md5($p).$s)
Flag=MGF_SALTED
Func=MD5GenBaseFunc__clean_input
Func=MD5GenBaseFunc__append_keys
Func=MD5GenBaseFunc__crypt
Func=MD5GenBaseFunc__clean_input
Func=MD5GenBaseFunc__append_from_last_output_as_base16
Func=MD5GenBaseFunc__append_salt
Func=MD5GenBaseFunc__crypt
Test=md5_gen(1033)e9fb44106edf60419d26a10b5439d0c7$aabbccddeeff:test1
SaltLen=12

john_x86 -test -format=md5-gen -subf=md5_gen(1033)
Benchmarking: md5_gen(1033) md5_gen(1033): md5(md5($p).$s) [128x1 (MD5_Go)]... DONE
Many salts:     1918K c/s
Only one salt:  1889K c/s

john_sse2 -test -format=md5-gen -subf=md5_gen(1033)
Benchmarking: md5_gen(1033) md5_gen(1033): md5(md5($p).$s) SSE2 [SSE2 32x4 (.S)]... DONE
Many salts:     5479K c/s
Only one salt:  4922K c/s


Here is a MUCH more optimal version (1043).  This version will use the flag
MGF_KEYS_BASE16_IN1 to load the md5($p) into input 1, at the start of that string.  That
will ONLY be done, if there is a SetKeys() change.  Then we simply set the input length
to 32, append the salt, and call crypt.

[List.Generic:md5_gen(1043)]
Expression=md5_gen(1043): md5(md5($p).$s)
Flag=MGF_SALTED
Flag=MGF_KEYS_BASE16_IN1
Func=MD5GenBaseFunc__set_input_len_32
Func=MD5GenBaseFunc__append_salt
Func=MD5GenBaseFunc__crypt
Test=md5_gen(1033)e9fb44106edf60419d26a10b5439d0c7$aabbccddeeff:test1
SaltLen=12

john_x86 -test -format=md5-gen -subf=md5_gen(1043)
Benchmarking: md5_gen(1043) md5_gen(1043): md5(md5($p).$s) [128x1 (MD5_Go)]... DONE
Many salts:     4128K c/s
Only one salt:  1890K c/s

john_sse2 -test -format=md5-gen -subf=md5_gen(1043)
Benchmarking: md5_gen(1043) md5_gen(1043): md5(md5($p).$s) SSE2 [SSE2 32x4 (.S)]... DONE
Many salts:     13096K c/s
Only one salt:  4834K c/s

So in this case, we see that the 'only 1 salt' speed is pretty much a wash.  However, the
'many salts' speed, has gone from 1900k to 4100k for non-sse, and from 5500k to 13100k.

NOTE, the above format is actually md5_gen(6) (also md5_gen(7)) format.