metasploit-framework/data/john/doc/UTF8-DEVEL.txt

These are "public" functions that can be used for supporting UTF-8 in your
own formats for JtR. All are put in unicode.c.

If you are not aware of UCS-2, just consider it a synonym to UTF-16. For our
purposes, they are the same.


Convert from ISO-8859-1 or UTF-8 to UTF-16LE:
=============================================
Source format depends on --utf8 flag given to john when running.
If length is exceeded or malformed data is found, it will return a negative
number telling you how much of the source that was read. That is, a return
code of -32 means you have an UNKNOWN number of characters of UCS-2 [use
strlen16()] in the destination buffer and you should truncate your
"saved_plain" (if applicable) at 32.

    #include "unicode.h"
    int plaintowcs(UTF16 *dst, int maxdstlen, const UTF8 *src, int srclen);


Convert UTF-8 to UTF-16LE:
==========================
Always from UTF-8. This is optimised for speed. If length is exceeded or
malformed data is found, it will return a negative number telling you how much
of the source that was read. That is, a return code of -32 means you have an
UNKNOWN number of characters of UCS-2 [use strlen16()] in the destination
buffer and you should truncate your "saved_plain" at 32.

    #include "unicode.h"
    int utf8towcs(UTF16 *target, int maxtargetlen,
           const UTF8 *source, int sourcelen);


Convert UTF-16LE to UTF-8:
==========================
Currently used in NT_fmt.c in order to avoid using a saved_plain buffer.

    #include "unicode.h"
    extern char * utf16toutf8 (const UTF16* source);


Return length (in characters) of a UTF16 string:
================================================
Number of octets is the result * sizeof(UTF16)

    #include "unicode.h"
    int strlen16(const UTF16 *str);


Create an NT hash:
==================
This will convert from ISO-8859 or UTF-8 depending on the --utf8 option. The
function will use Alain's fast NT hashing if length is <= 27 characters,
otherwise it will use Solar's MD4. Lengths up to MAX_PLAINTEXT_LENGTH is thus
supported with no hassle for you. If length is exceeded or malformed data is
found, it will return a negative number telling you how much of the source
that was read. That is, a return code of -32 means you should truncate your
"saved_plain" at 32.

    #include "unicode.h"
    int E_md4hash(const UTF8 *passwd, int len, unsigned char *p16);


NOTE that there are more functions available in ConvertUTF.c.original
(ConvertUTF.h.original) and if we need any of them, we should copy them to
unicode.c/h and simplify/optimize them.