CyberThreatIntel/101/Yara.md
StrangerealIntel 08c96e4ce8
Create Yara.md
2022-12-26 01:23:27 +01:00

31 KiB

YARA

The following page aims to explain the basics for creating Yara rules, classification and hunting. The learning axis begins with a summary of the documentation and then on tips on everyday cases and on the approach for hunting threat actors.

Generalities

To begin with it, the syntax used is close to the C language, for example, comments are // (single-line) and /* */ (multi-line comment). Each rule begins by the keyword rule is identifed by their rule identifier that the title of the rule.
This identifier can have any alphanumeric character but must replace space by the underscore character or the first character can't be a digit and don't contains one of following reserved keywords.
Yara keywords
all and any ascii at base64 base64wide condition
contains endswith entrypoint false filesize for fullword global
import icontains iendswith in include int16 int16be int32
int32be int8 int8be istartswith matches meta nocase not
of or private rule startswith strings them true
uint16 uint16be uint32 uint32be uint8 uint8be wide xor
Note : Rule identifiers are case sensitive and limited to 128 characters.
The structure of the Yara rule is generally composed of three sections :
  • meta which content all the metadata of the rule
  • strings definition that define the strings which will be used condition part of the rule. Each one have an identifier which begin by "$" character and subject to the same rules as for rule identifier.
  • condition that delimits the conditions on the strings previously defined in the previous section.
rule MyRule : tags
{
    meta: 
    strings:
    condition:
}

Strings

Three different types of string can be used in Yara : text strings, regular expressions and hexadecimal strings.

Text stings

The first type of strings is text based definition. Two cases of encoding are possible ascii and wide this can have modifiers in more theirs strings declarations.
The text strings must have escape sequences with \ like in the language C :
\r Carriage return
\t Horizontal tab
\n New line
\xdd byte in hexa
\" Double quote
\\ Backslash
By default, all the text strings without modifiers are defined as ascii.

Modifiers

Like said previously, a list of modifiers can be used by the analyst for improve the quality of the detection :
  • nocase (Case-insensitive)
This modifier allows to turn into case-insensitive mode (aaa = AAA) in pushing at the end of the text string.
$str_ascii_nocase = "my ascii string" ascii nocase
This can be combine with any other modifier except base64 based modifier : base64 and base64wide (with base 64 string, a series of different upper or lower case letters can change the result into another string than if we are just looking for the same one without a sensitive case)
  • wide
This gives the possibility to parse other types of strings which are encoded with two bytes per character (\xYY\x00). For parsing it, the system will be emulate to UTF-16 strings.
\x45\x76\x69\x6c -> "Evil" ascii
\x45\x00\x76\x00\x69\x00\x6c\x00 -> "Evil" wide
  • xor
In some cases, we have need to detected possibles reused strings that xored on the malicious.
To do this we can write to have each byte XOR applied to the string this way :
$str_xor = "Evil" xor
In this way, this is equivalent to testing all possibilities from 0x0 to 0xff.
This possible since YARA 3.11 to specify a range of values to test as in the following example (here between 0x10 and 0xa - xor(min-max) ):
$str_xor = "Evil" xor(0x10-0xa)
Note : You can test both ascii and wide in same time in using the modifiers ascii wide but don't forget that in terms of optimizations, it's better to test a value already with an xor than generated with an xor modifier.
  • fullword
In the case where we need to match only when the string is fully contained and breakable.
By the way, this can be useful if you know that the attacker uses declination of a specifiatic and compound string.
$str_full = "legit" fullword

In this example, you can observe that the string works with legit-software , legit-offers but not with mylegitoffer.

  • base64
By default, this search the first three permutations in base64 for the text string.
$str_b64 = "iex($payload)" base64
This possible to use a custom alphabet :
$str_b64 = "iex($payload)" base64("3GHIJKLMNOPQRSTUb=cdefghijklmnopWXYZ/12+406789VaqrstuvwxyzABCDEF5")
Important : YARA strips the leading and trailing characters in the beginning and the end of the string thus a rule will be valid for strings having a common section (axxxa = cxxxc).

Regex (Regular expressions)

For more flexibility in detecting an attacker's recurring pattern, it's possible to use regex with Yara. The syntax used is close to the PERL language but has some limitations with POSIX character classes and backreferences.
In the following case, I recommend the regex101 site which allows both to check the quality of our regrex and to have solid educational documentation on each element. It is also possible to use drag & drop to quickly construct and verify each regex.
This allows the following metacharacters :
\ Quote the next metacharacter
^ Match the beginning of the file
$ Match the end of the file
| Alternation
() Grouping
[] Bracketed character class
The list of the quantifiers this can be used are :
* Match 0 or more times
+ Match 1 or more times
? Match 0 or 1 times
{n} Match exactly n times
{n,} Match at least n times
{,m} Match at most m times
{n,m} Match n to m times
Note : Perl regular expressions are greedy, meaning they will match which all the data possible before a new line, using ? at the end of the quantifier allows to cut as quickly as possible once the regex triggers.
Following escape sequence and character classes can be used in the regex :
\t Tab (HT, TAB)
\n New line (LF, NL)
\r Return (CR)
\xnn Character whose ordinal number is the given hexadecimal number
\w Match a word character (alphanumeric letter)
\W Match a non-word character
\s Match a whitespace character
\S Match a non-whitespace character
\d Match a decimal digit character
\D Match a non-digit character
The regex must be declared with a couple of / like the following exemple :
$s = /{host : \w{3,16},os : \w{3,9}, av : [a-zA-Z]{3,32}}}/

Hexadecimal sequences (traits)

In some cases, it's necessary to detect similiar partitions of asm codes or which may present several alternatives.
First, this possible to use wildcard ? for specify portions of hexadecimal values that can vary and should match with anything.
As example, we can see that it's possible to put wildcards on part of a hex value :
$str_hex = { FA ?1 ?? 7? 34 }
Another highlight is the ability to jump groups of hex values in specifying the length ranges, this particularly useful for joining parts of reused codes, since the compliers each have different in terms of optimization or way of compiling the code, the opcodes will be different and as a result the hexadecimal sequences (traits) will have more or less important disparities even if we have the same original code.
In this example, we can see two similiar parts of codes FA 21 44 and 77 98 34 that have a different sequence of hex code inside, for this, we chose to specify the jump interval with the smallest and largest jump value, i.e. 2 and 4 ([min-max]).
FA 21 44 [2-4] 77 98 34
FA 21 44 XX XX 77 98 34
FA 21 44 YY YY YY YY 77 98 34
Note : Since YARA 2, this possible to use infinite range inno specify the min or/and max range ([-], [x-],[-y]), the parser will continue until this match on the next block of the identifier. This method is interesting the case of the attacker insert lot of spaces or junk code but this consume lot of ressources and aren't optimized in the fast approach for real time.
Another point of interest is the ability to switch with mutliple cases for like on a regex aproach in using multiple jumps :
Here, we have an attacker which use as persistence folder a set of keywords ìnstall,word2021 and excel2021 in the system folder, we can concatenate with a jump the commun part \xxxx2021 ( need to have hex code in the both side of the jump, explains why use \) and push the different cases in the switch state with different "OR" conditions.
25 53 79 73 74 65 6d 52 6f 6f 74 25 5c 53 79 73 74 65 6d 33 32 ( 5c 69 6e 73 74 61 6c 6c | 5c [4-5] 32 30 32 31 ) 5c

-> %SystemRoot%\System32(\install|\[4-5]2021)\

That's valid for :

%SystemRoot%\System32\install\
%SystemRoot%\System32\Word2021\
%SystemRoot%\System32\Excel2021\

Condition

Conditions content the same syntaxe that the C# language :
  • The typical Boolean operators : and, or and not
  • Relational operators of comparaision : >=, <=, <, >, == and !=
  • Arithmetic operators +, -, *, \, %
  • Bitwise operators &, |, <<, >>, ~, ^
note: this is possible to check a specific count of the repetion of a sequence in push a # behind a variable like this #a = 10
Sometimes we need to check that the location of some of the strings that the rule must detect is at these precise positions in the data stream, for this the "at" operator allows to define this location
In all the examples above, the number of strings have been specified by a numeric constant, but any expression returning a numeric value can be used. The keywords any and all can be used as well.
all of them       // all strings in the rule
any of them       // any string in the rule
all of ($a*)      // all strings whose identifier starts by $a
any of ($a,$b,$c) // any of $a, $b or $c
1 of ($*)         // same that "any of them"
It's also possible to define where one of the conditions may be correct for the rule to work, for this we needs to set the string to a specific offset on the file or to a virtual address in the address space of the process in using the at keywork like this :
condition:
        $s at x 
note: this also possible to use a specific range by this time in keyword
$s in (x..y)
  • Filesize
  • One of these special variable filesize allows the size of the file as a required condition to valid.
    This able to support a postfix of a value of the constant of 1024 (KB,MB,GB...).
  • Filetype
  • As for the previous case, it is also possible to use the magic numbers (onstant numerical or text value used to identify a file format or protocol) to verify that it is the correct type of file.
    This identifier used the unsigned integers (16 based for 2 bytes and 32 for 4 bytes) for compared to the magic number. This used the Big endian, the blocks of bytes must inversed between them (abcd -> dcba).
    Here a list of usual identifier types of files :
    Filetype Magic numbers
    MSDOS executable uint16(0) == 0x5a4d
    JAR uint32(0) == 0x04034b50
    bmp uint16(0) == 0x4d42
    class uint32(0) == 0xbebafeca
    jpg uint16(0) == 0xd8ff
    Postcript uint16(0) == 0x53502125
    Docx/Xlsx/PPTX/Zip uint32(0) == 0x04034b50
    Microsoft Database uint32(0) == 0x6e617453
    Xls/Doc/PPT uint32(0) == 0xe011cfd0
    PDF uint32(0) == 0x46445025
    CAB Installer uint32(0) == 0x4643534d
    MSI uint32(0) == 0xe011cfd0
    RAR uint32(0) == 0x21726152
    EPS uint32(0) == 0x53502125
    SQLite Database uint32(0) == 0x694c5153
    Mach-O (x86) uint32(0) == 0xcefaedfe
    Mach-O (x64) uint32(0) == 0xcffaedfe
    DEX uint32(0) == 0x0a786564
    RTF uint32(0) == 0x74725c7b
    Compressed ISO image uint32(0) == 0x215a7349
    ISO image uint32(0) == 0x30304443
    This possible to use the following recipe on cyberchef for generate your custom identifier :

    https://gchq.github.io/CyberChef/#recipe=Swap_endianness('Hex',{2|4},true/disabled)&input={input data in base64}

    Using modules

    The YARA core allows to load an additional modules for improve functionalities of your Yara rules.
    The modules must be declared by the import statement before used later like this :
    import pe
    import elf
    ...
    

    PE module

    The module allows to analyze the characteristics of the PE file, the most common variables are following :
    Variable Description
    pe.entry_point Confer the address of the entrypoint of the PE
    pe.sections.name Give the section name
    pe.number_of_sections Give the number of sections in the PE file
    pe.pdb_path Path of the PDB file for this PE if its present (Yara 4.0.0)
    pe.imphash() Function returning the import hash or imphash for the PE file
    pe.exports(x) Function returning true if the PE exports match with regex, string or ordinal in the argument
    pe.imports(x) Function returning true if the PE imports match with regex, string or ordinal in the argument
    These variables allow to create conditions on a specific sections names for detect packers, pdb paths...

    Entrypoint

    Sometimes, it's necessary to specify conditions on the entrypoints like a specific address of a data or a pattern of data. These approaches are used for detect some patterns basically at the entrypoint to detect packers or simple file infectors.
    rule myrule
    {
        strings:
            $a = { 9c 50 66 a1 [3] 00 66 a9 [2] 58 0f 85 }
        condition:
           $a in (pe.entry_point..pe.entry_point + 100)
    }
    

    Certificates

    Sometimes, attackers can use signed binaries or script, this possible to hunt by the properties of these certifications. Here, the most common variables that can be used :
    Variable Description
    pe.signatures.subject Allow to search on the subject of the certificate
    pe.signatures.issuer Looking on the issuer field
    pe.signatures.serial Give the serial identifier
    pe.number_of_signatures Number of authenticode signatures found in the PE file
    For example, in the following case, the rule allow to check if one of the signatures contains a serial number of compromised certificates used by threat actors for PE files.
    Note: Binaries can have multiple authenticode signatures, to do this we use for structure for any i in x : (similar pyhthon language)
    import "pe"
    rule CERT_APTX_Mal_Cert_Nov_2022_1 {
       meta:
          description = "Detects a compromised certificate used by the APTX"
          date = "XXXX-XX-XX"
          hash = "-"
       condition:
          uint16(0) == 0x5a4d and
          for any i in (0 .. pe.number_of_signatures) : (
             // check the issuer
             pe.signatures[i].issuer contains "COMODO RSA Code Signing CA" and
                (  // Check with the list of compromissed certificates
                   pe.signatures[i].serial == "1c:25:43:56:2c:2d:05:34:03:1a:77:cb:49:22:4c:cc" or 
                   pe.signatures[i].serial == "4c:78:75:76:45:2d:06:15:30:98:75:ac:49:67:4d:c2"
                )
          )
    }
    

    Referencing other rules

    One of great point of Yara is the fact to allow to use as reference to a condition an another rule as building blocks for a chain of complex technics approach or specific artifacts.
    The rule must be previously defined and called by their name of rule like this :
    rule myrefrule
    {
        strings:
            $s = "artifact"
        condition:
            $s
    }
    rule hunt_rule
    {
        strings:
            $x = "password"
        condition:
            $x and myrefrule
    }
    

    Tricks

    Dotnet malware

    C# malware can be easily translated into source code by tools like DNSpy, ILSpy, this possible to point to its location of the file using the shortcut crtl+x or by GUI.
    Once the pointer go to location the code in the file, you can copy/paste the hex sequences with the same structure of code of the different payload for make a hex sequence with jump or wildcard of the different value of opcodes due to different compiler used for built the payloads (different compiler have their own optimisation operations when there translate the source code to asm code).

    Hunting ATM malwares

    ATMs need XFS middleware that provides client-server architecture for financial applications on Microsoft Windows platform and allow to communicate with service providers via the XFS Manager using a set of API.
    These largely check for the presence of MSXFS.dll and instructions to devices via WFSExecute or lock/open with WFSLock/WFSOpen.
    Thus, searching via these strings (in ascii, base64...) can be a good start for the search for new ATM malware, it should not however be forgotten that these strings can also be legitimate and be the MSXFS dll used in the ATM (it has already been noted to me that these legitimate dlls had been submitted to Virustotal/Hybrid Analysis).
    It's also possible to emulate this proprietary dll with the following project : here
    For more information about the structure of ATMs, Trend Micro has published an excellent article on this subject here

    Translate asm code

    Sometimes, it isn't possible to have the samples because they aren't shared by TLPs or we are looking to find equivalent samples on a different architecture from our samples. This possible to use a online translator for translate the instructions as asm code of the chosen architecture.
    Here, on the folowing example, we can translate to the hex code the instruction shared o nthe proofpoint article :

    Note: Some SOC based architecture like MIPS, ARM use a stack of strings for optimize the quality of the programms, in this case, the structure of the code can be the same but the hex code will be greatly different between each compiler and therefore difficult to make a rule with hex sequences

    What approach for creating my rule ?

    The first question to ask when you want to create a yara rule is what is the purpose of what you want to detect, several possibilities are available by:
    • Hunting : The purpose of this type of rule is to detect specific methods such as MITER ATTACK-related, specific functions or artifacts to proactively/actively detect incidents or malware evolutions or the appearance of new malware.
    • Attribution : The goal is to define a rule for detecting a specific malware or threat actor.
    • Chain of rules : This rule must be include like a module for detecting a particular case on a common basis. For example, let's imagine in a case of leaks from builders like for conti and babuk, we had artifacts linked to an actor (TA) which exploits these leaks and already made a valid rule to detect this ransomware, it is then interesting to add a reference in the conditions (cf Referencing other rules) so that a chain of rules must be valid (ransomware which leaked and the artifacts of the TA) rather than making a rule again for this slightly modified malware.
    Once, the type on what kind of rule has been defined, we have different approaches available :
    • Strings : the most basic approach, based to string although this method allows to detect possible samples on various architectures that a hexadecimal sequence approach doesn't allow (because specific to theirs specific architectures), any good threat actor will delete the debug comments or specific artifacts reported by articles of threat intelligence for improve their stealth.
    • Regex : Regular expression give more mobility on your yara rule although still as affected like a string approach, this approach is particularly effective against strings where the content of can easily be altered by the attacker.
    • Hex sequences: As previously explained similarity in code structures having common opcodes can be a great tool in the context of a specific assignment of a Yara rule to a particular malware. However, in addition to being case sensitive with different architectures, it needs many different samples to refine with different compilers (even with different versions of the compiler) for be precise, that a cost or for example, not possible due to you have catch an only one (first sample detection).
    The final rule can of course include several different approaches depending on the needs, however we must not forget that it is the intelligence that is put in the rule (what does it detect?, what information should we draw? , what strategic advantage does it bring?) that will make the difference depending on the context (for an incident response context, pro-active research, monitoring of the evolution of TA capabilities, etc.)

    Tools

    The following section presents tools that I commonly use in implementation or research and being free to the infosec community.

    Yargen

    A tool (create by Florian Roth) to generate Yara rules from sample strings submitted as script arguments. Although often deprecated because it is based only on strings and therefore quickly generates obcel rules, this can bring added value in cases where Yara rules have to be made for the search for new samples for an approach via reuse code or find residual artifacts or configuration at the same TA in a mass of samples.
    Repository : here

    Binlex

    This tool created by 3rb3ru5d3d53c allows to extract hex sequences (reuse code) in common with several samples put in argument of the script.
    In this following example, we run binlex and select the traits between 8 and 16 bytes (sort the unique traits) :
    binlex -m auto -i {path} | jq -r 'select((.size > 8 and .size < 16) and (.bytes_sha256 != .traits.sha256)) | .trait' | head -10
    33 c0 39 81 39 50 75 12
    45 00 00 ?? ?? 00 74 1f
    ...
    
    This also possible to generate a Yara rule with the traits found with binlex, its therefore necessary to pass the output in a pipe to blyara for obtain our Yara rule :
    binlex -m raw:x86 -i {path} | jq -r 'select(.size > 16 and .size < 32) | .trait' | blyara --name myrule -m author {author} -m tlp clear -c 3
    rule myrule {
        metadata:
            author = "{author}"
            tlp = "clear"
        strings:
            trait_0 = {8b 01 d6 31 ff 31 8b 42 ?? 01 d0 8b 40 ?? 85 c0 74 4c}
            trait_1 = {49 8b 34  c0 c1 cf ?? 52 57 8b 52 ?? ac 01 c7 38 e0 75 f4}
            trait_2 = {00 00 6a 00 e8 67 00  6a ?? 56 57 68 ?? ?? f8 00 7e 36 ?? ?? ff d5 83 }
        condition:
            3 of them
    }
    
    The output content of binlex being in JSON, it can easily be integrated into trait databases, shared with collaborators...
    binlex -m auto -i {path} -c malware -g {tag} | head -1 | jq
    {
      "average_instructions_per_block": 29,
      "blocks": 1,
      "bytes": "ca 1c 89 45 fc 33 d2 8b 45 fc 59 f7 f1 89 45 fc 81 75 fc 3c 95 0e 00 c7 45 f4 0b ff 81 75 f8 30 f4 44 00 c7 45 fc 22 51 53 00 8b 45 fc 6a 3f 59 f7 f1 6a",
      "bytes_entropy": 5.778987,
      "bytes_sha256": "176cdc211b579ea4395b5b0f24f31a54caa02dcf55fd799b34c0eacff4e7ee99",
      "corpus": "malware",
      "cyclomatic_complexity": 3,
      "edges": 2,
      "file_sha256": "7b01c7c835552b17f17ad85b8f900c006def811d708890b5f49f231448aaccd3",
      "file_tlsh": "42E34A10F3D341F7DC9608F219B6B22F9F791E023124DFA987981F57ADB5246A2B981C",
      "instructions": 29,
      "invalid_instructions": 0,
      "mode": "pe:x64",
      "offset": 49711,
      "size": 118,
      "tags": [
        "{tag}"
      ],
      "trait": "1c 83 65 55 8b ec 83 ec ?? ?? cf 4a 88 89 ?? ?? ?? 81 ff 4d ?? 75 21",
      "trait_entropy": 3.45690,
      "trait_sha256": "e13464de470490ab00ab77bc8969b0a2a52be369b61f4dcd16758608cc49e48e",
      "type": "block"
    }
    
    Repository : here
    An introduction To Binlex presented by c3rb3ru5d3d53c is avialable here

    Yobi

    Created by imp0rtp3, this allow to add as module of firefox an yara scanner on the web page loaded in the browser with the rules imported on the preferences.
    Although it doesn't prevent the execution of the code, this one is useful with a bot for proactive research of skimmer, javascript loader which scans lists of domains suspected of being used by threat actors.
    Repository : here

    Cyberchef

    Needless to present and famous Swiss army knife that is Cyberchef, created by Cyber Swiss Army Knife, a module is available to test Yara rules.

    Note: Rather than creating a Yara rule for an archive such as an office document or a zip, it's possible to use cyberchef's Unzip module to decompress the content and test your rule (unzip content can be compressed and so don't match with your traits in the Yara rule.

    Yara scanner

    Same functionality as the previous tool but allows you to scan with yara rules on a mass of files or processes in memory.
    In this case, we can use the -r argument for recursive between the folders the scan :
    .\yara64.exe {Path_Rule} -r {Path_Samples}
    RAN_Royal_Rumble_Dec_2022_1 {Path_Samples}\250bcbfa58da3e713b4ca12edef4dc06358e8986cad15928aa30c44fe4596488.bin
    RAN_Royal_Rumble_Dec_2022_1 {Path_Samples}\7cbfea0bff4b373a175327d6cc395f6c176dab1cedf9075e7130508bec4d5393.bin
    RAN_Royal_Rumble_Dec_2022_1 {Path_Samples}\de025f921dd477c127fba971b9f90accfb58b117274ba1afb1aaf2222823b6ac.bin
    
    Its also possible to display the strings that match the rule with the -s argument, however the condition must be valid to be taken into account. Therefore, in the perspective of a debug, put the condition to "1 of them" for see that identifiers don't works of the rule.
    .\yara64.exe {Path_Rule} -r {Path_Samples} -s 
    RAN_Royal_Rumble_Dec_2022_1 {Path_Samples}\7cbfea0bff4b373a175327d6cc395f6c176dab1cedf9075e7130508bec4d5393.bin
    0x4b12d:$s1: 83 C4 04 8D 85 D4 EF FF FF 83 BD E8 EF FF FF 08 0F 43 85 D4 EF FF FF 6A 00 6A 00 6A 02 6A 00 6A ...
    0x4c763:$s2: 68 00 02 00 00 8D 84 24 DC 47 00 00 6A 00 50 E8 A9 3B 12 00 83 C4 0C 8D 84 24 D8 47 00 00 68 E8 ...
    0x4df96:$s3: 68 B0 F4 60 00 FF 15 54 51 59 00 50 68 B0 F4 60 00 57 E8 B3 12 00 00 6A 00 6A 00 6A 00 57 E8 27 ...
    0xbb7e1:$s4: 50 68 A2 57 59 00 57 68 28 14 5C 00 56 B3 01 E8 8B 1E FB FF FF 74 24 2C C7 44 24 2C 00 00 00 00 ...
    RAN_Royal_Rumble_Dec_2022_1 {Path_Samples}\de025f921dd477c127fba971b9f90accfb58b117274ba1afb1aaf2222823b6ac.bin
    0x4af8e:$s1: 83 C4 04 8D 85 DC EF FF FF 83 BD F0 EF FF FF 08 0F 43 85 DC EF FF FF 6A 00 6A 00 6A 02 6A 00 6A ...
    0x4c153:$s2: 68 00 02 00 00 8D 84 24 DC 47 00 00 6A 00 50 E8 E9 4D 12 00 83 C4 0C 8D 84 24 D8 47 00 00 68 70 ...
    0x4ddb2:$s3: 68 20 98 60 00 FF 15 54 01 59 00 50 68 20 98 60 00 57 E8 E7 11 00 00 83 C4 0C 6A 00 6A 00 6A 00 ...
    ...
    
    Repository : here
    Note: its possible to do the same thing that cyberchef in extracting the archives in using the yextend module (here).

    MalwareBazaar

    MalwareBazaar is a project operated by abuse.ch, allow to shared samples and submit Yara rule on your account and be notified when a sample match your yara rules.
    It isn't possible to do retrohunting this way however some additional projects like YARA Scan Service from riskmitigation.ch allows to do a retrohunt the samples submitted on the MalwareBazaar platform.