A multitude of adversaries beginning around February of 2020[1] have been abusing an old feature of Microsoft Excel as a novel malware delivery method. The Excel 4.0 macros (XLM) feature was introduced in Excel version 4.0 way back in 1992.[2] This style of macro predates the also commonly abused Visual Basic for Applications (VBA) macros. Some of the early adopters of this variation of the technique[3] were found to deliver Zloader[4] and Dridex[5]. As time went on, many different adversaries adopted this technique. Quite a bit of research[6] has been done to extract[7] and analyze the contents of the macros to find payloads and callback URLs. By building off the previous research, what is presented here is a deep dive into how to detect the presence of these macros in an Excel compound file in the first place. There are three basic categories of indicators which can be identified: i) the beginning of file (BOF) record, ii) the boundsheet record, and (iii), a property record found in the document summary information stream. While conducting this research, it was apparent how hard it can be to differentiate various flavors of Microsoft compound files from one another. Therefore, as a bonus, a method for identifying Excel files from among the other multitude of compound files is also detailed.
1. Beginning of File (BOF)
The examination of the BOF record requires some attention to history. The specification that this record is a part of is called the Binary Interchange File Format or BIFF.[8] When looking back at the earlier file formats for Excel documents, one realizes that the first bytes of the BOF record used to be the "magic number"[9] for these earlier Excel file versions. This can be seen in the progression of the first eight bytes of the reference Excel documents from Open Office's test document set.[10] This progression is shown in Figures 1 through 4, with Figure 4 showing the current magic number for Microsoft compound files: D0CF11E0 A1B11AE1, all shown in Hex Fiend[11] hex editor.
Figure 1: BIFF Version 2 Magic Number
Figure 2: BIFF Version 3 Magic Number
Figure 3: BIFF Version 4 Magic Number
Figure 4: BIFF Version 5 and Above Compound File Magic Number
These early Excel magic numbers vary somewhat based on certain features of the file or version of Excel.[12] The newer compound file basically became a wrapper for the older spreadsheet encoding format. In addition to the specification document provided by Microsoft, another great reference is the Library of Congress's file format description of BIFF.[13]
To get a closer look at exactly what the components of a BOF record are and what they mean, without having to carefully read the specification, one needs to use a tool that has the capability to parse and display the data. Each of the various tools that fill this role have pluses and minuses. Therefore, each tool is outlined here. The first is OffVis. This is a combination hex editor and compound file parser. It is an older Microsoft product which is still available for download.[14] A newer version of OffVis is also available on Github, but because it is distributed in binary form and not directly from Microsoft, appropriate caution should be exercised.[15] The global BOF record as parsed by OffVis is highlighted in Figure 5.
Figure 5: Global BOF Record in OffVis
As seen above, the BOF record starts with the bytes 0908. The next tool which one can use to parse and view a BOF record is BiffView++.[16] An example of a global BOF record from a BIFF version 8 file is shown in Figure 6.
Figure 6: Global BOF Record in BiffView++
Lastly, oledump can be used along with the plugin for BIFF parsing to output the bytes for each BOF record in an Excel document. The output from this command line tool is shown in Figure 7.
Figure 7: BOF Record Output from oledump
For the purposes of detecting Excel 4.0 macros, the global BOF record as seen above is not the one that is interesting. The BOF for the particular substream containing the macros, however, is the one to find and parse. The following description of the BOF record is condensed from the specification on pages 43 and 44 of the BIFF specification documentation.[17]
All versions of BIFF: 5, 7, and 8 begin with the same bytes 0908 which are best seen in the screenshot from OffVis above. The next two bytes of the BOF record are the length in bytes of the record. These bytes are shown below in red.
09 08 08 00 00 05 10 00 6C 09 C9 07
The BOF record in BIFF 5 and 7 has a length of 8 bytes. In the newer BIFF 8, the length of the BOF record can be either 8 or 16 bytes. The next two bytes are the version with 0005 representing BIFF 5/7 and 0006 representing BIFF 8. Taken together, the first six bytes of the BOF record are fixed and vary only by the BIFF version.
The next two bytes are the important ones for identifying Excel 4.0 macros. These bytes represent the substream type with 4000 indicating an Excel 4.0 macro sheet. In the case of a 16 byte BIFF 8 BOF record, this yields the byte string shown in Figure 8.
Figure 8: First Bytes of 16 Byte BIFF8 BOF Record
The 16 byte BOF record has a few additional fields, the final one recording the last save flags. The very last bytes of this additional information are reserved and according to the specification document must be zero. These null bytes are shown by OffVis in Figure 9.
Figure 9: Trailing Reserved Bytes in BOF Record
These last two trailing null bytes can be incorporated into the YARA signature for the 16 byte BOF to decrease false positives. The final string for that particular signature is the following.
$bof = { 09 08 10 00 00 06 40 00 [10] 00 00 }
All of the YARA signatures based on these BOF variations are provided at the end of the blog.
2. Boundsheet Record
This record is another component of the BIFF specification. Details can be found in the documentation referenced above on page 48. This record contains the name of a sheet in the Excel file along with the sheet type and the position of the sheet within a stream. There are two basic variations of this record, one for BIFF 7 and the other for BIFF 8. The bytes that are relevant for detection are the first ten and they are the same in both versions. The difference between the versions is in how the name of the sheet is encoded and stored. The later BIFF 8 version has bytes that support Unicode strings. The boundsheet record also records how many characters there are in the sheet name, but with a slight difference between the versions. The older version, BIFF 7, stores the length of the sheet name in bytes, and the newer version stores the length of the sheet name in characters. Both of these bits of information have important ramifications for building a byte string YARA signature which will be detailed below.
The boundsheet record can be viewed in OffVis under "Globals" towards the very bottom of the list near the end of file (EOF) record. There is one boundsheet record for each sheet in an Excel file. An example is shown in Figure 10.
Figure 10: Boundsheet Record in OffVis
This same boundsheet is seen in Figure 11, but parsed using BiffView++.
Figure 11: Boundsheet Record in BiffView++
Finally, Figure 12 shows the same boundsheet record as parsed by oledump.
Figure 12: Boundsheet Record in oledump
The first two bytes 8500 indicate the start of the boundsheet record. The next two bytes are then the overall length of the boundsheet record including the name of the sheet, which is variable. There are, however, a set of boundaries on this number. The sheet name can only be 31 characters long. Even if the data is encoded using all four possible bytes of a UTF-8 character, the longest length possible fits in one byte. Therefore the second byte of this length is always zero. A boundsheet record for a sheet named "Sheet1" is shown below with the length of the record highlighted in red.
85 00 0E 00 99 0C 00 00 00 00 06 00 53 68 65 65 74 31
In addition to the second byte always being zero, the first byte also cannot be zero, nor can it be above 136. Therefore to reduce false positives in the YARA signatures for boundsheets, the following two constraints are added to the condition section of the rule with the string named "$bs".
uint8(@bs + 2) >= 0x8 and uint8(@bs + 2) <= 0x88
The next four bytes are the stream position and are highly variable. Therefore, they are left as a jump in the byte string. The next byte signifies the state of the boundsheet: visible, hidden, or very hidden. In the byte string in the YARA rules, this byte is included in the stream position jump. The reason for this is some malicious Excel files observed in the wild have unused bits set that are part of this byte. Therefore, to test for this byte properly, it must be done in the condition by taking the bitwise AND of the byte and the mask shown in the documentation (0x3) then comparing that with 0x0 for visible, 0x1 for hidden, and 0x2 for very hidden. An example of this condition that tests for visible boundsheets is shown below.
uint8(@bs + 8) & 0x3 == 0x0
Because these bits have so far only been observed in malicious files, an additional test can be applied in the condition to identify these suspicious files as shown below. This tests whether the byte is larger than 2, signifying that additional reserved bits have been set.
uint8(@bs + 8) > 0x2
The next byte is the critical one for detecting Excel 4.0 macros. This is indicated by this byte being set to 0x1. Tying all this information together yields the following example YARA rule for detecting a hidden sheet with Excel 4.0 macros with the suspicious bits set. This is shown in Figure 13. The full set of rules for the three sheet states is provided at the end of this blog.
Figure 13: YARA Rule for Detecting Excel 4.0 Macros in a Hidden Sheet with Suspicious Bits
The first bytes of the string used in this signature are shown in Figure 14. Each component in the string is labelled.
Figure 14: First Bytes of Boundsheet Record
3. Document Summary Information
The Document Summary Information stream contains a variety of properties which vary depending on what kind of compound file it is found in.[18] In the particular case of this stream in an Excel file, there are a set of heading pairs which display a count of how many of each type of sheet is contained in the document. When there are one or more sheets containing Excel 4.0 macros, one of the heading pairs will indicate this with the string "Excel 4.0 Macros". However, depending on the language localization of Office used to save the Excel file, this heading pair can be slightly different. One example of this localized heading pair as extracted using Exiftool[19] is shown in Figure 15.
Figure 15: Russian Language Localized Heading Pairs
Fortunately, across all the observed variations of this heading pair, there are three substrings that occur in each one. The first variation is used with languages that use postpositive adjectives with borrowed words. In these strings, the word for "macros" is localized, and the word "Excel 4.0" is in English. Therefore, the ASCII hexadecimal code for a space, 0x20, is found immediately before the bytes representing "Excel 4.0". The byte immediately after is then a null byte: 0x00. This variant is shown below with these two bytes highlighted in red.
For languages that have prepositive adjectives, these two bounding characters are reversed. Certain languages, Norwegian being one, have an orthography which employs a hyphen character between the word for macros and Excel 4.0. An example of this form is shown in Figure 16 as extracted using Exiftool.
Figure 16: Norwegian Language Localized Heading Pairs
This third form is shown below with the important bytes shown in red.
There is one case of a false positive that contains one of these three substrings. This false positive is found in certain authentic Microsoft installer patches and MSI files. That string must be excluded in the string matching rules to prevent these false positives, if the YARA rule is not restricted to Excel compound files. That string is the following:
31 39 39 32 20 45 78 63 65 6C 20 34 2E 30 00
In addition to the above strings, there are two more fields that are added to the byte strings to further reduce false positives. The first is 1E000000 which indicates the start of a property record. The second is the length of the property record. This length only uses the first byte in this particular type of property record. Therefore the byte with the length must be a wild card: ??000000. This yields the rule shown in Figure 17.
Figure 17: Heading Pairs YARA Rule
Because these properties occur in a stream, they are subject to fragmentation. This causes a problem for YARA rules because the rule cannot predict at which point in the byte string the split will occur. An example of this type of fragmentation can be seen in Figure 18.
Figure 18: Fragmented Excel 4.0 Property
Of course, this can be handled by generating a gigantic rule for each permutation of the byte string with a split at each possible location. But there is a better way when using the Titanium Platform. One of the basic processes that is applied to a compound file when it is submitted is to extract the various streams and defragment them into contiguous files. These contiguous files are then scanned by any YARA rules loaded into the A1000. The effect of this is to completely circumvent the problem of fragmented streams. To navigate to exactly where one needs to, after submitting a file, drill down into the Document Summary Information stream via the Extracted Files feature. This is shown in Figure 19.
Figure 19: Extracted Document Summary Information Stream
Once that extracted stream is open, select the YARA analysis under "Titanium Core" to view the rules that matched with that particular stream. This is shown in Figure 20.
Figure 20: Excel 4.0 Macros YARA Rule Match
A full set of rules that match these strings by themselves as well as the strings within a property record are all provided at the end of this blog.
What is an Excel File?
All of the indicators for Excel 4.0 macros described above can have different inaccurate results based on the type of compound file that triggers the rule. Because Microsoft compound files can be nested in different ways, the indicators above may trigger when there is an Excel file embedded in a Word or PowerPoint file. Another case is when an Excel file is attached to an email and the email is saved in Outlook MSG format. Because of this problem, determining that a file is Excel and not another type of compound file is important. Alternatively, searching for these indicators in a more loose way within any compound file can be valuable.
To determine if a file is an Excel document, there are three main ways, each of which can be used to form a YARA rule. The first, and least complex is using the magic module in YARA. This rule is shown in Figure 21.
Figure 21: YARA Rule for Identifying Excel Via the Magic Module
Due to the sluggishness of the magic module in YARA, many deployments of YARA do not have this module compiled in. Therefore, a different method that uses strings and conditions only must be employed. There are two strong indicators that a file is an Excel document, and both are located in the directory stream of the compound file.
To analyze the directory in a compound file, first one must locate the root entry of the directory. This offset is calculated by parsing two fields found in the compound file header: the sector shift and the sector number of the first directory sector. The locations of these bytes are described in the specification for Microsoft compound files.[20] One example of these two values is shown in Figure 22 as parsed by OffVis.
Figure 22: Sector Shift and First Directory Sector Number
These two values are then used in the following equation to calculate the offset of the start of the directory stream. Sector shift is "S" and the number of the first directory sector is "N".
offset = 2S + N * 2S
The sector shift value is stored at offset 30 in the header and the sector number of the first directory stream is stored at offset 48. When implementing this equation in a YARA condition, one must use the bitwise operators since there is no exponent operator available. Therefore the part of the equation that uses sector shift translates to using the bitwise left shift operator like so:
1 << S
By combining all of the above and then reducing the terms of the equation, one is left with the following condition which calculates the offset of the beginning of the directory stream.
(uint32(48) + 1) * (1 << uint16(30))
Armed with this offset, the first indicator that a file is Excel is the class ID of the root storage object. These class IDs are not officially documented, but there are two[21] websites[22] which maintain a decent list of known IDs. The class ID of the root storage object is located 80 bytes into the root entry. The resulting rule which detects Excel 8's root class ID is shown in Figure 23.
Figure 23: YARA Rule for Detecting Excel 8 Root Entry Class ID
There is a problem, however, with relying solely on the root entry class ID. This field can be stomped by the adversary and the file will still open in Excel. In fact, there is an entire category of malicious Excel files of which the detections contain the string "abracadabra" and have random data written to the root entry class ID. Additionally, these files are often missing the document summary information stream entirely. These files are easy to find in the Titanium Platform via search using "threatname:abracadabra". The results of this search are shown in Figure 24.
Figure 24: Abracadabra Malicious Documents
The third method of identifying an Excel document is to examine the rest of the entries in the directory stream looking for one of two stream names: "Workbook" for Excel 8 and newer, and "Book" for Excel 5. The way to do this is to iterate across each entry in the directory checking if the stream name matches one of these two. A single directory entry is 128 bytes long, and there are 31 entries in addition to the root entry. Therefore two nested loops are used to check for the presence of the directory entry name. The resulting rule that searches for "Workbook" is shown in Figure 25.
Figure 25: YARA Rule for Identifying Directory Entry Named "Workbook"
Again, this method is not complete since the directory stream is a stream and therefore susceptible to fragmentation just like the document summary information stream. If the Workbook entry is located after the first fragment, the rule above will not trigger. However, not to worry, the Titanium Platform itself performs file identification for you, so all you need to do is submit the file, make coffee, and check back with the analysis results.
Figure 26: Titanium Platform File Format Identification
YARA Rules
NOTE: If restricting the Excel 4.0 macros rules to Excel files only is too restrictive, just widen the search by substituting the CompoundFile rule for the Excel_CompoundFile rule in the various rules' conditions.
import "magic"
private rule CompoundFile
{
meta:
author = "Malware Utkonos"
date = "2020-07-04"
description = "Magic number for Microsoft compound files: 'D0CF11E0A1B11AE1'."
condition:
uint32(0) == 0xE011CFD0 and uint32(4) == 0xE11AB1A1
}
rule Excel5_RootCLSID
{
meta:
author = "Malware Utkonos"
date = "2020-07-15"
description = "Excel BIFF5 root entry record class ID."
strings:
$clsid = { 10 08 02 00 00 00 00 00 C0 00 00 00 00 00 00 46 }
condition:
CompoundFile and
$clsid at ((uint32(48) + 1) * (1 << uint16(30)) + 80)
}
rule Excel8_RootCLSID
{
meta:
author = "Malware Utkonos"
date = "2020-08-08"
description = "Excel BIFF8 root entry record class ID."
strings:
$clsid = { 20 08 02 00 00 00 00 00 C0 00 00 00 00 00 00 46 }
condition:
CompoundFile and
$clsid at ((uint32(48) + 1) * (1 << uint16(30)) + 80)
}
rule Excel_DirEntryName_BIFF5
{
meta:
author = "Malware Utkonos"
date = "2020-08-09"
description = "Directory entry for a stream named 'Book' in a compound file."
strings:
$dirname = { 42 00 6F 00 6F 00 6B 00 [57] 00 02 0? } // Book
condition:
CompoundFile and
for any i in (1..#dirname) : (
for any j in (1..31) : ( @dirname[i] == (uint32(48) + 1) * (1 << uint16(30)) + j * 128 )
)
}
rule Excel_DirEntryName_BIFF8
{
meta:
author = "Malware Utkonos"
date = "2020-08-09"
description = "Directory entry for a stream named 'Workbook' in a compound file."
strings:
$dirname = { 57 00 6F 00 72 00 6B 00 62 00 6F 00 6F 00 6B 00 [49] 00 02 0? } // Workbook
condition:
CompoundFile and
for any i in (1..#dirname) : (
for any j in (1..31) : ( @dirname[i] == (uint32(48) + 1) * (1 << uint16(30)) + j * 128 )
)
}
rule Excel_MIME
{
condition:
magic.mime_type() == "application/vnd.ms-excel"
}
rule Excel_CompoundFile
{
condition:
Excel5_RootCLSID or Excel8_RootCLSID or Excel_DirEntryName_BIFF5 or Excel_DirEntryName_BIFF8 or Excel_MIME
}
rule Excel_Macros40_String
{
meta:
author = "Malware Utkonos"
date = "2020-07-23"
description = "Variations of the Excel 4.0 Macros string found in the Document Summary Info property."
strings:
$a = { 20 45 78 63 65 6C 20 34 2E 30 00 }
$b = { 00 45 78 63 65 6C 20 34 2E 30 20 }
$c = { 00 45 78 63 65 6C 20 34 2E 30 2D }
$fp = { 31 39 39 32 20 45 78 63 65 6C 20 34 2E 30 00 }
condition:
Excel_CompoundFile and any of ($a,$b,$c) and
not $fp
}
rule Excel_DocSumInfo_Macros40_Prop
{
meta:
author = "Malware Utkonos"
date = "2020-07-23"
description = "Document Summary Information containing a property value for Excel 4.0 macros."
strings:
$ = { 1E 00 00 00 ?? 00 00 00 [0-20] 20 45 78 63 65 6C 20 34 2E 30 00 }
$ = { 1E 00 00 00 ?? 00 00 00 [0-20] 45 78 63 65 6C 20 34 2E 30 20 }
$ = { 1E 00 00 00 ?? 00 00 00 [0-20] 45 78 63 65 6C 20 34 2E 30 2D }
condition:
Excel_CompoundFile and any of them
}
rule Excel_BOF_BIFF57_Macros40
{
meta:
author = "Malware Utkonos"
date = "2020-07-23"
description = "Beginning of File (BOF) record in BIFF5 or BIFF7 format with Excel 4.0 macros."
strings:
$bof = { 09 08 08 00 00 05 40 00 }
condition:
Excel_CompoundFile and $bof
}
rule Excel_BOF_BIFF8_Macros40_8
{
meta:
author = "Malware Utkonos"
date = "2020-07-23"
description = "Beginning of File (BOF) record in BIFF8 format with Excel 4.0 macros of length 8."
strings:
$bof = { 09 08 08 00 00 06 40 00 }
condition:
Excel_CompoundFile and $bof
}
rule Excel_BOF_BIFF8_Macros40_16
{
meta:
author = "Malware Utkonos"
date = "2020-07-23"
description = "Beginning of File (BOF) record in BIFF8 format with Excel 4.0 macros of length 16."
strings:
$bof = { 09 08 10 00 00 06 40 00 [10] 00 00 }
condition:
Excel_CompoundFile and $bof
}
rule Excel_Boundsheet_Macros40
{
meta:
author = "Malware Utkonos"
date = "2020-07-23"
description = "Boundsheet record with sheet type: Excel 4.0 macro at any visibility setting."
strings:
$bs = { 85 00 ?? 00 [5] 01 }
condition:
Excel_CompoundFile and $bs and
for any i in (1..#bs) : (
for any j in (0..2) : ( uint8(@bs[i] + 8) & 0x3 == j and uint8(@bs[i] + 2) >= 0x8 and uint8(@bs[i] + 2) <= 0x88 )
)
}
rule Excel_Boundsheet_Macros40_ResSet
{
meta:
author = "Malware Utkonos"
date = "2020-07-23"
description = "Boundsheet record with sheet type: Excel 4.0 macro at any visibility setting and data in the reserved bits."
strings:
$bs = { 85 00 ?? 00 [5] 01 }
condition:
Excel_CompoundFile and $bs and
for any i in (1..#bs) : (
for any j in (0..2) : ( uint8(@bs[i] + 8) & 0x3 == j and uint8(@bs[i] + 8) > 0x2 and uint8(@bs[i] + 2) >= 0x8 and uint8(@bs[i] + 2) <= 0x88 )
)
}
[1] https://www.lastline.com/labsblog/evolution-of-excel-4-0-macro-weaponization/
[2] https://en.wikipedia.org/wiki/Microsoft_Excel#Excel_4.0_(1992)
[3] https://attack.mitre.org/techniques/T1204/002/
[4] https://malpedia.caad.fkie.fraunhofer.de/details/win.zloader
[5] https://malpedia.caad.fkie.fraunhofer.de/details/win.dridex
[6] https://inquest.net/blog/2019/01/29/Carving-Sneaky-XLM-Files
[7] https://hatching.io/blog/excel-xlm-extraction/
[8] http://download.microsoft.com/download/1/A/9/1A96F918-793B-4A55-8B36-84113F275ADD/Excel97-2007BinaryFileFormat(xls)Specification.pdf
[9] https://en.wikipedia.org/wiki/Magic_number_(programming)
[10] https://www.openoffice.org/sc/testdocs/index.html
[11] https://ridiculousfish.com/hexfiend/
[12] https://mail-archives.apache.org/mod_mbox/tika-dev/201804.mbox/%3CJIRA.13152232.1523611896000.267077.1523860860153@Atlassian.JIRA%3E
[13] https://www.loc.gov/preservation/digital/formats/fdd/fdd000510.shtml
[14] https://go.microsoft.com/fwlink/?LinkId=158791
[15] https://github.com/arizvisa/windows-binary-tools/tree/master/offvis
[16] http://b2xtranslator.sourceforge.net/download.html
[17] http://download.microsoft.com/download/1/A/9/1A96F918-793B-4A55-8B36-84113F275ADD/Excel97-2007BinaryFileFormat(xls)Specification.pdf
[18] https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-oshared/2ea8be67-a4a0-4e2e-b42f-49a182645562
[19] https://exiftool.org/
[20] https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-cfb/53989ce4-7b05-4f8d-829b-d08d6148375b
[21] http://fileformats.archiveteam.org/wiki/Microsoft_Compound_File
[22] https://github.com/decalage2/oletools/blob/master/oletools/common/clsid.py
Keep learning
- Find the best building blocks for your next app with RL's Spectra Assure Community, where you can quickly search the latest safe packages on npm, PyPI and RubyGems.
- Get up to speed on securing AI/ML systems and software with our Special Report. Plus, see the Webinar: The MLephant in the Room.
- Learn about complex binary analysis and why it is critical to software supply chain security in our Special Report. Plus: Take a deep dive with RL's white paper.
Explore RL's Spectra suite: Spectra Assure for software supply chain security, Spectra Detect for scalable file analysis, Spectra Analyze for malware analysis and threat hunting, and Spectra Intelligence for reputation data and intelligence.