Thursday, March 6, 2014

Microsoft Compund Document Internals (Part 5 - Rebuilding a Corrupted Document Header)

Recently I was contacted by an individual who had been infected with the CryptorBit virus. This virus is a ransomware variant that supposedly encrypts your files and then demands payment of a certain amount of money in order to unlock your files. The person who contacted me explained that rather than encrypting the entire contents of the file (like CryptoLocker), this particular bit of malware just encrypts the first 512 bytes of the file. This provides an interesting opportunity with respect to MS Office Compound Documents because the first 512 bytes are the Compound Document Header (described in detail at MSDN). The interesting thing here is that many of the values that are in the Compound Document Header are static values, and those that are variable can possibly be derived from the actual file contents. Since the file contents are not affected by CryptorBit, we should, in theory, be able to scan the streams, storages, and directory entries within the Compound Document file and recover the data necessary to rebuild the Document header. Challenge accepted.

An initial review of the Compound Document Header reveals that a large number of the fields are static. The fields in question that are static are as follows:

  1. Header Signature
  2. Header CLSID
  3. Minor Version
  4. Major Version*
  5. Byte Order
  6. Sector Shift*
  7. Mini Sector Shift
  8. Reserved
  9. Number of Directory Sectors
  10. Transaction Signature Number*
  11. Mini Stream Cutoff Size
The values marked by * MAY be variable. There is a version 4 of the Compound Document Specification that uses 4096-byte sectors. I have yet to encounter one, although it should be relatively easy to identify one. I will incorporate this into future analysis if any readers can provide a version 4 document for me to analyze. For this post, we will assume that the documents that we will be working with are version 3 documents with 512-byte sector sizes.

Given the above, the only remaining values that we have to fill in are:
  1. Number of FAT Sectors
  2. First Directory Sector Location
  3. First MiniFAT Sector Location
  4. First DIFAT Sector Location
  5. Number of DIFAT Sectors
  6. The DIFAT that is contained within the Document Header
No problem, right? Well, it wasn't as easy as I thought it would be, but it turns out to be do-able.

The person who contacted me was nice enough to provide some documents where the header had been scrambled. I opened them up with a hex editor to take a look and see what I was dealing with. The header looked as follows:

Figure 1: Corrupted Document Header

Obviously not a valid Document Header! However, immediately following the header was the unencrypted / unaltered sector 0 (and subsequent sectors) of the file. This is good because the data structures are intact and we can begin digging through the file to find structures that will give us the answers to the 6 questions above that we have to answer in order to rebuild the document header.

When corresponding with the individual regarding this virus, my initial response was that I would first start to look for the document FAT sectors and use those as a reference for finding the answers to the other questions. So, how do we identify the FAT sectors?

Recall that the FAT is used to store the allocation status of each sector within the file, and from the FAT we can derive the sector chains for the streams and storages contained within the document. A FAT will thus contain sector numbers for the chains and also will contain special values such as 0xFFFFFFFD for FAT sectors, 0xFFFFFFFC for DIFAT sectors, and 0xFFFFFFFE for end-of-chain markers. Initially I surmised that it would be easy to find FAT sectors by searching for sectors that contained two or more of the end-of-chain markers. However, this logic proved to be flawed due to the possibility that large streams that encompass an entire FAT sector would cause that FAT sector to not contain any end-of-chain markers. Thus I had to come up with an alternate algorithm for identifying FAT sectors.

Taking a closer look at FAT sectors, we can easily notice a pattern that frequently appears. That pattern is that of incrementing sector numbers for each sector. An image will help illustrate this better.

Figure 2: FAT structure

We can see that for an 8-byte sequence in the FAT, that there is a good possibility that the second 4 bytes will be equal to the first 4 bytes, plus one. Why is this? Recall that the value of each 4-byte entry in the FAT is a pointer to the next sector in a sector chain. Because Microsoft tries to keep the streams and storages within a particular file as contiguous as possible, streams are very often arranged in sector-sequential order. This means that the sector chain will typically look like:

x, x+1, x+2 ... x+n

where n is the total number of sectors for the chain. When this type of sector chain is written to the FAT, it results in these sequential numerical values that are stored in each 4-byte entry. We can use this pattern to identify FAT sectors (and MiniFAT sectors too, but more on that later). Other things such as the presence of special sector identifiers (0xFFFFFFFC, 0xFFFFFFFE and 0xFFFFFFFF) are also good indicators that this is an actual, valid FAT sector within the Document File.

Using this algorithm, we can scan the document and identify all of the FAT sectors. But then we have the problem of putting them in the proper order. If we have multiple FAT sectors within a compound document, we must figure out the "sector chain" for the FAT sectors. This is typically done using the DIFAT, but if you recall, the first 109 entries of the DIFAT are also contained within the Document Header, thus we will have to figure out this information on our own somehow...

Given that sector numbers are stored as a DWORD value (that is, 4 bytes) in the FAT , each 512-byte FAT sector can store sector numbers for up to 128 sectors (512 / 4). The technique that I used assumes that for each FAT sector, the document sectors that are addressed will fall within a certain range. That is, the first FAT sector will address sectors from 0 - 127. The second FAT sector will address sectors from 128-255. And so on. Working with this assumption, I took the average of the sector numbers that were listed within the FAT and compared it to 128. If it is less than 128, it's probably FAT sector 0. If it's a multiple, then the multiple is likely the sequence number. The equation looked like this:

Round Up(averageSectorNumber / 128) = FAT Sequence Number

In my test documents that I used, this equation proved to be accurate and properly identified the FAT sector chain sequence.

So once we have the FAT sector chain, we can build the DIFAT! Recall that the first 109 FAT sector numbers are stored in the document header. If there are more than 109 FAT sectors, then a DIFAT sector will be required to store the sector numbers for the additional FAT sectors. Most documents do not require DIFAT sectors, but if you are dealing with a document that is larger than ~6.8MB, you may need a DIFAT sector within your document. Fortunately, DIFAT sectors are pretty easy to find - they are specially listed within the FAT. Once we have the FAT sector chain, we can scan each FAT sector to find sectors marked as DIFAT sectors (0xFFFFFFFC). Any sector within the FAT that is marked as a DIFAT sector can be analyzed separately.

For the purpose of this post, we will look at a document that does not have a special DIFAT sector and has only three FAT sectors. Given the three FAT sectors, now that we have them in order, we have the information that we need to answer questions 1, 4, 5 & 6 on the above list of unknowns:

1. Number of FAT sectors: Our algorithm has shown that there are three FAT sectors in the document that we are analyzing.
4. First DIFAT Sector Location: Since there are no DIFAT sectors in a document that has only three FAT sectors, this value will be the End-of-Chain value (0xFFFFFFFE).
5. Number of DIFAT Sectors: Zero
6. The DIFAT...

Recall that the DIFAT contains the sector numbers for each FAT sector, in sequence. In the document in question, the FAT sectors are at sectors 0, 100, and 290 (sequentially). Thus the values for the first 4 bytes of the DIFAT would be 0x00000000 (sector 0), the second 4 bytes would be 0x00000064 (sector 100), and the third 4 bytes would be 0x00000122 (sector 290). These three values point to the three FAT sectors within the document. The remaining DIFAT sectors would be 0xFFFFFFFF to reflect that there are no more FAT sectors used.

So at this point we have four of the six questions answered. We are almost there. Question 2 asks for the location of the first (Root) directory sector location. This should be quite simple to find. The Root Directory Sector starts with the Unicode string "Root Entry". Thus we can search the document for sectors that begin with that string. Which works well, until we find that some documents have more than one.

Some documents can have artifacts from previous versions of the document. As the document is edited and data is moved around, old sectors can be marked as unallocated and the data can be retained within the file even though the sector is marked as unallocated. Other documents can have embedded documents within them, thus there is a possibility that you may encounter the Root Entry value more than one time. In my particular case, I had to validate which one was the correct Root Entry and thus the First Directory Sector. We will come back to this topic. For now, let's file away the two sectors that are candidates for the First Directory Sector and take a look at question 3.

Question 3 asks us about the first MiniFAT sector location. According to Microsoft's documentation, "The mini FAT is used to allocate space in the mini stream. The mini stream is divided into smaller, equal-length sectors, and the sector size used for the mini stream is specified from the Compound File Header (64 bytes)." The nice thing about the MiniFAT is that it looks just like a regular FAT. Thus our FAT scanning algorithm above will also identify MiniFAT sectors. So how do we tell the MiniFAT sectors apart from the regular FAT sectors?

Well, the first way we can tell is to check the FAT against itself. FAT sectors within the FAT are marked as a special value - 0xFFFFFFFD. If we have a candidate that we think is a regular FAT sector, we can check to see if the value for that sector is listed as 0xFFFFFFFD in the table. If so, then it is very likely a valid FAT sector. MiniFAT sectors are different - they are chained just like normal streams in the FAT. Thus if we have identified a FAT sector but it does not show up as a FAT sector (0xFFFFFFFD) in the FAT itself, then it is quite possibly a MiniFAT sector. We can then pull the sector chain for that MiniFAT sector from the FAT.

Another thing that I have found about the MiniFAT is that it is typically located very close to the start of the Mini Stream. The Mini Stream is where user-defined content that is less than the Mini Stream Cutoff value (4096 bytes) is stored. The first sector of the Mini Stream is stored in the Root Directory entry. So here is our tie-in with the previous section on the Root Directory: we can now check each of the candidate Root Directory entries that we found and extract the starting sector and stream size for the Mini Stream. Once we have those values, we can compare them with the MiniFAT sectors that we have found and make sure that everything lines up properly.

In the document that I was working on, the first Root Directory sector started at sector 1 (file offset 1024):
Figure 3: Root Directory Entry


The MSDN documentation about the Root Directory Entry states that the entry contains the first sector of the Mini Stream as well as the Mini Stream size. The sector number is at offset 116 from the start of the Root Directory entry, and is 4 bytes long. This value represents the sector number for the start of the Mini Stream. The Mini Stream size is the next 8 bytes after the sector number. These values are the last 12 bytes of the Directory Entry. Seen above, the values are 0x00000003 (3) for the Mini Stream starting sector number, and 0x000004C0 (1216) as the stream size.

The second Root Directory entry that was found was at sector 289 (file offset 148480). The values recovered from this directory entry were 0x00000127 (sector 295) and 0x00000500 (1280 bytes) as the stream size.

At this point we consult our list of possible FAT sectors (that was discovered during our initial search for FAT sectors) to determine which of them could be our MiniFAT sector. To do this, we use a simple technique - see which of the candidate FAT sectors have the same number of addressed sectors as are present in the Mini Stream.

Per Microsoft's Documentation, the Mini Stream is broken up into 64-byte sectors. Thus, if we have the Mini Stream size (we do) we can figure out how many sectors there are within the Mini Stream by dividing the stream size by 64. In our first Root Directory entry, the stream size is 1216 bytes. This would imply that there are 19 sectors (1216 / 64) within the Mini Stream. None of our candidate FAT sectors contain 19 sectors that are addressed.

The second Root Directory entry has a stream size value of 1280 bytes for the Mini Stream. This implies that there would be 20 addressed sectors (1280 / 64) in the MiniFAT. In the sample document, the candidate FAT sector at sector 293 matched this number of sectors. It is also close in proximity (2 sectors away) from the start of the Mini Stream. This is very likely to be our MiniFAT sector.

Given this bit of information, we can answer the two remaining questions (numbers 2 & 3):

2. First Directory Sector Location: In this case, the only directory sector that had a valid Root Entry was the value at sector 289.
3. First MiniFAT Sector Location: In this case, the valid Root Directory entry and its corresponding stream size indicated that the MiniFAT candidate at sector 293 was our first MiniFAT sector.

We can now plug these values into the document header, or (like I did) use a python script to write out the new document header. Using a hex editor I copied the values into place, overwriting the scrambled data. I was then able to successfully open the document.

If anyone else is out there dealing with this virus, please feel free to hit me up and I can provide the python script that I wrote to pick apart the document internals.

1 comment:

  1. Could you please share the Python script to help recovering missing CFB header. Other CryptorBit tools doesn't help in my case as the compound file is not in list of common formats.

    ReplyDelete