As I have mentioned in previous posts, the Microsoft Office Compound Document file uses an internal File Allocation Table (FAT) structure to keep track of allocated and unallocated sectors within the file. In addition to the FAT, there is also a Double Indirect File Allocation Table (DIFAT) which is used to keep track of file sectors used by the FAT. Additionally, the Compound Document file also uses a MiniFAT, which allocates storage in the Mini Stream, which will be the topic of another post. All of these structures are used to map the allocation status of each sector within the Compound Document file, and are used to recover sector chains - that is, sequences of sectors that contain the data for a particular stream and storage. In this post, we will discuss the FAT concept in general, and the implementation specifics of FAT within the Compound Document file.
The FAT filesystem has been around for years, first developed back in the late 1970s for use on floppy drives. The File Allocation Table concept uses a table or array to map the allocation status of each sector of a piece of media. In our case the piece of media is the Compound Document file, but the concept is the same as that used in the FAT filesystem on digital storage media. For every sector within the media, there is an entry in a table (or array) that contains information about the allocation status of that sector.
When we want to retrieve the contents of a stream, we first look in the Directory Entry to find the starting sector location for that stream. We can then consult the File Allocation Table for information about the starting sector that we recovered from the Directory Entry. The FAT will contain information about the next sector number that contains data belonging to the stream. We then lookup that sector in the FAT until we reach a sector that contains the End of Chain (EOC) marker in the FAT. This indicates that we have recovered the entire "sector chain" for that particular stream.
Each sector within a Compound Document file is represented by a 32-bit value in the Compound Document FAT. This 32-bit value represents the sector number of the next sector containing the stream data. In the image above, we see that the Directory Entry tells us that the starting sector for a stream is sector 7. We consult the File Allocation Table for sector 7 and we see that the next sector is sector 1. We then would look up sector 1 in the FAT and continue onward until we find the EOC marker, which indicates that there are no more sectors as part of the stream. The file in Figure 1 shows that the sector chain for this particular stream is 7-1-3-5. The FAT tells us that sector 5 is the last sector in the stream due to the EOC marker indicating that there are no additional sectors that are part of this sector chain. Thus to recover the entire contents of this stream, we would collect the data at each of these sectors, sequentially, appending the data from each successive sector as we go along.
This particular stream shown in Figure 1 consists of four sectors; for a Compound Document file that uses 512-byte sector size this means that the stream represented by this cluster chain can have a maximum size of 2048 bytes (512 x 4). It is possible that a stream will not utilize all of the space within the last sector due to the fact that the stream size may not be an exact multiple of the sector size. Recall that the Stream Size is one of the values stored in the Compound Document Header; this value can be used to calculate how much slack space will be present for a given stream. When extracting streams from a compound document it is important to note that there may be slack space between the end of the stream and the end of the last sector in the sector chain and there may be data in this slack that is forensically interesting.
To locate the FAT within a Compound Document file, an examiner must consult the DIFAT (Double Indirect File Allocation Table) to calculate the sector chain for the sectors that contain the FAT itself. Recall that the Compound Document Header contains information about DIFAT sectors and also contains some DIFAT information. In smaller Compound Document Files, the entirety of the DIFAT information may be stored within the Compound Document Header. Let's take a look at the header from the sample document that we were analyzing in previous posts:
The FAT filesystem has been around for years, first developed back in the late 1970s for use on floppy drives. The File Allocation Table concept uses a table or array to map the allocation status of each sector of a piece of media. In our case the piece of media is the Compound Document file, but the concept is the same as that used in the FAT filesystem on digital storage media. For every sector within the media, there is an entry in a table (or array) that contains information about the allocation status of that sector.
When we want to retrieve the contents of a stream, we first look in the Directory Entry to find the starting sector location for that stream. We can then consult the File Allocation Table for information about the starting sector that we recovered from the Directory Entry. The FAT will contain information about the next sector number that contains data belonging to the stream. We then lookup that sector in the FAT until we reach a sector that contains the End of Chain (EOC) marker in the FAT. This indicates that we have recovered the entire "sector chain" for that particular stream.
Figure 1: Sector Chaining
Each sector within a Compound Document file is represented by a 32-bit value in the Compound Document FAT. This 32-bit value represents the sector number of the next sector containing the stream data. In the image above, we see that the Directory Entry tells us that the starting sector for a stream is sector 7. We consult the File Allocation Table for sector 7 and we see that the next sector is sector 1. We then would look up sector 1 in the FAT and continue onward until we find the EOC marker, which indicates that there are no more sectors as part of the stream. The file in Figure 1 shows that the sector chain for this particular stream is 7-1-3-5. The FAT tells us that sector 5 is the last sector in the stream due to the EOC marker indicating that there are no additional sectors that are part of this sector chain. Thus to recover the entire contents of this stream, we would collect the data at each of these sectors, sequentially, appending the data from each successive sector as we go along.
This particular stream shown in Figure 1 consists of four sectors; for a Compound Document file that uses 512-byte sector size this means that the stream represented by this cluster chain can have a maximum size of 2048 bytes (512 x 4). It is possible that a stream will not utilize all of the space within the last sector due to the fact that the stream size may not be an exact multiple of the sector size. Recall that the Stream Size is one of the values stored in the Compound Document Header; this value can be used to calculate how much slack space will be present for a given stream. When extracting streams from a compound document it is important to note that there may be slack space between the end of the stream and the end of the last sector in the sector chain and there may be data in this slack that is forensically interesting.
To locate the FAT within a Compound Document file, an examiner must consult the DIFAT (Double Indirect File Allocation Table) to calculate the sector chain for the sectors that contain the FAT itself. Recall that the Compound Document Header contains information about DIFAT sectors and also contains some DIFAT information. In smaller Compound Document Files, the entirety of the DIFAT information may be stored within the Compound Document Header. Let's take a look at the header from the sample document that we were analyzing in previous posts:
Figure 2: Compound Document Header DIFAT Information
The circled information shown in Figure 2 contains the "First DIFAT Sector Location", highlighted in green, and the "Number of DIFAT sectors", highlighted in blue. In this case, the First DIFAT Sector Location value is 0xFFFFFFFE, which is Microsoft's End-of-Chain (EOC) marker. This, along with the value of 0x0 for Number of DIFAT sectors indicates that there are no additional DIFAT sectors within the Compound Document File, and that all of the DIFAT information is contained within the Compound Document Header. It is important to note that for larger files (greater than 6.875 MB), the DIFAT may be larger and there may be DIFAT sectors that contain additional information.
The DIFAT consists of 32-bit sector numbers that point to the sectors used by the Compound Document File Allocation Table. If there is more than one sector used to contain the contents of the FAT, then those consecutive sectors will be listed sequentially in the DIFAT. DIFAT sectors within the Compound Document file are listed within the FAT as a special value, 0xFFFFFFFC (DIFSECT) and thus are not chained in the FAT. Instead, the last four bytes of a DIFAT sector contain the sector number of the next DIFAT sector. If there are no more DIFAT sectors, then the last four bytes of the DIFAT sector will be the EOC marker (0xFFFFFFFE).
Going back to our sample document, we can see that there are no DIFAT sectors, and there is only one sector number contained in the Compound Document Header DIFAT:
Figure 3: Compound Document Header
The values listed in light blue represent the DIFAT entries. In the Compound Document Header there will be a maximum of 109 DIFAT entries - thus there can be up to 109 FAT sectors within the file without the need for additional DIFAT sectors to track the sector numbers for each FAT sector. A numerical value indicates a sector number for a FAT sector. 0xFFFFFFFF indicates an unallocated / free value and can be used to store sector numbers for additional FAT sectors.
The first four bytes of the DIFAT in the Compound Document Header in Figure 3 are 0x24 or 36, indicating that the first (and in this case, only) sector containing the Compound Document File Allocation Table is at sector number 36. Since we have already determined that this file uses 512-byte sectors, we can use our previous equation to calculate the file offset of the first sector of the FAT. In this case, we calculate (512 x 36) + 512 = 18944 bytes. Thus the file offset for the first and only FAT sector within this file is 18944 bytes. For larger files there may be more than one sector containing the FAT, thus we should be prepared to examine more than one sector to recover all of the contents of the FAT.
In Compound Documents where the sector size is 512-bytes, a FAT sector can manage the allocation status for up to 128 sectors of data. For a Compound Document file with a 4096-byte sector size, then each FAT can manage the allocation status for up to 1024 sectors. If a file requires more than the number of sectors than can be contained within a single FAT sector, then additional FAT sectors will have to be allocated and the sector numbers for those new FAT sectors will be appended to the DIFAT. Figure 4 shows a document that has multiple FAT sectors allocated in the DIFAT:
Figure 4: DIFAT with Multiple FAT Sectors Allocated
Once we have located the FAT sectors, we can then utilize the FAT to determine sector chains for streams that we wish to recover. Returning to the example in our previous post, we were examining the SummaryInformation stream, which started at sector 20. Given the starting sector number, we can look up the allocation status for that sector within the File Allocation Table and determine the sector chain. The equation to calculate the byte offset in the FAT for a given sector number is (Sector Number x 4 bytes per sector). Thus to calculate the sector chain for the SummaryInformation stream, we would jump to byte offset 20 x 4 = 80 bytes into the FAT.
Figure 5: FAT entries for the SummaryInformation Stream
Figure 5 shows the first (and only) 512-byte FAT sector for the sample document we have been examining in this blog series. Each sector in the Compound Document file is represented by a four-byte value within the FAT. Sector 0 is mapped to the first four bytes, Sector 1 is mapped to the second four bytes, and so on. Normally these four bytes will be used to store the next sector number in a sector chain, as we previously mentioned. However, there are some special values that can be stored in a FAT entry. Per Microsoft, "Special values are reserved for chain terminators (ENDOFCHAIN = 0xFFFFFFFE), free sectors (FREESECT = 0xFFFFFFFF), and sectors that contain storage for FAT sectors (FATSECT = 0xFFFFFFFD) or DIFAT Sectors (DIFSECT = 0xFFFFFFC), which are not chained in the same way as the others."
We can see in figure 5 the sector chain for the SummaryInformation stream, which begins at sector number 20. We jump to offset 20 x 4 = 80 bytes into the FAT and see the sector number for the next sector in the stream. In this case, the next sector number is 0x15 (21). We calculate the offset into the FAT for sector 21 which is 21 x 4 = 84 and see that the next sector is 0x16 (22). We continue until we reach the EOC marker which is at sector 27. Thus, for this file, sectors 20-27 contain the SummaryInformation stream.
Given that this file is using a 512-byte sector size, the SummaryInformation stream has 4096 bytes of storage space allocated within this particular Compound Document file. From the Directory Entry for the SummaryInformation stream, recall that the stream size for this stream is 4096 bytes, indicating that there will be no slack space for this stream since the stream size matches the size of all of the allocated sectors.
In my next post, I will investigate in more detail the structure of the SummaryInformation structure, including the PropertySet data structures that we can use to extract metadata information from the Compound Document file.
Very interesting article. I've lost track of the amount of time I've spent wading through .doc files looking for useful artefacts. MS have a cool tool called OffVis (Office Visualizer) which allows these internal data structures to be examined in detail.
ReplyDeleteAny thoughts on recovering deleted text or highlighting altered text within a document when track changes is not enabled?
I had not seen OffVis, thanks for sharing that. Looks pretty cool. I was thinking of writing a python script or some sort of parser to dump out the stream slack. I have noticed in some documents that I was reviewing that there are some artifacts in the SummaryInformation stream such as fragments / substrings of the author's name repeated a couple of times in what appeared to be slack space. I have to dig into that a little further. Recovering deleted text or other file structures may be possible - I haven't tried that yet either, but it should theoretically be possible to go into a compound document and analyze the FAT for all sectors that are present in the file but marked as 0xFFFFFFFF (free / available) and then print a hex dump of those sectors. It may turn up something useful. Also a parser for stream slack should not be too hard to put together; I just need to have the free time to do it and then test it out on a multitude of sample files! Thanks again for the OffVis reference!
ReplyDelete