Thursday, December 19, 2013

Microsoft Office Compound Document Internals (Part 1 - Document Header)

Recently I was working on a case where we had a large collection of MS Office documents for review. There were literally thousands of them, and we had to make sense of what documents belonged to what user, and make some sort of picture about who wrote what documents, when they wrote them, and who had viewed or edited them. It occurred to me that the information that I was looking for was contained within the Office documents themselves - in the metadata structures. You see, Microsoft Office keeps track of several bits of metadata within its documents - things such as the usernames of the document author and last editor, dates and times the document was created, last saved, and last printed, as well as a bunch of other potentially useful information. This post describes my efforts to do bulk extraction of this metadata from my massive collection of documents and present that metadata in a way that was useful.


Microsoft Office Compound Documents, as described by OpenOffice.org, the Apache POI project, and Microsoft's own documentation, are basically self-contained filesystems that use concepts from the File Allocation Table filesystem. They are composed of two primary structures - streams, which are analogous to files within a FAT filesystem, and storages, which are analogous to folders or directories. The Compound Document file is broken up into sectors, much like a traditional filesystem, and each sector is addressed by its sector number within internal structures that specify the location of streams and storages.

To start looking into the Office Compound Document, we must first analyze the Compound Document Header, which is the first 512 bytes of the file. This header contains information about the structure of the Compound Document file itself, such as the location and number of the FAT structures, and information about where to begin looking for streams and storages. Forensic investigators will likely recognize the first 8 bytes of the document header - they are 0xD0 0xCF 0x11 0xE0 0xA1 0xB1 0x1A 0xE1 and this 8-byte sequence is the same for all Microsoft Office Compound Document files. Microsoft Compound Office documents have no footers. This header can be useful for file carving to quickly identify Office Compound Documents that have had their file extensions changed or those that have been deleted but are still resident in unallocated space.

Microsoft describes the high-level structure of a Compound Document file as follows:

Compound File Structure
Figure 1 - Compound Document Structure

For this post, we will be focusing on the Compound File Header, which describes the overall structure of the document and provides reference information that can be used to parse out the remaining structures in the document.

In addition to the first 8 bytes, there is quite a bit of additional information contained within the document header that is crucial to parsing out the contents of a Compound Documents. These values include the byte order, the sector size, and the location of the FAT, DIFAT, and Mini-FAT structures. These will be described in future posts. The byte structure of the Compound Document header is as follows:

0
1
2
3
4
5
6
7
8
9
1
1
2
3
4
5
6
7
8
9
2
1
2
3
4
5
6
7
8
9
3
1
0
0
0
Header Signature
...
Header CLSID
...
...
...
Minor Version
Major Version
Byte Order
Sector Shift
Mini Sector Shift
Reserved
...
Number of Directory Sectors
Number of FAT Sectors
First Directory Sector Location
Transaction Signature Number
Mini Stream Cutoff Size
First Mini FAT Sector Location
Number of Mini FAT Sectors
First DIFAT Sector Location
Number of DIFAT Sectors
DIFAT
...
...
...
...
...
...
...
(DIFAT cont'd for 101 rows)
Figure 2 - Compound Document Header Structure

The key values that we need to be aware of in the Compound Document Header are the values for "Byte Order", "Sector Shift", and "First Directory Sector Location". According to Microsoft's documentation, the "Byte Order" field, at offset 28 bytes within the header, should always be 0xFFFE, which indicates that little-endian byte order is used within the file when storing values. The field that immediately follows is the "Sector Shift" field, which specifies the sector size used within this Compound Document file. A sector shift value of 0x0009 indicates that sectors are 512 bytes in size. A value of 0x000c indicates that the sector size is 4096 bytes.
Once the sector size is known, we can then identify the file offset of any sector by using the following equation:

File Offset = (Sector Size * Sector Number) + 512

We add 512 to the value to account for the Compound Document Header itself, which is always 512 bytes long, according to Microsoft's documentation. Recall from Figure 1 that Sector 0 starts immediately after the Compound Document Header, at file offset 512 bytes.

The next value of interest in the Compound Document Header is the "First Directory Sector Location". This field is located at offset 48 bytes within the Compound Document Header. This is analogous to the sector number containing the Root directory in FAT filesystems. If we locate this sector within the Compound Document file, we will find the Root directory entry for the Compound Document. This field will contain a sector number, which we can plug into the above equation to find the file offset of the sector containing the root directory entry.

Compound Document Header
Figure 3 - Compound Document Header Fields

In the photo above, we can see that the value for "Byte Order" at offset 28 is 0xFFFE (when read in little endian, right to left byte order). At offset 30 we see the value 0x0009 for Sector Shift, indicating that this file uses 512-byte sectors. Finally, at offset 48, we see the value 0x00000025, indicating that the first directory sector location is at sector 0x25 or sector 37. We can then use the equation to find the file offset of the First Directory Sector Location, which would be (37 x 512) + 512, which equals 19456 bytes. If we jump to that offset we will find our first Compound File Directory Entries. We will begin to analyze these structures in a separate post.

No comments:

Post a Comment