Monday, December 30, 2013

Microsoft Office Compound Document Internals (Part 2 - Directory Entries)

Building on my previous post about the Compound Document file format, in this post we will discuss the Compound Document Directory Entry structures. Directory Entries are structures that store information about a stream or storage within a Compound Document file. Similar to directory entries in the FAT filesystem, Compound Document Directory Entries contain information such as timestamps, stream / storage names, and starting sector information. In order to recover the contents of a stream or storage, an examiner must first locate the Directory Entry for that particular stream or storage. In this post we will discuss how to locate Directory Entries, how to analyze their structure and content, and data that could be potentially useful to the forensic examiner.

Recall from part 1 that Compound Document files are comprised of streams and storages, which are analogous to files and folders, respectively, on a traditional filesystem. In Part 1, we discussed how to parse the Compound Document Header to find the sector size and first directory sector location. Once we have determined this information, we can begin to investigate the actual contents of the file. Building on our previous example, recall that the "First Directory Sector Location" value was 0x25, or sector 37. We determined that the file offset that corresponded to sector 37 was (37 x 512) + 512 or 19456 bytes. The following image is of the contents of that sector:


Figure 1: First Directory Sector

The First Directory Sector begins at file offset 19456, as shown in Figure 1, and contains Directory Entry structures. Per Microsoft's documentation, these Directory Entry structures are fixed length - 128 bytes long. Highlighting has been applied to the Directory Entry structures in Figure 1 to show that there are four Directory Entries in each 512-byte sector within the Compound Document file.

Each storage or stream within a Microsoft Office Compound Document will be represented by a single Directory Entry. Per Microsoft, "The number of directory entries can exceed the number of storage objects and stream objects due to unallocated directory entries", which implies that when a stream or storage is deleted from within a Compound Document, the Directory Entry will remain; however, I have not yet encountered nor investigated this. If true, the presence of Directory Entry structures for deleted streams and storages may present some useful forensic artifacts that can describe editing activity that took place within a document. 

The below table describes the byte structure of a directory entry.


0
1
2
3
4
5
6
7
8
9
1
1
2
3
4
5
6
7
8
9
2
1
2
3
4
5
6
7
8
9
3
1
0
0
0
Directory Entry Name
...
...
...
...
...
...
...
(Directory Entry Name cont'd for 8 rows)
Directory Entry Name Length
Object Type
Color Flag
Left Sibling ID
Right Sibling ID
Child ID
CLSID
...
...
...
State Bits
Creation Time
...
Modified Time
...
Starting Sector Location
Stream Size
...
Table 1: Directry Entry fields

Some of the key bits of information stored in the directory entry are the "Directory Entry Name", "Starting Sector Location", and "Stream Size". These bits of information can help us identify a particular stream or storage, determine its starting sector (from which we can calculate the file offset), and give us an idea about how many sectors the stream or storage will occupy (which can help with sector chaining using the FAT - which will be described in a future post). Other information that may be useful or interesting to forensic investigators are the timestamps for creation and last modification of the stream or storage.

In our sample document, pictured in Figure 1, let's take a look at one of the individual Directory Entries contained within the first directory sector:

Figure 2: Directory Entry for the SummaryInformation stream

Figure 2 shows the Directory Entry for the SummaryInformation stream object (which we will discuss in greater detail in a future post). Highlighting has been applied to show the different fields within the directory entry structure as listed in Table 1. The first field is the Directory Entry Name, highlighted in light yellow in Figure 2, which indicates that this Directory Entry is for the "SummaryInformation" stream within the compound document file. It should be noted that per Microsoft, the Directory Entry Name field contains characters in the UTF-16 encoding format, and can be no longer than 64 bytes.

The next value is the directory entry name length, which is highlighted in green and contains the hex value 0x0028 (40) which is the length of the directory entry name in bytes. Remember that we are reading in little-endian byte order. This length value includes the terminating NULL character.

The next value is the Object Type field, highlighted in light blue and which contains the value 0x02. This indicates whether the object referred to by this directory entry is a stream or a storage or another data type. Typical values are 0x01 for storage objects and 0x02 for stream objects. The directory entry pictured in Figure 2 is type 0x02, indicating that it points to a stream object.

The next value is the "Color Flag", which indicates a color, red or black, for each node. This color flag is used to sort directory entries into a red-black tree, which is a type of binary search tree, which allows for quick searching entries within a compound document file. In our particular example file, the flag is set to 0x01, which indicates that the directory entry will be represented as black on the red-black tree. 0x00 would indicate a red value. For more information on the compound document red-black tree, see Microsoft's documentation here.

The next value of interest are the Left and Right Sibling ID values, represented by the orange and red highlighted values, respectively. If we think of directory entries as a doubly-linked list, these two fields are the forward and reverse pointers. In the internal document structure and hierarchy, these values are used to maintain the document hierarchy and interrelationships of streams and storages within the file. Valid values for these fields fall between 0x00000000 and 0xFFFFFFF9, and the value represents the stream ID of the left or right sibling. If there is no Left or Right Sibling, the value will be 0xFFFFFFFF.

The next value is the Child ID value, highlighted in green. It's value in our sample document is 0xFFFFFFFF, indicating that there is no child object associated with the object referenced by this directory entry. Like the Sibling ID values, the value of the Child ID field represents the stream ID of the child object. Also, it is important to note that stream objects cannot have child objects - only storage objects can have child objects. If we think of streams as files and storages as folders / directories, this makes sense.

After the Child ID is the CLSID, which must be set to all 0x00 values for stream objects. A CLSID is a GUID (Globally Unique Identifier) which represents an object class. Future posts will contain some common GUIDs that may be encountered within a Microsoft Office Compound Document file. In our example file, since the directory entry being examined represents a stream object, the field contains all NULL values.

The next field contains State bits, which allow for user-defined state data to be retained for storage or root-storage objects. In the case of our sample file, this field is highlighted in pale yellow and contains all NULL values.

The next 16 bytes contain the creation time and modification time of the object, if the object is a storage or root storage object. Stream objects do not retain this information and will be all NULL values. In the case of our sample document, the object being examined is a stream object and thus does not contain timestamp information. If there was timestamp information available, it would be stored in the Microsoft FILETIME data type and are thus trivial to parse. If you encounter storage objects within a compound document file, they are likely to contain this timestamp information, which may be useful to a forensic examination in order to build a timeline of document editing activity.

The next field is the "Starting Sector Location", which gives us the sector number within the document where the data for this object is located. The directory entry points to the first sector of the stream or storage, which can then be used along with the File Allocation Table to determine sector chains for the particular stream. In the case of our sample file, the starting sector location is highlighted in pink and contains the value 0x00000014 (20). Thus the starting sector location for this stream object is in sector 20. We can use our equation from the previous post to calculate the file offset for this sector: FileOffset = (SectorNumber x SectorSize) + 512. Thus, the starting sector location for this stream object is at file offset (20 x 512) + 512 = 10752 bytes.

The final field within the Directory Entry is a 64-bit field that indicates the Stream Size. This field contains the "size of the user defined data, if this is a stream object." In the example shown in Figure 2, the Stream Size for the SummaryInformation stream is 0x1000 (4096) bytes. This value will be important when calculating sector chains so we should remember it for later when we will parse out stream objects from within the compound document file.

Figure 3: Starting Sector Location for SummaryInformation Stream Object

Figure 3 shows the Starting Sector Location for the SummaryInformation stream object contained within the sample file. It should be noted that if the stream object is larger than one sector's worth of data, then the data will be stored in multiple sectors which may or may not be contiguously located within the file. Thus it will be necessary to determine the "sector chain" in order to locate and recover all of the sectors that contain the stream object data. This will be the subject of a future post.

4 comments:

  1. Very Nice post, It provides detailed information about Compound Document File. I have few questions.
    1) Is there any way to find total count of Directory Entries in Header.
    2) How could I go directly to the offset if .Doc File contains Macro?

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Is there any way to find total count of Directory Entries in Header.

    ReplyDelete
  4. Is there any way to find total count of Directory Entries in Header.

    ReplyDelete