Wednesday, January 29, 2014

Microsoft Office Compound Document Internals (Part 4 - SummaryInformation)

In this segment of the series on Microsoft's Compound Document file format, I am going to discuss the extraction of information from the SummaryInformation data structure. You may have noticed the SummaryInformation references in the directory entries that we viewed in this previous post. The SummaryInformation structure is the internal data structure within Compound Document files that contains the metadata information - things such as the author's username, the username of the last person to have edited the document, date and time information for file creation, last save, and last print, and statistical information about the file. The SummaryInformation data structure is described here at MSDN. In this post, we will walk through our sample document and extract the document property metadata from the SummaryInformation structure.

My desire to access the information in the SummaryInformation and DocumentSummaryInformation came from a case that I was working where there were thousands of documents that we had to make sense of. My job as a forensic investigator was to determine as much information as possible about who was editing / processing / sending / receiving these files, and create a general timeline of events. Absent any other forensic data, the first place I thought to look was within the document properties. Initially I started doing this by opening copies of the documents in MS Office and noting the information. I noticed that the usernames and date/time information was largely intact, but doing this process manually across thousands of documents was rather tedious, so I looked for ways to do bulk extraction of this information. A few Google searches led me to the Apache POI project, which provides a Java API for doing extraction from the SummaryInformation and DocumentSummaryInformation structures, among a wealth of other things. A few days later, I had developed MetadataMiner which can do bulk extraction, searching, and reporting on the metadata fields contained in the SummaryInformation and DocumentSummaryInformation structures in MS Office Compound documents as well as Office Open XML format files.

After I wrote MetadataMiner, I was still curious about what was going on at the byte-level within the file when this data was being extracted. Where is the data actually stored within the file, and how do we get it out? Given the work that we have done already on analyzing the MS Compound Document file format, we are halfway there. From the directory entry for the SummaryInformation stream that we recovered in our previous post, we can tell that the SummaryInformation data structure starts at sector 0x14 (20). Thus we can jump to that sector by calculating the file offset (20 x 512) + 512 = 10752 bytes. Jumping to this sector in our hex editor, we see the following:

Figure 1: SummaryInformation Data Structure

The good news is, we can already see the actual values within the highlighted data that correspond to the values in the document's metadata properties. But that's not enough for me, I want to know how the data is structured and how I can parse it out and make sense of it. To do so I have to consult the MSDN documentation on the SummaryInformation Property Set structure. The data structure according to Microsoft is as follows:

0
1
2
3
4
5
6
7
8
9
1
1
2
3
4
5
6
7
8
9
2
1
2
3
4
5
6
7
8
9
3
1
0
0
0
ByteOrder
Version
SystemIdentifier
CLSID
NumPropertySets
FMTID
Offset 0
PropertySet0 (Variable)
Table 1: SummaryInformation Property Set Structure

The key bits of information in the SummaryInformation Property Set structure are NumPropertySets, FMTID (Format Identifier), and the Offset 0 values. The NumPropertySets value for SummaryInformation should always be 0x01 and The FMTID value should always be F29F85E0-4FF9-1068-AB91-08002B27B3D9. Offset 0 value should be 0x30 (48) which indicates the offset from the start of the SummaryInformation Property Set structure to the first PropertySet contained within. We can see all of the data structures applied to the SummaryInformation Property Set structure below:

Figure 2: SummaryInformation with structural highlighting applied.

In Figure 2, we can see the NumPropertySets value highlighted in dark blue, and it is indeed 0x01 (recall that we are working in little-endian byte order). The value in pink that follows is the FMTID value, which matches the Microsoft specification. Thus we know that this is a valid SummaryInformation structure. The value in dark red that follows is the Offset 0 value, which in this case (and in most cases) will be 0x30 (48) which indicates that the first and only PropertySet structure starts at 48 bytes into the SummaryInformation Property Set structure. We can then add 48 to 10752 to get the offset of the first Property Set (PropertySet 0), which starts at offset 10800. This is the bright yellow field shown in Figure 2.

In order to extract data from the PropertySet structure, we must understand the structure of the PropertySet.  A PropertySet is basically like its name says: it is a collection of values that indicate information about the type and location of the actual Property data values within the document. These information are called PropertyIdentifierAndOffset values. PropertyIdentifierAndOffset values describe the type of Property value (i.e. Author, Last Editor, etc) and the offset of that data value from the start of the PropertySet structure. From Microsoft's documentation, the PropertySet data structure is variably sized and is set up as follows:

0
1
2
3
4
5
6
7
8
9
1
1
2
3
4
5
6
7
8
9
2
1
2
3
4
5
6
7
8
9
3
1
0
0
0
Size
NumProperties
PropertyIdentifierAndOffset 0 (variable)
PropertyIdentifierAndOffset n (variable)
Property 0 (variable)
Property n (variable)
Table 2: PropertySet data structure

The key bits of information we need here are the Size of the PropertySet and the NumProperties value. The Size value indicates the overall size in bytes of the PropertySet structure. We can use this value to determine where the PropertySet structure ends. In the case of our sample document, we can see in Figure 2 that the Size value for this PropertySet structure is highlighted in yellow and its value is 0x164 (352) bytes. The next value is the NumProperties, highlighted in green and its value is 0x10 (16), indicating that there are 16 PropertyIdentifierAndOffset values that follow.

Each PropertyIdentifierAndOffset value is 8 bytes long. The first four bytes indicate the Property Type. The second four bytes indicate the offset of that Property value from the start of the PropertySet structure. The following figure shows each of these PropertyIdentifierAndOffset values highlighted:

Figure 3: PropertyIdentifierAndOffset fields

In Figure 3, the PropertyIdentifierAndOffset fields start at file offset 10808 and continue through address 10936. Each one specifies a particular document property and its offset from the start of the PropertySet structure (which, remember is at file offset 10800). Per Microsoft, the values for PropertyIdentifier are as follows:

0x00000001 (property identifier for the CodePage property)
PIDSI_TITLE (0x00000002)
PIDSI_SUBJECT (0x00000003)
PIDSI_AUTHOR (0x00000004)
PIDSI_KEYWORDS (0x00000005)
PIDSI_COMMENTS (0x00000006)
PIDSI_TEMPLATE (0x00000007)
PIDSI_LASTAUTHOR (0x00000008)
PIDSI_REVNUMBER (0x00000009)
PIDSI_APPNAME (0x00000012)
PIDSI_EDITTIME (0x0000000A)
PIDSI_LASTPRINTED (0x0000000B)
PIDSI_CREATE_DTM (0x0000000C)
PIDSI_LASTSAVE_DTM (0x0000000D)
PIDSI_PAGECOUNT (0x0000000E)
PIDSI_WORDCOUNT (0x0000000F)
PIDSI_CHARCOUNT (0x00000010)
PIDSI_DOC_SECURITY (0x00000013)

Using these values, we can go through each of the PropertyIdentifierAndOffset values and calculate their offsets. For example, in Figure 3 we can see that the PIDSI_TITLE value 0x00000002 is located at file offset 10816 and contains the value 0x00000002 for the PropertyIdentifier and 0x90 (144) for the offset. We can then calculate the file offset by adding 144 to 10800 (the file offset of the start of the PropertySet structure), which gives us 10944. We can then jump to that file offset and see the structure for the document's Title Property:

Figure 4: Title Property Highlighted

Per Microsoft's documentation, each of these Property values are referred to as TypedPropertyValues. Each property value has a specific data type, but the most common ones we will see are VT_LPSTR (String text in the local codepage) and VT_FILETIME data (integer data used to represent date values). In the case of a VT_LPSTR value, the first four bytes of the TypedPropertyValue will be 0x0000001E. The next four bytes will be an integer value that indicates the length of the VT_LPSTR. Immediately following that will be the property value (variable length). In the case of our sample document, the Document Title property is shown highlighted in Figure 4. The first four bytes, highlighted in green, indicate VT_LPSTR data. The second four bytes indicate that the string length is 0x18 (24) bytes. The yellow-highlighted value is the 24-byte long string value that represents the value for the Document Title property. This is the value that we would see if we were to open this document, select File->Properties and look in the Title field.

For VT_FILETIME data, the value for the PropertyType will be 0x00000040. The 8 bytes immediately following the PropertyType will be the actual VT_FILETIME value. Figure 5 shows data highlighted for the document's PIDSI_CREATE_DTM property. Using WinHex's built-in FILETIME interpreter, I can quickly see the filetime values.

Figure 5: File Create Date/Time Value Highlighted

When parsing the content of the PropertySet, there is one item of note. It appears that values in the PropertySet that contain zero-length strings are stored using a data buffer that is not zeroed out before it is written into the PropertySet in the file. This can be seen via the repeating "ear" value, which is the letters at index 1, 2, and 3 of the character array that is saved as the document title. What is likely happening here is that Microsoft is reutilizing the same buffer that it used for the Document Title Property. For each of the below property values, their length is 0x01 (1) byte, to hold the terminating NULL character for the string. This single byte is likely placed into the 4-byte buffer that still contains the remaining three bytes from the Document title at bytes 2-4. This is very likely implementation-specific and may not actually occur for all versions of Microsoft Word. The sample document that we are analyzing was created on Microsoft Word v.X for Macintosh. Figure 6 shows the residual values contained in the four-byte strings after the NULL terminator.

Figure 6: Repeating Values

In our next post, we will get into some of the actual stream contents and how to extract data from within the streams. I will demonstrate how to extract a JPEG image that is embedded inside a Word document.


1 comment:

  1. Ian,

    Thanks for your work in this area. Not only is what you're doing important, but it's valuable even in the face of the newer Office file format. Jump Lists on Windows 7 and 8 follow the same OLE/compound document format. I wrote a parser for this in Perl, and then someone said that they found a hit on a keyword search of a Jump List file that did not appear in one of the streams. Then I realized that one of the things that needed to be done is a mapping of the sectors used, as well as mapping of unused/unallocated sectors, in order to keep track of these sorts of things.

    ReplyDelete