Tuesday, April 8, 2014

MetadataMiner - Network Graphing of Microsoft Office Document Metadata

MetadataMiner is a tool that I wrote a few years ago to allow for bulk extraction of metadata from Microsoft Office documents. The metadata in Office documents is contained in the SummaryInformation and DocumentSummaryInformation structures that we have reviewed in previous posts. The SummaryInformation structure contains the following data fields:

  • Title
  • Subject
  • Author
  • Keywords
  • Comments
  • Template
  • Last Saved By
  • Revision Number
  • Total Editing Time
  • Last Print Date
  • Creation Date
  • Last Save Date
  • Number of Pages
  • Number of Words
  • Number of Characters
  • Thumbnail
  • Creating Application
  • Security
And the DocumentSummaryInformation structure contains the following data fields:
  • Category
  • Company
  • Manager
  • Hidden character count
  • Line count
  • Note count
MetadataMiner is a tool that can extract these fields from bulk documents. It records all of these field values into a database to allow for querying and searching. This can be useful for example, if you have a bunch of documents and want to quickly determine relationships within that batch of documents. 

Say for example you have one document of interest and want to find all of the other documents that were authored or "last saved by" that same individual. MetadataMiner lets you do this quickly. You can also generate a spreadsheet of all of the metadata fields for each document and then search, sort, and filter to your heart's content. If you want to build a timeline of document activity, you can sort on the "Creation Date" field and then you can see what documents were created at what time - in sequential order.




One of the new features that I have added in is the ability to visualize relationships between documents. When a document is created, the Author field and the "Last Saved By" field are set to the same value - that is, the username of the individual who first created the document. If a document is shared with another user, and that user edits the document, then that user's username is stored in the "Last Saved By" field. Thus there is a relationship there between the document's original author and the person with whom the document was shared.

MetadataMiner is available on GitHub at this link. It is written in Java so you will need a Java runtime environment, (I know, I know - security issues) but you don't have to enable the browser plugin. MetadataMiner is a standalone app so you don't need to run it in your browser thus you are less prone to the security vulnerabilities that go along with the Java browser plugin. To run MetadataMiner, simply download the JAR file from GitHub (or clone the project if you prefer via the following command:

git clone https://github.com/ipduffy/MetadataMiner

Then follow the instructions on GitHub to get MetadataMiner up and running and parsing your documents. I have run it on document collections that are fairly large with no issues so I'm pretty confident in sharing this tool with the forensics community. I am still semi-actively developing this project so if you have any feedback, bug reports, or feature requests, feel free to hit me up.

Thursday, March 6, 2014

Microsoft Compund Document Internals (Part 5 - Rebuilding a Corrupted Document Header)

Recently I was contacted by an individual who had been infected with the CryptorBit virus. This virus is a ransomware variant that supposedly encrypts your files and then demands payment of a certain amount of money in order to unlock your files. The person who contacted me explained that rather than encrypting the entire contents of the file (like CryptoLocker), this particular bit of malware just encrypts the first 512 bytes of the file. This provides an interesting opportunity with respect to MS Office Compound Documents because the first 512 bytes are the Compound Document Header (described in detail at MSDN). The interesting thing here is that many of the values that are in the Compound Document Header are static values, and those that are variable can possibly be derived from the actual file contents. Since the file contents are not affected by CryptorBit, we should, in theory, be able to scan the streams, storages, and directory entries within the Compound Document file and recover the data necessary to rebuild the Document header. Challenge accepted.

Friday, January 31, 2014

Forensic Lunch 1/31/14





Thanks again to David Cowen and his team for the opportunity to present my work on the Compound Document File format. For those who were interested in the tools / scripts that were discussed on the show, here are some links:

Python scripts for parsing MS Compound Documents - I have not had a chance to download and test/evaluate these yet but I'm hoping I'll have some free time to do so soon.

Microsoft OffVis tool for parsing MS Compound Documents and detecting malware - direct download

Link to good article describing OffVis and what it does

MSDN documentation on the Microsoft Compound File Binary format

Wednesday, January 29, 2014

Microsoft Office Compound Document Internals (Part 4 - SummaryInformation)

In this segment of the series on Microsoft's Compound Document file format, I am going to discuss the extraction of information from the SummaryInformation data structure. You may have noticed the SummaryInformation references in the directory entries that we viewed in this previous post. The SummaryInformation structure is the internal data structure within Compound Document files that contains the metadata information - things such as the author's username, the username of the last person to have edited the document, date and time information for file creation, last save, and last print, and statistical information about the file. The SummaryInformation data structure is described here at MSDN. In this post, we will walk through our sample document and extract the document property metadata from the SummaryInformation structure.

Wednesday, January 8, 2014

Microsoft Office Compound Document Internals (Part 3 - FAT, DIFAT, and Mini FAT)

As I have mentioned in previous posts, the Microsoft Office Compound Document file uses an internal File Allocation Table (FAT) structure to keep track of allocated and unallocated sectors within the file. In addition to the FAT, there is also a Double Indirect File Allocation Table (DIFAT) which is used to keep track of file sectors used by the FAT. Additionally, the Compound Document file also uses a MiniFAT, which allocates storage in the Mini Stream, which will be the topic of another post. All of these structures are used to map the allocation status of each sector within the Compound Document file, and are used to recover sector chains - that is, sequences of sectors that contain the data for a particular stream and storage. In this post, we will discuss the FAT concept in general, and the implementation specifics of FAT within the Compound Document file.

Monday, December 30, 2013

Microsoft Office Compound Document Internals (Part 2 - Directory Entries)

Building on my previous post about the Compound Document file format, in this post we will discuss the Compound Document Directory Entry structures. Directory Entries are structures that store information about a stream or storage within a Compound Document file. Similar to directory entries in the FAT filesystem, Compound Document Directory Entries contain information such as timestamps, stream / storage names, and starting sector information. In order to recover the contents of a stream or storage, an examiner must first locate the Directory Entry for that particular stream or storage. In this post we will discuss how to locate Directory Entries, how to analyze their structure and content, and data that could be potentially useful to the forensic examiner.

Thursday, December 19, 2013

Microsoft Office Compound Document Internals (Part 1 - Document Header)

Recently I was working on a case where we had a large collection of MS Office documents for review. There were literally thousands of them, and we had to make sense of what documents belonged to what user, and make some sort of picture about who wrote what documents, when they wrote them, and who had viewed or edited them. It occurred to me that the information that I was looking for was contained within the Office documents themselves - in the metadata structures. You see, Microsoft Office keeps track of several bits of metadata within its documents - things such as the usernames of the document author and last editor, dates and times the document was created, last saved, and last printed, as well as a bunch of other potentially useful information. This post describes my efforts to do bulk extraction of this metadata from my massive collection of documents and present that metadata in a way that was useful.