Ticket #138 (new enhancement)

Opened 1 year ago

Last modified 1 year ago

OLE2 Parser Changes + Updates

Reported by: nneonneo <nneonneo@gmail.com> Assigned to: haypo
Priority: normal Milestone:
Component: parser Keywords:
Cc:

Description

[warning: very large patch!] OK, here are the promised OLE2 changes, without the stream hack :)

I have moved a number of the smaller classes and files to another file, ole2_util.py, as they could potentially be used by other parsers (e.g. RawParser?, FragmentGroup?, etc.) and are also used by many of the parsers in the OLE2 suite, across many of the files. I think this is a good idea, but of course, final determination is left to haypo ;)

MSOffice parser has been significantly expanded, and it now includes parsers for the PowerPoint? Document stream and the Workbook stream from Excel. A new class "OLE2FragmentParser" has been created, which is used as the generic template for the parsers. Fixed a bug where the first big block would be omitted from the root entry in certain cases, resulting in broken files. Fixed small block seeking in RootEntry? (formerly OfficeRootEntry?); this allows small blocks to be parsed correctly. Changed RootEntry? to more closely resemble OLE2_File, as they are basically the same parser with minor changes (and of course RootEntry? lacks the DIFAT & FAT tables) Changed the format of PROPERTY_NAME to include a parser field too; this way, streams can easily have parsers associated with them. Summary has three new property parsers: Null, Blob and WidePascalString?32 (this last one needs a new name :P) Large (>7MB) file support. Also, changed SECT to be a UInt32, since that's what it is according to the specification (sector numbers can be up to 4 billion in size) DIFAT chaining, to support larger files Very large (>2GB) file support is untested and probably doesn't work. Require support for range-lock sectors and other things (possibly) Fix for small root entries; by definition, a root entry lives in a Big Block even if it isn't over the threshold. ...other changes which are very minor (e.g. spelling)

A patch and the file OLE2_Util.py are attached.

Attachments

ole2.diff (46.7 kB) - added by nneonneo <nneonneo@gmail.com> on 06/27/07 20:55:32.
Patch to hachoir-parser/hachoir_parser/misc/msoffice.py, msoffice_summary.py and ole2.py
ole2_util.py (3.2 kB) - added by nneonneo <nneonneo@gmail.com> on 06/27/07 20:56:22.
hachoir-parser/hachoir_parser/misc/ole2_util.py
ole2.2.diff (46.7 kB) - added by nneonneo <nneonneo@gmail.com> on 06/27/07 21:10:14.
Patch to hachoir-parser/hachoir_parser/misc/msoffice.py, msoffice_summary.py and ole2.py [updated]
ole2.3.diff (47.3 kB) - added by nneonneo <nneonneo@gmail.com> on 06/29/07 06:32:55.
More minor changes to the ole2 parser suite.

Change History

06/27/07 20:55:32 changed by nneonneo <nneonneo@gmail.com>

  • attachment ole2.diff added.

Patch to hachoir-parser/hachoir_parser/misc/msoffice.py, msoffice_summary.py and ole2.py

06/27/07 20:56:22 changed by nneonneo <nneonneo@gmail.com>

  • attachment ole2_util.py added.

hachoir-parser/hachoir_parser/misc/ole2_util.py

06/27/07 21:10:14 changed by nneonneo <nneonneo@gmail.com>

  • attachment ole2.2.diff added.

Patch to hachoir-parser/hachoir_parser/misc/msoffice.py, msoffice_summary.py and ole2.py [updated]

06/27/07 21:10:50 changed by nneonneo <nneonneo@gmail.com>

See last patch [ole2.2.diff]; first one contained a small error.

06/29/07 06:32:55 changed by nneonneo <nneonneo@gmail.com>

  • attachment ole2.3.diff added.

More minor changes to the ole2 parser suite.

06/29/07 06:36:41 changed by nneonneo <nneonneo@gmail.com>

Another new patch submitted (sorry about that); perhaps a meta-diff (diff of the diffs) is in order :)

Changes here: primarily aimed at fixing issues with CompObj? parser. It now correctly handles Macintosh CompObj? sections, as well as supporting some extra fields and correct computation of remaining space (now, it uses self.datasize, which is set by the parent parser on the stream passed to CompObj?).

I would also recommend removing OS_VERSION enum (I forgot to prior to creating patch, sorry about that!)

Finally, I recommend that slack space be factored into each fragment/stream parser, since a lot of deleted or overwritten data lives in the slack space. For an example, see RawParser? (which separates "real" data as marked by the datasize attribute from the slack space in the stream)

06/29/07 06:45:24 changed by nneonneo <nneonneo@gmail.com>

Also of note: I have a file with 6 CompObj? sections, and it is rather misleading to call them compobj[0] .. [5] because that implies that they belong to one stream. Perhaps they should also be named compobj[0]content[0] or something like that, for clarity's sake. Also, descriptions of property contents should carry the "filename" or stream name or the property, to make identification easier.

I will work on these tomorrow if I have the time.


Add/Change #138 (OLE2 Parser Changes + Updates)