This new version of pdf-parser brings support for analysis of stream objects (/ObjStm). Use new option -O to enable this mode.
Stream objects (/ObjStm) are objects that contain other objects: they have a stream, containing other objects. These contained objects can not have a stream.
pdfid.py detects the presence of stream objects:
But pdfid can not look inside a stream, to figure out what objects are inside. That’s why I always say to use pdf-parser to select and decompress stream objects, and then pipe this through pdfid:
When pdf-parser parses a stream object, it does not parse the content of its stream:
This changes with this new version of pdf-parser. When option -O is used, pdf-parser extracts objects from /ObjStm streams and handles them like normal objects. In the following example, object 2 is contained in object 1:
pdf-parser provides statistics for a PDF’s content with option -a:
Combining option -a with option -O includes objects present inside stream objects (this is an alternative for combining both tools: pdf-parser -s objstm -f a.pdf | pdfid -f):
If we forget to use option -O, object 7 is not found:
Here is a video showing this new feature: