Working with content streams

A content stream is a stream object associated with either a page or a Form XObject that describes where and how to draw images, vectors, and text.

Content streams are binary data that can be thought of as a list of operators and zero or more operands. Operands are given first, followed by the operator. It is a stack-based language based loosely on PostScript, but without any programmable features. There are no variables, loops or conditionals.

A typical example is as follows (with additional whitespace):

pikepdf provides a C++ optimized content stream parser and a filter. The parser is best used for reading and interpreting content streams; the filter is best used for rewriting them.

In [1]: pdf = pikepdf.open("../tests/resources/congress.pdf")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-9f65c1141d29> in <module>()
----> 1 pdf = pikepdf.open("../tests/resources/congress.pdf")

NameError: name 'pikepdf' is not defined

In [2]: page = pdf.pages[0]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-aa3f8ef59efc> in <module>()
----> 1 page = pdf.pages[0]

NameError: name 'pdf' is not defined

In [3]: for operands, operator in pikepdf.parse_content_stream(page):
   ...:     print("Operands {}, operator {}".format(operands, operator))
   ...: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-4596cd2e16f2> in <module>()
----> 1 for operands, operator in pikepdf.parse_content_stream(page):
      2     print("Operands {}, operator {}".format(operands, operator))
      3 

NameError: name 'pikepdf' is not defined

Extracting text from PDFs

If you guessed that the content streams were the place to look for text inside a PDF – you’d be correct. Unfortunately, extracting the text is fairly difficult because content stream actually specifies as a font and glyph numbers to use. Sometimes, there is a 1:1 transparent mapping between Unicode numbers and glyph numbers, and dump of the content stream will show the text. In general, you cannot rely on there being a transparent mapping; in fact, it is perfectly legal for a font to specify no Unicode mapping at all, or to use an unconventional mapping (when a PDF contains a subsetted font for example).

We strongly recommend against trying to scrape text from the content stream.

pikepdf does not currently implement text extraction. We recommend pdfminer.six, a read-only text extraction tool. If you wish to write PDFs containing text, consider reportlab.