Tutorial

_images/pike-cartoon.png

This brief tutorial should give you an introduction and orientation to pikepdf’s paradigm and syntax. From there, we refer to you various topics.

Opening and saving PDFs

In contrast to better known PDF libraries, pikepdf uses a single object to represent a PDF, whether reading, writing or merging. We have cleverly named this pikepdf.Pdf. In this documentation, a Pdf is a class that allows manipulate the PDF, meaning the file.

from pikepdf import Pdf
new_pdf = Pdf.new()
with Pdf.open('sample.pdf') as pdf:
    pdf.save('output.pdf')

You may of course use from pikepdf import Pdf as ... if the short class name conflicts or from pikepdf import Pdf as PDF if you prefer uppercase.

pikepdf.open() is a shorthand for pikepdf.Pdf.open().

The PDF class API follows the example of the widely-used Pillow image library. For clarity there is no default constructor since the arguments used for creation and opening are different. Pdf.open() also accepts seekable streams as input, and Pdf.save() accepts streams as output.

Inspecting pages

Manipulating pages is fundamental to PDFs. pikepdf presents the pages in a PDF through the pikepdf.Pdf.pages property, which follows the list protocol. As such page numbers begin at 0.

Let’s open a simple PDF that contains four pages.

In [1]: from pikepdf import Pdf
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-5d5e0079e556> in <module>()
----> 1 from pikepdf import Pdf

ModuleNotFoundError: No module named 'pikepdf'

In [2]: pdf = Pdf.open('../tests/resources/fourpages.pdf')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-fedb2a2da6a4> in <module>()
----> 1 pdf = Pdf.open('../tests/resources/fourpages.pdf')

NameError: name 'Pdf' is not defined

How many pages?

In [3]: len(pdf.pages)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-6d7b3cfee28b> in <module>()
----> 1 len(pdf.pages)

NameError: name 'pdf' is not defined

pikepdf integrates with IPython and Jupyter’s rich object APIs so that you can view PDFs, PDF pages, or images within PDF in a IPython window or Jupyter notebook. This makes it to test visual changes.

In [4]: pdf
Out[4]: « In Jupyter you would see the PDF here »

In [5]: pdf.pages[0]
Out[5]: « In Jupyter you would see an image of the PDF page here »

You can also examine individual pages, which we’ll explore in the next section. Suffice to say that you can access pages by indexing them and slicing them.

In [6]: pdf.pages[0]
Out[6]: « In Jupyter you would see an image of the PDF page here »

Note

pikepdf.Pdf.open() can open almost all types of encrypted PDF! Just provide the password= keyword argument.

For more details on document assembly, see PDF split, merge and document assembly.

Pages are dictionaries

In PDFs, the main data structure is the dictionary, a key-value data structure much like a Python dict or attrdict. The major difference is that the keys can only be names, and can only be PDF types, including other dictionaries.

PDF dictionaries are represented as pikepdf.Dictionary, and names are of type pikepdf.Name. A page is just a dictionary with a few required files and a reference from the document’s “page tree”. (pikepdf manages the page tree for you.)

In [7]: from pikepdf import Pdf
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-7-5d5e0079e556> in <module>()
----> 1 from pikepdf import Pdf

ModuleNotFoundError: No module named 'pikepdf'

In [8]: example = Pdf.open('../tests/resources/congress.pdf')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-5533e8a6eb0a> in <module>()
----> 1 example = Pdf.open('../tests/resources/congress.pdf')

NameError: name 'Pdf' is not defined

In [9]: page1 = example.pages[0]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-1efa67009061> in <module>()
----> 1 page1 = example.pages[0]

NameError: name 'example' is not defined

repr() output

Let’s example the page’s repr() output:

In [10]: page1
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-10-2932c8ee7072> in <module>()
----> 1 page1

NameError: name 'page1' is not defined

The angle brackets in the output indicate that this object cannot be constructed with a Python expression because it contains a reference. When angle brackets are omitted from the repr() of a pikepdf object, then the object can be replicated with a Python expression, such as eval(repr(x)) == x. Pages typically concern indirect references to themselves and other pages, so they cannot be represented as an expression.

In Jupyter and IPython, pikepdf will instead attempt to display a preview of the PDF page, assuming a PDF rendering backend is available.

Item and attribute notation

Dictionary keys may be looked up using attributes (page1.MediaBox) or keys (page1['/MediaBox']).

In [11]: page1.MediaBox      # preferred notation for required names
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-a0a19875a55a> in <module>()
----> 1 page1.MediaBox      # preferred notation for required names

NameError: name 'page1' is not defined

In [12]: page1['/MediaBox']  # also works
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-8af659fc77fb> in <module>()
----> 1 page1['/MediaBox']  # also works

NameError: name 'page1' is not defined

By convention, pikepdf uses attribute notation for standard names, and item notation for names that are set by PDF developers. For example, the images belong to a page always appear at page.Resources.XObject but the name of images is set by the PDF creator:

In [13]: page1.Resources.XObject['/Im0']

Item notation here would be quite cumbersome: ['/Resources']['/XObject]['/Im0'] (not recommended).

Attribute notation is convenient, but not robust if elements are missing. For elements that are not always present, you can use .get(), which behaves like dict.get() in core Python. A library such as glom might help when working with complex structured data that is not always present.

(For now, we’ll set aside what a page’s MediaBox and Resources.XObject are for. See Working with pages for details.)

Deleting pages

Removing pages is easy too.

In [14]: del pdf.pages[1:3]  # Remove pages 2-3 labeled "second page" and "third page"
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-14-2dc138e9e916> in <module>()
----> 1 del pdf.pages[1:3]  # Remove pages 2-3 labeled "second page" and "third page"

NameError: name 'pdf' is not defined
In [15]: len(pdf.pages)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-15-6d7b3cfee28b> in <module>()
----> 1 len(pdf.pages)

NameError: name 'pdf' is not defined

Saving changes

Naturally, you can save your changes with pikepdf.Pdf.save(). filename can be a pathlib.Path, which we accept everywhere. (Saving is commented out to avoid upsetting the documentation generator.)

In [16]: pdf.save('output.pdf')

You may save a file multiple times, and you may continue modifying it after saving.

To save an encrypted (password protected) PDF, use a pikepdf.Encryption object to specify the encryption settings. By default, pikepdf selects the strongest security handler and algorithm (AES-256), but allows full access to modify file contents. A pikepdf.Permissions object can be used to specify restrictions.

In [17]: no_extracting = pikepdf.Permissions(extract=False)

In [18]: pdf.save('encrypted.pdf', encryption=pikepdf.Encryption(
   ....:      user="user password", owner="owner password", allow=no_extracting
   ....: ))
   ....: 

Next steps

Have a look at pikepdf topics that interest you, or jump to our detailed API reference…