Manipulating pages

pikepdf presents the pages in a PDF through the pikepdf.Pdf.pages property, which follows the list protocol. As such page numbers begin at 0.

Since one of the most things people want to do is split and merge PDF pages, we’ll by exploring that.

Let’s look at a simple PDF that contains four pages.

In [1]: from pikepdf import Pdf
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-5d5e0079e556> in <module>()
----> 1 from pikepdf import Pdf

ModuleNotFoundError: No module named 'pikepdf'

In [2]: pdf = Pdf.open('../tests/resources/fourpages.pdf')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-fedb2a2da6a4> in <module>()
----> 1 pdf = Pdf.open('../tests/resources/fourpages.pdf')

NameError: name 'Pdf' is not defined

How many pages?

In [3]: len(pdf.pages)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-6d7b3cfee28b> in <module>()
----> 1 len(pdf.pages)

NameError: name 'pdf' is not defined

pikepdf integrates with IPython and Jupyter’s rich object APIs so that you can view PDFs, PDF pages, or images within PDF in a IPython window or Jupyter notebook. This makes it to test visual changes.

In [4]: pdf
Out[4]: « In Jupyter you would see the PDF here »

You can also examine individual pages, which we’ll explore in the next section. Suffice to say that you can access pages by indexing them and slicing them.

In [5]: pdf.pages[-1].MediaBox
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-44f68991e6a8> in <module>()
----> 1 pdf.pages[-1].MediaBox

NameError: name 'pdf' is not defined

Reversing the order of pages

Suppose the file was scanned backwards. We can easily reverse it in place - maybe it was scanned backwards, a common problem with automatic document scanners.

In [6]: pdf.pages.reverse()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-45d945d13bb3> in <module>()
----> 1 pdf.pages.reverse()

NameError: name 'pdf' is not defined
In [7]: pdf
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-044d52b80dc3> in <module>()
----> 1 pdf

NameError: name 'pdf' is not defined

Pretty nice, isn’t it? But the pages in this file already were in correct order, so let’s put them back.

In [8]: pdf.pages.reverse()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-45d945d13bb3> in <module>()
----> 1 pdf.pages.reverse()

NameError: name 'pdf' is not defined

Deleting pages

Removing and adding pages is easy too.

In [9]: del pdf.pages[1:3]  # Remove pages 2-3 labeled "second page" and "third page"
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-2dc138e9e916> in <module>()
----> 1 del pdf.pages[1:3]  # Remove pages 2-3 labeled "second page" and "third page"

NameError: name 'pdf' is not defined
In [10]: pdf
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-10-044d52b80dc3> in <module>()
----> 1 pdf

NameError: name 'pdf' is not defined

We’ve trimmed down the file to its essential first and last page.

Copying pages from other PDFs

Now, let’s add some content from another file. Because pdf.pages behaves like a list, we can use pages.extend() on another file’s pages.

In [11]: pdf = Pdf.open('../tests/resources/fourpages.pdf')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-fedb2a2da6a4> in <module>()
----> 1 pdf = Pdf.open('../tests/resources/fourpages.pdf')

NameError: name 'Pdf' is not defined

In [12]: appendix = Pdf.open('../tests/resources/sandwich.pdf')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-dfdf1219b19f> in <module>()
----> 1 appendix = Pdf.open('../tests/resources/sandwich.pdf')

NameError: name 'Pdf' is not defined

In [13]: pdf.pages.extend(appendix.pages)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-ee842503238b> in <module>()
----> 1 pdf.pages.extend(appendix.pages)

NameError: name 'pdf' is not defined

We can use pages.insert() to insert into one of more pages into a specific position, bumping everything else ahead.

In [14]: graph = Pdf.open('../tests/resources/graph.pdf')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-14-e13a77a4c4d2> in <module>()
----> 1 graph = Pdf.open('../tests/resources/graph.pdf')

NameError: name 'Pdf' is not defined

In [15]: pdf.pages.insert(1, graph.pages[0])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-15-530501c7ee15> in <module>()
----> 1 pdf.pages.insert(1, graph.pages[0])

NameError: name 'pdf' is not defined

In [16]: len(pdf.pages)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-16-6d7b3cfee28b> in <module>()
----> 1 len(pdf.pages)

NameError: name 'pdf' is not defined

We can also replace specific pages with assignment (or slicing).

In [17]: congress = Pdf.open('../tests/resources/congress.pdf')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-17-1e6c03cf538e> in <module>()
----> 1 congress = Pdf.open('../tests/resources/congress.pdf')

NameError: name 'Pdf' is not defined

In [18]: pdf.pages[2] = congress.pages[0]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-18-ccc557e7dfbe> in <module>()
----> 1 pdf.pages[2] = congress.pages[0]

NameError: name 'congress' is not defined

Note

Some interactive PDF features such as hyperlinks internal to the document may stop working when a page is copied from one file to another.

Copying pages within a PDF

When a page is copied (assigned) to a different position within the same PDF, the copy is constructed as a new page rather than a reference to the existing one. This is different from standard Python behavior.

For a detailed explanation and workarounds, see Copying and updating pages.

Saving changes

Naturally, you can save your changes with pikepdf.Pdf.save(). filename can be a pathlib.Path, which we accept everywhere. (Saving is commented out to avoid upsetting the documentation generator.)

In [19]: pdf.save('output.pdf')

You may save a file multiple times, and you may continue modifying it after saving.

Saving with encryption

To save an encrypted (password protected) PDF, use a pikepdf.Encryption object to specify the encryption settings. By default, pikepdf selects the strongest security handler and algorithm, but allows full access to modify file contents. A pikepdf.Permissions object can be used to specify restrictions.

In [20]: no_extracting = pikepdf.Permissions(extract=False)

In [21]: pdf.save('output.pdf', encryption=pikepdf.Encryption(
   ....:      user="user password", owner="owner password", allow=no_extracting
   ....: ))
   ....: 

Split a PDF one page PDFs

All we need is a new PDF to hold the destination page.

In [22]: pdf = Pdf.open('../tests/resources/fourpages.pdf')

In [23]: for n, page in enumerate(pdf.pages):
   ....:     dst = Pdf.new()
   ....:     dst.pages.append(page)
   ....:     dst.save('{:02d}.pdf'.format(n))
   ....: 

Note

This example will transfer data associated with each page, so that every page stands on its own. It will not transfer some metadata associated with the PDF as a whole, such the list of bookmarks.

Merging a PDF from several files

You might be able to guess.

In [24]: from glob import glob

In [25]: pdf = Pdf.new()

In [26]: for file in glob('*.pdf'):
   ....:     src = Pdf.open(file)
   ....:     pdf.pages.extend(src.pages)
   ....: 

In [27]: pdf.save('merged.pdf')

Note

This code sample does not deduplicate objects. The resulting file may be large if the source files have content in common.

Using counting numbers

Because PDF pages are usually numbered in counting numbers (1, 2, 3…), pikepdf provides a convenience accessor .p() that uses counting numbers:

In [28]: pdf.pages.p(1)        # The first page in the document

In [29]: pdf.pages[0]          # Also the first page in the document

In [30]: pdf.pages.remove(p=1)   # Remove first page in the document

To avoid confusion, the .p() accessor does not accept Python slices, and .p(0) raises an exception. It is also not possible to delete using it.

PDFs may define their own numbering scheme or different numberings for different sections, such as using Roman numerals for an introductory section. .pages does not look up this information.

Note

Because of technical limitations in underlying libraries, pikepdf keeps the source PDF open when a content is copied from it to another PDF, even when all Python variables pointing to the source are removed. If a PDF is being assembled from many sources, then all of those sources are held open in memory. This memory can be released by saving and re-opening the PDF.

Warning

It’s possible to obtain page information through the PDF /Root object as well, but not recommend. The internal consistency of the various /Page and /Pages is not guaranteed when accessed in this manner, and in some PDFs the data structure for these is fairly complex. Use the .pages interface.