Working with images

PDFs embed images as binary stream objects within the PDF’s data stream. The stream object’s dictionary describes properties of the image such as its dimensions and color space. The same image may be drawn multiple times on multiple pages, at different scales and positions.

In some cases such as JPEG2000, the standard file format of the image is used verbatim, even when the file format contains headers and information that is repeated in the stream dictionary. In other cases such as for PNG-style encoding, the image file format is not used directly.

pikepdf currently has no facility to embed new images into PDFs. We recommend img2pdf instead, because it does the job so well. pikepdf instead allows for image inspection and lossless/transcode free (where possible) “pdf2img”.

Playing with images

pikepdf provides a helper class PdfImage for manipulating images in a PDF. The helper class helps manage the complexity of the image dictionaries.

In [1]: from pikepdf import Pdf, PdfImage, Name
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-3cbdb951ae4b> in <module>()
----> 1 from pikepdf import Pdf, PdfImage, Name

ModuleNotFoundError: No module named 'pikepdf'

In [2]: example = Pdf.open('../tests/resources/congress.pdf')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-5533e8a6eb0a> in <module>()
----> 1 example = Pdf.open('../tests/resources/congress.pdf')

NameError: name 'Pdf' is not defined

In [3]: page1 = example.pages[0]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-1efa67009061> in <module>()
----> 1 page1 = example.pages[0]

NameError: name 'example' is not defined

In [4]: list(page1.images.keys())
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-959159ac717f> in <module>()
----> 1 list(page1.images.keys())

NameError: name 'page1' is not defined

In [5]: rawimage = page1.images['/Im0']  # The raw object/dictionary
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-744cbc224e3d> in <module>()
----> 1 rawimage = page1.images['/Im0']  # The raw object/dictionary

NameError: name 'page1' is not defined

In [6]: pdfimage = PdfImage(rawimage)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-34f33e88d5c1> in <module>()
----> 1 pdfimage = PdfImage(rawimage)

NameError: name 'PdfImage' is not defined

In [7]: pdfimage
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-d23167302873> in <module>()
----> 1 pdfimage

NameError: name 'pdfimage' is not defined

In Jupyter (or IPython with a suitable backend) the image will be displayed.

im0

You can also inspect the properties of the image. The parameters are similar to Pillow’s.

In [8]: pdfimage.colorspace
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-b3fb3c2eee89> in <module>()
----> 1 pdfimage.colorspace

NameError: name 'pdfimage' is not defined

In [9]: pdfimage.width, pdfimage.height
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-c9cea2c5b1d4> in <module>()
----> 1 pdfimage.width, pdfimage.height

NameError: name 'pdfimage' is not defined

Note

.width and .height are the resolution of the image in pixels, not the size of the image in page coordinates. The size of the image in page coordinates is determined by the content stream.

Extracting images

Extracting images is straightforward. extract_to() will extract images to a specified file prefix. The extension is determined while extracting and appended to the filename. Where possible, extract_to writes compressed data directly to the stream without transcoding.

In [10]: pdfimage.extract_to(fileprefix='image'))
Out[10]: 'image.jpg'

It also possible to extract to a writable Python stream using .extract_to(stream=...`).

You can also retrieve the image as a Pillow image:

In [11]: pdfimage.as_pil_image()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-4bac11e6132a> in <module>()
----> 1 pdfimage.as_pil_image()

NameError: name 'pdfimage' is not defined

Another way to view the image is using Pillow’s Image.show() method.

Not all images can be extracted. Also, some PDFs describe an image with a mask, with transparency effects. pikepdf can only extract the images themselves, not rasterize them exactly as they appear in a PDF viewer. In the vast majority of cases, however, the image can be extracted as it appears.

Note

This simple example PDF displays a single full page image. Some PDF creators will paint a page using multiple images, and features such as layers, transparency and image masks. Accessing the first image on a page is like an HTML parser that scans for the first <img src=""> tag it finds. A lot more could be happening. There can be multiple images drawn multiple times on a page, vector art, overdrawing, masking, and transparency. A set of resources can be grouped together in a “Form XObject” (not to be confused with a PDF Form), and drawn at all once. Images can be referenced by multiple pages.

Replacing an image

In this example we extract an image and replace it with a grayscale equivalent.

In [12]: import zlib

In [13]: rawimage = pdfimage.obj
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-00acb5512fed> in <module>()
----> 1 rawimage = pdfimage.obj

NameError: name 'pdfimage' is not defined

In [14]: pillowimage = pdfimage.as_pil_image()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-14-891c19a11a54> in <module>()
----> 1 pillowimage = pdfimage.as_pil_image()

NameError: name 'pdfimage' is not defined

In [15]: grayscale = pillowimage.convert('L')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-15-f5dec2f4fc30> in <module>()
----> 1 grayscale = pillowimage.convert('L')

NameError: name 'pillowimage' is not defined

In [16]: grayscale = grayscale.resize((32, 32))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-16-0daa1e6f3bdf> in <module>()
----> 1 grayscale = grayscale.resize((32, 32))

NameError: name 'grayscale' is not defined

In [17]: rawimage.write(zlib.compress(grayscale.tobytes()), filter=Name("/FlateDecode"))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-17-b2f535a2cdfa> in <module>()
----> 1 rawimage.write(zlib.compress(grayscale.tobytes()), filter=Name("/FlateDecode"))

NameError: name 'rawimage' is not defined

In [18]: rawimage.ColorSpace = Name("/DeviceGray")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-18-5a45d7a0ede9> in <module>()
----> 1 rawimage.ColorSpace = Name("/DeviceGray")

NameError: name 'Name' is not defined

In [19]: rawimage.Width, rawimage.Height = 32, 32
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-19-cc6a241bb1f0> in <module>()
----> 1 rawimage.Width, rawimage.Height = 32, 32

NameError: name 'rawimage' is not defined

Notes on this example:

  • It is generally possible to use zlib.compress() to generate compressed image data, although this is not as efficient as using a program that knows it is preparing a PDF.
  • In general we can resize an image to any scale. The PDF content stream specifies where to draw an image and at what scale.
  • This example would replace all occurrences of the image if it were used multiple times in a PDF.