Realize all kinds of operations on PDF with Python 07/02 Update SLTechnology News&Howtos

Realize all kinds of operations on PDF with Python

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article focuses on "using Python to achieve a variety of operations on PDF", interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "using Python to achieve all kinds of operations on PDF"!

Portable Document Format (Portable document format), or PDF, is a file format that can be used for rendering and document exchange across operating systems. Although PDF was originally invented by Adobe, it is now an open standard maintained by the International Organization for Standardization (ISO). You can deal with pre-existing PDF in Python by using the PyPDF2 package.

PyPDF2 is a pure Python package that can be used for many different types of PDF operations.

This article will show you how to do the following:

Extract document information from PDF from Python

Rotate the page

Merge PDF

Split PDF

Add watermark

Encrypted PDF

History of pyPdf,PyPDF2 and PyPDF4

The original pyPdf package was released in 2005. The official version of pyPdf was in 2010. About a year later, a company called Phasit sponsored a branch of pyPdf called PyPDF2. The code was written to be backward compatible with the original code, and it has been used for many years, and the effect has been very good. A version of it was in 2016.

There was a short series version of the package called PyPDF3, and the project was renamed PyPDF4. All of these projects are identical, but the difference between pyPdf and PyPDF2 + is that Python 3 support is added to the latter version. The original pyPdf of Python 3 had a different Python 3 branch, but this branch has not been maintained for years.

Although PyPDF2 was recently abandoned, the new PyPDF4 is not fully backward compatible with PyPDF2. Most of the examples in this article work with PyPDF4***, but some don't, which is why PyPDF4 doesn't have more features in this article. Feel free to replace PyPDF2 imports with PyPDF4 and see how it works.

Pdfrw: an alternative PDF operation package

Patrick Maupin has created a software package called pdfrw, which does much of the same work as PyPDF2. Except for the special case of encryption, pdfrw can implement all the operations of PyPDF2 mentioned later in this article.

Pdfrw differs in that it integrates with the ReportLab package, so you can use some or all of the pre-existing PDF to build a new PDF.

Installation of PyPDF2

If you use Anaconda instead of regular Python, you can install PyPDF2 using pip or conda. Here is how to install PyPDF2 using pip:

$pip install pypdf2

Because PyPDF2 does not have any dependencies, the installation is very fast.

How to extract PDF document information from Python

We can use PyPDF2 to extract metadata and some text from PDF, especially when performing some type of automation on pre-existing PDF files.

The following are the types of data that can currently be extracted:

Author

Creator

Producer

Subject

Title

Number of page

You can find any PDF file on your computer to try. Here is how to write some code using the PDF and learn how to access these properties:

From PyPDF2 import PdfFileReader def extract_information (pdf_path): with open (pdf_path 'rb') as f: pdf = PdfFileReader (f) information = pdf.getDocumentInfo () number_of_pages = pdf.getNumPages () txt = f "Information about {pdf_path}: Author: {information.author} Creator: {information.creator} Producer: {information.producer} Subject: {information.subject} Title: {information.title} Number of pages: {number_of_pages} "" print (txt) return information if _ _ name__ = ='_ main__': path = 'xxxx.pdf' extract_information (path)

First import PdfFileReader from the PyPDF2 package. PdfFileReader is a class with multiple ways to interact with PDF files. In this example, we called .getDocumentInfo (), which returns an instance of DocumentInformation that contains most of the information we are interested in. We can also call .getNumPages () on the reader object to return the number of pages in the document.

The variable information has multiple instance properties that you can use to get the rest of the required metadata from the document. We can print out this information and return it for future use.

Although PyPDF2 has .extractText (), which can use extracted text on its page objects (not shown in this example), it doesn't work very well. Some PDF will return text, and some will return an empty string. If you want to extract text from PDF, it is recommended that you take a look at the PDFMiner project. PDFMiner is more powerful and is dedicated to extracting text from PDF.

How do I rotate the page?

Sometimes PDF is in horizontal mode instead of vertical mode, or even upside down. This is likely to happen when someone scans a document for PDF or e-mail. We can print out the document and read the paper version, or we can use the power of Python to rotate the page in question.

Let's take a look at how to use PyPDF2 to rotate some pages of an article:

From PyPDF2 import PdfFileReader PdfFileWriter def rotate_pages (pdf_path): pdf_writer = PdfFileWriter () pdf_reader = PdfFileReader (path) # rotate 90 degrees clockwise page_1 = pdf_reader.getPage (0) .rotateClockwise (90) pdf_writer.addPage (page_1) # rotate 90 degrees page_2 = pdf_reader.getPage (1) .rotateCounterClockwise (90) pdf_writer.addPage (page_2) # Add a page of pdf_writer.addPage (pdf_reader.getPage (2)) with open ('rotate_pages.pdf') in the normal direction 'wb') as fh: pdf_writer.write (fh) if _ _ name__ = =' _ _ main__': path = 'new path .pdf' rotate_pages (path)

In addition to pdfileReader, pdfileWriter is imported because we need to write a new pdf. Rotate_pages () gets the path of the PDF to be modified. In this function, you need to create a writer object that can be named pdf-writer and a reader object named pdf-reader.

Next, you can use .get page () to get the page you want. Page 0, the * * page, is entered above, calling the .rotateClockwise () method of the page object and entering 90. Then again, for the second page, call .rotateCounterLockwise () to rotate counterclockwise and enter 90.

Each time the Rotation rotation method is called, .addPage () is called, which adds a rotated version of the page to the writer object. * * one page is page 3 and no rotation is made to it. * *, write all new pages to the new PDF using .write ().

How to merge PDF?

In many cases, we want to merge two or more PDF into one PDF. For example, there may now be a standard cover that needs to be transferred to many types of reports. At this point, you can use python to help with this kind of work.

The following is the code for the implementation to complete the PDF merge:

From PyPDF2 import PdfFileReader, PdfFileWriter def merge_pdfs (paths, output): pdf_writer = PdfFileWriter () for path in paths: pdf_reader = PdfFileReader (path) for page in range (pdf_reader.getNumPages ()): # add each page to the writer object pdf_writer.addPage (pdf_reader.getPage (page)) # write merged pdf with open (output 'wb') as out: pdf_writer.write (out) if _ _ name__ = =' _ _ main__': paths = ['document1.pdf',' document2.pdf'] merge_pdfs (paths, output='merged.pdf')

If you have a pdf list to merge together, you can do it directly using the merge_pdf function. This function takes the input path and the output path as parameters.

First iterate through the input paths and create a PDF read object for each input. Then iterate through all the pages in the PDF file and write them to the writer object using .addpage (). When you have finished writing to all pages of all PDF in the list, the new results will be written at the end.

If you don't want to merge all the pages for each PDF, you can slightly enhance the script by adding a series of pages to add. To be a little more challenging, you can also use Python's argparse module to create a command line interface for this function.

How to split PDF?

Sometimes it may be necessary to split the PDF into multiple PDF, especially for PDF that contains a lot of scanned content. Here is how to split PDF into multiple files using PyPDF2:

From PyPDF2 import PdfFileReader, PdfFileWriter def split (path, name_of_split): pdf = PdfFileReader (path) for page in range (pdf.getNumPages ()): pdf_writer = PdfFileWriter () pdf_writer.addPage (pdf.getPage (page)) output = f'{name_of_split} {page} .pdf 'with open (output) 'wb') as output_pdf: pdf_writer.write (output_pdf) if _ _ name__ =' _ _ main__': path = 'xxx.pdf' split (path,' jupyter_page')

In this function, the reaer object of PDF is created again and the page it reads is traversed. For each page in PDF, create a new writer instance of PDF and add a single page to it. Then, write the page to a uniquely named file. After the script is run, you can split each page of the original PDF into a separate PDF.

How do I add a watermark?

Watermarks are images or patterns on paper or electronic documents, and some watermarks can only be seen under special lighting conditions. The importance of watermarking is that it can protect your intellectual property, such as images or PDF.

We can use Python and PyPDF2 to add watermarks to the document, and it is a PDF that contains only watermarked images or text. Here is how to add watermark to PDF:

From PyPDF2 import PdfFileWriter, PdfFileReader def create_watermark (input_pdf, output Watermark): watermark_obj = PdfFileReader (watermark) watermark_page = watermark_obj.getPage (0) pdf_reader = PdfFileReader (input_pdf) pdf_writer = PdfFileWriter () # add watermarks to all pages for page in range (pdf_reader.getNumPages (): page = pdf_reader.getPage (page) page.mergePage (watermark_page) pdf_writer.addPage (page) with open (output 'wb') as out: pdf_writer.write (out) if _ _ name__ = =' _ main__': create_watermark (input_pdf='Jupyter_Notebook_An_Introduction.pdf', output='watermarked_notebook.pdf', watermark='watermark.pdf')

The above create_watermark has three parameters:

Input_pdf: the path of the PDF file to be watermarked

Output: the path to save the watermark version of PDF

Watermark: PDF containing watermarked images or text

In the code, open the watermark PDF and grab the * page from the document, because this is where the watermark should reside. Then use the input_pdf and generic pdf_writer objects to create a writer object for PDF to write out the watermarked PDF.

The next step is to traverse the page in input_pdf, then call .mergePage () with the watermark object watermark_page read above as an argument, which overwrites watermark_page at the top of the current page, and then adds the newly merged page to the pdf_writer object. After traversing, * writes the newly watermarked PDF to disk.

How to encrypt PDF?

PyPDF2 currently only supports adding user and owner passwords to pre-existing PDF. In the PDF version, the owner password provides administrator privileges for the PDF and allows permissions to be set for the document, while the user password only allows the document to be opened.

In fact, PyPDF2 does not allow you to set any permissions on a document, even if it allows you to set the owner password. But in any case, this is the way to encrypt, and will inherently encrypt PDF:

From PyPDF2 import PdfFileWriter, PdfFileReader def add_encryption (input_pdf, output_pdf, password): pdf_writer = PdfFileWriter () pdf_reader = PdfFileReader (input_pdf) for page in range (pdf_reader.getNumPages (): pdf_writer.addPage (pdf_reader.getPage (page)) pdf_writer.encrypt (user_pwd=password, owner_pwd=None, use_128bit=True) with open (output_pdf 'wb') as fh: pdf_writer.write (fh) if _ _ name__ = =' _ main__': add_encryption (input_pdf='reportlab-sample.pdf', output_pdf='reportlab-encrypted.pdf', password='twofish')

Add_encryption takes the input and output PDF path and the password to add to the PDF as parameters. Because you need to encrypt the entire input PDF, you need to traverse all of its pages and add them to the writer writer. The * * step is to call .encrypt (), which takes the user password, owner password, and whether 128-bit encryption should be added as parameters. 128-bit encryption is enabled by default. If it is set to False, 40-bit encryption is applied.

Conclusion

The PyPDF2 package is very useful and you can use PyPDF2 to automate scripts to complete batch operations of PDF documents. This article describes how to extract metadata from PDF, rotate pages, merge and split PDF, add watermarks, and add encryption.

At the same time, keep an eye on the newer PyPDF4 package, as it will soon replace PyPDF2. You can also take a look at the pdfrw package, which can also perform many of the same operations as PyPDF2.

At this point, I believe you have a deeper understanding of "using Python to achieve a variety of operations on PDF". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.