How to extract metadata from PDF by Python 07/15 Update SLTechnology News&Howtos

How to extract metadata from PDF by Python

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "Python how to extract metadata from PDF". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how Python extracts metadata from PDF".

History of PyPdf PyPDF2 PyPDF4

The original pyPdf package was released in 2005. The last official version of pyPdf was in 2010. About a year later, a company called Phasit sponsored a branch of pyPdf called PyPDF2. The code was written to be backward compatible with the original code and has been used for many years, with the last version in 2016.

There was a short series version of the package called PyPDF3, and the project was renamed PyPDF4. All of these projects are identical, but the biggest difference between pyPdf and PyPDF2 + is the addition of Python 3 support in the latter version. The original pyPdf of Python 3 had a different Python 3 branch, but this branch has not been maintained for years.

Although PyPDF2 was recently abandoned, the new PyPDF4 is not fully backward compatible with PyPDF2. Most of the examples in this article work well with PyPDF4, but some don't, which is why PyPDF4 doesn't have more features in this article. Feel free to replace PyPDF2 imports with PyPDF4 and see how it works.

Pdfrw: an alternative PDF operation package

Patrick Maupin has created a software package called pdfrw, which does much of the same work as PyPDF2. Except for the special case of encryption, pdfrw can implement all the operations of PyPDF2 mentioned later in this article.

The biggest difference of pdfrw is that it integrates with the ReportLab package, so you can use some or all of the pre-existing PDF to build a new PDF.

Installation of PyPDF2

If you use Anaconda instead of regular Python, you can install PyPDF2 using pip or conda. Here is how to install PyPDF2 using pip:

$pip install pypdf2

Because PyPDF2 does not have any dependencies, the installation is very fast.

How to extract PDF document information from Python****

We can use PyPDF2 to extract metadata and some text from PDF, especially when performing some type of automation on pre-existing PDF files.

The following are the types of data that can currently be extracted:

Author

Creator

Producer

Subject

Title

Number of page

You can find any PDF file on your computer to try. Here is how to write some code using the PDF and learn how to access these properties:

From PyPDF2 import PdfFileReaderdef extract_information (pdf_path): with open (pdf_path 'rb') as f: pdf = PdfFileReader (f) information = pdf.getDocumentInfo () number_of_pages = pdf.getNumPages () txt = f "Information about {pdf_path}: Author: {information.author} Creator: {information.creator} Producer: {information.producer} Subject: {information.subject} Title: {information.title} Number of pages: {number_of_pages}" print (txt) return informationif _ name__ = =' _ main__ ': path =' xxxx.pdf' extract_information (path)

First import PdfFileReader from the PyPDF2 package. PdfFileReader is a class with multiple ways to interact with PDF files. In this example, we called .getDocumentInfo (), which returns an instance of DocumentInformation that contains most of the information we are interested in. We can also call .getNumPages () on the reader object to return the number of pages in the document.

The variable information has multiple instance properties that you can use to get the rest of the required metadata from the document. We can print out this information and return it for future use.

Although PyPDF2 has .extractText (), which can use extracted text on its page objects (not shown in this example), it doesn't work very well. Some PDF will return text, and some will return an empty string. If you want to extract text from PDF, it is recommended that you take a look at the PDFMiner project. PDFMiner is more powerful and is dedicated to extracting text from PDF.

How do I rotate the page?

Sometimes PDF is in horizontal mode instead of vertical mode, or even upside down. This is likely to happen when someone scans a document for PDF or e-mail. We can print out the document and read the paper version, or we can use the power of Python to rotate the page in question.

Let's take a look at how to use PyPDF2 to rotate some pages of an article:

From PyPDF2 import PdfFileReader PdfFileWriterdef rotate_pages (pdf_path): pdf_writer = PdfFileWriter () pdf_reader = PdfFileReader (path) # rotate 90 degrees clockwise page_1 = pdf_reader.getPage (0) .rotateClockwise (90) pdf_writer.addPage (page_1) # rotate 90 degrees counterclockwise page_2 = pdf_reader.getPage (1) .rotateCounterClockwise (90) pdf_writer.addPage (page_2) # add a page of pdf_writer.addPage (pdf_) in the normal direction Reader.getPage (2) with open ('rotate_pages.pdf' 'wb') as fh: pdf_writer.write (fh) if _ _ name__ = =' _ _ main__': path = 'new path .pdf' rotate_pages (path)

In addition to pdfileReader, pdfileWriter is imported because we need to write a new pdf. Rotate_pages () gets the path of the PDF to be modified. In this function, you need to create a writer object that can be named pdf-writer and a reader object named pdf-reader.

Next, you can use .get page () to get the page you want. Page 0, the first page, is entered above, calling the .rotateClockwise () rotation method of the page object and entering 90. Then again, for the second page, call .rotateCounterLockwise () to rotate counterclockwise and enter 90.

Each time the Rotation rotation method is called, .addPage () is called, which adds a rotated version of the page to the writer object. The last page is page 3, without any rotation. Finally, write all new pages to the new PDF using .write ().

How to merge PDF?

In many cases, we want to merge two or more PDF into one PDF. For example, there may now be a standard cover that needs to be transferred to many types of reports. At this point, you can use python to help with this kind of work.

The following is the code for the implementation to complete the PDF merge:

From PyPDF2 import PdfFileReader, PdfFileWriterdef merge_pdfs (paths, output): pdf_writer = PdfFileWriter () for path in paths: pdf_reader = PdfFileReader (path) for page in range (pdf_reader.getNumPages ()): # add each page to the writer object pdf_writer.addPage (pdf_reader.getPage (page)) # write merged pdf with open (output, 'wb') as out: pdf_writer.write (out) if _ _ name__ =' _ _ main__': paths = ['document1.pdf' 'document2.pdf'] merge_pdfs (paths, output='merged.pdf')

If you have a pdf list to merge together, you can do it directly using the merge_pdf function. This function takes the input path and the output path as parameters.

First iterate through the input paths and create a PDF read object for each input. Then iterate through all the pages in the PDF file and write them to the writer object using .addpage (). When you have finished writing to all pages of all PDF in the list, the new results will be written at the end.

If you don't want to merge all the pages for each PDF, you can slightly enhance the script by adding a series of pages to add. To be a little more challenging, you can also use Python's argparse module to create a command line interface for this function.

How to split PDF?

Sometimes it may be necessary to split the PDF into multiple PDF, especially for PDF that contains a lot of scanned content. Here is how to split PDF into multiple files using PyPDF2:

From PyPDF2 import PdfFileReader, PdfFileWriterdef split (path, name_of_split): pdf = PdfFileReader (path) for page in range (pdf.getNumPages ()): pdf_writer = PdfFileWriter () pdf_writer.addPage (pdf.getPage (page)) output = f'{name_of_split} {page} .pdf 'with open (output,' wb') as output_pdf: pdf_writer.write (output_pdf) if _ _ name__ ='_ _ main__': path = 'xxx.pdf' split (path,' jupyter_page')

In this function, the reaer object of PDF is created again and the page it reads is traversed. For each page in PDF, create a new writer instance of PDF and add a single page to it. Then, write the page to a uniquely named file. After the script is run, you can split each page of the original PDF into a separate PDF.

How do I add a watermark?

Watermarks are images or patterns on paper or electronic documents, and some watermarks can only be seen under special lighting conditions. The importance of watermarking is that it can protect your intellectual property, such as images or PDF.

We can use Python and PyPDF2 to add watermarks to the document, and it is a PDF that contains only watermarked images or text. Here is how to add watermark to PDF:

From PyPDF2 import PdfFileWriter, PdfFileReader

Def create_watermark (input_pdf, output, watermark):

Watermark_obj = PdfFileReader (watermark)

Watermark_page = watermark_obj.getPage (0)

Pdf_reader = PdfFileReader (input_pdf)

Pdf_writer = PdfFileWriter ()

# add watermarks to all pages

For page in range (pdf_reader.getNumPages ()):

Page = pdf_reader.getPage (page)

Page.mergePage (watermark_page)

Pdf_writer.addPage (page)

With open (output, 'wb') as out:

Pdf_writer.write (out)

If _ _ name__ = ='_ _ main__':

Create_watermark (

Input_pdf='Jupyter_Notebook_An_Introduction.pdf'

Output='watermarked_notebook.pdf'

Watermark='watermark.pdf')

The above create_watermark has three parameters:

Input_pdf: the path of the PDF file to be watermarked

Output: the path to save the watermark version of PDF

Watermark: PDF containing watermarked images or text

In the code, open the watermark PDF and grab the first page from the document, because this is where the watermark should reside. Then use the input_pdf and generic pdf_writer objects to create a writer object for PDF to write out the watermarked PDF.

The next step is to traverse the page in input_pdf, then call .mergePage () with the watermark object watermark_page read above as an argument, which overwrites watermark_page at the top of the current page, and then adds the newly merged page to the pdf_writer object. After traversing, the newly watermarked PDF is written to disk.

How to encrypt PDF?

PyPDF2 currently only supports adding user and owner passwords to pre-existing PDF. In the PDF version, the owner password provides administrator privileges for the PDF and allows permissions to be set for the document, while the user password only allows the document to be opened.

In fact, PyPDF2 does not allow you to set any permissions on a document, even if it allows you to set the owner password. But in any case, this is the way to encrypt, and will inherently encrypt PDF:

From PyPDF2 import PdfFileWriter, PdfFileReaderdef add_encryption (input_pdf, output_pdf, password): pdf_writer = PdfFileWriter () pdf_reader = PdfFileReader (input_pdf) for page in range (pdf_reader.getNumPages (): pdf_writer.addPage (pdf_reader.getPage (page)) pdf_writer.encrypt (user_pwd=password, owner_pwd=None, use_128bit=True) with open (output_pdf 'wb') as fh: pdf_writer.write (fh) if _ _ name__ = =' _ main__': add_encryption (input_pdf='reportlab-sample.pdf', output_pdf='reportlab-encrypted.pdf', password='twofish')

Add_encryption takes the input and output PDF path and the password to add to the PDF as parameters. Because you need to encrypt the entire input PDF, you need to traverse all of its pages and add them to the writer writer. The final step is to call .encrypt (), which takes the user password, owner password, and whether 128-bit encryption should be added as parameters. 128-bit encryption is enabled by default. If it is set to False, 40-bit encryption is applied.

Thank you for your reading, the above is the content of "how Python extracts metadata from PDF". After the study of this article, I believe you have a deeper understanding of how Python extracts metadata from PDF, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.