Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to extract text from images and PDF on Linux

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "how to extract text from images and PDF on Linux". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

GImageReader is the front end of the Tesseract open source OCR engine. Tesseract was originally developed by HP and then opened up in 2006.

Basically, the OCR (Optical Character Recognition Optical character recognition) engine allows you to scan text in a picture or file (PDF). By default, it can detect multiple languages and also supports scanning through Unicode characters.

However, Tesseract itself is a command-line tool without any GUI. So, here, gImageReader can help any user use it to extract text from images and files.

Let me focus on something about it and mention my experience during testing.

Cross-platform front end of gImageReader:Tesseract OCR

To simplify your work, gImageReader can easily extract text from PDF files or images that contain any type of text.

It is needed for both spell checking and translation, and it should be useful for specific user groups.

Introduction to gImageReader features:

Add PDF documents and images from disks, scanning devices, clipboard, and screenshots

The ability to rotate an image

Universal image controls can adjust brightness, contrast and resolution

Scan the image directly through the application

Ability to process multiple images or files at a time

Manually or automatically identify area definitions

Identify plain text or hOCR documents

The editor displays the recognized text

Text that can be extracted by spell checking

Convert / export from hOCR documents to PDF documents

Export the extracted text to a .txt file

Cross-platform (Windows)

Install gImageReader on Linux

Note: you need to explicitly install the Tesseract language pack to detect from images / files in the software manager.

You can find gImageReader in the default repositories of some Linux distributions, such as Fedora and Debian.

For Ubuntu, you need to add a PPA before installing it. To do this, you need to enter the following in the terminal:

Linuxmi@linuxmi:~/www.linuxmi.com$ sudo add-apt-repository ppa:sandromani/gimagereader linuxmi@linuxmi:~/www.linuxmi.com$ sudo apt update linuxmi@linuxmi:~/www.linuxmi.com$ sudo apt install gimagereader tesseract-ocr tesseract-ocr-eng tesseract-ocr-chi-sim tesseract-ocr-chi-tra- y linuxmi@linuxmi:~/www.linuxmi.com$ sudo apt install tesseract-ocr-chi-sim-vert tesseract-ocr-chi-tra-vert-y

You can also find it for openSUSE in its build service, where AUR will be the Arch Linux user.

All links to repositories and packages can be found on their GitHub pages.

Experience in using gImageReader

GImageReader is a very useful tool for extracting text from images when needed. When you try to use a PDF file, it works well.

In order to extract images from photos taken by smartphones, the detection is close, but a bit inaccurate. Maybe when you scan something, it might be better to recognize characters from the file.

Therefore, you must try it for yourself and see how it works in your use case. I tried it on Ubuntu 20.04.2 LTS.

Operation steps

Open gImageReader

Add pdf

Select multiple languages for recognition language = > simplified character [chi_sim] + English [eng]

Copy or save recognized text

The operation results are shown in the following figure:

I just had a problem managing the language in my settings, but I didn't get a quick solution. If you encounter this problem, you may need to troubleshoot it and learn more about how to resolve the problem.

That's all for "how to extract text from images and PDF on Linux". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report