Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How ES conducts full-text search of word and PDF documents.

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article shows you how ES conducts full-text search of word and PDF documents. The content is concise and easy to understand, which will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

ES's full-text search of word and PDF documents. You can use plug-ins to extract the contents of the document, import it into es, and then search. There are several plug-ins.

Comparison of several content extraction plug-ins:

Https://ambar.cloud/blog/2017/10/24/ingesting-documents-into-es/

First, ambar official summary of the ES file content extraction plug-in. 1 、 Ingest Attachment Plugin .

Official website.

The easiest solution to use, it is the official plug-in for ElasticSearch. Content can be extracted from almost all document types. The included attachment cannot be fine-tuned, which is why it cannot handle large files.

2 、 Apache Tika .

Official website

Apache Tika is the actual standard for extracting content from a file. Roughly speaking, Tika is a combination of open source libraries that extract the contents of a file and merge into one library. It is open source and has REST API. You must have experience in setting up and configuring on the server. You should also note that Tika does not work well in some types of PDF (PDF with images), and that REST API runs much slower than direct Java calls, even on the local host.

So, you have Tika installed, what's the next step? You need to create some kind of wrapper:

Download a file

Call Tika to extract file contents

Submit parsed content to ElasticSearch

In order for ElasticSearch to quickly search for large files, you must adjust them yourself. To sum up, Tika is a good solution, but it requires a lot of coding and fine-tuning, especially for edge cases: for Tika, it's weird PDF and OCR.

3 、 FsCrawler

Official website

FsCrawler is a "fast and dirty" open source solution for those who want to index documents through the local file system and through SSH. It crawls your file system and indexes new files, updates existing files, and deletes old ones. FsCrawler is written in Java and requires some extra work to install and configure it. It supports scheduled fetching (for example, every 15 minutes) and has some basic API for submitting documents and managing scheduled schedules. FsCrawler uses Tika internally, and in general, you can use FsCrawler as the glue between Tika and ElasticSearch.

4 、 Ambar

Official website

It can handle large files (> 100 MB) very well.

It extracts content from PDF (even if it is poorly formatted and has an embedded image) and OCR the image

It provides users with easy-to-use REST API and WEB UI.

Easy to deploy (thanks to Docker)

It is open source under the Fair Source 1 v0.9 license

Out of the box, the site provides users with an analytical and real-time search experience.

This is how ES conducts full-text search of word and PDF documents. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report