How do Java engineers master full-text search engines 07/02 Update SLTechnology News&Howtos

How do Java engineers master full-text search engines

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "how Java engineers master full-text search engine". In daily operation, I believe many people have doubts about how Java engineers master full-text search engine. Xiaobian consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "Java engineers how to master full-text search engine". Next, please follow the editor to study!

What is full-text search?

First of all, figure out the first question: what is a full-text search engine?

The definition in Baidu encyclopedia:

Full-text search engine is a widely used mainstream search engine at present. How it works is that the computer indexing program scans every word in the article and builds an index on each word, indicating the number and location of the word in the article.

When the user queries, the retrieval program searches according to the pre-established index and feeds back the search results to the user's retrieval method. This process is similar to the process of looking up words through a search table in a dictionary.

From the definition, we can roughly understand the idea of full-text retrieval. In order to explain it in more detail, let's start with the data in life.

There are two kinds of data in our lives: structured data and unstructured data.

Structured data: data with fixed format or limited length, such as database, metadata, etc.

Unstructured data: unstructured data, also known as full-text data, refers to data with variable length or no fixed format, such as mail, word documents, etc.

Of course, there will be a third kind of semi-structured data, such as XML,HTML, when it can be processed as structured data as needed, or pure text can be extracted as unstructured data.

According to two kinds of data classification, search is also divided into two kinds:

Structured data search

Unstructured data search

For structured data, we can generally store and search through the table of relational databases (mysql,oracle, etc.), and we can also build indexes. For unstructured data, that is, there are two main methods for searching full-text data: sequential scanning and full-text retrieval.

Sequential scanning: you can also know its general search method through the text name, that is, searching for specific keywords in the way of sequential scanning.

For example, give you a newspaper so that you can find out where the word "RNG" has appeared. You definitely need to scan the newspaper from beginning to end and mark where the keyword has appeared and where it appears.

This method is undoubtedly the most time-consuming and inefficient, if the typesetting font of the newspaper is small, and there are more pages or even more newspapers, you will be almost done by the time you finish scanning your eyes.

Full-text search: sequential scanning of unstructured data is slow, can we optimize it? Can't we just find a way to structure our unstructured data?

We extract part of the information from the unstructured data, reorganize it, make it have a certain structure, and then search the data with a certain structure, so as to achieve the purpose of relatively fast search.

This way constitutes the basic idea of full-text retrieval. This part of the information extracted from unstructured data and then reorganized is called an index.

Take reading newspapers as an example, we would like to follow the news of the recent League of Legends S8 global finals. If they are all RNG fans, how can we quickly find the newspapers and sections of RNG News?

The way of full-text search: extract keywords from all sections of all newspapers, such as "EDG", "RNG", "FW", "team", "League of Legends" and so on. Then index these keywords, through which we can correspond to the newspaper and section where the keyword appears.

Why use full-text search engine

So the second question is, why use search engines?

All our data are available in the database, and Oracle, SQL Server and other databases can also provide query retrieval or clustering analysis functions, not directly through the database query?

Indeed, most of our query functions can be obtained through database query. If the query efficiency is low, we can also improve the efficiency by building a database index, optimizing SQL and other ways, and even through the introduction of cache to speed up the return of data. If the amount of data is larger, it can be divided into databases and tables to share the query pressure.

Then why full-text search engines? We mainly analyze the following reasons:

Data type

Full-text index search supports the search of unstructured data, which can better quickly search the unstructured text of any word or phrase that exists in large numbers.

For example, Google, Baidu website search, they all generate indexes based on the keywords in the web pages. When we enter keywords in the search, they will return all the pages to which the keywords are matched.

There are also common applications in the project log search and so on. For these unstructured data texts, relational database search is not well supported.

Maintenance of index

In general, full-text retrieval is very difficult in traditional databases, because generally no one uses data inventory text fields.

Full-text retrieval requires scanning the entire table, and even if the syntax of SQL is optimized, it has little effect if there is a large amount of data. The index is established, but it is also cumbersome to maintain, and the index is rebuilt for both insert and update operations.

The third question: when to use a full-text search engine?

The data object searched is a large amount of unstructured text data.

The number of document records reaches hundreds of thousands or millions or more.

Support a large number of interactive text-based queries.

Demand for very flexible full-text search queries.

There is a special need for highly relevant search results, but there is no available relational database to meet.

Situations where there is relatively little need for different record types, non-text data manipulation, or secure transaction processing.

Lucene or Solr or ElasticSearch?

Now the mainstream search engine is probably: Lucene,Solr,ElasticSearch.

Their index establishment is based on the inverted index. What is the inverted index?

Let's take a look at Wikipedia's explanation:

Wikipedia

Inverted index (English: Inverted index), also known as reverse index, placed file, or reverse file, is an indexing method used to store a mapping of the location of a word in a document or group of documents under full-text search. It is the most commonly used data structure in document retrieval system.

Let's take a look at these three mainstream full-text search engines that use inverted indexing.

Lucene

Lucene is a Java full-text search engine written entirely in Java. Lucene is not a complete application, but a code base and API that can be easily used to add search capabilities to an application.

Lucene provides powerful features through a simple API:

Scalable high-performance index

Exceed 150GB / hour on modern hardware

Small RAM requirements-only 1MB heap

Incremental indexes are as fast as bulk indexes

The index size is about 20-30% of the index text size.

Powerful, accurate and efficient search algorithm

Ranking search-return the best results first

Many powerful query types: phrase query, wildcard query, proximity query, range query, etc.

On-site search (e.g. title, author, content)

Sort by any field

Use merge results for multi-index search

Allow simultaneous update and search

Flexible facets, highlights, connections, and result grouping

Suggestions for speed, memory efficiency and error tolerance

Pluggable ranking models, including vector space model and Okapi BM25

Configurable storage engine (codec)

Cross-platform solution

Available as open source software under the Apache license, allowing you to use Lucene in commercial and open source programs

100%-pure Java

Implementations in other programming languages available are index compatible

But Lucene is just a framework, and to make full use of its functionality, you need to use JAVA and integrate Lucene into the program. It takes a lot of learning to understand how it works, and it's really complicated to be proficient in using Lucene.

Solr

Apache Solr is an open source search platform based on a Java library called Lucene. It provides Apache Lucene search capabilities in a user-friendly manner.

It provides distributed indexing, replication, load balancing queries, and automatic failover and recovery. If it is properly deployed and well managed, it can become a highly reliable, scalable and fault-tolerant search engine.

Many Internet giants, such as Netflix,eBay,Instagram and CloudSearch, use Solr because it can index and search multiple sites.

The main feature list includes:

Full-text search

Paging search

Real-time index

Dynamic clustering

Database integration

NoSQL features and rich document processing (such as Word and PDF files)

ElasticSearch

Elasticsearch is an open source RESTful search engine based on the Apache Lucene library, launched a few years after Solr.

It provides a distributed, multi-tenant full-text search engine with a HTTP Web interface (REST) and unstructured JSON documents.

Elasticsearch's official client library provides Java,Groovy,PHP,Ruby,Perl,Python,.NET and Javascript.

The distributed search engine includes an index that can be divided into shards, and each shard may have multiple copies. Each Elasticsearch node can have one or more shards, and its engine can also act as a coordinator, delegating operations to the correct shards.

Elasticsearch can be extended through near real-time search. One of its main functions is multi-tenancy. The main feature list includes:

Distributed search

Multi-tenant

Analysis and search

Grouping and aggregation

Elasticsearch vs. Solr, how to choose?

Because of the complexity of Lucene, it is rarely considered as the first choice for search, excluding some companies that need to develop their own search framework and the underlying need to rely on Lucene. So here we focus on Elasticsearch and Solr.

Elasticsearch vs. Solr . Which is better? What's the difference between them? Which one should you use?

Historical comparison

Apache Solr is a mature project with a large and active development and user community, as well as the Apache brand. Solr, which was first released to open source in 2006, has long dominated the search engine field and is the first choice for anyone who needs search capabilities.

Its maturity translates into rich functions, not just simple text indexing and search; such as facets, grouping, powerful filtering, pluggable document processing, pluggable search chain components, language detection, etc.

Solr has dominated the search field for many years. Then, around 2010, Elasticsearch became another option in the market. At that time, it was far less stable than Solr, without the functional depth of Solr, without thought sharing, branding, and so on.

Although Elasticsearch is young, it has some advantages of its own. Elasticsearch is based on more modern principles, targeted at more modern use cases, and is built to make it easier to handle large indexes and high query rates.

In addition, because it is too young to have a community to cooperate with, it is free to move forward without any consensus or cooperation with others (users or developers), backward compatibility, or any other more mature software that usually has to deal with.

As a result, it exposes some very popular features before Solr (for example, near real-time search, English: Near Real-Time Search).

Technically, NRT's search capability does come from Lucene, which is the basic search base used by Solr and Elasticsearch.

Ironically, because Elasticsearch first exposed NRT search, people associate NRT search with Elasticsearch, although Solr and Lucene are both part of the same Apache project, so people expect Solr to have such demanding functionality in the first place.

Feature difference comparison

Both search engines are popular, advanced open source search engines. They are all built around the core underlying search base-Lucene.

But they are different. Like everything, each has its advantages and disadvantages, and depending on your needs and expectations, each may be better or worse.

So, without saying much, let's take a look at their list of differences:

Comprehensive comparison

In addition, we will analyze it from the following aspects:

The popular trend in recent years

Let's take a look at the Google search trends for both products. Google trends show that Elasticsearch is more attractive than Solr, but that doesn't mean Apache Solr is dead.

Although some people may not think so, Solr is still one of the most popular search engines, with strong community and open source support. Take a look at the following picture for a comparison between the two:

Installation and configuration

Compared with Solr, Elasticsearch is easy to install and very lightweight. In addition, you can install and run Elasticsearch in a few minutes.

However, this ease of deployment and use can be a problem if Elasticsearch is not managed properly. JSON-based configuration is simple, but if you want to specify comments for each configuration in the file, it is not for you.

In general, if your application uses JSON, then Elasticsearch is a better choice. Otherwise, use Solr because its schema.xml and solrconfig.xml are well documented.

Community

Solr has a larger community of developers and contributors, while ES has a smaller but more active community and growing users.

Solr is true open source community code, and anyone can contribute to Solr and select new Solr developers (also known as submitters) based on their strengths.

Elasticsearch is open source in technology, but not so important in spirit. Anyone can see the source, and anyone can change it and contribute, but only Elasticsearch employees can actually make changes to Elasticsearch.

Solr contributors and submitters come from many different organizations, while Elasticsearch submitters come from a single company.

Maturity degree

Solr is more mature, but ES is growing rapidly, and I think it's stable.

Document

Solr scored high here. It is a very documented product with clear examples and API use case scenarios.

Elasticsearch's documentation is well organized, but it lacks good examples and clear configuration instructions.

At this point, the study on "how Java engineers master full-text search engine" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.