Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is full-text retrieval technology?

2025-03-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what is full-text retrieval technology". The content of the explanation is simple and clear, and it is easy to learn and understand. let's follow the editor's train of thought to study and learn "what is full-text retrieval technology"!

Full-text search technology 1, search framework comparison 1, what is a full-text search engine?

Full-text search engine is a widely used mainstream search engine at present. Its working principle is that the computer index program scans every word in the article, builds an index on each word, and indicates the number and location of the word in the article. When the user queries, the retrieval program searches according to the pre-established index and feeds back the search results to the user's retrieval method. This process is similar to the process of looking up words through a search table in a dictionary.

From the definition, we can roughly understand the idea of full-text retrieval. In order to explain it in more detail, let's start with the data in life.

There are two kinds of data in our lives: structured data and unstructured data.

Structured data: data with fixed format or limited length, such as database, metadata, etc.

Unstructured data: unstructured data, also known as full-text data, refers to data with variable length or no fixed format, such as mail, word documents, etc.

Semi-structured data, such as XML,HTML, can be processed as structured data when needed, or plain text can be extracted as unstructured data.

According to two kinds of data classification, search is also divided into two kinds: structured data search and unstructured data search.

For structured data, we can generally store and search through the table of relational databases (mysql,oracle, etc.), and we can also build indexes. For unstructured data, that is, there are two main methods for searching full-text data: sequential scanning and full-text retrieval.

Sequential scanning: you can also know its general search method through the text name, that is, searching for specific keywords in the way of sequential scanning. For example, give you a newspaper so that you can find out where the word "RNG" has appeared. You definitely need to scan the newspaper from beginning to end and mark where the keyword has appeared and where it appears.

This method is undoubtedly the most time-consuming and inefficient, if the typesetting font of the newspaper is small, and there are more pages or even more newspapers, you will be almost done by the time you finish scanning your eyes.

Full-text search: sequential scanning of unstructured data is slow, can we optimize it? Can't we just find a way to structure our unstructured data? Part of the information in the unstructured data is extracted and reorganized to make it have a certain structure, and then the data with a certain structure is searched, so as to achieve the purpose of relatively fast search. This way constitutes the basic idea of full-text retrieval. This part of the information extracted from unstructured data and then reorganized is called an index.

Take reading newspapers as an example, we would like to follow the news of the recent League of Legends S8 global finals. If they are all RNG fans, how can we quickly find the newspapers and sections of RNG News? The way of full-text search is to extract keywords from all sections of all newspapers, such as "EDG", "RNG", "FW", "team", "League of Legends" and so on. Then index these keywords, through which we can correspond to the newspaper and section where the keyword appears. Pay attention to the difference between directory search engines.

2. Why use full-text search engine

Why use search engines? All our data are available in the database, and Oracle, SQL Server and other databases can also provide query retrieval or clustering analysis functions, not directly through the database query? Indeed, most of our query functions can be obtained through database query. If the query efficiency is low, we can also improve the efficiency by building a database index, optimizing SQL and other ways, and even through the introduction of cache to speed up the return of data. If the amount of data is larger, it can be divided into databases and tables to share the query pressure.

Then why full-text search engines? We mainly analyze the following reasons:

Data type full-text index search supports the search of unstructured data, which can better quickly search the unstructured text of any word or phrase that exists in large numbers. For example, Google, Baidu type of site search, they are based on the keyword in the page to generate the index, we enter the keyword when searching, they will be the keyword that the index matched to all the pages back; there are common projects in the application of log search, and so on. For these unstructured data texts, relational database search is not well supported.

The maintenance of the index is generally traditional database, full-text retrieval is very chicken, because generally no one uses the data inventory text field. Full-text retrieval requires scanning the entire table, and even if the syntax of SQL is optimized, it has little effect if there is a large amount of data. The index is established, but it is also cumbersome to maintain, and the index is rebuilt for both insert and update operations.

When to use a full-text search engine:

The data object searched is a large amount of unstructured text data.

The number of document records reaches hundreds of thousands or millions or more.

Support a large number of interactive text-based queries.

Demand for very flexible full-text search queries.

There is a special need for highly relevant search results, but there is no available relational database to meet.

Situations where there is relatively little need for different record types, non-text data manipulation, or secure transaction processing.

3. Lucene,Solr,ElasticSearch comparison 3.1and Lucene

Lucene is a Java full-text search engine written entirely in Java. Lucene is not a complete application, but a code base and API that can be easily used to add search capabilities to an application.

Lucene provides powerful features through a simple API:

Scalable high-performance index

Exceed 150GB / hour on modern hardware

Small RAM requirements-only 1MB heap

Incremental indexes are as fast as bulk indexes

The index size is about 20-30% of the index text size.

Powerful, accurate and efficient search algorithm

Ranking search-return the best results first

Many powerful query types: phrase query, wildcard query, proximity query, range query, etc.

On-site search (e.g. title, author, content)

Sort by any field

Use merge results for multi-index search

Allow simultaneous update and search

Flexible facets, highlights, connections, and result grouping

Suggestions for speed, memory efficiency and error tolerance

Pluggable ranking models, including vector space model and Okapi BM25

Configurable storage engine (codec)

Cross-platform solution

Available as open source software under the Apache license, allowing you to use Lucene in commercial and open source programs

100%-pure Java

Implementations in other programming languages available are index compatible

The Apache Software Foundation supports the Apache community of open source software projects provided by the Apache Software Foundation.

But Lucene is just a framework, and to make full use of its functionality, you need to use JAVA and integrate Lucene into the program. It takes a lot of learning to understand how it works, and it's really complicated to be proficient in using Lucene.

3.2 、 Solr

Apache Solr is an open source search platform based on a Java library called Lucene. It provides Apache Lucene search capabilities in a user-friendly manner. As an industry participant for nearly a decade, it is a mature product with a strong and extensive user community. It provides distributed indexing, replication, load balancing queries, and automatic failover and recovery. If it is properly deployed and well managed, it can become a highly reliable, scalable and fault-tolerant search engine. Many Internet giants, such as Netflix,eBay,Instagram and CloudSearch, use Solr because it can index and search multiple sites.

The main feature list includes:

Full-text search

Stand out

Faceted search

Real-time index

Dynamic clustering

Database integration

NoSQL features and rich document processing (such as Word and PDF files)

3.3 、 Elasticsearch

Elasticsearch is an open source (Apache 2 license), a RESTful search engine built on the Apache Lucene library.

Elasticsearch was launched a few years after Solr. It provides a distributed, multi-tenant full-text search engine with a HTTP Web interface (REST) and unstructured JSON documents. Elasticsearch's official client library provides Java,Groovy,PHP,Ruby,Perl,Python,.NET and Javascript.

The distributed search engine includes an index that can be divided into shards, and each shard may have multiple copies. Each Elasticsearch node can have one or more shards, and its engine can also act as a coordinator, delegating operations to the correct shards.

Elasticsearch can be extended through near real-time search. One of its main functions is multi-tenancy.

The main feature list includes:

Distributed search

Multi-tenant

Analysis and search

Grouping and aggregation

The choice of Elasticsearch vs. Solr

Because of the complexity of Lucene, it is rarely considered as the first choice for search, excluding some companies that need to develop their own search framework and the underlying need to rely on Lucene. So here we focus on Elasticsearch and Solr.

Elasticsearch vs. Solr . Which is better? What's the difference between them? Which one should you use?

Apache Solr is a mature project with a large and active development and user community, as well as the Apache brand. Solr, which was first released to open source in 2006, has long dominated the search engine field and is the first choice for anyone who needs search capabilities. Its maturity translates into rich functions, not just simple text indexing and search; such as facets, grouping, powerful filtering, pluggable document processing, pluggable search chain components, language detection, etc.

Solr has dominated the search field for many years. Then, around 2010, Elasticsearch became another option in the market. At that time, it was far less stable than Solr, without the functional depth of Solr, without thought sharing, branding, and so on.

Although Elasticsearch is young, it has some advantages of its own. Elasticsearch is based on more modern principles, targeted at more modern use cases, and is built to make it easier to handle large indexes and high query rates. In addition, because it is too young to have a community to cooperate with, it is free to move forward without any consensus or cooperation with others (users or developers), backward compatibility, or any other more mature software that usually has to deal with.

As a result, it exposes some very popular features before Solr (for example, near real-time search, English: Near Real-Time Search). Technically, NRT's search capability does come from Lucene, which is the basic search base used by Solr and Elasticsearch. Ironically, because Elasticsearch first exposed NRT search, people associate NRT search with Elasticsearch, although Solr and Lucene are both part of the same Apache project, so people expect Solr to have such demanding functionality in the first place.

Characteristic Solr/SolrCloudElasticsearch community and developer Apache software fund and community support a single business entity and its employee nodes to discover Apache Zookeeper, mature in a large number of projects and field-tested Zen built into Elasticsearch itself, requires dedicated master nodes to split brain protection fragments placement is essentially static, requires manual work to migrate fragments, starting with Solr 7-Autoscaling API allows some dynamic operations The global cache can be moved on demand according to the cluster status, and each segment change is invalid. It is more suitable for dynamic change of data analysis engine performance. The accuracy of static data results is very suitable for accurate calculation. The accuracy of data placement full-text search function depends on Lucene-based language analysis, multi-recommendation, spell check, rich highlighting support Lucene-based language analysis, single recommendation API implementation. Highlight recalculated DevOps support is not yet complete, but the upcoming very good API non-planar data processing nested documents and natural support for parent-child support for nesting and object types allow almost unlimited nesting and parent-child support query DSLJSON (limited) XML (limited) or URL parameters JSON indexing / collection leader placement control and leader rebalancing even load on nodes impossible machine learning built-in-on stream aggregation, focus on logical regression and learning ranking contribution module business functions, focus on outliers and outliers and time series data thank you for reading, this is the content of "what is full-text retrieval technology" After the study of this article, I believe you have a deeper understanding of what is full-text retrieval technology, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 251

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report