Full-text search-Lucene 07/11 Update SLTechnology News&Howtos

Full-text search-Lucene

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

01. Description-full-text search (concept)

When there is a large amount of data and high requirements, there is a lot of database content, and when database search is under great pressure on the database server, please use full-text search-Lucene framework

What is the search data?

Text (important)

Multimedia

What is the way to search?

Do not handle semantics

Search for articles with specified words

Scope of application?

Web search, post bar search, document search, etc.

What are the requirements of full-text search?

Search speed should be fast

The result should be accurate.

When multiple results are found, put the best matching results first, and sort the correlation

Case insensitive

Description of the function of 02-Lucene

Lucene.apache.org

Apache provides tomcat/struts/beanutils/dbUtils/..

The principle of full-text Retrieval based on Lucene

In a large amount of data, how can Lucene achieve fast retrieval?

Baidu search-users request to send to Baidu server-return results

│

Many crawlers grab the index library from the Internet in the data organization-specific (fast search) format and put it into the Baidu server.

│

Lucene manages the index database and provides search function.

Introduction to API and data structure of 03-Lucene

The index library stores a pile of binary data, which can be understood as a database.

How do I create a directory of index libraries?

Web pages, files, all represent an object in java, ordinary javabean, using Map to represent all the information in a javabean, such as Map

04-prepare the development environment for Lucene + HelloWorld (indexing)

Core package lucene-core-3.0.1.jar

Specific feature pack

Word splitter lucene-analyzers-3.0.1.jar

Highlight keyword lucene-highlighter-3.0.1.jar

Package lucene-memory-3.0.1.jar on which the highlight function depends

Fast operation mode shift + Alt + A

06-Internal structure of index database

The index database has two areas.

1. Directory area-participle divides keywords and stores (keywords and n articles) correspondence

two。 Data area-Storage document

Doc.add (new Field ("id", idStr, Store.YES, Index.ANALYZED))

Doc.add (new Field ("title", article.getTitle (), Store.YES, Index.ANALYZED))

Doc.add (new Field ("content", article.getContent (), Store.NO, Index.ANALYZED))

Store parameter

Used to specify whether the original value of a Field is stored in the database

YES-Storage. The value of this field is in the extracted Document.

NO-if it is not stored, the value of this field will not be found in the extracted Document

Index parameter

Used to specify whether to update text values in a Field to the catalog area

NO-does not update to the catalog area and cannot be searched by this field

ANALYZED-first participle the text value of the field, and update the result to the directory

NOT_ANALYZED-No word segmentation, directly update the text value of Field as a word to the directory, application scenarios: author, date, number, url, file address

Store

Index

Application scenario

YES

ANALYZED

Can search and display

YES

Do not search by a field, but this field is displayed when it is displayed. For example, the author does not search by author, but the author is displayed when the article is displayed. Store some data in the index database, you can take it out directly, no longer need to ask the database for data, a query to display all the data, high efficiency, such as Baidu snapshot. If the data is particularly large, consider not saving the content in the index library.

ANALYZED

You can search by this field, you can find the corresponding data number of this field, but the contents of this field are not displayed. For example, e-books can search by content, but do not display electronic content on the results page, only the title and author of the e-book.

NOT_ANALYZED

Not allowed

07-Analysis of the process of indexing and searching. Build index-add, delete and modify index database

1. Convert Article to Document

Document doc = new Document ()

Doc.add (new Field ("id", idStr, Store.YES, Index.ANALYZED))

two。 Add to the index library

IndexWriter indexWriter =

New IndexWriter (directory, analyzer, MaxFieldLength.UNLIMITED)

IndexWriter.addDocument (doc)

Do two things: (1) save the Document to the data area, and an internal number will be assigned automatically

(2) update a field value (participle or no participle) to the directory.

two。 Search

Search process

1. Convert the query string to a Query object (only query from title by default)

QueryParser queryParser = new QueryParser (Version.LUCENE_30, "title", analyzer)

Query query = queryParser.parse (queryString)

2. Execute the query to get the intermediate results search directory

IndexSearcher indexSearcher = new IndexSearcher (directory); / / specify the index library to be used

TopDocs topDocs = indexSearcher.search (query, 100); / / return the first n results at most

Int count = total number of topDocs.totalHits;// associations

ScoreDoc [] scoreDocs = topDocs.scoreDocs

3. The real Document data is extracted from the processing result according to docId.

When searching, you should also divide the words into keywords to match the keywords in the directory.

Word separator

Word segmentation rule

Both indexing and search use the same word segmentation rule

MaxFieldLength

Limited 10000 default

Unlimited max

When creating an index catalog, only how many words in the first field are processed

08-scenarios for using Lucene in Web applications

What problems will be caused by adding, deleting and modifying databases and index libraries in web applications?

Both the database and the index database have information about the article, 1. Waste of storage space 2. State synchronization problem

1. Is it a waste?

The principle of storing data in the index database: the data that can be searched, for example, there are author tables and chapter tables in the database, and only chapter tables are stored in the index database. According to the author search, the articles written by the author can be searched, but the specific information of the author can not be found. Duplicate data is just data that needs to be searched.

One search shows all the data, reducing the pressure on the database.

The index database is necessary to realize full-text search. in order not to store duplicate information, can we cancel the database and store all the data in the index database?

no way! Some functions of the database can not be realized by the index library, such as transaction management.

two。 A scheme that ensures that the state of the index library is consistent with that of the data source:

(1) option 1: what is the problem when adding, deleting and modifying the database, while also adding, deleting and modifying the index database? On a program that others already exist, you are not allowed to modify their program, but I want to add a search function, what should I do? For example, Baidu goes to organize the data of other people's web pages into a search.

(2) Plan 2: when you can't control the data source, grab the data from the data source regularly-- the concept of crawler. Periodically rebuild the index library (or synchronize with the data source). Sometimes synchronization is not as fast as rebuilding the index database, unless the index library is not large. When you are old, you use crawlers to analyze whether the web page on the web has the latest data, such as MD5 summary of the contents of the page, (the content of MD5 summary is irreversible), two different strings to do MD5 summary is different, compare MD5 summary, update the index database, millions of data reconstruction is relatively fast.

Vertical search: there is a more detailed analysis of professional things, such as Taobao search for goods, can search for goods with characteristics according to prices.

Different schemes are adopted according to different data sources.

1. The data source is the first or second solution of the database.

two。 The data source is the web page usage plan 2, popular vertical search

3. The data source is the file using JNI technology: call C or C++ program in java to get the information of the operating system in time. Use plan two.

Eg: use struts2 to add, delete, modify and check the post bar.

ArticleAction {

Add () {

Form-- > article; dao.save (article);.. / / Save to database

IndexDao.save (article) / / Save to index library

}

Delete () {

Id = getparam ("id"); dao.delete (id);.. / / delete from the database

/ / remove from the index library

}

Modify () {

Article= getById (id); form-- > articel; dao.update (article); / / Update to the database

/ / Update to the index library

}

Add, delete, modify and check the index database

IndexDao {

Save (article)

Delete (id)

Update (article)

Search (str)

...

}

09-implementation of IndexDao (1)

The id of article must also be stored in the index database, which must be stored without word segmentation. This is a unique identifier, and you can accurately lock the article. For example, you can find and delete the article in the index database according to id.

The conversion of int types to string types uses the methods in lucene to store binary types of int types. Using toString directly will turn a 4-byte data into a dozen bytes of data, wasting space and difficult to sort.

Ctrl + T View inheritance relationship

test

12-implement IndexDao (4)-manage singleton IndexWriter

If the initialization IndexWriter is not closed and is in use, the cache on memory may not be flushed to the hard disk and resources will not be released. Web applications need to create an indexwriter object in the application listener anyway, and close the indexwriter object before the application exits. Java program, close the eindexWriter object before the virtual machine exits. Specifies a piece of code to execute before jvm exits.

13-optimized Index Library: a method for merging Files

There are many files in the indexDir folder, some of which have been marked as deleted files, why not delete them directly, but mark them? this is because this folder is very large, frequent modifications, additions and deletions of this folder may bring inefficiency, when to delete these marked files, when lucene is not busy, then merge or delete. How can these marked and deleted files not be checked out? Check out all the data first, and then filter out the del file. After optimization, there is no del file = = merge files, several small files are merged into one large file, reducing io operations.

Optimization is to merge files and merge several small files into one large file.

When will the files be merged?

Batch operation, re-create the index library every day, grab more files, merge into a large file and put it into the index library. When the number of files with the same extension reaches a certain number, the default is 10, the minimum is 2, and then merge = = automatic optimization

@ Test

Public void test () throws Exception {

LuceneUtils.getIndexWriter () .optimize ()

}

/ / automatically merge files

@ Test

Public void testAuto () throws Exception {

/ / configure that when the number of small files reaches, they will be automatically merged into one large file, with a default of 10 and a minimum of 2.

LuceneUtils.getIndexWriter () .setMergeFactor (5)

/ / build an index

Article article = new Article ()

Article.setId (1)

Article.setTitle ("preparing the development environment for Lucene")

Article.setContent ("if the information retrieval system sends out a search request and then goes to the Internet to find the answer, it will not be able to return the result in a limited time.")

New ArticleIndexDao () save (article)

}

07-implementation scheme of web crawler 1

Crawler: function, grab web pages.

Download all the pages on a website?

Scheme:

0. Initial condition: home page

1. Download the web page to get the content of the web page

two。 Get all the hyperlinks in it

3. Remove the downloaded links and the links outside the stack

4. Loop processing every valid hyperlink back to 1 stop the loop before a new link appears

Technology:

How to download URL-UrlConnection (Socket) of a website

Http protocol sends request response entity content

1. Download the web page

Public static String downLoad (String urlString) {

URL url = new URL (urlString)

URLConnection conn = url.openConnection ()

InputStream in = conn.getInputStream ()

/ / the data in the in stream is the web page content.

two。 Get all the hyperlinks

Dom + XPath

3. Remove the downloaded links and the links outside the stack

Put all the downloaded links in the database, or in a collection, and compare the new links with the links in the collection to see if they are included.

If you want to do better: if there is a network problem, many threads go to access, there are always a few threads can not access, can make 3 requests for links.

Multiple threads continue to take tasks from the task queue

The task queue (tasks that need to be completed) is put in the database, and it doesn't matter if there is a power outage. Multithreading queries and deletes tasks from the database. The queue form solves the problems of recursive memory overflow, power outage, and the inability to use multithreading.

Use LinkedList to represent queues in java

AddFirst () removeLast ()

AddLast () removeFirst ()

Using multiple modes: using schemas, mvc, inheritance

There are more classes, and the relationship becomes more complicated.

1. Suitable for more complex and stringent situations

two。 The code is readable.

3. Reasonable structure and easy to modify

4. Easy to expand and strong maintainability

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.