In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains "what is the basic principle of Lucene". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what is the basic principle of Lucene".
I. General introduction
According to lucene.apache.org/java/docs/i... Definition:
Lucene is an efficient, Java-based full-text retrieval library.
So it takes a lot of effort to learn about full-text search before you know Lucene.
So what is full-text search? It starts with the data in our lives.
There are two kinds of data in our lives: structured data and unstructured data.
Structured data: data with fixed format or limited length, such as database, metadata, etc.
Unstructured data: data that is indefinite in length or in no fixed format, such as mail, word documents, etc.
Of course, some places will also mention the third kind of semi-structured data, such as XML,HTML, when it can be processed according to structured data as needed, or pure text can be extracted as unstructured data.
Unstructured data is also called full-text data.
According to the classification of data, search is also divided into two types:
The search for structured data, such as the search for a database, using SQL statements. Another example is the search for metadata, such as using windows search to search the file name, type, modification time and so on.
Search for unstructured data: such as windows search can also search file content, Linux under the grep command, such as Google and Baidu can search a large number of content data.
There are two main ways to search for unstructured data, that is, full-text data:
One is the sequential scanning method (Serial Scanning): the so-called sequential scanning, such as looking for files containing a certain string, is the look of a document, for each document, from the beginning to the end, if this document contains this string, then this document is the file we are looking for, and then look at the next file until all the documents have been scanned. Such as the use of windows search can also search the contents of the file, but quite slow. If you have an 80g hard drive, if you want to find a file containing a string on it, it won't take him a few hours. The same is true of the grep command under Linux. You may think this method is relatively primitive, but for files with a small amount of data, this method is still the most direct and convenient. But for a large number of files, this approach is slow.
Some people may say that the sequential scanning of unstructured data is slow, but the search of structured data is relatively fast (because structured data has a certain structure, we can adopt certain search algorithms to speed up the speed). So why don't we just find a way to make our unstructured data structured?
This idea is very natural, but it constitutes the basic idea of full-text retrieval, that is, to extract part of the information from unstructured data, reorganize it, make it have a certain structure, and then search the data with a certain structure. in order to achieve the purpose of relatively fast search.
This part of the information extracted from unstructured data and then reorganized is called an index.
This statement is relatively abstract, and it is easy to understand with a few examples. for example, a dictionary, a dictionary's pinyin table and a radical search list are equivalent to the dictionary's index, and the interpretation of each word is unstructured. If the dictionary does not have a syllable table and radical search list, finding a word in the vast sea of words can only be scanned sequentially. However, some information of the word can be extracted for structural processing, such as pronunciation, which is more structured, with only a few consonants and vowels that can be enumerated respectively, so the pronunciation is taken out and arranged in a certain order. each pronunciation points to the number of pages of a detailed explanation of the word. When we search, we find the pronunciation according to the structured pinyin, and then according to the number of pages it points to, we can find our unstructured data-that is, the interpretation of the word.
This process of building an index and then searching the index is called full-text search (Full-text Search).
The following picture is from "Lucene in action", but it not only describes the retrieval process of Lucene, but also describes the general process of full-text retrieval.
Cdn.xitu.io/2016/11/29/a3cea9bc799d798340ded4e405204c4b.jpg?imageView2/0/w/1280/h/960/format/webp/ignore-error/1 ">
Full-text retrieval consists of two processes, index creation (Indexing) and search index (Search).
Index creation: the process of extracting information from all structured and unstructured data in the real world and creating an index.
Search index: the process of getting the user's query request, searching the created index, and then returning the results.
Therefore, there are three important problems in full-text retrieval:
What exactly is stored in the index? (Index)
How do I create an index? (Indexing)
How do I search the index? (Search)
Let's study each problem in sequence.
2. What exactly is stored in the index
What exactly needs to be stored in the index?
First of all, let's take a look at why sequential scanning is slow:
In fact, it is due to the inconsistency between the information we want to search for and the information stored in unstructured data.
The information stored in unstructured data is what strings are contained in each file, that is, known files, and it is relatively easy to get strings, that is, the mapping from files to strings. And the information we want to search is which files contain this string, that is, the known string, the desired file, that is, the mapping from the string to the file. The two are just the opposite. So if the index can always save the mapping from the string to the file, it will greatly improve the search speed.
Because the mapping from string to file is the reverse process of file-to-string mapping, the index that holds this information is called reverse index.
The information saved by the reverse index is generally as follows:
Suppose I have 100 documents in my document collection. For ease of presentation, we number the documents from 1 to 100 to get the following structure
On the left is a series of strings called dictionaries.
Each string points to the document (Document) linked list containing the string, which is called an inverted list (Posting List).
With an index, you can make the saved information consistent with the information you want to search, which can greatly speed up the search.
For example, to find a document that contains both the string "lucene" and the string "solr", we only need the following steps:
1. Take out the linked list of documents that contain the string "lucene".
two。 Take out the linked list of documents that contain the string "solr".
3. Find the file that contains both "lucene" and "solr" by merging the linked list.
Looking at this place, one might say that full-text search does speed up the search, but with the indexing process, the combination of the two is not necessarily much faster than sequential scanning. Indeed, with the indexing process, full-text retrieval is not necessarily faster than sequential scanning, especially when the amount of data is small. It is also a slow process to index a large amount of data.
However, there is still a difference between the two. Sequential scanning needs to be scanned every time, and the process of creating an index needs only once, and then it will be done once and for all. Each search, the process of creating an index does not have to go through. Just search for the created index.
This is also one of the advantages of full-text search over sequential scanning: once indexing, using multiple times.
Third, how to create an index
The index creation process of full-text retrieval generally has the following steps:
Step 1: some original documents to be indexed (Document).
To facilitate the illustration of the index creation process, two files are specially used as examples:
Document 1: Students should be allowed to go out with their friends, but not allowed to drink beer.
Document 2: My friend Jerry went to school to see his students but found them drunk which is not allowed.
Step 2: transfer the original document to the graded component (Tokenizer).
The participle component (Tokenizer) does the following (this process is called Tokenize):
Divide the document into a single word.
Remove punctuation.
Remove the stop word (Stop word).
The so-called stop word (Stop word) is some of the most common words in a language, because it has no special meaning, so in most cases can not become a search keyword, so when creating an index, this word will be removed and reduce the size of the index.
English Stop word, such as "the", "a", "this" and so on.
For each language's word segmentation component (Tokenizer), there is a stop word collection.
The result obtained after word segmentation (Tokenizer) is called Token.
In our example, we get the following Token:
"Students", "allowed", "go", "their", "friends", "allowed", "drink", "beer", "My", "friend", "Jerry", "went", "school", "see", "his", "students", "found", "them", "drunk", "allowed".
Step 3: pass the resulting Token to the language processing component (Linguistic Processor).
The language processing component (linguistic processor) mainly does some language-related processing on the resulting Token.
For English, the language processing component (Linguistic Processor) generally does the following:
Becomes lowercase (Lowercase).
Reduce the word to the root form, such as "cars" to "car". This operation is called stemming.
Change a word into a root form, such as "drove" to "drive". This operation is called lemmatization.
Similarities and differences between Stemming and lemmatization:
What they have in common: both Stemming and lemmatization make words in root form.
The two ways are different:
Stemming uses a "reduced" approach: "cars" to "car", "driving" to "drive".
Lemmatization takes a "transformational" approach: "drove" to "drove", "driving" to "drive".
The two algorithms are different:
Stemming mainly adopts some fixed algorithm to do this reduction, such as removing "s", removing "ing" plus "e", changing "ational" into "ate" and "tional" into "tion".
Lemmatization mainly makes this transformation by keeping some kind of dictionary. For example, there are mappings from "driving" to "drive", "drove" to "drive", and "am, is, are" to "be" in the dictionary. When making changes, just look it up in the dictionary.
Stemming and lemmatization are not mutually exclusive, but overlap, and some words can achieve the same transformation in both ways.
The result of the language processing component (linguistic processor) is called Term.
In our example, after language processing, the word (Term) is as follows:
"student", "allow", "go", "their", "friend", "allow", "drink", "beer", "my", "friend", "jerry", "go", "school", "see", "his", "student", "find", "them", "drink", "allow".
It is precisely because of the steps of language processing that drove can be searched and drive can be searched out.
Step 4: pass the resulting word (Term) to the index component (Indexer).
The Index component (Indexer) mainly does the following things:
1. Create a dictionary using the resulting word (Term).
In our example, the dictionary is as follows:
TermDocument IDstudent1allow1go1their1friend1allow1drink1beer1my2friend2jerry2go2school2see2his2student2find2them2drink2allow2
Sort dictionaries alphabetically.
TermDocument IDallow1allow1allow2beer1drink1drink2find2friend1friend2go1go2his2jerry2my2school2see2student1student2their1them2
Merge the same words (Term) into a document inversion (Posting List) linked list.
In this table, there are several definitions:
Document Frequency is the frequency of documents, indicating how many files in total contain this word (Term).
Frequency is the frequency of the word, indicating that several of the word (Term) are included in this file.
So for the word (Term) "allow", there are a total of two documents containing the word (Term), so the linked list of documents after the word (Term) has a total of two items, the first item represents the first document containing "allow", that is, document No. 1, in this document, "allow" appears twice, and the second item represents the second document containing "allow", which is document 2. In this document, "allow" appears once.
At this point, the index has been created, and we can quickly find the document we want through it.
And in the process, we were pleasantly surprised to find that searches for "drive", "driving", "drove" and "driven" could also be found. Because in our index, "driving", "drove" and "driven" will all go through language processing and become "drive". When searching, if you enter "driving", the query statement entered will also go through one or three steps here, thus becoming the query "drive", so that you can search the desired document.
Third, how to search the index?
At this point, it seems that we can announce that we have found the document we are looking for.
However, the matter is not over, and only one aspect of full-text search has been found. Isn't it? If only one or ten documents contain the string of our query, we do find it. But what if there are a thousand or even thousands of results? Which document do you want most?
Open Google, for example, you want to find a job at Microsoft, so you type "Microsoft job" and you find that a total of 22600000 results are returned. What a big number, suddenly found that can not be found is a problem, find too much is also a problem. Among so many results, how to put the most relevant first?
Of course, Google did a good job, and you found jobs at Microsoft right away. Imagine if the first few were all "Microsoft does a good job at software industry..." What a terrible thing it will be.
Like Google, how to find the most relevant query statement among thousands of search results?
How to judge the correlation between the searched document and the query statement?
This brings us back to our third question: how to search the index?
Search is mainly divided into the following steps:
Step 1: the user enters the query statement.
Like our common language, query statements also have a certain syntax.
Different query statements have different syntax, such as SQL statement has a certain syntax.
The syntax of the query sentence varies according to the implementation of the full-text retrieval system. The most basic ones are: AND, OR, NOT and so on.
For example, the user enters the statement: lucene AND learned NOT hadoop.
Indicates that the user is looking for a document that contains lucene and learned but not hadoop.
The second step: lexical analysis, syntax analysis, and language processing of the query sentence.
Because the query statement has syntax, it is also necessary to carry out syntax analysis, syntax analysis and language processing.
1. Lexical analysis is mainly used to identify words and keywords.
As in the above example, after lexical analysis, we get that the words have lucene,learned,hadoop, and the keywords have AND and NOT.
If illegal keywords are found in lexical analysis, an error will occur. Such as lucene AMD learned, where AND is misspelled, causing AMD to participate in the query as a common word.
two。 Syntax analysis is mainly based on the syntax rules of query sentences to form a syntax tree.
If it is found that the query statement does not meet the syntax rules, an error will be reported. For example, lucene NOT AND learned, there will be an error.
As in the example above, the syntax tree formed by lucene AND learned NOT hadoop is as follows:
3. Language processing is almost the same as language processing in indexing.
For example, learned becomes learn and so on.
After the second step, we get a syntax tree that has been processed by language.
Step 3: search the index to get the document that conforms to the syntax tree.
There are several small steps in this step:
First, in the reverse index table, find the linked list of documents containing lucene,learn,hadoop, respectively.
Secondly, merge the linked list containing lucene,learn to get the document linked list containing both lucene and learn.
Then, the linked list is subtracted from the document linked list of hadoop to remove the document containing hadoop, thus getting a document linked list that contains both lucene and learn and does not contain hadoop.
This document link list is the document we are looking for.
Step 4: sort the results according to the correlation between the document and the query statement.
Although in the previous step, we got the desired document, the query results should be sorted according to the correlation with the query statement, and the more relevant, the higher.
How to calculate the correlation between documents and query statements?
Why don't we regard the query statement as a short document and scoring the relevance between the document and the document. If the correlation with high score is good, it should be ranked first.
So how do you rate the relationship between documents?
This is not an easy task. First of all, let's take a look at the relationship between people.
First of all, look at a person, there are often many elements, such as character, beliefs, hobbies, clothing, height, fat and thin and so on.
Secondly, for the relationship between people, different elements are of different importance, character, beliefs, hobbies may be more important, clothing, height and height, fat and thin may not be so important, so people with the same or similar personality, beliefs, hobbies are more likely to become good friends, but people with clothes, height and height, fat and thin can also become good friends.
Therefore, to judge the relationship between people, we should first find out which elements are most important to the relationship between people, such as character, beliefs, and hobbies. Secondly, it is necessary to judge the relationship between these elements of two people, such as one is cheerful, the other is extroverted, one believes in Buddhism, the other believes in God, one likes playing basketball, the other likes playing football. We found that both people are very positive in character, kind in beliefs and sports in hobbies, so they should have a good relationship.
Let's take a look at the relationship between companies again.
First of all, look at a company, which is composed of many people, such as general manager, manager, chief technology officer, ordinary staff, security guard, doorman and so on.
Secondly, for the relationship between the company and the company, different people are of different importance, general manager, manager, chief technology officer may be more important, ordinary employees, security guards, doormen may be less important. So if there is a good relationship between the general manager, manager and chief technology officer of the two companies, it is easy for the two companies to have a good relationship. However, even if an ordinary employee has a deep feud with an ordinary employee of another company, it will not affect the relationship between the two companies.
Therefore, to judge the relationship between the company and the company, we should first find out who is most important to the relationship between the company and the company, such as the general manager, manager, and chief technology officer. Second, to judge the relationship between these people, it is not as good as the general manager of the two companies used to be a classmate, the manager is a fellow-townsman, and the chief technology officer used to be a business partner. We find that the relationship between the two companies is very good, no matter the general manager, manager or chief technology officer, so the relationship between the two companies should be very good.
After analyzing the two relationships, let's take a look at how to determine the relationship between documents.
First of all, a document consists of many words (Term), such as search, lucene, full-text, this, a, what, etc.
Secondly, for the relationship between documents, the importance of different Term is different, for example, for this document, search, Lucene, full-text are relatively important, this, a, what may be relatively unimportant. So if both documents contain search and Lucene,fulltext, the relevance of these two documents is better, but even if one document contains this, a, what, and the other document does not contain this, a, what, it does not affect the relevance of the two documents.
Therefore, to judge the relationship between documents, first find out which words (Term) are most important to the relationship between documents, such as search, Lucene, fulltext. Then judge the relationship between these words (Term).
The process of finding out the importance of words (Term) to a document is called the process of calculating the weight of words (Term weight).
Calculating the weight of a word (term weight) has two parameters, the first is the word (Term), and the second is the document (Document).
The Term weight of the word (Term weight) indicates the importance of the word (Term) in this document, and the more important the word (Term), the greater the weight (Term weight), so it will play a greater role in calculating the correlation between documents.
The process of determining the relationship between words (Term) to obtain document relevance uses an algorithm called vector space model (Vector Space Model).
Let's analyze these two processes in detail:
1. The process of calculating weights (Term weight).
There are two main factors affecting the importance of a word (Term) in a document:
Term Frequency (tf): that is, how many times this Term appears in this document. The larger the tf, the more important it is.
Document Frequency (df): that is, how many documents contain secondary Term. The larger the df, the less important it is.
Is it easy to understand? The more the word (Term) appears in the document, the more important the word (Term) is to the document. For example, the word "search" appears many times in this document, indicating that this document is mainly about this aspect. However, in an English document, the more this appears, the more important it is? No, this is adjusted by the second factor, which means that the more documents contain the word (Term), which means that the word (Term) is too common to distinguish between them, so the less important it is.
This is also like the technology that we programmers have learned. For programmers, the deeper they master the better (the deeper they master it means the more time they spend reading, the bigger the tf), and the more competitive they are when looking for a job. However, for all programmers, the fewer people who understand this technology, the better (fewer people understand df), and the more competitive it is to find a job. This is the truth that the value of human beings lies in irreplaceability.
I see. Let's look at the formula:
This is only a simple typical implementation of the term weight formula. People who implement a full-text retrieval system will have their own implementation, and Lucene is slightly different.
two。 The process of judging the relationship between Term to get the document correlation, that is, the algorithm of vector space model (VSM).
We regard the document as a series of words (Term), each word (Term) has a weight (Term weight), different words (Term) according to their weight in the document to affect the relevance of the document score calculation.
So we regard the term weight of all the words (term) in this document as a vector.
Document = {term1, term2,. , term N}
Document Vector = {weight1, weight2,. , weight N}
Similarly, we regard the query statement as a simple document, which is also represented by a vector.
Query = {term1, term 2,. , term N}
Query Vector = {weight1, weight2,. , weight N}
We put all the searched document vectors and query vectors into an N-dimensional space, and each word (term) is one-dimensional.
As shown in the figure:
We believe that the smaller the angle between the two vectors, the greater the correlation.
So we calculate the cosine value of the angle as the correlation score, the smaller the angle, the larger the cosine value, the higher the score, the greater the correlation.
Some people may ask that the query statement is generally very short and contains few words (Term), so the dimension of the query vector is very small, while the document is very long, there are many Term words, and the document vector dimension is very large. Why are both dimensions N in your picture?
Here, since we want to put it into the same vector space, the natural dimension is the same, and if we take the union of the two, if there is no word (Term), the Term Weight is 0.
The correlation scoring formula is as follows:
For example, the query statement has 11 Term and a total of three documents are searched out. The respective weights (Term weight) are shown in the table below.
T10t11
. 477
. 477.176
. 176
. 176
. 477
. 954
. 176
. 176
. 176
. 176.176
. 176
. 477
. 176
Therefore, it is calculated that the correlation score between the three documents and the query statement is as follows:
So document 2 has the highest correlation, returning first, followed by document 1, and finally document 3.
At this point, we can find the documents we want most.
Having said so much, in fact, it has not entered Lucene, but it is only the basic theory of information retrieval technology (Information retrieval). However, when we have seen Lucene, we will find that Lucene is a basic practice of this basic theory. Therefore, in the future analysis of Lucene articles, we will often see the application of the above theories in Lucene.
A summary of the above index creation and search process before entering Lucene, as shown in the figure:
This figure refers to the article "Open Source full-text search engine Lucene" in www.lucene.com.cn/about.htm.
1. Indexing process:
1) there are a series of indexed files
2) the indexed file forms a series of words (Term) after syntax analysis and language processing.
3) the dictionary and reverse index table are formed after index creation.
4) write the index to the hard disk through index storage.
two。 Search process:
A) users enter query statements.
B) A series of words (Term) are obtained from the query sentence through syntax analysis and language analysis.
C) A query tree is obtained by syntax analysis.
D) read the index into memory through index storage.
E) use the query tree to search the index, thus get the document linked list of each word (Term), intersect the document linked list, difference, and get the result document.
F) sort the results of the search to the relevance of the query.
G) return the query results to the user.
Now we can enter the world of Lucene.
The link to this article in CSDN is blog.csdn.net/forfuture19 …
The link to this article in Javaeye is forfuture1978.javaeye.com/blog/546771
Using Java API to call luceneGithub code in Spring Boot
I have put the code into Github to import the spring-boot-lucene-demo project
Github spring-boot-lucene-demo
Add dependency org.apache.lucene lucene-queryparser 7.1.0 org.apache.lucene lucene-highlighter 7.1.0 org.apache.lucene lucene-analyzers-smartcn 7.1.0 cn.bestwu ik-analyzers 5.1.0 com.chenlb.mmseg4j mmseg4j-solr 2.4.0 org.apache.solr solr-core configuration luceneprivate Directory directory; private IndexReader indexReader Private IndexSearcher indexSearcher; @ Beforepublic void setUp () throws IOException {/ / the location where the index is stored is set to directory = FSDirectory.open (Paths.get ("indexDir/")) in the current directory; / / the reader that creates the index indexReader = DirectoryReader.open (directory); / / create an index finder to retrieve the index library indexSearcher = new IndexSearcher (indexReader);} @ Afterpublic void tearDown () throws Exception {indexReader.close () * execute the query and print the number of records queried * * @ param query * @ throws IOException * / public void executeQuery (Query query) throws IOException {TopDocs topDocs = indexSearcher.search (query, 100); / / print the number of records queried System.out.println ("total query" + topDocs.totalHits + "documents") For (ScoreDoc scoreDoc: topDocs.scoreDocs) {/ / get the corresponding document object Document document = indexSearcher.doc (scoreDoc.doc); System.out.println ("id:" + document.get ("id")); System.out.println ("title:" + document.get ("title")); System.out.println ("content:" + document.get ("content")) }} / * participle print * @ param analyzer * @ param text * @ throws IOException * / public void printAnalyzerDoc (Analyzer analyzer, String text) throws IOException {TokenStream tokenStream = analyzer.tokenStream ("content", new StringReader (text)); CharTermAttribute charTermAttribute = tokenStream.addAttribute (CharTermAttribute.class); try {tokenStream.reset (); while (tokenStream.incrementToken ()) {System.out.println (charTermAttribute.toString ()) } tokenStream.end ();} finally {tokenStream.close (); analyzer.close ();}} create an index @ Testpublic void indexWriterTest () throws IOException {long start = System.currentTimeMillis (); / / the location where the index is stored, set to Directory directory = FSDirectory.open ("indexDir/") in the current directory / / version is no longer necessary in version 6.6, and there is a no-parameter construction method, so you can use the default StandardAnalyzer word splitter directly. Version version = Version.LUCENE_7_1_0; / / Analyzer analyzer = new StandardAnalyzer (); / / Standard word separator for English / / Analyzer analyzer = new SmartChineseAnalyzer (); / / Chinese word segmentation / / Analyzer analyzer = new ComplexAnalyzer (); / / Chinese word segmentation / / Analyzer analyzer = new IKAnalyzer (); / / Chinese word segmentation Analyzer analyzer = new IKAnalyzer () / / Chinese word segmentation / / create index writing configuration IndexWriterConfig indexWriterConfig = new IndexWriterConfig (analyzer); / / create index write object IndexWriter indexWriter = new IndexWriter (directory, indexWriterConfig); / / create Document object, store index Document doc = new Document (); int id = 1; / / add fields to doc doc.add (new IntPoint ("id", id)) Doc.add (new StringField ("title", "Spark", Field.Store.YES)); doc.add (new TextField ("content", "Apache Spark is a fast and general computing engine designed for large-scale data processing, Field.Store.YES); doc.add (new StoredField (" id ", id)); / / save doc objects to the index database indexWriter.addDocument (doc); indexWriter.commit () / / close stream indexWriter.close (); long end = System.currentTimeMillis (); System.out.println ("Index took" + (end-start) + "milliseconds");}
Response
17 main 58 ext.dic17:58:14.660 14.655 [main] DEBUG org.wltea.analyzer.dic.Dictionary-load extension dictionary: ext.dic17:58:14.660 [main] DEBUG org.wltea.analyzer.dic.Dictionary-load extension stop dictionary: stopword.dic index took 879 milliseconds to delete the document @ Testpublic void deleteDocumentsTest () throws IOException {/ / Analyzer analyzer = new StandardAnalyzer (); / / standard word splitter for English / / Analyzer analyzer = new SmartChineseAnalyzer () / / Chinese word segmentation / / Analyzer analyzer = new ComplexAnalyzer (); / / Chinese word segmentation / / Analyzer analyzer = new IKAnalyzer (); / / Chinese word segmentation Analyzer analyzer = new IKAnalyzer (); / / Chinese word segmentation / / create index writing configuration IndexWriterConfig indexWriterConfig = new IndexWriterConfig (analyzer); / / create index write object IndexWriter indexWriter = new IndexWriter (directory, indexWriterConfig) / / Delete the document long count = indexWriter.deleteDocuments that contains the keyword "Spark" in title (new Term ("title", "Spark")) / / in addition, IndexWriter also provides the following methods: / / DeleteDocuments (Query query): delete single or multiple Document / / DeleteDocuments according to Query condition (Query [] queries): delete single or multiple Document / / DeleteDocuments according to Query condition (Term term): delete single or multiple Document / / DeleteDocuments according to Term (Term [] terms): delete single or multiple Document / / according to Term DeleteAll (): delete all Document / / when using IndexWriter for Document delete operation The document is not deleted immediately, but the delete action is cached, and when IndexWriter.Commit () or IndexWriter.Close (), the delete operation is actually performed. IndexWriter.commit (); indexWriter.close (); System.out.println ("deletion completed:" + count);}
Response
Deletion completed: 1 updating the document / * testing update * is actually adding a * * @ throws IOException * / @ Testpublic void updateDocumentTest () throws IOException {/ / Analyzer analyzer = new StandardAnalyzer () after deletion; / / the standard participle applies to English / / Analyzer analyzer = new SmartChineseAnalyzer (); / / Chinese word segmentation / / Analyzer analyzer = new ComplexAnalyzer (); / / Chinese word segmentation / / Analyzer analyzer = new IKAnalyzer () / / Chinese word segmentation Analyzer analyzer = new IKAnalyzer (); / / Chinese word segmentation / / create index writing configuration IndexWriterConfig indexWriterConfig = new IndexWriterConfig (analyzer); / / create index writing object IndexWriter indexWriter = new IndexWriter (directory, indexWriterConfig); Document doc = new Document (); int id = 1; doc.add ("id", id); doc.add (new StringField ("title", "Spark", Field.Store.YES)) Doc.add (new TextField ("content", "Apache Spark is a fast and general computing engine designed for large-scale data processing, Field.Store.YES); doc.add (new StoredField (" id ", id)); long count = indexWriter.updateDocument (new Term (" id "," 1 "), doc); System.out.println (" Update document: "+ count); indexWriter.close ();}
Response
Update documents: 1 search by entry / * search by entry *
* TermQuery is the simplest and most commonly used Query. TermQuery can be understood as "entry search". * the most basic search in search engines is to search for an entry in the index, and TermQuery is used to do this. * an entry is the most basic search unit in Lucene. In essence, an entry is actually a name / value pair. * except that the "name" is the field name, and the "value" represents a keyword contained in the field. * * @ throws IOException * / @ Testpublic void termQueryTest () throws IOException {String searchField = "title"; / / this is the api of a conditional query, which is used to add conditional TermQuery query = new TermQuery (new Term (searchField, "Spark")); / / execute the query and print the number of records queried executeQuery (query);}
Response
A total of 1 document is queried id:1title:Sparkcontent:Apache Spark is designed for large-scale data processing and fast general computing engine! Multi-conditional query / * multi-conditional query * * BooleanQuery is also a kind of Query that is often used in the actual development process. * it is actually a combined Query that can be used to add various Query objects and indicate the logical relationship between them. * BooleanQuery itself is a container of Boolean clauses. It provides a special API method to add clauses to it, * and indicates the relationship between them. The following code provides the API interface for adding clauses to BooleanQuery: * * @ throws IOException * / @ Testpublic void BooleanQueryTest () throws IOException {String searchField1 = "title"; String searchField2 = "content"; Query query1 = new TermQuery (new Term (searchField1, "Spark")) Query query2 = new TermQuery (new Term (searchField2, "Apache")); BooleanQuery.Builder builder = new BooleanQuery.Builder (); / / BooleanClause is used to represent the class of Boolean query clause relations, / / includes: / / BooleanClause.Occur.MUST, / / BooleanClause.Occur.MUST_NOT, / / BooleanClause.Occur.SHOULD. / / must contain, cannot contain, can contain three kinds. There are six combinations: / 1.MUST and MUST: get the intersection of successive query clauses. / / 2.MUST and MUST_NOT: indicates that the query result cannot contain the retrieval result of the query clause corresponding to MUST_NOT. / / 3.SHOULD and MUST_NOT: when used, the function is the same as MUST and MUST_NOT. / / when 4.SHOULD is used with MUST, the result is the retrieval result of the MUST clause, but SHOULD can affect the sorting. / / 5.SHOULD and SHOULD: denote the "or" relationship, and the final search result is the union of all search clauses. / / 6.MUST_NOT and MUST_NOT: meaningless, search without results. Builder.add (query1, BooleanClause.Occur.SHOULD); builder.add (query2, BooleanClause.Occur.SHOULD); BooleanQuery query = builder.build (); / / execute the query and print the number of records queried executeQuery (query);}
Response
A total of 1 document is queried id:1title:Sparkcontent:Apache Spark is designed for large-scale data processing and fast general computing engine! Match prefix / * match prefix
* PrefixQuery is used to match the document whose index begins with the specified string. That is, xxx% exists in the document *
* * @ throws IOException * / @ Testpublic void prefixQueryTest () throws IOException {String searchField = "title"; Term term = new Term (searchField, "Spar"); Query query = new PrefixQuery (term); / / execute the query and print the number of records queried executeQuery (query);}
Response
A total of 1 document is queried id:1title:Sparkcontent:Apache Spark is designed for large-scale data processing and fast general computing engine! Phrase search / * phrase search *
* the so-called PhraseQuery is retrieved by phrases. For example, if I want to look up the phrase "big car", * if the phrase "big car" is included in the specified entry of the document to be matched, * the document is considered a match. But if the sentence to be matched contains "big black car", * then the match will not succeed. If you want to make this match, you need to set slop. * first give the concept of slop: slop refers to the maximum distance allowed between the positions of two items * * @ throws IOException * / @ Testpublic void phraseQueryTest () throws IOException {String searchField = "content"; String query1 = "apache"; String query2 = "spark" PhraseQuery.Builder builder = new PhraseQuery.Builder (); builder.add (new Term (searchField, query1)); builder.add (new Term (searchField, query2)); builder.setSlop (0); PhraseQuery phraseQuery = builder.build (); / / execute the query and print the number of records executeQuery (phraseQuery);}
Response
A total of 1 document is queried id:1title:Sparkcontent:Apache Spark is designed for large-scale data processing and fast general computing engine! Similar words search / * * similar words search *
* FuzzyQuery is a fuzzy query that can simply identify two similar words. * * @ throws IOException * / @ Testpublic void fuzzyQueryTest () throws IOException {String searchField = "content"; Term t = new Term (searchField, "large"); Query query = new FuzzyQuery (t); / / execute the query and print the number of records queried executeQuery (query);}
Response
A total of 1 document is queried id:1title:Sparkcontent:Apache Spark is designed for large-scale data processing and fast general computing engine! Wildcard search / * wildcard search
* Lucene also provides wildcard queries, which is WildcardQuery. * wildcard "?" Represents 1 character, while "*" represents 0 to more characters. * * @ throws IOException * / @ Testpublic void wildcardQueryTest () throws IOException {String searchField = "content"; Term term = new Term (searchField, "large * scale"); Query query = new WildcardQuery (term); / / execute the query and print the number of records queried executeQuery (query);}
Response
A total of 1 document is queried id:1title:Sparkcontent:Apache Spark is designed for large-scale data processing and fast general computing engine! Word segmentation query / * word segmentation query * @ throws IOException * @ throws ParseException * / @ Testpublic void queryParserTest () throws IOException, ParseException {/ / Analyzer analyzer = new StandardAnalyzer (); / / standard participle for English / / Analyzer analyzer = new SmartChineseAnalyzer (); / / Chinese word segmentation / / Analyzer analyzer = new ComplexAnalyzer (); / / Chinese word segmentation / / Analyzer analyzer = new IKAnalyzer (); / / Chinese word segmentation Analyzer analyzer = new IKAnalyzer () / / Chinese word segmentation String searchField = "content"; / / specify the search field and parser QueryParser parser = new QueryParser (searchField, analyzer); / / user input Query query = parser.parse ("calculation engine"); / / execute the query and print the number of records queried executeQuery (query);}
Response
A total of 1 document is queried id:1title:Sparkcontent:Apache Spark is designed for large-scale data processing and fast general computing engine! Multiple Field participle query / * * multiple Field participle query * * @ throws IOException * @ throws ParseException * / @ Testpublic void multiFieldQueryParserTest () throws IOException, ParseException {/ / Analyzer analyzer = new StandardAnalyzer (); / / standard participle for English / / Analyzer analyzer = new SmartChineseAnalyzer (); / / Chinese participle / / Analyzer analyzer = new ComplexAnalyzer (); / / Chinese participle / / Analyzer analyzer = new IKAnalyzer () / / Chinese participle Analyzer analyzer = new IKAnalyzer (); / / Chinese participle String [] filedStr = new String [] {"title", "content"}; / / specify search field and parser QueryParser queryParser = new MultiFieldQueryParser (filedStr, analyzer); / / user input Query query = queryParser.parse ("Spark"); / / execute the query and print the number of records queried executeQuery (query);}
Response
A total of 1 document is queried id:1title:Sparkcontent:Apache Spark is designed for large-scale data processing and fast general computing engine! Chinese word Segmentation / * IKAnalyzer Chinese word Segmentation * SmartChineseAnalyzer smartcn word Segmentation needs to be lucene dependent and synchronized with lucene version * * @ throws IOException * / @ Testpublic void AnalyzerTest () throws IOException {/ / Analyzer analyzer = new StandardAnalyzer (); / / Standard word Segmentation for English / / Analyzer analyzer = new SmartChineseAnalyzer (); / / Chinese word Segmentation / / Analyzer analyzer = new ComplexAnalyzer (); / / Chinese word Segmentation / / Analyzer analyzer = new IKAnalyzer () / / Chinese word segmentation Analyzer analyzer = null; String text = "Apache Spark is a fast and general computing engine designed for large-scale data processing"; analyzer = new IKAnalyzer (); / / IKAnalyzer Chinese word segmentation printAnalyzerDoc (analyzer, text); System.out.println (); analyzer = new ComplexAnalyzer (); / / MMSeg4j Chinese word segmentation printAnalyzerDoc (analyzer, text); System.out.println (); analyzer = new SmartChineseAnalyzer () / / Lucene Chinese Segmentation printAnalyzerDoc (analyzer, text);}
Three kinds of word segmentation response
Apachespark is specially designed for large-scale modular data processing. Fast general-purpose computing engine apachespark is designed for large-scale data processing. Apachspark is a fast general-purpose computing engine designed for large-scale data processing. Highlight processing * * @ throws IOException * / @ Testpublic void HighlighterTest () throws IOException, ParseException, InvalidTokenOffsetsException {/ / Analyzer analyzer = new StandardAnalyzer () / / Standard word Segmentation for English / / Analyzer analyzer = new SmartChineseAnalyzer (); / / Chinese word Segmentation / / Analyzer analyzer = new ComplexAnalyzer (); / / Chinese word Segmentation / / Analyzer analyzer = new IKAnalyzer (); / / Chinese Segmentation Analyzer analyzer = new IKAnalyzer (); / / Chinese Segmentation String searchField = "content"; String text = "Apache Spark large-scale data processing" / / specify search field and parser QueryParser parser = new QueryParser (searchField, analyzer); / / user input Query query = parser.parse (text); TopDocs topDocs = indexSearcher.search (query, 100); / / keyword highlighted html tag, need to import lucene-highlighter-xxx.jar SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter ("", "); Highlighter highlighter = new Highlighter (simpleHTMLFormatter, new QueryScorer (query)) For (ScoreDoc scoreDoc: topDocs.scoreDocs) {/ / get the corresponding document object Document document = indexSearcher.doc (scoreDoc.doc); / / highlight TokenStream tokenStream = analyzer.tokenStream ("content", new StringReader (document.get ("content")); String content = highlighter.getBestFragment (tokenStream, document.get ("content")); System.out.println (content);}}
Response
Apache Spark is a fast and versatile computing engine designed for large-scale data processing. Thank you for your reading, the above is the content of "what is the basic principles of Lucene". After the study of this article, I believe you have a deeper understanding of what the basic principles of Lucene are, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.