Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the inverted index of search engines?

2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "what is the inverted index of search engine". Interested friends might as well take a look. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "what is the inverted index of search engine"!

What is an inverted index?

See its name and its meaning, there is an inverted index, corresponding to the affirmation, there is a positive index.

Forward index (forward index) and reverse index (inverted index) are better known as inverted indexes. In the search engine, each file corresponds to a file ID, and the content of the file is represented as a collection of keywords (in fact, in the search engine index database, the keywords have also been converted to the keyword ID). For example, "document 1" extracts 20 keywords after word segmentation, and each keyword records the number and location of its occurrence in the document. The structure of the forward index is as follows: ID > word 1 of "document 1": occurrence times, occurrence location list; word 2: occurrence times, occurrence location list;. ID of "document 2" > list of keywords that appear in this document.

Usually through key, to find value.

When a user searches for the keyword "Huawei mobile phone" on the home page, assuming that there is only a positive index (forward index), then all documents in the index database need to be scanned, all documents containing the keyword "Huawei mobile phone" need to be found, graded according to the scoring model, ranked and presented to the user. Because the number of documents included in search engines on the Internet is astronomical, such an index structure simply can not meet the requirements of real-time return of ranking results. Therefore, the search engine will rebuild the forward index into an inverted index, that is, the mapping of file ID to keywords into the mapping of keywords to file ID, and each keyword corresponds to a series of files in which this keyword appears. The structure of the inverted index is as follows: "keyword 1": ID of "document 1", ID of "document 2", …. "keyword 2": a list of documents ID with this keyword.

From the keywords of the word, go to the document.

1. Word-document matrix word-document matrix is a conceptual model that expresses the inclusion relationship between the two, and figure 1 shows its meaning. Each column in figure 3-1 represents a document, each line represents a word, and the position of the check mark represents the inclusion relationship.

Figure 1 word-document matrix

From the vertical dimension, that is, the document, each column represents which words the document contains. For example, document 1 contains words 1 and 4, but not other words. In terms of horizontal, that is, the word dimension, each line represents which documents contain a word. For example, for word 1, word 1 has appeared in documents 1 and 4, while other documents do not contain word 1. Other rows in the matrix can also be interpreted in this way. The index of the search engine is actually the concrete [data structure] that implements the "word-document matrix" (http://lib.csdn.net/base/datastructure "algorithm and data structure knowledge base"). There are different ways to implement the above conceptual model, such as "inverted index", "signature file", "suffix tree" and so on. However, the experimental data show that "inverted index" is the best way to realize the mapping relationship between words and documents, so this blog mainly introduces the technical details of "inverted index". two。 Inverted indexing basic concept document (Document): the general search engine deals with Internet web pages, but the concept of document is broader, representing storage objects that exist in the form of text, and covering more forms than web pages, such as Word,PDF,html,XML and other different formats can be called documents. For example, an email, a text message, a Weibo can also be called a document. In the rest of this book, documents will be used to represent text information in many cases. Document collection (Document Collection): a collection of documents is called a document collection. For example, a large number of Internet pages or a large number of e-mails are concrete examples of document collections. Document number (Document ID): within the search engine, each document in the document collection will be given a unique internal number, with this number as the unique identification of this document, so as to facilitate internal processing, the internal number of each document is called "document number", and sometimes DocID is used to represent the document number conveniently. Word number (Word ID): similar to the document number, a word is represented by a unique number within the search engine, and the word number can be used as the only representation of a word. Inverted index (Inverted Index): inverted index is a specific storage form to implement "word-document matrix". Through inverted index, you can quickly get a list of documents containing this word based on the word. The inverted index mainly consists of two parts: "word dictionary" and "inverted file". Word dictionary (Lexicon): the usual index unit of a search engine is a word. A word dictionary is a collection of strings made up of all the words that have appeared in the document collection. Each index entry in the word dictionary records some information about the word itself and a pointer to the "inverted list". Inverted list (PostingList): an inverted list records a list of all documents in which a word has appeared and where the word appears in that document. Each record is called an inverted item (Posting). According to the inverted list, you can know which documents contain a word. Inverted file (Inverted File): an inverted list of all words is often sequentially stored in a file on disk, which is called an inverted file, which is a physical file that stores an inverted index. The relationship between these concepts can be seen more clearly in figure 2.

3. Inverted index simple instance inverted index is very simple in terms of logical structure and basic ideas. Below, we use specific examples to illustrate, so that readers can have a macro and direct feeling of the inverted index. Suppose the document collection contains five documents, each of which is shown in figure 3, and the leftmost column in the figure is the document number corresponding to each document. Our task is to build an inverted index on this document collection.

Figure 3 document collection

Chinese and English and other languages are different, there is no clear separation between words, so first of all, the document should be automatically divided into word sequences with a word segmentation system. In this way, each document is converted into a data stream composed of a sequence of words. for the convenience of subsequent processing in the system, we need to give a unique word number to each different word and record which documents contain the word. We can get the simplest inverted index (see figure 3-4). In figure 4, the "word ID" column records the word number of each word, the second column is the corresponding word, and the third column is the inverted list of each word. For example, the word "Google" is numbered as 1, and the inverted list is {1, 2, 4, 4, 5}, which means that every document in the document collection contains the word.

Figure 4 simple inverted index

The inverted index shown in figure 4 is the simplest because the index system records only which documents contain a word, when in fact, the index system can record more information than that. Figure 5 is a relatively complex inverted index. Compared with the basic indexing system in figure 4, the inverted list corresponding to words records not only the document number, but also the word frequency information (TF), that is, the number of occurrences of the word in a document. The reason for recording this information is that when the word frequency information is sorted by search results, calculating query and document similarity is a very important factor. Therefore, it is recorded in the inverted list to facilitate the score calculation in subsequent sorting. In the example in figure 5, the word "founder" is numbered 7, and the corresponding inverted list is: (3:1), where 3 represents the document with document number 3 contains the word, and the number 1 represents word frequency information. that is, this word appears only once in document 3, and the inverted lists corresponding to other words have the same meaning.

Figure 5 inverted index with word frequency information

The practical inverted index can also record more information. the index system shown in figure 6 records two types of information in addition to document number and word frequency information. that is, the "document frequency information" corresponding to each word (corresponding to the third column of figure 6) and recording the location information of the word in a document in the inverted list.

Figure 6 inverted index with word frequency, document frequency and location information

"document frequency information" represents how many documents in a document collection contain a word, and the reason for recording this information is the same as word frequency information, which is a very important factor in the ranking calculation of search results. The location information of words in a document does not have to be recorded by the index system, and it can be included or not included in the actual index system, because this information is not necessary for the search system, and location information can only be used when supporting "phrase query".

Take the word "Russ" as an example, the word number is 8 and the document frequency is 2, which means that two documents in the whole document collection contain the word, and the corresponding inverted list is: {(3 / 1;), (5 / 1). )}, which means that the word appears in document 3 and document 5 with a frequency of 1, and the word "Lars" appears in 4 in both documents, that is, the fourth word in the document is "Lars". The inverted index shown in figure 6 is already a very complete index system, and the index structure of the actual search system is basically like this. The only difference is which specific data structure is adopted to implement the above logical structure. With this index system, the search engine can easily respond to the user's query, for example, the user enters the query word "Facebook", and the search system looks up the inverted index, from which the document containing the word can be read out. These documents are the search results provided to the user, and these candidate search results can be sorted using word frequency information and document frequency information to calculate the similarity between the document and the query. According to the similarity score from high to low sort output, this is part of the internal process of the search system, the specific implementation plan will be described in detail in the fifth chapter of this book. 4. Word dictionary

The word dictionary is a very important part of the inverted index. it is used to maintain the relevant information of all the words that have appeared in the document collection, and to record the position information of the inverted list corresponding to a word in the inverted file. When supporting search, according to the user's query words, go to the word dictionary to query, you can get the corresponding inverted list, and use it as the basis for subsequent sorting.

For a large document collection, it may contain hundreds of thousands or even millions of different words, whether a word can be located quickly, which directly affects the response speed of search. Therefore, efficient data structures are needed to build and search word dictionaries. Common data structures include hash list structure and tree dictionary structure.

4.1 Hashgar linked list

Figure 7 is a schematic diagram of this dictionary structure. This dictionary structure mainly consists of two parts:

The main part is the hash table, each hash table item holds a pointer, the pointer points to the conflict linked list, in the conflict linked list, words with the same hash value form the linked list structure. There is a conflict linked list because two different words get the same hash value, if so, it is called a conflict in the hash method, and words with the same hash value can be stored in the linked list for later search.

In the process of indexing, the dictionary structure will be constructed accordingly. For example, when parsing a new document, for a word T that appears in the document, first use the hash function to obtain its hash value, and then read the stored pointer according to the hash table item corresponding to the hash value, and find the corresponding conflict linked list. If the word already exists in the conflict list, the word has already appeared in the previously parsed document. If the word is not found in the conflict list, it means that the word is encountered for the first time, then add it to the conflict list. In this way, when all the documents in the document collection are parsed, the corresponding dictionary structure is established.

When responding to a user's query request, the process is similar to creating a dictionary, except that even if a word does not appear in the dictionary, it will not be added to the dictionary. Taking figure 7 as an example, assuming that the query request entered by the user is word 3, hash the word and locate it to slot 2 in the hash table, and the conflict chain list can be obtained from the reserved pointer. Compare word 3 with the words in the conflict chain list in turn, and find that word 3 is in the conflict chain list. Then you can read out the inverted list corresponding to this word for follow-up work. If the word is not found, no document in the document collection contains the word, then the search result is empty.

4.2 Tree structure

The B-tree (or B + tree) is another efficient search structure, and figure 8 is a schematic diagram of the B-tree structure. B-tree is different from hash search, which requires dictionary entries to be sorted by size (number or character order), while hash does not need data to meet this requirement.

The B-tree forms a hierarchical search structure, the intermediate node is used to indicate which subtree the dictionary items in a certain order range are stored in, and plays the role of navigation according to the comparative size of the dictionary items. The lowest leaf node stores the address information of the word, and the word string can be extracted according to this address.

Figure 8 B-tree lookup structure

Summary

Word ID: record the word number of each word

Words: corresponding words

Document frequency: represents how many documents in the document collection contain a word

Inverted list: contains the word ID and other necessary information

DocId: id of the document in which the word appears

TF: the number of times a word appears in a document

POS: where the word appears in the document

Take the word "join" as an example, the word number is 6 and the document frequency is 3, which means that three documents in the whole document collection contain the word, and the corresponding inverted list is {(2 / 1;), (3 / 1;), (5 / 1). )}, the meaning is that the word appeared in document 2recover3, 5, and once in each document, the word "join" in the POS of the first document is 4, that is, the fourth word of the document is "join", and others are similar.

This inverted index is already a very complete index system, and the index structure of the actual search system is basically like this.

At this point, I believe you have a deeper understanding of "what is the inverted index of search engine". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report