In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Today I'm going to tell you about inverted indexes.
Index is one of the core technologies that constitute search engines, and it is very common in daily life. For example, when I read a book, I will first read the catalogue of the book, through which I can quickly locate the page number of a specific chapter and speed up the query of the content.
Documents are usually saved in various database management systems, such as mysql,oracle, etc., but the data of search engines can not be saved in the database, there are two main reasons: first, the amount of data of search engines is very large, and large search engines need to deal with hundreds of millions of web page data, which is difficult to manage in the face of massive data databases. Second, the operation of the search engine to the data is relatively simple, and the general addition, deletion, modification and query is enough, while the database operation supported by the database is more complex, sacrificing speed and space, while the search engine requires fast response and high efficiency of information retrieval. Inverted index is mainly used in search engines to store web page data.
Inverted index, also known as reverse index, is an indexing method, which is used to store the mapping of the storage location of a word in a document or a group of documents under full-text search. it is the most commonly used data structure in document retrieval systems.
The following is a popular example to explain the inverted index, which is taken from the content of the book: there are two documents doc1 and doc2,doc containing China, the United States and South Korea, and doc2 contains four keywords: China, the United States, Germany, and the United Kingdom. The relationship between documents and words is as follows:
Document words doc1 China, USA, Korea doc2 UK, China, USA, Germany
The document relationships to which words belong are as follows:
Word documents China doc1, doc2 US doc1, doc2 Korea doc1 UK doc2 Germany doc2
Referring to the following table, let's take a closer look at the inverted index, and we set the document ID for each document.
Document ID document content 1 artificial intelligence becomes the focus of the Internet conference 2 Google launches open source artificial intelligence system tools 3 the future of the Internet in artificial intelligence 4 Google open source machine learning tools
For the content of the document, it is necessary to go through lexicalization first. Unlike English, English separates words by spaces, and there is no clear separation symbol between Chinese words. After Chinese word segmentation by the word segmentation system, the matrix is divided into entries. Document 4 is divided into "Google", "open source", "machine", "learning" and "tools". The word Google appears once in document 2 and document 4, the document frequency is 2, the inverted record table is recorded as 2-> 4, and the document frequency is also the length of the inverted record table. The document frequency of each word item and the inverted record table are counted in turn, and the process of building the inverted index is as follows:
Word item document frequency inverted record table
Document ID document frequency inversion record table manual 32-> 3 intelligent 32-> 3 become 11 Internet 21-> 3
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.