Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize full-text search by JavaScript

2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)05/31 Report--

Most people do not understand the knowledge points of this article "JavaScript how to achieve full-text search", so the editor summarizes the following content, detailed content, clear steps, and has a certain reference value. I hope you can get something after reading this article. Let's take a look at this "JavaScript how to achieve full-text search" article.

Relativity

For each search query, it is easy to define a "relevance score" for each document. When users do a search, we can sort using the relevant scores instead of the occurrence time of the document. In this way, the most relevant documents will be listed in *, no matter how long ago it was created (of course, sometimes it is also related to the creation time of the document).

There are many ways to calculate the correlation between words, but let's start with the simplest, statistics-based method. This method does not need to understand the language itself, but determines the "correlation score" by counting the use of words, matching and the weight of the penetration rate of specific words in the document.

This algorithm does not care whether the word is a noun or a verb, nor does it care about the meaning of the word. The only thing it cares about is which words are commonly used and those are rare. If a search statement includes common words and rare words, you will score higher for documents containing rare words and reduce the weight of common words.

This algorithm is called Okapi BM25. It contains two basic concepts: word frequency (term frequency), abbreviated as word frequency ("TF") and reciprocal of document frequency (inverse document frequency) abbreviated as "IDF". Putting them together is called "TF-IDF", a statistical measure of how important a word (term) is in a document.

TF-IDF

Word frequency (Term Frequency), or "TF" for short, is a very simple measure: the number of times a particular word appears in a document. You can divide this value by the total number of words in the document to get a score. For example, if there are 100 words in the document and the word 'the'' appears eight times, then the TF of 'the'' is 8 or 8, 100 or 8% (depending on how you want to express it).

Inverse file frequency (Inverse Document Frequency), or "IDF" for short, is more complicated: the rarer a word, the higher the value. It divides the total number of documents by the number of files containing the word, and then takes the logarithm of the resulting quotient. The rarer the word, the higher the "IDF".

If you TF*IDF these two numbers together, you will get the weight of a word in the document. "weight" is defined as: how rare is the word and how often does it appear in the document?

You can apply this concept to document search queries. For each keyword in the query, calculate their TF-IDF score and add them up. The document with the query statement is scored *.

Okapi BM25

The above algorithm is an available algorithm, but it is not very *. It gives a correlation score algorithm based on statistics, and we can further improve it.

Okapi BM25 is so far considered to be one of the ranking algorithms (so it is called ElasticSearch). On the basis of TF-IDF, Okapi BM25 adds two adjustable parameters, K1 and bscore, which represent "word frequency saturation (term frequency saturation)" and "field length specification" respectively. What the hell is this?

In order to intuitively understand "word frequency saturation", imagine two articles about baseball of about the same length. In addition, we assume that all documents (except these two articles) don't have much baseball-related content, so the word "baseball" will have a high IDF-it's extremely important. Both articles are about baseball, and both spend a lot of time talking about it, but one uses the word "baseball" more than the other. So in this case, is it true that one article is much worse than another? Since both documents are devoted to baseball, the word "baseball" appears 40 times or 80 times is the same. In fact, it should be capped 30 times!

This is word frequency saturation. The native TF-IDF algorithm has no concept of saturation, so a document with 80 "baseball" appearances is twice as high as a score with 40 occurrences. Sometimes we want it, but sometimes we don't want it.

In addition, Okapi BM25 has a K1 parameter, which is used to adjust the rate of saturation change. The value of the K1 parameter is generally between 1.2 and 2.0. The lower the value, the faster the process of saturation. (it means that the above two documents have the same score, because they both contain a lot of the word "baseball".)

Field length reduction (Field-length normalization) reduces the length of a document to the average length of all documents. This is useful for single-field collections (single-field collections), such as ours, to unify documents of different lengths into the same comparison condition. It makes more sense for two-field sets, such as "title" and "body", which can also unify title and body fields to the same comparison conditions. The field length reduction is represented by b, whose value is between 0 and 1, 1 means all reduction, 0 is not reduced.

You can learn about the formula of the Okapi algorithm in Okapi BM25 Wikipedia. Now that we know what every term in the formula is, it must be easy to understand. So instead of mentioning the formula, we go straight to the code:

BM25.Tokenize = function (text) {text = text .toLowerCase. Replace (/\ var I = 0, len = text.length; I

< len; i++) { if(stopStems.indexOf(text[i]) === -1) { out.push(text[i]); } } 我 们定义了一个简单的静态方法Tokenize,目的是为了解析字符串到tokens的数组中。就这样,我们小写所有的tokens(为了减少熵)。我们运 行Porter Stemmer 算法来减少熵的量同时也提高匹配程度("walking"和"walk"匹配是相同的)。而且我们也过滤掉停用词(很普通的词)为了更近一步减少熵值。在 我所写的概念深入之前,如果我过于解释这一节就请多担待。 BM25.prototype.addDocument = function(doc) { if(typeof doc.id=== 'undefined') { throw new Error(1000, 'ID is a required property of documents.'); }; if(typeof doc.body === 'undefined') { throw new Error(1001, 'Body is a required property of documents.'); }; //Raw tokenized list of words var tokens = BM25.Tokenize(doc.body); //Will hold unique terms and their counts and frequencies var _terms = {}; //docObj will eventually be added to the documents database var docObj = {id: doc.id, tokens: tokens, body: doc.body}; //Count number of terms docObj.termCount = tokens.length; //Increment totalDocuments this.totalDocuments++; //Readjust averageDocumentLength this.totalDocumentTermLength += docObj.termCount; this.averageDocumentLength = this.totalDocumentTermLength / this.totalDocuments; //Calculate term frequency //First get terms count for(var i = 0, len = tokens.length; i < len; i++) { var term = tokens[i]; if(!_terms[term]) { _terms[term] = { count: 0, freq: 0 }; }; _terms[term].count++; } //Then re-loop to calculate term frequency. //We'll also update inverse document frequencies here. var keys = Object.keys(_terms); for(var i = 0, len = keys.length; i < len; i++) { var term = keys[i]; //Term Frequency forthis document. _terms[term].freq = _terms[term].count / docObj.termCount; //Inverse Document Frequency initialization if(!this.terms[term]) { this.terms[term] = { n: 0, //Number of docs this term appears in, uniquely idf: 0 }; } this.terms[term].n++; }; //Calculate inverse document frequencies //This is SLOWish so ifyou want to index a big batch of documents, //comment this out and run it once at the end of your addDocuments run //If you're only indexing a document or two at a timeyou can leave this in. //this.updateIdf; //Add docObj to docs db docObj.terms = _terms; this.documents[docObj.id] = docObj; }; 这就是addDocument这种方法会奇迹般出现的地方。我们基本上建立和维护两个类似的数据结构:this.documents.和this.terms。 this.documentsis 是一个保存着所有文档的数据库,它保存着文档的全部原始文字,文档的长度信息和一个列表,列表里面保存着文档中的所有词语和词语的数量与出现频率。使用这 个数据结构,我们可以很容易的和快速的(是的,非常快速,只需要时间复杂度为O(1)的哈表查询时间)回答如下问题:在文档 #3 中,'walk' 这个词语出现了多少次? 我们在还使用了另一个数据结构,this.terms。它表示语料库中的所有词语。通过这个数据结构,我们可以在O(1)时间内回答如下问题:'walk' 这个词在多少个文档中出现过?他们的 id 是什么? ***,我们记录了每个文档的长度,并记录了整个语料库中文档的平均长度。 注 意,上面的代码中, idf 被初始化 0,而且 updateidf 方法被注释掉了。这是因为这个方法运行的非常慢,并且只需在建立索引之后运行一次就可以了。既然运行一次就能满足需求,就没有必要运行5000次。先把它 注释掉,然后在大批量的索引操作之后运行,就可以节省很多时间。下面是这个函数的代码: BM25.prototype.updateIdf = function { varkeys = Object.keys(this.terms); for(vari = 0, len = keys.length; i < len; i++) { varnum = (this.totalDocuments - this.terms[term].n + 0.5); vardenom = (this.terms[term].n + 0.5); this.terms[term].idf = Math.max(Math.log10(num / denom), 0.01); 这是一个非常简单的函数,但是由于它需要遍历整个语料库中的所有词语,并更新所有词语的值,这就导致它工作的就有点慢。这个方法的实现采用了逆向文档频率 (inverse document frequency) 的标准公式(你可以在Wikipedia上找到这个公式)— 由总文件数目除以包含该词语之文件的数目,再将得到的商取对数得到。我做了一些修改,让返回值一直大于0。 BM25.prototype.search = function(query) { varqueryTerms = BM25.Tokenize(query); varresults = ; // Look at each document in turn. There are better ways to do this with inverted indices. varkeys = Object.keys(this.documents); for(varj = 0, nDocs = keys.length; j < nDocs; j++) { varid = keys[j]; // The relevance score for a document is the sum of a tf-idf-like // calculation for each query term. this.documents[id]._score = 0; // Calculate the score for each query term for(vari = 0, len = queryTerms.length; i < len; i++) { varqueryTerm = queryTerms[i]; // We've never seen this term before so IDF will be 0. // Means we can skip the whole term, it adds nothing to the score // and isn't in any document. if(typeofthis.terms[queryTerm] === 'undefined') { continue; } // This term isn't in the document, so the TF portion is 0 and this // term contributes nothing to the search score. if(typeofthis.documents[id].terms[queryTerm] === 'undefined') { continue; } // The term is in the document, let's go. // The whole term is : // IDF * (TF * (k1 + 1)) / (TF + k1 * (1 - b + b * docLength / avgDocLength)) // IDF is pre-calculated for the whole docset. varidf = this.terms[queryTerm].idf; // Numerator of the TF portion. varnum = this.documents[id].terms[queryTerm].count * (this.k1 + 1); // Denomerator of the TF portion. vardenom = this.documents[id].terms[queryTerm].count + (this.k1 * (1 - this.b + (this.b * this.documents[id].termCount / this.averageDocumentLength))); // Add this query term to the score this.documents[id]._score += idf * num / denom; if(!isNaN(this.documents[id]._score) && this.documents[id]._score >

0) {results.push (this.documents [id]);}} results.sort (function (a, b) {returnb._score-a.documents;}); returnresults.slice (0,10);}

The search method iterates through all the documents, gives the BM25 score of each document, and then sorts them in order from largest to smallest. Of course, it is unwise to traverse every document in the corpus during the search. This problem is solved in Part Two (reverse indexing and performance).

The above code has been well commented, and the main points are as follows: calculate the BM25 score for each document and each word. The idf score of the word has been calculated in advance, and you only need to query it when you use it. The word frequency has also been pre-calculated as part of the document properties. After that, only four simple operations are needed. Add a temporary variable _ score to each document, then sort it in descending order according to score and return the first 10 results.

The above is about the content of this article on "how to achieve full-text search in JavaScript". I believe we all have a certain understanding. I hope the content shared by the editor will be helpful to you. If you want to know more related knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report