"from Lucene to Elasticsearch: full-text Retrieval practice" Learning Notes 5 05/09 Update SLTechnology News&Howtos

"from Lucene to Elasticsearch: full-text Retrieval practice" Learning Notes 5

2025-05-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Today I'm going to tell you about tf-idf weight calculation.

Tf-idf weight calculation:

Tf-idf (Chinese word Frequency-inverse document probability) indicates the importance of calculating word items to a document in a document set or corpus. The importance of a term is proportional to the number of times it appears in the document and inversely proportional to the frequency of its occurrence in the document set. If a word item appears very frequently in a document, it is important, and if the word item appears more frequently in other documents, then the word may be more general.

Tf represents the frequency of a word item. If you want to calculate the word frequency of a specified term, count the number of times the word appears in the entire document. If the word "football" appears three times in a 3000-word document, it is difficult for me to determine that the article is related to football, but a 3000-word document generally appears "football" three times on Weibo. It can be concluded that the content of Weibo is related to football. In order to weaken the length of the document but affect it, it is necessary to standardize the word frequency. The calculation formula is as follows.

In addition, there is more than one method of word frequency standardization, and another method is adopted in Lucene:

The document frequency df represents the number of all documents that contain the specified word. Df is usually relatively large, so map it to a smaller range of values, denoted by the inverse document frequency (idf)':

It can be seen from the above formula that the larger the denominator is, the more common the word is and the less frequent the inverse document is. Adding 1 to the denominator is smoothed to prevent the denominator being 0 when all documents do not include a word. The weight of the term is represented by TF-IDF, and the calculation formula is as follows:

Through tf-idf, the document can be expressed as an n-dimensional term weight vector.

Author: Ke Zhimeng

Source: CSDN

Original: https://blog.csdn.net/yin4302008/article/details/86104662

Copyright notice: this article is the original article of the blogger, please attach a link to the blog article to reprint it!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.