Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to extract text feature words by TF-IDF

2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about how to extract text feature words from TF-IDF. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

01

-

What does TF-IDF mainly do?

TF-IDF technology is mostly used in text classification, such as throwing us a message pushed by Sina, letting the machine judge whether it is subordinate to news, finance and economics, sports, or entertainment; for example, how to extract the key words of a message pushed by Jinri Toutiao and recommend it to articles that meet our taste.

02

-

The main ideas of TF-IDF

The main idea of TF-IDF is that if a word or phrase appears in an article with a high frequency of TF and rarely appears in other articles (with a high IDF value), it is considered that the word or phrase has a good ability to distinguish categories and is suitable for classification.

03

-

What is the full name of TF-IDF?

TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technology for information retrieval and data mining. TF means Term Frequency and IDF means reverse file frequency (Inverse Document Frequency).

04

-

Why is it called reverse file frequency?

TF, the description of word frequency in TF-IDF, is easy for us to understand. Isn't it the number of times a word appears in our corpus in an article, but how to understand the reverse file frequency?

Take our mother tongue, for example, "yes", "we", similar words, do you think it will be useful for us to judge whether this article is sports or entertainment? Although their TF is very large, it is not helpful to our classification, so it is natural to think of adding a weight influence factor to TF: IDF, reverse file frequency, for example, if the word "Bayesian" appears in an article, then we go to the corpus and find that there are 500web pages in the existing 100m web pages, there is this Bayesian classification, and the word "of" appears. 100 million have appeared, at this time, we hope that "Bayes" is bigger than the "IDF", that is, the weight is more important, and the IDF formula does achieve this effect in the end, as we can see below.

05

-

TF,IDF 's mathematical formula

The total number of words in a web page is 100, and the word "Bayes" appears three times, so the word frequency of "Bayes" in this document is 3max 100mm 0.03.

The corresponding mathematical formula:

The character meaning of the above formula, I is the I word in the corpus, j is the number of the current web page.

When analyzing 100m web pages in the corpus, it is found that 500web pages contain "Bayes", so the IDF formula of the word Bayes:

I is still the first word in the corpus (Bayes), D is the number of pages in the corpus, and the set of denominators indicates the number of Bayes in 100 million pages, 500 pages as mentioned above. Finally, by taking the logarithm, it can be concluded that Bayesian IDF is larger than "of" IDF.

06

-

Get together

The effect achieved by this formula:

The high frequency of words in a particular file, and the low frequency of the word in the entire file collection, can produce a high weight TF-IDF.

Filter out common words, such as "yes", "we" and "eat".

Finally: extract the important words in an article.

The above is the editor for you to share the TF-IDF how to extract text feature words, if you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report