How to write a small project of word frequency statistics with Python 07/06 Update SLTechnology News&Howtos

How to write a small project of word frequency statistics with Python

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

How to use Python to write a word frequency statistics project, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail, people with this need can come to learn, I hope you can gain something.

Here we use python to do a little statistics of English word frequency. Of course, write their own, there is no stop words, calculate the weight of words and other functions, purely to write code to practice.

First of all, here is an English article, like the following 185 small paragraphs, the amount of data is still not large, Harry Potter novels seem to have 10W lines, you can find points if you are interested.

Although I installed 2 and 3 versions. Python2 is used here, because Python2 printing does not seem to need to write parentheses, which is more convenient.

Needless to say, there are mainly two scripts, one is word segmentation, the other is statistical word frequency:

one

Word segmentation

Here I use the cmd window command to read one line in turn to form a file stream, processing one line at a time, otherwise you need to get a large list (list).

As shown in the above lines of code, it is very simple to segment English words, just by separating them according to spaces. Unlike Chinese, it also requires a thesaurus and a series of algorithms. And then print it to the console. The words printed in this way are still unordered, and we need to sort them, so that the adjacent words are the same as each other, and we need to sort them, like this:

The cmd window enters the command to execute the script:

Type The_Clock_and_the_Key.txt | python2 splitText.py | sort

The "type" here opens a text file, and the "|" is a pipe: give the left content as an argument to the function on the right.

In this way, each word has one line, which is actually one of the basic functions of hadoop: [sorting].

two

Statistical word frequency

The idea is that if the word currently read is different from the saved word, the word count is over. Because, after the last word is assigned to current_word, there is no comparison (it is already on the last line, and when printing here, you need to print once outside the loop, line 23).

One word after the first script is processed on one line, and the data stream of adjacent words is piped into the script for processing.

The cmd window enters the command to execute the script:

Type The_Clock_and_the_Key.txt | python2 splitText.py | sort | python2 splitText2.py | sort / R

Where sort / R stands for reverse, which is a function.

The windows command line is not very playful, and the final sort is this:

It seems to be sorted according to the dictionary, , that's it! Students with obsessive-compulsive disorder can use lists or dictionaries to call Python's own sort function to sort it out.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.