What is the process of making dynamic word frequency bar graph by Python 04/04 Update SLTechnology News&Howtos

What is the process of making dynamic word frequency bar graph by Python

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces the process of making dynamic word frequency bar graph by Python. The content is very detailed. Interested friends can use it for reference. I hope it will be helpful to you.

Preface

I believe you are no stranger to the topic of "data visualization". On some platforms, you can often see videos of dynamic bar charts, mostly about changes in GDP in a country or changes in the number of infections in different countries, and so on.

In this article, we will use Python to draw a dynamic word frequency bar chart, which, as the name implies, is a dynamic bar chart with word frequency as a quantitative indicator.

Preparation in advance

Install the necessary libraries by entering the following command:

Selection and acquisition of pip install JiashuResearchToolspip install jiebapip install pandaspip install bar_chart_race data

The data we use this time is the income ranking of short book articles, with dates ranging from June 20, 2020 to September 18, 2021.

The process of parsing the data from the web page is more complicated, which is completed by using the short book data science library JianshuResearchTools.

To facilitate debugging, we use Jupyter Notebook for interactive development.

Import JianshuResearchTools and set an alias for it:

Import JianshuResearchTools as jrt

Call the API to obtain the data of September 17, 2021:

Jrt.rank.GetArticleFPRankData ("20210917")

The data returned is as follows:

[{'ranking': 0,' aslug': 'a03adf9d5dd5mm,' title': 'lucky that your heart is like my heart', 'author_name':' wild goose array chills', 'author_avatar_url':' https://upload.jianshu.io/users/upload_avatars/26225608/682b892e-6661-4f98-9aab-20b4038a433b.jpg', 'fp_to_author': 3123.148,' fp_to_voter': 3123.148, 'total_fp': 6246.297} {'ranking': 1,' aslug': '56f7fe236842},' title': 'scar', 'author_name':' Li Wending', 'author_avatar_url':' https://upload.jianshu.io/users/upload_avatars/26726969/058e18c4-908f-4710-8df7-1d34d05d61e3.jpg', 'fp_to_author': 1562.198,' fp_to_voter': 1562.198, 'total_fp': 3124.397}, (omitted below)

As you can see, the returned data contains the ranking of the article, the title, the author's name, the link to the author's profile picture, and some information about the assets of the short book.

We only need the title of the article for statistics, so we assign the data obtained above to the variable raw_data, and then:

[item ["title"] for item in raw_data]

Using the list derivation, we get a list of article titles.

To facilitate processing, we concatenate the data, separated by spaces:

"" .join ([item ["title"] for item in raw_data])

But we came across an error:

TypeError: sequence item 56: expected str instance, NoneType found

From the error message, we can see that there is an empty value in the article title list, which causes the string connection to fail.

(null value is because the author deleted the article)

So we also need to add logic to remove null values, code programming like this:

"" .join (filter (None, [item ["title"] for item in raw_data]))

The filter function filters out null values in the list by default when the first argument is None.

Now the data we have obtained is as follows:

Fortunately, your heart is like my heart scars short story | A Sheng "my favorite Friends" selection | Council Mid-Autumn Festival Carnival, waiting for you to come! There is no need to ask whether it is predestined or not, Shi Huo is a fan of professional diaries in poor years | from honeymoon to stranger: something about me and the American foreign teacher Red Mansion | on the impression of a stubborn city at the beginning of A Dream of Red Mansions | the love between leopards and dogs in gossip city ends in the war between people and animals (omitted below)

Next, we need to get all the data within the time frame.

Querying the function documentation of JRT shows that we need a string type, and the parameter in the format "YYYYMMDD" represents the date of the target data.

So we need to write a program to generate these date strings, as follows:

From datetime import date, timedeltadef DateStrGenerator (): start_date = date (2020, 6, 20) after = 0 result = None while result! = "20210917": current_date = start_date + timedelta (days=after) result = current_date.strftime (r "% Y%m%d") yield result after + = 1

Next, we write a piece of code to obtain this data:

Result = [] for current_date in tqdm (DateStrGenerator (), total=455): raw_data = jrt.rank.GetArticleFPRankData (current_date) processed_data = "" .join (filter (None, [item ["title"] for item in raw_data])) result.append ({"date": current_date, "data": processed_data})

Here a progress bar is displayed using the tqdm library, which is not required.

Use the Pandas library to convert the data we collected into DataFrame:

Df = pandas.DataFrame (result)

Word segmentation

We use the jieba library to implement word segmentation, first try to deal with the first piece of data:

Jieba.lcut (df ["data"] [0])

Use Counter in the Python standard library collections for word frequency statistics:

Counter (jieba.lcut (df ["data"] [0]))

Simply draw a bar chart:

As you can see, spaces and some punctuation marks, including meaningless words such as "de" and "I", appear very frequently, and we need to weed them out.

We build an txt document that holds the deprecated words, then read it with the following code and convert it into a list:

Stopwords_list = [item.replace ("\ n", ") for item in open (" stopwords.txt "," r ", encoding=" utf-8 ") .readlines ()]

Next, write a function to eliminate disabled words. In order to facilitate subsequent data processing, we also eliminate words and words that appear only once:

Def process_words_count (count_dict): result = {} for key, value in count_dict.items (): if value

< 2: continue if len(key) >

= 2 and key not in stopwords_list: result [key] = value return result

In addition, we use the add_word function of the jieba library to add some organization names and proper nouns from the brief book to the thesaurus to improve the accuracy of word segmentation. The code is as follows:

Keywords_list = [item.replace ("\ n", ") for item in open (" keywords.txt "," r ", encoding=" utf-8 ") .readlines ()] for item in keywords_list: jieba.add_word (item)

After some treatment, the effect of word segmentation has been significantly improved:

Finally, use this code to segment all the data and save the results to another DataFrame:

Data_list = [] date_list = [] for _, item in df.iterrows (): date_list.append (datetime (int (item ["date"] [0:4]), int (item ["date"] [4:6]), int (item ["date"] [6:8])) data_list.append (process_words_count (Counter (jieba.lcut (item ["data") processed_df = pandas.DataFrame (data_list, index=date_list)

My final result is a DataFrame with 455 rows and 2087 columns.

Screening and visualization

With so much data, a large part of it does not represent the overall situation, so we need to filter the data.

Use the following code to count the sum of all column values, that is, the number of times each keyword appears in all data, stored in a row named sum:

Try: result = [] for i in range (3000): result.append (processed_df.iloc [:, I] .sum ()) except IndexError: processed_df.loc ["sum"] = result

Run the following code to keep only keywords that appear more than 300 times in the dataset:

Maller_df = processed_ df.T [processed _ df.T ["sum"] > = 300] .Tsmaller _ df = smaller_df.drop (labels= "sum") smaller_df.columns

Now that the number of columns in the dataset has been reduced to 24, you can visualize.

Don't forget to import the module first:

Import bar_chart_race as bcr

To use this module, you need to install ffmpeg first, and you can find this tutorial on your own.

In addition, in order to support Chinese display, we need to open the _ make_chart.py file under this module and add the following two lines of code after import:

Plt.rcParams ['font.sans-serif'] = [' SimHei'] plt.rcParams ['axes.unicode_minus'] = False

These two lines of code will replace the default font of matplotlib with a font that supports Chinese display.

Finally, complete the visualization with one line of code:

Bcr.bar_chart_race (smaller_df, steps_per_period=30, period_length=1500, title= "Visualization of income rankings of short Books", bar_size=0.8, fixed_max=10, n_bars=10)

In this line of code, we use smaller_df as the dataset, the output file is output.mp4, the frame rate is set to 30, and each line of data is displayed for 2 seconds.

Because of the large amount of data, this step takes longer and takes up more memory. When the run is over, the output file can be found in the directory.

On the Python production of dynamic word frequency bar chart process is shared here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.