Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize word Cloud with Python Code

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to use Python code to achieve word cloud, the content of the article is of high quality, so the editor will share it with you for reference. I hope you will have some understanding of the relevant knowledge after reading this article.

What is a word cloud?

The word cloud, also known as the word cloud, is the visual prominent presentation of the "keywords" with high frequency in the text data, forming the rendering of the keywords to form a color picture similar to the cloud, so that you can appreciate the main expression meaning of the text data at a glance.

Now, all kinds of words can be found on the Internet. The following picture comes from teacher Shen Hao's Weibo:

You can also see more well-made word clouds from Baidu pictures. Some screenshots are as follows:

There are many tools for making ci Yun.

Technically, Ciyun is an interesting method of data visualization, and there are many off-the-shelf tools on the Internet:

Wordle is a game tool for generating word cloud images from text.

Tagxedo can make personalized word cloud online.

Tagul is a Web service that can also create gorgeous word clouds.

Tagcrowd can also enter the url of web to generate the word cloud of a web page directly.

.

Ten lines of code

But as an old code farmer, I still like to use my own code to generate my own word cloud. Is it complicated? Will it take a long time? A lot of text describes a variety of methods, but in fact it only takes 10 lines of python code.

Import matplotlib.pyplot as pltfrom wordcloud import WordCloudimport jiebatext_from_file_with_apath = open ('/ Users/hecom/23tips.txt'). Read () wordlist_after_jieba = jieba.cut (text_from_file_with_apath, cut_all = True) wl_space_split = ".join (wordlist_after_jieba) my_wordcloud = WordCloud () .generate (wl_space_split) plt.imshow (my_wordcloud) plt.axis (" off ") plt.show ()

That's all, a word cloud generated goes like this:

Read these 10 lines of code:

1x 3 lines, imported the library of drawing matplotlib, the word cloud generation library wordcloud and the lexicon of jieba, respectively

4 lines, is to read the local file, the code used in the text is this official account of "Lao Cao in the eyes of R & D management two or three things".

Line 5: 6, use jieba for word segmentation, and separate the results of the word segmentation with spaces

7 lines to generate word clouds for the text after word segmentation

Line 8: 10, use pyplot to show the word cloud picture.

This is one of the reasons why I like python, concise and lively.

Execution environment

If these ten lines of code are not running, you need to check your execution environment. For a complete development learning environment, you can refer to this official account "Development Learning Environment in Lao Cao's eyes". For python-oriented data analysis, if you like Anaconda, you can download and install it at https://www.continuum.io/downloads/. The operating interface after successful installation is as follows:

Anaconda is a boon for python data enthusiasts.

Installing the wordcloud and jieba libraries is equally easy:

Pip install wordcloudpip install jieba

Encountered a small pit, at the beginning of running these ten lines of code, only explicit a number of color small rectangular box, Chinese words do not come out explicitly, thought to be the evil UTF8 problem, debug, found that the result of print stuttering word can be displayed in Chinese, that is the font library problem of wordcloud generated words. The advantage of open source is to go directly to the source code of wordcloud.py and find the code related to the font library.

FONT_PATH = os.environ.get ("FONT_PATH", os.path.join (os.path.dirname (_ _ file__), "DroidSansMono.ttf"))

Wordcloud uses the DroidSansMono.ttf font library by default. Change it to a ttf font that supports Chinese, and rerun these ten lines of code. Of course, there are more elegant ways to interpret the code.

Look at the source code.

Now that you have entered the source code, you can't help but be curious to take a look at the implementation process and method of wordcloud.

The total wordcloud.py is no more than 600 lines, during which there are a lot of comments, which is easy to read. It uses a lot of libraries, the common random,os,sys,re (regular) and lovely numpy, as well as PIL drawing, and it is estimated that some people will encounter those holes in installing PIL.

The principle of generating word cloud is not complicated, it can be divided into five steps:

Word segmentation of text data is also the first step of many NLP text processing. For the process_text () method in wordcloud, it mainly deals with stopping words.

Calculate the frequency of each word in the text to generate a hash table. Word frequency computing is equivalent to wordcount, the first case of various distributed computing platforms, and hello world programs in various languages have the same status, hehe.

The layout of a picture is generated proportionally according to the number of word frequency. IntegralOccupancyMap-like is the algorithm of the word cloud and the core of the data visualization of the word cloud.

To generate pictures on the word cloud layout map according to the corresponding word frequency, the core method is generate_from_frequencies, whether it is generate () or generate_from_text (), and finally to generate_from_frequencies.

Complete the coloring of each word on the word cloud. The default is random coloring.

Most of the enhancements to words can be achieved through the constructor of wordcloud, which provides 22 parameters and can be extended by itself.

More small examples

Take a look at a cloud of words in quasi-classical Chinese. The text comes from the old text of this official account last year, "wife", in which several parameters about screen and font size are passed in the constructor:

Width=800,height=400,max_font_size=84,min_font_size=16

Got a word cloud picture like this:

I am ashamed of myself. I can't see the color of classical Chinese and the expression of feelings for my wife. It's not a good word. Maybe it's the limitation of ci Yun.

The word cloud of the rectangle is really too simple, it is much more interesting to fill it with the word cloud directly on the picture, and it can be realized by mask in wordcloud. Replace a picture of yourself and use the text in "talk again". The effect of the word cloud is as follows:

It is still difficult to see the outline of the portrait, but fortunately, it can cover up the ugly. Three lines of code have been added.

From PIL import Imageimport numpy as npabel_mask = np.array (Image.open ("/ Users/hecom/chw.png"))

When you construct the function, just pass in the mask:

Background_color= "black", mask=abel_mask

These word cloud pictures made by myself are still too crude, that is, the prototype is simple, and a good product is difficult. To make a good picture of a beautiful word cloud, you still have to work on many details.

For example:

The treatment of word segmentation, "is" such a meaningless word should not appear in the word cloud?

What is the purposeful choice of the keywords displayed?

How to choose a suitable font?

How to better color independently?

Picture preprocessing, how to make pictures and word clouds express the main features of the original picture?

.

Behind the word cloud

Behind the word cloud is actually a typical process of data integration processing, known as 6C, as shown in the following figure:

Connect: the goal is to select data from a variety of data sources, which will provide APIs, input format, rate of data collection, and provider limits.

Correct: focus on data transfer for further processing, while ensuring the quality and consistency of maintained data

Collect: where the data is stored and in what format to facilitate later stages of assembly and consumption

Compose: focus on how to mash up the various datasets that have been collected, and enriching this information can build a data-driven product that leads to success.

Consume: focus on the use of data, rendering, and how to make the right data achieve the right results at the right time.

Control: this is the sixth additional step needed as data, organizations, and participants grow, and it ensures data control. 


The word cloud constructed by these ten lines of code is not directly obtained from the official account (wireless_com) through API. Simplification and abstraction is a typical way of engineering. Here, the process of copying and pasting, or even omitting the correct process, is directly stored in a plain text file. The word cloud is used to generate visual images for consumption of consume, which is processed by jieba word segmentation, namely compose. Organizing the word clouds generated by yourself into different file directories for easy retrieval can be regarded as a preliminary control control.

On how to use Python code to achieve the word cloud to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report