Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How Python climbs Harry Potter novels

2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how Python crawls Harry Potter novels." Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's take you to learn "Python how to crawl Harry Potter novels"!

Let's briefly introduce jieba Chinese word segmentation package. There are three main word segmentation modes in jieba package:

Exact mode: Exact mode by default, accurate word segmentation, suitable for text analysis;

Full mode: all words that can be words are separated, but the words will have ambiguity;

Search engine mode: on the basis of precise mode, long words are segmented again, suitable for search engine segmentation.

The jieba package commonly used statements:

jieba.cut(text,cut_all = False), full mode when cut_all = True

Custom dictionary: jieba.load_userdict(file_name)

Add words: jieba.add_word(seg,freq,flag)

Delete word: jieba.del_word(seg)

Harry Potter is a fantasy novel by British writer J. K. Rowling, which describes the adventures of Harry Potter during his seven years of study at Hogwarts School of Witchcraft and Wizardry. Here's an example of jieba in action, using harry potter's intricate relationships.

#Load the required package import numpy as npimport pandas as pdimport jieba,codecsimport jieba.posseg as pseg #Part of speech module from pyecharts import Bar,WordCloud#Import names, stop words, specific thesaurus renmings = pd.read_csv ('person name.txt',engine='python',encoding='utf-8',names=<$'renming'])<$'renming ']stopwords = pd.read_csv ('mystopwords.txt ',engine ='python',encoding ='utf-8', names=['stopwords'])[' stopwords'].tolist()book = open ('Harry Potter.txt',encoding ='utf-8').read()jieba.load_userdict ('Harry Potter thesaurus.txt')#Define a participle function def words_cut(book): words = list(jieba.cut(book)) stopwords1 = [w for w in words if len(w)==1] #Add stopwords seg = set(words) - set(stopwords)-set (stopwords1) #Filter stop words to get more precise participles result = [i for i in words if i in seg] return result#initial participle bookwords = words_cut(book)renming = [i.split ('')[0] for i in set (renmings)] #As long as the person's name, Name = [i for i in bookwords if i in set(renming)] #Filter out people's names #Count word frequency bookwords_count = pd.Series (bookwords).value_counts().sort_values(ascending=False)nameswords_count = pd.Series(nameswords).value_counts().sort_values(ascending=False)bookwords_count[:100].index

After the initial participle, we find that most of the words are OK, but there are still a small number of names that are not accurately separated, such as 'Bully',' Ron said','Flat',' Sny','Ground said', etc., and there are also' Umbridge','Hogwarts' and so on.

#Custom partial words jieba.add_word ('Dumbledore ',100,' nr') jieba.add_word ('Hogwarts ',100,' n') jieba.add_word ('Umbridge ', 100,' nr') jieba.add_word ('La Tonkes', 100,'nr') jieba.add_word ('Voldemort,'100,'nr') jieba.del_word ('Ron said') jieba.del_word ('ground said') jieba. del_word ('Snei')#participle bookwords = words_cut(book)nameswords = [i for i in bookwords if i in set(renming)] bookwords_count = pd.Series(bookwords).value_counts().sort_values(ascending=False)nameswords_count = pd.Series(nameswords).value_counts().sort_values(ascending=False)bookwords_count[:100].index

After the second participle, we can see that the errors in the first participle have been corrected, and then we analyze them statistically.

#Counting TOP15 words bar = Bar ('most frequent wordsTOP15', background_color ='white ',title_pos ='center',title_text_size = 20)x = bookwords_count [:15].index.tolist()y = bookwords_count[: 15].values.tolist()bar.add (', x, y,xaxis_interval = 0,xaxis_rotate = 30,is_label_show = True)bar

Harry, Hermione, Ron, Dumbledore, Wand, Magic, Malfoy, Snape and Sirius are the most frequently used words in the book.

Let's string it up ourselves, we can probably know the main content of Harry Potter, that is, Harry, accompanied by his friends Hermione and Ron, helped and trained by the Archmage Dumbledore, used his wand to use magic to put the big boss Voldemort k.o story. Of course, Harry Potter is wonderful.

#Counting people's TOP20 words bar = Bar ('main people Top20', background_color ='white ',title_pos ='center',title_text_size = 20)x = nameswords_count [:20].index.tolist()y =nameswords_count[: 20].values.tolist()bar.add ('',x, y,xaxis_interval = 0,xaxis_rotate = 30,is_label_show = True)bar

According to the number of appearances in the whole novel, we find that Harry's position as the protagonist is unshakable, far more than 13000 times than Hermione, who ranks second. Of course, this is also very normal. After all, this book is Harry Potter, not Hermione Granger.

#Word Cloud Analysis of the Whole Novel name = bookwords_count.index.tolist()value = bookwords_count.values.tolist()wc = WordCloud(background_color ='white ')wc.add("", name, value, word_size_range=[10, 200],shape ='diamond')wc#Character Relationship Analysis names = {} relationships = {} lineNames = []with codecs.open ('Harry Potter.txt',' r','utf8') as f: n = 0 for line in f.readlines(): n+=1 print ('Processing line {}'. format(n)) poss = pseg.cut(line) lineNames.append([]) for w in poss: if w.word in set(nameswords): lineNames[-1].append(w.word) if names.get(w.word) is None: names[w.word] = 0 relationships[w.word] = {} names[w.word] += 1for line in lineNames: for name1 in line: for name2 in line: if name1 == name2: continue if relationships[name1].get(name2) is None: relationships[name1][name2]= 1 else: relationships[name1][name2] = relationships[name1][name2]+ 1node = pd.DataFrame(columns=['Id','Label','Weight'])edge = pd.DataFrame(columns=['Source','Target','Weight'])for name,times in names.items(): node.loc[len(node)] = [name,name,times]for name,edges in relationships.items(): for v, w in edges.items(): if w > 3: edge.loc[len(edge)] = [name,v,w]

After processing, we found that the same character appeared different names, so we merged and counted, resulting in 88 nodes.

node.loc[node['Id']=='Harry','Id'] = 'Harry Potter'node.loc[node['Id']=='Potter','Id'] = ' Harry Potter 'node.loc[node['Id']==' Albus ','Id'] = ' Dumbledore 'node.loc[node['Label']==' Harry ','Label'] = ' Harry Potter 'node.loc[node['Label']==' Potter','Label'] = 'Harry Potter'node.loc[node['Label']=='Albus','Label'] = ' Dumbledore 'edge.loc[edge['Source']==' Harry ','Source'] = ' Harry Potter 'edge.loc[edge['Source']==' Potter','Source'] = 'Harry Potter'edge.loc[edge['Source']=='Albus','Source'] = 'Dumbledore'edge.loc[edge <$'Target ']==' Harry ','Target'] = ' Harry Potter 'edge.loc[edge <$'Target']=='Potter','Target'] = 'Harry Potter'edge.loc[edge <$'Target ']==' Albus ','Target'] = ' Dumbledore 'nresult = node <$'Weight'].groupby ([node <$'Id '],node <$'Label']).agg({'Weight ':np.sum}).sort_values ('Weight',ascending = False)eresult = edge.sort_values ('Weight ',ascending = False)nresult.to_csv ('node.csv',index = False)eresult.to_csv ('edge.csv',index = False) At this point, I believe that you have a deeper understanding of "How Python Craws Harry Potter Novels". Let's try it out! Here is the website, more related content can enter the relevant channels for inquiry, pay attention to us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report