In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly explains "python how to climb star post bar". The content of the article is simple and clear, easy to learn and understand. Please follow the editor's train of thought to study and learn "python how to climb star post bar".
I. website analysis
The page flipping of Tieba is realized through the change of url, mainly with pn parameters:
Https://tieba.baidu.com/f?kw= star & number of ie=utf-8&pn= pages * 50
The content, author and number of posts can be found on the web page:
So, we just need to simulate the request with requests, and then parse it with bs4 to get what we want.
Second, python programming implementation
1. Crawl data
Using the old pattern of static web crawler, according to the characteristics of the web page source code, using the find_all function to extract the information such as posts, senders and the number of posts, and putting three types of information into the list, finally generating a two-dimensional list result, mainly to facilitate storage in the database, the code is as follows:
For t in range: print ('page {0}' .format (Win64; 1)) url=' https://tieba.baidu.com/f?kw= star & ie=utf-8&pn= {0} '.format (twee50) header = {' User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) Rv:69.0) Gecko/20100101 Firefox/69.0'} response = requests.get (url, header) soup = BeautifulSoup (response.text, 'html.parser') items_content = soup.find_all (' averse, class_='j_th_tit') # content items_user = soup.find_all ('span' Class_='tb_icon_author') # nickname items_comment = soup.find_all (class_='threadlist_rep_num center_text') # number of posts for I, j, k in zip (items_content, items_user, items_comment): result.append ([i.get ('title'), j.get (' title') [5:], k.text]) time.sleep (1)
two。 Save to the database
First create a new table named 'STAR', then create three columns, named' title', 'author', and' num', respectively, to store the contents of the fear in 1. Finally, store the contents of the 2D list result in the database:
Conn=pymysql.connect (host='127.0.0.1', port=3306, user='root', password=' database password', db='test1', charset='utf8mb4') cur = conn.cursor () # if there is a TIEBA table Delete cur.execute ("DROP STAR IF EXISTS STAR") # to create the TIEBA table sql = "" create table STAR (title char, author char, num char) "" cur.execute (sql) for i in result: cur.execute ("INSERT INTO STAR (title,author,num) VALUES ('{0}','{1}','{2}')". Format (I [0]. Replace ('\','). Replace ('\','). Replace ('\','), I [1], I [2]) conn.commit ()
Because there are emoticons and other symbols in the content of the post, I choose to use 'utf8mb4' so that the emoji can also be stored in the database, but there are some punctuation marks that can make mistakes in the writing process, so they are replaced with replace.
In total, I have climbed more than 13,000 pieces of data, and basically finished all the posts in the last two years.
Third, visual display
Use the create_engine module to read the contents of the database table, as follows:
Import pandas as pdfrom sqlalchemy importcreate_engine# initializes the database connection, using the create_engine module engine = create_engine ('mysql+pymysql://root: password @ 127.0.0.1 engine 3306 Unix Test1') # query statement to select all the data in the STAR table sql =''select * from STAR Two parameters of''# read_sql_query: sql statement, database connection df= pd.read_sql_query (sql, engine) # outputs the query results of the STAR table df ['num'] = [int (I) for i in list (df [' num'])] df=df.drop_duplicates (subset= ['title','author','num'], keep='first')
Because the number of posts is saved in character format, it is converted to an integer first, and then the data is deduplicated with the drop_duplicates module, so the data is sorted out.
Tens of thousands of pieces of data are put in front of you, and you can't see anything with the naked eye, so I choose a few angles here to analyze what secrets are hidden in this post bar with python statistics.
1. Find out the 20 people with the largest number of posts
To put it simply, create an empty dictionary, then convert df ['author'] into a list, count the number of elements in the list, store the elements and the number in the dictionary, sort the dictionaries, and draw the first ten into a bar chart. The code is as follows:
# ranking rank_num= {} for i in list (set (list (df ['author'])): rank_num [i.replace (',')] = list (df ['author']) .count (I) rank_num= sorted (rank_num.items (), key=lambda x: X [1], reverse=True) bar = Bar ("histogram", "number of posts-nickname") bar.add ("number of posts-nickname" [I [0] for i in rank_num [: 10]], [I [1] for i in rank_num [: 10]], xaxis_rotate=45, mark_line= ["average"], mark_point= ["max", "min"]) bar.render ('number of posts-nickname .html')
The results are as follows:
This headhunter post is a bit fierce, a single person sent a maximum of 751, really impressive.
two。 Find out the 20 posts with the most posts.
Dff=df.sort_values (by='num', ascending=False) .head (10) bar = Bar ('ranking the number of posts', width=1000,height=400) bar.use_theme ('dark') bar.add ('', dff ['title'] [::-1], dff [' num'] [::-1], is_convert=True, is_yaxis_inverse=False, xaxis_rotate=45,is_label_show=True,label_pos='right') bar.render ("ranking the number of posts")
The person who posted the most posts turned out to be a water sticker, with a number of 73459 times.
3. Make word cloud pictures of all posts
First, connect all the posts into characters, use jieba for word segmentation, and insert the background image. The code is as follows:
Import matplotlib.pyplot as pltimport jiebafrom wordcloud importwordcloudtext=''for i in list (df ['title']): text+=iprint (text) cut_text = jieba.cut (text) result= [] for i in cut_text: result.append (I) result= ">
The effect is as follows:
Thank you for your reading, the above is the content of "how to climb the star post bar in python". After the study of this article, I believe you have a deeper understanding of the problem of how to climb the star post bar in python, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.