In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces "Python data Visualization example Analysis". In daily operation, I believe many people have doubts about Python data Visualization example Analysis. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts of "Python data Visualization example Analysis". Next, please follow the editor to study!
01 web page analysis
Get the information of Wechat official account, title, beginning, official account and release time.
The request method is GET, the request URL is in the red box, and the following information is of no use.
02 anti-crawling crack
When did this happen in the picture above?
There are two ways, one is that the same IP visits the page repeatedly, and the other is that the same Cookies visits the page repeatedly.
Both, hang up faster! I only succeeded in crawling completely once.
Because I started by setting nothing, and then the CAPTCHA page appeared. After using the proxy IP, you will still jump to the CAPTCHA page and crawl successfully until * changes the Cookies.
01 proxy IP settings
Def get_proxies (I): "get proxy IP" df = pd.read_csv ('sg_effective_ip.csv', header=None, names= ["proxy_type" "proxy_url"]) proxy_type = ["{}" .format (I) for i in np.array (df ['proxy_type'])] proxy_url = ["{}" .format (I) for i in np.array (df [' proxy_url'])] proxies = {proxy_type [I]: proxy_ URL [I]} return proxies
The acquisition and use of agents will not be discussed here, as mentioned in the previous article, interested partners can take a look.
After two days of practice, free IP is really useless, and I found out my real IP in two seconds.
02 Cookies Settings
Def get_cookies_snuid (): "get SNUID value" time.sleep (float (random.randint (2,5) url = "http://weixin.sogou.com/weixin?type=2&s_from=input&query=python&ie=utf8&_sug_=n&_sug_type_=" headers = {" Cookie ":" ABTEST= your parameters; IPLOC=CN3301;SUID= your parameters SUIR= your parameter "} # HEAD request, request resource first response = requests.head (url, headers=headers). Headers result = re.findall ('SNUID= (. *?); expires', response [' Set-Cookie']) SNUID= result [0] return SNUID
Generally speaking, the setting of Cookies is the most important in the whole anti-crawling, and the key is to change the SNUID value dynamically.
Here will not elaborate on the reasons, after all, I also read the great god's post on the Internet to understand, and the understanding is still very shallow.
Successfully crawl 100 pages only once, 75 pages, 50 pages, or even hang up as soon as you climb.
I don't want to be stuck in the quagmire of "climbing-anti-crawling". What happens after the crawler is my real purpose, such as data analysis and data visualization.
So if you have a big job, you can only worship engineer Sogou.
03 data acquisition
1 construct the request header
Head = "" Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8 Accept-Encoding:gzip, deflate Accept-Language:zh-CN,zh;q=0.9 Connection:keep-alive Host:weixin.sogou.com Referer:' http://weixin.sogou.com/', Upgrade-Insecure-Requests:1 User-Agent:Mozilla/5.0 (Windows NT 6.1) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 "" # does not contain the SNUID value cookie = 'your Cookies' def str_to_dict (header): "construct the request header Different request headers can be constructed in different functions: header_dict = {} header = header.split ('\ n') for h in header: h = h.strip () if h: K, v = h.split (':', 1) header_ header [k] = v.strip () return header_dict
2 get web page information
Def get_message (): "" get web page related information "" failed_list = [] for i in range (1,101): print ('No.'+ str (I) + 'page') print (float (random.randint (15,20) # set the delay, which is found by du Niang and says to set a delay of more than 15 seconds. Will not be blocked time.sleep (float (random.randint (15,20) # change the nuld value if (iMur1)% 10 = = 0: value = get_cookies_snuid () snuid = 'SNUID=' + value +' '# set Cookies cookies = cookie + snuid url =' http://weixin.sogou.com/weixin?query=python&type=2&page=' + str (I) +'& ie=utf8' host = cookies +'\ n' header = head + host headers = str_to_dict (header) # set proxy IP proxies = get_proxies (I) try: Response = requests.get (url=url Headers=headers, proxies=proxies) html = response.text soup = BeautifulSoup (html, 'html.parser') data = soup.find_all (' ul' {'class':' news-list'}) lis = data [0]. Find _ all ('li') for j in (range (len (lis): H4 = lis.find _ all (' h4') # print (h4 [0]. Get _ text (). Replace ('\ n') '') title = h4 [0]. Get _ text (). Replace ('\ ncow,'). Replace (',',' Find _ all ('p') # print (p [0] .get _ text ()) article = p [0] .get _ text () .replace (',',' ') a = lis.find _ all (' class':, {'class':' account'}) # print (a [0] .get _ text ()) name = a [0] .get _ text () span = lis.find _ all ('span' {'class': 's 2'}) cmp = re.findall ("\ d {10}", span [0] .get _ text ()) # print (time.strftime ("% Y-%m-%d", time.localtime (int (CMP [0])) +'\ n') date = time.strftime ("% Y-%m-%d" Time.localtime (int (cmp [0])) with open ('sg_articles.csv',' axioms, encoding='utf-8-sig') as f: f.write (title +','+ article +','+ name +') '+ date +'\ n') print ('page' + str (I) + 'page success') except Exception as e: print ('page' + str (I) + 'page failure') failed_list.append (I) continue # get failed page number print (failed_list) def main (): get_message () if _ _ Name__ = ='_ _ main__': main ()
* * data was obtained successfully.
04 data visualization
1 number of Wechat articles posted * 0
Here we sort the search articles on Wechat and find the top ten Python bosses.
In fact, I really want to know whether they are run by a team or by an individual. But whatever, focus on it first.
This result may also have something to do with my search using the keyword Python, which shows that official account names are all with Python (with the exception of CSDN).
From pyecharts import Bar import pandas as pd df = pd.read_csv ('sg_articles.csv', header=None, names= ["title", "article", "name", "date"]) list1 = [] for j in df [' date']: # get the release year time = j.split ('-') [0] list1.append (time) df ['year'] = list1 # Select articles published in 2018 Df = df.loc [df ['year'] = =' 2018'] place_message = df.groupby (['name']) place_com = place_message [' name'] .agg (['count']) place_com.reset_index (inplace=True) place_com_last = place_com.sort_index () dom = place_com_last.sort_values (' count') Ascending=False) [0:10] attr = dom ['name'] v1 = dom [' count'] bar = Bar ("number of Wechat articles * 0", title_pos='center', title_top='18', width=800, height=400) bar.add ("", attr, v1, is_convert=True, xaxis_min=10, yaxis_rotate=30, yaxis_label_textsize=10, is_yaxis_boundarygap=True, yaxis_interval=0, is_label_show=True, is_legend_show=False, label_pos='right', is_yaxis_inverse=True Is_splitline_show=False) bar.render ("number of Wechat posts * 0.html")
2 Distribution of release time of Wechat articles
Because the articles found here will be from before 2018, delete them here and verify the release time of the rest of the articles.
After all, information is about timeliness, and it doesn't make sense if I search for old-fashioned information, not to mention the ever-changing Internet industry.
Import numpy as np import pandas as pd from pyecharts import Bar df = pd.read_csv ('sg_articles.csv', header=None, names= ["title", "article", "name") "date"]) list1 = [] list2 = [] for j in df ['date']: # get the year and month of the article time_1 = j.split (' -') [0] time_2 = j.split ('-') [1] list1.append (time_1) list2.append (time_2) df ['year'] = list1 df [' month'] = list2 # Select the article published in 2018 Df = df.loc [df ['year'] = =' 2018'] month_message = df.groupby (['month']) month_com = month_message [' month'] .agg (['count']) month_com.reset_index (inplace=True) month_com_last = month_com.sort_index () attr = ["{}" .format (str (I) +' month') for i in range (1) 12)] v1 = np.array (month_com_last ['count']) v1 = ["{}" .format (int (I)) for i in v1] bar = Bar ("Wechat post time distribution", title_pos='center', title_top='18', width=800, height=400) bar.add (", attr, v1, is_stack=True, is_label_show=True) bar.render (" Wechat article release time distribution .html ")
3 the title and the word cloud at the beginning of the article
From wordcloud import WordCloud, ImageColorGenerator import matplotlib.pyplot as plt import pandas as pd import jieba df = pd.read_csv ('sg_articles.csv', header=None, names= ["title", "article", "name", "date"]) text =''# for line in df ['article'] .astype (str): (cloud code) for line in df [' title']: text + = '.join (jieba.cut (line) Cut_all=False) backgroud_Image = plt.imread ('python_logo.jpg') wc = WordCloud (background_color='white', mask=backgroud_Image, font_path='C:\ Windows\ Fonts\ STZHONGS.TTF', max_words=2000, max_font_size=150 Random_state=30) wc.generate_from_text (text) img_colors = ImageColorGenerator (backgroud_Image) wc.recolor (color_func=img_colors) plt.imshow (wc) plt.axis ('off') # wc.to_file ("article .jpg") (previous word cloud code) wc.to_file ("title .jpg") print (' word cloud generated successfully!') At this point, the study of "Python data Visualization example Analysis" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.