Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use python crawler to crawl data

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "how to use python crawler to crawl data". In the actual case operation process, many people will encounter such difficulties. Next, let Xiaobian lead you to learn how to deal with these situations! I hope you can read carefully and learn something!

Python crawls out of the six songs Step 1: Install the requests library and the BeautifulSoup library:

The two libraries in the program are written as follows:

import requestsfrom bs4 import BeautifulSoup

Because I use Python programming with pycharm. So I'll talk about how to install these two libraries on pycharm. On the main page, under File Options, find Settings. Find the Project Interpreter further. After that, click on the + sign on the package in the selected box to query the plug-in installation. Hxd with compiler plug-ins installed is estimated to be a good start. The details are shown below.

Step 2: Get the headers and cookies needed by the crawler:

I wrote a crawler program that crawls microblogs, and here I'll take it directly as an example. Obtaining headers and cookies is a must for a crawler program, which directly determines whether the crawler program can accurately find the location of the web page to crawl.

First enter the micro-blog hot search page, press F12, the js language design section of the web page will appear. This is shown below. Find the Network section on the page. Then press Ctrl+R to refresh the page. If, proceed to have file information, do not need to refresh, of course, refresh also no problem. Then, we browse the Name section, find the file we want to crawl, right click, select copy, copy the URL of the next page. This is shown below.

Convert curl commands to code. This page can automatically generate headers and cookies based on the URL you copy, as shown below. The generated header and cookie are copied directly and pasted into the program.

#crawler header data cookies = { 'SINAGLOBAL': '6797875236621.702.1603159218040', 'SUB': '_2AkMXbqMSf8NxqwJRmfkTzmnhboh2ygvEieKhMlLJJRMxHRl-yT9jqmg8tRB6PO6N_Rc_2FhPeZF2iThYO9DfkLUGpv4V', 'SUBP': '0033WrSXqPxfM72-Ws9jqgMF55529P9D9Wh-nU-QNDs1Fu27p6nmwwiJ', '_s_tentry': 'www.baidu.com', 'UOR': 'www.hfut.edu.cn,widget.weibo.com,www.baidu.com', 'Apache': '7782025452543.054.1635925669528', 'ULV': '1635925669554:15:1:1:7782025452543.054.1635925669528:1627316870256',}headers = { 'Connection': 'keep-alive', 'Cache-Control': 'max-age=0', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 SLBrowser/7.0.0.6241 SLBChan/25', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'Sec-Fetch-Site': 'cross-site', 'Sec-Fetch-Mode': 'navigate', 'Sec-Fetch-User': '? 1', 'Sec-Fetch-Dest': 'document', 'Accept-Language': 'zh-CN,zh;q=0.9',}params = ( ('cate', 'realtimehot'),)

Copy it into the program like this. This was the request header of Weibo's trending search.

Step 3: Get the web page:

Once we have the header and cookie, we can copy it into our program. After that, you can use the request request to get the web page.

#Get web page response = requests.get ('https://s.weibo.com/top/summary', headers=headers, params=params, cookies=cookies) Step 4: Parse web page:

At this point, we need to go back to the website. Also press F12 to find the Elements section of the page. Click on the content of the web page with the arrow mark in the small box in the upper left corner, as shown below. At this time, the web page will automatically display the code corresponding to the part of the web page you obtained on the right.

As shown above, after we find the page code for the part of the page we want to crawl, place the mouse over the code, right-click, and copy to the selector section. As shown above.

Step 5: Analyze the information obtained and simplify the address:

In fact, the selector copied just now is equivalent to the address stored in the corresponding part of the web page. Since we need a type of information on the web page, we need to analyze and extract the address obtained. Of course, it is not impossible to use that address, that is, you can only get that part of the content on the page you choose.

#pl_top_realtimehot > table > tbody > tr:nth-child(1) > td.td-02 > a#pl_top_realtimehot > table > tbody > tr:nth-child(2) > td.td-02 > a#pl_top_realtimehot > table > tbody > tr:nth-child(9) > td.td-02 > a

Here are the three addresses I got. I can see that there are many similarities among the three addresses. The only difference is the tr part. Since tr is a web page tag, the following part is its complementary part, namely the subclass selector. It can be inferred that this kind of information is stored in the subclass of tr. We can directly extract information from tr and obtain all the information corresponding to this part. So the refined address is:

#pl_top_realtimehot > table > tbody > tr > td.td-02 > a

This process is probably better handled by hxd, which has some knowledge of js-like languages. However, it doesn't matter if there is no js language foundation. The main step is to keep the same part on the line and try slowly. It will always be right.

Step 6: Crawl content and clean data

Once this step is complete, we can crawl the data directly. Store the extracted address-like stuff in a tag. The tag will pull up the content of the page we want.

#crawl content="#pl_top_realtimehot > table > tbody > tr > td.td-02 > a"

Then we need to filter out unnecessary information such as soup and text, such as js language, to eliminate the interference of such language for the audience to read the information. So we've succeeded in crawling down information.

fo = open("./ Txt",'a ',encoding=" utf-8 ")a=soup.select(content)for i in range(0,len(a)): a[i] = a[i].text fo.write(a[i]+'\n')fo.close()

I store the data in a folder, so there will be write operations brought by wirte. Where you want to store your data, or how you want to use it, is up to the reader.

Import osimport requests from bs4 import BeautifulSoup#crawler header data cookies = { 'SINAGLOBAL': '6797875236621.702.1603159218040', 'SUB': '_2AkMXbqMSf8NxqwJRmfkTzmnhboh2ygvEieKhMlLJJRMxHRl-yT9jqmg8tRB6PO6N_Rc_2FhPeZF2iThYO9DfkLUGpv4V', 'SUBP': '0033WrSXqPxfM72-Ws9jqgMF55529P9D9Wh-nU-QNDs1Fu27p6nmwwiJ', '_s_tentry': 'www.baidu.com', 'UOR': 'www.hfut.edu.cn,widget.weibo.com,www.baidu.com', 'Apache': '7782025452543.054.1635925669528', 'ULV': '1635925669554:15:1:1:7782025452543.054.1635925669528:1627316870256',}headers = { 'Connection': 'keep-alive', 'Cache-Control': 'max-age=0', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 SLBrowser/7.0.0.6241 SLBChan/25', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'Sec-Fetch-Site': 'cross-site', 'Sec-Fetch-Mode': 'navigate', 'Sec-Fetch-User': '? 1', 'Sec-Fetch-Dest': 'document', 'Accept-Language': 'zh-CN,zh;q=0.9',}params = ( ('cate ', ' realtimehot'),)#Datastore fo = open("./ microblogging search.txt",'a ',encoding=" utf-8 ")#get web page response = requests.get ('https://s.weibo.com/top/summary', headers=headers, params=params, cookies=cookies)#parse web page response.encoding=' utf-8 'soup = BeautifulSoup(response.text, 'html.parser')#crawl content="#pl_top_realtimehot > table > tbody > tr > td.td-02 > a"#clean data a=soup.select(content)for i in range(0,len(a)): a[i] = a[i].text fo.write(a[i]+'\n')fo.close()

"How to use python crawler to crawl data" content is introduced here, thank you for reading. If you want to know more about industry-related knowledge, you can pay attention to the website. Xiaobian will output more high-quality practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report