In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
How to use Python web crawler to capture Baidu Tieba comment area pictures and videos, for this problem, this article introduces the corresponding analysis and answers in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.
Baidu Tieba is the largest Chinese communication platform in the world. like me, do you sometimes see pictures in the comment area and want to download them? Or do you see a video and want to download it?
Today, let's search for keywords to get pictures and videos in the comment area.
[II. Project objectives]
Implement to save the pictures or videos obtained by the post bar in a file.
[III. Libraries and websites involved]
1. The website is as follows:
Https://tieba.baidu.com/f?ie=utf-8&kw= Wu Jing & fr=search
2. Libraries involved: requests, lxml, urrilb
[IV. Project Analysis]
1. Treatment of anti-climbing measures
During the previous test, it was found that the site had a lot of anti-crawler treatment measures, which were tested as follows:
1) use the requests library directly, and the website will not return data directly without setting any header.
2) visit the same ip more than 40 times in a row and directly block the ip. This is how my ip was blocked at first.
In order to solve these two problems, finally, after research, using the following methods, can be effectively solved.
Get the normal http request headers and set these regular http request headers during the requests request.
2. How to realize the search keywords?
Through the URL, we can find that you only need to enter the content you want to search in kw= (), in parentheses. So you can replace it with a {}, which we'll iterate through later.
[v. Project implementation]
1. Create a class named BaiduImageSpider and define a main method main and initialization method init. Import the required libraries. Import requests
From lxml import etree
From urllib import parse
Class BaiduImageSpider (object):
Def _ _ init__ (self, tieba_name):
Pass
Def main (self):
Pass
If _ _ name__ = ='_ _ main__':
Inout_word = input ("Please enter the information you want to query:")
Spider.main ()
Pass
If _ _ name__ = ='_ _ main__':
Spider= ImageSpider ()
Spider.main () 2. Prepare the url address and request header headers request data. Import requests
From lxml import etree
From urllib import parse
Class BaiduImageSpider (object):
Def _ _ init__ (self, tieba_name):
Self.tieba_name = tieba_name # the name entered
Self.url = "http://tieba.baidu.com/f?kw={}&ie=utf-8&pn=0"
Self.headers = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET 4.0C; InfoPath.3)'
}
'' send a request for a response''
Def get_parse_page (self, url, xpath):
Html = requests.get (url=url, headers=self.headers) .content.decode ("utf-8")
Parse_html = etree.HTML (html)
R_list = parse_html.xpath (xpath)
Return r_list
Def main (self):
Url = self.url.format (self.tieba_name)
If _ _ name__ = ='_ _ main__':
Inout_word = input ("Please enter the information you want to query:")
Key_word = parse.quote (inout_word)
Spider = BaiduImageSpider (key_word)
Spider.main () 3. Data analysis with xpath
3.1. chrome_Xpath plug-in installation
1) A plug-in is used here. Can quickly check whether the information we crawled is correct. The specific installation method is as follows.
2) download chrome_Xpath_v2.0.2.crx from Baidu and enter chrome://extensions/ in chrome browser.
3) drag chrome_Xpath_v2.0.2.crx directly to the extension page
4) if the installation fails, the pop-up box prompts "unable to add applications, extensions and user scripts from the site". The solution to this problem is to open the developer mode, change the crx file (directly or with the suffix to rar) and unzip it into a folder, click developer mode to load the extracted extension, select the extracted folder, click OK, and install successfully.
3.2.Use the chrome_Xpath plug-in
We have installed the chrome_Xpath plug-in above, and we will use it next. 1) Open the browser and press the shortcut key F12. 2) Select the element, as shown in the following figure.
3) right-click, and then select "Copy XPath", as shown in the following figure.
3.3. Write code to get the link function.
Above, we have obtained the Xpath path of the link function, and then define a link function get_tlink, and inherit self to achieve multi-page crawling.
'' get link function''
Def get_tlink (self, url):
Xpath ='/ / div [@ class= "threadlist_lz clearfix"] / div/a/@href'
T_list = self.get_parse_page (url, xpath)
# print (len (t_list))
For t in t_list:
T_link = "http://www.tieba.com" + t
Next, the request for the post address will be saved locally.
Self.write_image (t_link)
4. Save data
A write_image method is defined here to save the data, as shown below.
'' Save to local function''
Def write_image (self, t_link):
Xpath = "/ / div [@ class='d_post_content j_d_post_content clearfix'] / img [@ class='BDE_Image'] / @ src | / / div [@ class='video_src_wrapper'] / embed/@data-video"
Img_list = self.get_parse_page (t_link, xpath)
For img_link in img_list:
Html = requests.get (url=img_link, headers=self.headers) .content
Filename = "Baidu /" + img_link [- 10:]
With open (filename, 'wb') as f:
F.write (html)
Print ("s download succeeded" filename)
As shown below:
[VI. Effect display]
1. Click run, as shown in the following figure (please enter the information you want to query):
2. Take Wu Jing as an example, enter and enter:
3. Download and save the pictures in a folder called "Baidu", which requires you to build a new one locally in advance. Be sure to remember to create a new folder called "Baidu" under the same level directory of the current code in advance, otherwise the system will not find the folder and will report the error of "Baidu".
4. The MP4 in the following figure is the video in the comment area.
Summary:
1. It is not recommended to grab too much data. It is easy to load the server. Just dabble.
2. Based on the Python web crawler, this paper uses the crawler library to crawl the comment area of Baidu Tieba. Some difficulties of Python crawling Baidu Tieba are explained in detail and effective solutions are provided. 3. You are welcome to try actively. Sometimes it is very easy for others to realize it, but when you do it yourself, there will always be all kinds of problems. Do not aim high and do it diligently, so that you can understand it more deeply. Learn the use of requests libraries and the writing of crawlers. This is the answer to the question about how to use Python web crawler to grab pictures and videos in Baidu Tieba's comment area. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.