In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
Python how to crawl the content searched by Baidu and extract the title content of the web page, in view of this question, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a simpler and easier way.
Hello, hello. What we bring to you today is the use case 1 of the combination of requests library and lxml library shared before: specify the content to be searched by Baidu and extract the title content of the web page. OK, don't talk too much nonsense and go straight to the main course.
Step 1: complete the preliminary analysis
Let's complete our first step and analyze our goals. We must not underestimate this step, because only when we think clearly can we write beautiful code in a relatively short period of time.
1. Determine the url:
First of all, if we want to request a web page, we must know what our url is. Next, I open the Chrome browser, open the Baidu page, specify the search "python", and get the following results:
Step 2: complete the acquisition of the page program body framework
Writing python programs, sometimes very contradictory, is to use object-oriented? Or process-oriented? It's all random (I'm casual, because no one asked me to write anything), so here I write this program in an object-oriented way.
1. First of all, we need to write a general framework: # file-import requestsclass MySpider (object): def _ init__ (self): self.url = 'http://www.baidu.com/s?wd={name}' # where the formatted input self.headers = {' User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64) of format is used. X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'} def main (self): # process url self.target = input ('Please enter what you are interested in:') self.url = self.url.format (name=self.target) # rebuild url When testing, you can try to print # request text = self.get () # write to the file self.write (text) def get (self):''request and return the web source code' 'pass def write (self,text):' 'write the returned source code to the file After waiting, parse with''passif _ _ name__ = =' _ main__': spider = MySpider () spider.main () 2. Why do you need to write to a file instead of parsing directly:
There are several reasons.
1 > generally speaking, when we write code, it is impossible to write successfully at once, and it always needs to be modified and tested. Usually we write code, naturally we can test and run it at will, and we can check whether the code is correct, but crawlers can't. Because if you visit the site too many times in a short period of time, it may cause the site to take some restrictive actions against you, such as adding a CAPTCHA to determine whether you are human or not, and blocking your ip for a short period of time. Therefore, we write the source code of the web page to a file so that we don't have to revisit the site when writing parsing code later.
2 > We write the file in the form of html, and we can open the file with a browser, and we can see clearly whether this file is the file we need to crawl. The following picture shows the result of the file I crawled and saved and opened in Google browser:
Complete the program to get the page:
Let's finish the program code to get the page:
1. Complete the request function get:def get (self):''request and return the source code of the web page' 'response = requests.get (self.url,self.headers) if response.status_code = = 200: return response.text
There is nothing to say about this, it is the most basic code.
two。 Finish writing to the file function write:def write (self,text): with open ('% s.htmlpercent% self.targetMagneology) as f: # where self.target is the input search content f.write (text)
This is only the basic operation of the file, there is nothing to explain, just note that what we store here is the html file, not the txt file.
3. Check whether it is working properly and whether it is the result we want:
The way of detection here, as I mentioned earlier, is that the browser can open the corresponding page. The operation is as follows:
It indicates that the program is running normally.
Complete the analysis of the general framework and local content of the page: 1. Complete the framework for parsing the page program: from lxml import etreeclass Parse (object): def _ _ init__ (self): # read the content and initialize the with open ('python.html','r' Encoding='utf-8') as f: self.html = etree.HTML (f.read ()) # parsing page def parse (self): passif _ _ name__ = ='_ _ main__': parser = Parse () parser.parse () 2. Complete the parsing function parse:
Let's finish the final step, the typing of the analytic function.
First of all, we need to analyze what tag we want to get. The analysis process is shown in the figure (personally, I think this part is more important, because when I was a beginner, I was mainly confused about two points: how to deal with failed requests and what is the idea of parsing)
OK, after analyzing where the content we need is, we can use lxml to write the code, as follows:
Def parse (self): # get url h4_tags = self.html.xpath ('/ / h4 [contains (@ class, "t")] / / text ()') h4_tags = [i.strip () for i in h4_tags] print (h4_tags)
The next thing to do is to deal with these strings, but this is not our focus, and the data is not important, so we won't deal with it.
Summary and full-text code: 1. First of all, attach the full text code: #-*-coding:utf-8-*-# get the web content file import requestsclass MySpider (object): def _ init__ (self): self.url = 'http://www.baidu.com/s?wd={name}' # clearly write the way to get headers self.headers = {' User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML Like Gecko) Chrome/86.0.4240.75 Safari/537.36'} def main (self): # process url self.target = input ('Please enter what you are interested in:') self.url = self.url.format (name=self.target) # request text = self.get () # write to file # self.write (text) Def get (self):''request and return web page source code' 'response = requests.get (self.url Self.headers) if response.status_code = = 200: return response.text def write (self,text): with open ('% s.htmlgift% self.targetmaidew' Encoding='utf-8') as f: f.write (text) if _ _ name__ ='_ _ main__': spider = MySpider () spider.main () #-*-coding:utf-8-*-# parse the page file from lxml import etreeclass Parse (object): def _ init__ (self): with open ('python.html','r' Encoding='utf-8') as f: self.html = etree.HTML (f.read ()) def parse (self): # get the title h4_tags = self.html.xpath ('/ / h4 [contains (@ class) "t")] / / text ()') h4_tags = [i.strip () for i in h4_tags] print (h4_tags) if _ name__ = ='_ main__': parser = Parse () parser.parse () 2. Summary:
Okay, that's all for today's sharing. This is the most basic case, the main purpose is to let you familiar with how to use requests and lxml to write a crawler program, the second is to let you familiar with the analysis of the site under the train of thought.
This is the answer to the question about how Python crawls the content of the specified Baidu search and extracts the title of the web page. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.