In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >
Share
Shulou(Shulou.com)05/31 Report--
This article will explain in detail how Python hackers create a fast writing information collector. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.
Environment:
Python 3
Module:
Lxml
Request
Beautifulsoup
Start:
First, take a look at the target station:
Http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-1.html
Here is a catalog: if we click on the first Beijing, we can see the table and the names of all the universities in Beijing.
Our goal is to put all universities in each city into different txt texts.
Officially start the analysis:
We review the elements, and the target we want to take is the name of the school.
You can clearly see the structure of the page, we want to take the target in a tbodyz, and in a tr tag. Continue to analyze the next name and find their pattern.
You can see that each name is in a separate tr tag.
OK, let's take a look at the url of Beijing and the corresponding url of the second city web page.
Http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html
Http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-3.html
You can see that the final number is different, starting with two. Increase in turn. Okay, now that we've basically got the information about the target, let's start tapping the code excitedly.
Let's start with one page.
# coding=utf-8import requestsimport lxmlfrom bs4 import BeautifulSoup as bs # Import our BF and name it bs, which is too long to be lazy. Def school (): # defines a function url= "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r=requests.get (url=url) # to request our target website using requests. Soup=bs (r.content, "lxml") # uses beautifulsoup parsing to assign the returned content to soup print (soup) # to print out the content. If _ _ name__ = ='_ _ main__': # where the program starts to run, you need to call the function you just set, otherwise the program will not run. School ()
Click run after writing, and successfully return to find that you do not need to set the header information. Saved some trouble.
Now let's take the content:
Our content is in this tag, and we use this tag as the standard to find all the contents in this tag. Here's the code. Running
Def school (): url= "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html"
R=requests.get (url=url)
Soup=bs (r.content, "lxml") content=soup.find_all (name= "tr", attrs= {"height": "29"})
Print (content)
Ok successfully returns what we need, but there are a lot of other useless options, and now we're going to get rid of them. Continue editing the school function. We need to loop through our fetched content. Let the content of each tr tag as a separate list, and then use the find_all method to find out and each td tag is the content of list, which is convenient for us to take parameters. The name of the school is in the second td tag and the location in list is 1. 5. The code is as follows.
Def school (): url= "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r=requests.get (url=url) soup=bs (r.content," lxml ") content=soup.find_all (name=" tr ", attrs= {" height ":" 29 "}) for content1 in content: soup_content=bs (str (content1)) "lxml") soup_content1=soup_content.find_all (name= "td") print (soup_content1 [1])
After adding it, we run the code. If you find that the report is wrong, don't panic. Let's take a look at the content of the error report.
An error is reported to the effect that the index is out of range. But we found that we have successfully returned a content, and let's analyze the source code of the web page.
You can see the first three tr tags, and we managed to get the contents of the "school name" line in the first tr tag, and then the second tr reported an error. Our code prints the second content of list, but there is only one content in the second tr tag. Then the rest is back to normal. How can we solve this problem? You can use the exception handling of python. When he reports an error, then ignores the error and continues to run. Turn the code into this.
Def school (): url= "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r=requests.get (url=url) soup=bs (r.content," lxml ") content=soup.find_all (name=" tr ", attrs= {" height ":" 29 "}) for content1 in content: try: soup_content=bs (str (content1)) "lxml") soup_content1=soup_content.find_all (name= "td") print (soup_content1 [1]) except IndexError: pass
Run it again.
It worked, but it's just a city, and we need something else. Next we need to use a for loop, from 2 to 33, add 1 at a time, and modify the parameters of the control page in url.
# coding=utf-8import requestsimport lxmlfrom bs4 import BeautifulSoup as bs def school (): for i in range: url= "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r=requests.get (url=url) soup=bs (r.content," lxml ") content=soup.find_all (name=" tr ") Attrs= {"height": "29"}) for content1 in content: try: soup_content=bs (str (content1), "lxml") soup_content1=soup_content.find_all (name= "td") print (soup_content1 [1]) except IndexError: passif _ _ name__ = ='_ main__': school ()
We can see that there are not only schools in Beijing, but also in Tianjin. Of course, all the schools below have been printed out. We have to remove the label. Change the print to the following code so that you will only see the text.
Print (soup_content1 [1] .string)
The main function is finished, and we also need to store them in different folders and keep them as specific file names, of course, it is impossible for us to enter the names of each city manually. Remember the place we reported wrong just now? that place happens to have the name of the city we want. To sort things out, we first take the city name from the web page, create a new TXT text corresponding to the city name, and then put the content we got into different files. To prevent the error from stopping, let's add an exception handler so that we can continue to finish our code.
Def school (): for i in range: try: url= "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r=requests.get (url=url) soup=bs (r.content," lxml ") content2=soup.find_all (name=" td ") Attrs= {"colspan": "7"}) [0] .string f1=open ("% s.txt"% (content2), "w") content=soup.find_all (name= "tr", attrs= {"height": "29"}) for content1 in content: try: soup_content=bs (str (content1) "lxml") soup_content1=soup_content.find_all (name= "td") f1.write (soup_content1 [1] .string + "/ n") print (soup_content1 [1] .string) except IndexError: pass except IndexError: pass
Complete code:
# coding=utf-8import requestsimport lxmlfrom bs4 import BeautifulSoup as bsdef school (): for i in range: try: url= "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r=requests.get (url=url) soup=bs (r.content," lxml ") content2=soup.find_all (name=" td ") Attrs= {"colspan": "7"}) [0] .string f1=open ("% s.txt"% (content2), "w") content=soup.find_all (name= "tr", attrs= {"height": "29"}) for content1 in content: try: soup_content=bs (str (content1) "lxml") soup_content1=soup_content.find_all (name= "td") f1.write (soup_content1 [1] .string + "/ n") print (soup_content1 [1] .string) except IndexError: pass except IndexError: passif _ _ name__ = ='_ main__': school ()
Summary:
This program is not difficult, nor does it use any multi-threading, class, very simple, not necessarily the more code the better, sometimes we just want to quickly achieve our goals, at this time the code should be as concise as possible. Hope to give beginners some good learning ideas, finally, I want to add another exception handling, because there is another error like the above, and in order to quickly achieve our goal, I directly try to add an exception handling. the program is running normally.
Python hackers on how to create a fast writing information collector to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.