In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
Today, the editor will share with you the relevant knowledge points of Python crawler case analysis. The content is detailed and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article. Let's take a look at it.
Environment building
Since you use python, you can't do without a locale. So I downloaded version 3.5 on the official website. After installation, I randomly selected an editor called PyCharm, so there are quite a lot of python editors.
Send a request
Of course, I don't know how python makes web requests, and what's the difference between 2.0 and 3.0, there are a lot of twists and turns in the middle, and finally write the simplest piece of request code.
Import urllib.parse
Import urllib.request
# params CategoryId=808 CategoryType=SiteHome ItemListActionName=PostList PageIndex=3 ParentCategoryId=0 TotalPostCount=4000
Def getHtml (url,values):
User_agent='Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36'
Headers = {'User-Agent':user_agent}
Data = urllib.parse.urlencode (values)
Response_result = urllib.request.urlopen (url+'?'+data). Read ()
Html = response_result.decode ('utf-8')
Return html
# get data
Def requestCnblogs (index):
Print ('request data')
Url = 'http://www.cnblogs.com/mvc/AggSite/PostList.aspx'
Value= {
'CategoryId':808
'CategoryType': 'SiteHome'
'ItemListActionName': 'PostList'
'PageIndex': index
'ParentCategoryId': 0
'TotalPostCount': 4000
}
Result = getHtml (url,value)
Return result
In fact, the request of the blog Park is quite standard, is just suitable for crawling. Because what he returns is a html. (it would be better to return json. )
Data parsing
As mentioned above, BeautifulSoup is used, and the advantage is that you don't have to write the rules yourself, just write according to his grammar, and finally complete the data parsing after many tests. Let's start with a HTML. Then it may look easier to correspond to the following code.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.