Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Case analysis of Python crawler

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

Today, the editor will share with you the relevant knowledge points of Python crawler case analysis. The content is detailed and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article. Let's take a look at it.

Environment building

Since you use python, you can't do without a locale. So I downloaded version 3.5 on the official website. After installation, I randomly selected an editor called PyCharm, so there are quite a lot of python editors.

Send a request

Of course, I don't know how python makes web requests, and what's the difference between 2.0 and 3.0, there are a lot of twists and turns in the middle, and finally write the simplest piece of request code.

Import urllib.parse

Import urllib.request

# params CategoryId=808 CategoryType=SiteHome ItemListActionName=PostList PageIndex=3 ParentCategoryId=0 TotalPostCount=4000

Def getHtml (url,values):

User_agent='Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36'

Headers = {'User-Agent':user_agent}

Data = urllib.parse.urlencode (values)

Response_result = urllib.request.urlopen (url+'?'+data). Read ()

Html = response_result.decode ('utf-8')

Return html

# get data

Def requestCnblogs (index):

Print ('request data')

Url = 'http://www.cnblogs.com/mvc/AggSite/PostList.aspx'

Value= {

'CategoryId':808

'CategoryType': 'SiteHome'

'ItemListActionName': 'PostList'

'PageIndex': index

'ParentCategoryId': 0

'TotalPostCount': 4000

}

Result = getHtml (url,value)

Return result

In fact, the request of the blog Park is quite standard, is just suitable for crawling. Because what he returns is a html. (it would be better to return json. )

Data parsing

As mentioned above, BeautifulSoup is used, and the advantage is that you don't have to write the rules yourself, just write according to his grammar, and finally complete the data parsing after many tests. Let's start with a HTML. Then it may look easier to correspond to the following code.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report