Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the basic knowledge points of Python crawler?

2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "what are the entry points of Python crawler knowledge". The explanation content in this article is simple and clear, easy to learn and understand. Please follow the ideas of Xiaobian to study and learn "what are the entry points of Python crawler knowledge" together.

1 What is a reptile?

"Reptile" is a figurative term. The Internet is likened to a large web, and a crawler is a program or script that crawls on this large web. Encounter bugs (resources), if the required resources to obtain or download down. This resource is usually a web page, file, etc. You can continue to crawl these linked resources by following the url links inside the resource.

You can also use the crawler as a simulation of our normal Internet access. Open the web page and analyze the content of the web page to get what we want.

Then, here is related to http transmission protocol and other related knowledge.

We usually open a web page, basically opening a URL link. In the process, a lot actually happened.

Open a URL link, the browser automatically sends a request to the URL link server, telling the server that I need to access the URL link content, please return the data to me. The server processes the request, responds to the request and returns the result to the browser.

Since reptiles need to simulate this process. According to the http protocol, a crawler needs to construct a request to send to the target server (usually a URL link). Then wait for the server response.

All relevant data is in the response result, which is the basic logic of crawler implementation.

2. urllib2 implements GET requests

GET and POST are the two most common ways to request. (There are 6 types)

GET mode is to transfer relevant parameters or data through URL link. Generally, opening a URL is a GET request, such as opening Baidu homepage and Google homepage.

Sometimes, parameters need to be transferred to this link.

For example, I searched Baidu for a word and found that the link became https://www.baidu.com/s? ie= UTF-8&wd = Test

Is there one here? Question marks and a bunch of data. The data behind the question mark is the parameters of the GET request. There are two sets of parameters.

1)ie = UTF-8

2) wd = test

Each set of parameters is linked with an & symbol. In parameters, the equal sign is preceded by the parameter name; the equal sign is followed by the parameter value.

For example, the meaning of the second group of parameters is Baidu search keyword "test." The first set of parameters is to set the encoding format for returning to ie browser. It can be optional and added as a description.

Then, I use urllib2 to simulate Baidu search code as follows:

#coding: utf-8 import urllib, urllib2 #first half link (note http, not https)url_pre = 'http://www.baidu.com/s' #GET params = {}params <$'wd '] = u' test'. encode ('utf-8') url_params = urllib.urlencode(params) #GET request full link url = '%s?% s' % (url_pre, url_params) #Open link, get response = urllib2.urlopen(url) #Get htmlhtml = response.read () #Save html to file with open ('test.txt', 'w') as f: f.write(html)

Executing the code, you can see the crawled content.

5. Set header for anti-crawler

Some servers check headers to avoid being crawled. Header is the data sent to the server along with the request. You can get the browser type, mobile or computer access, and where to enter the link through the header, etc.

If it is found that it is not a normal browser access, the server will directly refuse.

So~ We need to further simulate the browser behavior, we need to simulate the settings header.

#coding:utf-8import urllib, urllib2 #Set headeruser_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' headers = {'User-Agent':user_agent} #Construct a Request where the second argument is dataurl = 'http://www.server.com/login'request = urllib2.Request(url, None, headers) #Respond to a Request response = urllib2.urlopen(request) html = response.read ()

Similarly, if you don't know how to set the header, you can get it through package capture software, such as Fiddler.

6. Parse html

So much has been said before, all in order to obtain web content html. Since we get html, we parse? And extract the data we need?

The html we get is essentially a string. The most basic way to process strings is through related string functions, but it is inefficient and error-prone.

You can also use regular expressions to manipulate strings. There is also a lot of knowledge about this part, and everyone can understand it for themselves.

Here, I want to tell you how to deal with it is to use Beautiful Soup.

BeautifulSoup is a library for parsing html/xml. Non-Python libraries installed as follows:

pip install beautifulsoup4pip install lxml

The lxml library is installed to speed up html parsing.

First we set 1 html content, using BeautifulSoup parsing method is as follows:

#coding: utf-8 from bs4 import BeautifulSoup #assume any htmlhtml = '' test1

test2

'' #Parse htmlsoup = BeautifulSoup(html, 'lxml') using lxml

soup is the parser that gets parsed. We can get the corresponding node according to html structure. For example, I want to get the p tag:

p = soup.body.p

But this method can only get to the first node. If there are many p nodes under the body tag, this method cannot get all of them.

Here, we can use the find_all or select methods to get it. It is recommended that you use the select method, which can be used almost as a jQuery selector. For example:

p1 = soup.select ('p ') #Get p tag p2 = soup.select ('#test_p')#Get id test_p tag p3 = soup.select ('. test') #Get class test tag p4 = soup.select ('body. test')#Get class test tag under body

To complete the code, output the result:

#coding: utf-8 from bs4 import BeautifulSoup #assume any htmlhtml = '' test1

test2

'' #Parse htmlsoup = BeautifulSoup(html, 'lxml')#Get all p tags for p in soup.select ('p'): print(p)

With this method, all p-labels can be output.

What if I wanted to get the attributes and data of the p-tag? The method is as follows:

for p in soup.select('p'): print(p.name) #tag name #tag attribute, also p ['id ']. If the attribute does not exist, an error will be reported, similar to dictionary key retrieval. print(p.get('id')) print(p.string) #tag content

If there are many child tags in a tag, you can go further and use select.

To get the text content of all the child tags under a tag. You can get a generator with the strings attribute, but there may be a lot of carriage returns and spaces. To mask carriage returns and spaces, use the striped_strings attribute. As follows:

print(''.join(soup.body.strings))print(''.join(soup.body.stripped_strings))

We will get:

u'\ntest1\ntest2\n'u'test1 test2' Thank you for reading, the above is "Python crawler entry knowledge points what" content, after the study of this article, I believe that we have a deeper understanding of Python crawler entry knowledge points what this problem, the specific use of the situation also needs to be verified by practice. Here is, Xiaobian will push more articles related to knowledge points for everyone, welcome to pay attention!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report