How to use Python web crawler to extract information 07/09 Update SLTechnology News&Howtos

How to use Python web crawler to extract information

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

How to use Python web crawler to extract information, I believe that many inexperienced people are at a loss about this. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

The following editor will bring you an article on Python web crawlers and information extraction (example explanation). The editor thinks it's pretty good. I'll share it with you now, and I'll give you a reference. Let's follow the editor and have a look.

Course architecture:

1, Requests framework: automatic crawling HTML pages and automatic web request submission 2, robots.txt: Web crawler exclusion standard 3, BeautifulSoup framework: parsing HTML pages 4, Re framework: regular framework, extracting page key information 5, Scrapy framework: Web crawler principle introduction, professional crawler framework introduction concept: The Website is the API … IDE:IDLE, Notepad++, Sublime Text, Vim & Emacs, Atom, Komodo Edit integration tools such as IDE:IDLE, Notepad++, Sublime Text, Vim & Emacs, Atom, Komodo Edit integration tools IDE:PyCharm, Wing, PyDev & Eclipse, Visual Studio, Anaconda & Spyder, Canopy ·IDLE are the default and commonly used entry-level authoring tools included in Python. It is suitable for shorter programs. Sublime Text is a third-party programming tool specially developed for programmers, which can improve the programming experience and has a variety of programming styles. Wing is a fee-based IDE provided by Wingware Company, which has rich debugging functions, version control and version synchronization functions, and is suitable for multi-person joint development. Suitable for writing large programs. Visual Studio is maintained by Microsoft, and Python can be written by configuring PTVS, which is mainly based on Windows environment and has rich debugging functions. Eclipse is an open source IDE development tool. You can write Python by configuring PyDev, but the configuration process is complex and requires some development experience. PyCharm is divided into community version and professional version, community version is free, with the characteristics of simple, high integration, suitable for writing more complex projects. IDE for scientific computing and data analysis: Canopy is a charging tool maintained by Enthought, which supports nearly 500 third-party libraries and is suitable for application development in the field of scientific computing. Anaconda is open source and free, supporting nearly 800 third-party libraries. Requests Library introduction to Requests installation: Requests library is currently recognized as the best Python third-party library for crawling web pages, with the characteristics of simplicity and simplicity. Official website: http://www.python-requests.org finds "cmd.exe" in "C:\ Windows\ System32" and runs it as an administrator. Enter "pip install requests" on the command line to run.

Use IDLE to test the Requests library: > import requests

> r = requests.get ("http://www.baidu.com")# crawled Baidu page

> r.status_code

> r.encoding = 'utf-8'

> r.text

one

two

three

four

five

Seven main methods of Requests Library

The get () method r = requests.get (url) get () method constructs a Request object that requests a resource from the server and returns a Response object that contains the server resource. Requests.get (url, params=None, * * kwargs) url: to get the extra parameters in the url link params:url of the page, dictionary or byte stream format. Optional * * kwargs:12 control access parameters 2 important objects in the Requests library Request Response:Response object contains the properties of the Response object returned by the crawler r.status_code: the return status of the HTTP request, 200indicates that the connection is successful 404 indicates the string form of the failed r.text: HTTP response content, that is, the page content corresponding to the url r.encoding: the corresponding content encoding method guessed from the HTTP header r.apparent_encoding: the corresponding content encoding mode analyzed from the content (alternative encoding) r.content: the binary form of the HTTP response content r.encoding: if there is no charset in the header, it is considered to be ISO-8859-1. R.apparent_encoding: the encoding method analyzed according to the content of the web page can be regarded as an alternative to r.encoding. Coding of Response:

R.encoding: the encoding method of response content guessed from HTTP header in exchange rate query; if charset does not exist in header, it is considered to be encoded as ISO-8859-1 Magi r.text to display web page content according to r.encoding r.apparent_encoding: the encoding mode analyzed according to web page content can be regarded as an exception of Requests library, a general code framework for crawling web pages as an alternative to r.encoding.

Exception r.raise_for_status () of Response: if it is not 200, the exception requests.HTTPError; is generated. Determine whether the r.status_code is equal to 200inside the method. There is no need to add an additional if statement, which facilitates the use of try-except for exception handling import requests.

Def getHTMLText (url):

Try:

R = requests.get (url, timeout=30)

R.raise_for_status () # if the state is not 200, a HTTPError exception is thrown

R.encoding = r.apparent_encoding

Return r.text

Except:

Return "generates exception"

If _ name__ = = "_ _ main__":

Url = "http://www.baidu.com"

Print (getHTMLText (url))

one

two

three

four

five

six

seven

eight

nine

ten

eleven

twelve

thirteen

fourteen

General code framework can make it more efficient, stable and reliable for users to crawl web pages. HTTP protocol HTTP,Hypertext Transfer Protocol, hypertext transfer protocol. HTTP is a stateless application layer protocol based on request and response mode. HTTP protocol uses URL as the identification of locating network resources. URL format: http://host[:port][path] host: legal Internet host domain name or IP address

Port: Port number. Default port number is 80.

Path: the path to request resources HTTP URL understanding: URL is the Internet path to access resources through the HTTP protocol, and a URL corresponds to a data resource. The operation of resources by HTTP protocol

Understanding the difference between PATCH and PUT assumes that the URL location has a set of data UserInfo, including 20 fields such as UserID, UserName, and so on. Requirements: the user has modified the UserName, everything else remains the same. With PATCH, only local update requests for UserName are submitted to URL. With PUT, all 20 fields must be submitted to URL, and the unsubmitted fields are deleted. Main benefits of PATCH: saving network bandwidth Requests library main method parsing requests.request (method, url, * * kwargs) method: request method, corresponding to get/put/post, etc. 7 examples: r = requests.request ('OPTIONS', url, * * kwargs) url: url link to get page * * kwargs: control access parameters, a total of 13, all are optional params: dictionary or byte sequence, added to url as a parameter Kv = {'key1':'value1',' key2':'value2'}

R = requests.request ('GET',' http://python123.io/ws',params=kv)

Print (r.url)

Http://python123.io/ws?key1=value1&key2=value2

one

two

three

four

five

six

Data: dictionary, byte sequence, or file object as the content of Request; data in json:JSON format as the content of Request; headers: dictionary, HTTP custom header; hd = {'user-agent':'Chrome/10'}

R = requests.request ('POST',' http://www.yanlei.shop',headers=hd)

one

two

three

Cookies: cookie;auth: tuple in dictionary or CookieJar,Request, supporting HTTP authentication; files: dictionary type, transferring files; fs = {'file':open (' data.xls', 'rb')}

R = requests.request ('POST',' http://python123.io/ws',files=fs)

one

two

three

Timeout: set timeout time in seconds; proxies: dictionary type, set access proxy server, you can add login authentication allow_redirects:True/False, default is True, redirect switch; stream:True/False, default is True, get content immediate download switch; verify:True/False, default is True, authentication SSL certificate switch; cert: local SSL certificate path # method and parameters

Requests.get (url, params=None, * * kwargs)

Requests.head (url, * * kwargs)

Requests.post (url, data=None, json=None, * * kwargs)

Requests.put (url, data=None, * * kwargs)

Requests.patch (url, data=None, * * kwargs)

Requests.delete (url, * * kwargs)

one

two

three

four

five

six

(7) performance harassment caused by web crawlers: limited by the level and purpose of writing, web crawlers will bring huge legal risks to the web server: the data on the server has property rights, and it will bring legal risks for web crawlers to make profits after obtaining data. Privacy disclosure: Web crawlers may have the ability to break through simple access controls and obtain protected data to disclose personal privacy. Review of the source of restrictions on web crawlers: cryptocurrency http://www.gendan5.com/digitalcurrency/btc.html judges User-Agent to limit

Check the User-Agent domain of the visiting HTTP protocol header, and the value responds to the access of the browser or friendly crawler. Issue an announcement: the Roots agreement informs all crawlers of the crawling policy and requires the crawlers to comply with it. Robots protocol Robots Exclusion Standard Web crawler exclusion standard function: the website tells the web crawler which pages can be crawled and which are not. Format: robots.txt file in the root directory of the website. Case: JD.com 's Robots protocol http://www.jd.com/robots.txt# Note: * represents all, / represents the root directory

User-agent: *

Disallow: /? *

Disallow: / pop/*.html

Disallow: / pinpai/*.html?*

User-agent: EtaoSpider

Disallow: /

User-agent: HuihuiSpider

Disallow: /

User-agent: GwdangSpider

Disallow: /

User-agent: WochachaSpider

Disallow: /

one

two

three

four

five

six

seven

eight

nine

ten

eleven

twelve

thirteen

The use of Robots protocol web crawler: automatically or manually identify robots.txt, and then crawl content.

Binding: Robots agreement is recommended but not binding, web crawlers may not comply, but there are legal risks. Requests Library Web Crawler 1. JD.com Commodity import requests

Url = "https://item.jd.com/5145492.html"

Try:

R = requests.get (url)

R.raise_for_status ()

R.encoding = r.apparent_encoding

Print (r.text [: 1000])

Except:

Print (crawl failed)

one

two

three

four

five

six

seven

eight

nine

2. Amazon products # directly crawling Amazon products will be denied access, so you need to add the 'user-agent' field.

Import requests

Url = "https://www.amazon.cn/gp/product/B01M8L5Z3Y"

Try:

Kv = {'user-agent':'Mozilla/5.0'} # using proxy access

R = requests.get (url, headers = kv)

R.raise_for_status ()

R.encoding = r.apparent_encoding

Print (t.text [1000 2000])

Except:

Print ("crawl failed"

one

two

three

four

five

six

seven

eight

nine

ten

eleven

3. Baidu / 360search keyword submission interface Baidu keyword interface: http://www.baidu.com/s?wd=keyword 360keyword interface: http://www.so.com/s?q=keyword# Baidu

Import requests

Keyword = "Python"

Try:

Kv = {'wd':keyword}

R = requests.get ("http://www.baidu.com/s",params=kv)

Print (r.request.url)

R.raise_for_status ()

Print (len (r.text))

Except:

Print (crawl failed)

one

two

three

four

five

six

seven

eight

nine

ten

eleven

# 360

Import requests

Keyword = "Python"

Try:

Kv = {'q':keyword}

R = requests.get ("http://www.so.com/s",params=kv)

Print (r.request.url)

R.raise_for_status ()

Print (len (r.text))

Except:

Print (crawl failed)

one

two

three

four

five

six

seven

eight

nine

ten

eleven

4. Crawl and storage of web pictures. Format of web photo links: http://www.example.com/picture.jpg National Geographic: http://www.nationalgeographic.com.cn/ Select a picture link: https://cache.yisu.com/upload/information/20200703/146/45150.jpg image crawling full code

Import requests

Import os

Url = "https://cache.yisu.com/upload/information/20200703/146/45150.jpg"

Root = "D://pics//"

Path = root + url.split ('/') [- 1]

Try:

If not os.path.exists (root):

Os.mkdir (root)

If not os.path.exists (path):

R = requests.get (url)

With open (path,'wb') as f:

F.write (r.content)

F.close ()

Print ("File saved successfully")

Else:

Print ("File already exists")

Except:

Print (crawl failed)

one

two

three

four

five

six

seven

eight

nine

ten

eleven

twelve

thirteen

fourteen

fifteen

sixteen

seventeen

eighteen

nineteen

5. Automatic query of IP address attribution www.ip138.com IP query http://ip138.com/ips138.asp?ip=ipaddresshttp://m.ip138.com/ip.asp?ip=ipaddressimport requests

Url = "http://m.ip138.com/ip.asp?ip="

Ip = "220.204.80.112"

Try:

R = requests.get (url + ip)

R.raise_for_status ()

R.encoding = r.apparent_encoding

Print (r.text[ 1900:])

Except:

Print (crawl failed)

one

two

three

four

five

six

seven

eight

nine

ten

# using IDLE

> import requests

> url = "http://m.ip138.com/ip.asp?ip=""

> ip = "220.204.80.112"

> r = requests.get (url + ip)

> r.status_code

> r.text

After reading the above, have you mastered how to use Python web crawler to extract information? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.