Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the libraries of Python crawler

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what are the libraries of Python reptiles". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn what libraries Python crawlers have.

1. Request library

1. Requests

GitHub: https://github.com/psf/requests

Requests library should be the hottest and most practical library for crawlers now, which is very user-friendly. About its use I have also written an article to take a look at the Requests library of Python, you can take a look at it.

For the most detailed use of requests, you can refer to the official document: https://requests.readthedocs.io/en/master/

Use small cases:

> > import requests > r = requests.get ('https://api.github.com/user', auth= (' user', 'pass')) > r.status_code 200 > > r.headers [' content-type'] 'application/json; charset=utf8' > r.encoding' utf-8' > r.text u' {"type": "User"...'> > r.json () {upright diskusages: 368627

2. Urllib3

GitHub: https://github.com/urllib3/urllib3

Urllib3 is a very powerful http request library that provides a series of functions for manipulating URL.

For more information on how to use it, please refer to: https://urllib3.readthedocs.io/en/latest/

Use small cases:

> import urllib3 > http = urllib3.PoolManager () > r = http.request ('GET',' http://httpbin.org/robots.txt') > r.status 200 > r.data 'User-agent: *\ nDisallow: / deny\ n'

3.selenium

GitHub: https://github.com/SeleniumHQ/selenium

Automated testing tools. A driver that calls the browser, through this library you can directly call the browser to complete certain operations, such as entering a CAPTCHA.

For this library, not only Python can be used, but selenium can be used by JAVA, Python, C # and so on.

For information about how the Python language uses this library, you can visit https://seleniumhq.github.io/selenium/docs/api/py/ to see the official documentation.

Use small cases:

From selenium import webdriver browser = webdriver.Firefox () browser.get ('http://seleniumhq.org/')

4.aiohttp

GitHub: https://github.com/aio-libs/aiohttp

HTTP framework based on asyncio. The efficiency of asynchronous operation can be greatly improved by using asynchronous database to grab data with the help of async/await keyword.

This is an asynchronous library that must be mastered by advanced reptiles. For more details about aiohttp, you can go to the official document: https://aiohttp.readthedocs.io/en/stable/

Use small cases:

Import aiohttp import asyncio async def fetch (session, url): async with session.get (url) as response: return await response.text () async def main (): async with aiohttp.ClientSession () as session: html = await fetch (session, 'http://python.org') print (html) if _ _ name__ =' _ _ main__': loop = asyncio.get_event_loop () loop.run_until_complete (main ()) 2 parse library

1 、 beautifulsoup

Official document: https://www.crummy.com/software/BeautifulSoup/

Html and XML parsing, extracting information from web pages, with powerful API and a variety of parsing methods. A parsing library that I often use, which is very useful for html parsing. This is also a library that must be mastered for people who write crawlers.

2 、 lxml

GitHub: https://github.com/lxml/lxml

Support HTML and XML parsing, support XPath parsing mode, and parsing efficiency is very high.

3 、 pyquery

GitHub: https://github.com/gawel/pyquery

The Python implementation of jQuery, which can operate and parse HTML documents with jQuery syntax, is easy to use and has good parsing speed.

3. Data repository

1 、 pymysql

GitHub: https://github.com/PyMySQL/PyMySQL

Official document: https://pymysql.readthedocs.io/en/latest/

A pure Python implementation of the MySQL client operation library. Very practical, very simple.

2 、 pymongo

GitHub: https://github.com/mongodb/mongo-python-driver

Official document: https://api.mongodb.com/python/

As the name implies, a library for directly connecting to a mongodb database for query operations.

3 、 redisdump

Redis-dump is a tool for converting redis and json; redis-dump is based on ruby development and requires a ruby environment, and the new version of redis-dump requires a version of ruby above 2.2.2, and only version 2.0 of ruby can be installed in yum in centos. You need to install rvm, the administrative tool of ruby, and install a higher version of ruby first.

Thank you for your reading, the above is the content of "what libraries Python crawlers have". After the study of this article, I believe you have a deeper understanding of what libraries Python crawlers have, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report