What are the Python crawler tools? 04/19 Update SLTechnology News&Howtos

What are the Python crawler tools?

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article will explain in detail what the Python crawler tools are, and the content of the article is of high quality, so the editor will share it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.

A list of common modules related to crawlers. The network

Universal

Urllib-Network Library (stdlib).

Requests-Network library.

Grab-Network library (based on pycurl).

Pycurl-Network library (bind libcurl).

Urllib3-Python HTTP library, secure connection pool, support file post, high availability.

Httplib2-Network library.

RoboBrowser-A simple, Python-styled Python library that browses the web without a separate browser.

MechanicalSoup-A Python library that automatically interacts with websites.

Mechanize-stateful, programmable Web browsing library.

Socket-underlying network interface (stdlib).

Unirest for Python-Unirest is a lightweight HTTP library that can be used in many languages.

Hyper-HTTP/2 client for Python.

PySocks-SocksiPy updates and actively maintains versions, including bug fixes and other features. As a direct replacement for the socket module.

Async

Treq-API similar to requests (based on twisted).

Aiohttp-HTTP client / server (PEP-3156) for asyncio.

Web crawler framework

A full-featured reptile

Grab-Web crawler framework (based on pycurl/multicur).

Scrapy-Web crawler framework (based on twisted), does not support Python3.

Pyspider-A powerful crawler system.

Cola-A distributed crawler framework.

Other

Portia-A visual crawler based on Scrapy.

Restkit-HTTP resource kit for Python. It allows you to easily access HTTP resources and build objects around it.

Demiurge-PyQuery-based crawler micro-framework.

HTML/XML parser

Universal

Lxml-C language to write efficient HTML/ XML processing library. XPath is supported.

Cssselect-parses the DOM tree and CSS selector.

Pyquery-parses the DOM tree and jQuery selector.

BeautifulSoup-inefficient HTML/ XML processing library, pure Python implementation.

Html5lib-the DOM that generates the HTML/ XML document according to the WHATWG specification. The specification is used in all browsers today.

Feedparser-parses the RSS/ATOM feeds.

MarkupSafe-A string that is securely escaped for XML/HTML/XHTML.

Xmltodict-A Python module that makes you feel like you are dealing with JSON when dealing with XML.

Xhtml2pdf-converts HTML/CSS to PDF.

Untangle-it is easy to convert XML files to Python objects.

Clear

Bleach-cleans up the HTML (requires html5lib).

Sanitize-bring Qingming to the chaotic data world.

Text processing

A library for parsing and manipulating simple text.

Universal

Difflib-(Python standard library) helps with differentiation comparisons.

Levenshtein-quickly calculate Levenshtein distance and string similarity.

Fuzzywuzzy-Fuzzy string matching.

Esmre-regular expression Accelerator.

Ftfy-automatically collates Unicode text to reduce fragmentation.

Conversion

Unidecode-converts Unicode text to ASCII.

Character coding

Uniout-prints readable characters instead of escaped strings.

Chardet-compatible with Python's 2max 3 character encoder.

Xpinyin-A library that converts Chinese characters into pinyin.

Pangu.py-CJK and alphanumeric spacing in formatted text.

Slug

Awesome-slugify-A Python slugify library that can retain unicode.

Python-slugify-A Python slugify library that can convert Unicode to ASCII.

Unicode-slugify-A tool that can generate Unicode slugs.

Pytils-A simple tool for handling Russian strings (including pytils.translit.slugify).

Universal parser

Python implementation of PLY-lex and yacc parsing tools.

Pyparsing-A generative parser for a general framework.

The name of a person

Python-nameparser-A component that parses a person's name.

Phone number

Phonenumbers-parsing, formatting, storing and verifying international phone numbers.

User agent string

Python-user-agents-the parser for the browser user agent.

HTTP Agent Parser-HTTP proxy parser for Python.

Specific format file processing

A library that parses and processes specific text formats.

Universal

Tablib-A module that exports data to XLS, CSV, JSON, YAML, and so on.

Textract-extract text from various files, such as Word, PowerPoint, PDF, and so on.

Messytables-A tool for parsing messy tabular data.

Rows-A common data interface that supports many formats (currently supports CSV,HTML,XLS,TXT-and will provide more in the future! ).

Office

Python-docx-read, query and modify Microsoft Word2007/2008 docx files.

Xlwt / xlrd-reads and writes data and format information from the Excel file.

XlsxWriter-A Python module that creates Excel.xlsx files.

Xlwings-A BSD-licensed library that makes it easy to call Python in Excel and vice versa.

Openpyxl-A library for reading and writing Excel2010 XLSX/ XLSM/ xltx/ XLTM files.

Marmir-extract Python data structures and convert them to spreadsheets.

PDF

PDFMiner-A tool for extracting information from PDF documents.

PyPDF2-A library that can split, merge, and transform PDF pages.

ReportLab-allows you to quickly create rich PDF documents.

Pdftables-extract the table directly from the PDF file.

Markdown

Python-Markdown-A Markdown of John Gruber implemented in Python.

Mistune-the fastest, full-featured Markdown pure Python parser.

Markdown2-A fast Markdown implemented entirely in Python.

YAML

PyYAML-A YAML parser for Python.

CSS

Cssutils-A CSS library for Python.

ATOM/RSS

Feedparser-General purpose feed parser.

SQL

Sqlparse-A non-validated SQL statement parser.

HTTP

HTTP request / response message parser implemented in http-parser-C language.

Microformat

Opengraph-A Python module used to parse Open Graph protocol tags.

Portable executor

Pefile-A multi-platform module for parsing and processing portable executable (PE) files.

PSD

Psd-tools-reads the Adobe Photoshop PSD (that is, PE) file into the Python data structure.

Natural language processing

A library that deals with human language problems.

NLTK-the best platform for writing Python programs to process human language data.

Pattern-Python web mining module. He has natural language processing tools, machine learning and others.

TextBlob-provides a consistent API for delving into natural language processing tasks. Is based on NLTK and Pattern on the shoulders of giants.

Jieba-Chinese word segmentation tool.

SnowNLP-Chinese text processing library.

Loso-another Chinese thesaurus.

Genius-Chinese word segmentation based on conditional random domain.

Langid.py-independent language recognition system.

Korean-A Korean shape library.

Pymorphy2-Russian morphological analyzer (part of speech tagging + word form change engine).

PyPLN-A distributed natural language processing channel written in Python. The goal of this project is to create a simple way to use NLTK to handle large language libraries through a network interface.

Browser automation and simulation

Selenium-automates real browsers (Chrome, Firefox, Opera, IE).

Ghost.py-Encapsulation of PyQt's webkit (requires PyQT).

Spynner-Encapsulation of PyQt's webkit (requires PyQT).

Splinter-General API browser emulator (selenium web driver, Django client, Zope).

Multiple processing

The thread running of the threading-Python standard library. It is very effective for Istroke O-intensive tasks. It is not useful for CPU binding tasks because of python GIL.

Multiprocessing-the standard Python library runs multiple processes.

Celery-Asynchronous task queue / job queue based on distributed messaging.

The concurrent-futures-concurrent-futures module provides a high-level interface for invoking asynchronous execution.

Async

Asynchronous network programming library

Asyncio-(Python standard library in Python version 3.4 + and above) Asynchronous Icano, time loops, collaborative programs and tasks.

Twisted-an event-driven network engine framework.

Tornado-A network framework and asynchronous network library.

Pulsar-Python event-driven concurrency framework.

Diesel-Python's Icano framework based on green events.

Gevent-A protocol-based Python network library that uses greenlet.

Eventlet-Asynchronous framework with WSGI support.

Tomorrow-A wonderful embellishment syntax for asynchronous code.

Queue

Celery-Asynchronous task queue / job queue based on distributed messaging.

Huey-small multithreaded task queue.

Mrq-Mr. Queue-distributed work task queues using Python from redis & Gevent.

RQ-lightweight task queue manager based on Redis.

Simpleq-A simple, infinitely extensible, Amazon SQS-based queue.

Python API for python-gearman-Gearman.

Cloud Computing

Picloud-executes Python code in the cloud.

Dominoup.com-executes R _ Python and matlab code in the cloud.

E-mail

E-mail analysis library

Flanker-email address and Mime parsing library.

The Talon-Mailgun library is used to extract the quote and signature of the message.

Operation of web address and network address

Parse / modify URL and network address library.

URL

Furl-A small Python library that makes it easy to manipulate URL.

Purl-A simple, immutable URL and a clean API for debugging and manipulation.

Urllib.parse-used to break the separation of strings in uniform Resource Locator (URL) between components (addressing scheme, network location, path, etc.), in order to combine components into a URL string, and convert the "relative URL" into an absolute URL, called "basic URL".

Tldextract-accurately separates TLD from the registered domain and subdomain of URL, using a list of common suffixes.

Network address

Netaddr-Python library for displaying and manipulating network addresses.

Web page content extraction

A library that extracts content from a web page.

Text and metadata for HTML pages

Newspaper-use Python for news extraction, article extraction, and content curating.

Html2text-converts HTML to Markdown format text.

Python-goose-HTML content / article extractor.

Lassie-A humanized web content retrieval tool

Micawber-A small library of rich content extracted from web sites.

Sumy-A module for automatically summarizing text files and HTML web pages

Haul-an extensible image crawler.

Fast Python interface for python-readability-arc90 readability tools.

Scrapely-A library for extracting structured data from HTML web pages. Some examples of Web pages and data extraction are given, and scrapely builds an analyzer for all similar web pages.

Video

Youtube-dl-A small command line program that downloads videos from YouTube.

You-get-Python3's YouTube, Youku / Niconico video downloader.

Wiki

WikiTeam-A tool for downloading and saving wikis.

WebSocket

The library used for WebSocket.

Crossbar-Open source application messaging router (WebSocket and WAMP for Autobahn implemented by Python).

AutobahnPython-provides Python implementation and open source of WebSocket and WAMP protocols.

WebSocket-for-Python-Python 2 and 3 and WebSocket client and server libraries for PyPy.

DNS parsing

Dnsyo-check your DNS on more than 1500 DNS servers worldwide.

The interface of pycares-c-ares. C-ares is a C language library for making DNS requests and asynchronous name resolutions.

Computer vision.

OpenCV-Open Source computer Vision Library.

SimpleCV-introduction to camera, image processing, feature extraction, format conversion, readable interface (based on OpenCV).

Mahotas-Fast computer image processing algorithm (implemented entirely in C++), based entirely on numpy array as its data type.

Proxy server

Tproxy-tproxy is a simple TCP routing agent (layer 7) based on Gevent and configured with Python.

What are the Python crawler tools to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.