What is the crawler framework in Python 04/04 Update SLTechnology News&Howtos

What is the crawler framework in Python

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article is about what crawler frameworks are available in Python. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

1. Scrapy:Scrapy is an application framework written to crawl website data and extract structural data. It can be used in a series of programs, including data mining, information processing, or storing historical data. It is a powerful crawler framework that can satisfy simple page crawling, such as knowing clearly about url pattern. With this framework, you can easily climb down data such as Amazon product information. But for slightly more complex pages, such as weibo page information, this framework can not meet the requirements. Its features are: HTML, built-in support for XML source data selection and extraction; provides a series of reusable filters (i.e. Item Loaders) shared between spider, and provides built-in support for intelligent processing of crawling data.

2. Crawley: high-speed crawling the content of the corresponding website, supporting relational and non-relational databases, and the data can be exported to JSON, XML, etc.

3. Portia: is an open source visual crawler tool that allows users to crawl websites without any programming knowledge! Simply annotate the page you are interested in, and Portia will create a spider to extract data from similar pages. To put it simply, it is based on the scrapy kernel; visually crawls content without any development expertise; and dynamically matches the content of the same template.

4. Newspaper: can be used to extract news, articles and content analysis. Use multithreading, support more than 10 languages, and so on. Inspired by the simplicity and power of the requests library, the author uses python to develop programs that can be used to extract the content of articles. More than 10 languages are supported and all of them are unicode codes.

5. The article extraction tool written by python-goose:java. The information that can be extracted by the Python-goose framework includes: the main content of the article, the main pictures of the article, any Youtube/Vimeo video embedded in the article, meta-description, meta-tag.

6. Beautiful Soup: famous and integrated with some common crawler requirements. It is a Python library that can extract data from HTML or XML files. It can use your favorite converter to achieve the usual way to navigate, find, and modify documents. Beautiful Soup will save you hours or even days of work. The disadvantage of Beautiful Soup is that it cannot load JS.

7. Mechanize: its advantage is that it can load JS. Of course, it also has disadvantages, such as a serious lack of documentation. However, through the official example and human flesh attempt, it is barely possible to use.

8. Selenium: this is a driver that calls the browser. Through this library, you can directly call the browser to complete certain operations, such as entering the CAPTCHA. Selenium is an automated testing tool, which supports all kinds of browsers, including mainstream interface browsers such as Chrome,Safari,Firefox. If a Selenium plug-in is installed in these browsers, the testing of Web interface can be easily realized. Selenium supports browser drivers. Selenium supports multi-language development, such as Java,C,Ruby, etc. PhantomJS is used to render and parse JS,Selenium to drive and interface with Python, and Python for post-processing.

9. Cola: is a distributed crawler framework, for users, only need to write a few specific functions, without paying attention to the details of distributed operation. Tasks are automatically assigned to multiple machines, and the whole process is transparent to users. The overall design of the project is a little bad, and the degree of coupling between modules is high.

10. PySpider: a powerful web crawler system written by Chinese people with powerful WebUI. Written in Python language, distributed architecture, support for a variety of database backend, powerful WebUI support script editor, task monitor, project manager and results viewer. Python script control, you can use any html parsing package you like.

Thank you for reading! This is the end of this article on "what is the crawler framework in Python"? I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.