Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to start writing Python crawlers with zero foundation

2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to start writing Python crawlers with zero foundation". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to start writing Python crawlers with zero basics.

-❶.-not always the easiest to start with.

At first, I didn't know much about reptiles, and I didn't have any computer or programming foundation. I was really a little confused. There is no clear concept of where to start, what should be learned at the beginning, and which should wait until there is a certain foundation.

Because it is a Python crawler, Python is a must, so let's start with Python. So I read some tutorials and books to understand the basic data structure, followed by lists, dictionaries, tuples, various functions and control statements (conditional statements, loop statements).

After studying for a period of time, I found that I had not come into contact with the real reptile, and I soon forgot about pure theoretical study, and it was too wasteful to go back to review, so don't be too desperate. After going through the basics of Python, I didn't even install an IDE that could type the code. I couldn't laugh or cry when I thought about it.

-❷.-get started.

After reading a crawler's technical article, the clear thinking and easy-to-understand language made me feel that this is the crawler I want to learn. So I decided to match an environment first and try to see how reptiles play. (of course you can understand that this is impetuous, but it is true that every rookie wants to do something intuitive and feedback.)

For fear of making mistakes, I installed a safer Anaconda and used my own Jupyter Notebook as an IDE to write the code. I am glad to see that many people say that there are all kinds of BUG because of the configuration environment. Most of the time, it's not the thing that beats you, it's about the crawler configuring the environment.

Another problem encountered is that Python crawlers can be implemented in many packages or frameworks, which one should be chosen? My principle is simple and easy to use, write less code, for a rookie, performance, efficiency and so on, all by me pass. So began to contact urllib, Beautiful Soup (BeautifulSoup), because it is easy to hear others say.

The case I started with is the movie of climbing Douban. Countless people recommend Douban as an example for beginners to hit the road, because the page is simple and the anti-crawler is not strict. Starting with some entry-level examples of crawling Douban movies, you learned a little bit about the basic principles of crawlers: downloading pages, parsing pages, locating and extracting data.

Of course, I didn't go to the system to see urllib and BeautifulSoup. I need to solve the problems in the current examples, such as downloading and parsing pages, which are basically fixed statements, just use them directly, and I won't learn the principle first.

Download and parse the fixed sentence patterns of the page with urllib

Of course, the basic methods in BeautifulSoup cannot be ignored, but they are nothing more than find, get_text () and so on, and the amount of information is very small. In this way, through other people's ideas and their own search for the use of beautiful soup, completed the basic information crawling of Douban movie.

Use BeautifulSoup to climb Douban movie details.

-❸-crawlers are getting better.

With some routines and forms, there will be goals and you can move on. Or Douban, to explore their own crawl for more information, crawl multiple movies, multiple pages. At this time, it is found that the foundation is insufficient, such as the sentence control involved in crawling multiple elements, turning pages, dealing with a variety of situations, and the processing of strings, lists and dictionaries involved in extracting content, which is far from enough.

Then go back to supplement the basic knowledge of Python, it is very targeted, and can be immediately used to solve problems, also a deeper understanding. So until I climbed down Douban's TOP250 books and movies, I basically understood the basic process of a crawler.

BeautifulSoup is not bad, but it takes some time to understand the basics of the web page, otherwise the positioning and selection of some elements will still be a headache.

Later, after realizing xpath, it was too late to meet each other. This is the necessary sharp weapon for getting started. Just copy Chrome directly and point to where to play. Even if you want to write your own xpath, you can do it in an hour with the xpath tutorial on the first few pages of w3school. Requests also seems to be easier to use than urllib, but groping is always a process of trial and error, and the cost of trial and error is time.

Requests+xpath crawled Douban TOP250 book information

-❹.-against the anti-crawler.

Through requests+xpath, I can climb a lot of websites, and later I practiced Xiaozhu's rental information and Dangdang's book data. I found the problem when I climbed the hook. First of all, my request will not return information at all. I have to disguise my crawler as a browser and finally know what that piece of headers information in other people's code is for.

Add headers information to the crawler and pretend to be a real user

Then there are various elements that can not be located, and then we know that this is asynchronous loading, the data is not in the web page source code at all, and we need to grab the package to get the web page information. So preview in various JS and XHR files, looking for links that contain data.

Of course, Zhihu is lucky that there are not many files loaded by itself, so I found the json file to get the corresponding data directly. (here is an chrome plug-in for Amway: jsonview, which makes it easy for Xiaobai to understand json files.)

The browser grabs the data loaded by JavaScript

Here we have an understanding of anti-crawlers, of course, this is still the most basic, more stringent IP restrictions, CAPTCHA, text encryption, and so on, may also encounter a lot of problems. But isn't it good that the current problems can be solved? we can learn more efficiently by breaking them one by one.

For example, IP was later blocked when climbing other websites, which can be solved simply by controlling the crawling frequency through time.sleep (). If the limit is relatively strict or the crawling speed needs to be guaranteed, the proxy IP should be used to solve the problem.

Of course, later also tried Selenium, this is really based on the real user browsing behavior (click, search, page flip) to achieve crawlers, so for those anti-crawler sites, there is no solution, Selenium is a super useful thing, although the speed is a little slower.

-❺-try the powerful Scrapy framework

With requests+xpath and grab bag Dafa, you can do a lot of things, Douban under the classification of movies, 58.com, Zhihu, pull hook these sites are basically no problem. However, when the amount of data crawled is very large, and you need to deal with each module flexibly, it will seem inadequate.

So learn about the powerful Scrapy framework, it can not only easily build Request, but also powerful Selector can easily parse Response, but the most surprising thing is its ultra-high performance, can be crawler engineering, modularization.

Basic components of the Scrapy framework

Learn Scrapy, try to build a simple crawler framework, and be able to think about large-scale crawling problems structurally and engineered when doing large-scale data crawling, which enables me to think from the perspective of crawler engineering.

Of course, Scrapy's own selector, middleware, spider and so on will be more difficult to understand, or it is recommended to combine specific examples, reference to other people's code, to understand the implementation process, so that we can better understand.

Crawled a lot of rental information with Scrapy.

-❻.-the local file can't be moved. Go to the database.

After crawling back to a large amount of data, I found that it is very inconvenient for local files to be saved, and even if they are saved, the computer will be seriously stuck when opening large files. What should I do? Decisively on the database ah, so began to enter the MongoDB. Both structured and unstructured data can be stored. If you install PyMongo, you can easily operate the database in Python.

MongoDB installation itself will be more troublesome, if you go to toss about alone, you are likely to get into trouble. At the beginning of the installation, there are also a variety of BUG, fortunately big god small X guidance, solved a lot of problems.

Of course, for the crawler this piece, does not need much advanced database technology, mainly data storage and extraction, incidentally mastered the basic insertion, deletion and other operations. In short, it is OK to be able to extract crawled data efficiently.

Crawl pull hook recruitment data and store it with MongoDB

-❼-legendary distributed crawler

At this time, basically a large part of the web pages can be crawled, and the bottleneck focuses on the efficiency of crawling large-scale data. Because I learned Scrapy, I naturally came across a powerful name: distributed crawler.

Distributed this thing, although I do not understand, but feel that it is very powerful, feel very scary, but in fact is the use of multi-threading principle to make multiple reptiles work at the same time, in addition to the previous Scrapy and MongoDB, it seems that you also need to understand Redis.

Scrapy is used to do basic page crawling, MongoDB is used to store crawled data, and Redis is used to store web page queues to be crawled, that is, task queues.

Distribution may seem scary, but it's really broken down and learned step by step, that's all.

Distributed climbing 58.com: define the project content part

There are indeed many crawlers with zero basic learning, which can be summarized as follows:

1. Environment configuration, various installation packages and environment variables are too unfriendly to rookies.

two。 Lack of a reasonable learning path, come to Python, HTML all kinds of study, extremely easy to give up

3.Python has many packages and frameworks to choose from, but Xiaobai doesn't know which one is more friendly.

4. Don't even know how to describe a problem, let alone find a solution.

5. The information on the Internet is very scattered and unfriendly to rookies, and many of them seem to be in the clouds.

6. Some things seem to understand, but it turns out to be very difficult to write your own code.

At this point, I believe you have a deeper understanding of "how to start writing Python crawlers with zero foundation". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report