In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains "what are the stages of learning Python crawler". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what are the stages of learning Python crawler"?
What can a reptile do?
In addition to getting data from the Internet, crawlers can also help us complete a lot of tedious manual operations, including not only obtaining data, but also adding data, such as:
1. vote
two。 Manage multiple accounts for multiple platforms (such as accounts for each e-commerce platform)
3. Wechat chat robot
The actual application is much more than the above, but the above application is only in addition to the application of the data itself, the application of the data itself is also very wide:
1. Machine learning corpus
two。 Services in the vertical sector (valuation of used cars)
3. Aggregation service (Qunar, Meituan)
4. News recommendation (Jinri Toutiao)
5. Prediction and judgment (in the field of medicine)
So there are many functions that a crawler can do, which makes the demand for the crawler more and more exuberant, but many people who have had back-end developers think that the crawler is very simple. Many people think that the crawler uses a requests to get a html and parse it. In fact, is the crawler really that simple?
First of all, let's ask a few questions before we answer the question:
1. What if a web page needs to be logged in to access?
two。 For the above problems, many people say that simulated login is fine, but in fact, many websites will use various means to increase the difficulty of simulated login, such as various verification codes, various confusion and encryption of login logic, and various encryption of parameters. how to solve these problems?
3. What if many websites can only be logged in by mobile phone?
4. For the sake of user experience and server optimization, many websites load the elements of a page asynchronously or js? Do you have the ability to analyze these?
5. As a website, all kinds of anti-crawling schemes emerge one after another. When your crawler is anti-crawling, how do you guess how the other person is anti-crawling?
6. How does a crawler find the latest data? How do I find out if a data has been updated?
If you just do a simple crawler, for example, your crawler is disposable, it's certainly easy to get some data from a website at once, but if you want to do a crawler service, you have to face the above problems. This does not mention data extraction and parsing and so on:
Summing up the above questions, let's take a look at what we are going to learn:
The first stage: introduction to the basics
1. The foundation of computer network, including: tcp/ip protocol, socket network programming, http protocol
two。 The foundation of the front end: mainly javascript foundation and ajax foundation
3. Basic syntax of python
4. The basics of database: any database is fine, but it is strongly recommended to learn mysql or postgresql
5. The basis of html parsing: the use of beautifulsoup, xpath and css selectors
6. The basis of html download: use of urllib or requests
7. The basis of data preservation: if you want to use a database (mysql), you can use pymysql, then use peewee, if you need to use a document database (mongodb), you can choose pymongo, and then use mongoengine
The second stage: reptile actual combat
After the previous stage, you only have the most basic knowledge of reptiles, and you need to learn further if you want to really catch crawlers.
1. Simulated login: you need to know the principle of cookie and session login, and if you need to crawl Weibo, you also need to know the specific process of oauth3.0.
two。 Dynamic web page analysis technology: the most basic method is to analyze basic methods such as js and html, but many websites will make this part of the logic very complex, so you need to further learn the basics of selenium and chromedriver.
3. Identification of CAPTCHA:
This includes the most basic CAPTCHA recognition, such as ocr recognition. For more complex CAPTCHA, if you want to recognize it yourself, you also need to understand machine learning and image recognition technology. The simple way is to call third-party services.
4. For anti-crawling, you need to understand the basic configuration of nginx, and you need to be familiar with the details of http protocol.
5. Crawler development needs to be configured with multi-thread development, so you need to know more about multi-thread development, which includes the basics of inter-thread communication and thread synchronization.
The third stage: reptile monitoring and operation and maintenance
A crawler online production environment you have to monitor your crawler, monitor a crawler you'd better use page management, so you have to understand:
1. Linux Foundation for deployment of services
2. Docker fundamentals, the advantages and popularity of docker deployment, I believe we all understand
3. Django or flask, because we need to develop pages to monitor crawlers
Phase IV: crawler framework and distributed crawler
1. You need to know at least one reptile framework, scrapy or pyspider.
two。 Understand, scrapy, you also need to know that scrapy-redis knows how to solve the distributed crawler problem.
3. You need to understand the solution of distributed storage: a set of solutions for hadoop
4. You have to understand mongodb document database.
5. You have to understand elasticsearch search engine.
6. You have to understand kafaka, a distributed publish and subscribe messaging system.
7. Distributed related fundamentals such as distributed locks, you need to know the principle.
The fifth stage: the application of reptiles
This stage belongs to the field of application, for example, if you want to do artificial intelligence, you have to understand the relevant knowledge of artificial intelligence, if you do data analysis, you have to learn the basic knowledge of data analysis, if you want to do web services, you need to learn the basis of web development, if you want to do search engines and recommendation systems, you need to know the relevant basics.
Thank you for your reading, the above is the content of "how many stages of learning Python crawler". After the study of this article, I believe you have a deeper understanding of the stage of learning Python crawler, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.