In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article is about what Python crawler is and how to use it. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.
Web crawlers, also known as network robots, can replace people to automatically collect and organize data and information on the Internet. In the era of big data, information collection is an important work. If we rely solely on manpower for information collection, it will not only be inefficient and tedious, but also increase the cost of collection.
At this time, we can use web crawlers to collect data information automatically, such as crawling sites in search engines, collecting data in data analysis and mining, and financial data collection in financial analysis. in addition, web crawlers can also be used in public opinion monitoring and analysis, target customer data collection and other fields.
Of course, to learn the development of web crawlers, we first need to understand web crawlers. This article will lead you to understand several typical web crawlers and understand the common functions of web crawlers.
What is a web crawler
With the advent of the era of big data, the position of web crawlers in the Internet will become more and more important. The data in the Internet is huge, how to automatically and efficiently obtain and use the information we are interested in in the Internet is an important problem, and crawler technology is born to solve these problems.
The information we are interested in can be divided into different types: if we are just doing search engines, then the information we are interested in is as many high-quality pages in the Internet as possible; if we want to obtain data from a certain vertical field or have clear retrieval needs, then the information we are interested in is located according to our retrieval and needs, at this time, we need to filter out some useless information. The former is called general web crawler, and the latter is called focused web crawler.
1. First acquaintance with web crawlers
Web crawlers, also known as web spiders, web ants, web robots, etc., can automatically browse the information in the network, of course, we need to browse the information in accordance with our rules, these rules we call web crawler algorithm. Using Python, it is convenient to write crawler programs for automatic retrieval of Internet information.
Search engines can not do without crawlers, such as Baidu search engine crawlers called Baidu spiders (Baiduspider). Baidu spiders will crawl in the vast amount of Internet information every day, crawl high-quality information and collect it. When users retrieve corresponding keywords on Baidu search engine, Baidu will analyze and process the keywords, find out the relevant pages from the included pages, sort them according to certain ranking rules and show the results to users.
In this process, Baidu spiders played a vital role. So, how to cover more high-quality web pages in the Internet? How do you filter these duplicate pages? These are determined by the algorithm of Baidu spider crawler. Using different algorithms, the running efficiency of the crawler will be different, and the crawling results will be different.
Therefore, when we study the crawler, we not only need to understand how to implement the crawler, but also need to know some common crawler algorithms, if necessary, we also need to develop the corresponding algorithm, here, we only need to have a basic understanding of the concept of the crawler.
In addition to Baidu search engine can not do without crawlers, other search engines can not do without crawlers, they also have their own crawlers. For example, 360's reptile is called 360Spider, Sogou's reptile is called Sogouspider, and Bing's reptile is called Bingbot.
If we want to implement a small search engine, we can also write our own crawler to implement, of course, although the performance or algorithm may not compare with the mainstream search engine, but the degree of personalization will be very high. And it also helps us to have a deeper understanding of how search engines work.
Big data era is also inseparable from crawlers, such as big data analysis or data mining, we can go to some large official sites to download data sources. But these data sources are limited, so how can we obtain more and higher quality data sources? At this time, we can write our own crawler program to obtain data and information from the Internet. So in the future, the status of reptiles will become more and more important.
two。 Why learn to be a web crawler?
We have a preliminary understanding of web crawlers, but why learn web crawlers? You know, only by knowing our learning purpose clearly can we learn this knowledge better. We will analyze the reasons for learning web crawlers for you.
Of course, different people may have different purposes for learning crawlers. Here, we summarize the reasons for four common learning crawlers.
1) learn crawlers, you can customize a search engine, and you can have a deeper understanding of the working principle of data collection of the search engine.
Some friends hope to have a deep understanding of the crawler working principle of the search engine, or hope to develop a private search engine, so at this time, it is very necessary to learn the crawler.
To put it simply, after we have learned how to write a crawler, we can use the crawler to automatically collect information from the Internet, and then store or process it accordingly. When we need to retrieve some information, we only need to retrieve it in the collected information, that is, to achieve a private search engine.
Of course, how to crawl information, how to store, how to carry out word segmentation, how to calculate correlation and so on, all need us to design. Crawler technology mainly solves the problem of information crawling.
2) in the era of big data, in order to carry out data analysis, we must first have data sources, and learning crawlers can enable us to obtain more data sources, and these data sources can be collected according to our purpose, removing a lot of irrelevant data.
When carrying out big data analysis or data mining, data sources can be obtained from some websites that provide data statistics, as well as from some literature or internal materials, but these ways of obtaining data, sometimes it is difficult to meet our needs for data, and it takes too much energy to find the data manually from the Internet.
At this time, we can use crawler technology to automatically obtain the data content we are interested in from the Internet, and crawl it back as our data source, so as to carry out more in-depth data analysis. and get more valuable information.
3) for many SEO practitioners, learning crawlers can give you a deeper understanding of how search engine crawlers work, so that you can better optimize search engines.
Since it is search engine optimization, we must have a very clear understanding of the working principle of search engine, and we also need to master the working principle of search engine crawler, so that we can know ourselves and enemies in search engine optimization.
4) from the perspective of employment, reptile engineers are currently in short supply of talents, and their salaries are generally high, so it is very beneficial for employment to master this technology at a deep level.
Some friends may study crawlers in order to get a job or change jobs. From this point of view, the direction of reptile engineers is one of the good choices, because at present, the demand of reptile engineers is increasing, and there are fewer people who are competent for this position, so it belongs to a relatively scarce career direction. and with the advent of the big data era, the application of reptile technology will be more and more extensive, there will be a good space for development in the future.
In addition to the four common reasons for learning crawlers summarized above, you may have some other reasons for learning crawlers. in short, no matter what the reasons are, you can better study a knowledge and technology by figuring out the purpose of your learning. and stick to it.
3. The composition of web crawlers
Next, we will introduce the composition of web crawlers. The web crawler is composed of control node, crawler node and resource library.
Figure 1-1 shows the structural relationship between the control node and the crawler node of the web crawler.
Figure 1-1 the structural relationship between the control node and the crawler node of the web crawler
It can be seen that there can be multiple control nodes in the network crawler, and there can be multiple crawler nodes under each control node, and the control nodes can communicate with each other. At the same time, the control node and the crawler nodes under it can also communicate with each other, and the crawler nodes under the same control node can also communicate with each other.
The control node, also known as the central controller of the crawler, is mainly responsible for allocating threads according to the URL address and calling the crawler node for specific crawling.
The crawler node will crawl the web page according to the relevant algorithms, including downloading the web page and processing the text of the web page. After crawling, the crawler node will store the corresponding crawling results in the corresponding resource library.
4. Types of web crawlers
Now that we have a basic understanding of the composition of web crawlers, what are the specific types of web crawlers?
According to the realized technology and structure, web crawler can be divided into general web crawler, focused web crawler, incremental web crawler, deep web crawler and so on. In the actual web crawler, it is usually a combination of these kinds of crawlers.
4.1 General Web Crawler
First of all, we introduce you to the general web crawler (General Purpose Web Crawler). General web crawler is also called whole web crawler, as the name implies, the target resource crawled by general web crawler is in the whole Internet.
The target data crawled by the general web crawler is huge, and the range of crawling is also very large, it is precisely because the data it crawls is huge data, so for this kind of crawler, its crawling performance requirements are very high. This kind of web crawler is mainly used in large search engines and has very high application value.
The general web crawler is mainly composed of initial URL collection, URL queue, page crawling module, page analysis module, page database, link filtering module and so on. General web crawlers will adopt certain crawling strategies when crawling, including depth-first crawling strategy and breadth-first crawling strategy.
4.2 focus on web crawlers
Focused web crawler (Focused Crawler) is also called focused web crawler. As its name implies, focused web crawler is a kind of crawler that selectively crawls web pages according to a predefined theme. Focused web crawlers do not locate the target resources in the whole Internet like general web crawlers, but locate the crawled target web pages in the pages related to the theme. It can greatly save the bandwidth resources and server resources needed by the crawler.
Focused web crawlers are mainly used to crawl specific information, mainly to provide services for a specific group of people.
Focused web crawler is mainly composed of initial URL collection, URL queue, page crawling module, page analysis module, page database, link filtering module, content evaluation module, link evaluation module and so on. Content evaluation module can evaluate the importance of content, similarly, link evaluation module can also evaluate the importance of links, and then according to the importance of links and content, we can determine which pages have priority to visit.
There are four main crawling strategies for focused web crawlers, namely, content-based crawling strategy, link-based crawling strategy, reinforcement learning-based crawling strategy and context map-based crawling strategy. Focusing on the specific crawling strategies of web crawlers, we will analyze them in detail below.
4.3 incremental web crawler
Incremental web crawler (Incremental Web Crawler), the so-called incremental, corresponds to incremental updates.
Incremental update means that when updating, only the changed place is updated, but the unchanged place is not updated, so the incremental web crawler, when crawling the web page, only crawls the web page whose content has changed or the newly generated web page. For the web page that has not changed, it will not crawl.
To a certain extent, incremental web crawler can ensure that the page crawled is as new as possible.
4.4 Deep web crawler
Deep web crawler (Deep Web Crawler), can crawl deep pages in the Internet, here we first need to understand the concept of deep pages.
In the Internet, web pages are classified according to the way they exist, which can be divided into surface pages and deep pages. The so-called surface page refers to the static page that can be reached without submitting a form and using a static link, while the deep page is hidden behind the form and cannot be obtained directly through the static link. It is a page that can only be obtained after submitting certain keywords.
In the Internet, the number of deep pages is often much more than the number of surface pages, so we need to find ways to climb deep pages.
To climb a deep page, you need to find a way to fill out the corresponding form automatically, so the most important part of the deep web crawler is the form filling part.
The deep web crawler is mainly composed of URL list, LVS list (LVS refers to the collection of tags / values, that is, the data source that fills the form), crawl controller, parser, LVS controller, form analyzer, form processor, response analyzer and so on.
There are two types of deep web crawler forms:
The first is to fill in the form based on domain knowledge, which is simply to establish a keyword database to fill in the form, and select the corresponding keywords according to semantic analysis when you need to fill in.
The second is the form filling based on the web page structure analysis. To put it simply, this filling method is generally used when the domain knowledge is limited. This method will be analyzed according to the web page structure and fill in the form automatically.
Above, I introduce several common types of web crawlers. I hope readers can have a basic understanding of the classification of web crawlers.
5. Crawler expansion-focus crawler
Because the focused crawler can crawl purposefully according to the corresponding topic, and can save a lot of server resources and bandwidth resources, it has strong practicability, so here, we will explain the focused crawler in detail. Figure 1-2 shows the process that the focus crawler runs. After we are familiar with this process, we can know more clearly how the focus crawler works and how it works.
Figure 1-2 focus on the process run by the crawler
First of all, the focus crawler has a control center, which is responsible for managing and monitoring the whole crawler system. it mainly includes controlling user interaction, initializing the crawler, determining the theme, coordinating the work between the modules, controlling the crawling process and so on.
Then, the initial URL collection is passed to the URL queue, and the page crawling module reads the first URL list from the URL queue, and then crawls the corresponding pages from the Internet according to these URL addresses.
After crawling, the crawled content will be transferred to the page database for storage. At the same time, during the crawling process, some new URL will be crawled. At this time, we need to use the link filtering module to filter out irrelevant links according to our topic, and then use the link evaluation module or content evaluation module to sort the remaining URL links according to the topic. When finished, the new URL address is passed to the URL queue for use by the page crawling module.
On the other hand, after the page is crawled and stored in the page database, the crawled page needs to be analyzed and processed by using the page analysis module according to the topic, and the index database is established according to the processing result. When the user retrieves the corresponding information, the corresponding search can be carried out from the index database and the corresponding results can be obtained.
This is the main work flow of focus crawler. Understanding the main work flow of focus crawler helps us to write focus crawler and make the writing more clear.
II. Overview of web crawler skills
In the above, we have a preliminary understanding of the web crawler, so what can the web crawler do? What interesting things can you do with web crawlers? We will explain it in detail for you in this chapter.
1. Overview of web crawler skills
As shown in figure 2-1, we summarize the common functions of web crawlers.
Figure 2-1 schematic diagram of web crawler skills
As can be seen in figure 2-1, web crawlers can do many things instead of manually. For example, they can be used as search engines or crawl pictures on websites. For example, some friends crawl down all the pictures on some websites and browse them centrally. At the same time, web crawlers can also be used in the field of financial investment, such as automatically crawling some financial information and conducting investment analysis.
Sometimes, there may be several news websites that we prefer, and it is troublesome to open these news websites for browsing each time. At this time, you can use the web crawler to crawl down the news information from these news websites and concentrate on reading.
Sometimes, when we browse the information on the web, we will find a lot of advertisements. At this time, we can also use the crawler to crawl the information on the corresponding web page, so that we can automatically filter out these advertisements to facilitate the reading and use of the information.
Sometimes, we need marketing, so how to find the target customer and the contact information of the target customer is a key issue. We can look for it manually on the Internet, but it will be very inefficient. At this time, we use crawlers to set corresponding rules and automatically collect data such as contact information of target users from the Internet for our marketing use.
Sometimes, we want to analyze the user information of a website, such as analyzing the user activity, number of speeches, popular articles and other information of the website. If we are not webmasters, manual statistics will be a very huge project. At this time, we can use the crawler to easily collect these data for further analysis, and all these crawling operations are carried out automatically, we just need to write the corresponding crawler and design the corresponding rules.
In addition, crawlers can also achieve many powerful functions. In short, the emergence of crawlers can, to a certain extent, replace manual access to web pages, so that the operations that we used to manually access Internet information can now be realized by crawlers automatically. in this way, we can make more efficient use of the effective information in the Internet.
two。 Search engine core
The relationship between crawler and search engine is inseparable. Since web crawler is mentioned, search engine is inevitably mentioned. Here, we will briefly explain the core technology of search engine.
Figure 2-2 shows the core workflow of the search engine. First of all, the search engine will use the crawler module to crawl the web page in the Internet, and then store the crawled web page in the original database. The crawler module mainly includes the controller and the crawler, the controller mainly controls the crawling, and the crawler is responsible for the specific crawling tasks.
The data in the original database is then indexed and stored in the index database.
When the user retrieves the information, the corresponding information will be input through the user interactive interface, which is equivalent to the input box of the search engine. After the input is completed, the searcher will carry out word segmentation and other operations. The searcher will obtain data from the index database for corresponding retrieval processing.
When the user enters the corresponding information, the user's behavior will be stored in the user log database, such as the user's IP address, the keywords entered by the user, and so on. The data in the user log database is then handed over to the log analyzer for processing. The log analyzer will adjust the original database and index database, change the ranking results or do other operations based on a large amount of user data.
Figure 2-2 Core Workflow of search engine
The above is a brief overview of the core workflow of the search engine, perhaps we can not distinguish between the concept of indexing and retrieval, I would like to talk about it in detail for you here.
In a nutshell, retrieval is an act, while an index is an attribute. For example, in a supermarket, there are a large number of goods. In order to find these goods quickly, we will group them into groups, such as daily necessities, beverages, clothing, and so on. At this time, the group name of these goods is called index, and the index is controlled by indexers.
If a user wants to find a product, he needs to find it in a large number of goods in the supermarket. This process is called retrieval. If there is a good index, the efficiency of retrieval can be improved; if there is no index, the efficiency of retrieval will be very low.
For example, if the goods in a supermarket are not classified, it will be more difficult for users to find a certain product in a large number of goods.
3. About the user crawler.
User crawler is a type of web crawler. The so-called user crawler refers to a kind of crawler specially used to crawl user data in the Internet. Because the user data information in the Internet is relatively sensitive, the utilization value of user crawlers is also relatively high.
There are a lot of things you can do with a user crawler, so let's take a look at some interesting things you can do with a user crawler.
In 2015, some netizens of Zhihu crawled the user data of Zhihu, and then analyzed the corresponding data to get a large number of potential data on Zhihu, such as:
Male and female ratio of registered users on Zhihu: boys account for more than 60%.
The area of Zhihu registered users: Beijing has the largest proportion of the population, more than 30%.
Zhihu registered users engaged in the industry: the Internet industry accounts for the largest proportion of users, also more than 30%.
In addition, as long as we dig carefully, we can dig out more potential data, and to analyze these data, we must obtain these user data, at this time, we can use web crawler technology to easily crawl these useful user information.
Similarly, in 2015, some netizens crawled 30 million Qzone's user information and also obtained a lot of potential data, such as:
Qzone users say the rule of time: at about 22:00 in the evening, the average number of messages is the most in a day.
The month of birth of Qzone users: more users were born in January and October.
Age distribution of Qzone users: there are relatively more users born between 1990 and 1995.
The gender distribution of Qzone users: boys account for more than 50%, girls account for more than 30%, and non-gender accounts for about 10%.
In addition to the above two examples, user crawlers can also do a lot of things, such as crawling Taobao user information, can analyze what products Taobao users like, which is more conducive to our positioning of goods.
It can be seen that the use of user crawlers can get a lot of interesting potential information, so are these crawlers difficult? In fact, it is not difficult, I believe you can also write such a crawler.
Summary
Web crawlers, also known as web spiders, web ants, web robots, etc., can browse the information in the network automatically, of course, when browsing the information, we need to browse according to the rules we make, which we call the web crawler algorithm. Using Python, it is convenient to write crawler programs for automatic retrieval of Internet information.
If you want to learn about crawlers, you can: ① customizes a search engine, and you can have a deeper understanding of the working principle of search engine data collection; ② provides more high-quality data sources for big data's analysis; ③ better studies search engine optimization; and ④ solves the problem of employment or job-hopping.
The web crawler is composed of control node, crawler node and resource library.
According to the realized technology and structure, web crawler can be divided into general web crawler, focused web crawler, incremental web crawler, deep web crawler and so on. In the actual web crawler, it is usually a combination of these kinds of crawlers.
Focused web crawler is mainly composed of initial URL collection, URL queue, page crawling module, page analysis module, page database, link filtering module, content evaluation module, link evaluation module and so on.
The emergence of crawlers can to some extent replace manual access to web pages, so the operations that we used to manually access Internet information can now be realized by crawlers automatically. in this way, we can make better use of the effective information in the Internet more efficiently.
Retrieval is a behavior, while an index is an attribute. If there is a good index, the efficiency of retrieval can be improved. If there is no index, the efficiency of retrieval will be very low.
User crawler is one of the types of web crawlers. The so-called user crawler is a kind of crawler specially used to crawl user data in the Internet. Because the user data information in the Internet is relatively sensitive, the utilization value of user crawlers is also relatively high.
The above is what Python crawler is and how to use it. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.