In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces the relevant knowledge of Python reptile classification knowledge points, the content is detailed and easy to understand, the operation is simple and fast, and it has a certain reference value. I believe you will gain something after reading this Python reptile classification knowledge point. Let's take a look at it.
1. General crawler
General web crawler is an important part of search engine crawling system (Baidu, Google, Sogou, etc.). The main purpose is to download the web pages on the Internet locally to form a mirror backup of Internet content. Provide search support for search engines.
The first step
Search engines go to thousands of websites to crawl data.
Step two
The search engine stores the data in the original page database (that is, the document library) through the web page crawled by the crawler. The page data is exactly the same as the HTML obtained by the user's browser.
Step three
The search engine preprocesses the pages crawled by the crawler: Chinese word segmentation, noise elimination, index processing.
After organizing and processing the information, the search engine provides keyword retrieval services for users and displays the relevant information retrieved by users to users. It will be ranked when it is displayed.
Second, the limitations of search engines
Search engines crawl the entire web page, not specific and detailed information.
Search engines cannot provide search results for specific customer needs.
Focused crawler
In view of these situations of general crawlers, focused crawler technology is widely used. Focused crawler is a kind of web crawler which is "subject-oriented". The difference between focused crawler and general search engine crawler is that focused crawler will process and filter the content when crawling web pages. Try to make sure that only the web page data related to the requirements is crawled.
III. Robots protocol
Robots is an agreement between a website and a crawler, which tells the corresponding crawler the allowed permissions in a simple and direct txt format, that is, robots.txt is the first file to view when visiting a website in a search engine. When a search spider visits a site, it will first check whether robots.txt exists in the root directory of the site. If so, the search robot will determine the scope of access according to the contents of the file; if the file does not exist, all search spiders will be able to access all pages on the site that are not password protected. Baidu encyclopedia
Robots protocol is also known as crawler protocol, robot protocol, etc., and its full name is "Web crawler exclusion criteria" (Robots ExclusionProtocol). Websites use Robots protocol to tell search engines which pages can be crawled and which pages cannot be crawled, for example:
Taobao: https://www.taobao.com/robots.txt
Baidu: https://www.baidu.com/robots.txt
IV. Request and correspondence
Network communication consists of two parts: client request message and server response message.
The process by which the browser sends a HTTP request:
1. When we enter URL https://www.baidu.com in the browser, the browser sends a Request request to
Get the html file of https://www.baidu.com, and the server sends the Response file object back to the browser.
two。 The browser analyzes the HTML in Response and finds that many other files are referenced, such as Images files, CSS files, and JS files. The browser will automatically send Request again to get the image, CSS file, or JS file.
3. When all the files are downloaded successfully, the web page will be fully displayed according to the HTML syntax structure.
In fact, we crawl data by learning crawler technology, which is also the process of requesting data from the server and obtaining server response data.
This is the end of this article on "what are the knowledge points of Python crawler classification". Thank you for reading! I believe you all have a certain understanding of "what are the knowledge points of Python reptile classification". If you still want to learn more knowledge, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.