Introduction to the principle of web crawler 07/01 Update SLTechnology News&Howtos

Introduction to the principle of web crawler

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "the introduction of the principle of the web crawler". In the daily operation, I believe that many people have doubts about the introduction of the principle of the web crawler. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts about "introduction to the principle of web crawler"! Next, please follow the editor to study!

Learn about browsers and servers

Everyone should be no stranger to the browser, it can be said that as long as people who have been on the Internet know the browser. However, not many people understand the various principles of browsers.

As a partner who wants to develop a crawler, you must understand how the browser works. This is a necessary tool for you to write about reptiles, and there is nothing else.

During the interview, did you come across such a very macro and detailed question:

Please tell me what happened between when you typed the site in the browser's address bar and when you saw the page.

This is really a test of knowledge, ah, experienced apes can not only talk for three days and nights, but also extract a few minutes of the essence. I'm afraid you have little knowledge of the whole process.

Coincidentally, the more thoroughly you understand this problem, the more helpful it is to write about crawlers. In other words, crawlers are an area where comprehensive skills are tested. So, are you ready for this integrated skills challenge?

Without saying much nonsense, let's start by solving this problem, get to know browsers and servers, and see what knowledge crawlers need to use.

As mentioned earlier, we can talk about this problem for three days and three nights, but we don't have that much time, so we will skip some of the details and talk about the general process with the crawler, which is divided into three parts:

The browser makes a request

The server responds

The browser receives the response

1. The browser makes a request

Enter the URL in the browser address bar and enter, and the browser asks the server to make a web page request, that is, tell the server that I want to see one of your web pages.

There are so many mysteries in the above short sentence that I have to talk about them one by one. It is mainly about:

Is the URL valid?

Where is the server?

What does the browser send to the server?

What did the server return?

1) is the URL valid?

First of all, the browser needs to determine whether the URL (URL) you entered is valid or not. Little apes are no stranger to URL, the long string of characters that begins with http (s), but did you know that it can also start with ftp, mailto, file, data, irc? Here is its most complete syntax format:

URI = scheme: [/ / authority] path [? query] [# fragment] # where authority is like this: authority = [userinfo@] host [: port] # userinfo can contain both user name and password to: split userinfo = [user_name:password]

A more vivid representation of the picture goes like this:

Experience: to judge the legitimacy of URL

Urllib.parse can be used in Python to perform various operations of URL.

In [1]: import urllib.parse In [2]: url = 'http://dachong:the_password@www.yuanrenxue.com/user/info?page=2'In [3]: zz = urllib.parse.urlparse (url) Out [4]: ParseResult (scheme='http', netloc='dachong:the_password@www.yuanrenxue.com', path='/user/info', params='', query='page=2', fragment='')

We see that the urlparse function parses the URL into six parts:

Scheme://netloc/path;params?query#fragment

The main thing to need is that netloc is not equal to host in the definition of URL syntax.

2) where is the server?

The host in the above URL definition is a server on the Internet. It can be an IP address, but it is usually what we call a domain name. The domain name is bound to one (or more) IP addresses via DNS. In order to visit the website of a domain name, the browser must first resolve the domain name through the DNS server and get the real IP address.

The domain name resolution here is generally done by the operating system, and the crawler does not need to care. However, when you write a large crawler, such as Google, Baidu search engine, efficiency becomes very important, and the crawler has to maintain its own DNS cache.

Old ape experience: large reptiles need to maintain their own DNS cache

3) what does the browser send to the server?

The browser gets the IP address of the web server and can send a request to the server. This request follows the http protocol. What you need to care about when writing a crawler is the headers of http protocol. The following is the request headers sent by the browser when accessing en.wikipedia.org/wiki/URL:

As you can see from the figure, the http request header sent is similar to the structure of a dictionary:

Authority: the target machine that is accessed

Method: there are many ways to request a http:

GET

HEAD

POST

PUT

DELETE

CONNECT

OPTIONS

TRACE

PATCH

Generally speaking, GET and POST are most used by crawlers.

Path: the path to the website visited

Scheme: the protocol type of the request. This is https.

Accept: acceptable response content type (Content-Types)

Accept-encoding: list of acceptable encodings

Accept-language: a natural language list of acceptable responses

Cache-control: specify instructions that must be followed by all caching mechanisms in this request / response chain

Cookie: Cookie, a hypertext transfer protocol previously sent by the server over Set- Cookie

This is one of the things that crawlers care about, and the login information is all here.

Upgrade-insecuree-requests: non-standard request field, which can be ignored.

User-agent: browser identity

This is also the part that the reptile is very concerned about. For example, if you need to get the mobile version of the page, you need to set the browser ID as the user-agent of the mobile browser.

Experience: communicate with the server by setting up headers

4) what does the server return?

If we enter a web address in the browser address bar (not the file download address), enter, we will soon see a web page containing typesetting text, pictures, videos and other data, which is a rich content format page. However, when I look at the source code through a browser, I see a pair of html code in text format.

Yes, it is a pile of code, but it is rendered into a beautiful web page by the browser. In this pair of code are:

Css: the browser arranges text, pictures, etc., according to it.

JavaScript: the browser runs it to allow users to interact with web pages

Links such as pictures: the browser downloads these links and finally renders them into a web page.

The information we want to crawl is hidden in the html code, and we can extract what we want from it by parsing. If there is no data we want in the html code, but you see it on the web page, the browser loads (secretly downloads) that part of the data asynchronously through an ajax request.

At this point, we need to observe the browser's loading process to find out which ajax request loaded the data we need.

At this point, the study of "introduction to the principles of web crawlers" is over. I hope to be able to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.