What are the high-frequency interview questions for Python crawlers? 07/11 Update SLTechnology News&Howtos

What are the high-frequency interview questions for Python crawlers?

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what are the high-frequency interview questions of Python crawler". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next, let the editor take you to learn "what are the high-frequency interview questions for Python crawlers?"

1. What is included in Request?

1. Request method: there are mainly two methods: GET and POST. The parameters of POST request will not be included in url. 2. Request URLURL: uniform resource locator, such as a web document, a picture, a video, etc., can be uniquely determined by URL. 3. Request header information, including User-Agent (browser request header), Host, Cookies information 4, request body, GET request, generally not available, POST request The request body generally contains form-data

2. What information is contained in the Response?

1, response status: status code normal response 200 redirection 2, response header: such as content type, content length, server information, setting cookie, etc. 3, response body information: response source code, picture binary data, etc.

3. Common http status codes

200 status code server requests normal 301 status code: the requested resource has been permanently moved to a new location. When the server returns this response (the response to a GET or HEAD request), it automatically moves the requestor to a new location. Status code: the requested resource temporarily responds to the request from a different URI, but the requester should continue to use the original location for future requests: the request requires authentication. For web pages that need to log in, the server may return this response. 403 status code: the server understands the request but refuses to execute it. Unlike the 401 response, authentication does not provide any help, and the request should not be submitted repeatedly. 404 status code: the request failed and the desired resource was not found on the server. 500 status code: the server encountered an unexpected condition that prevented it from completing the processing of the request. In general, this problem occurs when the server's code goes wrong. 503 status code: the server is currently unable to process requests due to temporary server maintenance or overload.

4. What does the request and response of HTTP contain?

HTTP request header Accept: the type of content that the browser can handle Accept-Charset: the character set that the browser can display Accept-Encoding: the compression code that the browser can handle Accept-Language: the language currently set by the browser Connection: the type of connection between the browser and the server Cookie: any CookieHost of the current page setup: referer: domain of the page that made the request: URLUser-Agent: browser of the page that made the request User agent string HTTP response header message: Date: indicates the time the message was sent The description format of the time is defined by rfc822: server: server name. Connection: the type of connection between the browser and the server content-type: indicates what MIME type the subsequent document belongs to Cache-Control: controls the HTTP cache

5. Under what circumstances does the index of mysql fail

1. If there is an or in the condition, it will not be used even if there is a conditional index (which is why or is used as little as possible). If you want to use or and want the index to take effect, you can only add index 2 to every column in the or condition. For multi-column indexes, which are not the first part of the use, the index 3.like query will not be used to start with% 4. 0. If the column type is a string, be sure to quote the data in quotes in the condition, otherwise index 5. 0 is not used. If mysql estimates that using a full table scan is faster than using an index, do not use an index

6. What is the engine of the MySQL and what is the difference between the engines?

Main MyISAM and InnoDB two engines, the main differences are as follows: 1, InnoDB supports transactions, MyISAM does not support, this is very important. Transaction is an advanced processing method, for example, in some column additions and deletions, any error can be rolled back, but MyISAM can not; 2, MyISAM is suitable for query and insert-based applications, InnoDB is suitable for frequent modifications and applications involving high security; 3, InnoDB supports foreign keys, but MyISAM does not; 4, MyISAM is the default engine, InnoDB needs to be specified; 5, InnoDB does not support FULLTEXT type indexes 6. The number of rows of the table is not saved in InnoDB, such as select count () from table, InnoDB; needs to scan the entire table to calculate the number of rows, but MyISAM can simply read out the number of saved rows. Note that when the count () statement contains where conditions, MyISAM also needs to scan the entire table; 7, for self-growing fields, InnoDB must contain an index with only this field, but in the MyISAM table can be combined with other fields to establish a joint index; 8, when emptying the entire table, InnoDB is a row delete, the efficiency is very slow. MyISAM rebuilds the table; 9. InnoDB supports row locking (or locking the entire table in some cases, such as update table set astat1 where user like'% lee%').

7. Advantages and disadvantages of Scrapy:

Advantages: scrapy asynchronously adopts more readable xpath instead of regular powerful statistics and log systems to crawl on different url to support shell mode at the same time, it is convenient to debug and write middleware independently, and it is convenient to write some unified filters to be stored in the database through pipeline. disadvantages: python-based crawler framework, poor scalability based on twisted framework, running exception will not kill reactor And the asynchronous framework will not stop other tasks after the error, and it is difficult to detect after the data error.

8. How to transfer data securely by HTTPS

The client (usually the browser) first sends a request for encrypted communication to the server, and after the client receives the certificate, the server will first verify that the server receives the content encrypted using the public key. After the server uses the private key to decrypt, the random number pre-master secret is obtained, and then the session Key and MAC algorithm keys are obtained through certain algorithms according to radom1, radom2 and pre-master secret. Use the symmetric key as a later interaction. At the same time, the client also uses radom1, radom2, pre-master secret, and the same algorithm to generate the secret keys of the session Key and MAC algorithms. Then in subsequent interactions, the secret keys of the session Key and MAC algorithms are used to encrypt and decrypt the transmitted content.

9. Describe how the scrapy framework works?

Get the first batch of url from start_urls and send the request, which is handed over by the engine to the scheduler to enter the request queue. After the request is obtained, the scheduler hands the request in the request queue to the downloader to obtain the corresponding response resources, and gives the response to the parsing method written by itself for extraction processing: if the required data is extracted, it is handed over to the pipeline file for processing. If the url is extracted, continue with the previous steps (sending the url request, and the engine sends the request to the scheduler for queuing...) until there is no request in the request queue and the program ends. At this point, I believe you have a deeper understanding of "what are the high-frequency interview questions for Python crawlers?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.