Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the common interview questions of Python?

2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what are the common interview questions of Python". Interested friends might as well take a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "what are the common interview questions for Python"?

one。 What anti-crawler strategies and solutions have you encountered?

1. Anti-crawler through headers

two。 Hair crawler based on user behavior: (frequency of access to the same IP in a short period of time)

3. Dynamic web anti-crawler (request data through ajax, or generate through JavaScript)

4. Encrypt part of the data (the data is garbled)

Solution:

For the crawling of basic web pages, you can customize headers and add headers data.

Use multiple proxy ip to crawl or set the crawl frequency to be reduced.

Dynamic web pages can be crawled using selenium + phantomjs

If you encrypt part of the data, you can use selenium to take screenshots and use the pytesseract library that comes with python to identify, but the slower and most direct way is to find the encryption method for reverse reasoning.

two。 The difference between urllib and urllib2?

Both urllib and urllib2 are related modules that accept URL requests, but urllib2 can accept an instance of the Request class to set that the headers,urllib of a URL request can only accept URL. Urllib cannot disguise your User-Agent string.

Urllib provides the urlencode () method for GET query string generation, while urllib2 does not. This is why urllib is often used with urllib2.

three。 Enumerate the network packets used by the web crawler and parse the packets?

Network packets urllib, urllib2, requests

Parsing packages re, xpath, beautiful soup, lxml

four。 Briefly describe the steps of the crawler?

Determine the demand

Identify resources

Get the returned data of the website through url

Positioning data

Store data.

five。 How to deal with the anti-climbing mechanism?

Anti-climbing mechanism:

Headers direction

Judge User-Agent, judge Referer, judge Cookie.

Add all the headers information of the browser

Note: Accept-Encoding;gzip,deflate needs to be commented out

six。 What are the common HTTP methods?

GET: request the specified page information and return the entity body

HEAD: similar to a get request, except that there is no specific content in the returned response to capture the header

POST: submit data to a specified resource for processing requests (such as form submission or uploading files). The data is contained in the request body.

PUT: transfers data from the client to the server to replace the contents of the specified document

DELETE: request to delete the specified page

The CONNNECT:HTTP1.1 protocol is reserved for proxy servers that can change the connection mode to pipeline mode.

OPTIONS: allows clients to view the performance of the server

TRACE: echo the request from the server, mainly for testing or diagnosis.

seven。 What is the role of redis in Redis-scrapy?

It replaces Scheduler in scrapy framework with redis database to realize queue management sharing.

Advantages:

Can make full use of the bandwidth of multiple machines

You can make full use of the IP addresses of multiple machines.

eight。 Anti-crawler strategies encountered and solutions?

Through headers anti-crawler: customize headers, add headers data in the web page.

Anti-crawler based on user behavior (sealed IP): you can use multiple agents to crawl or reduce the frequency of crawling.

Dynamic web anti-crawler (JS or Ajax request data): dynamic web pages can be crawled using selenium + phantomjs.

Part of the data encryption processing (data garbled): find the encryption method for reverse reasoning.

nine。 If you are asked to guard against website crawlers, how should you improve the difficulty of crawling?

Judge the User-Agent of headers

Detect the access frequency of the same IP

Data is obtained through Ajax

Crawl behavior is to crawl the source file of the page. If you want to crawl the html code of a static web page, you can use jquery to imitate writing html.

ten。 How many components is scrapy divided into? What is the effect of the difference?

Divided into 5 parts: Spiders (reptiles), Scrapy Engine (engine), Scheduler (scheduler), Downloader (downloader), Item Pipeline (processing pipeline).

Spiders: a class defined by a developer to parse a web page and grab the content returned by a specified url.

Scrapy Engine: controls the data processing flow of the whole system and triggers the transaction processing.

Scheduler: receive the requests sent by the Engine and put the requests in the processing queue so that the engine can provide it later when needed.

Download: grab web page information and provide it to engine, which is then forwarded to Spiders.

Item Pipeline: responsible for processing the data extracted by the Spiders class.

Such as cleaning HTML data, verifying crawled data (checking that item contains some fields), checking duplicates (and discarding), and saving crawled results to the database

At this point, I believe you have a deeper understanding of "what are the common interview questions of Python?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report