In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains "what are the common interview questions of Python". Interested friends might as well take a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "what are the common interview questions for Python"?
one。 What anti-crawler strategies and solutions have you encountered?
1. Anti-crawler through headers
two。 Hair crawler based on user behavior: (frequency of access to the same IP in a short period of time)
3. Dynamic web anti-crawler (request data through ajax, or generate through JavaScript)
4. Encrypt part of the data (the data is garbled)
Solution:
For the crawling of basic web pages, you can customize headers and add headers data.
Use multiple proxy ip to crawl or set the crawl frequency to be reduced.
Dynamic web pages can be crawled using selenium + phantomjs
If you encrypt part of the data, you can use selenium to take screenshots and use the pytesseract library that comes with python to identify, but the slower and most direct way is to find the encryption method for reverse reasoning.
two。 The difference between urllib and urllib2?
Both urllib and urllib2 are related modules that accept URL requests, but urllib2 can accept an instance of the Request class to set that the headers,urllib of a URL request can only accept URL. Urllib cannot disguise your User-Agent string.
Urllib provides the urlencode () method for GET query string generation, while urllib2 does not. This is why urllib is often used with urllib2.
three。 Enumerate the network packets used by the web crawler and parse the packets?
Network packets urllib, urllib2, requests
Parsing packages re, xpath, beautiful soup, lxml
four。 Briefly describe the steps of the crawler?
Determine the demand
Identify resources
Get the returned data of the website through url
Positioning data
Store data.
five。 How to deal with the anti-climbing mechanism?
Anti-climbing mechanism:
Headers direction
Judge User-Agent, judge Referer, judge Cookie.
Add all the headers information of the browser
Note: Accept-Encoding;gzip,deflate needs to be commented out
six。 What are the common HTTP methods?
GET: request the specified page information and return the entity body
HEAD: similar to a get request, except that there is no specific content in the returned response to capture the header
POST: submit data to a specified resource for processing requests (such as form submission or uploading files). The data is contained in the request body.
PUT: transfers data from the client to the server to replace the contents of the specified document
DELETE: request to delete the specified page
The CONNNECT:HTTP1.1 protocol is reserved for proxy servers that can change the connection mode to pipeline mode.
OPTIONS: allows clients to view the performance of the server
TRACE: echo the request from the server, mainly for testing or diagnosis.
seven。 What is the role of redis in Redis-scrapy?
It replaces Scheduler in scrapy framework with redis database to realize queue management sharing.
Advantages:
Can make full use of the bandwidth of multiple machines
You can make full use of the IP addresses of multiple machines.
eight。 Anti-crawler strategies encountered and solutions?
Through headers anti-crawler: customize headers, add headers data in the web page.
Anti-crawler based on user behavior (sealed IP): you can use multiple agents to crawl or reduce the frequency of crawling.
Dynamic web anti-crawler (JS or Ajax request data): dynamic web pages can be crawled using selenium + phantomjs.
Part of the data encryption processing (data garbled): find the encryption method for reverse reasoning.
nine。 If you are asked to guard against website crawlers, how should you improve the difficulty of crawling?
Judge the User-Agent of headers
Detect the access frequency of the same IP
Data is obtained through Ajax
Crawl behavior is to crawl the source file of the page. If you want to crawl the html code of a static web page, you can use jquery to imitate writing html.
ten。 How many components is scrapy divided into? What is the effect of the difference?
Divided into 5 parts: Spiders (reptiles), Scrapy Engine (engine), Scheduler (scheduler), Downloader (downloader), Item Pipeline (processing pipeline).
Spiders: a class defined by a developer to parse a web page and grab the content returned by a specified url.
Scrapy Engine: controls the data processing flow of the whole system and triggers the transaction processing.
Scheduler: receive the requests sent by the Engine and put the requests in the processing queue so that the engine can provide it later when needed.
Download: grab web page information and provide it to engine, which is then forwarded to Spiders.
Item Pipeline: responsible for processing the data extracted by the Spiders class.
Such as cleaning HTML data, verifying crawled data (checking that item contains some fields), checking duplicates (and discarding), and saving crawled results to the database
At this point, I believe you have a deeper understanding of "what are the common interview questions of Python?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.