In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces "what high-frequency interview questions Python has". In daily operation, I believe many people have doubts about what high-frequency interview questions Python has. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the doubts about "what high-frequency interview questions Python has". Next, please follow the editor to study!
one。 Briefly describe the basic process of scrapy?
Scrapy is divided into nine steps:
Spiders requires the initial start_url or function stsrt_requests, which internally generates Requests to Engine
Engine sends requests to Scheduler
Engine gets the requests from Scheduler and gives it to Download for download
In the process of handing it to Dowmload, it will go through Downloader Middlewares (through process_request function).
After Dowmloader downloads the page, a response is generated, and the response is passed to Engine. In the process, it goes through Downloader Middlerwares (through the process_request function), and if there is an error in the transmission, it goes through the process_exception function.
Engine sends the response sent from Downloader to Spiders for processing, which goes through Spiders Middlerwares (through the process_spider_input function).
Spiders processes the response, returns two types of Requests or Item, and passes it to Engine, which goes through Spiders Middlewares (through the porcess_spider_output function).
Engine receives the returned information and, if it is Item, passes it to Items Pipeline; if it is Requests, it is passed to Scheduler to continue crawling
Repeat step 3 until there is no data to crawl
two。 Enumerate in the python3.5 language means
For an iterable (iterable) / traverable object (such as list, string), enumerate forms it into an index sequence, which can be used to obtain both index and value.
Enumerate is mostly used to get a count in a for loop.
three。 Do you know anything about Google's headless browser?
Headless browser, or headless browser, is a browser without interface. Since it is a browser, it should have everything the browser should have, but it just can't see the interface.
The PhantomJS in the selenium module in Python is an interface-less browser (headless browser): a headless browser based on QtWebkit.
four。 The difference between scrapy and scrapy-redis?
Scrapy is a general framework for crawlers, but it does not support distribution. Scrapy-redis provides some components based on redis to implement scrapy distributed crawlers more conveniently.
Why choose redis database?
Because redis supports master-slave synchronization and data is cached in memory, redis-based distributed crawlers are very efficient in reading requests and data at high frequency.
What is master-slave synchronization?
In Redis, users can ask one server to replicate another server by executing the SLAVEOF command or setting the slaveof option. We call the replicated server the master server (master), while the server that replicates the master server is called the slave server (slave). When the client sends the SLAVEOF command to the slave server, requiring the slave server to replicate the master server, the slave server first needs to perform synchronous operation That is, the database state of the slave server is updated to the database state of the master server.
five。 What are the advantages and disadvantages of scrapy? Why choose the scrapy framework?
Advantages:
Adopt more readable xpath instead of regular powerful statistics and log system to crawl on different url at the same time to support shell mode, which is convenient to debug and write middleware independently, and to write some unified filters to be stored in the database through pipelines.
Disadvantages:
Based on the python crawler framework, the expansibility is poor. Based on the twisted framework, exception will not kill reactor in operation, and the asynchronous framework will not stop other tasks after making an error, so it is difficult to detect after data error.
six。 What is the use of scrapy and requests
Requests is in polling mode and will be blocked by the network. It is not suitable for crawling large amounts of data.
The bottom layer of scapy is asynchronous framework twisted, and concurrency is the biggest advantage.
seven。 Describe how the scrapy framework works?
Get the first batch of url send requests from start_urls, and the request engine enters the request alignment to the scheduler. After the request is obtained, the scheduler sends the request alignment to the downloader to obtain the corresponding response resources, and gives the response to the parsing method written by itself for extraction processing. If the required data is extracted, it is handed over to the pipeline for processing. If the url is extracted, proceed with the previous steps. Until there are no requests in multiple columns, the program ends.
eight。 Is it better to use multi-processes or multi-threads to write crawlers?
IO-intensive code (file processing, web crawlers, etc.), multithreading can effectively improve efficiency (single-threaded IO operation will wait for IO, resulting in unnecessary waste of time, while starting multithreading can automatically switch to thread B when thread A waits, without wasting CPU resources, thus improving the efficiency of program execution). In the actual process of data acquisition, we need to consider not only the network speed and response, but also the hardware of our own machine to set up multi-process or multi-thread.
nine。 Common anti-reptiles and coping methods?
Based on user behavior, the same ip visits the same page many times in the same period of time, and uses proxy ip to build ip pool.
Request the user-agent in the header to build a user-agent pool (different operating systems and browsers simulate different users)
Dynamic loading (the data captured is different from that displayed by the browser). Js rendering simulates ajax requests and returns data in json form.
Selenium / webdriver simulated browser loading
Analyze the captured data
Encryption parameter field session tracking [cookie] hotlink protection settings [Referer
ten。 What are the main problems of distributed crawlers?
In the face of a large number of web pages to be crawled, only by adopting the distributed architecture is it possible to complete a round of crawling work in a relatively short time.
Its development efficiency is relatively fast and simple.
At this point, the study of "what are the high-frequency interview questions in Python" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.