Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the high-frequency interview questions for Python crawlers?

2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "what high-frequency interview questions Python crawler has". In daily operation, I believe many people have doubts about what high-frequency interview questions Python crawler has. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts about "what high-frequency interview questions Python crawler has". Next, please follow the editor to study!

1. Why do requests requests need to be accompanied by header?

The reason is: simulate the browser, deceive the server, and get the same content as the browser

The form of header: dictionary

Headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}

Usage: requests.get (url,headers=headers)

two。 Talk about what you know about Selenium and PhantomJS.

Selenium is an automated testing tool for Web. According to our instructions, the browser can automatically load the page, get the required data, or even take a screenshot of the page, or determine whether certain actions have taken place on the site. Selenium does not come with a browser and does not support browser functions. It needs to be combined with third-party browsers to use it. But sometimes we need to have it run embedded in the code, so we can use a tool called PhantomJS instead of a real browser. There is an API named WebDriver in the Selenium library. WebDriver is a bit like a browser that can load a Web site, but it can also be used like BeautifulSoup or other Selector objects to find page elements, interact with page elements (send text, clicks, etc.), and perform other actions to run web crawlers.

PhantomJS is a Webkit-based "headless" browser that loads the website into memory and executes the JavaScript on the page, and because it doesn't show a graphical interface, it runs more efficiently than a full browser. It consumes less resources than traditional Chrome or Firefox browsers.

If we combine Selenium with PhantomJS, we can run a very powerful web crawler that can handle JavaScrip, Cookie, headers, and whatever our real users need to do. After the main program exits, selenium does not guarantee that phantomJS will exit successfully, so it is best to shut down the phantomJS process manually. It is possible to cause multiple phantomJS processes to run and take up memory. Although WebDriverWait may reduce latency, there is currently a bug (various errors), in which case sleep can be used. PhantomJS crawl data is slow, so you can choose multithreading. If you find at run time that some can run or some can't, you can try changing phantomJS to Chrome.

3. Write a regular expression for an email address?

[A-Za-z0-9\ u4e00 -\ u9fa5] + @ [a-zA-Z0-9 zA-Z0 -] + (\ .[ a-u4e00 -] +) + $

4. What are the anti-crawler strategies you encounter? And what are the coping strategies?

Anti-crawler through headers

Crawlers based on user behavior: for example, the same IP visits the same page multiple times in a short time, or the same account does the same operation multiple times in a short time.

Dynamic web anti-crawler, for example, the data we need to crawl is obtained through ajax requests, or generated by JavaScript

Part of the data is encrypted, for example, the part of the data we want to catch can be caught, and the other part is encrypted, which is garbled.

Coping strategies:

For the crawling of basic web pages, you can customize headers, add headers data, and agents to solve the problem of data crawling of some websites. It is necessary to simulate login in order to grab complete data. For those who limit the crawling frequency, you can set the crawling frequency to be reduced. For those that restrict ip crawling, you can use multiple proxy ip for crawling, and polling using proxies for dynamic web pages can use selenium+phantomjs to crawl, but it is relatively slow, so you can also use the lookup interface to crawl. If you encrypt part of the data, you can use selenium to take screenshots and identify them with the pytesseract library that comes with python after a meal, but the slowest and most direct way is to find the encryption method for reverse reasoning.

5. Distributed crawler principle?

Scrapy-redis is distributed, but it is very simple in principle. Here, for convenience of description, we call our core server master and the machine used to run crawlers slave.

We know that using the scrapy framework to crawl a web page, we need to first give it some start_urls. The crawler first visits the url in the start_urls, and then crawls the elements inside, or other secondary or tertiary pages, according to our specific logic. To achieve distribution, we only need to make an article in this starts_urls.

We build a redis database on master (note that this database is only used for url storage, does not care about the specific data crawled, and do not be confused with the following mongodb or mysql), and open a separate list field for each site type that needs to be crawled. Get the address of the url as the master address by setting scrapy-redis on slave. The result is that although there are multiple slave, there is only one place to get the url, and that is the redis database on the server master. Moreover, because of scrapy-redis 's own queuing mechanism, the links obtained by slave will not conflict with each other. In this way, each slave after the completion of the crawl task, and then summarize the results to the server (at this time the data storage is no longer redis, but mongodb or mysql and other specific content of the database) the advantage of this method is that the program portability is strong, as long as deal with the path problem, transplant the program on slave to another machine to run, basically copy and paste things.

6. What is the difference between urllib and urllib2 in pythoon2.x?

Similarities and differences: all do url request operations, but the difference is obvious. Urllib2 can accept an instance of the Request class to set that the headers,urllib of an URL request can only accept URL. This means that you cannot use the urllib module to disguise your User Agent strings, etc. (camouflage the browser). Urllib provides the urlencode method for GET query string generation, while urllib2 does not. This is why urllib is often used with urllib2. The comparative advantage of the module is that urlliburllib2.urlopen can accept the Request object as a parameter, thus controlling the header part of the HTTP Request. However, the urllib.urlretrieve function and a series of quote and unquote functions such as urllib.quote have not been added to urllib2, so sometimes urllib is needed.

What is the 7.robots protocol?

Robots protocol (also known as crawler protocol, crawler rules, robot protocol, etc.), that is, robots.txt, website uses robots protocol to tell search engines which pages can be crawled and which pages can not be crawled.

Robots protocol is a common code of ethics in the Internet community of websites. Its purpose is to protect website data and sensitive information, and to ensure that users' personal information and privacy are not violated. Because it is not an order, it needs to be followed by the search engine consciously.

8. What is a reptile?

A crawler is an automated program that requests a website and extracts data

9. The basic process of a crawler?

1. Initiate a request to the target site through the http library, that is, send a Request. The request can contain additional headers and other information, and wait for the server to respond. 2. If the server can respond normally, you will get a Response,Response content. 3. Parse the content: regular expression, page parsing library, json4, save data: text or store in the database

10. What are Request and Response?

Send Request locally to the server, the server returns a Response according to the request, and the page is displayed on the page.

1. The browser sends a message to the server where the URL is located. This process is called Http Request.

2. After receiving the message sent by the browser, the server can do something according to the content of the message sent by the browser.

And then send the message back to the browser, a process called HTTP Response

3. After receiving the Response message from the server, the browser will process the information accordingly, and then display

At this point, the study on "what are the high-frequency interview questions for Python crawlers" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report