What is the method of data collection by python Douyin crawler 07/04 Update SLTechnology News&Howtos

What is the method of data collection by python Douyin crawler

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains the "python Douyin crawler data collection method is what", the article explains the content is simple and clear, easy to learn and understand, now please follow the editor's ideas slowly in depth, together to study and learn "python Douyin crawler data collection method is what" it!

A brief introduction to crawlers and anti-crawlers

Crawler is that we use some program instead of manual batch to read and obtain the information on the website. And anti-crawling is the opposite of crawlers, is to try their best to prevent non-man-made collection of website information, the two are mutually exclusive, so far most websites can easily crawl information.

The crawler wants to bypass the strategy of being opposed is to make the server person you are not a machine program as much as possible, so in the program, you have to disguise yourself as a browser to visit the website, which can greatly reduce the probability of being opposed. So how to disguise the browser?

1. You can use request headers (headers) to disguise yourself, the most commonly used of which is User Agent (Chinese name for user agent), which is part of the Http protocol and part of the header domain. User Agent is also referred to as UA. It is a special string header, which provides you with the browser type and version, operating system and version, browser kernel, and other information to visit the website; indicates the identity information of the current visiting server. If the same identity accesses the server too frequently, it will be recognized as a machine identity and will be hit by anti-crawling, so you need to change the User-Agent information frequently. The general User-Agent field includes the following information: browser ID (operating system ID; encryption level ID; browser language) rendering engine ID version information; 2. Use different User-Agent to circumvent anti-crawling strategies

For example:

Accept: the data types supported by the client, separated by commas, are ordered, with the semicolon preceded by the primary type and followed by the subtype

Accept-Encoding: specifies the type of content compression encoding returned by the web server that the browser can support

Accept-Language: types of natural languages accepted by browsers

Connection: sets the persistence of HTTP connections, usually Keep-Alive

Host: the domain name or IP address of the server and, if not a universal port, the port number

Referer: refers to the address where the URL of the current request is referenced

User_agent_list = ["Opera/9.80 (X11; Linux i686; U; hu) Presto/2.9.168 Version/11.50", "Opera/9.80 (X11; Linux i686; U; ru) Presto/2.8.131 Version/11.11", "Opera/9.80 (X11; Linux i686; U; es-ES) Presto/2.8.131 Version/11.11", "Mozilla/5.0 (Windows NT 5.1; U; en) Rv:1.8.1) Gecko/20061208 Firefox/5.0 Opera 11.11 "," Opera/9.80 (X11; Linux x8634; U; bg) Presto/2.8.131 Version/11.10 "," Opera/9.80 (Windows NT 6.0; U; en) Presto/2.8.99 Version/11.10 "," Opera/9.80 (Windows NT 5.1; U) Zh-tw) Presto/2.8.131 Version/11.10 "," Opera/9.80 (Windows NT 6.1; Opera Tablet/15165; U; en) Presto/2.8.149 Version/11.1 "," Opera/9.80 (X11; Linux x86x64; U; Ubuntu/10.10 (maverick); pl) Presto/2.7.62 Version/11.01 "," Mozilla/5.0 (Windows NT 6.1) " WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36 "," Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36 "," Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0 "," Opera/9.80 (X11; Linux i686) Ubuntu/14.10) Presto/2.12.388 Version/12.16 "," Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14 "," Mozilla/5.0 (Windows NT 6.0; rv:2.0) Gecko/20100101 Firefox/4.0 Opera 12.14 "," Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0) Opera 12.14 "," Opera/12.80 (Windows NT 5.1; U) En) Presto/2.10.289 Version/12.02 "," Opera/9.80 (Windows NT 6.1; U; es-ES) Presto/2.9.181 Version/12.00 "," Opera/9.80 (Windows NT 5.1; U; zh-sg) Presto/2.9.181 Version/12.00 "," Opera/12.0 (Windows NT 5.2 teru) En) Presto/22.9.168 Version/12.00 "," Opera/12.0 (Windows NT 5.1) Presto/22.9.168 Version/12.00 "," Mozilla/5.0 (Windows NT 5.1) Gecko/20100101 Firefox/14.0 Opera/12.0 "," Opera/9.80 (Windows NT 6.1; WOW64; U; pt) Presto/2.10.229 Version/11.62 "," Opera/9.80 (Windows NT 6.0; U) Pl) Presto/2.10.229 Version/11.62 "," Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52 "," Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; de) Presto/2.9.168 Version/11.52 "," Opera/9.80 (Windows NT 5.1; U) En) Presto/2.9.168 Version/11.51 "," Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; de) Opera 11.51 "," Opera/9.80 (X11; Linux x86x64; U; fr) Presto/2.9.168 Version/11.50 ",] referer_list = [" https://www.test.com/", "https://www.baidu.com/"]

Get the random number, that is, each collection will extract the random user agent and reference address according to the random number. (note: if multiple pages are collected circularly, it is best to wait for a few seconds before continuing to collect, so as to reduce the pressure on the server. ):

Import randomimport re, urllib.request, lxml.htmlimport requestsimport time, randomdef get_randam (data): return random.randint (0, len (data)-1) def crawl (): headers = {'Accept':' text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding':' gzip, deflate', 'Accept-Language':' zh-CN,zh QQ 0.9, 'Connection':' keep-alive', 'host':' test.com', 'Referer':' https://test.com/', } random_index = get_randam (user_agent_list) random_agent = user_agent_ list [random _ index] headers ['User-Agent'] = random_agent random_index_01 = get_randam (referer_list) random_agent_01 = referer_ list [random _ index_01] headers [' Referer'] = random_agent_01 session = requests.session () url = "https://www.test.com/" html_data = session.get (url) Headers=headers, timeout=180) html_data.raise_for_status () html_data.encoding = 'utf-8-sig' data = html_data.text data_doc = lxml.html.document_fromstring (data). (parsing, extracting, storing, etc.) time.sleep (random.randint (3,5)) 3. Use proxy IP to avoid anti-crawling: the same ip requests a large number of other servers, it is more likely to be identified as a crawler, ip may be temporarily blocked.

According to the anonymity of the proxy ip, the proxy ip can be divided into the following four categories:

Transparent proxy (Transparent Proxy) Transparent Proxy: although transparent proxy can directly "hide" your IP address, it can still find out who you are.

Anonymous proxy (Anonymous Proxy): anonymous proxies are a little better than transparent proxies: people only know that you use an proxy, not who you are.

Obfuscation agent (Distorting Proxies): like an anonymous agent, if you use an obfuscation agent, others will still know that you are using the agent, but will get a fake IP address that is more realistic in disguise.

High concealment agent (Elite proxy or High Anonymity Proxy): it can be seen that the high concealment agent makes it impossible for others to discover that you are using an agent, so it is the best choice.

In the use of use, there is no doubt that the effect of using high concealment agent is the best.

Below, I use the free Gaoni agent IP to collect:

# Agent IP: https://www.xicidaili.com/nnimport requestsproxies = {"http": "http://117.30.113.248:9999","https":" https://120.83.120.157:9999"}r=requests.get("https://www.baidu.com", proxies=proxies) r.raise_for_status () r.encoding = 'utf-8-sig'print (r.text) Thank you for your reading The above is the content of "what is the data collection method of python Douyin crawler". After the study of this article, I believe you have a deeper understanding of what the data collection method of python Douyin crawler is, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.