Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

"docker practical articles" python's docker- Douyin web data crawling (19)

2025-02-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/03 Report--

Original articles, welcome to reprint. Reprint please indicate: reproduced from IT Story Association, thank you!

Original link address: "docker practical stories" python's docker- Douyin webside data crawling (19)

Douyin crawling actual combat, why not crawl data? For example: there is an Internet e-commerce fresh company, the boss of this company wants to put advertisements on some traffic, by increasing the exposure of the company's products, marketing, in the choice he found Douyin, Douyin has a lot of data traffic, try to put ads on Douyin to see if the profit and effect are profitable. They analyze Douyin's data, analyze Douyin's user profiles, and determine the matching degree between the user group and the company. They need Douyin's number of fans, likes, followers and nicknames. Through users' preference to integrate the company's products into the video, better promote the company's products. Through these data, some public relations companies can find the dark horse of online celebrities and carry out marketing packaging. Source code: https://github.com/limingios/dockerpython.git (douyin)

Douyin sharing page introduction

Https://www.douyin.com/share/user/ user ID, the user ID is obtained from txt in the source code, and then the corresponding web page can be opened by means of link. Then through the web side page. Crawl for basic information.

Install Google xpath helper tools

Get crx from the source code

Google browser input: chrome://extensions/

Drag xpath-helper.crx directly into the interface chrome://extensions/

After successful installation

The shortcut key ctrl+shift+x starts xpath, which is usually used in conjunction with Google's F12 developer tool.

Start python crawling the website data shared by Douyin

Analyze the sharing page https://www.douyin.com/share/user/76055758243

1. Douyin makes a villain mechanism, and the numbers in Douyin ID are changed into strings and replaced.

{'name': [' ', '', ''], 'value':0}, {' name': ['', '', ''], 'value':1}, {' name': ['', '', ''], 'value':2}, {' name': ['', '' ''], 'value':3}, {' name': ['', '', ''], 'value':4}, {' name': ['', '', ''], 'value':5}, {' name': ['', '', ''], 'value':6} {'name': [' ', '', ''], 'value':7}, {' name': ['', '', ''], 'value':8}, {' name': ['', '', ''], 'value':9}

two。 Get the xpath of the required node

# nickname / / div [@ class='personal-card'] / div [@ class='info1'] / / p [@ class='nickname'] / text () # Douyin ID//div [@ class='personal-card'] / div [@ class='info1'] / / p [@ class='nickname'] / text () # work / / div [@ class='personal-card'] / div [@ class='info2'] / div [@ class='verify-info'] / span [@ class='info '] / text () # description / / div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='signature'] / text () # address / / div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='extra-info'] / span [1] / text () # Constellation / / div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='extra-info '] / span [2] / text () # followers / / div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='follow-info'] / / span [@ class='focus block'] / / I [@ class='icon iconfont follow-num'] / text () # followers / / div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='follow-info'] / / span [@ class='follower Block'] / / I [@ class='icon iconfont follow-num'] / text () # likes / / div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='follow-info'] / / span [@ class='follower block'] / span [@ class='num'] / text ()

Complete code import reimport requestsimport timefrom lxml import etree

Def handle_decode (input_data,share_web_url,task):

Search_douyin_str = re.compile (r 'Douyin ID:')

Regex_list = [

{'name': [' ', '', ''], 'value':0}

{'name': [' ', '', ''], 'value':1}

{'name': [' ', '', ''], 'value':2}

{'name': [' ', '', ''], 'value':3}

{'name': [' ', '', ''], 'value':4}

{'name': [' ', '', ''], 'value':5}

{'name': [' ', '', ''], 'value':6}

{'name': [' ', '', ''], 'value':7}

{'name': [' ', '', ''], 'value':8}

{'name': [' ', '', ''], 'value':9}

]

For i1 in regex_list: for i2 in i1 ['name']: input_data = re.sub (i2 mementstr (i1 [' value']) Input_data) share_web_html = etree.HTML (input_data) douyin_info = {} douyin_info ['nick_name'] = share_web_html.xpath ("/ / div [@ class='personal-card'] / div [@ class='info1'] / / p [@ class='nickname'] / text ()") [0] douyin_id =' '.join (share_web_html.xpath ("/ / div [@ class='personal-card'] / div [@ class=') Info1'] / p [@ class='shortid'] / i/text () ") douyin_info ['douyin_id'] = re.sub (search_douyin_str) '' Share_web_html.xpath ("/ / div [@ class='personal-card'] / div [@ class='info1'] / / p [@ class='nickname'] / text ()) [0]) .strip () + douyin_idtry: douyin_info ['job'] = share_web_html.xpath (" / / div [@ class='personal-card'] / div [@ class='info2'] / div [@ class='verify-info'] / span [@ class='info'] / text () ") [0] .strip () except: passdouyin_info ['describe'] = share_web_html.xpath (" / / div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='signature'] / text () ") [0] .replace ('\ n') 'To ') douyin_info [' location'] = share_web_html.xpath ("/ / div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='extra-info'] / span [1] / text ()") [0] douyin_info ['xingzuo'] = share_web_html.xpath ("/ / div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='extra-info'] / span [2 ] / text () ") [0] douyin_info ['follow_count'] = share_web_html.xpath (" / / div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='follow-info'] / / span [@ class='focus block'] / / I [@ class='icon iconfont follow-num'] / text () ") [0] .strip () fans_value =' '.join (share_web_html.xpath (" / /) Div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='follow-info'] / / span [@ class='follower block'] / / I [@ class='icon iconfont follow-num'] / text () ") unit = share_web_html.xpath (" / / div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='follow-info'] / / span [@ class='follower block'] / span [@ Class='num'] / text () ") if unit [- 1] .strip () = 'wicked: douyin_info [' fans'] = str ((int (fans_value) / 10)) + 'w'like =' .join (share_web_html.xpath (" / / div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='follow-info'] / / span [@ class='liked-num block'] / I [@ Class='icon iconfont follow-num'] / text ()) unit = share_web_html.xpath ("/ / div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='follow-info'] / / span [@ class='liked-num block'] / span [@ class='num'] / text ()") if unit [- 1] .strip () = 'walled: douyin_info [' like'] = str (int (like) / 10) + 'w'douyin_info [' from_url'] = share_web_urlprint (douyin_info)

Def handle_douyin_web_share (share_id):

Share_web_url = 'https://www.douyin.com/share/user/'+share_id

Print (share_web_url)

Share_web_header = {

'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'

}

Share_web_response = requests.get (url=share_web_url,headers=share_web_header)

Handle_decode (share_web_response.text,share_web_url,share_id)

If name = 'main':

While True:

Share_id = "76055758243"

If share_id = = None:

Print ('current processing task is:% s'%share_id)

Break

Else:

Print ('current processing task is:% s'%share_id)

Handle_douyin_web_share (share_id)

Time.sleep (2)

! [] (https://upload-images.jianshu.io/upload_images/11223715-651b910cb91c1c8d.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)#### mongodb > create a virtual machine through vagrant to create a mongodb Check out the docker crawler technology of python-python script app crawl (13) ```bashsu-# password: vagrantdocker > https://hub.docker.com/r/bitnami/mongodb> default port: 27017``bashdocker pull bitnami/mongodb:latestmkdir bitnamicd bitnamimkdir mongodbdocker run-d-v / path/to/mongodb-persistence:/root/bitnami-p 27017python 27017 bitnami/mongodb:latest# turn off firewall systemctl stop firewalld

Manipulate mongodb

Read the txt file to get the userId number.

#! / usr/bin/env python#-*-coding: utf-8-*-# @ Time: 2019-1-30 19 utf-8 @ Author: Aries# @ Site: # @ File: handle_mongo.py.py# @ Software: PyCharmimport pymongofrom pymongo.collection import Collectionclient = pymongo.MongoClient (host='192.168.66.100',port=27017) db = client ['douyin'] def handle_init_task (): task_id_collections = Collection (db) 'task_id') with open (' douyin_hot_id.txt','r') as f: f_read = f.readlines () for i inf _ read: task_info = {} task_info ['share_id'] = i.replace ('\ n' '') task_id_collections.insert (task_info) def handle_get_task (): task_id_collections = Collection (db, 'task_id') # return task_id_collections.find_one ({}) return task_id_collections.find_one_and_delete ({}) # handle_init_task () modify python program call import reimport requestsimport timefrom lxml import etreefrom handle_mongo import handle_get_taskfrom handle_mongo import handle_insert_douyin

Def handle_decode (input_data,share_web_url,task):

Search_douyin_str = re.compile (r 'Douyin ID:')

Regex_list = [

{'name': [' ', '', ''], 'value':0}

{'name': [' ', '', ''], 'value':1}

{'name': [' ', '', ''], 'value':2}

{'name': [' ', '', ''], 'value':3}

{'name': [' ', '', ''], 'value':4}

{'name': [' ', '', ''], 'value':5}

{'name': [' ', '', ''], 'value':6}

{'name': [' ', '', ''], 'value':7}

{'name': [' ', '', ''], 'value':8}

{'name': [' ', '', ''], 'value':9}

]

For i1 in regex_list: for i2 in i1 ['name']: input_data = re.sub (i2 mementstr (i1 [' value']) Input_data) share_web_html = etree.HTML (input_data) douyin_info = {} douyin_info ['nick_name'] = share_web_html.xpath ("/ / div [@ class='personal-card'] / div [@ class='info1'] / / p [@ class='nickname'] / text ()") [0] douyin_id =' '.join (share_web_html.xpath ("/ / div [@ class='personal-card'] / div [@ class=') Info1'] / p [@ class='shortid'] / i/text () ") douyin_info ['douyin_id'] = re.sub (search_douyin_str) '' Share_web_html.xpath ("/ / div [@ class='personal-card'] / div [@ class='info1'] / / p [@ class='nickname'] / text ()) [0]) .strip () + douyin_idtry: douyin_info ['job'] = share_web_html.xpath (" / / div [@ class='personal-card'] / div [@ class='info2'] / div [@ class='verify-info'] / span [@ class='info'] / text () ") [0] .strip () except: passdouyin_info ['describe'] = share_web_html.xpath (" / / div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='signature'] / text () ") [0] .replace ('\ n') 'To ') douyin_info [' location'] = share_web_html.xpath ("/ / div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='extra-info'] / span [1] / text ()") [0] douyin_info ['xingzuo'] = share_web_html.xpath ("/ / div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='extra-info'] / span [2 ] / text () ") [0] douyin_info ['follow_count'] = share_web_html.xpath (" / / div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='follow-info'] / / span [@ class='focus block'] / / I [@ class='icon iconfont follow-num'] / text () ") [0] .strip () fans_value =' '.join (share_web_html.xpath (" / /) Div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='follow-info'] / / span [@ class='follower block'] / / I [@ class='icon iconfont follow-num'] / text () ") unit = share_web_html.xpath (" / / div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='follow-info'] / / span [@ class='follower block'] / span [@ Class='num'] / text () ") if unit [- 1] .strip () = 'wicked: douyin_info [' fans'] = str ((int (fans_value) / 10)) + 'w'like =' .join (share_web_html.xpath (" / / div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='follow-info'] / / span [@ class='liked-num block'] / I [@ Class='icon iconfont follow-num'] / text ()) unit = share_web_html.xpath ("/ / div [@ class='personal-card'] / div [@ class='info2'] / p [@ class='follow-info'] / / span [@ class='liked-num block'] / span [@ class='num'] / text ()") if unit [- 1] .strip () = 'walled: douyin_info [' like'] = str (int (like) / 10) + 'w'douyin_info [' from_url'] = share_web_urlprint (douyin_info) handle_insert_douyin (douyin_info)

Def handle_douyin_web_share (task):

Share_web_url = 'https://www.douyin.com/share/user/'+task["share_id"]

Print (share_web_url)

Share_web_header = {

'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'

}

Share_web_response = requests.get (url=share_web_url,headers=share_web_header)

Handle_decode (share_web_response.text,share_web_url,task ["share_id"])

If name = 'main':

While True:

Task=handle_get_task ()

Handle_douyin_web_share (task)

Time.sleep (2)

* mongodb field > handle_init_task stores txt in mongodb > handle_get_task finds one and then deletes one, because txt exists Therefore, it does not matter to delete ```pythonflowers @ Author: Aries# @ Site: # @ File: handle_mongo.py.py# @ Software: PyCharmimport pymongofrom pymongo.collection import Collectionclient = pymongo.MongoClient (host='192.168.66.100'). Port=27017) db = client ['douyin'] def handle_init_task (): task_id_collections = Collection (db,' task_id') with open ('douyin_hot_id.txt','r') as f: f_read = f.readlines () for i inf _ read: task_info = {} task_info [' share_id'] = i.replace ('\ n' '') task_id_collections.insert (task_info) def handle_insert_douyin (douyin_info): task_id_collections = Collection (db, 'douyin_info') task_id_collections.insert (douyin_info) def handle_get_task (): task_id_collections = Collection (db 'task_id') # return task_id_collections.find_one ({}) return task_id_collections.find_one_and_delete ({}) handle_init_task ()

The 1000 items of data in the PS:text text are not enough to crawl at all. In fact, the app side and the PC side cooperate to crawl. The PC side is responsible for initializing the data. The fan list is obtained through userID and then crawled in a constant loop, so that a large amount of data can be obtained.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report