In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
How to use scrapy+splash+Lua to scroll CSDN, many novices are not very clear about this, in order to help you solve this problem, the following small series will explain in detail for everyone, people who have this need can learn, I hope you can gain something.
This is mainly done to use splash.
CSDN crawling data frequently seems to automatically 504, starting with the text:
Install scrapy, install splash need to install docker, detailed installation steps in my csdn blog
https://blog.csdn.net/zhao_5352269/article/details/82850496
Open csdn and you'll find csdn articles that load down as you slide your mouse
If you only use the scrapy framework to crawl, then only the currently displayed content
And we need to slide the content later, we need to splash, of course, it is also possible to use selenium
After installing splash, start the service and visit the page
Click on Examples, then continue to select Scroll page, the script will appear, click render me, it will run directly, return what you need. Default url may be a bit slow, copy csdn address directly, in the run,
You can intercept and add all of them. For details, see the official document splash.readthedocs.io
splash: set_viewport_full -change viewport size (call it before splash: png or splash: jpeg) to get a screenshot of the entire page;
Save the image so that it slides in the splash
The next step is to use splash in the scrapy framework.
Copy the script just now to the crawler (WeChat has a problem, cannot display all, will send the source code)
script = """
function main(splash, args)
splash:go(args.url)
local scroll_to = splash:jsfunc("window.scrollTo")
scroll_to(0, 1000)
splash:set_viewport_full()
splash:wait(10)
return {html=splash:html()}
end
"""
Copy over. Not all of them.
Then pass parameters using args in SplashRequest, or you can pass parameters using meta using Scrapy.Request
yield SplashRequest(nav_url, endpoint='execute', args={'lua_source': script, 'url': nav_url}
Setting is done in (required)
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware':723,
'scrapy_splash.SplashMiddleware':725,
'scrapy. downloadermiddwares.httpcompression. HttpCompression Middleware': 810,#No information found if not configured
# 'Technology.middlewares.TechnologyDownloaderMiddleware': 543,
}HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
SPLASH_URL = "http://192.168.99.100:8050" #Splash location in docker installed by yourself
DUPEFILTER_CLASS = "scrapy_splash.SplashAwareDupeFilter"
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
execute code
Did reading the above help you? If you still want to have further understanding of related knowledge or read more related articles, please pay attention to the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.