How to use scrapy+splash+Lua to scroll and crawl CSDN 10/17 Update SLTechnology News&Howtos

How to use scrapy+splash+Lua to scroll and crawl CSDN

2025-10-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

How to use scrapy+splash+Lua to scroll CSDN, many novices are not very clear about this, in order to help you solve this problem, the following small series will explain in detail for everyone, people who have this need can learn, I hope you can gain something.

This is mainly done to use splash.

CSDN crawling data frequently seems to automatically 504, starting with the text:

Install scrapy, install splash need to install docker, detailed installation steps in my csdn blog

https://blog.csdn.net/zhao_5352269/article/details/82850496

Open csdn and you'll find csdn articles that load down as you slide your mouse

If you only use the scrapy framework to crawl, then only the currently displayed content

And we need to slide the content later, we need to splash, of course, it is also possible to use selenium

After installing splash, start the service and visit the page

Click on Examples, then continue to select Scroll page, the script will appear, click render me, it will run directly, return what you need. Default url may be a bit slow, copy csdn address directly, in the run,

You can intercept and add all of them. For details, see the official document splash.readthedocs.io

splash: set_viewport_full -change viewport size (call it before splash: png or splash: jpeg) to get a screenshot of the entire page;

Save the image so that it slides in the splash

The next step is to use splash in the scrapy framework.

Copy the script just now to the crawler (WeChat has a problem, cannot display all, will send the source code)

script = """

function main(splash, args)

splash:go(args.url)

local scroll_to = splash:jsfunc("window.scrollTo")

scroll_to(0, 1000)

splash:set_viewport_full()

splash:wait(10)

return {html=splash:html()}

end

"""

Copy over. Not all of them.

Then pass parameters using args in SplashRequest, or you can pass parameters using meta using Scrapy.Request

yield SplashRequest(nav_url, endpoint='execute', args={'lua_source': script, 'url': nav_url}

Setting is done in (required)

DOWNLOADER_MIDDLEWARES = {

'scrapy_splash.SplashCookiesMiddleware':723,

'scrapy_splash.SplashMiddleware':725,

'scrapy. downloadermiddwares.httpcompression. HttpCompression Middleware': 810,#No information found if not configured

# 'Technology.middlewares.TechnologyDownloaderMiddleware': 543,

}HTTPCACHE_ENABLED = True

HTTPCACHE_EXPIRATION_SECS = 0

HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

SPLASH_URL = "http://192.168.99.100:8050" #Splash location in docker installed by yourself

DUPEFILTER_CLASS = "scrapy_splash.SplashAwareDupeFilter"

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

execute code

Did reading the above help you? If you still want to have further understanding of related knowledge or read more related articles, please pay attention to the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.