Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use the Scrapy_splash component of JS automatic rendering

2025-02-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

Editor to share with you how to use the Scrapy_splash component of JS automatic rendering. I hope you will get something after reading this article. Let's discuss it together.

1. What is scrapy_splash?

Scrapy_splash is a component of scrapy

Scrapy-splash loading js data is based on Splash.

Splash is a Javascript rendering service. It is a lightweight browser that implements HTTP API. Splash is implemented in Python and Lua, and is built on modules such as Twisted and QT.

Using scrapy-splash to get the final response is equivalent to the source code of the web page after the browser has finished rendering.

Splash official document https://splash.readthedocs.io/en/stable/

2. The role of scrapy_splash

Scrapy-splash can simulate the browser to load js and return the data after js is run.

3. Installation of scrapy_splash environment 3.1 docker image using splash

Dockerfile https://github.com/scrapinghub/splash/blob/master/Dockerfile of splash

Observation shows that the splash dependency environment is a little more complex, so we can use splash's docker image directly.

If you do not use docker images, please refer to the splash official documentation to install the appropriate dependent environment

3.1.1 install and start the docker service

Installation reference https://www.yisu.com/article/213611.htm

3.1.2 obtain the image of splash

Pull takes the image of splash based on the correct installation of docker

Sudo docker pull scrapinghub/splash

3.1.3 verify that the installation is successful

Run splash's docker service and access port 8050 through the browser to verify that the installation is successful

The foreground runs sudo docker run-p 8050 8050 scrapinghub/splash

Run sudo docker run-d-p 8050 scrapinghub/splash in the background

Visit http://127.0.0.1:8050 and see the following screenshot, which indicates success.

3.1.4 resolve timeout for obtaining image: modify the image source of docker

Take ubuntu18.04 as an example

1. Create and edit the configuration file for docker

Sudo vi / etc/docker/daemon.json

two。 Save and exit after writing to the image address configuration of the domestic docker-cn.com

{"registry-mirrors": ["https://registry.docker-cn.com"]}"

3. Re-obtain the splash image after restarting the computer or docker service

4. If it is still slow, please use mobile hotspot (traffic orz)

3.1.5 shut down the splash service

You need to close the container before deleting the container

Sudo docker ps-asudo docker stop CONTAINER_IDsudo docker rm CONTAINER_ID3.2 installs the scrapy-splash package in the python virtual environment

Pip install scrapy-splash

4. Using splash in scrapy

Take baidu as an example

4.1 create a project, create a crawler, scrapy startproject test_splashcd test_splashscrapy genspider no_splash baidu.comscrapy genspider with_splash baidu.com4.2 perfect settings.py configuration file

Add splash configuration and modify robots protocol in settings.py file

# rendering service urlSPLASH_URL = 'http://127.0.0.1:8050'# downloader middleware DOWNLOADER_MIDDLEWARES = {' scrapy_splash.SplashCookiesMiddleware': 723,' scrapy_splash.SplashMiddleware': 725,' scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,} # deduplicator filter DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'# uses Splash's Http cache HTTPCACHE_STORAGE =' scrapy_splash.SplashAwareFSCacheStorage'# Obey robots.txt rulesROBOTSTXT_OBEY = False4.3 does not use splash

Perfect in spiders/no_splash.py

Import scrapyclass NoSplashSpider (scrapy.Spider): name = 'no_splash' allowed_domains = [' baidu.com'] start_urls = ['https://www.baidu.com/s?wd=13161933309'] def parse (self, response): with open (' no_splash.html') 'w') as f: f.write (response.body.decode ()) 4.4 using splashimport scrapyfrom scrapy_splash import SplashRequest # use the request object class WithSplashSpider (scrapy.Spider) provided by the scrapy_splash package: name = 'with_splash' allowed_domains = [' baidu.com'] start_urls = ['https://www.baidu.com/s?wd=13161933309'] def start_requests (self): yield SplashRequest (self.start_urls [0]) Callback=self.parse_splash, args= {'wait': 10}, # maximum timeout Unit: second endpoint='render.html') # use the fixed parameters of splash service def parse_splash (self, response): with open ('with_splash.html', 'w') as f: f.write (response.body.decode ()) 4.5 to run two crawlers, and observe the phenomenon

4.5.1 run two crawlers separately

Scrapy crawl no_splashscrapy crawl with_splash

4.5.2 observe the two html files obtained

Do not use splash

Use splash

4.6 conclusion

Splash is similar to selenium and can access the url address in the request object like a browser.

The request can be sent sequentially according to the corresponding response content of the url

And render the content of multiple responses corresponding to multiple requests

Finally, the rendered response response object is returned

5. Learn more

About splash https://www.yisu.com/article/219166.htm

About scrapy_splash (screenshot, get_cookies, etc.) https://www.e-learn.cn/content/qita/800748

6. Summary

The role of 1.scrapy_splash components

Splash is similar to selenium and can access the url address in the request object like a browser.

The request can be sent sequentially according to the corresponding response content of the url

And render the content of multiple responses corresponding to multiple requests

Finally, the rendered response response object is returned

The use of 2.scrapy_splash components

Need splash service as support

The constructed request object becomes splash.SplashRequest

Use it in the form of download middleware

Scrapy_splash specific configuration is required

Specific configuration of 3.scrapy_splash

SPLASH_URL = 'http://127.0.0.1:8050'DOWNLOADER_MIDDLEWARES = {' scrapy_splash.SplashCookiesMiddleware': 723,' scrapy_splash.SplashMiddleware': 725,' scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,} DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'HTTPCACHE_STORAGE =' scrapy_splash.SplashAwareFSCacheStorage' after reading this article, I believe you have some understanding of "how to use the Scrapy_splash component of JS automatic rendering" If you want to know more related knowledge, welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report