In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
Editor to share with you how to use the Scrapy_splash component of JS automatic rendering. I hope you will get something after reading this article. Let's discuss it together.
1. What is scrapy_splash?
Scrapy_splash is a component of scrapy
Scrapy-splash loading js data is based on Splash.
Splash is a Javascript rendering service. It is a lightweight browser that implements HTTP API. Splash is implemented in Python and Lua, and is built on modules such as Twisted and QT.
Using scrapy-splash to get the final response is equivalent to the source code of the web page after the browser has finished rendering.
Splash official document https://splash.readthedocs.io/en/stable/
2. The role of scrapy_splash
Scrapy-splash can simulate the browser to load js and return the data after js is run.
3. Installation of scrapy_splash environment 3.1 docker image using splash
Dockerfile https://github.com/scrapinghub/splash/blob/master/Dockerfile of splash
Observation shows that the splash dependency environment is a little more complex, so we can use splash's docker image directly.
If you do not use docker images, please refer to the splash official documentation to install the appropriate dependent environment
3.1.1 install and start the docker service
Installation reference https://www.yisu.com/article/213611.htm
3.1.2 obtain the image of splash
Pull takes the image of splash based on the correct installation of docker
Sudo docker pull scrapinghub/splash
3.1.3 verify that the installation is successful
Run splash's docker service and access port 8050 through the browser to verify that the installation is successful
The foreground runs sudo docker run-p 8050 8050 scrapinghub/splash
Run sudo docker run-d-p 8050 scrapinghub/splash in the background
Visit http://127.0.0.1:8050 and see the following screenshot, which indicates success.
3.1.4 resolve timeout for obtaining image: modify the image source of docker
Take ubuntu18.04 as an example
1. Create and edit the configuration file for docker
Sudo vi / etc/docker/daemon.json
two。 Save and exit after writing to the image address configuration of the domestic docker-cn.com
{"registry-mirrors": ["https://registry.docker-cn.com"]}"
3. Re-obtain the splash image after restarting the computer or docker service
4. If it is still slow, please use mobile hotspot (traffic orz)
3.1.5 shut down the splash service
You need to close the container before deleting the container
Sudo docker ps-asudo docker stop CONTAINER_IDsudo docker rm CONTAINER_ID3.2 installs the scrapy-splash package in the python virtual environment
Pip install scrapy-splash
4. Using splash in scrapy
Take baidu as an example
4.1 create a project, create a crawler, scrapy startproject test_splashcd test_splashscrapy genspider no_splash baidu.comscrapy genspider with_splash baidu.com4.2 perfect settings.py configuration file
Add splash configuration and modify robots protocol in settings.py file
# rendering service urlSPLASH_URL = 'http://127.0.0.1:8050'# downloader middleware DOWNLOADER_MIDDLEWARES = {' scrapy_splash.SplashCookiesMiddleware': 723,' scrapy_splash.SplashMiddleware': 725,' scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,} # deduplicator filter DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'# uses Splash's Http cache HTTPCACHE_STORAGE =' scrapy_splash.SplashAwareFSCacheStorage'# Obey robots.txt rulesROBOTSTXT_OBEY = False4.3 does not use splash
Perfect in spiders/no_splash.py
Import scrapyclass NoSplashSpider (scrapy.Spider): name = 'no_splash' allowed_domains = [' baidu.com'] start_urls = ['https://www.baidu.com/s?wd=13161933309'] def parse (self, response): with open (' no_splash.html') 'w') as f: f.write (response.body.decode ()) 4.4 using splashimport scrapyfrom scrapy_splash import SplashRequest # use the request object class WithSplashSpider (scrapy.Spider) provided by the scrapy_splash package: name = 'with_splash' allowed_domains = [' baidu.com'] start_urls = ['https://www.baidu.com/s?wd=13161933309'] def start_requests (self): yield SplashRequest (self.start_urls [0]) Callback=self.parse_splash, args= {'wait': 10}, # maximum timeout Unit: second endpoint='render.html') # use the fixed parameters of splash service def parse_splash (self, response): with open ('with_splash.html', 'w') as f: f.write (response.body.decode ()) 4.5 to run two crawlers, and observe the phenomenon
4.5.1 run two crawlers separately
Scrapy crawl no_splashscrapy crawl with_splash
4.5.2 observe the two html files obtained
Do not use splash
Use splash
4.6 conclusion
Splash is similar to selenium and can access the url address in the request object like a browser.
The request can be sent sequentially according to the corresponding response content of the url
And render the content of multiple responses corresponding to multiple requests
Finally, the rendered response response object is returned
5. Learn more
About splash https://www.yisu.com/article/219166.htm
About scrapy_splash (screenshot, get_cookies, etc.) https://www.e-learn.cn/content/qita/800748
6. Summary
The role of 1.scrapy_splash components
Splash is similar to selenium and can access the url address in the request object like a browser.
The request can be sent sequentially according to the corresponding response content of the url
And render the content of multiple responses corresponding to multiple requests
Finally, the rendered response response object is returned
The use of 2.scrapy_splash components
Need splash service as support
The constructed request object becomes splash.SplashRequest
Use it in the form of download middleware
Scrapy_splash specific configuration is required
Specific configuration of 3.scrapy_splash
SPLASH_URL = 'http://127.0.0.1:8050'DOWNLOADER_MIDDLEWARES = {' scrapy_splash.SplashCookiesMiddleware': 723,' scrapy_splash.SplashMiddleware': 725,' scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,} DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'HTTPCACHE_STORAGE =' scrapy_splash.SplashAwareFSCacheStorage' after reading this article, I believe you have some understanding of "how to use the Scrapy_splash component of JS automatic rendering" If you want to know more related knowledge, welcome to follow the industry information channel, thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.