In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains the "installation of chrome headless mode crawling web page source code method under the Linux environment", the content of the article is simple and clear, easy to learn and understand, the following please follow the editor's train of thought slowly in depth, together to study and learn "Linux environment installation chrome headless mode crawling web page source code method"!
Origin: the company's business needs to grab the source code on the home page of the target website, conduct business analysis, and just want to use Jsoup to grab it directly in the mind of the meeting. It is very simple. After arriving at the meeting, the actual situation is very complicated.
Some websites are generated using js, some sites also have anti-crawl function to determine whether you use a browser to access, if not a browser, then directly give a warning js code, (due to industry reasons, do not release the target site and captured js code)
Therefore, we can only rely on the real browser to really visit the target website and get the page source code. This method draws lessons from the headless mode of selenium and linux version of chrome in the testing world, and starts as follows:
0. References from the brief book: https://www.jianshu.com/p/b2609ed57f07 is here to thank the author for providing such a good tutorial and ideas.
1. Execute the command before entering linux:
Yum install libX11 libXcursor libXdamage libXext libXcomposite libXi libXrandr gtk3 libappindicator-gtk3 xdg-utils libXScrnSaver liberation-fonts
two。 Download chrome
Wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
3. Install chrome
Rpm-ivh google-chrome-stable_current_x86_64.rpm
At this time, an error may be reported in some circumstances, similar to
Warning: google-chrome-stable_current_x86_64.rpm: Header V4 DSA/SHA1 Signature, key ID 7fac5991: NOKEYerror: Failed dependencies: / usr/bin/lsb_release is needed by google-chrome-stable-75.0.3770.100-1.x86_64
The solution I found is to install chrome so that it can be successfully installed.
Yum localinstall google-chrome-stable_current_x86_64.rpm
4. To check what version of chrome is installed, you need to download the corresponding chromedriver
[root@a80f552643f3] # google-chrome-- versionGoogle Chrome 75.0.3770.100
See the version number, so you can find the corresponding chromedriver to look here: http://chromedriver.chromium.org/downloads
Download the corresponding chromedriver of your own version, and then upload it to your own linux. The location is random (but this is reflected in the code, which will be mentioned below).
Chromedriver should be a zip, which needs to be unlocked. There is a file with the name of chromedriver. Set the permission (I have given the permission of 777 here).
Chmod 777 chromedriver
5. Go to the code editor (idea, springboot, other languages to find the corresponding code settings)
Set up pom.xml to add content to the tag.
Org.seleniumhq.selenium selenium-java 3.141.59 org.seleniumhq.selenium selenium-remote-driver 3.141.59 org.seleniumhq.selenium selenium-api 3.141.59 org.seleniumhq.selenium Selenium-chrome-driver 3.141.59 org.seleniumhq.selenium selenium-support 3.141.59 com.google.guava guava 23.0 org.jsoup jsoup 1.12.1
Write a controller, and then provide the following method, pay attention to the comments, comments are very important
@ GetMapping ("info") public String info (@ RequestParam String url) {/ / here is the address of setting chromedriver String driverPath = "/ root/chromedriver"; / / set this address to the system variable parameter System.setProperty ("webdriver.chrome.driver", driverPath) / / if there is no url parameter, by default, take a page's source code if (Objects.isNull (url)) {url = "https://www.jianshu.com/p/b2609ed57f07";} / / start setting chrome ChromeOptions chromeOptions=new ChromeOptions () / / the headless mode of chrome must be set, otherwise the linux command line mode starts error chromeOptions.setHeadless (Boolean.TRUE); / / No sandbox must be set, which is also to prevent DevToolsActivePort file doesn't exist chromeOptions.addArguments ("--no-sandbox"); / / prevent DevToolsActivePort file doesn't exist chromeOptions.addArguments ("--disable-dev-shm-usage") / / Speed up chromeOptions.addArguments ("blink-settings=imagesEnabled=false") without loading pictures; / / set ua chromeOptions.addArguments ("--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10: 14) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"); ChromeDriver chromeDriver = new ChromeDriver (chromeOptions); chromeDriver.get (url) / / get the page source code String pageSource = chromeDriver.getPageSource (); return pageSource;}
Package and upload to the Linux server, which can be used as an interface for external calls to return the page source code of the destination address.
Thank you for your reading, the above is the "Linux environment to install chrome headless mode crawling web page source code method" content, after the study of this article, I believe that you install chrome headless mode crawling web page source code method under the Linux environment has a more profound understanding of this problem, the specific use of the need for you to practice and verify. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.