Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use selenium to realize a dynamic crawler in Python

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

Python how to use selenium to achieve a dynamic crawler, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

1. Installation

Selenium installation is relatively simple, you can install it directly with pip. Open cmd and type

Pip install selenium

Just fine

two。 Install chromedriver

Chromedriver is the driver of Google browser, because I usually use chrome

It should be noted here that the version of chromedriver needs to correspond to the version of Chrome you installed, and the version of Chrome can be found in the upper right corner of the browser-check the browser version of Google Chrome. The specific corresponding rules are as follows:

Chromedriver supported Chrome version v2.40v66-68v2.39v66-68v2.38v65-67v2.37v64-66v2.36v63-65v2.35v62-64v2.34v61-63v2.33v60-62v2.32v59-61v2.31v58-60v2.30v58-60v2.29v56-58v2.28v55-57v2.27v54-56v2.26v53-55v2.25v53-55v2.24v52-54v2.23v51-53v2.22v49-52

After installation, just add the installation directory of the driver to the system Path. If you do not add it, you will get an error when running the program, indicating that you have not added it to the Path.

3. Start crawling.

The URL to be crawled today is https://www.upbit.com/service_center/notice, and then click the page flip button to find that url has not changed. If you view the requested address change through F12, you can find

Https://www.upbit.com/service_center/notice?id=1

The main change here is the later id,1,2,3. And so on.

Before you start with the selenium crawler, you need to define the following

# set options for Google browser

Opt = webdriver.ChromeOptions ()

# set the browser to a headless browser, that is, a browser that is not displayed when crawling first

Opt.set_headless ()

# browser is set to Google browser and set to the options set above

Browser = webdriver.Chrome (options=opt)

Save = []

Home = 'https://www.upbit.com/home'

# after the browser object is created, the URL can be sent to the browser through the get () method

# get URL information

Browser.get (home)

Time.sleep (15)

Then there is how to locate the elements of html. In selenium, the methods to locate elements are

Find_element_by_id (self, id_)

Find_element_by_name (self, name)

Find_element_by_class_name (self, name)

Find_element_by_tag_name (self, name)

Find_element_by_link_text (self, link_text)

Find_element_by_partial_link_text (self, link_text)

Find_element_by_xpath (self, xpath)

Find_element_by_css_selector (self, css_selector)

The id,name and so on can be obtained through the browser. The purpose of locating the element is to get the information we want, and then parse it and save it. We can get the text information of the element by calling the tex method.

Below, post the code of the whole crawler for your reference.

From selenium import webdriver

Import time

From tqdm import trange

From collections import OrderedDict

Import pandas as pd

Def stringpro (inputs):

Inputs = str (inputs)

Return inputs.strip () .replace ("\ n", ") .replace ("\ t ",") .lstrip () .rstrip ()

Opt = webdriver.ChromeOptions ()

Opt.set_headless ()

Browser = webdriver.Chrome (options=opt)

Save = []

Home = 'https://www.upbit.com/home'

Browser.get (home)

Time.sleep (15)

For page in trange (500):

Try:

Rows = OrderedDict ()

Url = "https://www.upbit.com/"\"

"service_center/notice?id= {}" .format (page)

Browser.get (url)

Content = browser.find_element_by_class_name (

Name='txtB'). Text

Title_class = browser.find_element_by_class_name (

Name='titB')

Title = title_class.find_element_by_tag_name (

'strong'). Text

Times_str = title_class.find_element_by_tag_name (

'span'). Text

Times = times_str.split ('|') [0] .split (") [1:]

Num = times_str.split ("|") [1] .split (") [1]

Rows ['title'] = title

Rows ['times'] = "" .join (times)

Rows ['num'] = num

Rows ['content'] = stringpro (content)

Save.append (rows)

Print ("{}, {}" .format (page, rows))

Except Exception as e:

Continue

Df = pd.DataFrame (save)

Df.to_csv (". / datasets/www_upbit_com.csv", index=None)

After reading the above, have you mastered how to use selenium to implement a dynamic crawler in Python? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report