Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to collect JD.com Commodity data flow in batch by Python

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how Python batch collection JD.com commodity data flow, the article introduces in great detail, has a certain reference value, interested friends must read it!

Prepare for work driver installation

Before implementing the case, we have to install a Google driver, because we use selenium to manipulate the Google driver, and then manipulate the browser to operate automatically, simulating human behavior to operate the browser.

Take Google browser as an example, open the browser and take a look at our own version, then download the version that is the same or closest to our browser version, download it, and put the extracted plug-in into our python environment, or put it with the code.

Module use and introduction

Selenium pip install selenium, if you enter selenium directly, it is the latest installation by default, and the version number followed by selenium is the corresponding version of the installation.

Csv built-in module, which does not need to be installed, is used to save data to Excel tables

Time built-in module, does not need to install, time module, mainly used for delay waiting

Process analysis

When we visit a website, we have to enter a URL, so that's what the code says.

First import the module

From selenium import webdriver

Do not name the file name or package name selenium, which will cause it to fail to import. Webdriver can be thought of as the driver of the browser, to drive the browser must use webdriver, support a variety of browsers.

Instantiate browser objects. I use Google here. I suggest you use Google for convenience.

Driver = webdriver.Chrome ()

We use get to visit a web site and open it automatically.

Driver.get ('https://www.jd.com/')

Run it.

After opening the URL, take buying lipstick as an example.

First of all, we have to search for the product information through the keywords you want to buy, and use the search results to get the information.

Then we also need to write an input, right-click in the blank, and select check.

Select the element elements panel

Mouse click on the left arrow button, to click on the search box, it will be directly located to the search tag.

Right-click on the tab, select copy, and select copy selector.

If you are xpath, copy its xpath.

And then write down what we want to search for.

Driver.find_element_by_css_selector ('# key'). Send_keys ('lipstick')

When it runs again, it will automatically open a browser and go to the target URL to search for lipstick.

In the same way, find the search button and click.

Driver.find_element_by_css_selector ('.button') .click ()

If you run it again, you will click and search automatically.

The page search has come out, so we normally browse the web page is to drop down the web page, right? let's just let it drop down automatically. Import time module first

Import time

Perform the operation of scrolling the page

Def drop_down (): "" perform page scrolling operation "" # javascript for x in range (1,12,2): # for loop drop down times, take 1 3 5 7 9 11, the page height will also change during your continuous drop down process. Time.sleep (1) j = x / 9 # 1 document.documentElement.scrollTop 9 3 driver.execute_script 9 5 + 9 9 # document.documentElement.scrollTop specify the location of the scroll bar # document.documentElement.scrollHeight get the maximum height of the browser page js = 'document.documentElement.scrollTop = document.documentElement.scrollHeight *% f'% j driver.execute_script (js) # execute our JS code

Write the loop, and then call it.

Drop_down ()

Let's give it another delay.

Driver.implicitly_wait (10)

This is an implicit wait, waiting for the web page delay, if the network is not good, the load is very slow.

Implicit wait does not have to wait for ten seconds. After your network is loaded in ten seconds, it will load at any time. If it is not loaded out in ten seconds, it will be forced to load.

There is another kind of waiting, you write a few seconds to wait a few seconds, relatively less humane.

Time.sleep (10)

After loading the data, we need to find the source of commodity data.

Price / title / evaluation / cover / store, etc.

Or right mouse button click to check, in element, click the small arrow to click on the data you want to view.

You can see it's all in the li tag.

Get all the li tag content, or the same, directly copy.

It's in the lower left corner.

What is shown here is the first one, but we want to get all the tags, so the one after li in the left box can be deleted.

If not, you can see that here are 60 commodity data, 60 on one page.

So let's copy the rest and receive it with lis.

Lis = driver.find_elements_by_css_selector ('# J_goodsList ul li')

Because we are getting all the tag data, we have one more s than before.

Print it.

Print (lis)

Return the element objects in the data list [] list through lis

Go through it and take out all the elements.

For li in lis: title = li.find_element_by_css_selector ('.p-name em') .text.replace ('\ n' '') # get the label text data price = li.find_element_by_css_selector ('.p-price strong i'). Text # price commit = li.find_element_by_css_selector (' .p-commit strong a'). Text # comments shop_name = li.find_element_by_css_selector ('.J _ im_icon a'). Text # store Name href = li.find_element_by_css_selector ('.p-img a'). Get_attribute ('href') # Product details page icons = li.find_elements_by_css_selector (' .p-icons i') icon =' '.join ([i.text for i in icons]) # list derivation', '.join concatenates the elements in the list into a string data dit = {' commodity title': title, 'commodity price': price, 'comments': commit, 'shop name': shop_name, 'label': icon, 'commodity details page': href } csv_writer.writerow (dit) print (title, price, commit, href, icon, sep=' |')

Search function

Key_world = input ('Please enter the product data you want to get:')

To obtain the data, save CSV after obtaining it.

F = open (f 'JD.com {key_world} Commodity data .csv', mode='a', encoding='utf-8', newline='') csv_writer = csv.DictWriter (f, fieldnames= ['Commodity title', 'Commodity Price', 'comment quantity', 'Store name', 'label', 'Commodity details Page',]) csv_writer.writeheader ()

And then write an automatic page turn.

For page in range (1 11): print (f' is crawling the data content of page {page}') time.sleep (1) drop_down () get_shop_info () # download data driver.find_element_by_css_selector ('.pn-next') .click () # Click the next page of complete code from selenium import webdriverimport timeimport csvdef drop_down (): "perform page scrolling"for x in range (1)" twelve, 2): time.sleep (1) j = x / 9 # 1 driver.execute_script 9 3, 9 5 document.documentElement.scrollTop, specify the location of the scroll bar # document.documentElement.scrollHeight get the maximum height of the browser page js = 'document.documentElement.scrollTop = document.documentElement.scrollHeight *% f'% j driver.execute_script (js) # execute JS code key_world = input ('Please enter the commodity data you want to obtain:') f = open (f 'JD.com {key_world} commodity data .csv') Mode='a', encoding='utf-8', newline='') csv_writer = csv.DictWriter (f, fieldnames= ['Product title', 'Commodity Price', 'comment quantity', 'Store name', 'label', 'Commodity details Page' ]) csv_writer.writeheader () # instantiate a browser object driver = webdriver.Chrome () driver.get ('https://www.jd.com/') # access a web address to open a browser URL # find # key a tag data in element (element panel) through css syntax enter a keyword lipstick driver.find_element_by_css_selector (' # key' ). Send_keys (key_world) # find the input box label driver.find_element_by_css_selector ('.button'). Click () # find the search button and click # time.sleep (10) # wait # driver.implicitly_wait (10) # implicitly wait for def get_shop_info (): # the first step is to get all the li tag content driver.implicitly_wait (10) ) lis = driver.find_elements_by_css_selector ('# J_goodsList ul li') # get multiple tags # return the element object in the data list [] list # print (len (lis)) for li in lis: title = li.find_element_by_css_selector ('.p-name em') .text.replace ('\ n' '') # get the label text data price = li.find_element_by_css_selector ('.p-price strong i'). Text # price commit = li.find_element_by_css_selector (' .p-commit strong a'). Text # comments shop_name = li.find_element_by_css_selector ('.J _ im_icon a') Text # shop name href = li.find_element_by_css_selector ('.p-img a'). Get_attribute (' href') # Product details page icons = li.find_elements_by_css_selector ('.p-icons i') icon =' '.join ([i.text for i in icons]) # list derivation', '.join concatenates the elements in the list into a string data dit = {' commodity title': title, 'commodity price': price, 'comments': commit, 'shop name': shop_name, 'label': icon 'Product details Page': href,} csv_writer.writerow (dit) print (title, price, commit, href, icon, sep=' |') # print (href) for page in range (1 11): print (f' is crawling the data content of page {page}') time.sleep (1) drop_down () get_shop_info () # download data driver.find_element_by_css_selector ('.pn-next') .click () # Click the next page driver.quit () # to close the browser effect display

The above is all the contents of this article "how to collect JD.com commodity data flow in bulk by Python". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report