Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python to crawl data from any website

2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to use Python to crawl data from any website". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to use Python to crawl data from any website.

First of all, I should warn you about the legality of network crawling. Although fetching is legal, the use of data that you may extract may be illegal. Make sure you don't crawl:

Copyrighted content-because it is someone's intellectual property, it is protected by law, and you can't just reuse it.

Personal data-if the information you collect can be used to identify individuals, it is considered personal data and, for EU citizens, it is protected by GDPR. Unless you have a legitimate reason to store this data, it's best to skip it altogether.

In general, you should always read the terms and conditions of the site before crawling to ensure that you do not violate their policies. If you are not sure how to continue, please contact the site owner and ask for permission.

What does your Scraper need?

To start building your own web crawler, you first need to install Python on your machine. Ubuntu 20.04 and other versions of Linux are pre-installed with Python 3.

To check if Python is installed on your device, run the following command:

Python3-v

If you installed Python, you should receive output similar to the following:

Python 3.8.2

In addition, for our web crawler, we will use the Python package BeautifulSoup (for selecting specific data) and Selenium (for rendering dynamically loaded content). To install them, simply run the following command:

Pip3 install beautifulsoup4

And

Pip3 install selenium

The final step is to make sure that the Google Chrome and Chrome drivers are installed on your machine. These will be necessary if we want to use Selenium to grab dynamically loaded content.

Using Firefox or other browsers also requires a corresponding browser driver.

How to check the page

Now that you've installed everything, it's time to start our crawl project.

You should select the website you want to crawl according to your needs. Keep in mind that the content structure of each site is different, so when you start crawling it yourself, you need to adjust what you have learned here. Each site requires minor changes to the code.

For this article, I decided to grab information about the first ten movies from IMDb's top 250 movies list: https: / / www.imdb.com/chart/top/.

First, we will get the title, and then we will delve further into it by extracting information from the pages of each movie. Some data will need to be rendered by JavaScript.

To begin to understand the structure of the content, you should right-click the first title in the list and select check elements.

By pressing CTRL+F and searching in the HTML code structure, you will see that there is only one tag on the page. This is useful because it provides us with information about how to access the data.

A HTML selector will provide us with all the headings table tbody tr td.titleColumn an on the page. That's because all headings are located at anchors in table cells with the "titleColumn" class.

Using this CSS selector and getting the innerText for each anchor will give us the title we need. You can use the JavaScript line to simulate in the browser console in the new window that just opened:

Document.querySelectorAll ("table tbody tr td.titleColumn a") [0] .innerText

You'll see something like this:

Now that we have this selector, we can start writing Python code and extracting the information we need.

How to use BeautifulSoup to extract statically loaded content

The movie titles in our list are static content. This is because if you look at the page source code (CTRL+U on the page or right-click and select View page source code), you will see that the title already exists.

Static content is usually easier to grab because it does not require JavaScript rendering. To extract the top ten headings from the list, we will use BeautifulSoup to get the content and print it in our Scraper output.

Import requestsfrom bs4 import BeautifulSoup page = requests.get ('https://www.imdb.com/chart/top/') # Getting page HTML through requestsoup = BeautifulSoup (page.content,' html.parser') # Parsing content using beautifulsoup links = soup.select ("table tbody tr td.titleColumn a") # Selecting all of the anchors with titlesfirst10 = links [: 10] # Keep only the first10 anchorsfor anchor in first10: print (anchor.text) # Display the innerText of each anchor

The above code uses the selector we saw in the first step to extract the movie title anchor from the page. Then iterate through the top ten and display the innerText for each.

The output should look like this:

How to extract dynamically loaded content

With the progress of technology, the website begins to load its content dynamically. This improves the performance of the page, the user experience, and even removes additional barriers to crawlers.

However, this complicates things because the HTML retrieved from a simple request will not contain dynamic content. Fortunately, with Selenium, we can simulate a request in the browser and wait for the dynamic content to be displayed.

How to make a request using Selenium

You need to know the location of the chromedriver. The following code is the same as in step 2, but this time we use Selenium to make the request. We will still use BeautifulSoup to parse the content of the page as before.

From bs4 import BeautifulSoupfrom selenium import webdriver option = webdriver.ChromeOptions () # I use the following options as my machine is a window subsystem linux. # I recommend to use the headless option at least, out of the 3option.add_argument ('- headless') option.add_argument ('- no-sandbox') option.add_argument ('- disable-dev-sh-usage') # Replace YOUR-PATH-TO-CHROMEDRIVER with your chromedriver locationdriver = webdriver.Chrome ('YOUR-PATH-TO-CHROMEDRIVER', options=option) driver.get (' https://www.imdb.com/chart/top/') # Getting page HTML through requestsoup = BeautifulSoup (driver.page_source) 'html.parser') # Parsing content using beautifulsoup. Notice driver.page_source instead of page.content links = soup.select ("table tbody tr td.titleColumn a") # Selecting all of the anchors with titlesfirst10 = links [: 10] # Keep only the first10 anchorsfor anchor in first10: print (anchor.text) # Display the innerText of each anchor

Don't forget to replace "YOUR-PATH-TO-CHROMEDRIVER" with the location where you extracted the chromedriver. In addition, you should notice that page.content, when we created the BeautifulSoup object, we are now using driver.page_source, which provides the HTML content of the page.

How to use Selenium to extract statically loaded content

Using the above code, we can now access each movie page by calling the click method on each anchor.

First_link = driver.find_elements_by_css_selector ('table tbody tr td.titleColumn a') [0] first_link.click ()

This will simulate clicking on the link to the first movie. However, in this case, I recommend that you continue to use driver.get instead. This is because you will no longer be able to use this method after click () enters a different page, because the new page does not have links to the other nine movies.

Therefore, after clicking the first title in the list, you need to return to the first page, then click the second page, and so on. This is a waste of performance and time. Instead, we will only use the extracted links and access them one by one.

For Shawshank Redemption, the movie page will be https://www.imdb.com/title/tt0111161/. We will extract the year and duration of the movie from the page, but this time we will use the function of Selenium instead of BeautifulSoup as an example. In practice, you can use either one, so choose the one you like best.

To retrieve the year and duration of the movie, you should repeat the first step we performed on the movie page.

You'll notice that you can find all the information in the first element with the class ipc-inline-list (".ipc-inline-list" selector), and all elements in the list have the role attribute value presentation ([role='presentation'] selector).

From bs4 import BeautifulSoupfrom selenium import webdriver option = webdriver.ChromeOptions () # I use the following options as my machine is a window subsystem linux. # I recommend to use the headless option at least, out of the 3option.add_argument ('- headless') option.add_argument ('--no-sandbox') option.add_argument ('--disable-dev-sh-usage') # Replace YOUR-PATH-TO-CHROMEDRIVER with your chromedriver locationdriver = webdriver.Chrome ('YOUR-PATH-TO-CHROMEDRIVER', options=option) page = driver.get (' https://www.imdb.com/chart/top/') # Getting page HTML through requestsoup = BeautifulSoup (driver.page_source) 'html.parser') # Parsing content using beautifulsoup totalScrapedInfo = [] # In this list we will save all the information we scrapelinks = soup.select ("table tbody tr td.titleColumn a") # Selecting all of the anchors with titlesfirst10 = links [: 10] # Keep only the first10 anchorsfor anchor in first10: driver.get (' https://www.imdb.com/' + anchor ['href']) # Access the movie's page infolist = driver.find_elements_by_css_selector ('.ipc-inline-list') [0] # Find the first element with class' ipc-inline-list' informations = infolist.find_elements_by_css_selector ("[role='presentation']") # Find all elements with role='presentation' from the first element with class' ipc-inline-list' scrapedInfo = {"title": anchor.text "year": informations [0] .text, "duration": informations [2] .text,} # Save all the scraped information in a dictionary totalScrapedInfo.append (scrapedInfo) # Append the dictionary to the totalScrapedInformation list print (totalScrapedInfo) # Display the list with all the information we scraped how to use Selenium to extract dynamically loaded content

The next important step in network crawling is to extract dynamically loaded content. You can find such content on each movie page (for example, https://www.imdb.com/title/tt0111161/) in the edit list section.

If you use the check on the page, you will see that you can find the element firstListCardGroup-editorial that this section is set to as the attribute data-testid. But if you look at the source code of the page, you won't find this property value anywhere. This is because the edit list section is dynamically loaded by IMDB.

In the following example, we will grab the edit list for each movie and add it to our current total crawl information result.

To do this, we will import more packages to wait for our dynamic content to load.

From bs4 import BeautifulSoupfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as EC option = webdriver.ChromeOptions () # I use the following options as my machine is a window subsystem linux. # I recommend to use the headless option at least, out of the 3option.add_argument ('- headless') option.add_argument ('--no-sandbox') option.add_argument ('--disable-dev-sh-usage') # Replace YOUR-PATH-TO-CHROMEDRIVER with your chromedriver locationdriver = webdriver.Chrome ('YOUR-PATH-TO-CHROMEDRIVER', options=option) page = driver.get (' https://www.imdb.com/chart/top/') # Getting page HTML through requestsoup = BeautifulSoup (driver.page_source) 'html.parser') # Parsing content using beautifulsoup totalScrapedInfo = [] # In this list we will save all the information we scrapelinks = soup.select ("table tbody tr td.titleColumn a") # Selecting all of the anchors with titlesfirst10 = links [: 10] # Keep only the first10 anchorsfor anchor in first10: driver.get (' https://www.imdb.com/' + anchor ['href']) # Access the movie's page infolist = driver.find_elements_by_css_ Selector ('.ipc-inline-list') [0] # Find the first element with class' ipc-inline-list' informations = infolist.find_elements_by_css_selector ("[role='presentation']") # Find all elements with role='presentation' from the first element with class' ipc-inline-list' scrapedInfo = {"title": anchor.text "year": informations [0] .text, "duration": informations [2] .text,} # Save all the scraped information in a dictionary WebDriverWait (driver, 5) .text (EC.visibility_of_element_located ((By.CSS_SELECTOR) "[data-testid='firstListCardGroup-editorial']")) # We are waiting for 5 seconds for our element with the attribute data-testid set as `firstListCardGroup- editorial` listElements = driver.find_elements_by_css_selector ("[data-testid='firstListCardGroup-editorial'] .listName") # Extracting the editorial lists elements listNames = [] # Creating an empty list and then appending only the elements texts for el in listElements: listNames.append (el.text) scrapedInfo ['editorial-list '] = listNames # Adding the editorial list names to our scrapedInfo dictionary totalScrapedInfo.append (scrapedInfo) # Append the dictionary to the totalScrapedInformation list print (totalScrapedInfo) # Display the list with all the information we scraped

For the previous example, you should get the following output:

How to save the captured content

Now that we have all the data we need, we can save it as a .json or .csv file for easy reading.

To do this, we will only use the JSON and CVS packages in Python and write our contents to the new file:

Import csvimport json... File = open ('movies.json', mode='w', encoding='utf-8') file.write (json.dumps (totalScrapedInfo)) writer = csv.writer (open ("movies.csv",' w')) for movie in totalScrapedInfo: writer.writerow (movie.values ()) crawling skills and tricks

Although our guidelines so far are advanced enough to deal with JavaScript rendered scenes, there is still a lot to explore in Selenium.

In this section, I will share some tips and tips that may come in handy.

1. Timing your request

If you send hundreds of request spam to the server in a short period of time, it is likely that a CAPTCHA will appear at some point, or your IP may even be blocked. Unfortunately, there is no workaround in Python to avoid this situation.

Therefore, you should place some timeout intervals between each request so that the traffic looks more natural.

Import timeimport requests page = requests.get ('https://www.imdb.com/chart/top/') # Getting page HTML through requesttime.sleep (30) # Wait 30 secondspage = requests.get (' https://www.imdb.com/') # Getting page HTML through request2. Error handling

Because the site is dynamic and can change its structure at any time, error handling may come in handy if you often use the same network crawling tool.

Try: WebDriverWait (driver, 5) .cake (EC.presence_of_element_located ((By.CSS_SELECTOR, "your selector")) breakexcept TimeoutException: # If the loading took too long, print message and try again print ("Loading took too much time!")

Try and error grammar is useful when you wait for an element, extract it, or even when you just make a request.

3. Screenshot

If you need to get a screenshot of the web page you are crawling at any time, you can use:

Driver.save_screenshot ('screenshot-file-name.png')

This helps you debug when you are working with dynamically loaded content.

At this point, I believe you have a deeper understanding of "how to use Python to crawl data from any website". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report