Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to learn Python to face prison crawlers

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "how to learn Python facing prison crawler". In daily operation, I believe many people have doubts about how to learn Python facing prison crawler. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubt of "how to learn Python facing prison crawler"! Next, please follow the editor to study!

To put it simply, a web crawler is to get the data it wants in bulk from the network.

There are two ways to crawl data on the Internet:

Use the official API

Network crawling

API (Application programming Interface) is used to exchange data between different systems in a standard way. However, most of the time, site owners do not provide any API. In this case, we can only use web to grab and extract the data.

Basically, each web page is returned from the server in HTML format, which means that our actual data is well wrapped in HTML elements. This makes the whole process of retrieving specific data simple and straightforward.

This tutorial will be a guide from beginning to end, so that you can use Python to learn crawlers as easily as possible. First of all, I'll introduce you to some basic examples to familiarize you with web fetching. Later, we will use this knowledge to extract football match data from Livescore.cz.

Start

To get started, you need to start a new Python3 project and install Scrapy (a web crawler library for Python). I used pipenv in this tutorial, but you can also use pip and venv, or conda.

Pipenv install scrapy # Pipenv install scrap

Now that you have Scrapy, you still need to create a new web crawl project, so Scrapy provides a command line to do this for us.

Now, let's use scrapy clii to create a new project called web _ scraper.

If you use pipenv like I do, use:

Pipenv run scrapy startproject web_scraper

Or in your own virtual environment, use:

Scrapy startproject web_scraper

This will create a basic project in the working directory with the following structure:

01. Use XPath

We will start our web crawling tutorial with a very simple example. First, we will locate the logo of the Live Code Stream website in HTML. As we know, it is just a text, not an image, so we will simply extract the text.

Code

To get started, we need to create a new crawler for this project. We can do this by creating a new file or by using CLI.

Because we already know the code we need, we will be in this path / web_scraper/spiders/live _ code _ stream. Create a new Python file on py

Here is the code in this file.

Code interpretation

First, we imported the Scrapy library because we need its functionality to create a Python web spider. The crawler will then be used to crawl the specified website and extract useful information.

We created a class and named it LiveCodeStreamSpider. Basically, it inherits scrapy. That's why we pass it as a parameter.

Now, the important step is to use a variable named name to define a unique name for your spider. Keep in mind that the name of an existing spider is not allowed. Similarly, you cannot create a new crawler with this name. It must be unique throughout the project.

After that, we use start_urls list to deliver the website URL.

Method of parse (), which locates the tag in the HTML code and extracts its text. In Scrapy, there are two ways to find the HTML element in the source code. These are mentioned below:

CSS and XPath

You can even use some external libraries, such as BeautifulSoup and lxml. However, for this example, we used XPath.

One way to quickly determine the XPath of any HTML element is to open it in Chrome Devtools. Now, just right-click the element's HTML code and hover the mouse cursor over copy in the pop-up menu that just appears. Finally, click the Copy XPath menu item.

Please look at the screenshot below to better understand it.

By the way, I used / text () after the actual XPath of the element, retrieving the text only from that element, not from the complete element code.

Note: you may not use any other name for the variables, lists, or functions mentioned above. These names are pre-defined in the Scrapy Library. Therefore, you must use them realistically. Otherwise, the program will not work properly.

Run the crawler

Because we are already in the web_scraper folder at the command prompt. Let's execute spider and populate the result in the new file lcs.json using the following code. The results we get will be well structured in JSON format.

Pipenv run scrapy crawl lcs-o lcs.json scrapy crawl lcs-o lcs.json

Result

When executing the above code, we will see a new file lcs.json in the project folder.

The following is the contents of this file.

[{"logo": "Live Code Stream"}]

02. Use CSS

Most of us like sports, such as football.

Football matches are often organized all over the world. There are several websites that provide real-time feedback on the results of the game. However, most of these sites do not provide any official API.

In turn, it creates an opportunity for us to use our web crawling skills and extract meaningful information to crawl their sites directly.

On their home page, they have a good display of the games and games that will be held today (the date of your visit to the website).

We can retrieve the following information

Name of the competition

Game time

Name of team A

Number of goals scored by team A

Name of team B

Number of goals for team B

Etc. Wait

In our code example, we will extract the name of the game that matches today.

Code

Let's create a new spider in the project to retrieve the name of the competition. I named the project livescore_t.py.

Here is the code you need to enter in livescore _ t.py:

Code interpretation

Import Scrapy as usual

Create a class that inherits from scrapy.Spider

Give our reptile a unique name LiveScoreT

Provide URL for livescore.cz

Finally, you use the parse () function to iterate over all the matching elements containing the competition name and concatenate them together using yield. Finally, we will receive the names of all the matches that have been played today. One thing to note is that this time I used CSS instead of XPath.

Running

It's time to see how our reptiles act. Run the following command to get spider to the home page of the Livescore.cz website. The web crawl result is then added to a new file called ls _ t.json.

Pipenv run scrapy crawl LiveScoreT-o ls_t.json

Result

This is what our web crawler extracted from Livescore.cz on November 18, 2020. Remember, the output may change from day to day.

03. A more advanced example.

In this section, we will not only retrieve the title of the tournament, but also move on to the next stage and get the full details of the tournament and its competition.

Create a new file in / web _ scraper/web _ scraper/spider/ and name it livescore.py.

Code interpretation

The code structure of this file is the same as the previous example. Here, we just update the parse () method with a new feature.

Basically, we extracted all the HTML elements from the page. Then, through the loop, we find out whether this is a tournament or a competition. If it is a tournament, we extract its name. In the case of the game, we extracted its "time", "status" and "name and score of the two teams".

Running

Type the following command on the console and execute it

Pipenv run scrapy crawl LiveScore-o ls.json

Result

Here are some samples that have been retrieved:

Now that we have this data, we can do whatever we want, such as using it to train our own neural networks to predict future games.

At this point, the study on "how to learn Python facing prison crawlers" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report