Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How python climbs the latest movie of movie paradise

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces python how to climb the latest movie of movie paradise, the article introduces in great detail, has certain reference value, interested friends must finish watching!

1 crawl the target

This crawled site chooses Movie Paradise, the URL is: www.ydtt8.net. Crawl content is all the movie information of the entire site, including the movie name, director, starring, download address and so on. The specific crawling information is shown in the following figure:

2 Design crawler 2.1 to determine the crawl entry

There are thousands of movies in the movie paradise, and the types of movies are dazzling. In order to ensure that the crawled movie information is not repeated, we need to determine a crawling direction. There is really no way to start with the current situation. However, we click the "latest Movie" option on the home page to jump to a new page. Suddenly there is a feeling of turning a corner but there is a village.

As can be seen from the picture, there are five movie columns in Movie Paradise, which are the latest films, Japanese and Korean films, European and American films, domestic films and comprehensive films. Each column has a certain number of pages, and each page has 25 pieces of movie information. Then the entry of the program can have 5 url addresses. These five addresses correspond to the home page links of each column.

2.2 crawling ideas

Knowing the crawling entrance makes the rest of the job much easier. Through the test, I found that except for the different url address of the page, the xpath path for extracting information is the same. Therefore, I regard 5 columns as a class, and then crawl through the class.

I have the latest movie here as an example to illustrate the idea of crawling.

1) request the front page of the column to get the total number of pages and speculate the url address of each page

2) store the obtained paged url in a queue named floorQueue

3) extract the paging url from the floorQueue in turn, and then use multithreading to initiate the request

4) save the acquired movie page url to the queue named middleQueue

5) extract the movie page url from middleQueue in turn, and then use multi-thread to initiate the request

6) parse the request result and extract the required movie information using Xpath

7) Save the crawled movie information in the queue named contentQueue

8) take the movie information from the contentQueue queue and save it to the database.

2.3 designing a crawler architecture

According to the idea of crawling, I designed the crawler architecture. As shown in the following figure:

2.4 Code implementation

It mainly describes the code of several important classes.

Main class

There are two main tasks: first, instantiate a dytt8Moive object, and then start crawling information. Second, when the crawl is finished, insert the data into the database.

The logic code for handling the crawler is as follows:

The code to create the database and the table, and then insert the movie information into the database is as follows:

TaskQueue class

Maintain the management classes of the floorQueue, middleQueue, and contentQueue queues. The reason for choosing the data structure of the queue is that the crawler needs to use multithreading, and the queue can ensure thread safety.

Dytt8Moive class

The dytt8Moive class is the backbone of this program. The initial goal of the program is to crawl 5 movie columns, but so far it is only realistic to crawl the latest columns. If you want to climb all the column movies, you just need to modify dytt8Moive a little bit.

The getMoiveInforms method is mainly responsible for parsing the movie information node and encapsulating it into a dictionary. In the code, you see that there is more than one path expression for Xpath. Because the typesetting of Movie Paradise's movie details page is uneven, a single content extraction expression, poster and movie screenshot expression, and download address expression is far from enough.

The choice of dictionary type as the data structure for storing movie information is also decided after climbing the hole. This is another part of the site that cheats people up. There are no content nodes in the movie details page, such as type and Douban score, so you cannot use the list to save in order.

3 crawling results

Here I show myself crawling the first part of the more than 4000 pieces of data in the latest column.

The above is all the content of the article "how to climb the latest Movie in Movie Paradise". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report