Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to climb 2000 condoms on Taobao with Python

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article shows you how to use Python to climb Taobao 2000 condoms, the content is concise and easy to understand, absolutely can make your eyes bright, through the detailed introduction of this article, I hope you can get something.

First, Taobao login review

We have already introduced how to use the requests library to log in to Taobao, received a lot of feedback and questions from students, brother pig felt very pleased, and said sorry to those students who did not reply in time!

By the way, this login function, the code is completely no problem. If you log in and fail to apply for the St code, you can change all the request parameters in the _ verify_password method.

In Taobao login 2.0 improvement, we added cookies serialization function, the purpose is to facilitate crawling Taobao data, because if you log in frequently with the same ip Taobao may trigger Taobao anti-scraping mechanism!

On the success rate of Taobao login, brother pig can basically succeed in the actual use, if not successful, change the login parameters according to the above method!

Second, crawling commodity information on Taobao

The main purpose of this article is to explain how to crawl data, and the analysis of the data will be placed in the next article. The reason for the separation is that there are too many problems encountered in climbing Taobao, and Brother Pig intends to explain how to climb in detail, so consider the length and the absorption rate of students to explain it in two articles! The purpose will remain the same: so that rookies can understand it!

This crawl is to call Taobao PC-side search interface, extract the returned data, and then save it as an excel file!

It seems like a simple function, but it contains a lot of problems. Let's look down a little bit.

Crawling a single page of data

To start writing a crawler project, we all need to quantify and then step by step, and usually the first step is to climb a page to try!

1. Find load data URL

We open Taobao in the web page, log in, open the debug window of chrome, click network, then check Preserve log, and enter the name of the product you want to search in the search box

This is the request on the first page, we looked at the data and found that the returned product information data was inserted into the web page, rather than the pure json data returned directly!

two。 Is there a pure json data interface returned?

Then Brother Pig wondered if there was a data interface that returned pure json. So I clicked on the next page (that is, page two).

After the second page of the request, Brother Pig found that the data returned was pure json, and then compared the two requests url to find the parameters that only returned json data!

By comparison, we find that the search request url will directly return json data with the ajax=true parameter, so we can directly simulate the direct request for json data!

So Brother Pig directly uses the request parameters on the second page to request data (that is, directly request json data), but there is an error on the first page of the request:

Return a link directly instead of json data. What the heck is this link? Click on it.

Clang, the slider appears, some students will ask: can you handle the Taobao slider with requests? Brother Pig consulted several crawler bosses, and the principle of the slider is to collect response time, drag speed, time, position, trajectory, number of retries, etc., and then determine whether it is manual sliding. And often change the algorithm, so Brother Pig chose to give up this road!

3. Use the request web interface

So we can only choose a request interface similar to the first page (request url returns the whole page form without ajax=true parameter), and then extract the data!

In this way, we can climb to Taobao's web page.

IV. Extraction of commodity attributes

After climbing to the web page, all we have to do is extract the data. Here, we first extract the json data from the web page, and then parse the json to get the desired attributes.

1. Extracting json data of goods from web pages

Now that we have chosen to request the entire page, we need to know where the data is embedded in the page and how to extract it.

After brother pig search and comparison, it is found that the js parameters in the return page: g_page_config is the product information we want, and it is also the json data format!

Then we write a rule and we can extract the data!

Goods_match = re.search (r'g_page_config = (. *?)}};', response.text) 2. Get attributes such as commodity price

To extract json data, you need to understand the structure of the returned json data. We can copy the data to some json plug-ins or parse it online.

Once we understand the json data structure, we can write a method to extract the attributes we want.

Save as excel

There are many libraries for operating excel. Someone on the Internet has made a comparison and evaluation of the excel operation library. If you are interested in it, please take a look at: https://dwz.cn/M6D8AQnq.

Brother Pig chose to use pandas library to operate excel, because pandas is more convenient to operate and is a more commonly used data analysis library!

1. Installation library

Pandas library operation excel actually depends on some other libraries, so we need to install multiple libraries

Pip install xlrd

Pip install openpyxl

Pip install numpy

Pip install pandas

two。 Save excel

What's a bit tricky here is that the pandas operation excel has no append mode, so you can only read the data first and then append it with append and then write to excel!

View the effect

6. Batch crawling

Once the whole process of crawling (crawling, data extraction, saving) is complete, we can call it in batches.

The number of timeout seconds set here is practiced by Brother Pig, from 3s, 5s to more than 10s, CAPTCHA is easy to appear too frequently!

Brother Pig crawled more than 2,000 pieces of data several times.

Problems encountered in climbing Taobao

Climbing Taobao encountered a lot of problems, here is a list for you one by one:

1. Login issu

Question: what if I fail to apply for St code?

Answer: replace all request parameters in the _ verify_password method.

If there is no problem with the parameters, the login will be basically successful!

two。 Agent pool

In order to prevent his ip from being blocked, Brother Pig used an agent pool. Climbing Taobao requires high-quality ip to climb, brother pig tried a lot of online free ip, basically can not climb.

But the ip of one website is very good: http://ip.zdaye.com/dayProxy.html, this website updates a batch of ip every hour, brother pig has tried that there are still a lot of ip that can crawl Taobao.

3. Retry mechanism

In order to prevent the normal request from failing, Brother Pig added a retry mechanism to the crawling method!

Need to install the retry library

Pip install retry4. Slider appears

All of the above are fine, but there will still be sliders. Brother Pig has tested it many times, and some crawl 20-40 times are the most likely to appear sliders.

When the slider appears, you can only wait for half an hour to continue to climb, because you can't use the requests library to solve the slider. Learn selenium and other frameworks later to see if you can solve it!

5. At present, this reptile

At present, this crawler is not perfect, it can only be regarded as a semi-finished product, and there are many things that can be improved, such as automatic maintenance of ip pool function, multi-thread sectional crawling function, solving slider problem, and so on. Later, we will slowly improve this crawler, so that he can become a perfect and sensible reptile!

The above content is how to use Python to climb Taobao 2000 condoms, have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report