In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article will explain in detail how to use the ip agent to solve the problem of blocking and restricting the operation of the crawler. The content of the article is of high quality, so the editor will share it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.
How to solve the problem that the IP of data collection is blocked and restricted? Take deciphering the SkyEye search web crawler as an example
The editor developed a distributed web crawler system with python three months ago to update and collect the data of the website in real time. The data module of the website, the data storage structure is the same as his. At that time, the idea was to make a data service platform exactly like Tianyuan, and then the data sources were collected synchronously from Tianyuan in real time through Xiaobian's web crawler technology. Preparation for the work required to collect celestial eye data:
1. The first step is to analyze the target website data module:
When using python3 to write a web crawler program to start crawling for data, the first step is to ask the editor to analyze the data modules of the website first. The main core data of the whole website has the following 19 modules: 1 basic information, 2 legal representatives, 3 key members, 4 shareholders & investment, 5 change records, 6 company annual reports, 7 judicial risks, 8 public opinion events, 9 job recruitment, 10 commodity information, 11 website filing, 12 trademark data, 13 patent data , 14 works copyright, software copyright, foreign investment relations, tax rating, administrative penalty, import and export credit, enterprise rating credit and other 19 dimensions of enterprise data
2. Write a web crawler demo model to analyze the page structure and code structure of the website.
The editor simulates the http request to look up the target website to see what the data information is like in response.
When the editor visits normally, it is easy to get the data of the list and the detailed links to enter the list, and the editor collects the detailed data packets of each enterprise through the links.
3The acquisition speed is too frequent, so it will be blocked and restricted. How to solve the IP problem?
When a small editor sends a http request to Tianyan to check the website, it normally returns 200status, indicating that the request is legally accepted, and you will see the returned data, but SkyEye has its own set of anti-crawling mechanism algorithm. If you check the same IP to constantly collect data from his website, he will be blacklisted by this IP, and when you go to collect data from its website, it will be blocked forever. How to solve this problem is actually very simple, there is no misuse of proxy IP to access, each request all all use proxy IP to request, and this proxy IP is randomly changed, each request is all different, so the use of this proxy IP technology to solve the problem of blocking restrictions.
4 days eye check 200 million data how to store? How many proxy IP are required
When the editor was writing about the web crawler to climb the sky, he just started to use the free agent IP on the Internet. As a result, all 90% of them were blocked with restricted numbers, so it is suggested that everyone should not use the free IP on the Internet when collecting such a large amount of data in the future. Because this kind of ip will expire in a few seconds, it means that when you do not have a collection network or just access the data, the IP expires and you cannot collect successfully. So in the end, the editor set up his own agent pool to solve the problem of collecting and sealing IP of 200 million days' eye inspection data. If you do not have the ability and conditions to build your own ip pool, it is recommended that you choose some professional agent ip software, such as Sun Software.
5 days to check the storage of hundreds of millions of data on the website.
Database design is very important, hundreds of millions of data storage database design is very important.
On how to skillfully use the ip agent to solve the problem of blocking and restricting the operation of the crawler is shared here. I hope the above content can be helpful to everyone and learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.