In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article is about how to use Python to write a multithreaded crawler to capture Baidu Tieba's mailbox and mobile phone number, the editor thinks it is very practical, so he shares it with you. I hope you can get something after reading this article.
I don't know how you spend the Chinese New year, anyway, the owner slept at home all day, and when I woke up, I found someone on QQ asked me for a copy of the source code of the crawler. I remembered that when I practiced before, I wrote a crawler that crawled the mailbox and cell phone number in the record of Baidu Tieba's post, so I shared it with you for study and reference.
Requirements Analysis:
The main purpose of this crawler is to grab the contents of various posts in Baidu Tieba, and analyze the contents of the posts to grab the mobile phone number and email address. The main process is explained in detail in the code comments.
Test environment:
The code passed the test under the environment of Windows7 64bit python 2.7 64bit (installing mysqldb extension) and centos 6.5 Python 2.7 (with mysqldb extension).
Environmental preparation:
If you want to do a good job, you must first sharpen its tools. You can see from the screenshot that my environment is Windows 7 + PyCharm. My Python environment is Python 2.7 64bit. This is a more suitable development environment for beginners. Then I suggest you install an easy_install, which is an installer, which is used to install some expansion packages. For example, in python, if we want to operate the mysql database, python natively does not support it. We must install the mysqldb package so that python can operate the mysql database. If we have easy_install, we only need one command to quickly install the mysqldb extension package, just like the composer in php. Apt-get in yum,Ubuntu in centos is just as convenient.
Related tools can be found in my github: cw1997/python-tools, in which the installation of easy_install only needs to run the py script under the python command line and then wait a moment, he will automatically add the environment variables of Windows, under the Windows command line, if you enter easy_install there is an echo that the installation is successful.
Details of the environment selection:
As for the computer hardware, of course, the faster the better, at least 8 GB of memory to start, because the crawler itself needs to store and parse a lot of intermediate data, especially multi-threaded crawlers, in the case of crawling lists and details pages with paging, and crawling a large amount of data, using queue queues to allocate crawling tasks will take up a lot of memory. Including sometimes we grab the data using json, if we use mongodb and other nosql database storage, it will also take up a lot of memory.
Wired network is recommended for network connection, because some inferior wireless routers and ordinary civil wireless network cards on the market will be intermittently disconnected or data loss, packet switching and so on when the online program is relatively large. I have my own experience.
As for the operating system and python, of course, 64-bit is definitely chosen. If you are using a 32-bit operating system, you cannot use large memory. If you use a 32-bit python, you may not feel any problem when fetching data on a small scale, but when the amount of data becomes large, for example, a large amount of data is stored in a list, queue and dictionary, resulting in a memory overflow error when python takes up more than 2g of memory. The reason is explained by Evian's answer in the question I once asked on segmentfault (java-python as soon as the memory consumption reaches 1.9G, the httplib module begins to report a memory overflow error-SegmentFault)
If you are going to use mysql to store data, it is recommended to use a later version of mysql5.5, because the mysql5.5 version supports the json data type, so you can abandon mongodb. Some people say that mysql will be a little more stable than mongodb. I'm not sure about that. )
Now that python is in version 3.x, why am I still using python2.7 here? My personal reason for choosing version 2.7 is that the python core programming book I bought a long time ago is the second edition and still uses 2.7 as an example. And at present, there are still a large number of tutorial materials on the Internet that are explained in version 2.7, which is still very different from 3.x in some aspects. if we have not learned 2.7, it may lead to errors in our understanding or inability to understand demo code if we do not quite understand some subtle grammatical differences. And there are still some dependent packages that are only compatible with version 2.7. My advice is that if you are in a hurry to learn python and then go to work in the company, and the company does not have the old code to maintain, then you can consider directly using 3.x, if you have plenty of time, and do not have a very systematic cow belt, you can only rely on scattered blog articles on the Internet to learn, then you should first learn 2.7. in learning 3.x, after all, 3.x is also very fast after learning 2.7x.
Knowledge points involved in multithreaded crawlers:
In fact, for any software project, if we want to know what knowledge is needed to write this project, we can observe which packages are imported into the main entry files of the project.
Now let's take a look at our project. As a person who has just come into contact with python, there may be some packages that have hardly been used, so in this section, we will briefly talk about the role of these packages, what knowledge points they will involve, and what are the key words of these knowledge points. This article will not spend a long speech to start from the foundation, so we should learn to make good use of Baidu, search these knowledge points of keywords from learning. Let's analyze these knowledge points one by one.
HTTP protocol:
The essence of our crawler crawling data is to constantly make http requests, get http responses, and store them in our computers. Understanding the http protocol helps us to accurately control some parameters that can accelerate the crawling speed when grabbing data, such as keep-alive and so on.
Threading module (multithreaded):
The programs we usually write are single-threaded programs, and the code we write runs in the main thread, which runs in the python process. For an explanation of threads and processes, please refer to Ruan Yifeng's blog: a simple explanation of processes and threads-Ruan Yifeng's web log
Multithreading is implemented in python through a module called threading. There were thread modules before, but threading has more control over threads, so we all switched to threading for multithreaded programming.
About some uses of threading multithreading, I think this article is good: [python] topic 8. You can refer to the thread and threading of multithreaded programming.
To put it simply, to write a multithreaded program using the threading module is to define a class first, then the class inherits threading.Thread and writes the work code to be done by each thread into the run method of a class. Of course, if the thread itself wants to do some initialization work during creation, it is necessary to write the code to perform the initialization work in its _ _ init__ method, which is like php. The construction method is the same in java.
An additional point here is the concept of thread safety. In general, when we are single-threaded, there is only one thread operating on resources (files, variables) at a time, so there can be no conflicts. However, in the case of multiple threads, it may occur that two threads are operating the same resource at the same time, resulting in resource damage, so we need a mechanism to resolve the damage caused by this conflict, such as locking and other operations, such as row-level locks on the innodb table engine of mysql database, read locks on file operations, and so on, which are done by the bottom layer of their program. So we usually just need to know those operations, or those programs that deal with thread safety issues, and then we can use them in multithreaded programming. This kind of program that considers thread safety is generally called "thread safe version". For example, php has a TS version, which means Thread Safety thread safety. The Queue module we will talk about below is a thread-safe queue data structure, so we can safely use it in multithreaded programming.
* * We're going to talk about the critical concept of thread blocking. When we finish learning the threading module in detail, we will probably know how to create and start threads. But if we create the thread and then call the start method, we will find that it looks like the whole program ends right away. What's going on? In fact, this is because we only have the code responsible for the promoter thread in the main thread, which means that the main thread only has the function of the promoter thread. As for the code executed by the child thread, they are essentially just a method written in the class and do not really execute him in the main thread, so after the main thread starts the child thread, his work has been completed and has been gloriously exited. Now that the main thread has exited, the python process ends, and the other threads have no memory space to continue execution. So we should let the main thread elder brother Cheng wait until all the children of the threads have been executed before leaving honorably, so is there any way to jam the main thread in the thread object? Thread.sleep? This is indeed an option, but how long should the main thread be sleep? We don't know exactly how long it will take to complete a task, and we certainly can't use this method. So we should go online at this time to find out if there is any way to let the child thread "jam" the main thread. The word "stuck" seems too vulgar, in fact, professionally speaking, it should be called "blocking", so we can query "python child thread blocking the main thread". If we can use the search engine correctly, we should find a method called join (). Yes, this join () method is the method used by the child thread to block the main thread when the child thread has not finished executing. The main thread runs to the line that contains the join () method and gets stuck there, and the code behind the join () method is not executed until all threads have finished executing.
Queue module (queue):
Suppose there is a scenario where we need to crawl a person's blog. We know that this person's blog has two pages, a list.php page showing links to all the blog articles, and a view.php page showing the specific content of an article.
If we want to grab all the articles in this person's blog, the idea of writing a single-threaded crawler is to first grab the href attribute of all the link a tags on this list.php page with regular expressions and store them in an array called article_list (not an array in python, called list, a list of Chinese names), and then use a for loop to iterate through the article_list array. Grab the content with a variety of functions that grab the content of the web page and store it in the database.
If we are going to write a multithreaded crawler to accomplish this task, assuming that our program uses 10 threads, then we have to find a way to divide the previously captured article_list into 10 parts, each of which is allocated to one of the child threads.
But the problem is that if the length of our article_list array is not a multiple of 10, that is, the number of articles is not an integer multiple of 10, then a thread will be assigned fewer tasks than other threads, and it will end faster.
It may not seem like a problem if we just crawl blog posts with only a few thousand words, but if we have a task (not necessarily a task to crawl a web page, it may be a mathematical calculation, or a time-consuming task such as graphics rendering), it will result in a great waste of resources and time. Our purpose of multithreading is to use all computing resources and calculate time as much as possible, so we need to find ways to allocate tasks more scientifically and rationally.
And I also want to consider a situation, that is, in the case of a large number of articles, we should not only be able to grab the content of the article quickly, but also see the content we have captured as soon as possible. This demand is often reflected in many CMS collection stations.
For example, the target blog we are going to crawl now has tens of millions of articles. Usually in this case, the blog will be paged. Then if we follow the above traditional way of thinking, it will take at least a few hours or even days to grab all the pages of list.php first. If the boss wants you to show the crawled content as soon as possible, and show the captured content to our CMS collection station as soon as possible. So we need to grab the list.php and throw the captured data into an article_list array, while another thread is used to extract the URL address of the crawled article from the article_list array, and then the thread goes to the corresponding URL address and uses the regular expression to get the blog article content. How to achieve this function?
We need to start two types of threads at the same time, one is responsible for grabbing the url in the list.php and then throwing it into the article_list array, and the other is responsible for extracting the url from the article_list and fetching the corresponding blog content from the corresponding view.php page.
But do we remember that we mentioned the concept of thread safety earlier? While the former type of thread writes data to the article_list array, the other type of thread reads data from the article_list and deletes the data that has already been read. However, list in python is not a thread-safe version of the data structure, so doing so can lead to unexpected errors. So we can try to use a more convenient and thread-safe data structure, which is the Queue queue data structure mentioned in our subtitle.
Similarly, Queue also has a join () method, which is actually similar to the join () method in threading mentioned in the previous section, except that in Queue, the blocking condition of join () is to block when the queue is not empty, otherwise continue to execute the code behind join (). In this crawler, I used this method to block the main thread instead of blocking the main thread directly through the thread's join, which has the advantage of not having to write a dead loop to determine whether there are any outstanding tasks in the current task queue, making the program run more efficiently and making the code more elegant.
Another detail is that the name of the queue module in python2.7 is Queue, while in python3.x it has been renamed queue, which is the difference between the first letter and lowercase. If you are copying the code on the Internet, you should remember this small difference.
Getopt module:
If you have learned the c language, you should be familiar with this module, which is responsible for extracting additional parameters from the commands on the command line. For example, we usually operate the mysql database on the command line by typing mysql-h227.0.0.1-uroot-p, where "- h227.0.0.1-uroot-p" after mysql is the available parameter part.
We usually write crawlers, there are some parameters that users need to enter manually, such as mysql host IP, user name password and so on. In order to make our program more user-friendly, there are some configuration items that do not need to be hard-coded in the code, but when we execute them, we pass them dynamically, combined with the getopt module, we can achieve this function.
Hashlib (hash):
Hash is essentially a collection of mathematical algorithms, which has the characteristic that if you give a parameter, it can output another result. Although the result is very short, it can be approximately considered to be *. For example, we have usually heard of md5,sha-1 and so on, they all belong to the hash algorithm. They can turn some documents and words into a mixed string of numbers and English with less than a hundred digits after a series of mathematical operations.
The hashlib module in python encapsulates these mathematical functions for us, and we only need to call it to complete the hash operation.
Why did you use this bag in my crawler? Because in some interface requests, the server needs to bring some check codes to ensure that the data requested by the interface has not been tampered with or lost, these check codes are generally hash algorithm, so we need to use this module to complete this operation.
Json:
Most of the time, the data we grab is not html, but some json data. Json is essentially a string with key-value pairs. If we need to extract a specific string, then we need the json module to convert this json string to dict type to facilitate our operation.
Re (regular expression):
Sometimes we grab some web page content, but we need to extract some content in a specific format from the web page. For example, the format of an e-mail address is usually the domain name of the first few English numeric letters plus an @ symbol plus http://xxx.xxx. To describe this format like a computer language, we can use an expression called regular expression to express this format. And let the computer automatically match the text in this particular format from a large string.
Sys:
This module is mainly used to deal with some system aspects, and I use it to solve the output coding problem in this crawler.
Time:
Anyone who has learned a little English can guess the processing time of this module. In this crawler, I use it to get the current timestamp, and then subtract the timestamp at the beginning of the program from the current timestamp at the end of the main thread. Get the running time of the program.
As shown in the figure, 50 threads crawled 100 pages (30 posts per page, equivalent to 3000 posts) and extracted the mobile mailbox from the post content in 330 seconds.
Urllib and urllib2:
Both modules are used to handle http requests and url formatting. The core code of my crawler http request part is done using this module.
MySQLdb:
This is a third-party module for operating mysql databases in python.
One detail to note here: the mysqldb module is not a thread-safe version, which means that we cannot share the same mysql connection handle in multiple threads. So as you can see in my code, I passed in a new mysql connection handle in the constructor of each thread. So each child thread will only use its own independent mysql connection handle.
Cmd_color_printers:
This is also a third-party module, the relevant code can be found online, this module is mainly used to output color strings to the command line. For example, we usually crawler error, to output a red font will be more conspicuous, we need to use this module.
Error handling for automated crawlers:
If you use the crawler in an environment where the quality of the network is not very good, you will find that the exception shown in the figure is sometimes reported. This is because I did not write the logic of handling all kinds of exceptions in order to be lazy.
In general, if we are going to write highly automated crawlers, we need to anticipate all the exceptions that our crawlers may encounter and deal with them.
For example, with the error shown in the figure, we should re-queue the task we were working on at that time, otherwise we would miss the information. This is also a complication of crawler writing.
The above is how to use Python to write a multithreaded crawler to capture Baidu Tieba's mailbox and mobile phone number. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.