Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the inventory of the 8 common crawler skills in the introduction to Python crawlers?

2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

Today, I will talk to you about the inventory of the eight commonly used crawler skills in the introduction to Python crawlers, which may not be well understood by many people. in order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

Programming is not an easy task for any novice. Python is indeed a blessing for anyone who wants to learn programming. Reading Python code is like reading an article. Python provides a very elegant syntax and is called one of the most elegant languages.

When getting started with python

All kinds of crawler scripts are used most frequently.

I have written scripts for local verification of agents, and scripts for automatic login and posting in forums.

I have written a script for receiving email automatically and a script for simple CAPTCHA recognition.

These scripts have one thing in common. They are all related to web.

It is always necessary to use some methods to get links, so I have accumulated a lot of experience of crawlers catching stations.

To sum up here, then there will be no need to repeat the work in the future.

1. Basic crawl of web pages

Get method

Post method

two。 Use a proxy server

This is useful in some cases

For example, IP is blocked, or the number of IP visits is limited, and so on.

3.Cookies processing

Yes, that's right, if you want to use both proxy and cookie

Then add proxy_support and change operner to, as follows:

4. Masquerade as browser access

Some websites are disgusted with the visit of the crawler, so they all refuse the request.

At this point, we need to pretend to be browsers.

This can be achieved by modifying the header in the http package:

5. Page parsing

Of course, the most powerful thing for page parsing is regular expressions.

This is not the same for different users of different sites, so there is no need for too much explanation.

Many times to follow books and websites to find materials to learn, you will find that there is no goal, learned a lot but do not know what they can achieve. To have a clear vocational study plan, you will encounter a lot of problems in the learning process. You can come to our python learning communication group [784758214], basic and advanced. From the recruitment needs of enterprises to how to learn python, there is a free system to share everything, so that whether you are self-study or looking for corresponding training, you can avoid detours. I hope it can help you quickly understand Python and learn python.

The second is the parsing library, there are two commonly used lxml and BeautifulSoup.

For these two libraries, my evaluation is that

They are all HTML/XML processing libraries and are implemented in Beautifulsoup pure python with low efficiency.

But the function is practical, for example, the source code of a HTML node can be obtained by result search.

LxmlC language coding, efficient, support Xpath.

6. Processing of CAPTCHA

What should I do if I encounter the CAPTCHA?

There are two situations to deal with:

Google that kind of verification code, there is no way.

Simple CAPTCHA: a limited number of characters, using only simple translation or rotation plus noise without distortion

It is still possible to deal with this. The general idea is to turn around and remove the noise.

Then divide the individual characters, and then reduce the dimension by feature extraction (such as PCA) and generate a feature library.

Then the CAPTCHA is compared with the feature library.

This is more complicated, so we won't expand it here.

Please get a relevant textbook to study the specific measures.

7. Gzip/deflate support

Today's web pages generally support gzip compression, which can often solve a large number of transmission time

Take the home page of VeryCD as an example, the uncompressed version is 247K, and the later 45K is the original 1max 5.

This means that the grab speed will be five times faster.

However, python's urllib/urllib2 does not support compression by default.

To return the compressed format, you must write 'accept-encoding'' in the header of request

Then after reading the response, we have to check the header to see if there is an item 'content-encoding'' to determine whether it needs to be decoded, which is very tedious.

How to make urllib2 automatically support gzip and defalte?

You can actually inherit the BaseHanlder class

And then deal with it in the build_opener way:

8. Multithreaded concurrent fetching

If single threading is too slow, multithreading is needed.

Here's a simple thread pool template that simply prints 1-10.

But you can see that it's concurrent.

Although the multithreading of Python is very creepy.

But for crawlers, which are frequently used on the Internet,

It can still improve efficiency to a certain extent.

9. Summary

Reading code written by Python feels like reading English, which allows users to focus on solving problems rather than figuring out the language itself.

Although Python is written in C language, it abandons the complex pointers in C, making it simple and easy to learn.

And as open source software, Python allows code to be read, copied, and even improved.

These features contribute to the high efficiency of Python. "Life is too short, I use Python," which is a wonderful and powerful language.

All in all, when you start to learn Python, you must pay attention to these four points:

1. Code specification, which in itself is a very good habit, if you do not start to maintain good code planning, it will be very painful in the future.

two。 More hands-on, less reading, many people learn Python blindly read books, this is not to learn mathematical physics, you may see examples will, learning Python is mainly to learn programming ideas.

3. Practice frequently, after learning new knowledge points, you must remember how to apply it, otherwise you will forget it. Learning from our profession is mainly practical operation.

4. Study to be efficient, if you all feel that the efficiency is very low, then stop, find out the reason, ask people who have been there why.

After reading the above, do you have any further understanding of the inventory of the 8 common crawler skills in the introduction to Python crawlers? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report