How to write Python crawler script from scratch 04/28 Update SLTechnology News&Howtos

How to write Python crawler script from scratch

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces the knowledge of "how to write a Python crawler script from scratch". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

0. Preparatory work

What you need to prepare: Python, scrapy, an IDE, or any text editing tool.

1. The technical department has studied and decided that you should write the crawler.

Create a random working directory, and then use the command line to create a project called miao, which can be replaced with your favorite name.

Scrapy startproject miao

Create a python file, such as miao.py, in the spiders folder to act as a script for the crawler.

The contents are as follows:

Import scrapy

Class NgaSpider (scrapy.Spider):

Name = "NgaSpider"

Host = "http://bbs.ngacn.cc/"

# start_urls is the initial page we are going to climb

Start_urls = [

"http://bbs.ngacn.cc/thread.php?fid=406",

]

# this is a parsing function, and if not specifically specified, the pages captured by scrapy will be parsed by this function.

# the processing and analysis of the page is carried out here, and in this example we simply print out the content of the page.

Def parse (self, response):

Print response.body

two。 Why don't you run one?

If you use the command line, this is it:

Cd miao

Scrapy crawl NgaSpider

You can see that the reptile has printed out the first page of your interstellar zone, of course, without any processing, mixed with html tags and js scripts.

Analysis

Next, we will analyze the page we have just captured and extract the title of the post from this pile of html and js.

In fact, parsing the page is a manual task, there are many ways, here only introduce xpath.

0. Why not try the magic xpath?

Take a look at the thing you just grabbed, or open the page manually with a chrome browser and press F12 to see the page structure.

Each title is actually wrapped in such a html tag. For example:

[cooperation mode] A tentative idea of modifying the cooperation mode

You can see that href is the address of the post (spelled with the forum address in front of it, of course), and the content of the tag package is the title of the post.

So we use the absolute positioning method of xpath to extract the part of class='topic'.

1. Look at the effect of xpath

Add a quote to the top:

From scrapy import Selector

Change the parse function to:

Def parse (self, response):

Selector = Selector (response)

# here, xpath will extract all the class=topic tags, of course, this is a list

# every element in this list is the html tag we are looking for

Content_list = selector.xpath ("/ / * [@ class='topic']")

# iterate through the list, processing each tag

For content in content_list:

# parse the tag here and extract the title of the post we need.

Topic = content.xpath ('string (.)'). Extract_first ()

Print topic

# the url address of the post is extracted here.

Url = self.host + content.xpath ('@ href') .extract_first ()

Print url

Run it again and you can see the output of the title and url of all the posts on the first page of your interstellar area.

Recursion

Next we will grab the content of each post.

You need to use python's yield here.

Yield Request (url=url, callback=self.parse_topic)

Here you will tell scrapy to grab the url, and then parse the captured page with the specified parse_topic function.

At this point, we need to define a new function to analyze the content of a post.

The complete code is as follows: rental forklift company

Import scrapy

From scrapy import Selector

From scrapy import Request

Class NgaSpider (scrapy.Spider):

Name = "NgaSpider"

Host = "http://bbs.ngacn.cc/"

# in this example, only one page is specified as the starting url for crawling

# of course, it is also possible to read the starting url from a database or file or somewhere else.

Start_urls = [

"http://bbs.ngacn.cc/thread.php?fid=406",

]

# entry of the crawler, where you can do some initialization work, such as reading the starting url from a file or database

Def start_requests (self):

For url in self.start_urls:

# here, the starting url is added to the queue to be crawled by scrapy, and the resolution function is specified

# scrapy will schedule itself, access the url and get the content back

Yield Request (url=url, callback=self.parse_page)

# layout resolution function to resolve the title and address of a post on a page

Def parse_page (self, response):

Selector = Selector (response)

Content_list = selector.xpath ("/ / * [@ class='topic']")

For content in content_list:

Topic = content.xpath ('string (.)'). Extract_first ()

Print topic

Url = self.host + content.xpath ('@ href') .extract_first ()

Print url

# here, add the parsed post address to the queue to be crawled, and specify the resolution function

Yield Request (url=url, callback=self.parse_topic)

# you can parse the page turning information here to crawl multiple pages in the version area

# the parsing function of a post to parse the content of each floor of a post

Def parse_topic (self, response):

Selector = Selector (response)

Content_list = selector.xpath ("/ / * [@ class='postcontent ubbcode']")

For content in content_list:

Content = content.xpath ('string (.)'). Extract_first ()

Print content

# you can parse the page turning information here to crawl multiple pages of the post

At this point, the crawler can crawl the title of all the posts on the first page of your altar and the content of each floor on the first page of each post.

The principle of crawling multiple pages is the same. Pay attention to parsing the url address of turning pages, setting termination conditions, and specifying the corresponding page parsing function.

Pipelines-- pipeline

Here is the processing of the crawled and parsed content, which can be written to local files and databases through pipes.

0. Define an Item

Create an items.py file in the miao folder.

From scrapy import Item, Field

Class TopicItem (Item):

Url = Field ()

Title = Field ()

Author = Field ()

Class ContentItem (Item):

Url = Field ()

Content = Field ()

Author = Field ()

Here we define two simple class to describe the result of our crawl.

1. Write a processing method

Find the pipelines.py file under the miao folder. Scrapy should have been automatically generated before.

We can build a solution here.

Class FilePipeline (object):

# # all the analysis results of the crawler will be handed over to this function by scrapy

Def process_item (self, item, spider):

If isinstance (item, TopicItem):

# # File writing, database writing and other operations can be performed here

Pass

If isinstance (item, ContentItem):

# # File writing, database writing and other operations can be performed here

Pass

# #...

Return item

two。 Call this processing method in the crawler.

To call this method, we only need to call it in the crawler, for example, the original content handling function can be changed to:

Def parse_topic (self, response):

Selector = Selector (response)

Content_list = selector.xpath ("/ / * [@ class='postcontent ubbcode']")

For content in content_list:

Content = content.xpath ('string (.)'). Extract_first ()

# # the above is the original content

# # create a ContentItem object and put what we crawled into it

Item = ContentItem ()

Item ["url"] = response.url

Item ["content"] = content

Item ["author"] = "" # #

# # just call it in this way

# # scrapy will hand over this item to the FilePipeline we just wrote to deal with

Yield item

3. Specify this pipeline in the configuration file

Find the settings.py file and add it

ITEM_PIPELINES = {

'400 miao.pipelines.FilePipeline':

}

This is called in the crawler

Yield item

It will be handled by this FilePipeline. The following number 400 indicates priority.

Multiple Pipeline,scrapy can be configured here. According to the priority, the item is handed over to each item for processing, and the result of each processing is passed to the next pipeline for processing.

You can configure multiple pipeline like this:

ITEM_PIPELINES = {

'400 miao.pipelines.Pipeline00':

'miao.pipelines.Pipeline01': 401

'miao.pipelines.Pipeline02': 402

'miao.pipelines.Pipeline03': 403

# #...

}

Middleware-- middleware

Through Middleware, we can make some changes to the request information, such as the commonly used settings UA, proxy, login information, and so on, which can be configured through Middleware.

Configuration of 0.Middleware

Similar to the configuration of pipeline, add the name of Middleware to the setting.py, such as

DOWNLOADER_MIDDLEWARES = {

Miao.middleware.UserAgentMiddleware: 401

Miao.middleware.ProxyMiddleware: 402

}

1. Look up UA on a broken website. I want to change UA.

Some websites are not allowed to visit without UA.

Create a middleware.py under the miao folder

Import random

Agents = [

"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5"

"Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9"

"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7"

"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14"

"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14"

"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20"

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27"

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1"

]

Class UserAgentMiddleware (object):

Def process_request (self, request, spider):

Agent = random.choice (agents)

Request.headers ["User-Agent"] = agent

Here is a simple random replacement of UA middleware, the content of agents can be expanded by itself.

two。 Break the website and block IP. I want to use an agent.

For example, if the local 127.0.0.1 opens an agent with port 8123, you can also configure the middleware to let the crawler crawl the target website through this proxy.

Also add to middleware.py:

Class ProxyMiddleware (object):

Def process_request (self, request, spider):

# fill in your own agent here

# if you buy an agent, you can use API to get a list of agents and select one at random

Proxy = "http://127.0.0.1:8123"

Request.meta ["proxy"] = proxy

Many websites limit the number of visits and temporarily block IP if they visit too frequently.

IP can be purchased online if necessary. The general service will provide an API to obtain the currently available IP pool. Just choose one and fill it here.

Some common configurations

Some common configurations in settings.py

# interval, in seconds. Indicates the interval between every two requests of the scrapy.

DOWNLOAD_DELAY = 5

# whether to retry when an access exception occurs

RETRY_ENABLED = True

# retry when you encounter the following http status code

RETRY_HTTP_CODES = [500,502,503,504,400,403,404408]

# number of retries

RETRY_TIMES = 5

The number of concurrency of # Pipeline. How many Pipeline can handle item at most at the same time

CONCURRENT_ITEMS = 200

# maximum number of concurrent requests

CONCURRENT_REQUESTS = 100

# the maximum number of concurrency for a website

CONCURRENT_REQUESTS_PER_DOMAIN = 50

# the maximum number of concurrency for an IP

CONCURRENT_REQUESTS_PER_IP = 50

I just want to use Pycharm.

If you have to use Pycharm as a development and debugging tool, you can configure it in the running configuration as follows:

Configuration page:

Script enter the cmdline.py path of your scrapy, for example, mine is

/ usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py

Then fill in the name of the crawler in Scrpit parameters, which in this case is:

Crawl NgaSpider

Finally, Working diretory, find your settings.py file and fill in the directory where the file is located.

This is the end of "how to write Python crawler script from scratch". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.