In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article introduces the knowledge of "how to write a Python crawler script from scratch". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
0. Preparatory work
What you need to prepare: Python, scrapy, an IDE, or any text editing tool.
1. The technical department has studied and decided that you should write the crawler.
Create a random working directory, and then use the command line to create a project called miao, which can be replaced with your favorite name.
Scrapy startproject miao
Create a python file, such as miao.py, in the spiders folder to act as a script for the crawler.
The contents are as follows:
Import scrapy
Class NgaSpider (scrapy.Spider):
Name = "NgaSpider"
Host = "http://bbs.ngacn.cc/"
# start_urls is the initial page we are going to climb
Start_urls = [
"http://bbs.ngacn.cc/thread.php?fid=406",
]
# this is a parsing function, and if not specifically specified, the pages captured by scrapy will be parsed by this function.
# the processing and analysis of the page is carried out here, and in this example we simply print out the content of the page.
Def parse (self, response):
Print response.body
two。 Why don't you run one?
If you use the command line, this is it:
Cd miao
Scrapy crawl NgaSpider
You can see that the reptile has printed out the first page of your interstellar zone, of course, without any processing, mixed with html tags and js scripts.
Analysis
Next, we will analyze the page we have just captured and extract the title of the post from this pile of html and js.
In fact, parsing the page is a manual task, there are many ways, here only introduce xpath.
0. Why not try the magic xpath?
Take a look at the thing you just grabbed, or open the page manually with a chrome browser and press F12 to see the page structure.
Each title is actually wrapped in such a html tag. For example:
[cooperation mode] A tentative idea of modifying the cooperation mode
You can see that href is the address of the post (spelled with the forum address in front of it, of course), and the content of the tag package is the title of the post.
So we use the absolute positioning method of xpath to extract the part of class='topic'.
1. Look at the effect of xpath
Add a quote to the top:
From scrapy import Selector
Change the parse function to:
Def parse (self, response):
Selector = Selector (response)
# here, xpath will extract all the class=topic tags, of course, this is a list
# every element in this list is the html tag we are looking for
Content_list = selector.xpath ("/ / * [@ class='topic']")
# iterate through the list, processing each tag
For content in content_list:
# parse the tag here and extract the title of the post we need.
Topic = content.xpath ('string (.)'). Extract_first ()
Print topic
# the url address of the post is extracted here.
Url = self.host + content.xpath ('@ href') .extract_first ()
Print url
Run it again and you can see the output of the title and url of all the posts on the first page of your interstellar area.
Recursion
Next we will grab the content of each post.
You need to use python's yield here.
Yield Request (url=url, callback=self.parse_topic)
Here you will tell scrapy to grab the url, and then parse the captured page with the specified parse_topic function.
At this point, we need to define a new function to analyze the content of a post.
The complete code is as follows: rental forklift company
Import scrapy
From scrapy import Selector
From scrapy import Request
Class NgaSpider (scrapy.Spider):
Name = "NgaSpider"
Host = "http://bbs.ngacn.cc/"
# in this example, only one page is specified as the starting url for crawling
# of course, it is also possible to read the starting url from a database or file or somewhere else.
Start_urls = [
"http://bbs.ngacn.cc/thread.php?fid=406",
]
# entry of the crawler, where you can do some initialization work, such as reading the starting url from a file or database
Def start_requests (self):
For url in self.start_urls:
# here, the starting url is added to the queue to be crawled by scrapy, and the resolution function is specified
# scrapy will schedule itself, access the url and get the content back
Yield Request (url=url, callback=self.parse_page)
# layout resolution function to resolve the title and address of a post on a page
Def parse_page (self, response):
Selector = Selector (response)
Content_list = selector.xpath ("/ / * [@ class='topic']")
For content in content_list:
Topic = content.xpath ('string (.)'). Extract_first ()
Print topic
Url = self.host + content.xpath ('@ href') .extract_first ()
Print url
# here, add the parsed post address to the queue to be crawled, and specify the resolution function
Yield Request (url=url, callback=self.parse_topic)
# you can parse the page turning information here to crawl multiple pages in the version area
# the parsing function of a post to parse the content of each floor of a post
Def parse_topic (self, response):
Selector = Selector (response)
Content_list = selector.xpath ("/ / * [@ class='postcontent ubbcode']")
For content in content_list:
Content = content.xpath ('string (.)'). Extract_first ()
Print content
# you can parse the page turning information here to crawl multiple pages of the post
At this point, the crawler can crawl the title of all the posts on the first page of your altar and the content of each floor on the first page of each post.
The principle of crawling multiple pages is the same. Pay attention to parsing the url address of turning pages, setting termination conditions, and specifying the corresponding page parsing function.
Pipelines-- pipeline
Here is the processing of the crawled and parsed content, which can be written to local files and databases through pipes.
0. Define an Item
Create an items.py file in the miao folder.
From scrapy import Item, Field
Class TopicItem (Item):
Url = Field ()
Title = Field ()
Author = Field ()
Class ContentItem (Item):
Url = Field ()
Content = Field ()
Author = Field ()
Here we define two simple class to describe the result of our crawl.
1. Write a processing method
Find the pipelines.py file under the miao folder. Scrapy should have been automatically generated before.
We can build a solution here.
Class FilePipeline (object):
# # all the analysis results of the crawler will be handed over to this function by scrapy
Def process_item (self, item, spider):
If isinstance (item, TopicItem):
# # File writing, database writing and other operations can be performed here
Pass
If isinstance (item, ContentItem):
# # File writing, database writing and other operations can be performed here
Pass
# #...
Return item
two。 Call this processing method in the crawler.
To call this method, we only need to call it in the crawler, for example, the original content handling function can be changed to:
Def parse_topic (self, response):
Selector = Selector (response)
Content_list = selector.xpath ("/ / * [@ class='postcontent ubbcode']")
For content in content_list:
Content = content.xpath ('string (.)'). Extract_first ()
# # the above is the original content
# # create a ContentItem object and put what we crawled into it
Item = ContentItem ()
Item ["url"] = response.url
Item ["content"] = content
Item ["author"] = "" # #
# # just call it in this way
# # scrapy will hand over this item to the FilePipeline we just wrote to deal with
Yield item
3. Specify this pipeline in the configuration file
Find the settings.py file and add it
ITEM_PIPELINES = {
'400 miao.pipelines.FilePipeline':
}
This is called in the crawler
Yield item
It will be handled by this FilePipeline. The following number 400 indicates priority.
Multiple Pipeline,scrapy can be configured here. According to the priority, the item is handed over to each item for processing, and the result of each processing is passed to the next pipeline for processing.
You can configure multiple pipeline like this:
ITEM_PIPELINES = {
'400 miao.pipelines.Pipeline00':
'miao.pipelines.Pipeline01': 401
'miao.pipelines.Pipeline02': 402
'miao.pipelines.Pipeline03': 403
# #...
}
Middleware-- middleware
Through Middleware, we can make some changes to the request information, such as the commonly used settings UA, proxy, login information, and so on, which can be configured through Middleware.
Configuration of 0.Middleware
Similar to the configuration of pipeline, add the name of Middleware to the setting.py, such as
DOWNLOADER_MIDDLEWARES = {
Miao.middleware.UserAgentMiddleware: 401
Miao.middleware.ProxyMiddleware: 402
}
1. Look up UA on a broken website. I want to change UA.
Some websites are not allowed to visit without UA.
Create a middleware.py under the miao folder
Import random
Agents = [
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5"
"Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7"
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14"
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20"
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27"
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1"
]
Class UserAgentMiddleware (object):
Def process_request (self, request, spider):
Agent = random.choice (agents)
Request.headers ["User-Agent"] = agent
Here is a simple random replacement of UA middleware, the content of agents can be expanded by itself.
two。 Break the website and block IP. I want to use an agent.
For example, if the local 127.0.0.1 opens an agent with port 8123, you can also configure the middleware to let the crawler crawl the target website through this proxy.
Also add to middleware.py:
Class ProxyMiddleware (object):
Def process_request (self, request, spider):
# fill in your own agent here
# if you buy an agent, you can use API to get a list of agents and select one at random
Proxy = "http://127.0.0.1:8123"
Request.meta ["proxy"] = proxy
Many websites limit the number of visits and temporarily block IP if they visit too frequently.
IP can be purchased online if necessary. The general service will provide an API to obtain the currently available IP pool. Just choose one and fill it here.
Some common configurations
Some common configurations in settings.py
# interval, in seconds. Indicates the interval between every two requests of the scrapy.
DOWNLOAD_DELAY = 5
# whether to retry when an access exception occurs
RETRY_ENABLED = True
# retry when you encounter the following http status code
RETRY_HTTP_CODES = [500,502,503,504,400,403,404408]
# number of retries
RETRY_TIMES = 5
The number of concurrency of # Pipeline. How many Pipeline can handle item at most at the same time
CONCURRENT_ITEMS = 200
# maximum number of concurrent requests
CONCURRENT_REQUESTS = 100
# the maximum number of concurrency for a website
CONCURRENT_REQUESTS_PER_DOMAIN = 50
# the maximum number of concurrency for an IP
CONCURRENT_REQUESTS_PER_IP = 50
I just want to use Pycharm.
If you have to use Pycharm as a development and debugging tool, you can configure it in the running configuration as follows:
Configuration page:
Script enter the cmdline.py path of your scrapy, for example, mine is
/ usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py
Then fill in the name of the crawler in Scrpit parameters, which in this case is:
Crawl NgaSpider
Finally, Working diretory, find your settings.py file and fill in the directory where the file is located.
This is the end of "how to write Python crawler script from scratch". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.