Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How does Python use regular expression to crawl the website information of ancient poetry and prose

2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article introduces the knowledge about "Python how to use regular expressions to crawl ancient poetry website information". In the actual case operation process, many people will encounter such difficulties. Next, let Xiaobian lead you to learn how to deal with these situations! I hope you can read carefully and learn something!

Analysis of Ancient Chinese Poetry Website

Figure 1 below shows the home page data of the ancient poetry website-"poetry column."

The address on the second page is so.gushiwen.cn/shiwens/default.aspx? page=2&tstr=&astr=&cstr=&xstr= 。The address of page n is page=n. Everything else remains the same.

1. Get the total number of pages with regular expressions

The matching regular expression is r'.*? (.*?) '

First, the r-modified string is a native string, matching first to the tag, and then passing through.*? Match the tags inside to the tags inside, etc. here. Can match any character (except newline), * can match 0 or any number of characters.? The symbol indicates that only 1 or 0 matches can be found. Added here? No. is meant to use non-greedy mode.

By matching the label to the total number of pages stored. Specify.* in the tag?

(.*?) Add () to specify different groups, here we only need to get the number of pages so add a separate group.

So, the final code is:

def get_total_pages(): resp = requests.get(first_url) #Get total pages ret = re.findall(r'.*? (.*?) ', resp.text, re.DOTALL) result = re.search('\d+', ret[0]) for page_num in range(int(result.group())): url = 'https://so.gushiwen.cn/shiwens/default.aspx? page=' + str(page_num) parse_page(url)

Passing the re.DOTALL parameter in the findall method is for yes. Sign can match to newline\n.

The result of the previous ret is/ 5 pages. To retrieve the number 5, we need to do a matching search again, which is by re.search ('\d+', ret[0]).

2. Extract the title of the poem

As shown in Figure 2 above, the HTML source code of the title of the poem is shown. It can be seen that the title of the poem is tagged. The regular expression matching the title of the poem is.*? (.*?)

First it matches the tag, then it matches (.*?) Here again, non-greedy patterns are used for matching.

3. Extract author and dynasty

Figure 3 above shows the HTML source code of the author and dynasty of the poem, from which it can be seen that the author and dynasty are in

The two a's under the label.

3.1 extract author

The regular expression for extracting authors is.*? (.*?) First, match the label. The next step is to match the contents of the first tag.

3.2 extract dynasty

The regular expression for extracting dynasty is.*? (.*?) The difference with the extracted author is that there is one more, because the dynasty is in the second label.

4. Extract the content of the poem

Figure 4 above shows the HTML source code of the poem content, from which you can see that the verses are in the tag, so you only need to match the content in this tag. Its regular expression is (.*?).

But the data that comes out of this match contains

label. So, we need to replace this label with the sub method. re.sub(r'+', "", content)。

organize code

So we have all the data we want. Next, we need to process the data. The final data format we expect is:

poems=[ { "title":"fisherman's pride·suddenly heard two oars at the bottom of the flower", "author":'Zhang San', 'dynasty':' Tang Dynasty', 'content':'xxxxxx' } { "title": 'Goose Goose Goose', "author":'Li Si', 'dynasty':' Tang Dynasty', 'content':'xxxxxx' } ]

Earlier, we got a list of all titles; a list of all authors; a list of all dynasties; and a list of all verses.

So how do we combine these lists into the form above?

Here, you need to use the zip function. This function combines multiple lists into a new list, where the elements of the list are tuples. For example:

a=<$'name ','age']b=<$'Zhang San',18]c=zip(a,b)

Calling zip results in a zip object that can be converted to a list object. The final result is shown in Figure 5 below.

Full source code # -*-utf-8-*-""@url: https://blog.csdn.net/u014534808@Author: Code Nong Fei Ge @File: gushiwen_rep.py@Time: 2021/12/7 07:40@Desc: Crawl ancient poetry website address of ancient poetry website with regular expression: www.gushiwen.cn/ ""import reimport requestsheaders = { 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36'}first_url = 'https://so.gushiwen.cn/shiwens/default.aspx'def get_total_pages(): resp = requests.get(first_url) #Get total pages ret = re.findall(r'.*? (.*?) ', resp.text, re.DOTALL) result = re.search('\d+', ret[0]) for page_num in range(int(result.group())): url = 'https://so.gushiwen.cn/shiwens/default.aspx? page=' + str(page_num) parse_page(url)#parse page def parse_page(url): resp = requests.get(url) text = resp.text #Extract Title (.*) Group and extract only the content of tags, by default. Cannot match\n. Add re.DOTALL to indicate. No. can match all, greedy patterns # titles = re.findall(r'.* (.*) ', text,re.DOTALL) #Non-greedy mode titles = re.findall(r'.*? (.*?) ', text, re.DOTALL) #Extract Author authors = re.findall(r'.*? (.*?) ', text, re.DOTALL) #Extract the dynasty dynastys = re.findall(r'.*? (.*?) ', text, re.DOTALL) #Extract verses content_tags = re.findall(r'(.*?) ', text, re.DOTALL) contents = [] for content in content_tags: content = re.sub(r'+', "", content) contents.append(content) poems = [] for value in zip(titles, authors, dynastys, contents): #Unpack title, author, dynasty, content = value poems.append( { "title": title, "author": author, 'dynasty': dynasty, 'content': content } ) print(poems) """ poems=[ { "title":"fisherman's pride·suddenly heard two oars at the bottom of the flower", "author":'Zhang San', 'dynasty':' Tang Dynasty', 'content':'xxxxxx' } { "title":"fisherman's pride·suddenly heard two oars at the bottom of the flower", "author":'Zhang San', 'dynasty':' Tang Dynasty', 'content':'xxxxxx' } ] """"zip function a=['name',' age'] b=['Zhang San', 18]c=zip(a,b)c=[ ('name ',' Zhang San'), ('age',18)]"""if __name__ == '__main__': get_total_pages()

The final run result is:

"Python how to use regular expressions to crawl ancient poetry website information" content is introduced here, thank you for reading. If you want to know more about industry-related knowledge, you can pay attention to the website. Xiaobian will output more high-quality practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report