Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python to get the specified content of a web page

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "how to use Python to get the specified content of the web page". In the daily operation, I believe that many people have doubts about how to use Python to obtain the specified content of the web page. The editor consulted all kinds of materials and sorted out a simple and easy-to-use method of operation. I hope it will be helpful to answer the doubt of "how to use Python to get the specified content of the web page". Next, please follow the editor to study!

Preface

Python is quite good for data processing. If you want to be a crawler, Python is a good choice. It has a lot of class packages that have been written. As long as you call it, you can complete a lot of complex functions.

Before we begin, we need to install some environment dependency packages and open the command line

Make sure you have python and pip on your computer, and if not, you need to install them yourself.

We can then use pip to install the prerequisite module requests

Pip install requests

Requests is an easy-to-use HTTP library implemented by python, which is much simpler to use than urllib. Requests allows you to send HTTP/1.1 requests. Specify URL and add query url string to start crawling web page information

1. Grab the source code of the web page

Take this platform as an example, grab the company name data in the web page, and the web link: https://www.crrcgo.cc/admin/crr_supplier.html?page=1

The source code of the target web page is as follows:

First of all, clarify the steps.

1. Open the destination site

two。 Grab the target site code and output

Import requests

Import the requests function modules we need

Page=requests.get ('https://www.crrcgo.cc/admin/crr_supplier.html?page=1')

This command means to use get to get the data of the web page. In fact, what we get is the data information of the home screen when the browser opens the Baidu URL.

Print (page.text)

This sentence is the text (text) content output (print) from which we get the data.

Import requestspage=requests.get ('https://www.crrcgo.cc/admin/crr_supplier.html?page=1')print(page.text)

Successfully crawled to the source code of the target web page

two。 Grab some tag content in the source code of a web page

But the code captured above is full of angle brackets, which has no effect on us. The data filled with angle brackets is the web page file we received from the server, just like Office's doc and pptx file formats, and the web page file is usually in html format. Our browser can display this html code data as a web page that we see.

If we need to extract valuable data from these characters, we must first understand the tag elements.

The text content of each tag is sandwiched between two angle brackets, ending with /. The angle brackets (img and div) indicate the type of the tag element (picture or text). Other attributes (such as src) can be contained in the angle brackets.

The tag content text is the data we need, but we need to use the id or class attribute to find the tag elements we need from many tags.

We can open any web page in a computer browser, press the f12 key to open the element Viewer (Elements), and we can see hundreds of various markup elements that make up this page.

Tag elements can be nested layer by layer, for example, below is body nesting div elements, body is the parent layer, upper layer element; div is the child layer, lower layer element.

Ten minutes to use data crawler

Let's go back to crawling. Now I just want to grab the data of the company name in the web page. I don't want anything else.

Look at the html code of the web page and find that the company name is in the tag detail_head

Import requestsreq=requests.get ('https://www.crrcgo.cc/admin/crr_supplier.html?page=1')

As explained above, these two lines are to get the page data.

From bs4 import BeautifulSoup

We need to use the functional module BeautifulSoup to change the html data filled with angle brackets into a more useful format. From bs4 import BeautifulSoup means importing BeautifulSoup from the functional module bs4, yes, because bs4 contains multiple modules, and BeautifulSoup is just one of them.

Req.encoding = "utf-8"

Specifies that the acquired web page content is encoded in utf-8

Soup = BeautifulSoup (html.text, 'html.parser')

This code uses the html parser (parser) to analyze the html text content obtained by our requests, and soup is the result of our parsing.

Company_item=soup.find_all ('div',class_= "detail_head")

Find is to find, find_all to find all. Find that the tag name is div and the class attribute is all elements of detail_head

Dd = company_item.text.strip ()

The strip () method is used to remove the character (default is a space or newline character) or character sequence specified at the beginning and end of the string. In this case, the extra angle bracket html data is removed.

The code after the final stitching is as follows:

Import requestsfrom bs4 import BeautifulSoupreq = requests.get (url= "https://www.crrcgo.cc/admin/crr_supplier.html?page=1")req.encoding =" utf-8 "html=req.textsoup = BeautifulSoup (req.text,features=" html.parser ") company_item = soup.find (" div ", class_=" detail_head ") dd = company_item.text.strip () print (dd)

In the end, the execution result successfully grabbed the company information we wanted in the web page, but only one company was captured, and the rest was not captured.

So we need to add a loop to grab all the company names on the page, and it hasn't changed much.

For company_item in company_items: dd = company_item.text.strip () print (dd)

The final code is as follows:

Import requestsfrom bs4 import BeautifulSoupreq = requests.get (url= "https://www.crrcgo.cc/admin/crr_supplier.html?page=1")req.encoding =" utf-8 "html=req.textsoup = BeautifulSoup (req.text,features=" html.parser ") company_items = soup.find_all (" div ", class_=" detail_head ") for company_item in company_items: dd = company_item.text.strip () print (dd)

The final running result queries all the company names on the page.

3. Grab the contents of multiple web page subtags

What about the company name I want to crawl from multiple web pages now? Quite simply, the general code has been written, and we just need to add a loop again.

Look at the pages we need to crawl and find that when the pages change, only the numbers after the page will change. Of course, the web pages of many large manufacturers, such as JD.com and Taobao, are often puzzling and difficult to guess.

Inurl= "https://www.crrcgo.cc/admin/crr_supplier.html?page="for num in range (1pc6): print (" = crawler "+ str (num) +" page data = ")

Write loop, we only grab 1 to 5 pages, the loop here we use range function to implement, range function left closed right open feature so that we want to crawl to 5 pages must specify 6

Outurl=inurl+str (num) req = requests.get (url=outurl)

Assemble the loop value and url into a complete url, and get the page data

The complete code is as follows:

Import requestsfrom bs4 import BeautifulSoupinurl= "https://www.crrcgo.cc/admin/crr_supplier.html?page="for num in range (1pc6): print (" = being crawled "+ str (num) +" page data = ") outurl=inurl+str (num) req = requests.get (url=outurl) req.encoding =" utf-8 "html=req.text soup = BeautifulSoup (req.text,features=" html.parser ") company_items = soup.find_all (" div ") Class_= "detail_head") for company_item in company_items: dd = company_item.text.strip () print (dd)

Successfully crawled 1-5 pages of all the company names (sub-tags).

At this point, the study on "how to use Python to get the specified content of a web page" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report