Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python to crawl Baidu hot search information

2025-04-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

Editor to share with you how to use Python to crawl Baidu hot search information, I hope you will gain something after reading this article, let's discuss it together!

Preface

What is a crawler is actually the use of a computer to simulate the operation of a web page.

For example, simulating human browsing shopping websites.

Be sure to check the target website before using a crawler: -)

You can add / robots.txt to the target site to view the details of the page

For example, Tmall can enter https://brita.tmall.com/robots.txt to view it.

User-agent represents the object to which the request is sent

The asterisk * represents any search engine

Disallow represents the parts that are not allowed to be accessed

/ represents starting from the root directory

Allow represents the parts that are allowed to be accessed

In this example, I crawled the top 30 news from Baidu (I originally planned to crawl the hero information of the top 50 in the data center of League of Legends home page, but I had no choice but to climb Baidu hot search) and put its general information into the Excel table and the Flask web page to achieve data visualization. Students interested in data visualization can also crawl other content.

Due to my limited level, the reptiles in this article are relatively basic.

Library function preparation

How to install the Python library:

Open a cmd command prompt and type pip install XXX (this is the name of the library you want to install)

For the specific use of these libraries, please take a look at my operation.

You only need to master a few commonly used functions simply.

Bs4

That is BeautifulSoup

Used to parse HTML pages and extract specified data.

The detailed usage will be shown in my demonstration later.

Re

A regular expression is used to match a string of responses in a string.

You can go to the rookie tutorial on regular expressions in great detail.

Urllib

Is a HTTP request library that comes with Python, which can operate a series of URL.

Xlwt/xlrt

For writing (write) / reading (read), data in the Excel table.

Flask

This library is used to do a simple Web framework, that is, the website, for data visualization.

In fact, my grasp of data visualization is also very shallow, just simply import the data into the Web web page.

Jinja2

The purpose of this library is to implement the function of inserting independent variables into characters in HTML pages.

Backend: name= "HQ" front end:

{{name}} how handsome you are!

Show: HQ is so handsome!

Markupsafe

Sharing with Jinja is used to avoid untrusted input and injection attacks when rendering pages (although no one will attack you.)

Data crawling

The two py files of data crawling and data visualization are separate.

Data crawling requires importing four re bs4 urllib xlwt library files

Web page crawling

Using the following method to call a function can make the function call relationship clearer.

If _ _ name__== "_ _ main__": # call the function main () def askurl (url) when the program executes: head= {"User-Agent":''Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML Like Gecko) Chrome/97.0.4692.71 Safari/537.36 Edg/97.0.1072.55'''} # user agent tells the server that I am just an ordinary browser requset=urllib.request.Request (url) # send request response=urllib.request.urlopen (requset) # respond to a request object # convert to bytes type string # through read () Convert decode () to a string of str type # accept the response html=response.read (). Decode ('utf-8') save the crawled web page in the document to facilitate viewing path=r "C:\ Users\ XXX\ Desktop\ Python\ text.txt" # here add r before the string to prevent the escape of f=open in the string (r "path") F.write (html) f.close () # so that you can view the source return html of the web page in the txt file.

The value of headers can be pressed F12 in the web page.

Then click on the network change and drop down to the bottom of any request header for the user-agent proxy information.

It is worth noting that if headers is not set in the request, the server will return a status code of 418.

Means the server recognizes that you are a crawler and says: "I'm a teapot"

Indicates that the server refuses to brew coffee because it is always a teapot (this is a stem)

Data parsing

Change the suffix of the crawled txt file to html and open it as a local web page.

If you report an error due to a long line in vscode, please refer to the following blog

The opened web page is shown in the figure

Use this feature to see where information needs to be crawled

In this project, we capture the title, content, popularity and links of the target information.

We can find that all the information we need is in the following types of tables with class

So we use Beautifulsoup to parse the web page.

Def getData (html): datalist= [] soup=BeautifulSoup (html, "html.parser") # defines a parsing object # soup.find_all (soup.find_all b) where a tag type class_ matches div's class # returns a list of all class as category-wrap_iQLoo horizontal_1eKyQ ('div' Class_= "category-wrap_iQLoo horizontal_1eKyQ"): item=str (item) # converts each subtag in the list to a string for re matching

Next, re matching is performed for each item.

First create a matching rule using re.compile () and then match it with findall

The matching rule is created by viewing the special characters before and after the target information in the HTML file

And (. *?) That is to say, the string to be matched is followed by *? Stands for non-greedy matching

For example

The information before and after the title is ellipsis "> and"

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report