In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
Editor to share with you how to use Python to crawl Baidu hot search information, I hope you will gain something after reading this article, let's discuss it together!
Preface
What is a crawler is actually the use of a computer to simulate the operation of a web page.
For example, simulating human browsing shopping websites.
Be sure to check the target website before using a crawler: -)
You can add / robots.txt to the target site to view the details of the page
For example, Tmall can enter https://brita.tmall.com/robots.txt to view it.
User-agent represents the object to which the request is sent
The asterisk * represents any search engine
Disallow represents the parts that are not allowed to be accessed
/ represents starting from the root directory
Allow represents the parts that are allowed to be accessed
In this example, I crawled the top 30 news from Baidu (I originally planned to crawl the hero information of the top 50 in the data center of League of Legends home page, but I had no choice but to climb Baidu hot search) and put its general information into the Excel table and the Flask web page to achieve data visualization. Students interested in data visualization can also crawl other content.
Due to my limited level, the reptiles in this article are relatively basic.
Library function preparation
How to install the Python library:
Open a cmd command prompt and type pip install XXX (this is the name of the library you want to install)
For the specific use of these libraries, please take a look at my operation.
You only need to master a few commonly used functions simply.
Bs4
That is BeautifulSoup
Used to parse HTML pages and extract specified data.
The detailed usage will be shown in my demonstration later.
Re
A regular expression is used to match a string of responses in a string.
You can go to the rookie tutorial on regular expressions in great detail.
Urllib
Is a HTTP request library that comes with Python, which can operate a series of URL.
Xlwt/xlrt
For writing (write) / reading (read), data in the Excel table.
Flask
This library is used to do a simple Web framework, that is, the website, for data visualization.
In fact, my grasp of data visualization is also very shallow, just simply import the data into the Web web page.
Jinja2
The purpose of this library is to implement the function of inserting independent variables into characters in HTML pages.
Backend: name= "HQ" front end:
{{name}} how handsome you are!
Show: HQ is so handsome!
Markupsafe
Sharing with Jinja is used to avoid untrusted input and injection attacks when rendering pages (although no one will attack you.)
Data crawling
The two py files of data crawling and data visualization are separate.
Data crawling requires importing four re bs4 urllib xlwt library files
Web page crawling
Using the following method to call a function can make the function call relationship clearer.
If _ _ name__== "_ _ main__": # call the function main () def askurl (url) when the program executes: head= {"User-Agent":''Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML Like Gecko) Chrome/97.0.4692.71 Safari/537.36 Edg/97.0.1072.55'''} # user agent tells the server that I am just an ordinary browser requset=urllib.request.Request (url) # send request response=urllib.request.urlopen (requset) # respond to a request object # convert to bytes type string # through read () Convert decode () to a string of str type # accept the response html=response.read (). Decode ('utf-8') save the crawled web page in the document to facilitate viewing path=r "C:\ Users\ XXX\ Desktop\ Python\ text.txt" # here add r before the string to prevent the escape of f=open in the string (r "path") F.write (html) f.close () # so that you can view the source return html of the web page in the txt file.
The value of headers can be pressed F12 in the web page.
Then click on the network change and drop down to the bottom of any request header for the user-agent proxy information.
It is worth noting that if headers is not set in the request, the server will return a status code of 418.
Means the server recognizes that you are a crawler and says: "I'm a teapot"
Indicates that the server refuses to brew coffee because it is always a teapot (this is a stem)
Data parsing
Change the suffix of the crawled txt file to html and open it as a local web page.
If you report an error due to a long line in vscode, please refer to the following blog
The opened web page is shown in the figure
Use this feature to see where information needs to be crawled
In this project, we capture the title, content, popularity and links of the target information.
We can find that all the information we need is in the following types of tables with class
So we use Beautifulsoup to parse the web page.
Def getData (html): datalist= [] soup=BeautifulSoup (html, "html.parser") # defines a parsing object # soup.find_all (soup.find_all b) where a tag type class_ matches div's class # returns a list of all class as category-wrap_iQLoo horizontal_1eKyQ ('div' Class_= "category-wrap_iQLoo horizontal_1eKyQ"): item=str (item) # converts each subtag in the list to a string for re matching
Next, re matching is performed for each item.
First create a matching rule using re.compile () and then match it with findall
The matching rule is created by viewing the special characters before and after the target information in the HTML file
And (. *?) That is to say, the string to be matched is followed by *? Stands for non-greedy matching
For example
The information before and after the title is ellipsis "> and"
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.