In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >
Share
Shulou(Shulou.com)05/31 Report--
This article introduces you how the example of Python crawling rental data is, the content is very detailed, interested friends can refer to, hope to be helpful to you.
What is a reptile
A crawler, also known as a "web crawler", is a program that automatically accesses the Internet and downloads the content of the site. It is also the basis of search engines, such as Baidu and GOOGLE rely on powerful web crawlers to retrieve vast amounts of Internet information and then store it in the cloud to provide quality search services for netizens.
Second, what is the use of reptiles
You might say, apart from search engine companies, what's the use of learning crawlers? Haha, someone finally got the point. For example: enterprise A has set up a user forum, and many users leave messages on the forum to talk about their own experience, and so on. Now A needs to understand user needs and analyze user preferences in preparation for the next round of iterative product updates. So how to get the data, of course, requires crawler software to get it from the forum. So in addition to Baidu and GOOGLE, many companies are recruiting reptile engineers with high salaries. You can see how popular crawlers are by searching for "crawler engineers" on any recruitment website to see the number of jobs and salary range.
Third, the principle of reptiles
Initiate a request: send a request (a request) to the target site through the HTTP protocol, and then wait for a response from the target site server.
Get the response content: if the server can respond normally, you will get a Response. The content of Response is the content of the page to be obtained, and the content of the response may include HTML,Json strings, binary data (such as pictures and videos), and so on.
Parsing content: the content obtained may be HTML, which can be parsed with regular expressions or web page parsing libraries; it may be Json, which can be directly converted to Json object parsing; and it may be binary data that can be saved or further processed.
Save data: when the data parsing is complete, it will be saved. It can be saved as a text document or in a database.
4. Python crawler example
The previous introduction of the definition of reptiles, functions, principles and other information, I believe that many friends have begun to be interested in reptiles, ready to try. Now come to "practical information" and directly paste a simple Python crawler code:
1. Preparatory work: install the Python environment, install the PYCHARM software, install the MYSQL database, create a new database exam, and create a table house in exam to store the crawler results [SQL statement: create table house (price varchar (88), unit varchar (88), area varchar (88));]
two。 The goal of the crawler: crawl the price, unit and area of all the links in the home page of a rental, and then store the crawler structure in the database.
3. Crawler source code:
Import requests # request URL page content
From bs4 import BeautifulSoup # get page elements
Import pymysql # Link database
Import time # time function
Import lxml # parsing library (supports HTML\ XML parsing, XPATH parsing)
# get_page function: get the content of the url link through the get method of requests, and then integrate it into a format that BeautifulSoup can handle
Def get_page (url):
Response = requests.get (url)
Soup = BeautifulSoup (response.text, 'lxml')
Return soup
# what the get_links function does: get all the rental links on the list page
Def get_links (link_url):
Soup = get_page (link_url)
Links_div = soup.find_all ('div',class_= "pic-panel")
Links= [div.a.get ('href') for div in links_div]
Return links
# get_house_info function is used to obtain the information of a rental page: price, unit, area, etc.
Def get_house_info (house_url):
Soup = get_page (house_url)
Price = soup.find ('span',class_='total'). Text
Unit = soup.find ('span',class_='unit'). Text.strip ()
Area = 'test' # here we customize a test to test the area field.
Info = {
'Price': price
'unit': unit
'area': area
}
Return info
# write the configuration information of the database to the dictionary
DataBase = {
'host': '127.0.0.1'
'database': 'exam'
'user': 'root'
'password': 'root'
'charset':' utf8mb4'}
# linking the database
Def get_db (setting):
Return pymysql.connect (* * setting)
# insert data obtained by crawlers into the database
Def insert (db,house):
Values = "'{}'," * 2 + "'{}'"
Sql_values = values.format (house ['price'], house ['unit'], house ['area'])
Sql = ""
Insert into house (price,unit,area) values ({})
Format (sql_values)
Cursor = db.cursor ()
Cursor.execute (sql)
Db.commit ()
# main program flow: 1. Connect to database 2. The URL list 3.FOR loop that gets each house information starts with the first URL to get the house specific information (price, etc.) 4. Insert the database one by one
Db = get_db (DataBase)
Links = get_links ('https://bj.lianjia.com/zufang/')
For link in links:
Time.sleep (2)
House = get_house_info (link)
Insert (db,house)
First of all, "if you want to do a good job, you must first sharpen its tools", using Python to write crawlers is the same reason, writing crawlers need to import a variety of library files, it is these and useful library files to help us complete most of the work of the crawler, we only need to call the relevant excuse functions. The format of the import is the import library file name. It should be noted here that the library files are installed in PYCHARM. You can install the library files by pressing the ctrl+alt key while placing the cursor over the library file name, or by the command line (Pip install library file name). If the installation fails or is not installed, then the subsequent crawler will definitely report an error. In this code, the first five lines of the program are imported related library files: requests used to request URL page content; BeautifulSoup used to parse page elements; pymysql used to connect to the database; time contains various time functions; lxml is a parsing library used to parse HTML, XML format files, while it also supports XPATH parsing.
Second, let's look at the entire crawler process from the last main program of the code:
Connect to the database through the get_db function. Further inside the get_db function, you can see that the connection to the database is achieved by calling the connect function of Pymysql. Here * * seting is a way for Python to collect keyword parameters. We write the connection information of the database into a dictionary, DataBase, and pass the information in the dictionary to connect as arguments.
Through the get_links function, you can get the links to all the housing sources on the home page of Lianjia. Links to all listings are stored in Links as a list. The get_links function first obtains the content of the home page of Lianjia through a requests request, and then organizes the format of the content through the interface of BeautifuSoup, which can be processed by it. Finally, through the electrophoresis find_all function to find all the div styles that contain pictures, and then through a for loop to get the contents of the hyperlink tab (a) contained in all div styles (that is, the contents of the href attribute), all the hyperlinks are stored in the list links.
Iterate through all the links in the links through the FOR loop (for example, one of the links is: https://bj.lianjia.com/zufang/101101570737.html)
Using the same method as 2), get the price, unit, and area information in the link in 3) by using the find function to locate the element, and write this information into a dictionary Info.
Call the insert function to write the Info information obtained from a link into the house table of the database. Going deep into the insert function, we can see that it executes a SQL statement through the database's cursor function cursor () and then the database performs commit operations to achieve the response function. Here the SQL statement is written in a special way, and the format function is used to format it in order to facilitate the reuse of the function.
Finally, run the crawler code and you can see that all the housing information on the home page of Lianjia has been written into the data. (note: test is the test string I specified manually)
Postscript: in fact, Python crawler is not difficult, after familiar with the entire crawler process, there are some details need to pay attention to, such as how to get page elements, how to build SQL statements and so on. Don't panic when you encounter problems. You can destroy BUG one by one by looking at the tips of IDE, and finally get the structure we expect.
About Python crawl rental data example is shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.