In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces how to use python crawler + forward agent to achieve Lianjia data collection, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.
First, select the target website:
Chain [lian] Home [jia]: https://bj.lianjia.com/
Click "Rent" to go to the home page:
This is the home page to climb.
Climb a page first
1. Analyze the page
Right-click the link of a housing source, and click "check", as shown in the figure:
Enter developer mode
You can see the link in the a tag:
You can extract the link using xpath, but the link is the second half of the real url and requires string concatenation to get the real url
Later will be reflected in the code.
For the time being, the crawled information is only crawled from the information marked in the following figure:
Including title, time, price, room layout, area
Third, crawl all the pages
1. Analyze the page url
Click to rent a house and find the page to which it jumps: https://bj.lianjia.com/zufang/
Yes, this is the home page to climb:
Let's pull down to the bottom and click on the next page or another page.
Page 1: https://bj.lianjia.com/zufang/pg1/#contentList
2nd also: https://bj.lianjia.com/zufang/pg2/#contentList
Page 3: https://bj.lianjia.com/zufang/pg3/#contentList
.
.
.
Page 100: https://bj.lianjia.com/zufang/pg100/#contentList
By observing url, we can find the rule: only the number after pg changes on each page, and it is the same as the number of pages.
All pages can be crawled using a loop after splicing strings.
IV. Source code
Development tool: pycharm
Python version: 3.7.2
Import requestsfrom lxml import etree# has written a common method that returns the text of text by typing url Def get_one_page (url): headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',} response = requests.get (url,headers=headers) response.encoding = 'utf-8' if response.status_code = = 200: return response.text else: print ("request status code! = 200Jing URL error.") Return Nonefor number in range (0101): # use the range function to loop through 0-100 and grab pages 1-100. Initialize_url = "https://bj.lianjia.com/zufang/pg" + str (number) +" / # contentList "# string concatenates url from page 1 to page 100 Html_result = get_one_page (initialize_url) # get the text text of URL html_xpath = etree.HTML (html_result) # convert to xpath format # grab the url in the home page (there are 30 housing information on each page) page_url = html_xpath.xpath ("/ / div [@ class='content w1150'] / div [@ class='content__article'] / / div [@ class='content__list'] / div/a [@ class='content__list--item--aside'] / @ href ") for i in page_url: # cycle each listing url true_url =" https://bj.lianjia.com" + I # get details page url true_html = get_one_page (true_url) # get text text true_xpath = etree.HTML (true_html ) # convert to xpath format # grab page title That is, the title of the housing details page title = true_xpath.xpath ("/ / div [@ class='content clear w1150'] / p [@ class='content__title'] / / text ()") # capture the release time and segment the string release_date = true_xpath.xpath ("/ / div [@ class='content clear w1150'] / / div [@ class='content__subtitle'] / / text ( ) release_date_result = str (release_date [2]) .strip () .split (":") [1] # grab price price = true_xpath.xpath ("/ / div [@ class='content clear w1150'] / / p [@ class='content__aside--title'] / span//text ()") # grab room style house_type = true_xpath.xpath ( "/ / div [@ class='content clear w1150'] / / ul [@ class='content__aside__list'] / / span [2] / / text () # grab room area acreage = true_xpath.xpath (" / / div [@ class='content clear w1150'] / / ul [@ class='content__aside__list'] / / span [3] / / text () ") print (str (title) +" -"+ str (release_date_result) +"-"+ str (price) +"-"+ str (house_type) +"-"-" + str (acreage)) # write operation Write the information to a text text with open (r "E:\ admin.txt",'a') as f: f.write (str (title) + "-" + str (release_date_result) + "- -" + str (price) + "-" + str (house_type) + "-" + str (acreage) + "\ n")
Finally, the crawled information is output and written to the text at the same time; of course, it can also be directly written to the JSON file or directly stored in the database.
Thank you for reading this article carefully. I hope the article "how to use python crawler + forward agent to achieve Lianjia data collection" shared by the editor will be helpful to everyone. At the same time, I also hope you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.