How to use python crawler + forward agent to realize Lianjia data collection 07/15 Update SLTechnology News&Howtos

How to use python crawler + forward agent to realize Lianjia data collection

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces how to use python crawler + forward agent to achieve Lianjia data collection, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.

First, select the target website:

Chain [lian] Home [jia]: https://bj.lianjia.com/

Click "Rent" to go to the home page:

This is the home page to climb.

Climb a page first

1. Analyze the page

Right-click the link of a housing source, and click "check", as shown in the figure:

Enter developer mode

You can see the link in the a tag:

You can extract the link using xpath, but the link is the second half of the real url and requires string concatenation to get the real url

Later will be reflected in the code.

For the time being, the crawled information is only crawled from the information marked in the following figure:

Including title, time, price, room layout, area

Third, crawl all the pages

1. Analyze the page url

Click to rent a house and find the page to which it jumps: https://bj.lianjia.com/zufang/

Yes, this is the home page to climb:

Let's pull down to the bottom and click on the next page or another page.

Page 1: https://bj.lianjia.com/zufang/pg1/#contentList

2nd also: https://bj.lianjia.com/zufang/pg2/#contentList

Page 3: https://bj.lianjia.com/zufang/pg3/#contentList

Page 100: https://bj.lianjia.com/zufang/pg100/#contentList

By observing url, we can find the rule: only the number after pg changes on each page, and it is the same as the number of pages.

All pages can be crawled using a loop after splicing strings.

IV. Source code

Development tool: pycharm

Python version: 3.7.2

Import requestsfrom lxml import etree# has written a common method that returns the text of text by typing url Def get_one_page (url): headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',} response = requests.get (url,headers=headers) response.encoding = 'utf-8' if response.status_code = = 200: return response.text else: print ("request status code! = 200Jing URL error.") Return Nonefor number in range (0101): # use the range function to loop through 0-100 and grab pages 1-100. Initialize_url = "https://bj.lianjia.com/zufang/pg" + str (number) +" / # contentList "# string concatenates url from page 1 to page 100 Html_result = get_one_page (initialize_url) # get the text text of URL html_xpath = etree.HTML (html_result) # convert to xpath format # grab the url in the home page (there are 30 housing information on each page) page_url = html_xpath.xpath ("/ / div [@ class='content w1150'] / div [@ class='content__article'] / / div [@ class='content__list'] / div/a [@ class='content__list--item--aside'] / @ href ") for i in page_url: # cycle each listing url true_url =" https://bj.lianjia.com" + I # get details page url true_html = get_one_page (true_url) # get text text true_xpath = etree.HTML (true_html ) # convert to xpath format # grab page title That is, the title of the housing details page title = true_xpath.xpath ("/ / div [@ class='content clear w1150'] / p [@ class='content__title'] / / text ()") # capture the release time and segment the string release_date = true_xpath.xpath ("/ / div [@ class='content clear w1150'] / / div [@ class='content__subtitle'] / / text ( ) release_date_result = str (release_date [2]) .strip () .split (":") [1] # grab price price = true_xpath.xpath ("/ / div [@ class='content clear w1150'] / / p [@ class='content__aside--title'] / span//text ()") # grab room style house_type = true_xpath.xpath ( "/ / div [@ class='content clear w1150'] / / ul [@ class='content__aside__list'] / / span [2] / / text () # grab room area acreage = true_xpath.xpath (" / / div [@ class='content clear w1150'] / / ul [@ class='content__aside__list'] / / span [3] / / text () ") print (str (title) +" -"+ str (release_date_result) +"-"+ str (price) +"-"+ str (house_type) +"-"-" + str (acreage)) # write operation Write the information to a text text with open (r "E:\ admin.txt",'a') as f: f.write (str (title) + "-" + str (release_date_result) + "- -" + str (price) + "-" + str (house_type) + "-" + str (acreage) + "\ n")

Finally, the crawled information is output and written to the text at the same time; of course, it can also be directly written to the JSON file or directly stored in the database.

Thank you for reading this article carefully. I hope the article "how to use python crawler + forward agent to achieve Lianjia data collection" shared by the editor will be helpful to everyone. At the same time, I also hope you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.