In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article will explain in detail how to use Python to obtain Chengdu rental information. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.
For the acquisition of information and data, first of all, we collect the information of the market network and the free network.
1. Go to the market network to obtain information
i. Get the content of the current page
The rules here are more obvious. You can parse the content of the web page with xpath, and the information of each section can be easily obtained. Finally, you can save and return it with a list. First, loop out each divs block, and get the content of each section one by one.
Def get_this_page_gj (url Tmp): html = etree.HTML (requests.get (url) .text) divs = html.xpath ('/ / div [@ class= "f-list-item ershoufang-list"]') for div in divs: title = div.xpath ('. / dl/dd [@ class= "dd-item title"] / a/text ()') [0] house_url = div.xpath ('. / dl/dd [@ class= "dd-item title"] / a @ class= "dd-item title') [0] size =", ".join (div.xpath ('. / dl/dd [@ class=" dd-item size "] / span/text ()') address ='- '.join ([data.strip () for data in divs [0] .xpath ('. / dl/dd [@ class=" dd-item address "] [1] / / a//text ()') if data.strip ()! =']) agent_string = div.xpath ('. / dl/dd [@ class=" dd) -item address "] [2] / span/span/text ()') [0] agent = re.sub ('' '', agent_string) price = div.xpath (. / dl/dd [@ class= "dd-item info"] / div [@ class= "price"] / span [@ class= "num"] / text ()') [0] tmp.append ([title, size, price, address, agent, house_url]) return tmp
II. URL construction
Visit the home page link to get the total number of pages, construct url according to the access rules of url, and call the method to obtain the data of the current page. The url here all starts with http://cd.ganji.com/zufang/pn, followed by the page number of the online page.
Def house_gj (headers): index_url = 'http://cd.ganji.com/zufang/' html = etree.HTML (get_html (index_url, headers)) total = html.xpath (' / / div [@ class= "pageBox"] / a [position () = last ()-1] / span/text ()') [0] result = [] for num in range (1 Int (total) + 1): result + = get_this_page_gj ('http://cd.ganji.com/zufang/pn{}'.format(num), []) print (' finish reading page {} / num '.format (num)) return result
2.
Here is similar to the market network, the structure is also similar, the same way of acquisition, we also grab the basic information plus url links, the difference is that the price here may not be easy to obtain, not directly displayed, but in the form of picture + offset
1. Price acquisition
Each number corresponds to a picture, and the number in the picture will be obtained from the original image according to the offset set in style, and the original image on each page is also different, so it is troublesome to deal with it.
If we pay careful attention here, we will find that the spacing between each number is actually the same. We can change the numerical viewing rule on the page by ourselves. The distance between each number is 21.4px, starting from the left side of the original image, and the offset is determined according to the offset. The returned number subscript = | offset / 21.4 |, of course, there will be a small error according to the page picture, content and other elements, but it is a very small error. Finally, take the whole number list of the original image to get the value of the corresponding subscript. Here we use tesseract to parse the picture.
.price _ strings = div.xpath ('. / div [@ class= "info-box"] / div [@ class= "price"] / span [@ class= "num"] / @ style') offset_list = [] for data in price_strings: offset_list.append ('position: (. *?) px' Data) [0]) style_string = html.xpath ('/ / div [@ class= "info-box"] / div [@ class= "price"] / span [@ class= "num"] / @ style') [0] pic = "http:" + re.findall (r'background-image: url\ ((. *?)\) . *?', style_string) [0] price = get_price_zr (pic, offset_list) def get_price_zr (pic_url, offset_list):''the index here holds the subscript values of all numbers Wait for the image parsing to obtain the corresponding subscript price number''index, price = [], [] with open (' pic.png') 'wb') as f: f.write (requests.get (pic_url) .content) code_list = list (pytesseract.image_to_string (Image.open (' pic.png') for data in offset_list: index.append (int (eval (data) / 21.4)) for data in index: price.append (code_ list [data]) return ".join (price)
Pic_url is the address of the original image of each page, which is downloaded and parsed with pytesseract, and finally returns a new numeric string (price) composed of numbers corresponding to each subscript. Offset_list is a list of offset values of each number obtained.
two。 Free network data acquisition
Here is similar to the market network, the structure is also similar, the same way of acquisition, we also grab the basic information plus url links, the difference is that the price here may not be easy to obtain, not directly displayed, but in the form of picture + offset
i. Price acquisition
Each number corresponds to a picture, and the number in the picture will be obtained from the original image according to the offset set in style, and the original image on each page is also different, so it is troublesome to deal with it.
If we pay careful attention here, we will find that the spacing between each number is actually the same. We can change the numerical viewing rule on the page by ourselves. The distance between each number is 21.4px, starting from the left side of the original image, and the offset is determined according to the offset. The returned number subscript = | offset / 21.4 |, of course, there will be a small error according to the page picture, content and other elements, but it is a very small error. Finally, take the whole number list of the original image to get the value of the corresponding subscript. Here we use tesseract to parse the picture.
.price _ strings = div.xpath ('. / div [@ class= "info-box"] / div [@ class= "price"] / span [@ class= "num"] / @ style') offset_list = [] for data in price_strings: offset_list.append ('position: (. *?) px' Data) [0]) style_string = html.xpath ('/ / div [@ class= "info-box"] / div [@ class= "price"] / span [@ class= "num"] / @ style') [0] pic = "http:" + re.findall (r'background-image: url\ ((. *?)\) . *?', style_string) [0] price = get_price_zr (pic, offset_list) def get_price_zr (pic_url, offset_list):''the index here holds the subscript values of all numbers Wait for the image parsing to obtain the corresponding subscript price number''index, price = [], [] with open (' pic.png') 'wb') as f: f.write (requests.get (pic_url) .content) code_list = list (pytesseract.image_to_string (Image.open (' pic.png') for data in offset_list: index.append (int (eval (data) / 21.4)) for data in index: price.append (code_ list [data]) return ".join (price)
Pic_url is the address of the original image of each page, which is downloaded and parsed with pytesseract, and finally returns a new numeric string (price) composed of numbers corresponding to each subscript. Offset_list is a list of offset values of each number obtained.
ii. Get the data of the current page
Here, similar to Marketplace, we construct a function to get each page of data, and then call the function to pass in the url of each page. Here, you can take a look at the extended usage of xpath (contains function) and regular access to the original image link.
Def get_this_page_zr (url Tmp): html = etree.HTML (requests.get (url) .text) divs = html.xpath ('/ / div [@ class= "item"]') for div in divs: if div.xpath ('. / div [@ class= "info-box"] / h6/a/text ()'): title = div.xpath ('. / div [@ class= "info-box"] / h6/a/text ()') [0] else: continue link = 'http:' + div.xpath ('. / Div [@ class= "info-box"] / h6 class= a raceme "info-box") [0] location = div.xpath ('. / div [@ class= "info-box"] / div [@ class= "desc"] / div [@ class= "location"] / text ()') [0] area = div.xpath ('. / div [@ class= "info-box"] / div [@ class= "desc"] / div [contains (text ()) "price")] / text ()') [0] price_strings = div.xpath ('. / div [@ class= "info-box"] / div [@ class= "price"] / span [@ class= "num"] / @ style') offset_list = [] for data in price_strings: offset_list.append (re.findall ('position: (. *?) px' Data) [0]) style_string = html.xpath ('/ / div [@ class= "info-box"] / div [@ class= "price"] / span [@ class= "num"] / @ style') [0] pic = "http:" + re.findall (r'background-image: url\ ((. *?)\) . *?, style_string) [0] price = get_price_zr (pic, offset_list) tag =', '.join (div.xpath ('. / div [@ class= "info-box"] / / div [@ class= "tag"] / span/text ()') tmp.append ([title, tag, price, area, location, link]) return tmp
III. Url construction
The principle is the same as that of the market network, mainly focusing on the extended use of xpath position () = last ()
Def house_zr (headers): index_url = 'http://cd.ziroom.com/z/' html = etree.HTML (get_html (index_url, headers)) total = html.xpath (' / / div [@ class= "Z_pages"] / a [position () = last ()-1] / text ()') [0] result = [] for num in range (1 Int (total) + 1): result + = get_this_page_zr ('http://cd.ziroom.com/z/p{}/'.format(num), []) print (' finish reading page {} / free net '.format (num)) return result on how to use Python to get Chengdu rental information so much I hope the above content can be of some help to you and learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.