In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly introduces "how to use Python crawler to obtain foreign bridge ranking data list". In daily operation, I believe many people have doubts about how to use Python crawler to obtain foreign bridge ranking data list. Xiaobian consulted all kinds of materials and sorted out simple and easy operation methods. I hope to help you answer the doubts of "how to use Python crawler to obtain foreign bridge ranking data list"! Next, please follow the small series to learn together!
Foreword:
To get started, install pyquery into your local development environment. The command is pip install pyquery, version 1.4.3.
Basic use as shown below, understand also mastered 5 into, so simple.
from pyquery import PyQuery as pqs = 'Eraser PyQuery Small Classroom'doc = pq(s)print(doc ('title '))
The output is as follows:
Eraser PyQuery Small Classroom
You can also pass the URL to be parsed directly to the pyquery object, as follows:
from pyquery import PyQuery as pqurl = "https://www.bilibili.com/"doc = pq(url=url,encoding="utf-8")print(doc ('title ')) #(-) LO Cheers ~-bilibili
The same idea, you can also initialize the pyquery object through the file, just need to change the parameter to filename.
After laying the foundation, you can enter the practical operation link. The following is the target case analysis to be captured this time.
target site analysis
The list of highest international bridges
The data presented on the page is as follows:
In the process of browsing, we found that most of them were designed by China. Sure enough, our infrastructure is the first in the world.
The page turning rules are as follows:
http://www.highestbridges.com/wiki/index.php? title=List_of_Highest_International_Bridges/Page_1
http://www.highestbridges.com/wiki/index.php? title=List_of_Highest_International_Bridges/Page_2
#The measured data is empty when you turn to page 13, about 1200 bridges
http://www.highestbridges.com/wiki/index.php? title=List_of_Highest_International_Bridges/Page_13
Since the target data exists in table form, it is only necessary to extract the data directly according to the header. Rank,Name,Height (meters / feet),Main Span Length,Completed,Location,Country
encoding time
Before formal coding, take the first page to practice:
from pyquery import PyQuery as pqurl = "http://www.highestbridges.com/wiki/index.php? title=List_of_Highest_International_Bridges/Page_1"doc = pq(url=url, encoding='utf-8')print(doc('title'))def remove(str): return str.replace("", "").replace("\n", "")#Get the row where all the data is located. The following is a css selector, which is also called jquery selector. There is no problem with items = doc.find ('table.wikitable.sortable tr').items()for items in items: td_list = item.find('td') rank = td_list.eq(1).find("span.sorttext").text() name = td_list.eq(2).find("a").text() height = remove(td_list.eq(3).text()) length = remove(td_list.eq(4).text()) completed = td_list.eq(5).text() location = td_list.eq(6).text() country = td_list.eq(7).text() print(rank, name, height, length, completed, location, country)
The code as a whole is written down, and it is still found that the dependence on the selector is relatively large, that is, it requires skilled operation of the selector to select the target element, which is convenient to obtain the final data.
Expand the above code to all data and modify it to iterative acquisition:
from pyquery import PyQuery as pqimport timedef remove(str): return str.replace("", "").replace("\n", "").replace(",", ",")def get_data(page): url = "http://www.highestbridges.com/wiki/index.php? title=List_of_Highest_International_Bridges/Page_{}".format( page) print(url) doc = pq(url=url, encoding='utf-8') print(doc('title')) #Get all the rows where the data is located, the following is the css selector, called jquery selector is also no problem items = doc.find('table.wikitable.sortable tr').items() for item in items: td_list = item.find('td') rank = td_list.eq(1).find("span.sorttext").text() name = remove(td_list.eq(2).find("a").text()) height = remove(td_list.eq(3).text()) length = remove(td_list.eq(4).text()) completed = remove(td_list.eq(5).text()) location = remove(td_list.eq(6).text()) country = remove(td_list.eq(7).text()) data_tuple = (rank, name, height, length, completed, location, country) save(data_tuple)def save(data_tuple): try: my_str = ",".join(data_tuple) + "\n" # print(my_str) with open(f"./ data.csv", "a+", encoding="utf-8") as f: f.write(my_str) print("write complete") except Exception as e: passif __name__ == '__main__': for page in range(1, 14): get_data(page) time.sleep(3)
Among them, it is found that there is an English comma, which is uniformly modified, that is, the application of the remove(str) function.
At this point, the study on "how to use Python crawler to obtain the list of foreign bridge rankings" is over, hoping to solve everyone's doubts. Theory and practice can better match to help you learn, go and try it! If you want to continue learning more relevant knowledge, please continue to pay attention to the website, Xiaobian will continue to strive to bring more practical articles for everyone!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 283
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.