In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly introduces how to grab famous websites in Python crawler. The introduction in this article is very detailed and has certain reference value. Interested friends must read it!
1. Enter the URL
quotes.toscrape.com/, go to the homepage of the website, observe the structure of the page, we find that the content of the page is very clear,
It is mainly divided into three main fields: famous sayings, authors and labels. At the same time, the contents of the three fields are the contents extracted this time.
2. Determine requirements and analyze web page structure
Open developer tools, click networ to analyze network data capture, the website is requested in the get mode, do not need to carry parameters, then we can use the get() method in the request library to simulate the request, need to bring headers request, simulate browser information verification, prevent the website server from detecting as a crawler request.
You can also click on the leftmost arrow of the developer tool to help us quickly locate where the web page data is located in the element tab.
3. Analyze the structure of the web page and extract the data.
After the request is successful, you can start to extract the data ~, I use the xpath parsing method, so first to parse the xpath page, click the small arrow on the far left, can help us quickly locate the data, web page data in the element tab location. Because the page request data is sorted by list, we can locate the entire list data first. Through the html parser in lxm, grab fields one by one and save them to the list to facilitate the next step of data cleaning.
Save to csv file
Source code sharing
import requestsfrom lxml import etreeimport csv url = "https://quotes.toscrape.com/"headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'} res = requests.get(url,headers = headers).text html = etree.HTML(res) queto_list = html.xpath('//div[@class="col-md-8"]') lists = [] for queto in queto_list: #Text title = queto.xpath('./ div[@class="quote"]/span[1]/text()') #Author authuor = queto.xpath('./ div[@class="quote"]/span[2]/small/text()') #Quotes label tags = queto.xpath('./ div[@class="quote"]/div[@class="tags"]/a[@class="tag"]/text()') #Add data to list and save lists.append(title) lists.append(authuor) lists.append(tags) with open("./ csv",'w', encoding ='utf-8', newline ='\n') as f: writer = csv.writer(f) for i in lists: writer.writerow(x) The above is "How to crawl famous website in Python crawler" All the contents of this article, thank you for reading! Hope to share the content to help everyone, more relevant knowledge, welcome to pay attention to the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un