In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces how to grab relevant news by keyword through Python crawler, which has certain reference value. Interested friends can refer to it. I hope you will gain a lot after reading this article. Let Xiaobian take you to understand it together.
preface
First of all, if you search directly from the news, you will find that the content will display up to 20 pages, so we have to search from Sina's front page, so there is no page limit.
Web page structure analysis
After entering Sina and searching for keywords, I found that no matter how I turned the page, the URL would not change, but the content of the web page was updated. Experience told me that this was done through Ajax, so I took down Sina's web page code and looked at it.
Obviously, every page turn sends a request to an address by clicking on the a tab. If you put that address directly into the address bar of your browser and press Enter:
Congratulations, you got the wrong one.
Take a closer look at html's onclick and find that it calls a function called getNewsData, so look up this function in the relevant js file, you can see that it constructs the requested url before each Ajax request, and uses the get request to return the data format jsonp(cross-domain).
So we just need to mimic its request format to get the data.
var loopnum = 0;function getNewsData(url){ var oldurl = url; if(! key){ $("#result").html("No search terms"); return false; } if(! url){ url = 'https://interface.sina.cn/homepage/search.d.json? q='+encodeURIComponent(key); } var stime = getStartDay(); var etime = getEndDay(); url +='&stime='+stime+'&etime='+etime+'&sort=rel&highlight=1&num=10&ie=utf-8'; //'&from=sina_index_hot_words&sort=time&highlight=1&num=10&ie=utf-8'; $.ajax({ type: 'GET', dataType: 'jsonp', cache : false, url:url, success: //callback function is too long to write }) send request import requestsheaders = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0",}params = { "t":"", "q":"tourism"; "pf":"0", "ps":"0", "page":"1", "stime":"2019-03-30", "etime":"2020-03-31", "sort":"rel", "highlight":"1", "num":"10", "ie":"utf-8"}response = requests.get("https://interface.sina.cn/homepage/search.d.json? ", params=params, headers=headers)print(response)
This time using the requests library, construct the same url and send the request. The result received was a cold 403Forbidden:
So go back to the site and see what went wrong.
Find the returned json file from developer tools and look at the request header and find that it has cookies in the request header, so we can directly copy its request header when constructing headers. Run again, response200! The rest is simple, you just need to parse the returned data and write it into Excel.
Complete import requestsimport jsonimport xlwtdef getData(page, news): headers = { "Host": "interface.sina.cn", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0", "Accept": "*/*", "Accept-Language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2", "Accept-Encoding": "gzip, deflate, br", "Connection": "keep-alive", "Referer": r"http://www.sina.com.cn/mid/search.shtml? range=all&c=news&q=%E6%97%85%E6%B8%B8&from=home&ie=utf-8", "Cookie": "ustat=__172.16.93.31_1580710312_0.68442000; genTime=1580710312; vt=99; Apache=9855012519393.69.1585552043971; SINAGLOBAL=9855012519393.69.1585552043971; ULV=1585552043972:1:1:1:9855012519393.69.1585552043971:; historyRecord={'href':'https://news.sina.cn/','refer':'https://sina.cn/'}; SMART=0; dfz_loc=gd-default", "TE": "Trailers" } params = { "t":"", "q":"tourism"; "pf":"0", "ps":"0", "page":page, "stime":"2019-03-30", "etime":"2020-03-31", "sort":"rel", "highlight":"1", "num":"10", "ie":"utf-8" } response = requests.get("https://interface.sina.cn/homepage/search.d.json? ", params=params, headers=headers) dic = json.loads(response.text) news += dic["result"]["list"] return newsdef writeData(news): workbook = xlwt.Workbook(encoding = 'utf-8') worksheet = workbook.add_sheet('MySheet') worksheet.write(0, 0, "Title") worksheet.write(0, 1, "time") worksheet.write(0, 2, "Media") worksheet.write(0, 3, "URL") for i in range(len(news)): print(news[i]) worksheet.write(i+1, 0, news[i]["origin_title"]) worksheet.write(i+1, 1, news[i]["datetime"]) worksheet.write(i+1, 2, news[i]["media"]) worksheet.write(i+1, 3, news[i]["url"]) workbook.save('data.xls')def main(): news = [] for i in range(1,501): news = getData(i, news) writeData(news)if __name__ == '__main__': main() Thank you for reading this article carefully. I hope that Xiaobian's article "How to grab relevant news by keywords through Python crawler" will help everyone. At the same time, I hope everyone will support you a lot and pay attention to the industry information channel. More relevant knowledge is waiting for you to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.