In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Overview
In this section, we will talk about grabbing local news on the government website. And integrate the captured news data into the following two data tables news_site and news.
News_site (news source)
Field type description idbigint primary key, automatically increment namevarchar (128) source name
News (News)
Field type description idbigint primary key, automatically increase the titlevarchar (128) title site_idbigint foreign key, point to the id field of table news_site, contenttext content, pub_datedatetime release time, date_createddatetime join time
It is easy to see that there is a correlation between the two tables, so how to write the data to the association, which we will introduce one by one.
Define sites, datasets
Define fetching and extraction rules
We need to fill in the entrance address here. If there are multiple entrance addresses, they should be separated by English commas. As shown in the following figure:
Next, when we write the rules, the first thing we need to do is to match URL, and here we need to fill in the regular expression. The "?" next to it Number, after clicking, the corresponding help document will pop up. As shown in the following figure:
Then we need to note that if all we need to crawl is a link, then whether the dataset is selected No, and the dataset field must have a field named href. As shown in the following figure:
Otherwise, whether the dataset should be selected as yes, and the dataset field must have a field named sn. The data stored in the sn field is generally a unique value, which is equivalent to the id field in the data table. As shown in the following figure:
The complete rule content is shown as follows:
[{_ _ sample: http://sousuo.gov.cn/column/40520/0.htm match0: http\:\ /\ / sousuo\ .gov\ .cn\ / column\ / 40520 /\ d +\ .htm fields0: {_ _ model: false _ node: .news _ box a href: {expr: an attr: abs:href js: "_ _ label : link _ _ showOnList: false _ _ type: "" down: "0" accessPathJs: "" uploadConf: ""}} {_ _ sample: http://www.gov.cn/xinwen/2019-02/26/content_5368539.htm match0: http\:\ /\ / www\ .gov\ .cn / xinwen/2019-\ d {2} / \ d {2} / content_\ d+.htm fields0: {_ _ model: true _ _ dataset: news _ _ node: ".article" sn: {expr: "" attr: "" js:''var xx=md5 (baseUri) xx''_ _ label: Number _ _ showOnList: false _ _ type: "" down: "0" accessPathJs: "" uploadConf: ""} title: {expr: .article > H2 attr: "" js: "" _ _ label: title _ _ showOnList: true _ _ type: " "down:" 0 "accessPathJs:"uploadConf:"} pubdate: {expr: .pages-date:matchText attr: "" js: "" _ _ label: release time _ _ showOnList: false _ _ type: "" down: "0" accessPathJs : "" uploadConf: "} source: {expr: .pages-date > span.font:contains (Source) attr:"js:''var xx=source.replace (" Source: " '') Xx''_ _ label: source _ _ showOnList: true _ _ type: "" down: "0" accessPathJs: "" uploadConf: ""} content: {expr: .pages _ content attr: "" js: "" _ _ Label: news content _ _ showOnList: false _ _ type: "" down: "0" accessPathJs: "" uploadConf: ""}] configure and start the crawler
A crawler can be configured to grab multiple sites, and a site can also be configured with multiple crawlers.
Then click start, and the crawler will be started.
View and export data
You can export data according to the search criteria. After you select the Export button, you will also be prompted for which data segments to export, and finally export the file. If there is a small amount of data, it will be exported as an excel file, otherwise the downloaded packaged zip file. As shown in the following figure:
This section ends here, and the next article will talk about how to merge the data into the data table through golden data.
(note: this content is based on the training video, https://golddata.100shouhou.com/front/docs)
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.