GoldData learning example-collecting news data on the official website 07/09 Update SLTechnology News&Howtos

GoldData learning example-collecting news data on the official website

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Overview

In this section, we will talk about grabbing local news on the government website. And integrate the captured news data into the following two data tables news_site and news.

News_site (news source)

Field type description idbigint primary key, automatically increment namevarchar (128) source name

News (News)

Field type description idbigint primary key, automatically increase the titlevarchar (128) title site_idbigint foreign key, point to the id field of table news_site, contenttext content, pub_datedatetime release time, date_createddatetime join time

It is easy to see that there is a correlation between the two tables, so how to write the data to the association, which we will introduce one by one.

Define sites, datasets

Define fetching and extraction rules

We need to fill in the entrance address here. If there are multiple entrance addresses, they should be separated by English commas. As shown in the following figure:

Next, when we write the rules, the first thing we need to do is to match URL, and here we need to fill in the regular expression. The "?" next to it Number, after clicking, the corresponding help document will pop up. As shown in the following figure:

Then we need to note that if all we need to crawl is a link, then whether the dataset is selected No, and the dataset field must have a field named href. As shown in the following figure:

Otherwise, whether the dataset should be selected as yes, and the dataset field must have a field named sn. The data stored in the sn field is generally a unique value, which is equivalent to the id field in the data table. As shown in the following figure:

The complete rule content is shown as follows:

[{_ _ sample: http://sousuo.gov.cn/column/40520/0.htm match0: http\:\ /\ / sousuo\ .gov\ .cn\ / column\ / 40520 /\ d +\ .htm fields0: {_ _ model: false _ node: .news _ box a href: {expr: an attr: abs:href js: "_ _ label : link _ _ showOnList: false _ _ type: "" down: "0" accessPathJs: "" uploadConf: ""}} {_ _ sample: http://www.gov.cn/xinwen/2019-02/26/content_5368539.htm match0: http\:\ /\ / www\ .gov\ .cn / xinwen/2019-\ d {2} / \ d {2} / content_\ d+.htm fields0: {_ _ model: true _ _ dataset: news _ _ node: ".article" sn: {expr: "" attr: "" js:''var xx=md5 (baseUri) xx''_ _ label: Number _ _ showOnList: false _ _ type: "" down: "0" accessPathJs: "" uploadConf: ""} title: {expr: .article > H2 attr: "" js: "" _ _ label: title _ _ showOnList: true _ _ type: " "down:" 0 "accessPathJs:"uploadConf:"} pubdate: {expr: .pages-date:matchText attr: "" js: "" _ _ label: release time _ _ showOnList: false _ _ type: "" down: "0" accessPathJs : "" uploadConf: "} source: {expr: .pages-date > span.font:contains (Source) attr:"js:''var xx=source.replace (" Source: " '') Xx''_ _ label: source _ _ showOnList: true _ _ type: "" down: "0" accessPathJs: "" uploadConf: ""} content: {expr: .pages _ content attr: "" js: "" _ _ Label: news content _ _ showOnList: false _ _ type: "" down: "0" accessPathJs: "" uploadConf: ""}] configure and start the crawler

A crawler can be configured to grab multiple sites, and a site can also be configured with multiple crawlers.

Then click start, and the crawler will be started.

View and export data

You can export data according to the search criteria. After you select the Export button, you will also be prompted for which data segments to export, and finally export the file. If there is a small amount of data, it will be exported as an excel file, otherwise the downloaded packaged zip file. As shown in the following figure:

This section ends here, and the next article will talk about how to merge the data into the data table through golden data.

(note: this content is based on the training video, https://golddata.100shouhou.com/front/docs)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.