Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the data collection process of the National Bureau of Statistics?

2025-01-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "what is the data collection process of the National Bureau of Statistics". In the daily operation, I believe that many people have doubts about the data collection process of the National Bureau of Statistics. The editor consulted all kinds of data and sorted out the simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "what is the data collection process of the National Bureau of Statistics?" Next, please follow the editor to study!

Acquisition process

The first step in collecting all kinds of public data is to analyze web pages.

The picture above shows the interface of the annual data of the National Bureau of Statistics. On the left is the tree menu of data classification, and on the right is the data displayed after each menu is clicked. You can set the year to filter the data.

Collection data classification tree

According to the situation of the page, first of all, we need to collect the data in the tree menu, and then collect the data on the right according to the classification of the menu. In this way, the omission of collection can be avoided.

There are generally two situations in which crawlers collect data:

Collect the html page, then analyze its structure and extract the data

Check whether there is an API to get the data, and extract the data directly from the API

By analyzing the loading process of the web page, it is found that the data of the International Bureau of Statistics have API, which saves a lot of time.

The API information is as follows:

Host: "https://data.stats.gov.cn/easyquery.htm"method: POSTparams: id=zb&dbcode=hgnd&wdcode=zb&m=getTree

The data in the tree menu can be obtained by simulating the POST request through python's requests library.

Def init_tree (tree_data_path): data = get_tree_data () with open (tree_data_path, "wb") as f: pickle.dump (data, f) def get_tree_data (id= "zb"): r = requests.post (f "{host}? id= {id} & dbcode=hgnd&wdcode=zb&m=getTree", verify=False) logging.debug ("access url:% s") R.url) data = r.json () for node in data: if node ["isParent"]: node ["children"] = get_tree_data (node ["id"]) else: node ["children"] = [] return data

Just call the init_tree function above, and the tree menu will be serialized into tree_data_path in json format.

The purpose of serialization is that it can be used repeatedly when collecting data later, without having to collect the tree menu every time. (after all, the menu is basically the same)

Collect data according to classification

With the classified menu, the next step is to collect specific data. Similarly, through the analysis of web pages, the data is also API, there is no need to collect html pages and then extract data.

Host: "https://data.stats.gov.cn/easyquery.htm"method: GETparams: parameters have variables. For more information, please see the code.

Collecting data is a little more complicated. Instead of visiting the API like collecting a tree menu, it traverses the tree menu and accesses API according to the menu information.

#-*-coding: utf-8-*-import loggingimport osimport pickleimport timeimport pandas as pdimport requestshost = "https://data.stats.gov.cn/easyquery.htm"tree_data_path =". / tree.data "data_dir =". / data "def data (sj=" 1978-"): tree_data = [] with open (tree_data_path," rb ") as f: tree_data = pickle.load (f) traverse_tree_data (tree_data) Sj) def traverse_tree_data (nodes, sj): for node in nodes: # get data on leaf node if node ["isParent"]: traverse_tree_data (node ["children"], sj) else: write_csv (node ["id"], sj) def write_csv (nodeId, sj): fp = os.path.join (data_dir) NodeId + ".csv") # does not crawl if os.path.exists (fp): logging.info ("File already exists:% s", fp) return statData = get_stat_data (sj, nodeId) if statData is None: logging.error ("NOT FOUND data for% s", nodeId) return # csv data csvData = {"zb": [] "value": [], "sj": [], "zbCN": [] "sjCN": []} for node in statData ["datanodes"]: csvData ["value"] .append (node ["data"] ["data"]) for wd in node ["wds"]: csvData [wd ["wdcode"] .append (wd ["valuecode"]) # indicator coding meaning zbDict = {} sjDict = {} for node in statData ["wdnodes"]: If node ["wdcode"] = = "zb": for zbNode in node ["nodes"]: zbDict [zbNode ["code"]] = {"name": zbNode ["name"] "cname": zbNode ["cname"], "unit": zbNode ["unit"],} if node ["wdcode"] = = "sj": for sjNode in node ["nodes"]: sjDict [sjNode ["code"]] = {"name": sjNode ["name"] "cname": sjNode ["cname"], "unit": sjNode ["unit"] } # add zbCN and sjCN for zb in csvData to csv data ["zb"]: zbCN = (zbDict [zb] ["cname"] if zbDict [zb] ["unit"] = "" else zbDict [zb] ["cname"] + "(" + zbDict [zb] ["unit"] + ") csvData [" zbCN " "] .append (zbCN) for sj in csvData [" sj "]: csvData [" sjCN "] .append (sjDict [Sj] [" cname "]) # write csv file df = pd.DataFrame (csvData Columns= ["sj", "sjCN", "zb", "zbCN", "value"],) df.to_csv (fp, index=False) def get_stat_data (sj, zb): payload = {"dbcode": "hgnd", "rowcode": "zb", "m": "QueryData", "colcode": "sj", "wds": "[]" "dfwds":'[{"wdcode": "zb", "valuecode": "'+ zb +'"}, {"wdcode": "sj", "valuecode": "'+ sj +'"}]',} r = requests.get (host, params=payload, verify=False) logging.debug ("access url:% s" R.url) time.sleep (2) logging.debug (r.text) resp = r.json () if resp ["returncode"] = = 200: return resp ["returndata"] else: logging.error ("error:% s", resp) return None

Code description:

Tree_data_path = ". / tree.data": this is the tree menu data serialized in the first step

The collected data generates the corresponding csv according to the number of each menu in the tree menu.

Only each leaf node of the tree menu has data, and non-leaf nodes do not need to collect data.

Call the data function to collect data. The default is to collect data from 1978.

Collection result

The results of this collection have 1917 different kinds of data.

At this point, the study on "what is the data collection process of the National Bureau of Statistics" is over. I hope to be able to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report