Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to get data in R language

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to obtain data in R language". The content of the explanation in this article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought. Let's study and learn how to obtain data in R language.

Today, I only share the code for data acquisition. In order to show the standardization of the project (in fact, it is to install X), I used the Create Projects menu in Rstudio for the first time to create a local project warehouse (writing R code is too elegant before, regardless of whether others can understand it or not, because I have already suffered great losses in my work).

Because it contains a secondary list page, the idea of the first step is to climb the year link first, and then traverse the link to grab the document in each year.

Perhaps because of the thinking problems of my liberal arts students, I am not used to writing a double-layer for loop directly (because it will be uncomfortable to see it), so when I encounter this kind of need to traverse twice, I usually break it down into two small steps:

1. Link to the homepage of the government work report corresponding to the traversal year:

# #! / user/bin/env RStudio 1.1.423

# #-*-coding: utf-8-*-

# # Pages_links Acquisition## loads the necessary installation package: library ("rvest") library ("stringr") library ("Rwordseg") library ("wordcloud2") library ("dplyr") # main URL

Url% html_nodes ("p")% >% html_text () # extraction year & link information:

Base% html_nodes ("div.history_report")% >% html_nodes ("a") Year% html_text (trim = TRUE)% >% as.numeric () Links% html_nodes ("a")% >% html_attr ("href")% >% str_trim ("both") # merged into a data box:

Reports_links% html_nodes ("td.p1,tr > td,div.pages_content")% >% html_text ("both") >% cat (file = sprintf (". / data/Corpus/%d.txt", I))}

The above needs to use the more basic CSS expression color matching rvest to extract the document, if you do not know much about this piece of content, quickly through the menu of network data to get notes to make up.

There is no construction loop, and the multi-process parallel crawling scheme provided by the foreach package is used to deal with the multi-loop problem (although the magnitude here does not show the advantage of parallelism, the overall code is more efficient than writing a brief introduction to the loop).

System.time ({

If (! dir.exists (". / data/Corpus")) {dir.create (". / data/Corpus")} cl

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report