Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use tidyr package in R language

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to use the tidyr package of R language". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn how to use the tidyr package of R language.

Preface

The reshape2 package helps us easily realize the conversion between long and wide format data in the data processing phase. The tidyr package can be seen as an evolutionary version of the reshape2 package, written by Hadley Wickham, the chief scientist of Rstudio and the god of the R language world. Tidyr package is often used in combination with dplyr package, and now it has the potential to replace reshape2 package. It is an R package worthy of attention.

There are four commonly used functions in the tidyr package, which are:

Gather (): converts wide data to long data, aggregating rows into columns

Spread (): converts long data to wide data, expands columns into rows

Unite (): merge multiple columns into one column

Separate (): separates a column into multiple columns.

Next, we mainly study these four functions in detail, and on this basis, we learn some other practical functions of the tidyr package.

1. Gather () function

Import the package used

> library (dplyr) > library (tidyr)

As mentioned earlier, the gather () function converts wide data into long data with the following formula:

> gather (data=,key=,value=,...,na.rm=,convert=,factor_key=) # key: create a new column name, and the old column name of the original data becomes the observation value of the new column name # value: create a new column name, and the observation value of all the old column names of the original data becomes the observation value of the new column name #...: specify the columns to be converted according to the actual needs # na.rm: logical value, whether to delete the missing value # convert: logical value Whether to convert the data type in the key column # factor_key: logical value, if F, key is automatically converted to a string, otherwise it is a factor (the original lever level remains the same)

First, let's look at the raw data:

> head (iris,3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa3 4.7 3.2 1.3 0.2 setosa

Use the gather () function to reshape the data ('% >%'in the code is a pipeline function, which is why we load the dplyr package. For more information on pipeline functions, see R Language Learning:dplyr package (11))

> iris% >% + gather (key=var1,value = var2,...=1:4,na.rm = F)% >% + arrange (desc (var2))% >% + head (3) Species var1 var21 virginica Sepal.Length 7.92 virginica Sepal.Length 7.73 virginica Sepal.Length 7.7 II, spread () function

The spread () function converts long data into wide data and expands the column into rows. The calling formula is as follows:

> spread (data =, key =, value =, fill =, convert =, drop =) # key: specify a converted column with its observations as the converted column name # value: spread the observations of other columns to the corresponding cells # fill: set a certain value and replace the missing value

We use the economics dataset in R to use the learning function.

# extract the first three columns of the original dataset > data head (data,3) # A tibble: 3 × 3 date pce pop 1 1967-07-01 507.4 1987122 1967-08-01 510.5 1989113 1967-09-01 516.3 19911 since there is no suitable long data at hand, we first use the gather function to generate a new long data And suppose we want to make column rows for this long data > data% >% + gather (key=var1,value = var2) -date)% >% + head (3) # A tibble: 3 × 3 date var1 var2 1 1967-07-01 pce 507.42 1967-08-01 pce 510.53 1967-09-01 pce 516.data-date: date in the default dataset does not change # conversion from long data to wide data using the spread () function # the original var1 variable is used as the converted column name The values of var2 variables are dispersed as corresponding observations in each column > data% >% + gather (key = var1,value = var2,-date)% >% + spread (key = var1,value = var2)% >% + head (3) # A tibble: 3 × 3 date pce pop 1 1967-07-01 507.4 1987122 1967-08-01 510.5 1989113 1967-09-01 516.3 199113 III, unit ()

The unite () function merges multiple columns in the data box into one column, and calls the formula as follows:

> unite (data =, col =,... =, sep =, remove =) # col: specify the name of the new column #...: specify which columns in the data are grouped together # sep: separator between data in the new column # remove: logical value Whether to keep the columns that participate in the combination # data preparation > date hour min second event data head (data,3) date hour min second event1 2016-11-01 23 59 11 y2 2016-11-02 21 12 4 U3 2016-11-03 2 55 42 I

Here, we use the unite () function to merge the date and time values into one column.

# date and hour are connected with spaces # datehour and time values are connected with': 'connection > data% >% + unite (datehour,date,hour,sep='')% >% + unite (datetime,datehour,min,second,sep=':')% >% + head (3) datetime event1 2016-11-01 23:59:1 y2 2016-11-02 21:12:4 u3 2016-11-03 2:55:42 I IV, separate () function

After learning the unite () function, the separate () function is easy to understand. Its function is just the opposite of unite, that is, a column in the data box is divided into multiple columns according to the delimiter, which is generally used to split the time series. The calling formula is as follows:

> separate (data =, col =, into =, sep =, remove =, + convert =, extra =, fill =,...) # col: a column to be split # into: define the new column name after split # sep: separator # remove: logical value, whether to delete the split column

We use the time data set obtained in the previous section, define it as data_unite, and split it.

# split date and time first When splitting the time > data_unite% >% + separate (datetime,c ('date','time'), sep='')% >% + separate (time,c ('hour','min','second')) Sep=':')% >% + head (3) date hour min second event1 2016-11-01 23 59 1 y2 2016-11-02 21 12 4 U3 2016-11-03 2 55 42 I, Simple completion of missing values > library (readxl) > data data# A tibble: 8 × 2 type num 1 a 752 b 723 664 a NA5 c 696 b 657 a 728 c NA

From the above data, we can see that there are missing values for both types and values. For the missing value of the type, we choose the mode substitution, for the missing value of the numerical type, we choose the mean replacement (we can also choose the median, etc., depending on the specific situation)

> num_mean type_mode data data# A tibble: 8 × 2 type num 1 a 75.000002 b 72.000003 a 66.000004 a 69.833335 c 69.000006 b 65.000007 a 72.000008 c 69.83333 Thank you for reading. The above is the content of "how to use tidyr package in R language". After the study of this article I believe that you have a deeper understanding of how to use the tidyr package of R language, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report