In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces the relevant knowledge of "how to use dplyr software package in R language to clean and convert data". The editor shows you the operation process through an actual case. The operation method is simple, fast and practical. I hope this article "how to use dplyr software package in R language to clean and convert data" can help you solve the problem.
Dplyr package is one of the most powerful and popular software packages in R. The software package is written by the most popular R programmer, Hadley Wickham, who has written many useful R packages, such as ggplot2,tidyr. What is dplyr?
Dplyr is a powerful R software package for processing, cleaning and summarizing unstructured data. In short, it makes data exploration and data manipulation in R easy and fast.
What's so special about dplyr?
The software package "dplyr" contains many main data manipulation functions, such as applying filters, selecting specific columns, sorting data, adding or deleting columns, and aggregating data. Another most important advantage of this package is that it is very easy to learn and use the dplyr function. It's also easy to recall these features. For example, filter () is used to filter rows. The processing speed of dplyr function is faster than that of basic R function. This is because the dplyr function is written in a computationally efficient way. They are also more syntactically stable and support data frames better than vectors. The following are the methods and uses in the package:
Dplyr FunctionDescriptionEquivalent SQLselect () Selecting columns (variables) SELECTfilter () Filter (subset) rows.WHEREgroup_by () Group the dataGROUP BYsummarise () Summarise (or aggregate) data-arrange () Sort the dataORDER BYjoin () Joining data frames (tables) JOINmutate () Creating New VariablesCOLUMN ALIASdplyr use filter Series: filter out the data you want # install and load packages # demonstrate directly using built-in iris and mtcars datasets # iris dataset Filter samples with Species "setosa" and Sepal.Length greater than 5 # "&" can also be replaced with "," > filter (iris) Species = = "setosa" & Sepal.Length > = 5.5) Sepal.Length Sepal.Width Petal.Length Petal.Width Species1 5.8 4.0 1.2 0.2 setosa2 5.7 4.4 1.5 0.4 setosa3 5.7 3.8 1.7 0.3 setosa4 5.5 4.2 1.4 0.2 setosa5 5.5 3.5 1.3 0.2 setosa
Filter supports the following forms of judgment:
Relationship type: =, =,! =, is.na (),! is.na ()
&, |,!, xor () # Operation judgment on vector set
Between (),% in%, near () # sqrt (2) ^ 2 = = 2 returns FALSE,near (sqrt (2) ^ 2, 2) then TRUE
All_vars () and any_vars () # appear in filter_all, filter_at and filter_if as judgment conditions
Filter series, and several variogram functions: filter_all, filter_at, filter_if
# filter any sample filter_all (mtcars, any_vars (. ) # filter variables end with "d" and the variable "% 2" equals 0filter_at (mtcars, vars (starts_with ("d")), any_vars ((. % 2) = = 0) # filter variables round down = = the original variable value, and the value of this part of the variable! Sample set filter_if (mtcars, ~ all (floor (.) = =.), all_vars (. ! = 0))
More usage:
Mtcars% >% filter_all (all_vars (. > 150)% >% head () # filter all rows with variables greater than 150. the result is empty mtcars% >% filter_all (any_vars (. > 150)% >% head () # filter rows with variables greater than 150. for all columns with variable names beginning with d Filter all rows that have variables divisible by 2 mtcars% >% filter_at (vars (starts_with ("d")), any_vars ((.% 2) = = 0) # for columns where variables are all integers Filter all rows with non-zero variables mtcars% >% filter_if (~ all (floor (.) = =.), all_vars (.! = 0)) # support purrr syntax filtering mtcars% >% filter_at (vars (hp, vs), ~.% 2 = = 0) # filter all row select functions where hp and vs variables are even: keep only the columns you need and support changing variable names
Usage: select (.data, …)
Unlike filter explained earlier, select filters variables, while filter filters sample sets.
Application scenario: suppose the data is stored in a wide table (such as a table with 100 variables), and you only need a few of them. The key to select is "…" The judgment condition of
# mtcars dataset Filter mpg, cyl, wt, vs,4 variable data # mtcars [, c ("mpg", "cyl", "wt", "vs")] to achieve the same function > select (mtcars,c ("mpg", "cyl", "wt", "vs")) mpg cyl wt vs 21.06 2.620 021.06 2.875 022.8 4 2.320 Tips:select also supports ":" and "-" operation # for example: select (mtcars) C ("mpg": "vs")), indicates continuous column selection # select (mtcars,- "mpg") deletes the mpg column
The above gives people the impression that the real power of select is that it supports the following conditional judgments without going through select and using data boxes and vector operations:
Inclusion relationships: starts_with (), ends_with (), one_of ()
Matching relationship: matches (), contains (), num_range ()
# inclusion relationship: in Iris, filter variables that begin with Petal or end with Width > select (iris, starts_with ("Petal"), ends_with ("Width")) # Petal.Length Petal.Width Sepal.Width# 1.4 0.2 3. (data omission) # Tips:starts_with ("Petal"), ends_with ("Width"), two conditions are not "and" relations, but "or" inclusion relations: often need to extract changing data sets, the use of one_of is perfect # extract mtcars "mpg", "cyl", "wt", "vs", "vss" > var1 select (mtcars, one_of (var1)) # mpg cyl wt vs# 21.06 2.620 "21.06 2.875" 2.84 2.320. (omit data) # Warning message: Unknown columns: `vss` # Tips: select cannot find the quota variable, the system will return warning # matching relationship: filter Iris dataset The variable name with "wid" in the variable name > select (iris, matches (".wid.")) > select (iris, contains ("wid")) # Sepal.Width Petal.Width# 3.5 0. 0. 0. 0. Num_range can efficiently match the # random data box with variable names similar to x01, x02, x03 by X1~X5 Y composition: df select (df, c (var1 = "y", num_range ("x", 1:3)) # var1 x1 x2 xrooma 0.96631605 0.29815009 0.654541roomb 0.61046600 0.76547552 0.824719 roomc 0.70510879 0.46636723 0.447258. (data omission) mutate series: calculate the data to produce new data
Usage: mutate (.data, …)
The use of mutate mainly depends on "…" Changes in the formula to generate new variables
Mutate supports the following formulas:
Common calculation methods such as +, -, *, /,%,% |%, etc.
Lead (), lag ()
Dense_rank (), min_rank (), percent_rank (), row_number (), cume_dist (), ntile ()
Cumsum (), cummean (), cummin (), cummax (), cumany (), cumall ()
Na_if (), coalesce ()
If_else (), recode (), case_when ()
Let's start with the "rank" series, which is mainly used to divide rankings, grades, percentages, density, and so on.
# simple +, -, *, /,%,% |% can add new data columns > mutate (mtcars,aa=hp-drat) Bb=mpg*cyl) mpg cyl disp hp drat wt qsec vs am gear carb aa bb1 21.06 160.0 1103.90 2.620 16.46 0 1 4 4 106.10 126.02 21.06 160.0 110 3.90 2.875 17.02 0 1 4 4 106.10 126.03 22.8 4 108.0 93 3.85 2.320 18.61 11 4 1 89.15 91.24 21.4 6 258. 0 110 3.08 3.215 19.44 10 3 1 106.92 128.45 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 171.85 149.66 18.1 6 225.0 105 2.76 3.460 20.22 10 3 1 102.24 108.67 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 241.79 114.4# If you only want to keep these new variables, you can: > transmute (mtcars Aa=hp-drat,bb=mpg*cyl) aa bb1 106.10 126.02 106.10 126.03 89.15 91.24 106.92 128.45 171.85 149.66 102.24 108.6#percent_rank Sort the identifiers of x% pipe according to the percentage of [0Get 1] # for example, divide the identifiers of x% pipe according to the value of x, in the select function Everything () can be used to change the variable order > iris% >% mutate (Length_rank = percent_rank (Sepal.Length))% >% select (Length_rank). Everything () # Length_rank Sepal.Length Sepal.Width Petal.Length Petal.Width Species# 0.21476510 5.1 3.5 1.4 0.2 setosa# 0.10738255 4.9 3.0 1.4 0.2 setosa#. (data omission) # row_number (), not only can be used to sort the number of rows you want, but also can be used to get the number of rows mutate (mtcars, row_number () = = 1L) # newly generated variable, used to determine whether it is the first row, return TRUE or FALSE mtcars% >% filter (between (row_number (), 1,10)) # filter 1-10 rows through row_number. Somewhat similar to top_n (10) # ntile, it cuts the dataset into N blocks and returns specific values. It belongs to ntile (runif (10), 5) # [1] 1 2 4 5 5 3 4 2 3. To some extent, ntile can be used to divide the training set and test set (similar to sample function) # ind = 0.
< 0 的情况下, y + 1 df % mutate( xy = if_else(x >= 0, y-1, yellow1, y) # xy xy#-5 0.7760150 1.776015-4 0.9310976 1.931097 case_when, the judgment conditions can be more diversified # case_when, and case...when... in SQL Same # same as switch in C language x arrange (mtcars,hp,mpg Cyl) mpg cyl disp hp drat wt qsec vs am gear carb1 30.4 4 75.7 52 4.93 1.615 18.52 1 14 22 24.4 146.7 62 3.69 3.190 20.00 1 0 4 23 33.9 4 71.1 65 4.22 1.835 19.90 1 14 27.3 4 79.066 4.08 1.935 18.90 1 14 15 32.4 4 78.7 66 4.08 2.200 19.47 11 4 16 26.0 4 120.3 91 4.43 2.140 16.70 01 5 27 22.8 4 108.0 93 3.85 2.320 18.61 11 4 18 22.8 4 140.8 95 3.92 3.150 22.90 10 4 29 21.5 4 120.1 97 3.70 2.465 20.01 10 3 110 18.1 6 225.0 105 2.76 3.460 20.22 10 3 1
Multi-column sort, descending with desc ()
> arrange (mtcars,desc (hp), mpg Cyl) mpg cyl disp hp drat wt qsec vs am gear carb1 15.08 301.0 3353.54 3.570 14.60 01 5 82 15.8 351.0 264 4.22 3.170 14.50 01 5 43 13.3 350.0 245 3.73 3.840 15.41 00 3 44 14.3 360.0 2.45 3.21 3.570 15.84 00 3 45 14.7 8 440.0 230 3.23 5.345 17.42 00 3 46 10.4 8 460.0 215 3.00 5.424 17.82 00 3 47 10.4 8 472.0 205 2.93 5.250 17.98 00 3 48 15.2 8 275.8 180 3.07 3.780 18.00 00 3 39 16.4 8 275.8 180 3.07 4.070 17.40 00 3 310 17.3 8 275.8 1803.07 3.730 17.60 003 3sample_n () and sample_frac () randomly selects a subset of data by row
Sample_n () and sample_frac () is selected randomly according to the fixed number of rows, and one is selected according to the proportion of the number of rows.
> sample_n (mtcars 10) mpg cyl disp hp drat wt qsec vs am gear carb1 18.7 8 360.0 175 3.15 3.440 17.02 00 3 22 14.3 360.0 245 3.21 3.570 15.84 00 3 43 21.5 4 120.1 97 3.70 2.465 10 3 14 32.4 4 78.7 66 4.08 2.200 19.47 1 14 26.0 4 120.3 91 4.43 2.140 16.70 0 15 26 19.2 6 167.6 123 3.92 3.440 18.30 10 4 47 17.8 6 167.6 123 3.92 3.440 18.90 10 4 48 27.3 4 79.0 66 4.08 1.935 18.90 1 14 19 15.8 8 351.0 264 4.22 3.170 14.50 0 15 410 15.5 8 318.0 150 2.76 3.520 16.87 00 32 2 > sample_frac (mtcars Mpg cyl disp hp drat wt qsec vs am gear carb1 10.4 8 472 205 2.93 5.25 17.98 0 3 42 14.3 8 360 245 3.21 3.57 15.84 00 3 43 13.3 8 350 245 3.73 3.84 15.41 0034% data pipeline example data (iris) data (mtcars) iris% >% head () mtcars% >% head () # Filter variables iris% >% select (starts_with ("Sepal"))% >% head () # more than one filter condition iris% >% select (- starts_with ("Sepal"))% >% head () # filter variables named variable iris% >% select (ends_with ("Length"))% >% head () iris% >% select (- ends_with (") Length ")% >% head () # move the variable Species to the front iris% >% select (Species Everything ()% >% head () # Anti-filtering Filter other variables except Sepal.Length variable iris% >% select (- Sepal.Length)% >% head () # move variable Species to the last face iris% >% select (everything (), Species)% >% head () iris% >% select (- Species, Species)% >% head () # incorrect usage The result is empty iris% >% select (Species,-Species)% >% head () iris% >% select (contains ("etal"))% >% head () iris% >% select (matches (".t.")% >% head () # filter name, t in the middle variable. Iris% >% select (last_col ())% >% head () # Last variable iris% >% select (last_col (offset = 2))% >% head () # penultimate variable iris% >% select (one_of (c ("Petal.Length") "Petal.Width"))% >% head () iris% >% group_by (Species)% >% select (group_cols ())% >% distinct ()% >% head () # get grouping variable name df% as_tibble () head (df) df% >% select (V4:V7)% >% head () # filter V4 column to V7 column df% >% select (num_range ("V") 4:7)% >% head () # result is the same as before # column name rename iris% >% select (petal_length = Petal.Length)% >% head () # rename iris% >% select (obs = starts_with ('S'))% >% head () # multiple variable renaming on "how to use dplyr package in R language to clean and convert data" ends here Thank you for your reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.