In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
What this article shares with you is about how to deal with the missing value of R language in big data. The editor thinks it is very practical, so I share it with you. I hope you can gain something after reading this article. Let's take a look at it with the editor.
After getting the data, after a clear analysis of the requirements, do not rush to a variety of statistics, models together, first to do a "clean" to the data. There are often a variety of missing values, abnormal values, wrong values and so on in the data. Today, we will first introduce how to deal with the missing values in order to better data analysis and more accurate and efficient modeling.
Check if the dataset is missing
NA is used in R to represent the missing value, is.na is used to identify the missing value, and the return value is TRUE or FALSE. Because the logical values TRUE and FALSE are equivalent to the values 1 and 0, respectively, sum () and mean () can be used to get the missing dataset.
Load R package and built-in dataset
Sample sleep dataset for library (VIM) # VIM package data (sleep,package= "VIM")
1) check how many missing values and percentages exist in the dataset as a whole
Sum (is.na (sleep)) mean (is.na (sleep))
2) check the missing values and percentages of specific variables (columns) in the dataset
Sum (is.na (sleep$Sleep)) mean (is.na (sleep$Sleep))
3) multiple rows in the dataset contain missing values
Mean (! complete.cases (sleep))
4) list rows with no missing values
Sleep [complete.cases (sleep),] # uses functions
List 0); sleep [- list,] # effect is the same as above
5) list rows with one or more missing values
Sleep [! complete.cases (sleep),] list 0); sleep [list,]
Second, explore the missing value
2.1 mice package shows the overall missing data
Library (mice) md.pattern (sleep) BodyWgt BrainWgt Pred Exp Danger Sleep Span Gest Dream NonD 42 1 1 1 09 1 1 1 0 0 23 1 1 1 1 0 11 1 11 11 11 0 1 0 0 31 11 11 11 0 0 11 22 11 11 1 0 11 1 0 22 11 11 1 0 11 0 0 3 0 0 0 4 4 4 12 14 38
Where'1' represents intact data and'0' represents a missing value. In the first column on the left, '42' represents 42 pieces of data without missing values, and the first' 9' represents 9 pieces of data Dream and NonD are missing at the same time. The last row returns the number of deletions corresponding to each variable (column), and 38 is the total number of missing values. The following picture has the same meaning.
2.2 VIM package shows missing data
1) show the overall missing sleep dataset
Library ("VIM") aggr (sleep,prop=FALSE,numbers=TRUE)
2) show the absence of variables of interest to the sleep dataset
Marginplot (sleep [c ("Sleep", "Dream")], pch=c (20), col=c ("darkgray", "red", "blue"))
Three processing missing values
When the missing values are fully understood, the NA rows and some NA columns in the dataset can be processed according to the size of the data and whether a column is an important predictive variable.
3.1 Delete missing values
1) Delete all rows and columns in the dataset that contain NA
Sleep_noNA
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.