How to deal with the missing value of R language in big data 07/19 Update SLTechnology News&Howtos

How to deal with the missing value of R language in big data

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What this article shares with you is about how to deal with the missing value of R language in big data. The editor thinks it is very practical, so I share it with you. I hope you can gain something after reading this article. Let's take a look at it with the editor.

After getting the data, after a clear analysis of the requirements, do not rush to a variety of statistics, models together, first to do a "clean" to the data. There are often a variety of missing values, abnormal values, wrong values and so on in the data. Today, we will first introduce how to deal with the missing values in order to better data analysis and more accurate and efficient modeling.

Check if the dataset is missing

NA is used in R to represent the missing value, is.na is used to identify the missing value, and the return value is TRUE or FALSE. Because the logical values TRUE and FALSE are equivalent to the values 1 and 0, respectively, sum () and mean () can be used to get the missing dataset.

Load R package and built-in dataset

Sample sleep dataset for library (VIM) # VIM package data (sleep,package= "VIM")

1) check how many missing values and percentages exist in the dataset as a whole

Sum (is.na (sleep)) mean (is.na (sleep))

2) check the missing values and percentages of specific variables (columns) in the dataset

Sum (is.na (sleep$Sleep)) mean (is.na (sleep$Sleep))

3) multiple rows in the dataset contain missing values

Mean (! complete.cases (sleep))

4) list rows with no missing values

Sleep [complete.cases (sleep),] # uses functions

List 0); sleep [- list,] # effect is the same as above

5) list rows with one or more missing values

Sleep [! complete.cases (sleep),] list 0); sleep [list,]

Second, explore the missing value

2.1 mice package shows the overall missing data

Library (mice) md.pattern (sleep) BodyWgt BrainWgt Pred Exp Danger Sleep Span Gest Dream NonD 42 1 1 1 09 1 1 1 0 0 23 1 1 1 1 0 11 1 11 11 11 0 1 0 0 31 11 11 11 0 0 11 22 11 11 1 0 11 1 0 22 11 11 1 0 11 0 0 3 0 0 0 4 4 4 12 14 38

Where'1' represents intact data and'0' represents a missing value. In the first column on the left, '42' represents 42 pieces of data without missing values, and the first' 9' represents 9 pieces of data Dream and NonD are missing at the same time. The last row returns the number of deletions corresponding to each variable (column), and 38 is the total number of missing values. The following picture has the same meaning.

2.2 VIM package shows missing data

1) show the overall missing sleep dataset

Library ("VIM") aggr (sleep,prop=FALSE,numbers=TRUE)

2) show the absence of variables of interest to the sleep dataset

Marginplot (sleep [c ("Sleep", "Dream")], pch=c (20), col=c ("darkgray", "red", "blue"))

Three processing missing values

When the missing values are fully understood, the NA rows and some NA columns in the dataset can be processed according to the size of the data and whether a column is an important predictive variable.

3.1 Delete missing values

1) Delete all rows and columns in the dataset that contain NA

Sleep_noNA

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.