How to understand the handling of missing values in r language 04/16 Update SLTechnology News&Howtos

How to understand the handling of missing values in r language

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about how to understand the missing value processing in r language. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can gain something after reading this article.

The processing of missing values is an important part of data preprocessing. The reasons for missing data are: data loss, storage failure and refusal to disclose relevant information in the investigation. Here we use the sleep data set in the VIM package as a sample to introduce the method of missing value handling. The sleep data set recorded the sleep information of 62 mammals, including weight, sleep length, dream time and so on.

Missing value classification

1, complete random deletion (MCAR): missing data has nothing to do with other variables. If each missing variable is MCAR, the complete sample can be seen as a simple sampling of a more big data set.

2, random deletion (CAR): the missing data is related to other observed variables, but not to its own variables. For example, animals with low body weight are more likely to lose Dream data (it is difficult for smaller animals to observe). If the weight is selected, the loss of Dream is random.

3, non-random deletion (MNAR): missing data depends not only on other variables but also on its own variables. For example, if the dream time is short, Dream is easy to be missing (maybe it is difficult to measure because of the short time), and the missing data is MNAR.

Data missing is generally the first two cases, the last case is more complex, it is necessary to model the relationship of interest, but also to model the generation mechanism of missing values, and constantly collect new data.

Judge the missing value

1the function of dint is.na (). If it acts on a vector or data box, if the corresponding value is missing, TRUE is returned, otherwise FALSE is returned. Applying the sum () function to the result of is.na () returns the number of missing values.

2complete. Cases () function, which returns the logical value vector. In contrast to the is.na () function, the missing value returns FALSE and the normal data returns TRUE, which is often used to select data that is not missing.

Judge the missing mode

1, the list shows the missing situation. The md.pattern () function in the mice package is used here.

2. The graph shows the missing situation.

(1) aggr () function

The figure on the left shows the number of missing samples in each field, each row on the right represents a missing pattern, red represents missing, blue represents not missing, and the right indicates the number of this pattern, which can be observed corresponding to the md.pattern () result.

(2) matrixplot () function

Light colors represent small values, dark colors represent large values, and red represents missing values. Matrixplot () shows how missing each sample is.

(3) the marginplot () function, which can only draw the missing situation of two variables at a time.

If MCAR, the red and green box lines should be very similar.

3, the lack of correlation exploration. By generating the shadow matrix, 1 is used to represent the missing data, and the missing samples are selected to calculate the missing correlation coefficient matrix. It is helpful to observe which variables are often missing together, and to analyze the relationship between the "missing" variables and other variables.

Dealing with missing values

1. If the number of missing samples is small and random, you can consider deleting the missing samples directly. Use na.omit (sleep) or complete.cases (sleep). If more than 5% of a field is missing, you can consider deleting this field.

2, replace the missing value. Missing values can be replaced by means, median, and random numbers, but deviations are introduced.

3. Multiple interpolation. The mice () function in the mice package can return multiple complete data sets and store them in imp by interpolation, and linear regression of imp can be carried out with with function. Finally, pool () function is used to summarize the regression results.

The mice () function generates five complete datasets by default. To view the interpolation data, temp$imp is available, and the result is a pair of interpolation data for each dataset (first row) and each observation (first column).

Nmis represents the number of missing data in the variable, and fmi indicates that the missing data contributes to the variation. With is used to check whether a data set is qualified, and pool is used to test whether the whole method is qualified, so as to determine which data set is selected.

Finally, the complete () function is used to generate the complete dataset, and here we choose to generate the first dataset to replace the missing values.

The above is the editor for you to share how to understand the missing values in r language to deal with, if there is a similar doubt, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.