In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)06/01 Report--
Missing data is a problem that data scientists often encounter when dealing with data. In this paper, the author provides corresponding data interpolation solutions based on different situations. There is no perfect data interpolation, but there is always one that is more suitable for the current situation.
One of the most common problems I encounter in data cleaning and exploratory analysis is dealing with missing data. The first thing we need to understand is that there is no perfect solution to this problem. There are different data interpolation methods for different problems-time series analysis, machine learning, regression models and so on, so it is difficult to provide general solutions. In this article, I will try to summarize the most commonly used methods and find a structured solution.
Interpolation data vs delete data
Before we discuss the method of data interpolation, we must understand the cause of data loss.
1. Random loss (MAR, Missing at Random): random loss means that the probability of data loss has nothing to do with the lost data itself, but only related to part of the observed data.
2. Completely random loss (MCAR, Missing Completely at Random): the probability of data loss has nothing to do with its hypothetical values and other variables.
3. Non-random loss (MNAR, Missing not at Random): there are two possible cases. The missing value depends on its hypothetical value (for example, high-income people usually do not want to disclose their income in the survey), or on the value of other variables (assuming that women usually do not want to disclose their age, then the missing value of the age variable is affected by the gender variable).
In the first two cases, the data with missing values can be deleted according to its occurrence, while in the third case, deleting data containing missing values may cause deviations in the model. So we need to be very careful about deleting data. Please note that interpolation data does not necessarily provide better results.
Delete
List deletion
Deleting by list (full case study) deletes a row of observations as long as they contain at least one missing data. You may just need to delete these observations directly, and the analysis will be easy to do, especially if the missing data is only a small part of the total data. In most cases, however, this deletion method is not easy to use. Because the assumption of complete random deletion (MCAR) is usually difficult to satisfy. Therefore, this deletion method will result in biased parameters and estimates.
Delete in pairs
In the case of important variables, deleting in pairs deletes only relatively unimportant variable rows. This ensures as much data as possible. The advantage of this method is that it can help enhance the effect of analysis, but it also has many disadvantages. It assumes that missing data is subject to completely random loss (MCAR). If you use this method, different parts of the model will eventually get a different number of observations, which makes the model interpretation very difficult.
Observation lines 3 and 4 will be used to calculate the covariance of ageNa and DV1, and observation lines 2, 3 and 4 will be used to calculate the covariance of DV1 and DV2.
Delete a variable
In my opinion, it is always better to keep the data than to throw it away. Sometimes, if more than 60% of the observed data is missing, it is possible to delete the variable directly, but only if it does not matter. Having said that, interpolating data is always better than discarding variables directly.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.