Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Why not encode the classified variables exclusively

2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "Why not to carry on the unique thermal coding to the classified variable", the interested friend might as well take a look. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "Why not to classify variables for unique thermal coding" bar!

Singularly hot coding (also known as virtual variables) is a method of converting classified variables into several binary columns, where 1 indicates that there are rows belonging to that category. Obviously, from the point of view of machine learning, it is not suitable to encode classified variables.

Obviously, it adds a lot of dimensions, but in general, the smaller the dimension, the better. For example, if you set up a column that represents a state in the United States (such as California, New York), the single hot coding scheme will have 50 more dimensions.

Not only does this add a lot of dimensions to the dataset, but it doesn't actually have much information-- a few ones are scattered among a large number of zeros. This makes optimization difficult, especially for neural networks, whose optimizer is easy to enter the wrong space in a large number of blank dimensions.

To make matters worse, there is a linear relationship between each sparse column of information. This means that one variable can be easily predicted using other variables, which may lead to high-dimensional parallelism and multicollinearity problems.

The best data set contains the characteristics that information has independent value, while mono-thermal coding can create a completely different environment. Of course, if there are only three or even four classes, then hot coding alone may not be a bad choice. However, depending on the relative size of the dataset, other alternatives may be worth exploring.

Target coding can effectively represent classification columns and occupies only one feature space. It is also known as mean coding, and each value in the column is replaced by the average target value of the category. This allows a more direct representation of the relationship between classification variables and target variables, and this is a very popular technique (especially in Kaggle competitions).

This coding method has some disadvantages. First, it makes it more difficult for the model to learn the relationship between a mean-coded variable and another variable. It can only draw similarities in columns based on its relationship to the target, which has both advantages and disadvantages.

This coding method is very sensitive to y variables, which will affect the ability of the model to extract coding information.

Because each value in this category is replaced by the same value, the model may tend to over-fit the coded values it sees (for example, associate 0.8 with a value that is completely different from 0.79). This is the result of class processing that treats values on a continuous scale as severely duplicated. Therefore, you need to carefully monitor the y variable for outliers.

To do this, you can use the category_ encoders library. The target encoder is a supervised method, so X and y training sets are required.

From category_encoders importTargetEncoder enc = TargetEncoder (cols= ['Name_of_col','Another_name']) training_set = enc.fit_transform (X_train, y_train)

Leave-one coding (Leave-one-out encoding) attempts to compensate for dependence on y variables and value diversity by calculating averages (excluding current row values). This stabilizes the impact of outliers and creates more different coding values.

The model not only provides the same value for each encoded class, but also provides a scope for better generalization. You can use LeaveOneOutEncoder to execute the implementation in the category_encoders library as usual.

From category_encoders importLeaveOneOutEncoder enc = LeaveOneOutEncoder (cols= ['Name_of_col','Another_name']) training_set = enc.fit_transform (X_train, y_train)

Another strategy to achieve a similar effect is to add normally distributed noise to the coding score, where the standard deviation is an adjustable parameter.

Bayesian object coding (Bayesiantarget encoding) is a mathematical method that uses the target as a coding method. Using only the mean can be a deceptive measure, so Bayesian target coding attempts to use other statistics to measure the distribution of target variables, such as their variance or skewness (highermoments).

Then, by combining the attributes of these distributions through the Bayesian model, the model can produce a code that can better understand all aspects of the class target distribution. However, the results are difficult to explain.

Evidence weight (WoE) is another subtle view of the relationship between classified independent variables and dependent variables. WoE is derived from the credit scoring industry and is used to measure the difference between customers who default or repay their loans. The mathematical definition of evidence weight is the natural logarithm of the odds ratio, or:

Ln (% of non events /% of events)

The higher the WoE, the more likely it is that events will occur. "Non-events" refers to the percentage of events that do not belong to a certain category. Use evidence weights to establish a monotonous (never stop moving in one direction) relationship with dependent variables, and ensure categories within the logical scale. WoE is a key component of the "value of information" indicator, which measures how functions provide information to forecasts.

From category_encoders importWOEEncoder enc = WOEEncoder (cols= ['Name_of_col','Another_name']) training_set = enc.fit_transform (X_train, y_train)

These methods are supervised encoders, or coding methods that take into account target variables, so they are usually more efficient encoders in prediction tasks. However, this is not necessarily the case when unsupervised analysis is required.

Nonlinear PCA (Nonlinear PCA) is a method to deal with principal component analysis, which can be used to deal with classified variables by using classification quantization. In this way, you can find the best value for the category, thus maximizing the performance of the regular PCA (the variance of the interpretation).

At this point, I believe that everyone has a deeper understanding of "why not heat code the classified variables". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report