How to understand single-hot coding through machine learning 10/13 Update SLTechnology News&Howtos

How to understand single-hot coding through machine learning

2025-10-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the relevant knowledge of "how to understand single-hot coding through machine learning". The editor shows you the operation process through actual cases, and the operation method is simple, fast and practical. I hope this article "how to understand single-hot coding through machine learning" can help you solve the problem.

Guide reading

There are many better options.

Singularly hot coding, also known as dummy variables, is a method of converting classified variables into several binary columns, where 1 represents rows that belong to that category.

Obviously, from the point of view of machine learning, it is not a good choice for classified variable coding. The most obvious is that it increases a lot of dimension, which is common sense, and low dimension is usually better. For example, if we were to use a column to represent a state in the United States (such as California, New York), then the single hot coding scheme would result in 50 additional dimensions.

Not only does it add a lot of dimension to the dataset, but it really doesn't have much information-- a bunch of zeros are occasionally dotted with ones. This leads to an unusually sparse phenomenon, which makes it difficult to optimize. This is especially true for neural networks, whose optimizer can easily enter the wrong optimization space in the case of dozens of empty dimensions.

To make matters worse, there is a linear relationship between each sparse column of information. This means that one variable can be easily predicted using other variables, which leads to parallelism and multicollinearity.

The optimal data set consists of features with independent value of information, while monotone coding creates a completely different environment.

Admittedly, if there are only three or even four categories, hot coding alone may not be a bad choice, but it may be worth exploring other options, depending on the relative size of the dataset.

Object coding is a very effective method to represent classification columns, which occupies only one feature space. Also known as mean coding, replaces each value in a column with the mean target value of that category. This allows a more direct representation of the relationship between classification variables and target variables, which is a very popular technique (especially in Kaggle competitions).

This coding method has some disadvantages. First of all, it makes it more difficult for the model to learn the relationship between an average coded variable and another variable, and it draws similarities in a column only according to its relationship with the target, which may be beneficial or disadvantageous.

However, this coding method is very sensitive to the y variable, which will affect the ability of the model to extract coding information.

Because the values of each category are replaced by the same values, the model may tend to over-fit the coded values it sees (for example, associate 0.8 with a value that is completely different from 0.79). This is the result of treating values on a continuous scale as repeating classes.

Therefore, you need to carefully monitor the y variable to find outliers, and so on.

To do this, use the category_ encoders library. Because the target encoder is a supervised method, it requires X and y training sets.

From category_encoders import TargetEncoder

Enc = TargetEncoder (cols= ['Name_of_col','Another_name'])

Training_set = enc.fit_transform (X_train, y_train)

Leave-one-out encoding attempts to compensate for the dependence on the y variable and the diversity of values by calculating averages (excluding the current row values). This eliminates the impact of outliers and creates more diverse coded values.

Because the model gives each coding class not only the same value, but also a range, it learns to generalize better.

As usual, you can use LeaveOneOutEncoder implementation in the category_encoders library.

From category_encoders import LeaveOneOutEncoder

Enc = LeaveOneOutEncoder (cols= ['Name_of_col','Another_name'])

Training_set = enc.fit_transform (X_train, y_train)

Another strategy to achieve similar effect is to add normal distribution noise to the coding score, where the standard deviation is a parameter that can be tuned.

Bayesian Target Encoding is a mathematical method that uses targets as coding methods. Using only averages can be a deceptive metric, so Bayesian target coding attempts to combine other statistical measures of the distribution of target variables, such as its variance or skewness-- known as' higher moments'.

The attributes of these distributions are then merged by the Bayesian model to produce a code that better understands all aspects of the classification target distribution. However, the results are difficult to explain.

Weight of Evidence is another scheme for classifying the relationship between independent and dependent variables. WoE is derived from the field of credit scoring and is used to measure the difference between customers who default or repay their loans. The mathematical definition of Weight of Evidence is the natural logarithm of the ratio ratio, that is:

Ln (% of non events /% of events)

The higher the WoE, the more likely the event is to occur.' Non-events' is the percentage of those that do not belong to a class. It is natural for logical regression to use Weight of Evidence dependent variables to establish monotonous relationships and ensure categories on a logical scale. WoE is a key component of another metric, Information Value, and the IV value measures how a feature provides information for prediction.

From category_encoders import WOEEncoder

Enc = WOEEncoder (cols= ['Name_of_col','Another_name'])

Training_set = enc.fit_transform (X_train, y_train)

These methods are coding methods that supervise encoders or consider target variables, so they are usually more efficient encoders in prediction tasks. However, this is not necessarily the case when unsupervised analysis needs to be performed.

Nonlinear PCA is a principal component analysis method that uses classification quantization to deal with classified variables. This will find the best value for the category, thus maximizing the performance of the regular PCA (interpreting variance).

This is the end of the introduction to "how to understand single-hot coding through machine learning". Thank you for your reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.