What are the modeling skills using ML and DNN 07/01 Update SLTechnology News&Howtos

What are the modeling skills using ML and DNN

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces you to use ML and DNN modeling skills is what, the content is very detailed, interested friends can refer to, hope to be helpful to you.

Data preprocessing (Data Preparation)

Processing raw data (Process Your Own Data)

Because consumers may not know how to carry out data processing and feature engineering, data analysts need to preprocess data within the model.

Taking the problem of text classification as an example, BERT is used for classification. Data analysts cannot require customers to do tagging and feature collation.

Take the regression problem as an example, time is one of the characteristics. In the initial model, data analysts can only use the day of the week (such as Thursday) as a feature. After several iterations, the day of the week is no longer a good feature factor, and the data analyst only wants to use a date (such as the 31st) as a feature. On the other hand, the customer may only provide the information of the day of the week, but not the specific date, so the data preprocessing is needed.

Take speech recognition as an example, consumers can only send audio to data analysts, but not classic feature data such as Mel Cepstral Coefficient (MFCC).

Therefore, it is recommended to embed data preprocessing in the code rather than requiring the client to preprocess.

Use Tensor (Use Tensor)

The tensor is an N-dimensional array, which is used in multi-dimensional computation. It is faster than using Python dictionaries or arrays, and the object data format of deep learning frameworks (such as PyTorch or TensorFlow) is tensor.

Data expansion (Data Augmentation)

The lack of tagged data is one of the common challenges faced by practitioners. Transfer learning is one of the ways to overcome this problem. Computer vision practitioners can consider using ResNet, and natural language processing practitioners can consider BERT. On the other hand, composite data can be generated to add tag data. Albumentations and imgaug can generate image data, while nlpaug can generate text data.

If you know your data, you should design data expansion methods tailored. Remember, the golden rule of data science is garbage in garbage out.

Data sampling (Sampling Same Data)

In most cases, we want to extract data randomly to keep the probability distribution of sample data consistent among training set, test set and verification set. At the same time, we also want to maintain this "random" behavior, so that we can get the same training set, test set and verification set at different times.

If the data has a date attribute, you can split the data by date.

Otherwise, you can change the random seed to have a consistent random behavior.

Import torch

Import numpy as np

Import random

Seed = 1234

Random.seed (seed)

Np.random.seed (seed)

Torch.manual_seed (seed)

Torch.cuda.manual_seed (seed) Model training (Model Training)

Storage intermediate state (Saving Intermediate Checkpoint)

Just saving the model after the training is completed usually has the following disadvantages:

Due to the complexity of the model, computing resources and the size of training data, the whole model training process may take several days or weeks. If no intermediate state is stored, this can be very dangerous because the machine may be shut down unexpectedly.

Generally speaking, longer training can achieve better results (for example, less loss). However, overfitting may occur. In most cases, the last model state does not provide the best results. We need to use intermediate state models for production most of the time.

Using the check-stop mechanism can save money. Notice that a model has not improved in several cycles, and we can stop it early to save time and resources.

Ideally, we can store the model continuously (for example, after each epoch), but it requires a lot of storage. In fact, we recommend keeping only the best model (or the best three models) and the last model.

Virtual cycle (Virtual Epoch)

Epoch is a very common parameter in model training. If the initialization is incorrect, it may affect the performance of the model.

For example, if we have 1 million records and we set up 5 epoch, then we have a total of 5 million training data. Three weeks later, we got another 500000 records. If we use the same epoch for model training, the total training data will reach 7.5 million. The question is:

It is difficult to determine whether the improvement in model performance is due to an increase in the number of specific data or an increase in the overall amount of data.

The new 500000 pieces of data extend the training time by an hour or even a few days. It increases the risk of machine failure.

It is recommended to use virtual epoch instead of the original static epoch. Virtual epoch can be calculated according to the size of training data, expected epoch and batch size.

The usual static epoch is as follows:

# original

Num_data = 1000 * 1000

Batch_size = 100

Num_step = 14 * 1000 * 1000

Num_checkpoint = 20

Steps_per_epoch = num_step//num_checkpoint

# TensorFlow/ Keras

Model.fit (x, epoch=num_checkpoint, steps_per_epoch=steps_per_epoch)

Batch_size=batch_size

)

The virtual epoch is as follows:

Num_data = 1000 * 1000

Num_total_data = 14 * 1000 * 1000

Batch_size = 100

Num_checkpoint = 20

Steps_per_epoch = num_total_data / / (batch_size*num_checkpoint)

# TensorFlow/ Keras

Model.fit (x, epoch=num_checkpoint, steps_per_epoch=steps_per_epoch)

Batch_size=batch_size

)

Principle of simplification (Simple is Beauty)

Practitioners usually intend to use state-of-the-art models to build the initial model. In fact, we suggest building a model that is simple enough as a baseline model. The reason is:

We always need a baseline model to prove that the proposed model is correct.

The baseline model does not need to be very good in terms of performance, but it must be interpretable. Business users always want to know the reason for the forecast.

It is very important to be easy to implement. Customers can't wait a year to get a good enough model. We need to build a model to gain momentum from investors and build your wonderful model based on the initial model.

Here are some suggested baseline models for different areas:

Speech recognition: you can use classical features, such as mel frequency cepstral coefficient (MFCC) or mel spectrogram features, instead of training models to obtain vector representations (such as adding embedding layers). These features are passed on to a long-term and short-term memory network (LSTM) or convolution neural network (CNN) and a fully connected layer for classification or prediction.

Computer vision: TODO.

Natural language processing: embedding LSTM using bag-of-words or classic word embeddings is a good starting point before moving to other models, such as BERT or XLNet.

Debug (Debugging)

Simplify the problem (Simplifying Problem)

Sometimes, the classification problem includes 1 million data and 1000 categories. It is difficult to debug the model when the performance of the model is lower than the ideal value. Poor performance can be caused by model complexity, data quality, or bug. Therefore, it is suggested to simplify the problem so that we can ensure that it is defect-free. We can use the over-fitting problem to achieve this goal.

At the beginning, there is no need to classify 1000 categories, you can first sample 10 categories, each category has 100 data, and train the model. By using the same training data set (or subset) as the evaluation data set, we can over-fit the model and obtain good results (for example, 80 or even 90 + accuracy). Model development on this basis can reduce the emergence of bug.

Use Evaluation Mode (Using Eval Mode for Training)

If the accuracy of the evaluation mode has not changed in the previous epoch, you may usually forget to reset to "training" mode after the evaluation.

In Pytorch, it is necessary to change the training mode and evaluation mode during the training and evaluation phase. If training mode is enabled, batch standardization, dropout, or other parameters will be affected. Sometimes, the data analyst may forget to enable the training mode after evaluating the mode.

Model = MyModel () # Default mode is training mode

For e in range (epoch):

# mode.train () # forget to enable train mode

Logits = model (x_train)

Loss = loss_func (logits, y_train)

Model.zero_grad ()

Loss.backward ()

Optimizer.step ()

Mode.eval () # enable eval mode

With torch.no_grad ():

Eval_preds = model (x_val)

Data conversion (Data Shifting)

When there is a significant difference between the training data set and the evaluation / test data set, data conversion is required. In computer vision tasks, perhaps most of the training data are pictures during the day, while the test data are pictures at night.

If it is found that there is a great difference between training loss / accuracy and test loss / accuracy, some samples can be randomly selected from the two data sets for inspection. To solve this problem, you can consider the following methods:

Ensure that similar data distribution is maintained between training, testing, and prediction datasets.

If possible, add more training data.

Add synthetic data by using related libraries. Consider using nlpaug (for natural language processing and acoustic tasks) and imgaug (for computer vision tasks).

Underfitting problem (Addressing Underfitting)

Underfitting means that the training error is greater than the expected error. In other words, the model cannot achieve the expected performance. There are many factors that cause big errors. To solve this problem, you can start with a simpler model or method and see if it can be solved.

Perform error analysis. Explain your model through LIME, SHAP, or Anchor so that you can sense the problem.

The initial model may be too simple. Increase the complexity of the model, such as adding long-term and short-term memory (LSTM) layer, convolution neural network (CNN) layer or fully connected (FC) layer.

By reducing the regularization layer, the model is slightly over-fitted. Dropout and reducing weights can prevent over-fitting. You can then try to remove these regularization layers to see if you can solve the problem.

Adopt the most advanced model architecture. Consider using converters (such as BERT or XLNet) in natural language processing (NLP).

Introduce synthetic data. Generating more data helps improve the performance of the model without any manual operation. In theory, the generated data should share the same tag. It allows the model to "see" more different data and ultimately improve its robustness. Nlpaug and imgaug can be used to perform data expansion.

Assign better hyperparameters and optimizers. Consider performing hyperparameter adjustments instead of using default / regular learning rates, epoch, batch size. Consider using beam search, grid search, or random search to identify better hyperparameters and optimizers. This method is relatively simple and only needs to change the hyperparameters, but it may take a long time.

Re-examine the data and introduce additional features.

Overfitting problem (Addressing Overfitting)

In addition to underfitting, you may also face the problem of over-fitting. Overfitting means that your model is too suitable for your training set and not suitable enough for other data. In other words, the accuracy of the training set is better than that of the verification set. Consider the following solutions:

Perform error analysis. Explain your model through LIME, SHAP, or Anchor so that you may find the problem.

Add more training data.

Introduce the regularization layer. Dropout (regularization layer) and batch normalization (normalization layer) help reduce overfitting by removing some input and smoothing input.

Introduce synthetic data. Generating more data helps improve the performance of the model without any manual operation.

Assign better hyperparameters and optimizers.

Remove some features.

The model may be too complex. Can reduce the complexity of the model.

Production (Production)

Metadata Association (Meta Data Association)

After the model is rolled out, you need to check some exception data. One way is to generate the ID and add it to the database. However, it is accompanied by several problems, which also increases the difficulty of troubleshooting. Here are some disadvantages:

Affect the flexibility of the system. From the point of view of architecture design, decoupling is one of the ways to build highly flexible systems. If we generate an ID and pass the forecast with this ID to the customer, then the customer needs to use it persistently in their database. If we change the format or data type, we need to notify all users to update their database.

We may need to collect more metadata based on the user's key data. Additional critical data increases connection complexity and storage consumption.

To overcome this problem, the prediction results should be directly related to the key data of the user.

Convert to inference model (Switch to Inference Mode)

When using Pytorch, there are several settings to be aware of when deploying the model to a production environment. Eval in Pytorch was mentioned earlier, which makes these layers (such as Dropout, BatchNorm) work in inference mode, for example, without applying any Dropout operations during the reasoning phase. It can not only speed up your process, but also input all the information into the neural network.

Mode.eval () # enable eval mode

With torch.no_grad ():

Eval_preds = model (x_val)

Calculated cost (Scalling Cost)

When trying to extend API to handle a larger amount of data, you may sometimes consider using GPU. Indeed, the GPU virtual machine is much more expensive than CPU. However, GPU brings some advantages, such as less computing time and less VM to maintain the same service level. Data analysts should try to assess whether GPU can save some money.

Stateless (Stateless)

Try to make your API stateless so that your API service can be easily adjusted. Statelessness means that no intermediate results are saved in the API server (memory or local storage). Just keep the API server simple and return the results to the client without storing anything in memory or local storage.

Batch processing (Batch Process)

Predicting a set of data is usually faster than predicting one by one. Most modern machine learning or deep learning frameworks optimize predictive performance (in terms of speed). You may notice that switching to batch mode predicts a big improvement in efficiency.

Use C++

Although Python is the mainstream language in machine learning, Python may be too slow compared to other programming languages such as C++. If you want low latency computing reasoning time, you can consider using TorchScript. The general solution is that you can still train your model in Python, but generate C++-compatible models by using it.

On the use of ML and DNN modeling skills is shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.