What to do when there is insufficient data? deep learning NLP 10/14 Update SLTechnology News&Howtos

What to do when there is insufficient data? deep learning NLP

2025-10-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

How to do deep learning NLP when there is insufficient data? I believe that many inexperienced people are at a loss about this. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

As a data scientist, one of your most important skills should be to choose the right modeling techniques and algorithms for your problems. A few months ago, I tried to solve the problem of text classification, that is, which news articles are relevant to my clients.

I have only a few thousand examples of tags, so I started using simple classic machine learning modeling methods, such as Logistic regression on TF-IDF, but this model is usually suitable for text classification of long documents.

After discovering my model errors, I found that just understanding words is not enough for this task. I need a model that will use a deeper semantic understanding of the document.

Deep learning models perform very well in complex tasks, which usually require an in-depth understanding of translation, question and answer, abstracts, natural language reasoning and other texts. So this seems to be a good method, but deep learning usually requires hundreds of thousands or even millions of training-marked data points, and thousands of data is obviously not enough.

Usually, big data sets carry out in-depth learning to avoid over-fitting. Depth neural networks have many parameters, so usually if they do not have enough data, they tend to remember the training set and perform poorly on the test set. In order to avoid this phenomenon without big data, we need to use special technology.

Regularization

Regularization method is a method used in different ways within the machine learning model to avoid over-fitting. This method has a strong theoretical background and can solve most problems in a general way.

L1 and L2 regularization

This method is probably the oldest and has been used in many machine learning models for many years. In this method, we add the weight to the loss function of the model we are trying to minimize. In this way the model will try to reduce the weights and the weights that are not helpful to the model will be significantly reduced to zero and will not affect the model. In this way, we can use fewer weights to simulate the training set. For more instructions, you can read this article.

Cdn.com/c0f84195cb47896e0350f775db6ca0e99ef29466.png ">

Dropout

Dropout is another new regularization method. During training, each node (neuron) in the neural network will be discarded (the weight will be set to zero). In this way, the network can not rely on specific neurons or the interaction of neurons, but must learn each pattern of different parts of the network. This allows the model to focus on important patterns that are extended to new data.

Stop early

Early stop is a simple regularization method that only needs to monitor the performance of the verification set. if you find that the verification performance is improving, please stop training. This method is very important without big data, because models tend to overfit after 5-10 periods or even earlier.

Small number of parameters

If you do not have a large dataset, you should be very careful to set the parameters and the number of neurons in each layer. In addition, special layers such as convolution layers have fewer parameters than fully connected layers, so it is useful to use them when they suit your problem.

Data enhancement

Data enhancement is a way to create more training data by changing the training data in a way that does not change the label. In computer vision, many image transformations are used to enhance data sets, such as flipping, cropping, scaling, rotation and so on.

These transformations are useful for image data, but not for text. For example, flipping a sentence like "Dog loves me" is not a valid sentence, and using it will make the model learn rubbish. Here are some text data enhancement methods:

Synonym substitution

In this method, we replace the random words in our text with their synonyms. For example, we change the sentence "I like this movie very much" to "I love this movie very much." it still has the same meaning, maybe the same label. This method doesn't work for me because synonyms have very similar word vectors, so the model treats two sentences as almost the same sentence rather than an extension.

Directional translation

In this method, we take our text, translate it into an intermediate language with machine translation, and then translate it into other languages. This method has been successfully used in the Kaggle toxicity review challenge. For example, if we translate "I like this movie very much" into Russian, we will get "I like this movie very much", and when we translate it into English, we get "I really like this movie". The reverse translation method provides us with synonym substitution, just like the first method, but it can also add or delete words and explain sentences while retaining the same meaning.

File clipping

News articles are so long that when viewing data, you sometimes don't need all the articles to categorize documents. This reminds me of cutting the article into several subdocuments as data expansion, so that I will get more data. First, I tried to extract a few sentences from the document and create 10 new documents. This creates a document with no logical relationship between sentences, but I get a bad classifier. My second attempt is to divide each article into five consecutive sentences. This method worked very well and gave me a good performance improvement.

Generate adversarial network

GAN is one of the most exciting recent developments in data science, and they are often used as generation models for image creation. This blog post explains how to use GAN for data enhancement of image data, but it can also be used for text.

Transfer learning

Transfer learning refers to the use of weights from networks that are trained for your problem through another problem (usually a large data set). Transfer learning is sometimes used as a weight initialization for some layers, and sometimes as a feature extractor that we no longer train. In computer vision, starting with a pre-trained Imagenet model is a very common way to solve problems, but NLP does not have a very large dataset that can be used for transfer learning like Imagenet.

Pre-trained word vector

The NLP deep learning architecture usually starts with an embedded layer that converts a hot-coded word into a digital vector representation. We can train the embedding layer from scratch, but we can also use pre-trained word vectors, such as Word2Vec,FastText or Glove, which use unsupervised learning methods to train large amounts of data or train data in our domain. Pre-trained word vectors are very effective because they provide model context for words based on a large amount of data and reduce the number of parameters of the model thus significantly reducing the possibility of overfitting. You can read more about word embedding here.

Pre-trained sentence vector

We can change the input of the model from words to sentences, so that we can use fewer models with fewer parameters and still have sufficient expressive ability. To do this, we can use pre-trained sentence encoders such as Facebook's InferSent or Google's generic sentence encoder. We can also use methods such as skipping thought vectors or language models to train sentence encoders with unmarked data. You can learn more about unsupervised sentence vectors from my previous blog post.

Pre-trained language model

Recent papers such as ULMFIT, Open-AI converter and BERT have achieved amazing results for many NLP tasks by pre-training language models in a very large corpus. A language model is the task of using the previous word to predict the next word in a sentence. For me, this pre-training doesn't really help me get better results, but the article has shown some ways to help me fine-tune it better, which I haven't tried yet. This is a good blog about pre-training language models.

Pre-training of unsupervised or self-supervised learning

If we have a large dataset from untagged data, we can use unsupervised methods, such as automatic encoders or mask language models, to pretrain our model with only the text itself. Another better option for me is to use self-supervision. The self-monitoring model is a model that automatically extracts tags without human comments. A good example is the Deepmoji project, where the authors trained a model to predict emoticons from tweets, and after getting good results in emoji prediction, they used their network to pre-train a tweeter emotion analysis model with the latest results. Emoji prediction is obviously very related to emotional analysis, so it performs very well as a pre-training task. The self-monitoring task of news data can predict headlines, newspapers, the number of comments, the number of retweets and so on. Self-supervision can be a very good pre-training method, but it is often difficult to tell which proxy tag will be associated with your real tag.

Feature engineering

I know that deep learning "kills" feature engineering, which is a bit out of date. However, when you don't have a large dataset, letting the network learn complex patterns through feature engineering can greatly improve performance. For example, in my classification of news articles, the number of authors, newspapers, comments, tags, and more can help predict our tags.

Multi-mode architecture

We can use a multi-schema architecture to combine document-level features into our model. In multimodal, we built two different networks, one for text and one for features, merging their output layers and adding more layers. These models are difficult to train because these features usually have stronger signals than text, so the network mainly learns feature effects. This is a great Keras tutorial on multimodal networks. This approach improved my performance by less than 1%.

Word level (word level) feature

Another type of feature engineering is word-level features, such as part of speech tagging, semantic role tagging, entity extraction and so on. We can combine the embedding of a hot-coded representation or word-level feature with the embedding of words, and use it as input to the model. We can also use other word features in this method, for example, in the emotion analysis task, we can use the emotion dictionary and add another dimension to the embedding, where 1 represents our word in the dictionary and 0 represents other words. this way the model can easily learn some of the words it needs to pay attention to. In my task, I added dimensions of some important entities, which gave me a good performance improvement.

Preprocessing as feature engineering

The last feature engineering method is to preprocess input text in a way that makes the model easier to learn. One example is the special "stop". If sports are not important to our labels, we can change the words football, baseball and tennis, which will help the network to understand that the differences between sports are not important. can reduce the number of parameters in the network. Another example is the use of automatic summarization, as I said earlier, neural networks do not perform well on long text, so we can run automatic summarization algorithms such as "text ranking" on text and provide only important sentences to the network.

After reading the above, do you know what to do when there is insufficient data? do you have the method of deep learning NLP? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.