Ten deep learning methods that AI practitioners need to apply 07/11 Update SLTechnology News&Howtos

Ten deep learning methods that AI practitioners need to apply

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Ten deep learning methods that AI practitioners need to apply

Abstract: if you want to understand artificial intelligence, how can these ten deep learning methods work?

In the past decade, there has been a surge in interest in machine learning. Almost every day, we can see discussions about machine learning in a variety of computer science courses, industry conferences, Wall Street Journal, and so on. In all the discussions about machine learning, many people confuse what machine learning can do with what they want machine learning to do. Fundamentally, machine learning uses algorithms to extract information from raw data and represent it in some type of model. We use this model to infer other data that has not yet been modeled.

Neural networks are a model of machine learning, and they are at least 50 years old. The basic unit of the neural network is the node, which is basically inspired by the biological neurons in the mammalian brain. The connections between neurons are also modeled on the biological brain, and these connections develop over time in a way called "training".

In the mid-1980s and early 1990s, many important model architecture advances were made in neural networks. However, more and more time and data are needed to obtain good performance, which greatly reduces the interest of researchers. At the beginning of the 21st century, computing power increased exponentially, and researchers saw the "Cambrian explosion" of computer technology. As an important competitor in this field, deep learning has won many important machine learning competitions because of the explosive growth of computing power. So far, this trend has not abated; today, we see deep learning mentioned in every corner of machine learning.

To keep myself up to date, I took Udacity's "Deep Learning" course, which provides a good introduction to the motivation of deep learning and how to learn an introduction to intelligent system design for large-scale dataset learning from TensorFlow. In class, I developed convolution neural networks for image recognition, embedded neural networks for natural language processing, and character-level text generation using cyclic neural networks / long-term and short-term memory networks. You can find the Jupiter Notebook applicable code, and all the code can be found in this GitHub repository.

Recently, I began to read academic papers about this in-depth study. According to my research, here are some publications that have had a great impact on the development of this field:

New York University gradient-based learning is applied to document recognition (1998), which introduces convolution neural networks into the machine learning world.

Deep Boltzmann Machines (2009) of the University of Toronto, which provides a new learning algorithm for Boltzmann machines, including many hidden variable layers.

Stanford and Google used large-scale unsupervised learning to build advanced features (2012), solving the problem of using only untagged data to build advanced, quasi-specific function detectors.

Berkeley's DeCAF-, a deep convolution activation feature for general visual recognition (2013), released DeCAF, an open source implementation of deep convolution activation, as well as all related network parameters, enabling vision researchers to conduct depth experiments across a series of visual concept learning paradigm representations.

DeepMind uses Deep Reinforcement Learning (2016) to play Atari, which provides the first deep learning model, which can be used to successfully learn control strategies directly from high-dimensional sensory input.

Through research and study papers, I have learned a lot of rich knowledge about deep learning. Here, I'd like to share 10 powerful deep learning methods that AI engineers can apply to machine learning problems. But first, let's define what deep learning is. Deep learning is a challenge for many people because its form has gradually changed over the past decade. To better illustrate the status of deep learning, the following figure illustrates the concept of the relationship between artificial intelligence, machine learning and deep learning.

The field of artificial intelligence is extensive and has existed for a long time. Deep learning is a subset of the field of machine learning, and machine learning is only a subfield of artificial intelligence. Distinguish the deep learning network from the previous feedforward multilayer network:

Deep learning has more neurons than previous networks.

The way there is a more complex connection layer in deep learning

The computing power provided by the Cambrian explosion

Deep learning can extract features automatically.

When I say "more neurons", I mean that the number of neurons has been increasing in recent years, and deep learning can represent more complex models. The layer also evolved from the complete connection of each layer in the multi-layer network to the local connection of the neuron fragments in the convolution neural network and the cyclic connection to the same neuron in the recurrent neural network (except the connection to the previous layer).

Deep learning can be defined as a neural network with a large number of parameters and layers:

Unsupervised pre-training network

Convolution neural network

Cyclic neural network

Recurrent neural network.

In this article, I mainly explain the latter three kinds of networks.

Convolutional neural networks (CNN) are standard neural networks that extend across spaces that use shared weights. CNN aims to identify an image by internal convolution, which sees the edge of the identified object on the image.

The recurrent neural network is basically a standard neural network that uses time to extend the space, which extracts the edge of the next time step instead of entering the next layer at the same time.

RNN carries out sequence recognition, such as speech or text signals, because of its internal cycle, which means that there is short-term memory in the RNN network. Recurrent neural network is more similar to hierarchical network, in which the input sequence is actually independent of time, but the input must be processed in a tree-like way.

The following 10 methods can be applied to all of these architectures.

1-back propagation

Back-prop back propagation is only a simple method to calculate the partial derivative of a function, which has the form of a combination of functions (such as in a neural network). When you use a gradient-based method to solve the optimization problem (gradient descent is just one of them), you want to calculate the function gradient at each iteration.

For neural network, its objective function has the form of combination. How do you calculate the gradient? There are two common ways to do this: (I) Analytical differentiation. If you know the form of a function, you only need to use the chain rule (basic calculus) to calculate the derivative. (ii) the approximate differential of finite difference. This method is computationally expensive because the number of evaluation functions is O (N), where N is the number of parameters. Compared with analytical differentiation, the computational cost of this method is expensive. When debugging, finite difference is usually used to verify the execution effect of back propagation.

2-random gradient descent

The intuitive way to imagine a gradient decline is to imagine the path of a river from the top of the mountain. The goal of gradient descent is exactly what the river is trying to achieve, that is, from the top of the mountain to the lowest point.

Now, if the topography of the mountain is such that the river does not have to stop completely anywhere until it reaches its final destination, this is the ideal situation we want. In machine learning, this is equivalent to saying that we have found the global minimum (or optimal value) of the solution from the initial point (the top of the mountain). However, there may be a number of potholes in the river path due to the nature of the terrain, which will force the river to trap and stagnate. In machine learning, this kind of pothole is called local optimal solution, which is a situation we don't want. Of course, there are many ways to solve the local optimal problem, and I am not going to discuss it further here.

Therefore, gradient declines tend to fall into local minimums, depending on the nature of the terrain (or a function in ML terminology). However, when you have a special mountain shape (shaped like a bowl, called a convex function in ML terminology), the algorithm can always find the optimal value. You can imagine visualizing this river. In machine learning, these special terrain (also known as convex functions) always need to be optimized. In addition, the position where you start from the top of the mountain (that is, the initial value of the function) is different, and finally your path to the bottom of the mountain is completely different. Similarly, depending on the flow speed of the river (that is, the learning rate or step size of the gradient descent algorithm), you may reach your destination in different ways. Whether you will fall into or avoid a pit (local minimum) will be affected by these two criteria.

3-Learning rate decay

Adjusting the learning rate of random gradient descent optimization program can improve the performance and shorten the training time. This is sometimes referred to as learning rate annealing or adaptive learning rate. The simplest and most commonly used learning rate adjustment during training is the technique to reduce the learning rate over time. Using a larger learning rate value at the beginning of training, the learning rate can be greatly adjusted; in the later stage of training, the learning rate is reduced to make the model update the weight at a smaller rate. This technique can quickly learn to get some better weights in the early stage, and fine-tune the weights in the later stage.

The decline of the two popular and easy-to-use learning rates is as follows:

Gradually reduce the learning rate in each link.

Use a sharp drop in a specific period of time to reduce the learning rate.

4-Dropout

The depth neural network with a large number of parameters is a very powerful machine learning system. However, overfitting is a serious problem in this kind of network. Large networks are also slow to use, so it is difficult to deal with overfitting by combining the predictions of many different large neural networks during testing. Dropout is a technology to solve this problem.

The key idea is to randomly delete units and their connections from the neural network during training, which can prevent over-adaptation between units. During the training period, samples are taken from different "sparse" networks with an exponential number. In the test, it is easy to approximately average all these sparse networks to achieve the prediction effect by simply using the single unraveling network (untwinednetwork) with less weight. This significantly reduces overfitting and performs better than other regularization methods.

Dropout has been shown to improve the performance of supervised learning tasks of neural networks in the fields of computer vision, speech recognition, document classification and computational biology, and obtain the most advanced results on many benchmark data sets.

5-maximum pooling

The maximum pool is a sample-based discretization process. The purpose is to undersample the input representation (image, hidden layer output matrix, etc.) by reducing its dimension and allowing features contained in sub-regions to be merged.

To some extent, this method helps to solve the problem of overfitting by providing an abstract form of representation. Similarly, it also reduces the amount of computation by reducing the number of learning parameters and providing transformation invariance of basic internal representations. Maximum pooling is achieved by applying the maximum filter to initial representation subregions that usually do not overlap.

6-batch normalization

Of course, neural networks, including depth networks, need to carefully adjust the weight initialization and learning parameters. Batch normalization helps to make the Chinese process a little easier.

Weight question:

No matter which weights are initialized, randomly or empirically selected, they are very different from the learning weights. Consider a small batch of data sets, in the initial period, there will be many outliers when the feature is activated.

The depth neural network itself is fragile, that is, the small disturbance in the initial layer will lead to great changes in the later layer.

During back propagation, these phenomena lead to gradient offset, which means that the gradient must compensate for outliers before learning weights to produce the desired output. This will also lead to extra time to converge.

Batch normalization normalizes these gradients from discrete rules to normal values and flows towards a common goal (by normalizing them) in small batches.

Learning rate problem: in general, the learning rate remains small, so that only a small part of the gradient is used to correct the weights, because the abnormally activated gradient should not affect the weights that have been learned. Through batch normalization, the possibility of these outliers being activated is reduced, so a higher learning rate can be used to speed up the learning process.

7-long-and short-term memory:

LSTM network has the following three aspects, which distinguish it from conventional neurons in recurrent neural network.

1. It controls when the input is allowed into the neuron.

2. It can control when to remember what was calculated in the last time step.

3. It can control when the output is passed to the next timestamp.

The advantage of LSTM is that it determines all of this based on the current input itself, as shown in the following figure:

The input signal x (t) at the current timestamp determines all three points mentioned above. Enter the door decision point 1. The amnesia door makes a decision at point 2, and the output door makes a decision at point 3. The input door can make these three decisions alone. This is inspired by how our brains work and can handle sudden context switching.

8-Skip-gram:

The goal of the word embedding model is to learn a high-dimensional dense representation for each vocabulary, in which the similarity between the embedding vectors shows the semantic or syntactic similarity between the corresponding words. Skip-gram is a model of learning word embedding algorithm.

The main idea behind the skip-gram model (and many other word embedding models) is as follows: if two words have similar contexts, they are similar.

In other words, suppose you have a saying, such as "cats are mammals". If you use the term "dog" instead of "cat", the sentence is still a meaningful sentence. So in this example, "dog" and "cat" can share the same background (that is, "is a mammal").

Based on the above assumptions, you can consider a context window (a window containing k consecutive terms). Then you should skip one of the words and try to learn all the terms except the one skipped and predict the neural network of the skipped terms. So if two words repeatedly share a similar context in a large corpus, the embedded vectors of these terms will have similar vectors.

9-continuous word bag model (Continuous Bag of Words):

In the problem of natural language processing, we want to learn to represent each word in the document as a numerical vector, so that the words that appear in a similar context have vectors that are close to each other. In the continuous word bag model, the goal is to be able to use context around specific words and predict specific words.

We do this by extracting a large number of sentences from a large corpus, and every time we see a word, we use its context word. Then we input the context words into a neural network and predict the words in the context center.

When we have thousands of such context words and center words, we have an example of a neural network data set. We train the neural network to represent the embedding of specific words in the output of the encoded hidden layer. Coincidentally, when we train on a large number of sentences, words in similar contexts get similar vectors.

10-transfer Learning:

Consider how the image passes through the convolution neural network. Suppose you have an image, and you apply convolution, and you get a combination of pixels as output. If an edge is encountered, convolution is applied again, so now the output is a combination of edges or lines. Then apply convolution again, and the output will be a combination of lines, and so on. You can think of it as each layer looking for a specific pattern. The last layer of the neural network tends to become very professional. If you are using ImageNet, then the last layer of your network will be looking for children or dogs or airplanes or something. Step back a few layers and you may see that the network is looking for eyes or ears or mouth or wheels.

Each layer in the deep CNN gradually establishes higher and higher-level feature representations. The last few layers tend to focus on any data you enter into the model. On the other hand, early layers were more generic, finding a lot of simple patterns in larger categories of images.

Transfer learning means that you train CNN on one data set, cut off the last layer, and retrain the last layer of the model on other different data sets. Intuitively, you are retraining the model to identify different higher-level functions. As a result, the time spent on model training is greatly reduced, so transfer learning is a useful tool when you don't have enough data or too many resources for training.

This article shows only a general overview of these methods. I recommend reading the following article for a more detailed explanation:

Andrew Beam's "Deep Learning"

A brief History of Neural Network and Deep Learning by Andre Kulinkov

Adit Deshpande's beginner's Guide to understanding Convolutional Neural Networks

Chris Ora's understanding the LSTM Network

Algobean's "artificial Neural Network"

Andrej Karpathy's "unreasonable effectiveness of recurrent Neural Networks"

Deep learning pays great attention to technical practice. There are not many specific explanations for each of the new ideas in this article. Most new ideas are accompanied by experimental results to prove that they work. Deep learning is like playing Lego. Mastering Lego is as challenging as any other art, but it is easy to start Lego by comparison.

Transferred from: https://www.cnblogs.com/DicksonJYL/p/9591732.html

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.