I use Keras to teach the computer to generate titles for images! 07/06 Update SLTechnology News&Howtos

I use Keras to teach the computer to generate titles for images!

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Table of contents:

Introduction

motivation

precondition

data acquisition

Understanding data

Data cleaning

Load training set

Data preprocessing-Image

Data preprocessing-subtitles

Prepare data using generator functions

Word embedding

Model architecture

Reasoning

Evaluation

Conclusions and future work

Referenc

1. Brief introduction

What do you see in the picture below?

Can you write a title?

Some of you may say, "A white dog is lying on the grass", some may say "there is a white dog with brown spots", and some may say "the dog has some pink flowers on the grass".

Absolutely all of these headings are related to this image, and there may be other headings. But what I want to say is that it is easy for us human beings to take a look at the picture and describe it in appropriate words. Even a five-year-old can do this easily.

But can you write a computer program that takes the image as input and produces the relevant title as output?

Simple architecture

Before the development of deep neural network, this problem was difficult to imagine even for the most advanced researchers in the field of computer vision. But with the advent of deep learning, if you have the dataset you need, you can solve this problem very easily.

Andrej Karapathy, who has done a good study on this issue in his doctoral thesis at Stanford University, is also the director of Tesla's AI.

The purpose of this article is to explain (as simple as possible) how deep learning can be used to solve the problem of generating titles for a given image, so it is called Image Captioning.

To better understand this problem, I strongly recommend using the system created by Microsoft, called Caption Bot. Just go to this link and try to upload any image you want; the system will generate a title for it. (https://www.captionbot.ai/)

two。 motivation

We must first understand the importance of this problem to real-world scenarios. In these applications, the solution to its problem can be very useful.

Self-driving cars-self-driving is one of the biggest challenges, and if we can properly describe the scene around the car, it can improve the self-driving system.

Help for the blind-we can create a product for the blind to guide them on the road without the support of others. We can do this by first converting the scene to text and then the text to voice. Both are now famous applications of deep learning.

Today, CCTV cameras are everywhere, but if we can generate subtitles while watching the world, we can sound an alarm as soon as some malicious activity occurs. This may help to reduce some crimes and / or accidents.

Automatic subtitles can help make Google image search as useful as Google search, because each image can be converted to a title and then perform a search based on the title.

3. precondition

This article assumes that you are familiar with basic deep learning concepts, such as multilayer perceptrons, convolutional neural networks, recurrent neural networks, migration learning, gradient descent, overfitting, probability, text processing, Python syntax and data structures, Keras libraries, and so on.

4. Data collection

There are many open source datasets available for this problem, such as Flickr 8k (including 8k images), Flickr 30k (including 30k images), MS COCO (including 180k images), and so on.

But for the purposes of this case study, I used the Flickr 8k dataset (https://forms.illinois.edu/sec/1713398) on not very high-end PC / laptop systems, and training models with a large number of images may not be feasible.

The dataset contains 8000 images, each with five subtitles (as we have seen in the introduction section, images can have multiple subtitles, all of which are related at the same time).

These images bifurcate as follows:

Training set-6000 Images

Dev Set-1000 pictures

Test set-1000 images

5. Understanding data

If you download the data from the link I provided, you will also get some text files related to the image, along with the image. One of the files is "Flickr8k.token.txt", which contains the name of each image and its five titles. We can read this file as follows:

The text file is as follows:

Therefore, each line contains # I, where 0 ≤ I ≤ 4

That is, the image name, title number (0 to 4), and the actual title.

Now, let's create a dictionary called description, which contains the name of the image (without the .jpg extension) as the key, and the list of five titles of the corresponding image as values.

For example, referring to the screenshot above, the dictionary would look like this:

6. Data cleaning

When we deal with text, we usually perform some basic cleanup, such as wrapping all words at a low level (otherwise "hello" and "Hello" will be treated as two separate words), removing special tags (such as'%','$','#', etc.), and eliminating words that contain numbers (such as' hey199', etc.).

The following code performs the following basic cleanup steps:

Create a vocabulary of all unique words in the dataset 80005 (that is, 40000) image subtitles (corpus):

This means that we have 8763 unique words in all 40000 image titles. We write all these headings and their image names in a new file, "descript.txt", and save them on disk.

However, if we think about it carefully, many of these words will appear only a few times, such as 1, 2 or 3 times. Since we are creating a predictive model, we do not want all the words in our vocabulary, but are more likely to appear more common words. This helps the model become more robust to outliers and reduce errors.

Therefore, we only consider those words that appear at least 10 times in the whole corpus. The code is as follows:

So now we have only 1651 unique words in our vocabulary.

7. Load training set

The text file "Flickr_8k.trainImages.txt" contains the names of images that belong to the training set. So we load these names into the list "train".

Therefore, we separate the 6000 training images in the list named "train".

Now let's load the descriptions of these images from "descriptions.txt" (saved on hard disk) in the Python dictionary "train_descriptions".

However, when we load them, we will add two tags to each title, as shown below (explained later):

'startseq'-> this is a start sequence tag that will be added at the beginning of each title.

'endseq'-> this is a closing sequence tag that will be added at the end of each title.

8. Data preprocessing-Image

The image is just the input (X) of our model. You may already know that any input to the model must be given in the form of a vector.

We need to convert each image into a fixed-size vector and feed it as input to the neural network. For this reason, we use the InceptionV3 model (convolution neural network) created by Google Research to choose transfer learning.

The model is trained on Imagenet data sets to classify 1000 different kinds of images. However, our purpose is not to classify images, but to obtain fixed-length information vectors for each image. This process is called automatic feature engineering.

We simply delete the last softmax layer from the model and extract the 2048 length vector (bottleneck feature) for each image, as shown below:

Feature vector extraction (feature engineering)

The code is as follows:

Now we pass each image to the model to get the corresponding 2048 length feature vector, as follows:

We save all the bottleneck training functions in the Python dictionary and use the Pickle file to save it on disk, that is, "encoded_train_images.pkl", whose key is the image name and the value is the corresponding 2048 length feature vector.

Note: if you do not have a high-end PC / laptop, this process may take an hour or two.

Similarly, we encode all the test images and save them in the file "encoded_test_images.pkl".

9. Data preprocessing-subtitles

We must note that the subtitles are what we want to predict. Therefore, during the training period, the title will be the target variable (Y) that the model is learning to predict.

But the prediction of the whole title cannot happen at the same time, and we will predict the subtitles word for word. Therefore, we need to encode each word as a vector of a fixed size. However, this section will be seen later when you see the model design, but now we will create two Python dictionaries, "wordtoix" (pronounced-wordto index) and "ixtoword" (pronounced-index toword).

In a nutshell, we will use integers (indexes) to represent each unique word in the vocabulary. As shown above, we have 1652 unique words in the corpus, so each word will be represented by an integer index between 1 and 1652.

These two Python dictionaries can be used as follows:

Wordtoix ['abc']-> returns the index of the word' abc'

Ixtoword [k]-> returns words with index'k'

The code used is as follows:

There is another parameter that we need to calculate, that is, the maximum length of the title, which we do as follows:

So the maximum length of any title is 34.

10. Prepare data using generator functions

This is one of the most important steps in this case study. Here, we will learn how to prepare data in a manner that facilitates input to the deep learning model.

From now on, I will try to explain the remaining steps with the following example:

Consider that we have three images and their corresponding headings as follows:

Caption_1-> black cat sitting on the grass

Caption_2-> the white cat is walking on the road.

Caption_3-> the black cat is walking on the grass.

Now, suppose we will use the first two images and their titles to train the model, and we will use the third image to test our model.

Now the question we have to answer is: how do we construct it as a supervised learning problem? What is the data matrix like? How many data points do we have? Etc.

First, we need to convert the two images into their corresponding 2048 length feature vectors, as described above. Let "Image_1" and "Image_2" be the feature vectors of the first two images, respectively.

Second, let's build a vocabulary for the first two (train) subtitles by adding two tags "startseq" and "endseq" to the two: (assuming we have performed the basic cleanup steps)

Caption_1-> "startseq black cat sitting on the grass endseq"

Caption_2-> "startseq White Cat is walking on the road"

Vocab = {black,cat,endseq,grass,is,on,road,sat,startseq,the,walking,white}

Let's give the index of each word in the glossary:

Black-1, Cat-2, endseq-3, Grass-4, is-5, in-6, Road-7, sit-8 Magi startseq-9 Personality 10, Walk-11, White-12

Now let's try to construct it as a supervised learning problem, where we have a set of data points D = {Xi,Yi}, where Xi is the eigenvector of the data point'i' and Yi is the corresponding target variable.

Let's take the first image vector Image_1 and its corresponding title "startseq, black cat sitting on the grass". Recall that Image vector is the input and the title is what we need to predict. But the way we predict the title is as follows:

For the first time, we provide the image vector and the first word as input, and try to predict the second word, namely:

Input = Image_1 + 'startseq'; output =' the'

Then we provide the image vector and the first two words as input and try to predict the third word, namely:

Input = Image_1 + 'startseq'; output =' cat'

Wait...

Therefore, we can summarize the data matrix of an image and its corresponding title as follows:

Data points corresponding to an image and its title

It must be noted that an image + title is not a single data point, but multiple data points, depending on the length of the title.

Similarly, if we consider both the image and its title, our data matrix will look like this:

Data matrix of image and title

We must now understand that in each data point, there is not only an image as input to the system, but also a partial subtitle that helps to predict the next word in the sequence.

Since we are working on sequences, we will use cyclic neural networks to read these partial subtitles (more on this later).

However, as we have discussed, we will not pass the actual English text of the title, but we will pass an index sequence, where each index represents a unique word.

Now that we have created an index for each word, let's replace the words with their index and understand what the data matrix will look like:

Replace the data matrix of words with an index

Since we will batch (explained later), we need to make sure that the length of each sequence is equal. Therefore, we need to append 0 to the end of each sequence. But how many zeros should we add to each sequence?

This is why we calculated that the maximum length of the title is 34 (if you remember). Therefore, we will append many zeros, which will result in a length of 34 for each sequence.

The data matrix will be as follows:

Attach zeros to each sequence so that they are the same length 34

A data generator is required:

I hope this will give you a better understanding of how we prepare the dataset for this problem. However, there is a big problem. In the above example, I only considered 2 images and titles, which resulted in 15 data points.

However, in our actual training data set, we have 6000 images, each with 5 subtitles. This makes a total of 30000 images and titles. Even assuming that the average length of each subtitle is only 5 words, it will result in a total of 30000 * 5, or 150000 data points.

Let's do some more calculations:

What is the length of each data point?

The length of the data point = the length of the image vector + the length of some subtitles.

Length of image feature vector = 2048 (discussed)

But what is the length of some subtitles?

Well, you might think it's 34, but that's wrong.

Each word (or index) will be mapped (embedded) to a higher dimensional space through a word embedding technique.

Then, during the model building phase, we will see that each word / index is mapped to a 200-long vector using a pre-trained GLOVE word embedding model.

Each sequence now contains 34 indexes, each of which is a vector of length 200. Therefore, the length of a data point is:

2048 + (34 * 256) = 8848.

From the smallest point of view, we can get at least 150000 data points. Therefore, the size of the data matrix is:

150000 * 10752 = 1327200000 yuan.

Now, even if we assume that a block occupies 2 bytes, then, in order to store the data matrix, we will need close to 3 GB of main memory. (recall that we assume that the average length of subtitles is 5 words, maybe more).

This is a very large requirement, and even if we can manage to load so much data into RAM, it will make the system very slow.

For this reason, we use a lot of data generators in deep learning. A data generator is a function that is natively implemented in Python. The ImageDataGenerator class provided by Keras API is simply an implementation of the generator function in Python.

So how do you solve this problem by using generator functions?

If you understand the basics of deep learning, you must know that to train models on specific data sets, we need to use certain versions of random gradient descent (SGD), such as Adam,Rmsprop,Adagrad, and so on.

For SGD, we do not calculate the loss of the entire data set to update the gradient. Instead, in each iteration, we calculate the loss on a batch of data points (usually 64128256, etc.) to update the gradient.

This means that we do not need to store the entire dataset in memory immediately. Even if we have a current batch of points in memory, it is enough to achieve our goal.

The generator functions in Python are used exclusively for this purpose. It is like an iterator, recovering from its last call to its location.

The code for the data generator is as follows:

11. Word embedding

As mentioned above, we will map each word (index) to a 200-long vector, for which we will use a pre-trained GLOVE model:

Now, for all 1652 unique words in our vocabulary, we have created an embedded matrix that will be loaded into the model before training.

twelve。 Model architecture

Because the input consists of two parts, the image vector and partial subtitles, we cannot use the Sequential API provided by the Keras library. For this reason, we use Functional API, which allows us to create merge models.

First, let's take a look at a brief architecture that contains advanced submodules:

A high level of architecture

We define the model as follows:

Let's take a look at the model summary:

Summary of parameters in the model

The following figure helps to visualize the network structure and better understand the two input streams:

Annotated architecture diagram

The black text on the right is a comment for you to map your understanding of data preparation to the model architecture.

The LSTM (long-term and short-term memory) layer is just a dedicated recurrent neural network to process sequence input (partial subtitles in our case).

If you have followed the previous section, I think reading these comments will help you understand the model architecture in a direct way.

Recall that we have created an embedded matrix from the pre-trained GLOVE model, and we need to include it in the model before we start training:

Note that since we are using a pre-trained embedded layer, we need to freeze the model before training it (trainable = False) so that it is not updated during backpropagation.

Finally, we use the adam optimizer to compile the model

Super parameters during training:

Then the model is trained for 30 periods with an initial learning rate of 0.001 with 3 pictures per batch (batch). However, after 20 periods, the learning rate dropped to 0.0001 and the model was trained for 6 pictures per batch.

This usually makes sense, because in the later stage of training, the model is converging, and we must reduce the learning rate so that we can take a smaller step towards the minimum. Increasing the batch size over time helps your gradient updates become more powerful.

Time: I used GPU + Gradient Notebook on www.paperspace.com, so I spent about an hour training the model. However, if you train on a PC without GPU, it may take 8 to 16 hours, depending on your system configuration.

13. Reasoning

So so far we have seen how to prepare the data and build the model. In the final step of this series, we will learn how to test (infer) our model by passing in a new image, that is, how to generate a title for the new test image.

Recall that in the example we saw about how to prepare the data, we only used the first two images and their titles. Now let's use the third image and try to understand how we want to generate the title.

The third image vector and title are as follows:

Image_3-> the black cat is walking on the grass

The terms in this example also include:

Vocab = {black,cat,endseq,grass,is,on,road,sat,startseq,the,walking,white} with the following index:

Black-1, Cat-2, endseq-3, Grass-4, is-5, in-6, Road-7, sit-8 Magi startseq-9 Personality 10, Walk-11, White-12

We will iterate to generate the title, one word at a time as follows:

Iteration 1:

We provide image vector Image_3 and 'startseq' as part of the model title. You should now understand the importance of 'startseq', which is used as the initial part title of any image during reasoning.

We now expect our model to predict the first word "the".

But wait, the model generates a 12-long vector (in the example, compared to 1652 in the original example), which is the probability distribution of all words in the vocabulary. For this reason, we greedily choose the word with the maximum probability, given the feature vector and part of the title.

If the model is well trained, we must expect the word "the" to be the most likely:

Corollary 1

This is called maximum likelihood estimation (MLE), where we choose the most likely word based on a given input model. Sometimes this method is also called greedy search because we greedily choose the word with the highest probability.

Iteration 2:

This time let's assume that the model has predicted "the" in the previous iteration. So now we use the input of the model as the image vector Image_3 and the partial title "startseq the". Now we expect the model to have the highest probability of generating the word "black" given the image feature vector and partial subtitles.

Corollary 2

In this way, we continue to iterate to generate the next word in the sequence. But the big question here is when do we stop?

Therefore, we stop when one of the following two conditions is met:

We encounter 'endseq', which means that the model thinks this is the end of the title. (you should now understand the importance of the 'endseq' tag)

We reach the maximum threshold for the number of words generated by the model.

If any of the above conditions are met, we will break the loop and report the resulting title as the model output of the given image. The inference code is as follows:

14. Evaluation

To understand how good the model is, let's try to generate a title on the image of the test dataset (that is, an image that the model did not see during training).

Output-1

Note: we must understand how the model accurately identifies colors.

Output-2

Output-3

Output-4

Output-5

Of course, if I only show you the right subtitles, then I am lying to you. No model in the world is perfect, and this model can make mistakes. Let's look at some examples in which the title is not very relevant and sometimes even irrelevant.

Output-6

Maybe the color of the shirt is mixed with the color in the background.

Output-7

Why do models classify the famous Rafael Nadal as a woman? Maybe it's the long hair.

Output-8

The syntax of the model is incorrect this time.

Output-9

Obviously, the model tries to understand the situation, but the title is still not very good.

Output-10

In another example, the model fails and the title does not matter.

All in all, I must say that my initialization model, without any strict hyperparameter adjustments, does a good job of generating image titles.

A very important point:

We must understand that the image used for testing must be semantically related to the image used to train the model. For example, if we train our models on images of cats, dogs, etc., we cannot test them on images such as airplanes, waterfalls, etc. This is an example, the distribution of trains and test devices will be very different, in this case, there is no machine learning model in the world to provide good performance.

15. Conclusions and future work

Please refer to my GitHub link to access the complete code written in Jupyter Notebook.

(https://github.com/hlamba28/Automatic-Image-Captioning.git)

Note that due to the randomness of the model, the title you generate (if you try to copy the code) may not be exactly similar to the title generated in my case.

Of course, this is only the first solution, and a number of modifications can be made to improve it, such as:

Use a larger dataset.

Change the model architecture, such as including an attention module.

Make more super-parameter adjustments (learning rate, batch size, number of layers, number of units, dropout rate, etc.).

Use cross-validation sets to understand overfitting.

Use Beam Search instead of Greedy Search during reasoning.

Use BLEU scores to evaluate and measure the performance of the model.

Write code in an appropriate object-oriented manner to make it easier for others to copy: -)

16. reference

Https://cs.stanford.edu/people/karpathy/cvpr2015.pdf

Https://arxiv.org/abs/1411.4555

Https://arxiv.org/abs/1703.09137

Https://arxiv.org/abs/1708.02043

Https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/

Https://www.youtube.com/watch?v=yk6XDFm3J2c

Https://www.appliedaicourse.com/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.