The method of OCR performance Optimization 07/19 Update SLTechnology News&Howtos

The method of OCR performance Optimization

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article focuses on "the method of OCR performance optimization". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "the method of OCR performance optimization".

Abstract: if you want to optimize the performance of OCR, you must first understand the structure of the OCR network to be optimized. From the perspective of motivation, this paper deduces how the OCR network based on Seq2Seq structure is built step by step.

OCR refers to the recognition of the printed text in the picture. Recently, we are doing the performance optimization of the OCR model, rewriting the OCR network based on TensorFlow with Cuda C, and finally achieving a 5-fold improvement in performance. Through this optimization work, I have a deep understanding of the general network structure and related optimization methods of the OCR network. I plan to record it here through a series of blog posts, as well as a summary and study notes of my recent work.

If you want to optimize the performance of OCR, you must first understand the structure of the OCR network to be optimized. In this paper, I will try to deduce from the perspective of motivation how the OCR network based on Seq2Seq structure is built step by step.

In order to understand this article, we only need to understand the dimension change law of matrix in matrix multiplication, that is, the matrix multiplied by pairm is equal to the matrix of niterm. It would be better if you know the structure of CNN and RNN networks and have some knowledge of the construction routines of machine learning models.

First of all, the overall structure of the OCR BILSTM network to be analyzed in this paper is shown in the following figure:

Next, I will gradually explain the motivation and function of each structure from the upper right corner of the picture (the output of the model) to the lower left (the input of the model).

1. Construct the simplest OCR network

First of all, consider the OCR recognition scene in the simplest case, assuming that the input contains only one text picture, and the height and width of the picture are 32 pixels, that is, a matrix of 32 pixels. In order to make it easy to lengthen it, we can get a matrix of 1 inch 1024. In terms of output, because of the particularity of the text, we can only label all the text, and finally output the number of the recognized text, so we get that our output is a matrix of 1: 1, and the content of the matrix element is the number of the recognized text.

How do you get this 1: 1 matrix? According to the pattern of probability and statistics, we assume that there are 10000 words in the world, and the table is 11000, then these 10000 elements have the probability to become our output. therefore, if we first calculate the probability of these 10000 words as the recognition result of the input picture, then we can pick the output with the highest probability. The problem is then transformed into how to get a matrix (Y) of 1x 10000 from a matrix (X) of 1x 1024. Here we can go to the most common linear hypothesis in the structure of machine learning model, assuming that Y and X are linearly related, so we can get the simplest and classical linear model: y = AX + B. It is said that X (dimension: 1x 1024) is the input and Y (dimension: 1x 10000) is the output. Both An and B are the parameters of the model. From the matrix multiplication, it is known that the dimension of A should be 1024 million 1000 and the dimension B should be 1 million 10000. So far, only X is known, and we need to know the specific values of An and B if we want to calculate Y. In the routine of machine learning, the values of An and B as parameters are randomly set at the beginning, and then the machine slowly adjusts these two parameters An and B to the optimal value by feeding a large number of X and its standard answer Y. this process is called model training, and the fed data is called training data. After training, you can multiply the optimal A by your new input X and add the optimal B to get the corresponding Y. use the argMax operation to select the number of the largest of the 10, 000 numbers of Y, which is the number of the recognized text.

Now, looking back at the upper-right corner of figure 1, I believe you can understand the meaning of the two yellow matrices 38410, 000 and 1: 10000. The main difference between the example in the picture and the example described in the previous paragraph is that the input in the picture is a picture of 1x 1024, and the text in the upper paragraph is just 27 pictures of 1m 384. So far, you have learned how to construct a simple OCR network. Then we begin to optimize this simple network.

two。 Optimization strategy 1: reduce the amount of computation

In the example described in the above text, we have to do a matrix multiplication of 1: 1024 and 1024: 10000 for each text we recognize, which is too much to calculate. Are some of the calculations redundant? People who are familiar with PCA should immediately think of, in fact, the text image will be lengthened into a matrix of 1: 1024. The characteristic space of this text is 1024 dimensions. Even if the values of each dimension are only 0 and 1, there are 2 ^ 1024 values that can be expressed in this feature space, which is much larger than the number of words in all the text space we assumed. For this reason, we can use PCA or various dimensionality reduction operations to reduce the dimension of this input eigenvector to less than 10000 dimensions, such as 128D in the image.

3. Optimization strategy 2: consider the correlation between words

(reminder: in the above picture, in order to reflect the dimension of batch Size, it is drawn according to 27 text pictures. The discussion below only focuses on one text picture, so the places with dimension 1 below correspond to 27 in the figure.)

As you may have noticed, the dimension of the "location image feature" multiplied by the yellow 384mm 10000 matrix is not directly 1x 384, but 1 * (128-128-128). In fact, there is an optimization implied here, which is based on the hypothesis of relevance between words. A simple example is that if the first word is "you", then it is likely to be followed by the word "good". This statistical law in the text order should be used to improve the recognition accuracy of text pictures. So how to achieve this connection?

In the picture, we can see that there is a parameter matrix of 10000mm 128 on the left. It is easy to know that this parameter is like a database, which stores the processed features of all 10000 text pictures (the so-called processing is the dimensionality reduction mentioned above. The original feature should be 10000mm 1024), according to the structure in the picture. I need to enter the recognition result of the previous word of the currently recognized word (the recognition work is identified word by word). Then select the feature matrix corresponding to the last word 1x128, and after some processing and transformation, it will be regarded as the first three parts of the input of 1cm 384.

By the same token, what do the two 1128s at the bottom of 1m 384 mean? Although in the sentence, the first word has a great influence on the next word, even if the current word to be predicted is very vague in the picture, I can guess it from the previous word. Can it be guessed from the first k words or the last k words? Obviously the answer is yes. Therefore, the two lower 1x128 represent the influence of the words "front to back (Forward)" and "back to front (Backward)" in the sentence picture on the current words to be recognized, so a "bi-directional LSTM network" is added in front of the picture to generate these two features.

At this point, the improved version of the OCR network outline is basically out, there are still some details need to be solved. I don't know if you have noticed, according to the above, there are three 1128 features in 1 word 384, which respectively represent the influence of the previous word on the current word, the influence of the order of each text in the whole sentence in the picture from front to back (Forward) on the current text, and the influence of the ordering of each text in the whole sentence in the picture from back to front (Backward) on the current text.

But they all have a feature length of 128 inches! A word is 128 and a sentence is 128? For different words and pictures, the length of sentences may be different, how can they all be expressed by the characteristic length of one word?

How to express the characteristics of a variable-length sentence? At first glance, it is indeed a very thorny problem, but fortunately it has a very rough and simple solution, that is, weighted summation, which is also a routine in probability statistics. No matter how many cases you have, the probability summation of all cases has to be equal to 1. See here do not know whether to be shocked, "change" and "unchanged" such as the two things seem to be incompatible with water and fire is so magical coexistence, this is the charm of mathematics, people can not help but clap their hands!

The following figure illustrates how this magical way works with a practical example. When we want to identify the word "chopsticks" in the text clip, although the word change has been nearly obscured, according to the accumulation of some experience and knowledge in daily life, when we want to fill in the blanks of the position, we contact the context and focus on "is Chinese" in the above and "eat" in the following. The mechanism of weighted coefficient is used to realize this kind of attention mechanism. As for "experience in daily life", this kind of thing is learned by the "attention mechanism network" through a large amount of training data. This is the origin of the 32 alpha in figure 1. Attention network in the industry is generally served by the GRU network, due to space reasons, will not expand here, next time have the opportunity to elaborate. Viewers only need to know that there should also be an "attention network" on the right side of figure 1 to output 32 alpha values.

At this point, I believe you have a deeper understanding of the "method of OCR performance optimization", might as well come to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.