Accurate recommendation of data leveling | data section of OCR technology 07/11 Update SLTechnology News&Howtos

Accurate recommendation of data leveling | data section of OCR technology

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Introduction: the successful application of deep learning in the OCR field requires a lot of data. Digital leveling accurate recommendation team uses image enhancement, semantic understanding, generation of confrontation network and other technologies to generate high-quality and sufficient data to provide fuel for the algorithm model and help OCR technical services to iterate quickly in a variety of business scenarios to improve the effect.

one。 Background introduction

If deep learning is regarded as an engine, a large amount of labeled data is fuel, and the volume and quality of fuel directly affect the power of the engine. With the great enhancement of computing power, deep learning models have gone further and further towards the direction of wide & deep, and larger and deeper models need more data training. This is evident in the scale of datasets that have been made public by academia and industry in recent years. Take the classic computer vision task as an example, as shown by fig.1, the order of magnitude of public data sets is almost exponentially increasing. The larger and larger dataset also contains more and more label information, which enables the depth model to be fully trained to complete a variety of machine vision, semantic understanding, behavior prediction and other tasks.

Fig. 1 in recent years, the open data set of computer vision [1] [2] [3] [4] [5]

1.1 OCR data

As shown in figure fig.2, the function of OCR is to detect the text area in the image and to identify the text content. At present, our OCR algorithm is mainly applied to advertising pictures, which not only helps advertising review, but also extracts semantic features from advertising material pictures for more accurate recommendations [17]. In addition to advertising, we also serve the recognition of content-related online images, game images, and all kinds of card images. Compared with object detection and recognition, OCR contains tilted text boxes, low-resolution text, and diversified text layout, so OCR data tagging has its particularity and higher labeling cost. This situation determines that it is difficult for us to obtain samples to be labeled through user feedback to support OCR depth model training. Therefore, in addition to the essential manual data tagging in specific business scenarios, our training samples need to be obtained by machine generation.

Character recognition effect of fig.2 OCR image

two。 OCR data generation

For the technology based on deep learning, the number of training data affects the technical effect to a great extent. The company's business pictures contain a large number of Chinese character text lines, and in the case of a mixture of Chinese and English numbers, there is almost no such large-scale available text detection and recognition data set, due to the high cost of obtaining a large amount of tagged training data. Easy to expand and fast data machine generation becomes the first choice. In the field of computer vision, data machine generation can be roughly divided into three types: low-level image processing technology, middle-level image understanding plus artificial rules, and high-level end-to-end image data generation. OCR technology data generation also follows these three types.

2.1 Image processing data enhancement

The method of training data generation based on image processing is the lowest threshold and the most widely used method. The most commonly used image processing methods include dozens of basic operations, such as fig.3, flip, translation, rotation, noise, blur, and so on. Each sample can be combined to generate a large number of new samples. In the field of OCR, in addition to the above-mentioned basic image processing technology, the attributes of written text and background pictures can also be greatly diversified. The background pictures we use come from a variety of business scenarios; hundreds of Chinese and English fonts are used in font selection; in the selection of corpus, based on the existing advertising corpus, we build a new corpus of nearly 10 million words. The generated samples are close to the real advertising pictures, and tens of millions of samples are generated, which makes the model have strong recognition ability and generalization performance.

Fig.3 image sample enhancement: noise, rotation, contrast adjustment, blur, etc.

2.2 based on image segmentation & depth of field

Because the strategy of writing text directly on the background picture does not take into account the background changes, in many cases with complex backgrounds, the generated samples are unreal, and some of the samples are unable to judge the text content. The existence of these samples is likely to bring side effects on the ability of model detection and recognition. Inspired by the article [6] published by the VGG Lab of Oxford University in the past 16 years, we analyze and understand the background picture, select the background that tends to be consistent for writing, and fit the writing plane with the surface of the object in the picture according to the picture depth of field information, so that the text fits the surface of the object. Get more realistic visual effects.

Character recognition effect of fig.4 OCR image

(the green text box identifies the text line location, the black text represents the text content in the text box, and the image source [6])

In fig.4, the first behavior background picture processing flow, the second column generates a sample example. The picture goes through depth-of-field detection, image segmentation and text area filtering. In the process of writing a line of text, the plane of the writing is based on the surface of the object to simulate a more realistic picture sample. In fig.5, you can see that the text generated in the picture has not only a label box for each line of text, but also a clear text box for the location of each word.

Fig.5 generates OCR tagged images based on image segmentation and depth of field

Based on the above image segmentation and image depth of field techniques, we generate a large number of labeled samples on advertising images for text detection model training. As shown by fig.6, detailed features such as the ratio of text to picture size, the tilt angle of text lines, the mapping relationship between text color and background color, and text spacing are also obtained through statistical advertising pictures. On the basis of making the generated samples closer to the real samples, we also increase the diversity of generated samples from the aspects of text transparency and italics, so as to obtain more robust text detection and end-to-end detection and recognition ability.

Fig.6 Advertising Image Generation sample

For most online pictures (advertising pictures, information flow article pictures, game pictures, etc.), because the samples in the business are also generated by the computer and belong to generated digital pictures (Born-Digital Images) recognition, the samples we generate can be simulated to be very realistic, but in some business scenarios, the pictures to be identified come from real shooting. It belongs to natural scene character recognition STR (Scene Text Recognition). STR is a classical and popular technology in the field of computer vision. For many years, there has been continuous research work [10] [11] [12] [13] to promote its development. However, STR has not made a breakthrough in the number of open data sets, and some studies have tried to generate natural scene samples based on similar methods of generating network images, but have not achieved significant results.

2.3 generate countermeasure network (Generative Adversarial Network)

In natural scenes, text is not only not as significant as network pictures, but also often accompanied by blur, reflection and so on, which brings great difficulties to text detection and recognition. Take the bank card number recognition scene as an example, shadow, reflection, angle, background all bring great difficulties to the recognition, however, these are not the core issues, the core problem is that bank card pictures belong to personal privacy. It is almost impossible to obtain a large number of real bank card picture samples. So, how to meet the model training but must have a rich style, a large number of sample sets? We need to find more breakthroughs from the point of view of algorithms.

Fig.7 generates countermeasure network structure

Since Ian Goodfellow proposed generating confrontation Model (GANs) [7] at the end of 2014, a large number of GANs applications on various tasks have emerged in the industry, including some results of data generation [8] [9] [14] [15]. The idea of GANs, as shown by fig.7, is that the generating network is responsible for generating the image, and the discriminating network is responsible for predicting whether the input picture is true or not. with the alternating adversarial learning between the generated network G and the discriminant network D, the generated network can gradually generate images that are fake and real.

(a) Model training process

(B) selection of generator model structure

Fig.8 pix2pix [14] principle diagram

In a series of results of generating confrontation network, we find that the image style transformation [14] [15] based on adversarial learning is more in line with our scene, as shown by fig.8, in pix2pix [14], discriminator D learns to distinguish between real samples and generated samples; generator G learns to generate more real samples so that D can not recognize, in which the generator network structure can choose whether there is a jump connection or not. On this basis, we can convert the artificially generated numbers with black and white background into the style of bank card numbers, which can be used to increase the training samples. As shown in fig.9, there is a sample picture of a real bank card on the left and a template of the corresponding card number on the right. We expect to use the trained generation countermeasure network to transform random card numbers into bank card style samples, so that we can obtain a large number of labeled bank card samples to train the character recognition model.

Fig. 9 sample material of bank card (some numbers are blackened to protect the privacy of card number)

Fig.10 is the generation confrontation model we use, which is different from the regular picture generation task. The card number image is long (the length is more than 10 times the width). In order to ensure that the overall style of the generated image is consistent, we adjust the network structure to make the network feel wild enough to see most of the pictures, so as to ensure that the overall style of the generated picture remains consistent. In addition, through drop out to control randomness, so that the same number template can generate hundreds of different styles.

Fig.10 Bank Card sample Generation antagonism Model

The pictures in Fig.11 are generated through GANs, and you can see that although there are still a few flaws in the pictures (bottom image), most of the pictures have reached the point where they are fake and real. According to the digital coding specification of bank card number, we can quickly generate hundreds of thousands of digital templates, and then convert these templates into bank card number style through GANs. With the randomness of the reasoning process, we can produce millions of generated samples in one day to provide recognition model training.

Fig.11 uses generation anti-network to generate bank card number sample effect

2.4 Summary

In all kinds of tasks, we have generated tens of millions of samples, which provides sufficient training data for OCR detection and recognition. Through the above data generation technology, we use network pictures, natural scene pictures, and specific business scenarios (bank card, × ×.) The effect of OCR detection and recognition has been significantly improved. Especially for the network pictures, due to the lifelike, large number and sufficient diversity of the generated samples, it can respond quickly through the algorithm bad case feedback to generate targeted samples, which makes the ability of OCR to improve rapidly. In the aspect of generating confrontation network, optimizing the stability of the generation model, using small sample learning, semi-supervised learning and so on will be the focus of our exploration.

III. Summary

This paper shares the work of the accurate recommendation team in data generation, which is mainly based on three types of technologies: image processing, image understanding, and generating antagonistic networks, which quickly generate a large amount of tagged data. In addition, manual tagged data are constantly accumulated as real samples, which not only objectively reflect the business scene, but also provide a benchmark for generating data specifications. That is, rely on these real data styles to carry out a large number of simulation and generalization in the process of generating data. In the follow-up work, we will focus on how to achieve continuous automatic data accumulation and automatic training and updating of the model through service and instrumentalization. In addition to OCR, computer vision and even the whole field of machine learning, although the word data-driven has been mentioned countless times, there are still few products or technical services that can really release data-driven capabilities. How to make the machine itself have the ability of data collection, collation and analysis, adjust and optimize the algorithm independently, and make independent judgment and decision will be the direction of our exploration.

OCR, the accurate recommendation team of Tencent TEG, has accumulated various technologies over the years, and is willing to communicate and cooperate with any business colleagues with OCR technology-related needs, aiming at TEG's mission: professionalism, cooperation and partners, but is willing to continue to create industry-class data, algorithms and systems to provide quality services for the business team.

Technical & business cooperation please consult: hongfawang@tencent.com, and long-term recruitment of excellent algorithm engineers and interns, welcome to contact.

References:

[1] The 2005 PASCAL Visual Object Classes Challenge, M. Everingham, A. Zisserman, C. Williams, L. Van Gool, M. Allan, C. Bishop, O. Chapelle, N. Dalal, T. Deselaers, G. Dorko, S. Duffner, J. Eichhorn, J. Farquhar, M. Fritz, C. Garcia, T. Griffiths, F. Jurie, D. Keysers, M. Koskela, J. Laaksonen, D. Larlus, B. Leibe, H. Meng, H. Ney, B. Schiele, C.Schmid,E.Seemann,J.ShaweTaylor, A. Storkey, S. Szedmak B. Triggs, I. Ulusoy, V. Viitaniemi,and J. Zhang. In Selected Proceedings of the First PASCAL Challenges Workshop, LNAI, Springer-Verlag, 2006 (in press).

[2] SUN Database: LargeScale Scene Recognition from Abbey to Zoo. J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. IEEE Conference on Computer Vision and Pattern Recognition, 2010.

[3] The MIR Flickr Retrieval Evaluation. M. J. Huiskes, M. S. Lew, ACM International Conference on Multimedia Information Retrieval (MIR'08), Vancouver, Canada

[4] New Trends and Ideas in Visual Concept Detection. M. J. Huiskes, B. Thomee, M. S. Lew, ACM International Conference on Multimedia Information Retrieval (MIR'10), Philadelphia, USA.

[5] Abu-El-Haija, Sami, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. "Youtube-8m: a large-scale video classification benchmark." ArXiv preprint arXiv:1609.08675, 2016.

[6] Synthetic Data for Text Localisation in Natural Images, Ankush Gupta, Andrea Vedaldi, Andrew Zisserman. CVPR 2016.

[7] Generative Models, Andrej Karpathy, Pieter Abbeel, Greg Brockman, Peter Chen, Vicki Cheung, Rocky Duan, Ian Goodfellow, Durk Kingma, Jonathan Ho, Rein Houthooft, Tim Salimans, John Schulman, Ilya Sutskever, And Wojciech Zaremba, OpenAI, retrieved April 7, 2016.

[8] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Josh Susskind, Wenda Wang, Russ Webb, Learning from Simulated and Unsupervised Images through Adversarial Training, CVPR 2017.

[9] Xinlong Wang, Mingyu You, Chunhua Shen, Adversarial Generation of Training Examples for Vehicle License Plate Recognition. ArXiv 2017.

[10] Xinhao Liu , Takahito Kawanishi , Xiaomeng Wu, Kunio Kashino, Scene text recognition with high performance CNN classifier and efficient word inference. ICASSP 2016.

[11] Fei Yin, Yi-Chao Wu, Xu-Yao Zhang, Cheng-Lin Liu, Scene Text Recognition with Sliding Convolutional Character Models, Arxiv 2017.

[12] Suman K.Ghosh, Ernest Valveny, Andrew D. Bagdanov, Visual attention models for scene text recognition, CVPR 2017.

[13] Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, Xiang Bai Robust Scene Text Recognition with Automatic Rectification, CVPR 2016.

[14] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros, Image-to-Image Translation with Conditional Adversarial Nets, CVPR 2017.

[15] Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros, Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. Arxiv 2017.

[16] Mehdi 米孜, Simon Osindero, Conditional Generative Adversarial Nets. Arxiv 2014.

Xue Wei, big data and Machine Learning in Advertising, Tencent big data Technology Summit, 2017.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.