Official interpretation of Xiaomi form recognition technology, which supports intelligent extraction of forms in pictures 02/15 Update SLTechnology News&Howtos

Official interpretation of Xiaomi form recognition technology, which supports intelligent extraction of forms in pictures

2026-02-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

CTOnews.com, September 3 (Xinhua)-- Leijun, founder, chairman and CEO of Xiaomi, said through social media this evening that Xiaomi engineers have developed a form recognition algorithm that can efficiently and accurately convert forms in pictures into editable Excel files, greatly improving the user experience.

At the same time, the official of Xiaomi Technology has also published a paper to interpret some technical implementation principles of the form recognition algorithm, including the overall framework, form detection algorithm, table recognition algorithm, alignment algorithm and so on.

The following is the official interpretation of Xiaomi Technology:

Table recognition refers to recognizing the table structure and text information in pictures into data formats that can be understood by computers. it has extensive practical value in office, business, education and other scenes, and has always been a hot issue in document analysis and research. Around this problem, we develop a set of table recognition algorithm, which can extract the table from the picture efficiently and accurately and convert it into editable Excel file. At present, the algorithm has successfully landed on flagship models such as Xiaomi 10s series and MIX Fold 2. You can identify it from photo albums-more forms, or scan it to enter the experience.

▲ Picture 1 Lei Jun introduced Xiaomi form recognition algorithm at MIX Fold 2 conference.

Background the documents that most people deal with in daily office are mainly forms and documents, in which the importance of forms is beyond doubt. In desktop office scenarios in various industries, Excel and WPS are the de facto standards for spreadsheets. We often encounter the need to import the contents of a table image into Excel.

In the past, we could only input the content into Excel in front of the picture, which was inefficient and error-prone. In recent years, with the development of technology, the availability of OCR (Optical character recognition) is increasing. Users can automatically extract text information from pictures with the help of OCR software.

However, for table scenarios, it is not enough to extract text, users also need to repeatedly manually copy and paste to restore the spreadsheet, which will still take a lot of time. For this reason, we have implemented a set of table image extraction scheme, which can effectively improve the office efficiency of users. Figure 2 shows our recognition results:

▲ figure 2 table recognition effect display

Second, the overall framework figure 3 shows an overall framework of our current algorithm, which mainly includes the form detection algorithm on the mobile side and the form recognition algorithm on the server side.

The Technical Framework of ▲ figure 3 Table recognition

The table detection algorithm is mainly to accurately extract the table area from the picture, and correct the table to get a flat table picture for the next step of table recognition; the table recognition algorithm is mainly to extract the table structure and the text content of the table from the picture, and then effectively combine these information together to output the editable Excel table. Table detection algorithm and form recognition algorithm are described in detail below.

Third, the form detection algorithm has the following difficulties: on the one hand, the algorithm and memory on the mobile phone are limited, on the other hand, the requirement for the form detection result is very high, and other words are often contained around the form. If the detection result is not accurate, it will have a negative impact on the subsequent recognition results. Our table detection algorithm can detect the table area and the four corners of the table at the same time. Through perspective transformation and our self-developed anti-distortion algorithm, we can get a flat table with only the table area, and the effect is shown in figure 4.

▲ figure 4 table detection algorithm effect

The framework of the table detection algorithm is shown in figure 5. Because the algorithm runs on the mobile phone and needs to ensure the running speed and model size, we use a very portable one-stage detection framework. Backbone uses shuffleNetV2; to detect the table box and regress the key points at the same time, which is convenient for the perspective correction of the table. Wing loss is used instead of L1 loss to make the regression of key points more accurate. In terms of data, the algorithm is used to mine a large number of table detection data from open data at low cost, which significantly improves the effect of form detection. The final model is about 1m in size and runs smoothly on Xiaomi's phone.

▲ Fig. 5 Table Detection algorithm Framework

4. Table recognition algorithm as shown in figure 3, the algorithm runs on the server side, and mainly includes: text detection, text recognition, table structure prediction, cell matching, alignment algorithm, Excel export. The text detection and recognition module uses the OCR service that we have launched before, so we will not focus on it here. The following will mainly introduce the table structure prediction algorithm and the Cell coordinate aggregation algorithm. In terms of data, due to the difficulty of tagging table data, we have completed a set of table rendering tools, which can synthesize various styles of table data, which greatly reduces the labeling cost.

There are a variety of table styles, such as wired table, wireless table, interval horizontal line table, etc., and there are many kinds of complex merged cells in the table; in addition, the picture contains shadow, light, distortion, deformation and so on, which also increase the difficulty of table prediction. There are many previous studies on the prediction of table structure, such as extracting table lines based on traditional algorithms, deriving the information of rows, columns and merged cells from table lines, detecting cells based on target detection, and then using post-processing methods to organize cells to restore the table structure; based on semantic segmentation, the table lines are segmented, and then the segmentation results are post-processed to restore the table structure. The above algorithms have a common problem, the post-processing is complex and the robustness is poor, so it usually needs to adapt to a specific table.

At present, the mainstream method is to represent the table in the hypertext of HTML, then encode the HTML to predict the HTML sequence and the corresponding coordinate information. This method has achieved good results on open source data sets, and Ping an Technology of China and Baidu have also adopted this scheme, but too many HTML tags lead to errors in table structure recognition. In view of the shortcomings of this method, we adopt a new coding method for the table, and only four tags can represent the table with arbitrary structure, which greatly improves the accuracy of table structure recognition.

As shown in figure 6, the table is defined as a matrix composed of N cells, as well as internal merge cells. "0": on behalf of ordinary cells, "1": on behalf of left merge cells, "2": merge cells upward; and each cell corresponds to a coordinate box, so that the results of OCR recognition can be matched later. The advantages of this definition: no artificial grammar rules; data organization has natural two-dimensional alignment attributes, the network is less prone to drift; a small number of tags can restore any table structure, no open set classification problem.

▲ figure 6 table structure definition

We use the table structure prediction framework shown in figure 7, which is based on the image-to-sequence learning network of cnn+transformer decoder. In the decoding phase, it includes two prediction heads to predict the coordinate information of the table sequence and the table Cell respectively.

▲ figure 7. Prediction framework of table structure

The effect of the table structure is shown in figure 8. The table structure recognition algorithm predicts the location information of each cell and the sequence information corresponding to each location. Figure 8 around the two pictures are one-to-one corresponding, the same color of the detection box corresponding to the right Cell cells, Cell is in order.

▲ figure 8 table structure recognition effect display

Form recognition in the deployment process, using the Fastertransformer reasoning framework to accelerate, our reasoning speed increased by about 20 times, significantly improving the user experience.

The Cell coordinate aggregation algorithm is mainly to correctly match the content detected by the text with the cells predicted by the table. The flow of the algorithm is shown in figure 9. The text box matches the cell box, first matching the one with the largest IOU, and if IOU=0, the nearest center of the two boxes. If a cell contains the structure of multiple text boxes, but also in the cell according to the reading order of output, and achieve intelligent line wrapping to improve the user experience.

▲ figure 9 Cell coordinate aggregation algorithm flow

In the end, our algorithm leads the main competitors in the industry in the accuracy of table structure extraction and end-to-end table restoration.

Alignment algorithm the above algorithm has basically restored the table information, but the alignment of cells in the same table is not the same, and there may be "left alignment", "right alignment" and "center alignment" at the same time. We design a set of alignment algorithm, by analyzing the location information of the cells in the table to achieve automatic alignment, completely restore the real table, and significantly improve the user experience. The effect of the alignment algorithm is shown in figure 10:

▲ figure 10 alignment algorithm effect

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.