In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Introduction: if algorithms and data are sports car engines and gasoline, then the system is a gearbox, a stable and flexible gearbox, which is the basis for image recognition services to move forward. The trinity of algorithm, data and system, with the rapid development of algorithm and the increasing accumulation of data, the system is also upgrading efficiently and stably.
First, background introduction
The previous series of articles introduced the algorithm and data respectively. If the algorithm and data are sports car engines and gasoline, then the system is a gearbox, a stable and flexible gearbox, which is the basis for image recognition services to move forward. The trinity of algorithm, data and system are combined into a complete OCR online service. With the upgrade of the algorithm and the continuous access of the service, the system has also experienced the upgrade from the stand-alone version to the distributed version; from customizing the system module for each algorithm to separating the framework and algorithm logic to improve the running efficiency of the algorithm and the reusability of the module; upgrading from a single running environment to CPU/GPU heterogeneous parallelism. In addition to the functions that general distributed systems need to provide, we also add features such as single point hot update and cluster snapshot by combining algorithm and operation and maintenance characteristics.
Second, the challenges we face
As a system that provides online services, the image recognition service framework faces not only the challenges faced by general business systems or platforms, but also the challenges of supporting complex algorithms and services as an algorithm system, mainly shown in the following points:
1. High performance and high reliability
Any online system faces challenges of performance and reliability. In our system, image recognition service, as an online service, needs to be able to respond and return quickly, and ensure high availability in both the framework layer and the algorithm layer.
two。 System decoupling & High scalability
As a distributed algorithm system, the decoupling of the algorithm module and the framework can make the algorithm and the background personnel develop synchronously more efficiently, and update and iterate the algorithm and the framework respectively. The high scalability requires not only the framework to support resource scalability on the cluster, but also the rapid access and replacement of stand-alone algorithms.
3. Support of complex services and algorithms
With the continuous access of services and the complexity of algorithms, flexible supporting algorithms in the framework layer, efficient algorithm module reuse and rapid adaptation to new algorithms and access to new services are needed.
4. Support for different operating environments
The running environment of the algorithm system, especially the image recognition algorithm system, includes CPU and GPU. The framework should not only support the efficient operation of different environments, but also support different links running in different hardware environments to ensure the rational and efficient use of resources.
Third, our solution
There are many processes for image recognition from algorithm research, model training to large-scale service provision. We mainly divide the development phase into the following two stages:
Algorithm research & model training:
Algorithm personnel carry out algorithm research, model training, and business docking. The algorithm does not need to care about the framework and system scheduling, but only needs to output the algorithm SO of a single module and the trained model file.
Framework development & algorithm integration:
Background developers develop the service framework and integrate the algorithm SO, that is, integrate the algorithm and model files studied by algorithm researchers into the service framework to provide stable online image recognition services.
Next, we will focus on the design and implementation of the system.
IV. Image recognition service framework
4.1 system Architecture
The framework layer uses Java coding, and the algorithm layer uses plug-in design, which can load jar packages, so or other scripts, which is a multi-language hybrid system. Image recognition algorithms are generally computationally intensive, and some of them need to run on GPU, so in the algorithm layer, we use CUDA C++ to write so, and use JNI to mount so for algorithm execution and scheduling. As shown in figure 1, the architecture is mainly divided into three layers: access layer, framework layer, algorithm layer, plus evaluation system, storage system, monitoring and alarm system, log system and other peripheral systems to form a complete image recognition service system.
Fig. 1 Image recognition service framework system architecture diagram
Access layer: including protocol conversion, parameter transmission and result adaptation, etc.
Framework layer: image recognition service runs the system framework, loading and running algorithm SO, to provide stable recognition services, including
Master: receive requests from the access layer, split requests, schedule requests, merge results, etc.
Worker: the process carrier for the actual execution of the algorithm, which mainly includes loading and updating the SO/ model of the algorithm, and executing the algorithm.
Zookeeper: stores worker heartbeat information, algorithm mapping, algorithm execution plan, algorithm static / dynamic snapshot information, etc.
ConfigServer: monitor worker heartbeat and update dynamic routing table in real time, trigger master to update routing rules and connection pool
Algorithm layer: algorithm personnel provide a variety of algorithm models and algorithm so
Peripheral system
Evaluation system: provide version evaluation function
Storage system: insensitive pictures and badcase storage
Monitoring alarm: monitor the running status of the service and alarm in case of an exception
Log system: request the storage of the log to provide the framework runtime for tracking and troubleshooting problems.
4.2 system operating mode
This section will analyze the running state of the framework in combination with the actual OCR prediction request.
1) an example of OCR identification
As shown in figure 2, we take STR (Scene Text Recognition, scene text recognition) as an example, a typical usage scenario for advertising picture material understanding. In the task, we will identify the text in the picture and give the specific coordinates.
Fig. 2 an example of OCR recognition
2) system running state
As shown in figure 3, we analyze in detail the running process of the above example in the framework.
Fig. 3 running state of the system
1. The request on the business side carries image content (or image URL), bid (identifying different services) and tid (identifying different algorithm categories).
two。 Find the corresponding algorithm in the master node, and then find the corresponding execution plan, and the execution steps of the algorithm are defined in the execution plan; according to each step, the corresponding routing node is found, and the master splits / packages / routes the request to the appropriate worker
3. Master distributes the original picture route to the detection subsystem. The detection process runs on GPU, and the algorithm so detects each picture frame in the module (such as "JD.com", "inner and outer dermis" in figure 2, etc.), and returns the results to master after slicing.
4. Master splits the test results and distributes them to the recognition subsystem in parallel. The recognition process runs on GPU, and the algorithm so will recognize the text of a single picture frame and return it to master respectively.
5. Master summarizes the identification results and sends them to the reordering subsystem. The reordering subsystem runs on CPU, and the algorithm so returns the results to master.
6. Master encapsulates the final result and returns it
The role of each module in the whole process is:
Algorithm mapping: mapping from bid+tid to a specific sub-algorithm. The same tid and different bid combinations can support the customization of similar algorithms by different businesses.
Execution plan: define the execution steps of the algorithm, such as STR image text recognition in the figure, which includes three steps: detection, recognition and reordering
Dynamic snapshot: that is, the dynamic routing table, which defines the specific nodes to which the algorithm is mapped at each stage. Worker reports heartbeat, ConfigServer collates and generates dynamic routing table, and Master node listens for changes in routing table.
4.3 disaster recovery and cluster hot update
Hot update capability is the basic function of a system to provide reliable and stable services, which can not only ensure the lossless upgrade of the system, but also ensure the disaster recovery capability of the system. As shown in figure 4, the system mainly uses the heartbeat mechanism of zookeeper and worker to achieve cluster hot updates.
Fig. 4 Cluster hot update
Worker: establish a temporary node with zookeeper to maintain heartbeat information at startup
Configserver: monitors the heartbeat information of worker in zookeeper. If worker is disconnected or reconnected, configserver immediately senses and modifies the dynamic snapshot.
Master: listen for dynamic snapshot information on zookeeper. Dynamic snapshot changes immediately trigger routing rules and routing connection pool updates.
Through the cooperation of these roles, when the worker node is abnormal, master quickly completes the switching, which ensures the stability of the system. This mechanism also supports the hot update of the cluster. When a worker needs to be updated, the worker is offline first. After the master is aware, no request is sent to the worker. After the update is started, the master reestablishes the connection with it and sends the request.
4.4 single point hot update
A single point of hot update refers to the replacement and upgrade of single or multiple modules in a process without restarting the service. Unlike other business systems, under the algorithm platform, consider the following two scenarios:
1) multiple algorithm SO is loaded in a process, and one of the algorithm modules needs to be updated.
2) there are several serial modules in the algorithm chain, and one of them needs to be tested or updated.
In these two scenarios, it is a bit heavy to use cluster hot update or restart the process, so we implement a set of dynamic update scheme for a single so in the process. Dynamic loading of SO cannot be realized directly through Java code. As shown in figure 4, we introduce proxy so to achieve the purpose of dynamically loading SO by performing (dlopen,dlsym and dlclose) operations in the proxy so. At the same time, in the proxy so, we encapsulate all the interfaces that JNI transformations and algorithms need to use, and decouple the framework and algorithm very well. In addition to the dynamic loading of so, we also realize the dynamic loading of the model.
Figure 5 single point hot update
4.5 static snapshot
Usually in a distributed framework, static snapshots are not required. But in the algorithm system, we usually need to frequently upload and delink a batch of so, and these so will be distributed on different machines and different nodes. Although the dynamic loading / unloading of related processes can be triggered simply by uploading / deleting local so files on the server, this operation is complicated and error-prone when the algorithm is complex and a process loads a lot of so. Therefore, here we need to optimize the system in operation and maintenance to improve the operation and maintenance of the system.
As shown in figure 5, we introduce static snapshots at the same time as dynamic snapshots. Static snapshots are written to the static routing table by the operation and maintenance staff through scripts or configuration files to configure the expected initial state of a cluster to the zookeeper. ConfigServer collates the static snapshot and the heartbeat information reported by worker to generate the final dynamic snapshot.
Figure 6 static snapshot
This kind of cluster snapshot mechanism which combines dynamic and static snapshots is slightly more complex in operation and maintenance than only dynamic snapshots, but the complexity can be reduced through operation and maintenance tools. The advantages of dynamic and static cluster snapshot mechanism are also obvious: first, under complex algorithms, it is not easy to make mistakes; second, you can quickly go up and down some algorithms or change the algorithm flow in addition to dynamic snapshots.
V. some thoughts on the framework
We successfully use Java and CUDA C++ mixed editing system as the image recognition service framework, and support a lot of business. A mature and stable framework can liberate more manpower for algorithm development and business access. Relying on the mature open source tools of Java, you can quickly develop and maintain the framework. Using Java as the framework scheduling layer and network layer, there is no performance difference compared to the traditional C++ framework; while the way of writing SO using CUDA C++ can better adapt to the GPU environment and deep learning framework, and use machine computing resources more efficiently. Of course, our system continues to evolve, and further efforts will be made in the direction of resource utilization, scheduling efficiency, service access speed and refinement of operation and maintenance.
VI. Concluding remarks
We have launched a series of OCR technology articles, including "testing of OCR Technology", "Identification of OCR Technology", "data of OCR Technology" and "system of OCR Technology". We hope that through these articles, we can discuss some technologies and applications in the field of OCR. In the follow-up work, the team will also continue to work in the field of OCR, continue to move forward, continue to improve the technical level and service quality, and make a modest contribution to the development of OCR technology.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.