Exploration and practice of G7 in Real-time Computing 07/02 Update SLTechnology News&Howtos

Exploration and practice of G7 in Real-time Computing

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Author: Zhang Hao

G7 Business Quick View

G7 mainly senses the track, fuel consumption, point flameout, load, temperature and other data of vehicles through sensors on trucks, connects vehicles, drivers, fleets and cargo owners together, and optimizes pain points such as timeliness, safety and cost of cargo transportation.

The whole data is collected through vehicle-mounted sensor equipment, such as the company's Smart Box, CTBox box, oil sensing equipment, temperature probe, etc., and the vehicle data is reported to the back-end platform, calculated and processed on the back-end platform, and finally displayed to the user.

cdn.xitu.io/2019/4/26/16a5833469899a62? w=706&h=668&f=jpeg&s=111457">

The business scenarios for G7 are typical IoT scenarios:

sensor data

Multiple types of data

poor data quality

Data Low Latency

large amount of data

Among them, the reason for the poor data quality is that the whole chain will be very long. The data collected from the sensors will be reported to the back-end server through the network operator, and then analyzed, mq, filtered, called the three-party interface, business processing, and warehousing. The whole process is very long, resulting in data duplication and data loss during data transmission. Another point, IoT scenarios require very low data transmission delay, such as in and out of the area alarm, when the vehicle enters a certain electronic fence when the alarm needs to be triggered, this time need to quickly generate alarm events, usually can not exceed 30s, otherwise the time is too long The vehicle has passed through a certain electronic fence area and then the alarm is worthless. Another, the amount of data is also very large, now generate 2 billion track points per day, generate 10 billion data per day, the requirements for computing performance are very high.

Real-time calculation and selection

From the above scenario, we can sense that what is needed in the IoT scenario of G7 is a low-latency, fast real-time computing engine. At first, some of our architectures are based on Lambda architecture, such as trajectory point calculation, which uses real-time calculation engine to calculate real-time data. This data has low latency, but the data is not very accurate. In addition, it needs to be calculated again offline. This data is usually accurate and can be used to repair real-time data. The disadvantages of this approach are also obvious. One is that the program needs to maintain two sets of code: real-time program and offline program. The other is that the real-time data is inaccurate and the delay of accurate data is too high. Later we were pleasantly surprised to discover Kappa, an architecture based on real-time processing.

Kappa's architecture emphasizes the real-time nature of data. In order to ensure the real-time nature of data, some data with too much delay will be discarded. All calculation logic is only in real-time calculation. The whole calculation has only one set of logic. Data is obtained from MQ, calculated and processed by data processing layer, and finally falls into data storage layer, providing data query function to the outside world. Compared with Lambda architecture, Kappa architecture is more suitable for IoT field.

For Kappa architecture, we compared the mainstream real-time stream computing frameworks in the industry:

Compare the mainstream stream computing frameworks: Storm, Storm Trident, Spark Streaming, Google Cloud Dataflow, Flink. Microbatch-based Spark Streaming and Storm Trident have high latency, which is not suitable for our scenario. Storm's latency is very low, but data consistency is At Least Once, fault tolerance mechanism is more complex, flow control will be more jitter, these aspects are not suitable. Among them, Flink's consistency guarantee (version 1.4 also supports end-to-end consistency), low latency, relatively small overhead of fault tolerance mechanism (distributed snapshot based on Chandy-Lamport), flow control is more elegant (data transmission between operators is based on distributed memory blocking queue), application logic and fault tolerance are separated (operator processing and distributed snapshot checkpoint), based on the above we think Flink is more suitable for IoT this scenario.

G7 Business Application Cases Flink currently has three main application scenarios in G7:

real-time calculation

Real-time ETL

statistical analysis

The following describes the use of the above three scenarios.

real-time calculation

In the G7 scenario, there are many services that belong to the category of real-time calculation, such as entry and exit area events, speeding events, idling events, speeding events, fatigue alarm events, dangerous driving alarms, fuel consumption calculations, mileage calculations, etc. Fatigue alarm calculation was one of the earliest attempts to use flink to land.

Fatigue Alarm Service Model

This is the G7 large screen launched by G7 for customers, in which the risk-related part is calculated according to fatigue.

According to G7's big data calculation, the proportion of truck accidents caused by fatigue driving accounts for 20% of the total accidents. It is particularly important to warn and warn fatigue driving, which can effectively reduce the possibility of accidents.

According to the mileage of the vehicle, the mileage of the driver and the driving time, it is judged whether there is fatigue driving. If it exceeds the alarm threshold, it will alarm, and if it is below the alarm threshold and above the early warning threshold, it will warn. Alarm and early warning are issued voice to the cab to remind the driver.

The biggest challenge in this business scenario is real-time, stability. The risk can only be minimized by sending the alarm to the relevant personnel in the shortest time and in the most stable way.

business process

In the whole processing flow, fatigue configuration will be obtained first, and whether early warning and alarm occur will be judged according to the vehicle status information and driver punch card information combined with fatigue configuration. During the calculation process, the state at the beginning of fatigue driving will be cached, and the previous state data will be obtained at the end of fatigue driving. After successful matching, a complete fatigue event will be generated. In the middle, some interface services such as dubbo will be called to obtain the configuration data and status data of the vehicle. The generated fatigue alarm will call the interface for issuing voice, and the fatigue event results will also be stored in hbase, mysql, kafka, etc.

Streaming model

Finally, the program developed into Flink is composed of the following operators from beginning to end: consumption kafka operator, type conversion operator, data filtering operator, asynchronous call third-party interface operator, window sorting operator, fatigue processing business logic operator and data storage operator.

This process is also stepping on a lot of pits, we also have some experience:

Operator expression as simple as possible

Each operator is as cohesive as possible, and the coupling between operators is as low as possible.

Operator break-up, asynchronous + multi-threaded performance play better

Set the parallelism of each operator unit separately for better performance

hash and balance are optional: hash is used to redistribute data only where keyby and valueestate are required. Balance is used as much as possible in other places, and the parallelism of upstream and downstream is consistent. Tasks will be concatenated into a thread, and the network IO performance will be higher.

Use Asynchronous I/O to call dubbo interface, zuul, db, hbase and other external interfaces real-time ETL

Some scenarios are simple data collection, processing, warehousing, that is, real-time ETL, including data collection from Kafka to HDFS, DB, HBase, ES, Kafka, etc. This part of the work can be abstracted into Flink operator expression: Source -> Transformation -> Sink.

This part can usually be FlinkKafkaConumser, MapFunction, JDBCAppendTableSink code. As follows:

statistical analysis

Some scenarios require some real-time statistical analysis, such as statistics of the total number of vehicles, the total number of drivers, fatigue events, entry and exit events, the number of punch cards, and the number of flameout events in the last hour in all cities across the country. In this scenario, you can usually use Flink SQl to do real-time analysis, sql+ window function (fixed window, sliding window). The code is roughly as follows:

Development and status quo of real-time computing platform

In terms of successful business landing, we also hope to build a real-time computing platform to serve all business lines. After almost 3 months of polishing, the real-time computing platform with internal code name Glink will be launched. The approximate architecture is as follows:

Glink consists mainly of the following components:

HDFS Distributed File System. Used to store checkpoint/savepoint data generated in flink tasks, storage and distribution of task reports and third-party dependency packages, temporary data generated during task operation, etc.;

Yarn Unified Computing Resource Platform. It is used to provide a unified distributed computing resource platform, task submission, task scheduling, task execution, and resource isolation functions. Currently, all flink tasks are managed uniformly through yarn.

Performance monitoring AMP tool. Use the comment open source Cat, on this basis to do secondary development and named "Tianshu System." It can provide program time-consuming 95, 99 lines, average time-consuming, maximum time-consuming, java GC monitoring, thread monitoring, stack information, etc.

Cluster monitoring management. Machine resource monitoring uses zabbix to provide cpu, memory, disk io, network io, number of connections, handle monitoring. Cluster resource monitoring and management uses open source Ambari to provide automatic installation, configuration, cluster overall tasks, memory, cpu resources, hdfs space, yarn resource size monitoring alarms;

Mission surveillance alarm. Upload the data to InfluxDB by statsD reporter provided by flink, draw the processing flow of task by scanning Infludb data, and alarm by monitoring that the flow threshold is lower than the expected value;

Diagnostic debugging. Using mature log query system es+logstash+kibana, by collecting logs of each node and writing them into es, key information can be queried in kibana to obtain log memory and provide clues for diagnosis and tuning programs;

Flink app application layer. Flink applications specifically developed, usually addressing real-time ETL, statistical analysis, business computing scenarios;

Glink mission control platform. Package the following functions to provide unified task management and operation and maintenance management functions. Real-time Computing Platform Showcase-Task Management

Real-Time Computing Platform Showcase-Logging and Performance Monitoring

Some of the platform's features are described:

Mission management functions. Provide task release, modify, upgrade, stop, apply for resources, review resources, start log view function;

Operation and maintenance management function. Provide log view, program monitoring, task monitoring, flow monitoring, abnormal alarm and other functions.

The above functions of Glink real-time computing platform basically meet the needs of users to independently complete the work from program development, release, optimization, online and operation and maintenance.

Glink-Framework Development Framework

In addition to providing the corresponding platform functions, we also need to provide better packaging and tool classes in the flink ecosystem, so we provide the scaffolding of development tools: Glink-Framework framework.

Glink-Framework provides the following packages:

Simplify pom files, reduce a lot of dependencies, plug-in configuration;

Three-way call integration: dubbo, zuul;

Tripartite database integration: mysql, redis;

Multi-environmental management;

Dependency version management;

Code monitoring tools: checkstyle, pmd, findbugs. Cooperation mode between platform and business party BP

On the other hand, we think that flink has a certain technical threshold, especially for small partners who have no concurrent programming and cluster development experience before, it takes a period of time to learn. In view of this pain point, we propose a technical cooperation mode of technical BP. According to the complexity of the business, the platform will assign one or more technicians to participate in the whole development and operation and maintenance work of the business party, from demand analysis to online landing. In the later stage, there will be continuous technology sharing and training to help the business party learn the development ability.

step on the pit

In the whole process of platformization and business development, flink has also stepped on a lot of pits, some of which are more typical.

Too much parallelism results in longer barrier alignment time, and the alignment time of one subtask with parallelism 28 exceeds 50s;

Valuestate cannot be shared across operators;

Flink1.3 kafka connector does not support partition addition;

Integration with spring, there is a handler matching problem;

Hadoop package conflict caused by the program can not start normally and no exception;

One of the more interesting is the problem of too much parallelism, which causes barrier alignment to take too much time. To understand this problem, we must first understand that flink in the process of generating checkpoints, will be inserted in the source barrier and normal messages sent downstream, the operator will wait until the specified brrier will trigger the checkpoint. As shown below:

This is in the case of a single stream, which is a bit more complicated if there are multiple streams entering a single operator at the same time. When flink does checkpoint, it finds that there are multiple flows entering an operator. The message corresponding to the barrier that enters this operator first will be buffered into the operator and wait for the barrier corresponding to another flow to arrive before triggering checkpoint. The process of waiting for this buffer is called checkpoint alignment (barrier alignment), as shown below:

Some operators of a program running online fail checkpoint timeout because barrier alignment takes longer than 50s. For this problem, our tuning strategy is two, one is to minimize parallelism, that is, to minimize the flow into an operator, if within 4 barrier alignment time is relatively small. Another way is to replace the semantics of exactly once with the semantics of at least once, so that the checkpoint does not do barrier alignment, and the data arrives at the operator to do checkpoint immediately and send downstream. At present, our solution is to distinguish according to different business scenarios. If you use at least once data to ensure that you can meet business requirements, try to use at least once semantics. If not, reduce parallelism to reduce the amount of data and time barrier alignment takes.

platform revenue

Through the platform construction in recent times, the benefits in terms of "cost reduction and efficiency increase" are mainly reflected in the following aspects:

Increased resource utilization. Currently, through monitoring the entire cluster, the average CPU utilization rate is about 20% in the case of mixed deployment, and the CPU utilization rate will be higher in some CPU-intensive computing services;

Increased development efficiency. For example, ETL collection program development, traditional development of data collection, transformation, warehousing takes about 1 day, through the platform-based way to develop a simple ETL program in 1 hour to complete the development;

The data processing capacity is large. More than 8 billion pieces of data are processed on average every day;

Business coverage is wide. The platform launched 30+ business, which is expected to exceed 100+ within the year. Service for all business lines, IoT platform, EMS, FMS, intelligent trailer, enterprise solutions, SaaS, hardware department, etc. future planning

For the future planning of flink, we will mainly focus on the goal of "reducing costs and increasing efficiency and providing a unified computing platform," mainly focusing on the following aspects:

1 . Resource segregation is more thorough. At present, the default isolation method of yarn is only for memory isolation, and yarn+cgroup is needed to isolate both memory and cpu in the future. In addition, we will consider using yarn node label to completely isolate machine level, and divide different types of machine resources according to different services. For example, tasks with high CPU correspond to CPU-intensive machines, and tasks with high IO correspond to machines with better IO.

Platform usability improved. The platform includes code release, debug, debugging, monitoring, troubleshooting, one-stop problem solving;

Reduce the Code. By using Flink SQL+UDF function, common methods and functions are encapsulated, and SQL is used to express business as much as possible to improve development efficiency. In addition, CEP pattern matching support will also be considered. At present, many services can be supported by dynamic CEP;

Universal scaffolding. Continuous development on Glink-Framework, providing more sources, sinks, tools, etc., business encapsulation, simplifying development;

This article is taken from Zhang Hao's technology sharing in "Flink China Community Offline Meetup·Chengdu Station"

For more information, please visit Apache Flink Chinese community website

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.