Analysis of Technical Essentials of mixed transaction Analysis and processing "HTAP" 07/06 Update SLTechnology News&Howtos

Analysis of Technical Essentials of mixed transaction Analysis and processing "HTAP"

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

HTAP is a popular concept in recent years. This article will talk about the past and present life and technical characteristics of HTAP.

I. categories of data applications

According to the usage characteristics of the data, it can be simply divided as follows. Before choosing a technology platform, we need to make such a positioning.

1.1 OLTP online transaction processing OLTP (On-Line Transaction Processing)

OLTP is event-driven, application-oriented, also known as transaction-oriented processing. Its basic feature is that the user data received by the foreground can be immediately transmitted to the computing center for processing, and the processing result is given in a very short time, which is a fast response to the user operation. For example, banking and e-commerce trading systems are typical OLTP systems.

OLTP has the following characteristics:

Directly oriented to the application, the data is generated in the system. Transaction-based processing system. The amount of data involved in each transaction is very small; the response time is very high. The number of users is very large, its users are operators, and the degree of concurrency is very high. The various operations of the database are mainly based on the index. SQL is used as the carrier of interaction. The overall amount of data is relatively small. 1.2 OLAP online real-time analysis OLAP (On-Line Analytical Processing)

OLAP is data analysis-oriented, also known as information analysis-oriented process. It enables analysts to quickly, consistently and interactively observe information from all aspects, in order to achieve the purpose of in-depth understanding of the data. It is characterized by dealing with large amounts of data, supporting complex analysis operations, focusing on decision support, and providing intuitive and easy-to-understand query results. For example, data warehouse is a typical OLAP system.

OLAP has the following characteristics:

It does not produce data itself, and its basic data comes from the operation data in the production system. Query-based analysis system; complex queries often use multi-table join, full table scan and so on, the number of involved is often very large. The amount of data designed for each query is very large, and the response time has a lot to do with the specific query. The number of users is relatively small, and their users are mainly business personnel and managers. Because the business problems are not fixed, the various operations of the database can not be based entirely on the index. With SQL as the main carrier, language class interaction is also supported. The overall amount of data is relatively large. 1.3 OTHER

In addition to the traditional OLTP and OLAP classes, there are some new features for the use of data in recent years, which I classify as "other" categories.

1) Multimode

With the development of business "Internet" and "intelligence" and architecture "micro-service" and "cloud", application systems put forward new standards and requirements for data storage management, and the diversity of data has become a prominent problem. The early database is mainly faced with the processing scenario of structured data. Later, with the development of business, there is a gradual demand for unstructured data processing, including structured data, semi-structured (JSON, XML, etc.) data, text data, geospatial data, graph data, audio and video data and so on. Multi-mode means that a single database supports the storage and processing of multiple types of data.

2) flow

Streaming (real-time computing) comes from the demand for timeliness of data processing. The business value of data decreases rapidly with the loss of time, so the data must be calculated and processed as soon as possible. The traditional processing method based on periodic class obviously can not meet the demand.

With the development of mobile Internet, Internet of things and sensors, a large number of streaming data are generated, corresponding to the emergence of proprietary streaming data processing platforms, such as Storm, Kafka and so on. In recent years, many databases have begun to support streaming data processing, such as MemSQL and PipelineDB. Some proprietary streaming data processing platforms begin to provide SQL interfaces. For example, KSQL provides a streaming SQL processing engine based on Kafka.

3) higher order

With the in-depth use of data, the use of data is no longer just a simple addition, deletion, modification or grouping aggregation operation, but also gradually attracts people's attention to its more advanced use. For example, the data are analyzed by using algorithms such as machine learning, statistical analysis and pattern recognition.

1.4 comparison-OLAP vs OLTP

Second, data processing mode

In the face of the above complex and changeable application scenarios, are the various categories of data applications handled by a single platform or by different platforms? Generally speaking, the performance of proprietary systems will be one or two orders of magnitude higher than that of general-purpose systems, so different services should use different systems. However, as the ancients said, "the general trend of the world, long-term division, long-term division must be divided", there is also a trend in the field of data processing, which is handled by a single platform.

The core of the choice here is how to treat demand and technology dialectically. They are a pair of contradictions, when the contradiction is relaxed, the field of data processing will tend to be integrated; and when the contradiction is acute, the field of data processing will tend to be scattered. According to the development status and current demand of software and hardware technology, the trend of integration in the future is more obvious. The integrated data platform will be able to meet the scenarios of the vast majority of users, and only a very small number of enterprises need to use proprietary systems to meet their special needs.

2.1 distributed (proprietary platform)

At present, the more conventional way is to use a number of proprietary platforms to process data for different scenarios. Therefore, it is cross-platform, and there is a process of data transmission. This brings two problems: data synchronization and data redundancy. The core of data synchronization is the timeliness of data, and out-of-date data often lose its value.

Common practices are as follows:

The data changes in the OLTP system are exposed in the form of logs, decoupled transmission through message queues, and ETL consumption pull at the back end to synchronize the data to the OLAP. The whole chain is long, and it is a test for scenarios with high timeliness requirements.

In addition, the flow of data in the chain, there are multiple copies of data redundant preservation. In a conventional high-availability environment, multiple copies of data will be further saved, so there are large technical, labor and data synchronization costs hidden in it. And across so many technology stacks, database products, behind each technology stack requires separate team support and maintenance, such as DBA, big data, infrastructure and so on, which contain huge manpower, technology, time, operation and maintenance costs. It is precisely to meet a variety of business needs at the same time, improve timeliness, reduce data redundancy, shorten the chain, etc., convergence technology stack becomes very important. This is also the starting point for the birth of general-purpose platform solutions.

2.2 centralized (common platform)

Users are tired of using different data processing systems for different data processing, and are more inclined to use integrated data processing platform to deal with various data types of enterprises. For scenarios that combine online transaction processing and online real-time analysis, this is the HTAP discussed below. This kind of common platform scheme has the following advantages:

Information isolated islands are avoided through data integration, which is convenient for sharing and unified data management. The data integration platform based on SQL can provide good data independence, so that applications can focus on business logic without paying attention to the underlying operation details of the data. The integrated data platform can provide better real-time and more complete data, and provide faster and more accurate analysis and decision-making for the business. It can avoid the gluing between various systems, the overall technical architecture of the enterprise is simple, does not need complex data import / export, and is easy to manage and maintain. It is convenient for personnel training and knowledge sharing, and there is no need to train development, operation and management personnel for various proprietary systems. III. HTAP

HTAP database (Hybrid Transaction and Analytical Process, mixed transaction and analytical processing). A 2014 Gartner report uses the term Hybrid transaction Analytical processing (HTAP) to describe a new application framework to break the gap between OLTP and OLAP. It can be applied to both transactional and analytical database scenarios to achieve real-time business decisions.

This architecture has obvious advantages: it not only avoids tedious and expensive ETL operations, but also can analyze the latest data more quickly. This ability to analyze data quickly will become one of the core competitiveness of enterprises in the future.

3.1 Technical points there is either only one copy of the underlying data, or it can be copied quickly, and meet the high concurrency of real-time updates at the same time. In order to meet the capacity of massive data, it has a good linear expansion ability in storage and computing. It has a good optimizer, which can meet the sentence requirements of transaction class and analysis class. It has standard SQL and supports technologies such as secondary indexing, partitioning, column storage, vector computing and so on. 3.2 key technologies-row and row storage

1) Row storage (Row-based)

Row-based rows are generally used for traditional relational databases, such as Oracle's OracleDB and MySQL,IBM 's DB2, Microsoft's SQL Server and so on. In a database based on row storage, data is stored in a logical storage unit based on row data, and the data in a row exists in the form of continuous storage in the storage medium.

2) column storage (Column-based)

Column storage is relative to row storage. New distributed databases such as Hbase, HP Vertica, EMC Greenplum and so on all adopt column storage. In a database based on column storage, the data is stored according to the column-based logical storage unit, and the data in a column exists in the form of continuous storage in the storage medium.

The traditional row database is stored according to rows, and the cost of maintaining a large number of indexes and materialized views is high in terms of time (processing) and space (storage). On the contrary, the data of the column database is stored according to the column, each column is stored separately, and the data is the index. Only access to the columns involved in the query greatly reduces the system Imax O, each column is processed by a line, and because the data types are consistent and the data characteristics are similar, it is very convenient to compress.

3.3 key Technologies-MPP

MPP (Massively Parallel Processing), that is, massively parallel processing, in the database non-shared cluster, each node has an independent disk storage system and memory system. Business data is divided into each node according to the database model and application characteristics. Each data node is connected to each other through a private network or commercial general network to cooperate with each other to provide database services as a whole. Non-shared database cluster has the advantages of complete scalability, high availability, high performance, excellent performance-to-price ratio, resource sharing and so on.

To put it simply, MPP is to distribute the tasks to multiple servers and nodes in parallel. After the calculation is completed on each node, the results of each part are gathered together to get the final result. Here is an example of a typical MPP product Greenplum architecture.

3.4 key Technologies-Resource isolation

OLTP and OLAP have different characteristics in the use of resources, so it is necessary to do a good job of isolation at the resource level to avoid mutual influence. It is common to specify the user allocation queue by defining the resource queue, which plays the role of resource isolation.

3.5 HTAP products

The following figure is a classification diagram of the database products found on the website, aiming at the related products on the referable object line of the HTAP class. Of course, this is just a family opinion, for reference only!

Author: Han Feng

The first release is on the author's personal official name "Han Feng Channel".

Source: Yixin Institute of Technology

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.