Tencent Xu Chunming: HBase practice and Innovation in Internet Financial Industry 04/20 Update SLTechnology News&Howtos

Tencent Xu Chunming: HBase practice and Innovation in Internet Financial Industry

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

This article is based on the content of teacher Xu Chunming's live speech at the Ninth China Database Technology Conference (DTCC) on May 11, 2018.

Instructor profile:

Xu Chunming, the head of Tencent Financial and payment Database platform Group, has more than 10 years of working experience in the database industry, and is good at MySQL database operation and maintenance and kernel development. Recently, in-depth study of HBase-related technologies, responsible for the landing of HBase technology in Tencent payment scenarios.

Summary:

This sharing is about the challenges encountered by relational databases in the Tencent Financial payment scenario, the main reasons behind choosing HBase and how we build a high-performance, high-reliability and high-availability HBase data center. Finally, it is about the challenges and solutions encountered by HBase in the Internet financial business.

Sharing outline:

1. Relational database challenges-scalability, maintainability, ease of use

2. How to build a HBase center-high performance, reliability and high availability

3. HBase challenges and responses-business limitations, secondary indexing, cluster replication

Text:

I. Relational database challenges-scalability, maintainability, ease of use

Tenpay, founded by Tencent in 2005, gradually faded out of view of users in 2013 and was previously responsible for solving the payment business on the PC side. At the present stage, Tenpay mainly provides middle and back-end business and financial services for WeChat red packet. In 2005, Tencent's database was still relatively simple, basically MySQL, and the peak transaction volume per second was only a few hundred. At that time, we focused on data security, compatibility, cross-city disaster recovery and other issues. However, due to the emergence of WeChat red packet in 2015, the peak trading volume increased 100-fold in one year. We encountered many problems in terms of performance, disaster recovery and usability. The team did a lot of work to solve these problems, and finally established a basic system architecture of trading system + historical data.

The above architecture mainly uses some components of MySQL, including three engines, the real-time database cluster on the left, and this cluster uses the InnoDB engine. The devices in the cluster have good performance, strong IO capabilities, large memory, and good CPU. There are hundreds of groupings in it, because WeChat Pay's transaction is very complex, involving distributed transactions, inter-city disaster recovery, cross-city disaster recovery and many other aspects. At that time, we used MySQL and InnoDB databases, and found that as long as there was a holiday, our online capacity would not be enough, because the capacity of high-performance servers was limited.

And the emergence of festivals led to high peak trading, so we have a jagged peak chart. In addition, the database capacity is also growing rapidly. Based on these problems, our solution is to delete unimportant fields, optimize indexes, reduce data content, and migrate historical data to the history database cluster. Before 2015, we did not migrate in real time, usually the next day's trough batch export, import and then delete.

Although this approach can solve the problem of high concurrency and mass data of our online transactions, there will still be a lot of problems when historical data is needed.

The challenges we encounter are mainly several aspects of the figure above. The first horizontal expansion is because the state requires financial institutions to keep transaction data. With the increasing transaction data, stand-alone performance storage can easily reach the bottleneck. The architecture is difficult to adapt to the current data storage scale, and data splitting needs to be considered. However, the traditional DB split is difficult, the workload is very large and the cycle is long.

The second is data migration. We set a data migration strategy to write data from the transaction DB to the historical DB on a regular basis. However, the migration strategy of sub-database and sub-table is very complex, which makes the configuration of the upper layer Spider very complex and error-prone. Therefore, the data cannot be found several times when the business is in use, resulting in incomplete business data.

Another problem is that DB is generally cheap equipment with large capacity and is easy to fail. Although we have multiple history repositories that directly solve system problems by replicating synchronous data, the amount of data is too large. In the course of operation, it is found that the slave always fails to catch up with the host. Once the host fails, the slave cannot be used as a host due to delay, so it can only repair the faulty host, which is a big risk.

In the batch query, the traditional history DB uses TokuDB engine, the data is highly compressed, disk IO, memory performance is relatively poor, a large number of data scan has a great impact on the performance of a single machine, and it is easy to drag down the history database. There are also some problems in business query, such as inaccurate query data, data confusion and so on.

Generally speaking, after the business starts to run, there are always problems in the database, which leads to the negative impact of a lot of database word-of-mouth.

At that time, our solution was to compare databases to select databases with better performance. we compared Redis, MongoDB and HBase, but finally chose HBase, mainly because of its characteristics of high writing, high reading and low reading. In fact, this has little impact on us. We have relatively few queries on historical data, and we are not as demanding as online trading business. The second is its flexible scalability, as long as the HBase cluster is stable, its expansion is very easy, HBase security is also high, so that the team needs to do much less, we can have more energy to focus on other aspects of work.

Second, how to build a HBase center-high performance, reliability and high availability

After selecting the HBase data center, the next question is how to build it. At that time, the team set the goal of 1 million + input per second, 99.9% availability, 0 record loss and error. The specific implementation steps are divided into two steps:

First of all, we need to input the data into HBase, our data source is MySQL, and the data is cached and then written to HBase after the log parsing system DBSync. The intermediate process was not so complicated. At that time, leader, who was in charge of data mining, came to me because every time their statistics were not accurate, they received more complaints. Just from the cost point of view, we do not need to apply for this part of the server, we can quickly write the data to our own HBase. How to query after data entry? Our trading business is basically implemented by C++, HBase uses Java language, and HBase's thrift can be used for cross-language data transmission, so we do an API based on thrift to solve cross-language problems.

How to use business processes? We not only use HBase as a platform for storing historical data, business processes in addition to accounting DB, there are other transaction-related DB, they all have the same problem, the capacity is easy to be full, in order to make HBase more useful, in specific scenarios, such as a historical refund a month ago, online data can not be preserved for so long. If there is a refund for more than one month, we check the refund from HBase and write it to MySQL. The core refund process and financial process are done directly in MySQL, which minimizes changes to the business and facilitates business operation.

The writing principle of HBase is based on the idea of LSM, which first writes the data into memory and then returns, and then drops the disk asynchronously, but in this way there are many small files on the disk, and the performance of the query will be very poor. Therefore, you also need to compaction the small files on the disk to reduce the number of files. Considering the backlog of many compaction tasks, we enlarged the small Compact thread bar. Another is to adjust the parameters of the WAL log. People who have done relational databases may know that there are four ways in HBase. The first is to write directly to the memory without writing to the hard disk, the second is to write to the hard disk asynchronously, the third is to write to the operating system without dropping the disk, and the fourth is to write to the disk in real time. Combined with performance and data reliability, we adopt the method of writing to the operating system without dropping disk, considering that the possibility of three-stage failure of three copies at the same time is relatively low, which reduces the IO pressure and delay. Another thing is to reduce the memory size of the regionserver stack, because our business writes are too frequent, and the hit rate of cache queries is really very low, only about 10% to 15%, so we allocate all the memory to writes to increase the amount of writes. Because it is a batch write business, we can adjust the number of batch write records on the client according to the demand to improve the write performance.

After optimization, the pressure test reaches 1.5 million + writes per second, which achieves a very good result.

After that, there is the optimization of reading. The server hardware we use is very poor, the operation rate is very low, and all queries basically rely on the IO hard ability of the disk to resist. LSM idea has a disadvantage, that is, the number of files is too many affect query performance, if you want to query rowkey=100 data, the number of files, because each file may have rowkey=100 records, then we have to scan 100 files, taking a long time. Our solution is to reduce the number of files under region, merge files regularly, and consolidate small files into large files. The second is focus isolation, we separate some tables into a group to reduce coupling, and the machines in the group are only allowed to make table read and write requests to avoid the interaction between different tables. There is also parameter optimization. We built a lot of region tables at the beginning, which put too much pressure on the whole cluster, so we reduced the number of small tables and controlled the number of them. Another is that the business is graded according to priority, and different levels of business are handled differently. Some index tables can be built on an annual basis, and some tables can be built on a monthly basis. Finally, there is a hardware upgrade, which focuses on configuring higher-performance servers.

After optimization, the query time consumption decreased from 260 milliseconds to 60 milliseconds, and the query time consumption curve gradually flattened.

In terms of high availability, the first HBase itself must be available, which is the most basic point to avoid Full GC and Region Server single Region blocking unavailable. Second, when the machine fails, what can be done without affecting the reading and writing of the entire cluster? We use a fast elimination mechanism to avoid the high impact of a single server load on the cluster, and third, consider the network unavailability and the failure of the computer room, so we set up a primary and standby cluster. If there are three replicas in a cluster, then there are six replicas and the cost is very high. We consider reducing the number of replicas to two replicas every three months to save costs. Fourth, to avoid avalanches, whether it is Thrift Server or RPC calls, the operation time is relatively long, which requires shortening the Thrift Server timeout and reducing the number of retries. Finally, it is grouped according to the level of business importance to reduce the coupling with other machines.

The figure above is the flow chart of high availability single point guarantee: write a data from the client, consult the Meta table to get the RS location, write the data to RegionServer, and then write the data asynchronously to the Hadoop layer to synchronize the data. You can see that only Region Server is a single point in the cluster, and other components have highly available solutions. When Region Server fails, the recovery of migration within the cluster usually takes several minutes, which is unacceptable for OLTP business. The alternative is to quickly migrate the business to the standby cluster, or use region replica.

We use three methods to ensure the accuracy of the data. The first is data reconciliation. There will be data reconciliation in the financial system. I believe that no matter what system there will be bug, reconciliation is very critical and cannot go wrong. We have 10-minute incremental reconciliation and redundant data reconciliation. The source IP is required to be separated from the checking IP, and the source DB is separated from the reconciliation DB. The second is to reduce human error from the process specification, and to operate and maintain through some systematic methods. The third is to improve the efficiency of monitoring, plus second-level monitoring, if there is a problem can be quickly found and solved immediately, there is an abnormal reconciliation immediately reported to the Aegis security system. In this way, through the above three methods, we improve the accuracy of the data by an order of magnitude.

Here separately talk about the special processing of update, financial business data processing is divided into two kinds, the first is directly mapped, such as the loss of user funds is relatively simple. What is more difficult is that the second kind of time batch reconciliation, such as the status of some balances and certain orders, we have tens of billions of data a day. How should we deal with the reconciliation with an average of hundreds of thousands of dollars a second? for this, the system adds modification time to make the reconciliation more accurate, but at the same time, it will cause data redundancy. For example, an order number corresponds to a record in our trading system, but there is no update in HBase. It is all mapping. In this way, there will be two records, so the refund cannot be successful in this state. In order to solve this problem, we split the update in MySQL binlog into two messages delete + insert, so that one record is kept in the HBase.

However, there is another problem. There may be two records in the same second, that is, when the same order number completes the transaction in one second, when the data is split, both data records may be deleted. If there is no certain amount of concurrency, this problem will not arise. We deal with this problem by adding a sequence number to the split and ensuring that the insert is after the delete, which ensures that a record is left.

We have also encountered many other problems in the process of practical application of HBase. The first is that the queue of HBase read and write threads is separated, because we write far more than we read. For example, in the refund business, we only write but not read. We adjust the parameters of hbase.ipc.server.callqueue.read.share=0.3 to ensure a certain number of read and write threads. It is also encountered that the high load of a certain DataNode leads to a significant decrease in the number of writes in the entire cluster. We have added seconds of monitoring to quickly find out and quickly migrate the region solution.

After that, there are maintainability issues. Some configurations of HBase cannot dynamically modify the configuration, do not support commands to view the operation of HBase clusters, and lack summary statistics. There is also a capacity balance, our amount of data is increasing, we have dozens of servers with 99% disk capacity, and some servers use less than 1% of the capacity. So you need to migrate data between servers. Instead of operating in the HBase layer, you can stop the datanode node from the hdfs layer and let namenode migrate.

Generally speaking, although HBase has some disadvantages, the advantages of HBase are more obvious, and its stable performance is conducive to better liberating manpower and saving costs.

III. HBase challenges and responses-business limitations, secondary indexing, cluster replication

The following is my idea of HBase. At this stage, there are only two application scenarios of HBase, which have great limitations, because it does not support SQL, cross-row and cross-table transactions, does not support secondary indexes, and has a large read delay. It can not be used in OLTP (On-Line Transaction Processing online transaction processing) business, such as the core process of payment business, but it is suitable for storing historical data, processing historical data reconciliation, historical data backtracking and other requirements.

So how to solve the problem of affairs? First, although HBase does not support cross-row and cross-table transactions, it can change multiple tables and implement single-table design and HBase single-row transactions. Therefore, when dealing with data table transactions, swap as many tables as possible to avoid distributed transactions.

The second transaction is a business distributed transaction. If you do financial business, it is too difficult to change in HBase, so you can make changes in the business layer, using two-phase commit protocol, which is divided into PMaget Cpi F phase. The transaction manager controls the commit or rollback of the transaction. The resource manager corresponds to the table, which splits multiple resource managers and adopts a locked scheme. The transaction manager is stateless, and the high availability solution is more mature. The resource manager can be stateful, it can be Mmurs scheme, but it can also be designed as Mmurm M scheme, whether it is distributed or stand-alone, reassemble HBase and use the single-line transaction function of HBase to control concurrency.

Here are some concrete examples

The third is the open source distributed transaction framework. There are mainly two kinds of OMID and Percolator, and what they have in common is two-phase distribution. The difference is that OMID has a global transaction that controls concurrency and which transaction should be rolled back. Percolator mainly uses the technology of Basebase coprocessor, which is difficult to practice.

The above picture shows the application of the coprocessor. The first one is Phoenix, which supports sql and secondary index, and is implemented through Coprocessor, but it is unstable, and a large amount of data can easily cause HBase to crash. The second Coprocessor implementation uses Observer to write the index table before writing the main table, and finally the time library table, which is designed as a monthly or daily table, and does not use copy historical data when creating a new index.

HBase has the function of copying, but it is found to be unstable in the process of using it. We later realized the double-write function through the application.

The above figure shows the isolation scheme supported by HBase and Hadoop. Although the isolation can be achieved to a certain extent, the isolation is not complete, the coupling is strong, and the positioning problem is complex, which still cannot achieve business isolation.

Finally, talk about the goal and direction of the database. Whether it is traditional relational database or distributed database, the direction of efforts is to provide data services with high availability, high performance, high concurrency, easy to use and easy to expand. At present, there are many databases on the market, but there are always a few features of all databases that are more or less unsatisfactory, so there is no unified database, which leads to the situation of letting a hundred flowers blossom in database applications. this is also the direction and motivation of each database team.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.