Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Rowkey Design of parallel Computing Architecture based on HBASE

2025-04-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)06/01 Report--

1. Big data's Application scenario in HBASE Storage, Computing and query

Massive data are transaction data, and transaction data are generated on the basis of time. The business time of the data may or may not be generated sequentially, for example, some transactions occur at 10:00 in the morning, but are closed and generated at 5 p.m., such data can cause time continuity when the storage is loaded. In addition, the mining of massive data produces statistical data, and statistical data also has time attributes. if statistical data is saved, it must ensure that the data will not change after statistical calculation as far as possible. if new transaction data is generated after the occurrence of statistics, then the statistical calculation will be re-triggered and the previously stored data will be re-saved. Other data are mainly general data based on configuration data. According to the above analysis, according to the characteristics of the data, we can divide the data into transaction data, statistical data and general data. The query for data will have different user operation scenarios according to the classification of the data. For transaction data, the user's query must give a time range (even if the user does not give this condition, the system will default), because the transaction data is massive. If filtering, filtering, grouping, aggregation, multi-table association and other operations are carried out according to different conditions in the specified time range, the efficiency of the query will be determined by the way the data is persisted in the file and the structure of the index. How to design intelligently to deal with the above problems is one of the biggest issues in the efficient application of HBASE. Statistical data converges to some extent relative to transaction data, but it also has to solve the same query problem. General data does not involve complex query requirements, but in terms of in-depth planning of the product, the issues associated with other tables should be considered. The above is a brief introduction to the three data forms of big data application scenarios. Let me briefly analyze the architecture and functional features of HBASE, so as to derive how to achieve the requirements of storage, calculation and query in the above application scenarios. two。 Standard HBASE functional Analysis HBase is a distributed, column-oriented open source database that is a sub-project of Apache's Hadoop project. Different from the general relational database, HBase is a database suitable for unstructured data storage. Another difference is that HBase is column-based rather than row-based. As shown in the following figure, HBASE is a system between HDFS and MapReduce of hadoop. I will not describe the basic introduction here, and there are a lot of related materials for reference. As shown in the following figure, the hierarchy of HBASE is RegionServer > Region > Store (MemStore) > StoreFile > HFile. HFile is the persistent storage medium of data, and MemStore is the in-memory cache of data. HBASE is the column store using KeyValue, and Rowkey is the Key of KeyValue, representing the only row. Rowkey is a binary stream, maximally 64KB, and the content is user-defined. The loading of the data is sorted from small to large according to the binary order of Rowkey. HBASE automatically splits the data into multiple Region's multiple HFile according to the size of the data.

HBASE is retrieved based on Rowkey. Three ways are supported. Access through a single Rowkey, that is, get operation according to a certain Rowkey key value; scan through the range of Rowkey, that is, scan within this range by setting startRowKey and endRowKey; and full table scan, that is, all row records in the whole table are scanned directly. The efficiency of HBASE retrieval by a single Rowkey is very high, it takes less than 1 millisecond, and 1000 records can be obtained per second.

The system gets the data by finding the Region where a Rowkey (or a Rowkey range) is located, and then routing the request for query data to that Region, as shown in the figure above. Therefore, the reasonable distribution of data is the design way to improve the performance of retrieval query. For example, if you get 1 million records, it takes 1000 seconds to get all the data based on 1000 records per Region. If the data is evenly distributed on each Region of the cluster, then the parallel computing feature can be used to make the Region spit data to the client at the same time when retrieving. If the data is evenly distributed over 100 Region, then all the data can be taken down in 10 seconds. HBASE also supports pre-built Region, which allows users to control data distribution according to the characteristics of the data. Therefore, the design of Rowkey will control the parallel computing efficiency of the platform. 3. According to the functional characteristics of HBASE, the design of rowkey for parallel computing is based on the characteristics of HBASE, and the design of Rowkey will determine the parallel computing architecture. 3.1. First of all, the design principle is the Rowkey length principle, Rowkey is a binary stream, the length of Rowkey is recommended by many developers to design at 10 to 100 bytes, my advice is as short as possible, no more than 16 bytes. The first reason is that the persistence file HFile of data is stored according to KeyValue. If the Rowkey is too long, for example, 100 bytes, 10 million columns of data Rowkey will occupy 100 * 10 million = 1 billion bytes, nearly 1 GB of data, which will greatly affect the storage efficiency of HFile. Second, memStore will cache part of the data to memory, and if the effective utilization of memory in Rowkey fields is too long, the system will not be able to cache more data, which will reduce the efficiency of retrieval. So the shorter the byte length of the Rowkey, the better. The third reason is that the current operating systems are all 64-bit systems with 8-byte memory alignment. Controlled at 16 bytes, 8-byte integer multiples take advantage of the best features of the operating system. The second is the principle of Rowkey hashing. If the Rowkey is incremented by timestamp, do not put the time in front of the binary code. It is recommended to use the high bit of the Rowkey as the hash field, which is generated by the program cycle, and the low bit time field. This will improve the probability of data balancing distributed in each Regionserver to achieve load balancing. If there is no hash field, the time information directly in the first field will produce a hot phenomenon that all new data is accumulated on a RegionServer, so that the load will be concentrated on individual RegionServer when data retrieval is done, which reduces the query efficiency. Finally, there is the only principle of Rowkey, which must be designed to be unique. 3.2. The architecture model is based on the length principle, hash principle and unique principle of Rowkey. I will put forward different Rowkey design suggestions for different application scenarios. Rowkey design for transaction data: transaction data has time attributes, and I will store the time information in Rowkey, which helps to prompt the query retrieval speed. By default, I build tables for transaction data on a daily basis, and the benefits of this design are manifold. According to the day table, my time information can be removed from the date part and left only hours and minutes milliseconds, so that 4 bytes can be done. Add 2 bytes of hash field for a total of 6 bytes to form a unique Rowkey. The following figure shows: transaction data Rowkey designs byte 0, byte 1, byte 2, byte 3, byte 4, byte 5. Hash field time field (millisecond) extension field 0x65535 (0x0000~0xFFFF) 0mm 86399999 (0x00000000~0x05265BFF) such a design does not save money from the operating system memory management level, because 64-bit operating systems must be 8-byte aligned. But for the Rowkey part of persistent storage, you can save 25% of the overhead. One might ask why the time field is not saved in host byte order so that it can also be used as a hash field. This is because the data in the time range is as continuous as possible, and the probability of data search in the same time range is very high, which has a good effect on query retrieval, so it is better to use independent hash fields. For some applications, we can consider using hash fields to store field information of some data, as long as the same hash value is unique at the same time (milliseconds). Rowkey design for statistics: statistics are also with time attributes, and the smallest unit of statistics is only up to minutes (pre-statistics are meaningless in seconds). At the same time, we also use daily data sub-table by default for statistical data, so there is no need to say much about the benefits of this design. According to the day table, the time information only needs to be retained for hours and minutes, then 001400 only needs to take two bytes to save the time information. Because the number of some dimensions of the statistics is very large, 4 bytes are required as sequence fields, so using the hash field as a sequence field at the same time is also a unique Rowkey of 6 bytes. The following figure shows: statistics Rowkey design 0 byte 1 byte 2 byte 3 byte 4 byte 5 byte … Hash field (sequence field) time field (minute) extension field 0x00000000~0xFFFFFFFF) 0percent 1439 (0x0000~0x059F) the same design does not save money from the operating system memory management level, because 64-bit operating systems must be 8-byte aligned. But for the Rowkey part of persistent storage, you can save 25% of the overhead. Pre-statistical data may involve repeated recalculation requirements, so it is necessary to ensure that the invalidated data can be effectively deleted without affecting the equilibrium effect of the hash, so it needs special treatment. Rowkey design for general data: self-increasing sequence is used as the unique primary key for general data, and users can choose to build sub-table by day or single table mode. This mode needs to ensure the uniqueness of hash fields (sequence fields) when multiple loading modules are running at the same time. Consider giving unique factor differences to different loading modules. The design structure is shown in the following figure. General data Rowkey design 0 byte 1 byte 2 byte 3 byte … Hash field (sequence field) extended field (controlled within 12 bytes) 0x00000000~0xFFFFFFFF) can be composed of multiple user fields. Conclusion the above summarizes the key points of Rowkey design in HBASE's parallel computing architecture. In addition to Rowkey, there are other influencing factors of parallel computing, which will be explained in other chapters.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report