Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to modify and delete CarbonData

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "how to realize the modification and deletion of CarbonData". In the daily operation, I believe that many people have doubts about how to modify and delete CarbonData. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts of "how to modify and delete CarbonData". Next, please follow the editor to study!

CarbonData is a columnar storage file format developed by Huawei, open source and supports Apache Hadoop, which supports indexing, compression and decoding. its purpose is to achieve multiple requirements of the same data and to achieve faster interactive query. At present, the project is in the process of Apache incubation.

Currently, CarbonData does not support modifying data that already exists in the table. But in reality, we may want this feature, such as modifying dimension tables, data correction of fact tables, and data cleaning. Many users who use CarbonData want it to provide the function of data modification and deletion. To this end, Issue (CARBONDATA-440) has been mentioned in the community, and its goal is to provide Update/Delete functionality for CarbonData, which should be released in CarbonData version 0.3.0. This article will introduce the design and implementation of Update/Delete function of CarbonData. Here are the high-level design goals to achieve this feature:

(1) provide a standard SQL interface to enable update and delete operations

(2) when you update and delete the CarbonData table, you do not need to rewrite the entire CarbonData block that already exists, but write the changes to the difference file (differential files)

(3) after update and delete operations, CarbonData readers should be able to skip deleted records and read updated records seamlessly without requiring users to update their applications.

Update operation implementation

As we all know, the data of CarbonData is stored on HDFS, and the files in HDFS are immutable, so the data blocks of CarbonData cannot be modified in place. One way to update data is to delete and rewrite entire blocks of data. However, this method is inefficient and can lead to performance bottlenecks. In fact, we can think of the update operation as "delete" and then "insert", which is the implementation of update in CarbonData. Next, I will introduce the implementation of the update operation of CarbonData in detail: the update operation of CarbonData is divided into the following two steps:

1. The * * step consists of two parts:

(1) first of all, CarbonData can identify rows that need to be updated by performing filtering and Join operations. To be able to identify row data, CarbonData uses the ROWID attribute. Once the data that needs to be updated is identified, it will be identified as deleted in a separate file, and these files are stored in the directory of the current table, which is called "Delete Delt".

(2) then CarbonData will collect the column values that need to be updated from the source table and form a new row. The new row data consists of the updated column values and the column values represented by the target. These updated row data will form a source RDD at the Spark processing layer.

Step 2: CarbonData will use the existing data loading method to convert the row data in the source RDD to CarbobData data format. This operation is similar to incremental loading of data. This newly created CarbonData file is called "Update Delta". Update Delta files will be stored in the same segment, and Update Delta itself has btree and block-level statistics, just like normal CabonData files. This new btree should be appended to the global btree and cached.

The following is the sequence diagram of the CabonData update operation:

Implementation of delete operation

In the case of deleting data, CarbonData also identifies the rows that need to be deleted through filtering and Join operations. To be able to identify row data, CarbonData uses the ROWID attribute. Once the data that needs to be deleted is identified, it will be identified as deleted in a separate file, also known as the "Delete Delta" file. The CarbonData record scanner will exclude these deleted files from the result set. After the delete operation, CarbonData does not need to update the global dictionary table, because some entries in the dictionary table are still valid for other segment.

Atomicity of delete operation

The deletion operation of CarbonData is atomic, that is, either all of the deleted data is deleted or none of it is deleted. The Delete delta file resulting from the delete operation is not visible to the readers event while the delete operation is still in progress; the newly deleted row data will not be visible to the readers until the delete operation is successful. The deletion operation is shown in the following figure:

The following is the sequence diagram of the CabonData delete operation:

File merging

Update delta and delete delta files are generated for each update operation, and with frequent update and delete operations, more and more delta files are generated. This will result in many small files, which may affect the performance of scan operations, so we need to merge these delta files into separate delta files. The operation of merging many delta files into one delta file is called compaction or minor compaction. Do the following:

And the compaction operation can be triggered by configuring how many delta files are reached. After a delete or update operation, if the number of delta files reaches the configured threshold, the compaction operation will be triggered.

At this point, the study on "how to modify and delete CarbonData" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report