A complete solution for Apache ShardingSphere data desensitization 07/19 Update SLTechnology News&Howtos

A complete solution for Apache ShardingSphere data desensitization

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Detailed explanation of Apache ShardingSphere data desensitization solution

A brief introduction to the author

Pan Juan, JD.com math senior DBA, mainly responsible for JD.com math database middleware development, database operation and maintenance automation platform development, production database operation and maintenance work. Repeatedly participated in the escort work of JD.com 6.18,11.11 and other promotional activities. He used to be responsible for the design and development of JD.com 's mathematical database automation platform, and now focuses on the development of Apache ShardingSphere distributed database middleware. Willing to learn and explore in database, automation, distributed, middleware and other related fields.

I. background

Security control has always been an important part of governance, and data desensitization belongs to the category of security control. For Internet companies and traditional industries, data security has always been a very important and sensitive topic. Data desensitization refers to the deformation of some sensitive information through desensitization rules to achieve the reliable protection of sensitive private data. Involving customer security data or some commercially sensitive data, such as mobile phone number, card number, customer number and other personal information, data desensitization is required in accordance with the regulations of relevant departments.

In real business scenarios, relevant business development teams often need to implement and maintain an encryption and decryption system according to the needs of the company's security department, and when the desensitization scenario changes, the self-maintained desensitization system is often faced with the risk of reconstruction or modification. In addition, for businesses that have been online, how to achieve seamless desensitization without modifying business logic and business SQL?

According to the industry's demand for desensitization and the pain points of business transformation, Apache ShardingSphere provides a complete, safe, transparent and low-cost data desensitization integration solution.

II. Preface

Apache ShardingSphere is an ecological circle of open source distributed database middleware solutions, which consists of Sharding-JDBC, Sharding-Proxy and Sharding-Sidecar (planned), which are independent of each other but can be deployed together. They can provide standardized data slicing, distributed transactions and distributed governance functions, and can be applied to a variety of application scenarios, such as Java isomorphism, heterogeneous languages, containers, cloud natives and so on.

The data desensitization module belongs to the sub-function module under the core function of ShardingSphere distributed governance. By parsing the SQL input by the user and rewriting the SQL according to the desensitization configuration provided by the user, it encrypts the original data and stores the original data (optional) and ciphertext data to the underlying database at the same time. When the user queries the data, it takes the ciphertext data from the database, decrypts it, and finally returns the decrypted original data to the user. Apache ShardingSphere distributed database middleware automation-transparent data desensitization process, so that users do not need to pay attention to the implementation details of data desensitization, like the use of ordinary data desensitization data. In addition, whether the online business is desensitized or the newly launched business uses the desensitization function, ShardingSphere can provide a relatively perfect solution.

Third, demand scenario analysis

The requirement for data desensitization is generally divided into two situations in real business scenarios:

When the new business comes online, the security department stipulates that sensitive information related to users, such as banks and mobile phone numbers, should be encrypted and stored in the database, and then decrypted when in use. Because it is a new system, there is no stock data cleaning problem, so the implementation is relatively simple.

The business has been online, and plaintext has been stored in the database before. The relevant departments suddenly need to desensitize and rectify the online business. This kind of scenario generally needs to deal with three problems:

A) how to desensitize the historical data, that is, the number of washing.

B) how to desensitize the new data and store it in the database without changing the business SQL and logic, and then decrypt it when in use.

C) how to realize the migration of business system between plaintext and ciphertext data securely, seamlessly and transparently.

IV. Detailed explanation of the treatment process.

Overall architecture

The Encrypt-JDBC provided by ShardingSphere is deployed with the business code. The business side needs to program JDBC for Encrypt-JDBC. Because Encrypt-JDBC implements all JDBC standard interfaces, the business code is compatible with each other without additional modifications. At this point, Encrypt-JDBC is responsible for all interactions between the business code and the database. The business only needs to provide desensitization rules. As a bridge between the business code and the underlying database, Encrypt-JDBC can intercept user behavior and interact with the database after the behavior is modified.

Encrypt-JDBC intercepts the SQL initiated by the user, parses and understands the SQL behavior through the SQL syntax parser, and then finds out the fields that need to be desensitized and the encryption and decryptor used to encrypt and decrypt the target fields according to the desensitization rules passed in by the user, and then interacts with the underlying database.

ShardingSphere will encrypt the plaintext requested by the user and store it in the underlying database; when the user queries, the ciphertext will be extracted from the database and decrypted and returned to the end user.

ShardingSphere shields the desensitization of data so that users do not need to perceive the process of parsing SQL, data encryption and data decryption, and use desensitized data just like ordinary data.

Desensitization rule

Before we can explain the whole process in detail, we need to understand the desensitization rules and configuration, which is the basis of understanding the whole process. Desensitization configuration is mainly divided into four parts: data source configuration, encryption configuration, desensitization table configuration and query attribute configuration. The details are shown below:

Data source configuration: refers to the configuration of DataSource.

Encryption configuration: refers to what encryption strategy is used for encryption and decryption. At present, ShardingSphere has two built-in encryption and decryption strategies: AES/MD5. Users can also implement a set of encryption and decryption algorithms by implementing the interface provided by ShardingSphere.

Desensitization table configuration: used to tell ShardingSphere which column in the data table is used to store ciphertext data (cipherColumn), which column is used to store plaintext data (plainColumn), and which column the user wants to use for SQL writing (logicColumn).

How to understand which column the user wants to use for SQL writing (logicColumn)?

We can understand from the meaning of the existence of Encrypt-JDBC. The ultimate goal of Encrypt-JDBC is to shield the underlying desensitization of data, that is, we do not want users to know how the data is encrypted and decrypted, how to store plaintext data in plainColumn, and how to store ciphertext data in cipherColumn. In other words, we don't want users to know about the existence and use of plainColumn and cipherColumn. Therefore, we need to provide the user with a conceptual column, which can be separated from the real column of the underlying database, it can be a real column in the database table or not, so that the user can change the column names of the plainColumn and cipherColumn of the underlying database at will. Or delete the plainColumn and choose never to store plaintext again, only ciphertext. As long as the user's SQL is written for this logical column, and the correct mapping relationship between logicColumn, plainColumn and cipherColumn is given in the desensitization rule.

Why would you do that? The answer is at the back of the article, that is, in order to enable online businesses to carry out data desensitization migration seamlessly, transparently and securely.

Query attribute configuration: when plaintext data and ciphertext data are stored in the underlying database table at the same time, the attribute switch is used to determine whether to directly query the plaintext data in the database table to return, or to query the ciphertext data to be decrypted by Encrypt-JDBC.

Desensitization process

Take Chestnut, for example, if there is a table called t_user in the database, there are actually two fields pwd_plain in this table, which are used to store plaintext data, pwd_cipher, and ciphertext data, and define logicColumn as pwd. Then, when writing SQL, users should write for logicColumn, that is, INSERT INTO t_user SET pwd = '123'. ShardingSphere receives the SQL, and through the desensitization configuration provided by the user, it finds that the pwd is logicColumn, so it desensitizes the logical column and its corresponding plaintext data. It can be seen that ShardingSphere converts the user-oriented logical columns to the plaintext and ciphertext columns facing the underlying database for column names and data desensitization mapping. As shown in the following figure:

This is the core meaning of Encrypt-JDBC, that is, according to the desensitization rules provided by users, the user SQL is separated from the underlying data table structure, so that the user's SQL writing no longer depends on the real database table structure. The connection, mapping and transformation between the user and the underlying database are handled by ShardingSphere. Why are we doing this? The same sentence: in order to enable the online business to carry out data desensitization migration seamlessly, transparently and securely.

In order to give readers a clearer understanding of the core processing flow of Encrypt-JDBC, the following image shows the processing flow and conversion logic when adding, deleting, modifying and querying with Encrypt-JDBC, as shown in the following figure.

5. Detailed explanation of the solution

After understanding the ShardingSphere desensitization process, the desensitization configuration and desensitization process can be combined with the actual scene. All designs and developments are designed to address the pain points encountered in business scenarios. So in the face of the previously mentioned business scenario requirements, how should we use ShardingSphere to meet the business requirements?

New online service

Business scenario analysis: the new online business is relatively simple because everything starts from scratch and there is no problem of historical data cleaning.

Solution description: after selecting an appropriate cipher, such as AES, you only need to configure logical columns (write SQL for users) and ciphertext columns (data tables store ciphertext data). Logical columns and ciphertext columns can be the same or different. The recommended configuration is as follows (shown in Yaml format):

EncryptRule: encryptors: aes_encryptor: type: aes props: aes.key.value: 123456abc tables: t_user: columns: pwd: cipherColumn: pwd encryptor: aes_encryptor

With this configuration, Encrypt-JDBC only needs to convert logicColumn and cipherColumn, and the underlying data table stores no plaintext but only ciphertext, which is also required by the security audit section. If users want to store both plaintext and ciphertext in the database, just add plainColumn configuration. The overall processing flow is shown in the following figure:

Online business transformation

Business scenario analysis: as the business is already running online, there must be a large amount of plaintext historical data in the database. The problem now is how to encrypt and clean the historical data, how to encrypt the incremental data, and how to make the business migrate seamlessly and transparently between the new and old data systems.

Solution description: before providing a solution, let's brainstorm: first of all, since the old business needs desensitization, it must store very important and sensitive information. This information has a high gold content and the business foundation is relatively important. If there is a mistake, the whole team KPI will see you again. Therefore, it is impossible to stop business as soon as it comes up, prohibit the writing of new data, find an encryptor to encrypt and clean all the historical data, and then deploy the previously reconstructed code online so that it can encrypt and decrypt the stock and incremental data online. In such a simple and rude way, according to historical experience, it must be cool.

Well, another relatively safe approach is to re-build a pre-release environment exactly like the production environment, and then use the relevant migration and washing tools to encrypt the original data of the production environment and store it in the pre-release environment, while the new data is encrypted and stored in the database of the pre-release environment through tools such as MySQL master-slave replication and self-developed tools developed by the business side. Then deploy the code that can be encrypted and decrypted after refactoring to the pre-release environment. In this way, the production environment is a set of query modification environment with plaintext as the core, and the advance environment is a set of encryption and decryption query modification environment with ciphertext as the core. After the comparison is correct for a period of time, the production flow can be cut into the pre-release environment by night operation. This scheme is relatively safe and reliable, but the time, manpower, capital and cost are high, including: pre-release environment construction, production code rectification, related auxiliary tools development and so on. Unless there is no way out, business developers generally go from getting started to giving up.

What business developers want most is to reduce the burden of capital costs, it is best not to modify the business code, and be able to migrate the system safely and smoothly. Therefore, the desensitization function module of ShardingSphere is applied. It can be divided into three steps:

Before system migration

Suppose the system needs to desensitize the pwd field of t_user, and the business side uses Encrypt-JDBC instead of the standardized JDBC interface, which basically requires no additional modification (we also provide access methods such as SpringBoot,SpringNameSpace,Yaml to meet the needs of different business parties). In addition, a set of desensitization configuration rules are provided, as follows:

EncryptRule: encryptors: aes_encryptor: type: aes props: aes.key.value: 123456abc tables: t_user: columns: pwd: plainColumn: pwd cipherColumn: pwd_cipher encryptor: aes_encryptorprops: query.with.cipher.column: false

According to the above desensitization rules, we first need to add a field in the database table t_user called pwd_cipher, that is, cipherColumn, to store ciphertext data. At the same time, we set plainColumn to pwd to store plaintext data, and logicColumn to pwd. Since the previous code SQL was written in pwd, that is, SQL for logical columns, the business code does not need to be changed. Through Encrypt-JDBC, plaintext is written to the pwd column for the new data, and the plaintext is encrypted and stored in the pwd_cipher column at the same time. At this point, because query.with.cipher.column is set to false, the plaintext column pwd is still used for query storage for business applications, but additional ciphertext data of the newly added data is stored on the underlying database table pwd_cipher. The processing flow is as follows:

When the new data is inserted, it is encrypted into ciphertext data through Encrypt-JDBC and stored in cipherColumn. Now we need to deal with historical plaintext stock data. Since Apache ShardingSphere does not provide relevant migration and washing tools, it is necessary for the business to encrypt and store the plaintext data in pwd to pwd_cipher.

System migration

The new data has been stored in the ciphertext column by Encrypt-JDBC, and the plaintext in the plaintext column. After the historical data is encrypted and cleaned by the business side, the ciphertext is also stored in the ciphertext column. In other words, the current database stores both plaintext and ciphertext, but because of the query.with.cipher.column=false in the configuration item, the ciphertext has never been used. Now in order to enable the system to cut to ciphertext data for query, we need to set the query.with.cipher.column in the desensitization configuration to true. After restarting the system, we found that the system business is normal, but Encrypt-JDBC has started to extract the data of the ciphertext column from the database and decrypt it and return it to the user; while for the user's need to add, delete and modify, the original text data will still be stored in the plaintext column, and the encrypted ciphertext data will be stored in the ciphertext column.

Although the business system now takes out the data from the ciphertext column, decrypts it and returns it, but it still saves a copy of the original data to the plaintext column when it is stored. Why? The answer is: to be able to roll back the system. Because as long as ciphertext and plaintext always exist at the same time, we are free to switch business queries to cipherColumn or plainColumn through switch item configuration. In other words, if the system is cut to the ciphertext column for query, it is found that the system reported an error and needs to be rolled back. Then you just need to restore the query.with.cipher.column=false,Encrypt-JDBC and start querying using plainColumn again. The processing flow is shown in the following figure:

After system migration

Due to the requirements of the security audit department, it is generally impossible for the business system to keep the plaintext column and ciphertext column of the database synchronously, so we need to delete the plaintext column data after the system is stable. That is, we need to delete plainColumn, that is, pwd, after the system migration. The problem arises. Now the business code is written for SQL for pwd. Delete the pwd that stores plaintext in the underlying data table, and decrypt it with pwd_cipher to get the original data. Does that mean that the business side needs to rectify all SQL so as not to use the pwd column that is about to be deleted? Remember the core meaning of our Encrypt-JDBC?

This is the core meaning of Encrypt-JDBC, that is, according to the desensitization rules provided by users, the user SQL is separated from the underlying database table structure, so that the user's SQL writing no longer depends on the real database table structure. The connection, mapping and transformation between the user and the underlying database are handled by ShardingSphere.

Yes, because of the existence of logicColumn, the user's writing SQL is oriented to this virtual column, and Encrypt-JDBC can map this logical column to the ciphertext column in the underlying data table. Therefore, the desensitization configuration after migration is:

EncryptRule: encryptors: aes_encryptor: type: aes props: aes.key.value: 123456abc tables: t_user: columns: pwd: # Transformation Mapping between pwd and pwd_cipher cipherColumn: pwd_cipher encryptor: aes_encryptor props: query.with.cipher.column: true

The processing flow is as follows:

At this point, the online business desensitization rectification solutions are all described. We provide Java, Yaml, SpringBoot, SpringNameSpace for users to choose access, and strive to meet the different access needs of the business. At present, the solution has been continuously launched in JD.com Mathematical Science, providing internal basic service support. How to treat infertility in Zhengzhou: http://www.zzfkyy120.com/

VI. Advantages of middleware desensitization service

Automation-transparent data desensitization process, users do not need to pay attention to the details of desensitization intermediate implementation.

Provide a variety of built-in, third-party (AKS) desensitization policies, users only need to simple configuration to use.

Provide desensitization policy API interface, users can implement the interface, thus using a custom desensitization policy for data desensitization. How about Jiaozuo Gastrointestinal Hospital: http://jz.lieju.com/zhuankeyiyuan/37756433.htm

Support switching between different desensitization strategies.

For the online service, the plaintext data and ciphertext data can be stored synchronously, and the plaintext column or ciphertext column can be used for query through configuration. It can realize the secure and transparent migration of the data before and after encryption by the online system without changing the business query SQL.

VII. Description of applicable scenarios

User projects are programmed in the Java language.

The backend databases are MySQL, Oracle, PostgreSQL, and SQLServer.

The user needs to desensitize one or more columns in the database table (data encryption & decryption).

Compatible with all commonly used SQL

VIII. Restrictive conditions

Users need to deal with the original stock data and washing data in the database by themselves.

Use desensitization function + sub-library sub-table function, some special SQL does not support, please refer to the SQL usage specification.

Desensitized fields cannot support comparison operations, such as greater than less than, ORDER BY, BETWEEN, LIKE, etc.

Desensitized fields cannot support calculation operations, such as AVG, SUM, and evaluation expressions

IX. Follow-up

This article describes how to use Encrypt-JDBC, one of the ShardingSphere products, to access, and you can also choose to use SpringBoot, SpringNameSpace and so on. The access end in this form is mainly isomorphic to JAVA and deployed in the production environment together with the business code. For heterogeneous languages, ShardingSphere also provides Encrypt-Proxy clients. Encrypt-Proxy is a server-side product that implements the binary protocols of MySQL and PostgreSQL. Users can deploy Encrypt-Proxy services independently and access this virtual database server with desensitization function by using third-party database management tools such as Navicat, JAVA connection pool and command line, just like using ordinary MySQL and PostgreSQL databases.

Desensitization function belongs to the functional category of Apache ShardingSphere distributed governance. In fact, the Apache ShardingSphere ecology also has other more powerful capabilities, such as data fragmentation, read-write separation, distributed transactions, monitoring governance, and so on. You can even select any variety of functional modules for superimposed use, such as data desensitization + data sharding, or data sharding + read-write separation, or monitoring governance + data sharding. In addition to the overlay selection at the functional level, ShardingSphere also provides various forms of access end, such as Sharding-JDBC or Sharding-Proxy, to meet the needs of different scenarios.

Write at the end

At the beginning, ShardingSphere only supports sub-database and sub-table functions, and now it has formed an ecological circle with core functions such as data slicing, distributed governance, distributed transactions and so on. This also indicates that it is not only a distributed database middleware, it not only has the ability to divide databases and tables, but also forms an omni-directional solution ecosystem with data slicing, distributed governance and distributed transactions as the core. Welcome to learn more on the official website and follow our ☺ at gitHub!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.