Implementation of appendable data Mart based on file system 07/09 Update SLTechnology News&Howtos

Implementation of appendable data Mart based on file system

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Background of a problem

In the vast majority of application systems, the storage and calculation of data are basically completed by the database at the beginning, serving business transactions and report queries at the same time; however, after several years of information construction and data accumulation, it is often encountered that the pressure on the database increases, which leads to the problem of performance bottleneck.

To investigate the reason, it is often found that the report for historical data query accounts for a large proportion. Further analysis shows that such reports usually have the following characteristics:

1. The change of data is small: the historical data for query is almost unchanged.

2. Large amount of data: the amount of data increases with time.

As the JDBC performance of most databases is very low (the JDBC fetch process requires data object conversion, which is an order of magnitude slower than reading data from a file), if the data is always stored in the database, when the amount of data involved is large or concurrent, the performance of the report will decline sharply, and further will seriously affect related business operations, such as marketing, data collation and re-reporting, etc.

To solve this problem, the common solution is to add a pre-database between the production library and the application, use ETL tools to extract data from the production database regularly, and then import it into the pre-database after cleaning. All historical report queries are based on the pre-database, thus separating from the production library and alleviating the pressure on the production database.

However, this scheme not only increases a lot of unnecessary costs, redundant components and workload, but also increases the difficulty of management and maintenance in the later stage. More importantly, when the amount of data is relatively large, the report query is still very slow, because the fundamental problems mentioned above have not been solved, the IO performance of most databases is far lower than that of the file system, and the performance of the report depends heavily on the database fetch link, that is to say, the problem has not been solved fundamentally.

Second, the solution.

To solve the problem fundamentally, we can assume that if the file has computing power, move these little-changed historical data out of the database and use the file system to store instead of the front-end database, then it is possible to achieve much higher IO performance than the database. This will not only solve the problem of slow query of large amounts of data reports, but also get the following benefits:

1. Convenient management; files naturally support multi-level directories, and replication, transfer and splitting are much simpler and more efficient than databases, so that users can classify and manage data according to business modules, time order and other rules. When the application is offline, the corresponding data of the application can also be deleted according to the directory. As a result, data management becomes simple and clear, and the workload is significantly reduced.

2. Low cost; since it is a file, it can be simply stored in a cheap hard disk without the need to buy expensive software and hardware dedicated to the database.

3. Reduce the pressure of database expansion; if the database throughput burden is reduced, the critical point of database expansion can be significantly delayed, the database can continue to be in service, and a lot of expansion costs can be saved.

4. The utilization rate of resources is high; using files to store data does not mean abandoning the database. On the contrary, files should only store peripheral data with low security requirements but a large amount of data, as well as files outside the library, while the database still stores core data. As a result, file storage and database storage perform their respective functions, and resource utilization is significantly improved.

So how can you effectively give computing power to files? The following will introduce the dry aggregator, is such a sharp tool, through the aggregator, can achieve the separation of complex computing and report display, its built-in aggregation engine can make files have computing power, easily deal with a variety of difficult and miscellaneous diseases. The following figure shows the comparison between the general situation and the report system structure after the introduction of the aggregator. it should be said that after the introduction of the aggregator, the whole architecture has become more fresh and reasonable:

Description of three scenarios

Next, we use a typical scenario to illustrate the role and usage of the aggregator:

Table A "merchandise sales details" has hundreds of millions of data, in which the field areaid is associated with the primary key id of the B table "region table". Table An is called fact table and Table B is called dimension table. The field associated with the primary key of table B in table An is called the foreign key of A to B, and B is also called the foreign key table of A. Foreign key tables are many-to-one relationships. The following figure is shown:

Next, through the production of "daily sales growth report of salespeople in each region", let's take a look at how the aggregator uses files to achieve data externalization, so as to improve the efficiency of report query. The final display effect of the report is as follows:

In this report, the query is made according to the selected start date and end date, and the report is first grouped by region name, salesperson code, and sales day to count the daily sales of each salesman. and the daily growth rate of each salesperson's daily sales (the algorithm is "(sales of the day-sales of the previous day) / sales of the previous day"). The query button at the top of the report is the "parameter template" function provided by the report tool. See the tutorial for details, which will not be repeated here.

3.1 Design data storage organization

Before we can take advantage of the many advantages of storing data in the file system, we should first define the directory storage structure of the file:

The characteristic of historical data is that the data landing does not change after the transaction is formed, and the amount of data is huge, so we can divide the annual data according to business module, month and other rules, that is, each month's data is saved in a set file (the set file uses the compression format provided by the aggregator, with better IO performance). The directory structure is: / business module / data schedule / year and month file name, as shown in the following figure:

At the same time, we also need to set the data synchronization script to be executed regularly every morning to append the data from the previous day to the current month set file; and on the 1st of each month, the script will automatically generate a new set file named after the year and month according to the rules.

3.2 synchronize data 3.2.1 synchronize historical data to files

First, move out the historical data from January to October 2017 by different months (assuming that you have 10 months of historical data). The SPL script of the aggregator is as follows:

one

= connect ("demo")

two

= 10. ("SELECT * FROM sdrpts WHERE filedate > = '2017 -" / ~ "- 01'AND filedate

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.