The implementation Scheme of Real-time report Thum0 07/08 Update SLTechnology News&Howtos

The implementation Scheme of Real-time report Thum0

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The total real-time query based on database system can only expand the capacity of database (including the means of sub-database) when there is a large amount of data, and the cost is high; if you use the mixed operation of file system and production database, you can achieve low-cost and high-performance Thum0 query, and the thermal export mechanism is the basis of this scheme!

Background of a problem

In the application system of the report, users pay more and more attention to the real-time performance of the data, hoping that the latest data can be reflected in the report, that is, we often talk about the Taper 0 scenario, so as to assist decision-making and drive operation in time.

For example, the scenario of the application of Traffic big data: we need to understand the vehicle traffic density with real-time data, carry out road planning reasonably, and predict line congestion and frequent accident reminders according to historical data.

But the conventional scheme: report + data warehouse + ETL tool is very difficult to achieve this kind of real-time report, often can only see yesterday, last week or even last month, that is, titled 1, titled 7, titled 30 and so on, we collectively referred to as tween report.

The reasons are probably reflected in the following three aspects:

1. If the historical data and up-to-date data of the report are read from the production system, although it can achieve Thum0 report, it will put pressure on the production database, and when the amount of data becomes larger and larger, it will cause performance bottleneck and directly affect the business; and a large number of historical data will occupy high database costs (storage cost and performance cost).

2. If you use the data warehouse method, it will take a long "window time" for ETL to extract data from the production database, usually after the business staff get off work and before going to work the next morning, so the latest data you can see can only be Test1.

3. In theory, it is possible to take data from both the historical database and the production database to form real-time reports, but the general report tools do not have the ability of cross-database mixed computing, and other cross-database computing schemes are complex and difficult to implement. and low performance.

Second, the solution.

So, is there a lower-cost and easier-to-implement Thum0 reporting scheme? The dry aggregator, which will be introduced below, is such a sharp tool that makes use of the mixed data source capability of the aggregator to achieve low-cost Thum0 real-time reports.

The realization idea is to store a large number of historical data that no longer change, and only read a small amount of new data from the production database, which not only ensures the real-time performance of the report, but also reduces the cost of historical data storage. the load caused by the report system to the production database is reduced.

The following figure shows the structural comparison between the conventional Tannn scheme and the aggregator Thum0 scheme. It should be said that after the introduction of the aggregator, a lot of unnecessary costs and redundant components have been reduced, and the whole architecture has become more fresh and reasonable:

The "export (non-real-time)" in the new processing architecture above refers to regularly synchronizing the new data from the production database to the file that stores the historical data during non-working hours (for example, at night).

Regarding the data external plan, the design data storage organization, the timing task and so on related preparation and the peripheral work, the specific practice can refer to the relevant chapter, will not repeat here.

Three mixed computing scenarios

Next, let's take a look at how the aggregator uses historical data to mix with current data to implement the Tunable 0 scheme by making an example of "Real-time process Station Bad Plato". The final display effect of the report is as follows:

This report clearly shows that 80% of the problems in the production process of electronic equipment are caused by 20%, which has an advantage in finding out the key causes of most problems.

The query process of the data in the report is as follows: filter the query according to the start date and end date; first group according to the bad code, summarize the bad quantity of each category, and descend the order according to the summary quantity, then calculate the bad cumulative ratio (the algorithm is "(bad quantity cumulative summary / total bad quantity summary) * 100"). The query button at the top of the report is the "parameter template" function provided by the report tool. See the tutorial for details, which will not be repeated here.

3.1 Writing data query scripts

We assume that we have moved the little changed historical data out of the database and stored it in a set file (the set file uses the compression format provided by the aggregator to have better IO performance), named MES-pre.btx, and execute the data synchronization script regularly every day to append the previous day's data to the current data file. A small amount of data on the day involved in the query is directly extracted from the production database (demo) to ensure the real-time performance of the data. The SPL script of the aggregator is as follows (also supports checking only historical data):

one

= connect ("demo")

= A1.query ("SELECT code,name FROM watch")

two

= A1.cursor@x ("SELECT code,nums FROM meta_resource WHERE" + if (Efiledate > = date (now ()), "DATE_FORMAT (fildate,'%Y-%m-%d') > ='" + string (date (now () + "'", "1x 0")

three

= file ("D:/PT/MES-pre.btx") .cursor@b () .select (filedate > = Bfiledate & & filedateA4.switch (bad name, B1:code)

= A4.sort (bad quantity:-1)

six

> B5.run (cumulative quantity of defects + = cumulative quantity of defects [- 1])

= A4.sum (bad quantity)

seven

> B5.run (Bad accumulative ratio = round (B6 cumulative quantity / B6 million 100pc2))

= B5.new (bad code, bad name .name: bad name, bad quantity, bad cumulative quantity, bad cumulative ratio)

eight

Return B7

A1: connect to a preconfigured production database (demo)

B1: query dictionary table, bad code, bad name

A2: set up a database cursor and read the data of the data table with a simple sql. The filter condition section of Sql performs dynamic stitching according to logical judgment. When the end date > = current system date, it represents the real-time data of that day. Otherwise, a query action with an empty result is performed to adapt to the business scenario where only historical data is checked. The @ x option means to close the connection after reading the database.

A3: create a cursor for the data file D:/PT/MES-pre.btx. File cursors allow you to read data from big data files in batches to avoid memory overflows. The @ b option means that the file is read according to the binary format provided by the collector and the eligible records are filtered according to the passed start date (Bfiledate) and end date (Efiledate).

B3: merge database cursors (new data) and file cursors (historical data)

A4: use the groups function to complete the grouping and summary of the merged cursors, and construct several more columns: bad name, bad cumulative quantity, bad cumulative ratio, to facilitate the later assignment calculation.

A5: the association is realized by establishing a pointer reference record pointing to the code field in the B1 table on the "bad name" field of the A4 result through the switch () function, as shown below:

B5: sort in descending order according to the number of bad things. As shown below:

A6: calculate the bad cumulative quantity; as you can see, the aggregator uses the "bad cumulative quantity [- 1]" to represent the bad quantity of the previous line, so it is easy to calculate the relative position.

B6: total the number of defective products

A7: calculate the undesirable cumulative ratio. The algorithm is "(Bad quantity accumulative Summary / Total Bad quantity Summary) * 100", while keeping two decimal places. The calculation result is as follows:

B7-A8: take out the required fields and return the result set associated with the bad name to the report tool, as shown below:

3.2 as a report data source

After completing the data query with the aggregator, in order to use the query results in the report, you can set the aggregator as the data source directly in the report, which is as simple as using the database, as follows:

Define parameters (Bfiledate, Efiledate) in the report

L set the aggregator dataset and pass report parameters

L Design report statistical chart

As shown in the following figure:

After completing the report design, enter the parameters for calculation, and you can get the desired report.

Four data preprocessing

In the previous chapter, through the mixed calculation of historical data (file) and real-time data (database), we can easily realize the real-time report (Tread0) solution; in order to do this, the corresponding data preprocessing, including how to export to a file, how to design what kind of storage organization, and so on, is particularly important.

Next, we will discuss several modes of exporting historical data to files and their advantages and disadvantages: cold export, compromise, hot export.

4.1 Cold export

With regard to the many benefits of using files to store historical data, you can refer to the relevant sections, which will not be repeated here.

The so-called cold export is to allow a "time window" to take historical data from the production library and export it to a file. For example, every day from 2: 00 a.m. to 6: 00 a.m. is the time window for regular execution of tasks.

The disadvantage of cold export is also obvious, in the additional data exported to the file during this period of time, the file is unreadable, that is to say, the relevant query can not be carried out, so, in essence, cold export does not really achieve Thum0 real-time query (the production system does not stop, the query system does not stop).

By the way, however, there will be no such problem if you use another database to store historical data. The reason is that relational database supports transaction consistency, and query can still be well supported while data is written. Of course, this will certainly sacrifice some of the performance, which takes up a lot of resources when there is a large amount of data exported every day (because the database rollback segment will be very large).

So consistency and high performance are contradictory to some extent. Although the database is consistent, the database itself is too slow and expensive, while the aggregator (set file) can achieve high performance, but there is no transaction consistency and cannot participate in other calculations while maintaining the data.

However, when the requirements for business scenarios are not very high, cold export is enough. Let's give a simple example of how to write an aggregator script to obtain yesterday's historical data and append it to the current set file. The code is as follows:

one

= file ("D:/PT/MES-pre.btx")

= connect ("demo")

two

= B1.cursor@x ("SELECT * FROM meta_resource WHERE DATE_FORMAT (fildate,'%Y-%m-%d') =?", after (date (now ()),-1))

> A1.export@ab (A2)

A1: open the set file path that needs to be exported by path

B1: connect to the database (demo)

A2: create a database cursor according to sql to get yesterday's data. The parameter is yesterday's date. @ x option means to close after reading the database.

B2: append the execution result to the set file

4.2 compromise approach

In view of the shortcomings of the "cold export" scheme, it is easier to think of a compromise: historical data is no longer written into a set file according to the appended mode, but the file is taken apart so that they are less coupled and do not affect each other. To do so, you need to consider the following two rules:

1. Export a separate set file every day, which can be named with the year, month and day, so that the query of the exported historical data will not be affected during the export process.

2. Add time range judgment to the query script to avoid the derived "time window"; for example, the time window for scheduled tasks is 2: 00 a.m. to 6: 00 a.m. every day. In the query script, you can make a logical judgment according to the current time point of the query action. If the query occurs after 6 o'clock on the same day, the data export has been completed. Then the data source is the set file (historical data up to yesterday) + the current database (new data up to today's current time point). If the query occurs before 6 o'clock on the same day, That is the set file (historical data up to the day before yesterday) + the current database (new data from yesterday to today's current point in time).

The disadvantage of this method is that when designing the data storage organization, the file will be broken up, the code of the logical judgment part will be more lengthy, and the file management will be more troublesome. But in any case, it can still meet the requirements and achieve the real real-time report (Tunable 0) solution. Here are the implementation steps.

4.2.1 Design data storage organization

The historical data is divided according to the business module, and the data is saved in a set file every day. The directory structure is: / business module / data schedule / year-month-day file name, as shown in the following figure:

4.2.2 synchronizing yesterday's data to a file

Transform the data export script in the "cold export" scheme, get yesterday's historical data from the database and save a set file every day, named after the year, month and day, the code is as follows:

one

= file ("D:/PT/" + string (after (date (now ()),-1), "yyyyMMdd"))

= connect ("demo")

two

= B1.cursor@x ("SELECT * FROM meta_resource WHERE DATE_FORMAT (fildate,'%Y-%m-%d') =?", after (date (now ()),-1))

> A1.export@b (A2)

A1: open the path of the set file to be exported according to the path, one per day, named after year, month and day

The grid code that has been explained earlier will not be repeated here.

4.2.3 data query

First of all, we need to write a tool script, whose main function is to filter out multiple set file paths that need to query the span range according to the passed start date and end date, and at the same time determine whether the set file object under the path exists. The script is named: determine the scope of the read file .dfx, and write the code as follows:

one

= if (endDate > = date (now ()), if (now () > datetime (date (now ()), "06:00:00"), after (endDate,-1), after (endDate,-2)), endDate)

two

= periods (startDate,A1,1)

three

= A2. (path+string (~, "yyyyMMdd"))

four

= A3.select (file (~) .exists ())

five

Return A4

The script receives three parameters, the start date (startDate), the end date (endDate), and the storage path of the set file (path)

A1: when the incoming end date > = current system date, and the current time is after 6 o'clock of the current day, return yesterday's date, before 6 o'clock of that day, return the day before yesterday's date, otherwise return the actual end date passed in

A2: according to the start date and the calculated end date, the date range is obtained by default by day interval.

A3: circular A2, which is spliced with the year, month and day in the date segment through the storage path of the set file, and formatted by the string () function

A4: determine whether the file under the path really exists, and A5 returns the actual file path. The final result is as follows:

Then, we need to make some modifications to the script for the "mixed operation scenario" data query in the previous section. It is worth noting that the concept of multiple cursors will be adopted here, and multiple cursors will be merged into a single cursor. The modified script is as follows:

one

= connect ("demo")

= A1.query ("SELECT code,name FROM watch")

two

= A1.cursor@x ("SELECT code,nums FROM meta_resource WHERE" + if (Efiledate > = date (now ()), if (now () > datetime (date (now ()), "06:00:00")), "DATE_FORMAT (fildate,'%Y-%m-%d') > ='" + string (date (now () + "'", "DATE_FORMAT (fildate,'%Y-%m-%d') > ='" + string (after (date (now ()),-1) + "'") "1x 0"))

three

= call ("D:/PT/ determines the scope of the read file .dfx", Bfiledate,Efiledate, "D:/PT/ data Table A /")

four

= A3. (file (~) .cursor@b ())

= (A2 | A4) .mcursor ()

five

= B4.groups (code: bad code, code: bad name; sum (nums): bad quantity, sum (nums): bad cumulative quantity, sum (0): bad cumulative ratio)

six

> A5.switch (bad name, B1:code)

= A5.sum (bad quantity)

= A5.sort (bad quantity:-1)

seven

> C6.run (cumulative quantity of defects + = cumulative quantity of defects [- 1])

> C6.run (undesirable cumulative ratio = round ((undesirable cumulative quantity / B6) * 100pc2))

eight

= C6.new (bad code, bad name .name: bad name, bad quantity, bad cumulative quantity, bad cumulative ratio)

nine

Return A8

The trellis code that has been explained earlier will not be repeated here.

A2: set up database cursors, and dynamically splice sql according to logic judgment. When the end date of the query > = current system date, and the current query time point is after 6 o'clock on the same day, only the real-time data of that day is queried. If the current query time point occurs before 6 o'clock of the day, the query returns the real-time data of yesterday and that day. Otherwise, if it is not satisfied, do a query action with an empty result to adapt to the business scenario where only the historical data is queried.

A3: call "determine the scope of the read file .dfx", pass in the values of the script parameter start date and end date, and get a collection of all set files within the start and end date.

A4: loop A3, which opens each set file object separately and creates cursors based on the file, where the cursor () function uses the @ b option to represent reading from the set file.

B4: using the concept of multiple cursors provided by the aggregator, multiple cursors with the same data structure are combined into one cursor. When in use, multichannel cursors use parallel computing to process the data of each cursor. The number of parallelism can be determined by setting n in the cs.mcursor (n) function. When n is vacant, the number of parallelism will be automatically set by default.

A9: finally, the result set is returned to the report tool for use

4.3 Thermal export

The so-called thermal export is relative to the cold export. Hot export to ensure that the query system will never stop, in the export of data in the process of query requests come in, can still work. Thermal export is generally suitable for situations with high requirements for real-time query scenarios.

4.3.1 implementation ideas

Hot export requires the use of file backup mechanism combined with database consistency to achieve hot swap action. For ease of understanding, refer to the following logic diagram:

First of all, the main purpose of creating a backup table in the database is to record which backup file is currently in use and the date range of the hot data fetched from DB, which is emptied when the query system starts.

Second, export the historical data to set file A, back up a file B at the same time, record the file An in the database backup table, and set the date of the hot data taken from DB (for example, after a certain time); this action is done only once when the system is initialized and running.

Then, design the process of data query:

1. Create a status table in the database. When querying the data, first find out which file is available for backup and the date range of the hot data from the backup table, and then add a record to the status table to indicate that the backup file has a query. When the query is completed, the record will be deleted in the status table, which can be self-added.

Third, the process of exporting design data to a set file:

1. Perform a scheduled task at 2: 00 a. M. every day, first synchronize the historical data and append it to file B. when the export is completed, modify the record of the database backup table to use file B. at the same time, modify the scope of fetching hot data from DB. File B will be used for future query actions.

2. Check and wait for the usage record of An in the status table to be empty (all queries based on An are finished), then the synchronous historical data will be appended to file A, otherwise it will be cyclically checked every 1 minute.

3. When the data in step 2 is appended, modify the database backup table to use file A, and the newly generated query will return to the use of file A, thus achieving the action of hot switch.

4. Until the usage record of B in the waiting status table is cleared (the B-based query is also finished)

5. The whole process is completed, and you can wait for the next round of export.

What needs to be noted here is that the backup table and status table must use the database as a medium, so as to take advantage of the consistency of the database; files cannot be used to record the contents of the backup table and status table, because files cannot maintain consistency. It can be messy when multiple tasks are concurrent.

4.3.2 data query

The first step is to define a "backup table" in the database, which contains three fields (file name / boundary time / identity), and a "query status table", which contains three fields (unique identity / file name / current system time). The data structure is shown below:

The second step is to back up a set file B through set file A, then record the querable backup file An in the backup Table, and set the hot data boundary time taken from DB (defined as daily zero). If this step is performed with an aggregator script, the sample code is as follows:

one

= movefile@c (file ("D:/PT/MES-A.btx"), "D:/PT/MES-B.btx")

= connect ("demo")

two

> B1.execute ("INSERT INTO backup (name,crashtime,flag) VALUES", "A", date (now ()), "WORKING_STATUS")

> B1.close ()

A1: copy and back up the same file B based on the exported set file A

A2: after the backup file B is completed, write the currently available set file A to the data table, the current system time (zero), and the given identity listed as: WORKING_STATUS

In the third step, we need to make some modifications to the script for the "mixed operation scenario" data query in the previous section. The modified script is as follows (in this case, only historical data is supported):

one

= connect ("demo")

= A1.query ("SELECT code,name FROM watch")

two

= A1.query@1 ("SELECT NAME,crashtime FROM BACKUP WHERE flag='WORKING_STATUS'")

= name=A2 (1), crashtime=A2 (2)

three

= A1.cursor ("SELECT code,nums FROM meta_resource WHERE" + if (Efiledate > = date (crashtime), "DATE_FORMAT (fildate,'%Y-%m-%d') > ='" + string (date (crashtime)) + "," 1x 0 ")

> A1.execute ("INSERT INTO status (name,time) VALUES (?), name,now ()), uniques = A1.query@1 (" SELECT @ @ identity ")

four

= file ("concat (" name, name, ".btx") .cursor@b () .select (filedate > = Bfiledate & & filedateA5.switch (bad name, B1:code)

= A5.sort (bad quantity:-1)

seven

> B6.run (cumulative quantity of defects + = cumulative quantity of defects [- 1])

= A5.sum (bad quantity)

eight

> B6.run (undesirable cumulative ratio = round ((undesirable cumulative quantity / B7) * 100pc2))

= B6.new (bad code, bad name .name: bad name, bad quantity, bad cumulative quantity, bad cumulative ratio)

nine

> A1.execute ("DELETE FROM STATUS WHERE uniques=?", uniques)

> A1.close ()

ten

Return B8

The trellis code that has been explained earlier will not be repeated here.

A2: query the currently available set file name and the boundary date time of the hot data value according to the identification WORKING_STATUS as a condition

B2: define variables name, crashtime and assign values to facilitate subsequent cell calculation references.

B3: this cell takes two steps. First, a record is written to the status table, indicating that there is a query in the current backup file, where uniques is a self-incrementing column. Then, after inserting the record, the value of the self-growing field generated in the previous insert statement is obtained by executing [SELECT @ @ IDENTITY] and assigned to the variable uniques, so that it can be easily referenced by A9 query. The effect in the database is as follows:

A9: when the query is completed, delete the record in the status table based on the value of the variable uniques as a condition. The effect is as follows:

4.3.3 synchronous data and hot switching

The data export script in the "cold export" scheme is modified and executed at 2: 00 a. M. every day. The code is as follows:

one

= file ("D:/PT/MES-B.btx")

= connect ("demo")

= file ("D:/PT/MES-A.btx")

two

= B1.cursor ("SELECT * FROM meta_resource WHERE DATE_FORMAT (fildate,'%Y-%m-%d') =?", after (date (now ()),-1))

> A1.export@ab (A2)

> B1.execute ("UPDATE BACKUP SET NAME =?, crashtime=? WHERE flag = 'WORKING_STATUS'", "B", date (now ()

three

For connect ("demo") .query@1x ("SELECT COUNT (*) FROM STATUS WHERE NAME='A'") > 0

> sleep (60 to 1000)

four

= B1.cursor ("SELECT * FROM meta_resource WHERE DATE_FORMAT (fildate,'%Y-%m-%d') =?", after (date (now ()),-1))

> C1.export@ab (A4)

> B1.execute ("UPDATE BACKUP SET NAME =?, crashtime=? WHERE flag = 'WORKING_STATUS'", "A", date (now ()

five

For connect ("demo") .query@1x ("SELECT COUNT (*) FROM STATUS WHERE NAME='B'") > 0

> sleep (60 to 1000)

six

> B1.close ()

The trellis code that has been explained earlier will not be repeated here.

C2: when the historical data is synchronously appended to file B, modify the record of the "backup table" of the database to use file B, and modify the boundary date range of fetching data from DB. The execution result is as follows:

A3-B3: loop to check whether the usage record of An in the status table has been emptied. If you find that any A-based query is not finished, wait 1 minute, and then cycle until all A-based queries are completed.

The database connection expression of A3 needs to be specified: in general, a database connection has been defined in the B1 cell, and in A3, it can be directly referenced and written as:

For B1.query@1 ("SELECT COUNT (*) FROM STATUS WHERE NAME='A'") > 0

However, some databases make a connection to handle only one transaction at a time by default, which will cause A3 to keep the result of the loop and always follow the result of the first query. For example, the first query returns true, but when the database changes, it still returns true. For insurance period, it can be written in the following format:

For connect ("demo") .query@1x ("SELECT COUNT (*) FROM STATUS WHERE NAME='A'") > 0

This belongs to the category of database configuration and can be controlled by the connection parameters of the database, which is not explained in detail here.

C4: when the data of file An is appended, modify the record of the "backup table" of the database to use file A, and the newly generated query will use file An again. The execution result is as follows:

A5-B5: circularly query whether the usage record of B in the status table has been emptied. If you find that there is still a B-based query that is not finished, wait 1 minute, and then cycle until the B-based query is complete, the entire export process is complete, and then wait for the next round of export.

Five summaries

Hot export of data is a complicated topic in the scenario of real-time report (Test0). However, this problem can be easily solved by using the backup mechanism of aggregator (set file) combined with database consistency. The following two advantages are mainly used:

1. Cross-database hybrid computing

As an independent computing engine, the aggregator can command each database to calculate separately in parallel, collect the results and then carry out a round of summary operation, and then submit or land to the front end, so that it can easily realize the full query report.

At the same time, under the cross-database hybrid computing model of the aggregator, whether the database is isomorphic or not is not required, and historical data can be stored in lower-cost open source databases, such as Oracle and MySQL mashup clusters.

2. Set files with high cost performance and high performance

Without the need to build data warehouse, the historical data is externally stored in the file system, which is not only easy to manage, but also can obtain more efficient IO performance and computing power, so as to solve the performance bottleneck and storage cost caused by the large amount of data in relational database.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.