Lightweight can be embedded in multi-dimensional analysis background 07/09 Update SLTechnology News&Howtos

Lightweight can be embedded in multi-dimensional analysis background

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The raising of a question

There are usually three options for background data sources in multidimensional analysis (BI) systems. First, the general database; second, the professional data warehouse; third, the data source of the BI system.

However, all three options have their own problems. Ordinary databases are generally row-based storage, so it is difficult to obtain the high performance desired by multi-dimensional analysis, and it is only suitable for a small amount of data. Many professional data warehouses are column storage, the performance problem is not big, but the price is relatively expensive, and the cost of construction, expansion and maintenance is also very high. The data source of BI system is relatively closed, which can only provide support for its own BI front-end, but can not provide data services for the front-end of many different manufacturers.

Solution ideas and process

The aggregator can independently assume the role of lightweight multi-dimensional analysis background, which is equivalent to small and medium-sized data warehouse or data Mart. The structure diagram is as follows:

The aggregator can store the data of multi-dimensional analysis in a binary file in the form of column storage in advance, which is called a group table. Multidimensional analysis front-end applications drag and drop to generate SQL, which is submitted through the aggregator JDBC. The aggregator performs a SQL query on the group table and returns the results to the multidimensional analysis front end. The group table file can also be collected and calculated by the aggregator from various heterogeneous data sources.

Compared with the ordinary database scheme, the binaries stored in the aggregator can directly improve the performance. For expensive professional databases and relatively closed BI own data sources, the aggregator can provide a more economical and simple solution, and can collect data from a variety of heterogeneous data sources.

There are three ways to deploy the aggregator: 1, integrated in the front-end application; 2, independent server; 3, cluster hot standby. The specific methods are described below.

Case scenario description

In the following case, the multidimensional analysis system does self-help analysis of the order data. For the sake of simplicity, the foreground of the multidimensional analysis system is simulated by jdbc.jsp in the tomcat server. Tomcat is installed in C:\ tomcat6 of the windows operating system.

The aggregator JDBC is integrated in the application of multidimensional analysis. Jdbc.jsp imitates the multidimensional analysis application system to generate SQL that conforms to the aggregator specification and submits it to the aggregator SPL script for processing through the aggregator JDBC.

The data of the multidimensional analysis system comes from the ORDERS table in the production database demo, and the production database is the ORACLE database. The multi-dimensional analysis system can not be directly connected to the demo database to realize the analysis, so as not to bring too much pressure to the production database. The ORDERS order table is full data, and the aggregator ETL synchronizes the latest data for the day into the group table file every day. The date is subject to the order date ORDERDATE, assuming that the current date is 2015-07-18.

Background data initialization

Use the following ordersAll.sql file to complete the ORDERS table creation and data initialization in the ORACLE database.

OrdersAll

The data ends on July 17, 2017.

When the multidimensional analysis system is online and initialized, the historical data in the ORDERS table should be synchronized to the binary file of the aggregator. This is an one-time preparatory work before going online, and there is no need to do it after it is launched.

In the aggregator, create a new data source orcl to connect to the ORACLE database. Use SPL language script etlAll.dfx to read all the data into the aggregator group table file orders.ctx. The SPL script is as follows:

AB1=connect ("orcl") = A1.cursor@d ("select ORDERDATE,CUSTOMERID,EMPLOYEEID,ORDERID,AMOUNT from ORDERS order by ORDERDATE,CUSTOMERID,EMPLOYEEID,ORDERID") 2=file ("C:/tomcat6/webapps/DW/WEB-INF/data/orders.ctx")

3=A2.create (# ORDERDATE,#CUSTOMERID,#EMPLOYEEID,#ORDERID,AMOUNT)

4=A3.append (B1) > A1.close ()

Orders.ctx is a group table file, which is stored in columns by default, and supports parallel computing in any segment, which can effectively improve the query speed. When generating group tables, we should pay attention to the pre-sorting of data and the reasonable definition of dimension fields. In this example, according to the fields that are often filtered and grouped, the dimension fields are determined as: ORDERDATE,CUSTOMERID,EMPLOYEEID, ORDERID.

When you get data from ORACLE, you sort it by dimension field. Because ORDERDATE,CUSTOMERID,EMPLOYEEID corresponds to more duplicated data, it is sorted first; ORDERID corresponds to less duplicated data, so it is sorted later.

The @ d option of the database cursor in cell B1 indicates that the numeric type data is converted to double type when fetching numbers from the ORACLE database, with sufficient precision for common values such as amount. Without this option, it will be converted to big decimal data by default, and computing performance will be greatly affected.

ETL process

After the multidimensional analysis system is online, the latest data of the day should be synchronized regularly every night. Let's assume that the date of the day is 2015-07-18. Use the following ordersUpdate.sql file to add the current day's data to the ORDERS table in the ORACLE database to simulate the increment of the data.

OrdersUpdate

Add the current-day data increment to the aggregator group table file orders.ctx using the SPL language script etlUpdate1.dfx. The SPL script is as follows:

ABC1=connect ("orcl")

2=A1.cursor@d ("select ORDERDATE,CUSTOMERID,EMPLOYEEID,ORDERID,AMOUNT from ORDERS where ORDERDATE=? order by ORDERDATE,CUSTOMERID,EMPLOYEEID,ORDERID", etlDate)

3=file ("C:/tomcat6/webapps/DW/WEB-INF/data/orders.ctx")

4=A3.create () = A4.append (A2) = A3.rollback () 5 > A1.close ()

The input parameter for etlUpdate.dfx is etlDate, which is the date of the day that needs to be added.

The B4 cell appends the new data directly to the group table file. Because the first sort field is orderdate, appending new data does not affect sorting. If the first sort field is not orderdate, reorder it.

Rollback in C4 is a rollback function. If an error occurs during the execution of append in B4, the rollback operation will be performed to restore the group table state before the append operation. After normal execution, it will not be rolled back.

EtlUpdate1.dfx scripts can be executed on the windows or linux command line, and can be executed regularly in combination with scheduled tasks. You can also use the ETL tool to schedule calls.

The windows command line can be invoked by:

C:\\ Program Files\\ raqsoft\\ esProc\ bin > esprocx.exe C:\\ etlUpdate1.dfx

The linux command is:

/ raqsoft/esProc/bin/esprocx.sh / esproc/ etlUpdate1.dfx

Application structure 1: application integrated computing

The aggregator JDBC is integrated in the application of multidimensional analysis. After receiving the SQL, it looks up the local file orders.ctx and returns the result.

1. Copy the DW directory in the compressed file below to the application directory of tomcat.

The directory structure is shown below:

The configuration file is in classes, and the license files obtained on the official website should also be placed in the classes directory. The Jar package for the aggregator is to be placed in the lib directory (refer to the aggregator tutorial for which jar you need).

Modify the home directory configuration in raqsoftConfig.xml:

C:\\ tomcat6\\ webapps\\ DW\ WEB-INF\\ data

After configuring the home directory, orders.ctx can write from orders.ctx instead of writing the full path name.

2. Edit the jdbc.jsp in the DW directory and simulate the foreground interface to submit the sql display result.

100 "; out.println (" Test page v1 "

Out.println ("order ID" + "\\ t" + "customer ID" + "\\ t" + "employee ID" + "\\ t" + "order date" + "\ t" + "order amount" + "

"); ResultSet rs = statement.executeQuery (sql); int F1 direction f6 string f3 people f4 float f5 [rs.next ()) {F1 = rs.getInt (" ORDERID "); f2 = rs.getString (" CUSTOMERID "); f3 = rs.getString (" EMPLOYEEID "); f4 = rs.getString (" ORDERDATE "); f5 = rs.getFloat (" AMOUNT "); out.println (F1 +"\ t "+ F2 +"\ t "+ F3 +"\ t "+ F4 +"\ t "+ F5 +"\ t "+"

");} out.println ("); rs.close (); conn.close ();} catch (ClassNotFoundException e) {System.out.println ("Sorry, can`t find the Driver!"); e.printStackTrace ();} catch (SQLException e) {e.printStackTrace ();} catch (Exception e) {e.printStackTrace ();}% > 100 "; out.println (" Test page v1 ")

Out.println ("order ID" + "\\ t" + "customer ID" + "\\ t" + "employee ID" + "\\ t" + "order date" + "\ t" + "order amount" + "

As you can see, in jsp, the JDBC of the aggregator is connected first, and then the SQL is submitted for execution. The steps are exactly the same as the general database, with high compatibility and versatility. For multidimensional analysis tools, although it is an interface operation to connect JDBC and submit SQL, the basic principle is exactly the same as jsp.

3. Start tomcat, access http://localhost:8080/DW/jdbc.jsp in the browser, and view the results.

You can also continue to test the following:

1. Group summary

Sql = "select CUSTOMERID,EMPLOYEEID,sum (AMOUNT) order Total, count (1) order quantity from ORDERS.ctx where ORDERDATE=date ('2015-07-18') group by CUSTOMERID,EMPLOYEEID"

2. Parallel query

Sql= "select / * + parallel (4) * / top 10 ORDERID,CUSTOMERID,EMPLOYEEID,ORDERDATE,AMOUNT from ORDERS.ctx where ORDERDATE=date ('2015-07-18') and AMOUNT > 100" sql= "select / * + parallel (4) * / top 10 ORDERID,CUSTOMERID,EMPLOYEEID,ORDERDATE,AMOUNT from ORDERS.ctx where ORDERDATE=date ('2015-07-18') and AMOUNT > 100"

Like ORACLE, the aggregator simple SQL also supports parallel queries such as / * + parallel (4) * /.

Application structure 2: stand-alone server

The first solution is to utilize the resources of the application server. In the case of a large amount of concurrency or a large amount of data, there will be great pressure on the application server. In this case, it is recommended to use a separate node server for data calculation.

After receiving the SQL, the aggregator JDBC transfers it to the DW.dfx program for processing. DW.dfx calls DWServer.dfx on the node server for calculation. DWServer.dfx changes the table name to a file name, and looks up the local file orders.ctx to return the result.

The following DWServer directory is copied to the desired directory. The node server of the aggregator has cross-platform features and can run on any operating system that supports Java. See the aggregator tutorial for deployment methods. This assumes that it is placed in the C disk root directory of the windows operating system.

DWServer

1. Initialize dfx before the system goes online, and put the orders.ctx file in the C:/DWServer/data directory. When testing, you can directly copy the generated orders.ctx.

2. Modify the previous dfx, change A1 to = file ("C:/DWServer/data/orders.ctx"), and save as etlUpdate2.dfx. The modified etlUpdate2.dfx is in the c:\ DWServer directory.

3. Open C:\\ tomcat6\\ webapps\\ DW\\ WEB-INF\\ dfx\\ DW.dfx in the application server and observe and understand the SPL code. The parameter sql is the incoming SQL statement.

AB1=callx ("DWServer.dfx", [sql]; ["127.0.0.1 purl 8281"])

2return A1.ifn ()

A1: call DWServer.dfx on the node machine. The parameter is [sql], and the parentheses denote the sequence, which is a sequence with only one member. ["127.0.0.1 8281"] is a sequence of node machines, using the IP: Port number.

A2: returns the result of the A1 call. Because the result of the call is a sequence, use the ifn function to find the first member in the sequence that is not empty, which is the return result corresponding to SQL.

Modify the following configuration in C:\\ tomcat6\\ webapps\\ DW\\ WEB-INF\\ classes\\ raqsoftConfig.xml:

C:\\ tomcat6\ webapps\\ DW\\ WEB-INF\ dfxRuntime,ServerDW.dfxC:\\ tomcat6\\ webapps\\ DW\\ WEB-INF\\ dfxRuntime,ServerDW.dfx

The content of the tag here is the JDBC gateway dfx file. When the aggregator JDBC is called in the BI system, the SQL executed is handed over to the gateway file for processing. If this tag is not configured, the statements submitted by JDBC will be parsed directly by the aggregator as a script.

4. Start the node server.

Run esprocs.exe, as shown below:

Click the configuration button to configure the relevant parameters:

After clicking OK, return to the main interface and click the start button.

5. Open C:\ DWServer\ dfx\ DWServer.dfx to observe and understand the SPL code.

ABC1=filename= "C:/DWServer/data/orders.ctx"

2=sql=replace (sql, "from ORDERS.ctx", "from" + filename)

3=connect () = A3.cursor@x (A2) return B3

A1: defines the absolute path to the aggregator set file.

A2: replace the file name with the absolute path.

A3-C3: connect to the file system. Execute SQL to get the cursor and return.

The server mode can also configure the home directory as in "Application structure one", so A2 does not have to write an absolute path. The path is written in the SPL here, and the advantage is that the same server can serve multiple sets of data tables (files). If many files are in the home directory, it will be inconvenient to manage.

6. Restart tomcat and visit http://localhost:8080/DW/jdbc.jsp in the browser to view the result.

Application structure 3: cluster hot backup

In the case of increasing concurrency or increasing amount of data, the node server can scale horizontally to cope with the pressure of large concurrency or large amount of data computing.

1. Deploy a set of aggregator nodes on another window machine. The deployment method and solution are slightly different, and you need to configure data partitions. The IP addresses of the two servers are 168.0.122 and 192.168.0.176. The c:\\ DWServer directory in method 2 is also copied to another server.

In the figure, the data partition name is configured to 0 and the path is c:/DWServer/data. Note that both servers need to be configured.

2. Rewrite the c:/DWServer/dfx/etlUpdate2.dfx on 168.0.122 and save it as etlUpdate3.dfx.

ABC1=connect ("orcl")

2=A1.cursor@d ("select ORDERDATE,CUSTOMERID,EMPLOYEEID,ORDERID,AMOUNT from ORDERS where ORDERDATE=? order by ORDERDATE,CUSTOMERID,EMPLOYEEID,ORDERID", etlDate)

3=file ("CJV / DWServer/data/orders.ctx")

4=A3.create () = A4.append (A2) = A3.rollback () 5 > A1.close ()

6=sync (["192.168.0.176 8281"]: "192.168.0.122 8281"; 0)

The A6 cell is to synchronize the updated orders.ctx to partition 0 of 192.168.0.176, which is the CRAV / DWServer/data directory. [192.168.0.176 IP3:PORT3 8281 "] refers to the list of node machines that need to be synchronized. If there are more node machines that need synchronization, you can write: [" synchronization "," IP2:PORT2 "," node "].

Because there is synchronized code here, you only need to execute the scheduled task etlUpdate3.dfx on 192.168.0.122.

3. Open C:\ tomcat6\ webapps\ DW\ WEB-INF\ dfx\ DW.dfx in the application server, and modify as follows:

AB1=callx ("DWServer.dfx", [sql]; ["192.168.0.122 sql 8281", "192.168.0.176 sql"])

2return A1.ifn ()

A1: call DWServer.dfx on the node machine. The parameter is [sql], and the parentheses denote the sequence, which is a sequence with only one member. Because the node machine is a cluster, there are two IP addresses. When multiple concurrency occurs, callx randomly accesses two nodes.

4. Restart tomcat and visit http://localhost:8080/DW/jdbc.jsp in the browser to view the result.

Advantages of aggregator Summary Open lightweight data Warehouse / data Mart

Aggregator is a professional data computing middleware (DCM), which has the ability of independent computing, and can provide data source services for the front end of multi-dimensional analysis system without database and data warehouse.

The aggregator uses column storage data, has the query performance of a professional data warehouse, tens of millions of levels of data, and can achieve a second-level detailed query speed. Ordinary databases are generally row storage, which can not meet the performance requirements of multi-dimensional analysis. At the same time, unlike professional data warehouse, the price of aggregator is lower, and the cost of construction, expansion and maintenance is relatively low.

The aggregator is open and provides standard JDBC services to the front end of the multidimensional analysis system. It can form a platform-type background data source and provide data services for the front end of many different manufacturers at the same time.

Group table column storage / ordered compressed storage

Advanced data storage mode is an important guarantee for the successful implementation of data computing middleware (DCM).

The aggregator group table uses column storage to store data, and the performance improvement is particularly obvious for wide table queries with a large number of fields. The column storage mechanism used in the group table is different from the conventional storage mechanism. Conventional column storage (such as parquet format) can only be divided into blocks and then stored within the block, which is limited when doing parallel computing. The parallel compressed storage mechanism of the group table adopts the technology of multiplication and segmentation, allows parallel computing of any segment, and can make use of the computing power of multiple CPU cores to maximize the IO of the hard disk.

When the group table is generated, the dimension field should be specified. The data itself is stored orderly according to the dimension field. The commonly used conditional filtering calculation does not rely on the index to ensure high performance. The files are compressed and stored, which reduces the space occupied on the hard disk and reads faster. Due to the appropriate compression ratio, the CPU time consumed by decompression is negligible.

The group table can also use row storage and full memory to store data, and support the operation of in-memory database.

Cluster function

Agile clustering capabilities can ensure high performance and high availability of data computing middleware (DCM).

The aggregator node server is a stand-alone process that can accept calculation requests from the aggregator JDBC and return the results. For the case of concurrent access, it can be sent to multiple servers to calculate at the same time to improve the concurrent capacity. In the case of a single large computing task, it can be divided into multiple small tasks and sent to multiple servers to calculate at the same time, which plays the role of big data's parallel computing.

The cluster computing scheme of the aggregator has the ability of agile scale-out, which can be solved by rapidly increasing the number of nodes when the amount of concurrency or data is large. The aggregator cluster also has the ability of fault tolerance, that is, it can ensure that the whole cluster can work when individual nodes fail, and the computing tasks can continue to be executed, which plays the role of multi-machine hot backup and ensuring high availability.

Application and promotion

As a data computing middleware (DCM), the background data source provided by the aggregator can support a variety of front-end applications, not only when the front-end is multi-dimensional analysis. For example: large screen display, management of the cockpit, real-time reports, large data inventory reports, report batch subscriptions, and so on.

The background data source formed by the aggregator can also be used with database and data warehouse. The data computing gateway and routing implemented by the aggregator can intelligently switch between the data cached by the aggregator and the data warehouse to solve the problem that the data warehouse can not meet the performance requirements. For example: a scenario in which hot and cold data are calculated separately. For more information, please see "aggregator implements computing routes to optimize BI background performance".

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.