In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
The raising of a question
The common structure of BI system is: the front end is the BI application, which is responsible for the user operation and result presentation of multi-dimensional analysis; the background is the database / data warehouse, which is responsible for data calculation and storage. SQL is used as the interface between the front end and the background.
In practical application, there is often the problem of excessive pressure on the background data warehouse. The problem is that the response time of the front end is too long, and the response speed of the data warehouse becomes slower.
A common solution is to add another front database between the data warehouse and the application. However, it is difficult to realize data routing and hybrid computing between the front database and the background data warehouse, for example, hot data with high access frequency is placed in the front database, and a large amount of cold data is placed in the data warehouse. Query according to certain rules to decide whether to visit the front database or the background data warehouse. If the front database and the background data warehouse are different products, the translation of SQL should be considered.
Solution ideas and process
As a data computing middleware (DCM), building an independent data front layer is an important application mode of the aggregator. The data front layer reconstructs the BI system into a three-tier structure: the data storage and batch data computing layer is undertaken by the database; the data front and cache layer is undertaken by the aggregator; the data analysis presentation layer is undertaken by multi-dimensional analysis tools or report tools.
The aggregator can cache data and perform complex calculations independently from the database. At the same time, it has a programmable gateway mechanism, which can switch freely between cached computing and SQL transparent transmission. Using the aggregator to complete the front-layer data calculation can be separated from the batch data calculation task undertaken by the database, and there is no need to build another database.
The aggregator can put hot data and recent data on the data front layer, thus playing the role of data cache, which can effectively improve the speed of data calculation and reduce the waiting time of users.
The architecture of the system is as follows:
Case scenario description
Front desk BI system, to do self-service query for order data. The required condition for query is the date of order. For the sake of simplicity, the foreground BI system is simulated by jdbc.jsp in the tomcat server.
The aggregator JDBC and intelligent gateway are integrated in the application system. Jdbc.jsp imitates the BI application system to generate SQL which conforms to the aggregator simple query specification and submits it to the aggregator intelligent gateway for processing through the aggregator JDBC.
The data comes from the ORDERS table in the ORACLE database demo. The ORDERS order table is full data, and the aggregator only stores data for the last three years, such as 2015-2018. The date is subject to the date of order.
Basic data preparation and extraction of cached data
Use the following orders.sql file to complete the ORDERS table creation and data initialization in the ORACLE database.
Click to download orders.sql
In the aggregator, create a new data source orcl to connect to the ORACLE database. The SPL script etl1.dfx is used to pre-read the data of the last three years into the aggregator set file orders.btx. The SPL script is as follows:
AB1=year (now ())-3
2=connect ("orcl") = A2.cursor@d ("select * from orders where to_char (orderdate,'yyyy') >?", A1) 3=file ("C:/tomcat6/webapps/gateway/WEB-INF/data/orders.btx")
4=A3.export@z (B2) > A2.close ()
As you can see from the SPL script, you can export the data in the database to a file with only one sentence of export in the A4 cell. Set file is a binary file format built into the aggregator, using a simple compression mechanism, the same amount of data will take up less space than the database. The @ z option indicates that you can write a file that can be segmented, which is ideal for multidimensional analysis operations that often need to be parallel.
The @ d option of the database cursor in cell B2 indicates that the numeric type data is converted to double type when fetching numbers from the ORACLE database, which is accurate enough for common values such as amount. Without this option, it will be converted to big decimal data by default, and computing performance will be greatly affected.
Scripts can be executed on the windows or linux command line, and batch tasks can be executed regularly in combination with scheduled tasks. The windows command line can be invoked by:
C:\ Program Files\ raqsoft\ esProc\ bin > esprocx.exe C:\ etl1.dfx
The linux command is:
/ raqsoft/esProc/bin/esprocx.sh / gateway/etl1.dfx
Solution 1: application server integrated computing
After receiving the SQL, the JDBC intelligent gateway of the aggregator is transferred to the gateway1.dfx program for processing. Gateway1.dfx determines whether the query will be made within three years. If so, change the table name to the file name, and check the local file orders.btx to return the result. If not, convert SQL to ORACLE format and submit it to the database for processing.
1. Copy the following gateway directory to the application directory of tomcat.
Click to download gateway.zip
The directory structure is shown below:
Note: the configuration file is in classes, and the license files obtained on the official website should also be placed in the classes directory. The Jar package for the aggregator is to be placed in the lib directory (refer to the aggregator tutorial for which jar you need). In addition, you need to check and modify the following configuration in raqsoftConfig.xml:
C:\\ tomcat6\\ webapps\\ WEB-INF\\ dfx\\ Runtime,Server\\ gateway1.dfx\ C:\\ tomcat6\\ webapps\\ gateway\\ WEB-INF\\ dfx\ Runtime,Server\ gateway1.dfx\ C:\\ tomcat6\\ webapps\\ gateway\\ WEB-INF\\ dfx\\ Runtime,Server\\ gateway1.dfx\ C:\ tomcat6\\ webapps\\ gateway\\ WEB-INF\ dfx\ Runtime,Server\ gateway1.dfx
The content of the tag here is the gateway dfx file. When the aggregator JDBC is called in the BI system, the SQL executed is handed over to the gateway file for processing. If this tag is not configured, the statements submitted by JDBC are parsed directly by the aggregator as a script, and the desired routing rules cannot be implemented.
2. Edit the jdbc.jsp in the gateway directory and simulate the foreground interface to submit the sql display result.
As you can see, in jsp, the JDBC of the aggregator is connected first, and then the SQL is submitted for execution. The steps are exactly the same as the general database, with high compatibility and versatility. For the BI tool, although it is an interface operation to connect JDBC and submit SQL, the basic principle is exactly the same as jsp.
3. Open the gateway1.dfx in the dfx directory and observe and understand the SPL code.
First, you can see that the gateway1.dfx input parameters are sql and args, such as SQL:
Select top 10 ORDERID,CUSTOMERID,EMPLOYEEID,ORDERDATE,AMOUNT from ORDERS where ORDERDATE=date ('2015-07-18') and AMOUNT > 100.
Next, you can see the SPL script as follows:
ABC1=filename= "C:/tomcat6/webapps/gateway/WEB-INF/data/orders.btx"
2=sql.sqlparse@w () .split (") = A2.select@1 (like (~," ORDERDATE=date ('?)
3=mid (right (B2jue 14), 3Jing 10) = year (now ())-year (date (A3))
4if B3100) t WHERE ROWNUMA2.close ()
Different from set files, group tables are stored in columns by default and support parallel computing in arbitrary segments, which can effectively improve the query speed. At the same time, when generating group tables, we should pay attention to the pre-sorting of data and the reasonable definition of dimension fields. In this example, the dimension field is determined to be: CUSTOMERID,EMPLOYEEID,ORDERDATE,ORDERID according to the fields that are often filtered and grouped.
When A3 gets the data, it should be sorted according to the dimension field. Because CUSTOMERID,EMPLOYEEID,ORDERDATE corresponds to more duplicated data, it is sorted first; ORDERID corresponds to less duplicated data, so it is sorted later.
When defining a group table in A4, use # to represent the dimension field.
It is important to note that the group table also supports parallel query / * + parallel (n) * /.
Summary of multitasking performance tuning skills
The characteristics of BI applications are:
1. The response time is required to be high, usually no more than 5-10 seconds.
2. The corresponding amount of data queried ranges from several hundred megabytes to several gigabytes, and there are dozens or even hundreds of fields.
3. The amount of concurrency is large, tens to hundreds of concurrency.
The method of performance optimization is:
1. The group table is adopted to improve the response speed of single-task query.
◇ defines the dimension fields reasonably according to the requirements.
When the group table is defined, the dimension field should be determined according to the needs of the business. To select fields that are often used as filtering criteria or grouped as dimension fields, the dimension fields are marked with #.
◇ is pre-sorted by dimension field.
To sort the data by dimension field, the fields with a large number of repeated records are in front, for example, sorted by the field order of order by province, city, and county, not vice versa.
◇ chooses whether or not to use parallel queries based on concurrency.
When the concurrency is large, parallel query / * + parallel (n) * / is not recommended for a single SQL query. Parallel queries consume more threads and affect large concurrency performance.
2. Configure the parameters of the node server reasonably to give full play to the performance of each node.
Each server (physical machine or virtual machine) needs to start a node server. The configuration interface for each node server to start the extension is as follows:
◇ configures the number of processes based on hardware resources
It is recommended that the number of processes in the process list (that is, the number of appropriate jobs) should not exceed the total number of cores of CPU * 2 + 3. For example, if the server has 8 CPU with two cores each, and the total number of cores is 8 / 2 / 16, then the number of processes should not exceed 16 / 2 / 3 / 10. The maximum number of jobs is recommended as the number of suitable jobs * 2, that is, 10-2-20.
◇ allocates as much memory as possible, but avoid overload
The maximum memory of each process of the node server should be allocated as much as possible, but the total amount should be smaller than the actual physical memory to avoid the operating system using hard drives to supplement the lack of memory. For example, if the total memory is 32 gigabytes and the number of processes is 8, the maximum memory for each process should not be greater than 4 gigabytes. The maximum and minimum memory for the configuration process is in C:\ Program Files\ raqsoft\ esProc\ bin\ config.txt, for example:
The minimum memory of jvm_args=-Xms128m-Xmx4845m is 128m and the maximum is 4G.
3. Expand the node server horizontally, and multi-computers should deal with large concurrent access.
◇ scales out to cope with large concurrency.
With the increase of concurrency, when the performance can not meet the requirements, it is necessary to increase the number of node servers to meet the demand through horizontal expansion.
◇ adds server list configuration items.
At this point, you need to modify the server sequence parameters of the callx function in gateway3.dfx. You can write server sequence parameters to the configuration file so that you don't have to modify the dfx file every time.
4. Use local hard disk data for calculation to avoid cross-network access.
The IO speed of the hard disk is relatively guaranteed.
Node servers access data from other servers through the network, or by accessing data on shared storage, the network is often blocked and the query response speed is reduced. Therefore, as far as possible, each node server only executes the data on the local machine and does not access it across the network.
Aggregator advantage summary programmable data routing
Programmable data routing is an important application scenario of data computing middleware (DCM).
In the above example, the strategy of data routing is that the data of the last three years is routed to the aggregator as hot data, and the other data is routed to the database as cold data.
Similar routing rules include data from the last three days and the last day of the last twelve months as hot data, routed to the aggregator for calculation, and other data routed to the database for summary calculation.
For the routing rules of hot and cold data calculation, this article only introduces the situation that one query only involves cold or hot data. if both hot and cold data may be involved in a query, we will introduce them in subsequent articles.
In practical application, the rules of data routing may be very complex and changeable, and it will be very difficult to implement them through configuration, and it is the best way to implement them by programming. It is the simplest and most efficient to use SPL, the programming language of aggregator, to implement complex data routing rules. The aggregator supports hybrid computing of diverse heterogeneous data sources and can be programmed to implement complex data routing rules involving various heterogeneous data sources.
SQL Analysis and Translation
When used as a background for multidimensional analysis, data computing middleware (DCM) should provide the necessary SQL parsing and translation functions.
The realization of data routing is inseparable from the parsing and translation of SQL statements by the aggregator. First, use the SQL parsing capabilities of the aggregator to find the date field in the where condition, and then decide whether to route to the file or the database according to the rules. If it is routed to the database, then the SQL translation capability of the aggregator is needed to translate the standard SQL of the aggregator into the SQL of the database.
The SQL parsing of the aggregator is realized by the sqlparse () function, and the SQL translation is realized by the sqltranslate () function.
SQL performance optimization
SQL performance optimization is also an indispensable capability of data computing middleware (DCM).
BI applications allow users to drag and drop to generate SQL, and there are a lot of low-performance SQL. For example, add a layer of count directly to the SQL queried by the details to count the total number of results: select count (1) from (select f1, f2, law, f3, and 4). F30 from table1 where F1 1 and 1 1). At this time, if all F1 to f30 in the subquery are taken out, the performance of the query will be degraded. Filtering conditions such as 1: 1 can also cause meaningless time consumption.
Aggregator simple SQL engine, can complete automatic query optimization. Remove unnecessary conditions such as 1: 1, and you won't take out all the fields to complete the count. In order to achieve SQL parsing and optimization, effectively improve the query performance.
Similarly, there is select top 10 F1 from table1 order by F1. The aggregator is implemented by comparing small result sets. It can be achieved without large sorting, only traversing one side of the data can get the desired results, and effectively improve the query speed.
Group table column storage / ordered compressed storage
Advanced data storage mode is an important guarantee for the successful implementation of data computing middleware (DCM).
The aggregator group table uses column storage to store data, and the performance improvement is particularly obvious for wide table queries with a large number of fields. The column storage mechanism used in the group table is different from the conventional storage mechanism. Conventional column storage (such as parquet format) can only be divided into blocks and then stored within the block, which is limited when doing parallel computing. The parallel compressed storage mechanism of the group table adopts the technology of multiplication and segmentation, allows parallel computing of any segment, and can make use of the computing power of multiple CPU cores to maximize the IO of the hard disk.
When the group table is generated, the dimension field should be specified. The data itself is stored orderly according to the dimension field. The commonly used conditional filtering calculation does not rely on the index to ensure high performance. The files are compressed and stored, which reduces the space occupied on the hard disk and reads faster. Due to the appropriate compression ratio, the CPU time consumed by decompression is negligible.
The group table can also use row storage and full memory to store data, and support the operation of in-memory database.
Cluster function
Agile clustering capabilities can ensure high performance and high availability of data computing middleware (DCM).
The aggregator node server is an independent process, which can accept the calculation request of the aggregator gateway program and return the result. For the case of concurrent access, it can be sent to multiple servers to calculate at the same time to improve the concurrent capacity. In the case of a single large computing task, it can be divided into multiple small tasks and sent to multiple servers to calculate at the same time, which plays the role of big data's parallel computing.
The cluster computing scheme of the aggregator has the ability of agile scale-out, which can be solved by rapidly increasing the number of nodes when the amount of concurrency or data is large. The aggregator cluster also has the ability of fault tolerance, that is, it can ensure that the whole cluster can work when individual nodes fail, and the computing tasks can continue to be executed, which plays the role of multi-machine hot backup and ensuring high availability.
Application and promotion
As a data computing middleware (DCM), the data computing gateway and routing implemented by the aggregator can solve the situation where the data warehouse can not meet the performance requirements and the hot and cold data are separated and mixed computing, not only when the front end is BI. For example: large screen display, management of the cockpit, real-time reports, large data inventory reports, report batch subscriptions, and so on.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.