In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
This article will explain in detail the example analysis of cross-database data table operation in the database. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.
1. Simple merge (FROM)
The so-called cross-database data table means that logically the same data table is stored in different databases. The reason may be that the amount of data is so large that it is difficult to deal with in a database, or it may be necessary to separate the production database from the history database. Different databases may only be the same kind of databases deployed on different machines, or they may be database systems of different types.
In the face of cross-database data tables, especially in the case of different database types, the tools that come with the database are often powerless, and it is generally necessary to look for third-party tools that can well support multiple data source types, and the aggregator, it can be said to be the best of them. Next, let's discuss in detail several common cross-library hybrid operations:
Cross-library operation, the simple and rough idea is to merge logically the same data tables scattered in each library into a single table, and then operate on this table.
For example, in two databases, HSQL and MYSQL, there is a student score table, each of which stores a portion of the student information, as shown in the following figure:
Using the aggregator, we can easily merge these two tables with the same structure into one table. The SPL script of the aggregator is as follows:
A1, A2 and B1, B2 respectively read the student scores in the two libraries, while A3 merges the two tables in a simple and intuitive way.
This method actually reads both tables into memory, generates the order table objects of the aggregator respectively, and then uses the operation "|" of the order table to complete the merge. Some students may ask: what if I have a large amount of data and can't read it all into memory? It doesn't matter. The aggregator specially designed to deal with big data will never be stumped by such a simple small problem. We can use cursors, and we can also achieve fast splicing of tables:
A2 and B2 respectively use cursors to open the student scores in the two libraries, while A3 uses the conjx () function to merge the two cursors to form a new cursor that can access both tables at the same time.
Corresponding to SQL, this simple merging is like just completing the from work, so that the data across the database tables with the same structure are "vertically" spliced into an accessible ordinal table or cursor. In the actual operation, it will also involve filtering (where/having), grouping aggregation (group+sum/count/avg/max/min), join (join+on), deduplication (distinct), sorting (order), fetching partial data (limit+offset), and so on. Now we will discuss these operations one by one.
Of course, when we deal with the requirements of these operations, we can not simply implement the function, we also need to consider the efficiency and performance of the implementation, so in principle, we will try our best to make use of the computing power of the database. The aggregator is mainly responsible for mixed operations. However, sometimes the aggregator is responsible for almost all the operations, and the database is only responsible for storing data.
2. WHERE
The essence of where filtering is through comparison calculation, the result of removing the comparison is the record of false, so where only acts on one record, does not involve the operation between records, and does not need to consider which database the data is located in. For example, in the previous example, we want to count the "math" scores of all the students in "Class one". The SQL in the single database is as follows:
SELECT student ID, score FROM student transcript WHERE subject = 'math' AND class = 'Class one'
When there are multiple databases, you only need to write the where clause directly in SQL and let each database process the filtering in parallel:
We can also make the aggregator responsible for all filtering operations, and the database stores only data. At this point, you can use the select function of the aggregator (different from the select keyword of SQL)
When the amount of data is large, you can also replace the order table with cursors and use the conjx function to connect:
3. ORDER BY and LIMIT OFFSET
Order by is processed after the result set is generated. In the above example, if we want to sort by math score, for a single database, we only need to add the order by clause:
SELECT class, student ID, score FROM student transcript WHERE subject = 'math' AND class = 'class one' ORDER BY score
For multiple databases, you can have the databases sort separately and then merge the ordered data by the aggregator. This can maximize the performance of the database and parallel server.
You can also sort in reverse order, adding "-" before the sort field when merging (the merge function may not be added with "-", but it is added according to the standard)
Of course, it can be sorted entirely by the aggregator:
Reverse sorting is implemented by the aggregator:
For large amounts of data, you need to use cursors and mergex to complete orderly merging:
The implementation of limit and offset comes after order. In the example, if you want to get a math score in addition to the top 10 after the first place (less than but not more than), in the case of a single database, SQL is like this:
SELECT class, student ID, score FROM student transcript WHERE subject = 'math' AND class = 'class one' ORDER BY score DESC LIMIT 10 OFFSET 1
When there are multiple databases, the to function of the aggregator can be used to realize the function of limit offset, and to is equivalent to limit m offset n.
For cases where cursors are used in large amounts of data, the offset function can be implemented using the aggregator function skip, while the function of limit can be implemented using the function fetch.
4. Aggregate operation
Let's discuss five common aggregation operations: sum/count/avg/max/min.
The basis of sum is addition. According to the law of addition, the internal data in each database are summed first, then spliced into a table, and then summed up, which is the same as the result of splicing into a table and then summing together.
The essence of count is to calculate null data for each item of non-null data, and then accumulate them. Therefore, its essence is still addition operation, which conforms to the law of addition combination as well as sum. The only difference is that instead of accumulating its own values for the original data, the count is 1 (non-null) or 0 (null).
The essence of avg is avg = sum/count when count > 0 and avg = null when count = 0. Obviously, avg cannot be calculated separately like sum or count. However, by definition, we can first calculate sum and count, and then calculate avg through sum and count.
Both max and min are based on comparison operations, and because comparison operations are transitive, the maximum values of all databases can be obtained by comparing the maximum values of each database.
Still the above example, this time we ask all the students in the two classes for the total math score, the number of students, the average score, the highest and lowest scores, for single-source data:
Total score of SELECT sum, number of count exams, average score of avg, highest score of max, lowest score of min. FROM student grade table WHERE subject = 'math'
The result set of the aggregation operation is very small, with only one row, so cursors can be used regardless of the amount of source data, as follows:
In fact, the order by + limit offset mentioned earlier can also be seen as an aggregation operation in essence: top. By optimizing from this point of view, higher computational efficiency can be obtained. After all, when the amount of data is large, the cost of full sorting is very high, and the operation of taking the first N data does not need full sorting. Of course, this method is also applicable to situations where the amount of data is small.
Specifically, in the case of order by F limit m offset n, you only need to use top (nimm, F, ~) first, and then use to (nym1,).
Let's take the previous SQL statement with order by+limit offset as an example:
SELECT class, student ID, score FROM student transcript WHERE subject = 'math' AND class = 'class one' ORDER BY score DESC LIMIT 10 OFFSET 1
For multiple databases, the script is as follows, where the reverse sort only needs to be preceded by "-" before the sort field:
5. GROUP BY, DISTINCT and HAVING
A, grouping aggregation operation
For group by, because the final result has nothing to do with the input order of the individual sample, as long as the population of the sample remains unchanged, the final result will not change. In other words, as long as the extraction of data from the sub-database and the final summary of all data, the classification operation can be carried out in advance.
Suppose we want to find the total math score, the number of students, the average score, the highest score and the lowest score of Class 1 and Class 2 respectively. The single database is as follows:
SELECT class, total score of sum (grade), number of count exams, average score of avg (grade), highest score of max (grade), lowest score of min (grade) FROM student grade table WHERE subject = 'math' GROUP BY class
Let's discuss it in three situations:
First, for small data, the result set of the aggregate operation will only be smaller, so query+groups is recommended:
Second, for large amounts of data, cursor+groupx should be used if the result set is also large.
In addition, because the grouping of large result sets is slow, it is necessary to generate cached data in out-of-memory. If we sort the data in the database first, we can avoid this cache (at this time, the computing pressure will be transferred to the database, so it needs to be weighed according to the actual situation, and in general, the database server has more computing power).
The way to do this is to sort the result sets of SQL using order by, and then group them using the @ o option of groupx after merging using the mergex function in the aggregator:
Of course, if you do not want to increase the burden on the database, you can also let the database do only grouping but not sorting. At this time, the aggregator directly uses groupx, and note that the @ o option cannot be added. In addition, when summarizing the data, replace mergex with conjx:
Third, cursor+groups is recommended if it is clearly known that the result set is small
At this time, groups has better performance than groupx, because groups saves all the operational data in memory, which saves the time of writing to out-of-memory files than groupx.
In addition, using groups does not require pre-sorting in the database, because the result set of the database group by itself is not necessarily orderly, and using orde by sorting will also increase the cost. For small result sets, using groups@o for aggregators is not necessarily more efficient than using groups directly.
Usually, conjx is used to summarize data.
B, count after removing weight (count distinct)
To remove duplicates in each database, you can use the distinct keyword. If the data between databases is deduplicated, you can use the merge@u function of the aggregator. It is important to note that before using it, you should make sure that the data in the table is ordered to the primary key fields (or one or more fields that are unique).
For distinct, the calculation method of sum (distinct) and avg (distinct) is more or less the same as that of count (distinct), and it is not often used in business, while max (distinct) and min (distinct) are no different from simply using max and min. Therefore, we only take count (distinct) as an example to illustrate.
For example, if you want to calculate the total number of people who fail at least one make-up exam in three subjects other than language (assuming only Class 1 and Class 2), the SQL of a single database looks like this:
SELECT count (distinct student ID) number FROM student transcript WHERE score
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.