Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Operation of cross-database data tables

2025-01-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

1. Simple merge (FROM)

The so-called cross-database data table means that logically the same data table is stored in different databases. The reason may be that the amount of data is so large that it is difficult to deal with in a database, or it may be necessary to separate the production database from the history database. Different databases may only be the same kind of databases deployed on different machines, or they may be database systems of different types.

In the face of cross-database data tables, especially in the case of different database types, the tools that come with the database are often powerless, and it is generally necessary to look for third-party tools that can well support multiple data source types, and the aggregator, it can be said to be the best of them. Next, let's discuss in detail several common cross-library hybrid operations:

Cross-library operation, the simple and rough idea is to merge logically the same data tables scattered in each library into a single table, and then operate on this table.

For example, in two databases, HSQL and MYSQL, there is a student score table, each of which stores a portion of the student information, as shown in the following figure:

Using the aggregator, we can easily merge these two tables with the same structure into one table. The SPL script of the aggregator is as follows:

A

B

one

= connect ("org.hsqldb.jdbcDriver", "jdbc:hsqldb:hsql://127.0.0.1/demo?user=sa")

= connect ("com.mysql.jdbc.Driver", "jdbc:mysql://127.0.0.1:3306/demo?user=root&password=password")

two

= A1.query ("select * from Student score sheet")

= B1.query ("select * from Student score sheet")

three

= A2 | B2

A1, A2 and B1, B2 respectively read the student scores in the two libraries, while A3 merges the two tables in a simple and intuitive way.

This method actually reads both tables into memory, generates the order table objects of the aggregator respectively, and then uses the operation "|" of the order table to complete the merge. Some students may ask: what if I have a large amount of data and can't read it all into memory? It doesn't matter. The aggregator specially designed to deal with big data will never be stumped by such a simple small problem. We can use cursors, and we can also achieve fast splicing of tables:

A

B

two

= A1.cursor ("select * from Student score sheet")

= B1.cursor ("select * from Student score sheet")

three

= [A2, B2] .conjx ()

A2 and B2 respectively use cursors to open the student scores in the two libraries, while A3 uses the conjx () function to merge the two cursors to form a new cursor that can access both tables at the same time.

Corresponding to SQL, this simple merging is like just completing the from work, so that the data across the database tables with the same structure are "vertically" spliced into an accessible ordinal table or cursor. In the actual operation, it will also involve filtering (where/having), grouping aggregation (group+sum/count/avg/max/min), join (join+on), deduplication (distinct), sorting (order), fetching partial data (limit+offset), and so on. Now we will discuss these operations one by one.

Of course, when we deal with the requirements of these operations, we can not simply implement the function, we also need to consider the efficiency and performance of the implementation, so in principle, we will try our best to make use of the computing power of the database. The aggregator is mainly responsible for mixed operations. However, sometimes the aggregator is responsible for almost all the operations, and the database is only responsible for storing data.

2. WHERE

The essence of where filtering is through comparison calculation, the result of removing the comparison is the record of false, so where only acts on one record, does not involve the operation between records, and does not need to consider which database the data is located in. For example, in the previous example, we want to count the "math" scores of all the students in "Class one". The SQL in the single database is as follows:

SELECT student ID, score FROM student transcript WHERE subject = 'math' AND class = 'Class one'

When there are multiple databases, you only need to write the where clause directly in SQL and let each database process the filtering in parallel:

A

B

two

= A1.query ("select student ID, score from student transcript where subject = 'math' and class = 'Class one'")

= B1.query ("select student ID, score from student transcript where subject = 'math' and class = 'Class one'")

three

= A2 | B2

We can also make the aggregator responsible for all filtering operations, and the database stores only data. At this point, you can use the select function of the aggregator (different from the select keyword of SQL)

A

B

two

= A1.query ("select student ID, grades, subjects, class from student transcript")

= B1.query ("select student ID, grades, subjects, class from student transcript")

three

= A2.select (subject = "Mathematics" & & Class = = "Class one"). New (student ID, grade)

= B2.select (subject = "Mathematics" & & Class = = "Class one"). New (student ID, grade)

four

= A3 | B3

When the amount of data is large, you can also replace the order table with cursors and use the conjx function to connect:

A

B

two

= A1.cursor ("select student ID, score from student transcript where subject = 'math' and class = 'Class one'")

= B1.cursor ("select student ID, score from student transcript where subject = 'math' and class = 'Class one'")

three

= [A2, B2] .conjx ()

3. ORDER BY and LIMIT OFFSET

Order by is processed after the result set is generated. In the above example, if we want to sort by math score, for a single database, we only need to add the order by clause:

SELECT class, student ID, score FROM student transcript WHERE subject = 'math' AND class = 'class one' ORDER BY score

For multiple databases, you can have the databases sort separately and then merge the ordered data by the aggregator. This can maximize the performance of the database and parallel server.

A

B

two

= A1.query ("select class, student ID, grade from student transcript where subject = 'math' and class = 'class one' order by score")

= B1.query ("select class, student ID, grade from student transcript where subject = 'math' and class = 'class one' order by score")

three

= [A2, B2] .merge (score)

You can also sort in reverse order, adding "-" before the sort field when merging (the merge function may not be added with "-", but it is added according to the standard)

A

B

two

= A1.query ("select class, student ID, grade from student report sheet where subject = 'math' AND class = 'class one' order by score desc")

= B1.query ("select class, student ID, grade from student report sheet where subject = 'math' AND class = 'class one' order by score desc")

three

= [A2, B2] .merge (- score)

Of course, it can be sorted entirely by the aggregator:

A

B

two

= A1.query ("select class, student ID, grade from student transcript where subject = 'math' AND class = 'class one'")

= B1.query ("select class, student ID, grade from student transcript where subject = 'math' AND class = 'class one'")

three

= [A2.sort (grade), B2.sort (grade)] .merge (grade)

Reverse sorting is implemented by the aggregator:

A

B

two

= A1.query ("select class, student ID, grade from student transcript where subject = 'math' AND class = 'class one'")

= B1.query ("select class, student ID, grade from student transcript where subject = 'math' AND class = 'class one'")

three

= [A2.sort (- score), B2.sort (- score)] .merge (- score)

For large amounts of data, you need to use cursors and mergex to complete orderly merging:

A

B

two

= A1.cursor ("select class, student ID, grade from student transcript where subject = 'math' AND class = 'class one' order by score")

= B1.cursor ("select class, student ID, grade from student transcript where subject = 'math' AND class = 'class one' order by score")

three

= [A2, B2] .mergex (grade)

The implementation of limit and offset comes after order. In the example, if you want to get a math score in addition to the top 10 after the first place (less than but not more than), in the case of a single database, SQL is like this:

SELECT class, student ID, score FROM student transcript WHERE subject = 'math' AND class = 'class one' ORDER BY score DESC LIMIT 10 OFFSET 1

When there are multiple databases, the to function of the aggregator can be used to realize the function of limit offset, and to is equivalent to limit m offset n.

A

B

two

= A1.query ("select class, student ID, grade from student report sheet where subject = 'math' AND class = 'class one' order by score desc")

= B1.query ("select class, student ID, grade from student report sheet where subject = 'math' AND class = 'class one' order by score desc")

three

= [A2, B2] .merge (- score) .to (2,11)

For cases where cursors are used in large amounts of data, the offset function can be implemented using the aggregator function skip, while the function of limit can be implemented using the function fetch.

A

B

two

= A1.cursor ("select class, student ID, grade from student report sheet where subject = 'math' AND class = 'class one' order by score desc")

= B1.cursor ("select class, student ID, grade from student report sheet where subject = 'math' AND class = 'class one' order by score desc")

three

= [A2, B2] .mergex (- score).

= A3.skip (1)

four

= A3.fetch (10)

4. Aggregate operation

Let's discuss five common aggregation operations: sum/count/avg/max/min.

The basis of sum is addition. According to the law of addition, the internal data in each database are summed first, then spliced into a table, and then summed up, which is the same as the result of splicing into a table and then summing together.

The essence of count is to calculate null data for each item of non-null data, and then accumulate them. Therefore, its essence is still addition operation, which conforms to the law of addition combination as well as sum. The only difference is that instead of accumulating its own values for the original data, the count is 1 (non-null) or 0 (null).

The essence of avg is avg = sum/count when count > 0 and avg = null when count = 0. Obviously, avg cannot be calculated separately like sum or count. However, by definition, we can first calculate sum and count, and then calculate avg through sum and count.

Both max and min are based on comparison operations, and because comparison operations are transitive, the maximum values of all databases can be obtained by comparing the maximum values of each database.

Still the above example, this time we ask all the students in the two classes for the total math score, the number of students, the average score, the highest and lowest scores, for single-source data:

Total score of SELECT sum, number of count exams, average score of avg, highest score of max, lowest score of min. FROM student grade table WHERE subject = 'math'

The result set of the aggregation operation is very small, with only one row, so cursors can be used regardless of the amount of source data, as follows:

A

B

two

= A1.cursor ("select sum total score, number of count exams, highest max score, lowest min score from student transcript where subject = 'math'")

= B1.cursor ("select sum total score, number of count exams, highest max score, lowest min score from student transcript where subject = 'math'")

three

= [A2, B2] .conjx () .total (sum (total score), sum (number of exams), max (highest score), min (lowest score)

four

= create (total score, number of examiners, average score, highest score, lowest score). Insert (0scoreA3 (1), A3 (2), if (A3 (2)! = 0scoreA3 (1) / A3 (2), null), A3 (3), A3 (4))

In fact, the order by + limit offset mentioned earlier can also be seen as an aggregation operation in essence: top. By optimizing from this point of view, higher computational efficiency can be obtained. After all, when the amount of data is large, the cost of full sorting is very high, and the operation of taking the first N data does not need full sorting. Of course, this method is also applicable to situations where the amount of data is small.

Specifically, in the case of order by F limit m offset n, you only need to use top (nimm, F, ~) first, and then use to (nym1,).

Let's take the previous SQL statement with order by+limit offset as an example:

SELECT class, student ID, score FROM student transcript WHERE subject = 'math' AND class = 'class one' ORDER BY score DESC LIMIT 10 OFFSET 1

For multiple databases, the script is as follows, where the reverse sort only needs to be preceded by "-" before the sort field:

A

B

two

= A1.cursor ("select class, student ID, grade from student transcript where subject = 'math' AND class = 'class one'")

= B1.cursor ("select class, student ID, grade from student transcript where subject = 'math' AND class = 'class one'")

three

= [A2, B2] .conjx (). Top (11,-grades, ~). To (2,)

5. GROUP BY, DISTINCT and HAVINGA, grouping aggregation operation

For group by, because the final result has nothing to do with the input order of the individual sample, as long as the population of the sample remains unchanged, the final result will not change. In other words, as long as the extraction of data from the sub-database and the final summary of all data, the classification operation can be carried out in advance.

Suppose we want to find the total math score, the number of students, the average score, the highest score and the lowest score of Class 1 and Class 2 respectively. The single database is as follows:

SELECT class, total score of sum (grade), number of count exams, average score of avg (grade), highest score of max (grade), lowest score of min (grade) FROM student grade table WHERE subject = 'math' GROUP BY class

Let's discuss it in three situations:

First, for small data, the result set of the aggregate operation will only be smaller, so query+groups is recommended:

A

B

two

= A1.query ("select class, total score of sum (grade), number of people taking the count test, highest score of max (grade), lowest score of min (grade) from student grade table where subject = 'math' group by class")

= B1.query ("select class, total score of sum (grade), number of people taking the count test, highest score of max (grade), lowest score of min (grade) from student grade table where subject = 'math' group by class")

three

= (A2 | B2) .groups (Class: class; sum (Total score): total score, sum (number of examiners): number of examiners, max (highest score): highest score, min (lowest score)

four

= A3.new (class, total score, number of examiners, if (number of examiners = 0, null, total score / number of examiners): average score, highest score, lowest score)

Second, for large amounts of data, cursor+groupx should be used if the result set is also large.

In addition, because the grouping of large result sets is slow, it is necessary to generate cached data in out-of-memory. If we sort the data in the database first, we can avoid this cache (at this time, the computing pressure will be transferred to the database, so it needs to be weighed according to the actual situation, and in general, the database server has more computing power).

The way to do this is to sort the result sets of SQL using order by, and then group them using the @ o option of groupx after merging using the mergex function in the aggregator:

A

B

two

= A1.cursor ("select class, total score of sum (grade), number of count exams, highest score of max (grade), lowest score of min (grade) from student grade table where subject = 'math' group by class order by class")

= B1.cursor ("select class, total score of sum (grade), number of count exams, highest score of max (grade), lowest score of min (grade) from student grade table where subject = 'math' group by class order by class")

three

= [A2, B2] .mergex (class) groupx@o (class: class; sum (total score): total score, sum (number of examiners): number of examiners, max (highest score): highest score, min (lowest score): lowest score)

four

= A3.new (class, total score, number of examiners, if (number of examiners = 0, null, total score / number of examiners): average score, highest score, lowest score)

Of course, if you do not want to increase the burden on the database, you can also let the database do only grouping but not sorting. At this time, the aggregator directly uses groupx, and note that the @ o option cannot be added. In addition, when xxx total data, replace mergex with conjx:

A

B

two

= A1.cursor ("select class, total score of sum (grade), number of people taking the count test, highest score of max (grade), lowest score of min (grade) from student grade table where subject = 'math' group by class")

= B1.cursor ("select class, total score of sum (grade), number of people taking the count test, highest score of max (grade), lowest score of min (grade) from student grade table where subject = 'math' group by class")

three

= [A2, B2] .conjx () .groupx (Class: class; sum (Total score): total score, sum (number of examiners): number of examiners, max (highest score): highest score, min (lowest score)

four

= A3.new (class, total score, number of examiners, if (number of examiners = 0, null, total score / number of examiners): average score, highest score, lowest score)

Third, cursor+groups is recommended if it is clearly known that the result set is small

At this time, groups has better performance than groupx, because groups saves all the operational data in memory, which saves the time of writing to out-of-memory files than groupx.

In addition, using groups does not require pre-sorting in the database, because the result set of the database group by itself is not necessarily orderly, and using orde by sorting will also increase the cost. For small result sets, using groups@o for aggregators is not necessarily more efficient than using groups directly.

Usually, conjx is used to summarize data.

A

B

two

= A1.cursor ("select class, total score of sum (grade), number of people taking the count test, highest score of max (grade), lowest score of min (grade) from student grade table where subject = 'math' group by class")

= B1.cursor ("select class, total score of sum (grade), number of people taking the count test, highest score of max (grade), lowest score of min (grade) from student grade table where subject = 'math' group by class")

three

= [A2, B2] .conjx () .groups (Class: class; sum (Total score): total score, sum (number of examiners): number of examiners, max (highest score): highest score, min (lowest score)

four

= A3.new (class, total score, number of examiners, if (number of examiners = 0, null, total score / number of examiners): average score, highest score, lowest score)

B, count after removing weight (count distinct)

To remove duplicates in each database, you can use the distinct keyword. If the data between databases is deduplicated, you can use the merge@u function of the aggregator. It is important to note that before using it, you should make sure that the data in the table is ordered to the primary key fields (or one or more fields that are unique).

For distinct, the calculation method of sum (distinct) and avg (distinct) is more or less the same as that of count (distinct), and it is not often used in business, while max (distinct) and min (distinct) are no different from simply using max and min. Therefore, we only take count (distinct) as an example to illustrate.

For example, if you want to calculate the total number of people who fail at least one make-up exam in three subjects other than language (assuming only Class 1 and Class 2), the SQL of a single database looks like this:

SELECT count (distinct student ID) number FROM student transcript WHERE score

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report