In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces the relevant knowledge of "what join is used in java multi-table query". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
We analyze three common join types: Merge join,Hash join and NestedLoop Join. Before that, let's introduce some key words:
Inner ralation and outer relation.
A relation can be
A table.
An index
An intermediate result of a previous operation
When you Join two relation, the join algorithm has a difference between inner and outer relation. Outer relation is the left dataset and inner relation is the right dataset.
For example, A JOIN B, An is outer relation,B and inner relation. And generally A JOIN B and B JOIN A use time is not the same.
Later we assume that outer relation has N elements and inner relation has M elements. However, in the actual optimizer, you can get the exact value from the statistics.
Nested loop join
Nested associations are the easiest. The process is probably:
Traverse every line of outer relation
Then check to see if each line of inner relation matches.
Written as pseudo code is like this:
Nested_loop_join (array outer, array inner) for each row an in outer for each row b in inner if (match_join_condition (amemb)) write_result_in_output (arecom b) end if end for end for
Because of double traversal, the complexity is O (Null M). Corresponding to the outer relation O of disk, each row of N rows in the disk needs to be read through M rows from inner relation.
So this algorithm needs to read N + Numeric rows of data from disk. However, if the inner relation is small enough to put in memory, you only need to read M + N times. Although there is no change in time complexity, this approach is not bad on disk I inner relation O, so it can be replaced by an index, and disk I hand O is more advantageous.
Hash join
Hash joins are more complex, but in many cases they are also cheaper than cyclic nested joins
The principle of hash connection is:
Get all elements from inner relation
Save the hash table to disk
Create a hash table in memory
Read all the elements of outer relation one by one
Calculate the hash value of each element (using the hash function of the hash table) to find the hash bucket associated with inner relation
Check to see if the element of outer relation has a match in the hash bucket.
We need to make some assumptions in terms of time complexity to simplify the problem:
Inner relation is divided into X hash buckets
The hash function distributes the hash values of the data within each relation nearly evenly, which is equivalent to saying that the hash bucket size is uniform.
The element of outer relation matches all the elements in the hash bucket, and the cost is the number of elements in the hash bucket.
The time complexity is (M _ max X) * N + cost_to_create_hash_table (M) + cost_of_hash_function*N. If the hash function creates a sufficiently small hash bucket, then the complexity is O (multiple N).
There is also a version of the hash join, which is more memory-friendly, but not good enough for disk Imax O. The situation is as follows:
Calculate the hash table of both outer relation and inner relation
Save the hash table to disk
Then compare the two relation hash buckets one by one (one relationship is read into memory, the other is read line by line)
Merge join
A merge join is the only join algorithm that produces a sort.
Note: this simplified merge join does not distinguish between inner or outer tables; both tables play the same role. But the actual implementation is different, such as when dealing with duplicate values.
1. (optional) sort join operation: both input sources are sorted by join keywords.
two。 Merge join operation: the sorted input sources are merged together.
Sort
We have already talked about merge sorting, which is a good algorithm here.
Sometimes the dataset has been sorted, such as:
If the interior of the table is ordered, for example, there is an index in the join condition to organize the table
If relation is an index in a join condition
If the join is acting on the intermediate result of a sorted query
Merge join
This part is very similar to the merge operation in the merge sort that we talked about. The only difference is that we don't pick all the elements from the two relationships, only the same elements.
The general principle is as follows:
In two relation, compare the current element (the current equal sign appears for the first time)
When it is the same, put both elements in the result and compare the next element in the two relationships
If it is different, look for the next element in the relationship with the smallest element.
Repeat steps 1, 2, and 3 until the last element of one of the relationships.
Because both relationships are sorted, you no longer need to "look back", so this method is very effective.
This algorithm is a simplified version that does not deal with multiple occurrences of the same data in two sets of data.
Which connection algorithm is the best?
If there were the best, there would be no need to make so many types. Since there are many factors to consider, there will not be a simple answer, such as these:
Free memory size: if you don't have enough memory, join with a powerful hash, or at least a fully in-memory hash join, say bye bye.
The size of two datasets: if a large table joins a very small table, a nested loop join is faster than a hash join because the latter has the cost of creating a hash; if both tables are very large, the cost of a nested loop join CPU is high.
Whether or not there are indexes: merge joins seem to be a smarter choice if there are two B+ tree indexes.
Whether the result set needs to be sorted: even if you are using an unordered dataset, you may want to use a more expensive merge join (with sorting), because the final result is ordered and you can combine it with another result set through a merge join (it is also possible that operators such as ORDER BY/GROUP BY/DISTINCT used in the query implicitly or explicitly require a sort result).
Whether relationships have been sorted: merge joins are the best choice at this time.
Type of join: is it an equivalent join (such as tableA.col1 = tableB.col2) or an inner join? Outer connection? Cartesian product? Or self-connection? Some connections do not work under certain circumstances.
Distribution of data: if the data of the join condition is skewed (for example, if you join people according to their last name, there will be many people with the same surname), hashing will be a disaster, because the hash function will produce extremely uneven hash buckets.
If you want join operations to use multi-threads or multi-processes.
This is the end of the content of "what join is used in java multi-table query". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.