Learn, Practice & Share: Advanced Joins

A. Hash Join:

The hash join has two inputs: the build input and probe input. The query optimizer assigns these roles so that the smaller of the two inputs is the build input. They can do duplicate removal and grouping, such as SUM (salary) GROUP BY department.

For EX: Table A (input 1) – 10 Rows (Build Input), Table B (Input 2) – 10 Rows (Probe Input).

Different types of hash joins: in-memory hash join, grace hash join, and recursive hash join.

In-Memory Hash Join:

The hash join first scans or computes the entire build input and then builds a hash table in memory. Each row is inserted into a hash bucket depending on the hash value computed for the hash key. If the entire build input is smaller than the available memory, all rows can be inserted into the hash table. This build phase is followed by the probe phase. The entire probe input is scanned or computed one row at a time, and for each probe row, the hash key's value is computed, the corresponding hash bucket is scanned, and the matches are produced.

Grace Hash Join:

If the build input does not fit in memory, a hash joins proceeds in several steps. This is known as a grace hash join. Each step has a build phase and probe phase. Initially, the entire build and probe inputs are consumed and partitioned (using a hash function on the hash keys) into multiple files. Using the hash function on the hash keys guarantees that any two joining records must be in the same pair of files. Therefore, the task of joining two large inputs has been reduced to multiple, but smaller, instances of the same tasks. The hash join is then applied to each pair of partitioned files.

Recursive Hash Join

If the build input is so large that inputs for a standard external merge would require multiple merge levels, multiple partitioning steps and multiple partitioning levels are required. If only some of the partitions are large, additional partitioning steps are used for only those specific partitions. In order to make all partitioning steps as fast as possible, large, asynchronous I/O operations are used so that a single thread can keep multiple disk drives busy.

NOTE: If the build input is only slightly larger than the available memory, elements of in-memory hash join and grace hash join are combined in a single step, producing a hybrid hash join.

It is not always possible during optimization to determine which hash join is used. Therefore, SQL Server starts by using an in-memory hash join and gradually transitions to grace hash join, and recursive hash join, depending on the size of the build input.

If the optimizer anticipates wrongly which of the two inputs is smaller and, therefore, should have been the build input, the build and probe roles are reversed dynamically. The hash join makes sure that it uses the smaller overflow file as build input. This technique is called role reversal. Role reversal occurs inside the hash join after at least one spill to the disk.

NOTE: Role reversal occurs independent of any query hints or structure. Role reversal does not display in your query plan; when it occurs, it is transparent to the user. The term hash bailout is sometimes used to describe grace hash joins or recursive hash joins

Using HASH JOIN

USE AdventureWorks2008R2;

SELECT p.Name, pr.ProductReviewID

FROM Production.Product p

LEFT OUTER HASH JOIN Production.ProductReview pr

ON p.ProductID = pr.ProductID

ORDER BY ProductReviewID DESC;

Using Normal JOIN
SELECT p.Name, pr.ProductReviewID
FROM Production.Product p
LEFT OUTER JOIN Production.ProductReview pr
ON p.ProductID = pr.ProductID
ORDER BY ProductReviewID DESC;

B. Nested Loop Join

The nested loops join, also called nested iteration, uses one join input as the outer input table (shown as the top input in the graphical execution plan) and one as the inner (bottom) input table. The outer loop consumes the outer input table row by row. The inner loop, executed for each outer row, searches for matching rows in the inner input table.

In the simplest case, the search scans an entire table or index; this is called a naive nested loops join. If the search exploits an index, it is called an index nested loops join. If the index is built as part of the query plan (and destroyed upon completion of the query), it is called a temporary index nested loops join. All these variants are considered by the query optimizer.

A nested loops join is particularly effective if the outer input is small and the inner input is pre-indexed and large. In many small transactions, such as those affecting only a small set of rows, index nested loops joins are superior to both merge joins and hash joins. In large queries, however, nested loops joins are often not the optimal choice.

EX: Using LOOP JOIN

DELETE FROM Sales.SalesPersonQuotaHistory

FROM Sales.SalesPersonQuotaHistory AS spqh

INNER LOOP JOIN Sales.SalesPerson AS sp

ON spqh.BusinessEntityID = sp.BusinessEntityID

WHERE sp.SalesYTD > 2500000.00;

C. MERGE Join

The merge join requires both inputs to be sorted on the merge columns, which are defined by the equality (ON) clauses of the join predicate. The query optimizer typically scans an index, if one exists on the proper set of columns, or it places a sort operator below the merge join. In rare cases, there may be multiple equality clauses, but the merge columns are taken from only some of the available equality clauses.

Because each input is sorted, the Merge Join operator gets a row from each input and compares them. For example, for inner join operations, the rows are returned if they are equal. If they are not equal, the lower-value row is discarded and another row is obtained from that input. This process repeats until all rows have been processed.

The merge join operation may be either a regular or a many-to-many operation. A many-to-many merge join uses a temporary table to store rows. If there are duplicate values from each input, one of the inputs will have to rewind to the start of the duplicates as each duplicate from the other input is processed.

If a residual predicate is present, all rows that satisfy the merge predicate evaluate the residual predicate, and only those rows that satisfy it are returned.

Merge join itself is very fast, but it can be an expensive choice if sort operations are required. However, if the data volume is large and the desired data can be obtained presorted from existing B-tree indexes, merge join is often the fastest available join algorithm.

EX: Using MERGE JOIN

SELECT poh.PurchaseOrderID, poh.OrderDate, pod.ProductID, pod.DueDate, poh.VendorID

FROM Purchasing.PurchaseOrderHeader AS poh

INNER MERGE JOIN Purchasing.PurchaseOrderDetail AS pod

ON poh.PurchaseOrderID = pod.PurchaseOrderID;

Learn, Practice & Share

Labels

Friday, October 14, 2011

Advanced Joins

No comments:

Post a Comment

Total Pageviews