MS-SQL Server Notes: join

Showing posts with label join. Show all posts

Monday, November 29, 2010

Default Join Type

I'm always ask what the default join in SQL Server is, when no join type is specified. My immediate reaction is Google it. The problem is trying to find it among all the article that will show up as search result.

Now that I found it, I decided to put it here...

Here is the actual article (MSDN BOL) where I found it. And just in case Microsoft decides to remove it, here's an excerpt:

INNER
Specifies that all matching pairs of rows are returned. Discards unmatched rows from both tables. This is the default if no join type is specified.

LEFT [ OUTER ]
Specifies that all rows from the left table not meeting the specified condition are included in the result set in addition to all rows returned by the inner join. Output columns from the left table are set to NULL.

RIGHT [ OUTER ]
Specifies that all rows from the right table not meeting the specified condition are included in the result set in addition to all rows returned by the inner join. Output columns from the right table are set to NULL.

FULL [ OUTER ]
If a row from either the left or right table does not match the selection criteria, specifies the row be included in the result set, and output columns that correspond to the other table be set to NULL. This is in addition to all rows usually returned by the inner join.

~~ CK

Saturday, November 1, 2008

The Problem with IN

In one of my old notes, I wrote how JOIN, IN and EXISTS are all affected by NULL values in the columns used to join tables, or to check for existence of relationship (parent-child) between tables.

Here are some queries that use IN that will not give any syntax error but could give unexpected results.

Consider this sample table:

select * from Authors
select * from books

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1           Jhon                           San Francisco
2           Paul                           Los Angeles
3           George                         San Diego
4           Ringo                          Oakland

aid         BookID      BookTitle                      BookPrice
----------- ----------- ------------------------------ ---------------------
1           1           Yesterday                      5.00
1           2           In My Life                     7.50
1           3           Hey Junde                      4.45
2           4           Fool On the Hill               NULL
2           6           If I Fell                      7.80
4           7           Let It Be                      NULL
4           8           Till There Was You             0.00
2           9           Yellow Submarine               34.65
1           10          I Should Have Known Better     65.33

Now, I executed this query:

select *
from authors
where  authorid in
   (select Authorid from Books 
        where BookTitle = 'Yesterday')

select *
from authors
where authorid in
   (select aid from Books
      where BookTitle = 'Yesterday')

Here are the result sets:

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1           Jhon                           San Francisco
2           Paul                           Los Angeles
3           George                         San Diego
4           Ringo                          Oakland

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1           Jhon                           San Francisco

Just like what I saw with EXISTS in my old notes, the parser did not return an error on the first query, but returned an unexpected result instead. Separately, the subquery will return an error.

select Authorid from Books
where BookTitle = 'Yesterday'

Msg 207, Level 16, State 1, Line 1
Invalid column name 'Authorid'.

Just like EXISTS, the first query did not return an error because even though AuthorID is not in Books table, because it exists in Authors table.

Unlike EXISTS, the relationship was established without the need to relate the relationship keys. To ensure accurate results, use table alias.

select *
from authors a
where  authorid in
(select b.Authorid from Books b
    where b.BookTitle = 'Yesterday')

Msg 207, Level 16, State 1, Line 4
Invalid column name 'Authorid'.

select *
from authors a
where authorid in
(select b.aid from Books b
  where b.BookTitle = 'Yesterday')

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1           Jhon                           San Francisco

~~CK

Monday, October 20, 2008

Careful EXISTS

In one of my old notes, I wrote how JOIN, IN and EXISTS are all affected by NULL values in the columns used to join tables, or to check for existence of relationship (parent-child) between tables.

Here are some queries that use EXISTS that will not give any syntax error but could give unexpected results

Consider this sample table

select * from Authors
select * from Books

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1           Jhon                           San Francisco
2           Paul                           Los Angeles
3           George                         San Diego
4           Ringo                          Oakland

aid         BookID      BookTitle                      BookPrice
----------- ----------- ------------------------------ -----------
1           1           Yesterday                      5.00
1           2           In My Life                     7.50
1           3           Hey Junde                      4.45
2           4           Fool On the Hill               NULL
2           6           If I Fell                      7.80
4           7           Let It Be                      NULL
4           8           Till There Was You             0.00
2           9           Yellow Submarine               34.65
1           10          I Should Have Known Better     65.33

Now if I execute these queries:

select * 
from authors 
where  exists 
   (select Authorid from Books  
        where BookTitle = 'Yesterday')

select * 
from authors
where exists 
   (select aid from Books 
      where BookTitle = 'Let It Be')

Here are the result sets:

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1           Jhon                           San Francisco
2           Paul                           Los Angeles
3           George                         San Diego
4           Ringo                          Oakland

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1           Jhon                           San Francisco
2           Paul                           Los Angeles
3           George                         San Diego
4           Ringo                          Oakland

The first query did not result any error. This, in spite the column AuthorID does not exist in Books table.

Separately executed, the subquery will return an error:

select Authorid from Books  
  where BookTitle = 'Yesterday'

Msg 207, Level 16, State 1, Line 1
Invalid column name 'Authorid'.

The parser did not return any error even though AuthorID does not exists in Books. This is because it exists in Authors. In this case, the query will use the column in Authors table instead. This could return unexpected results.

To illustrate this query in another way:

declare @myVar int

set @myVar = 1

select @myVar myVar from Books where BookTitle = 'Yesterday'

set @myVar = 2

select @myVar myVar  from Books where BookTitle = 'Yesterday'

Here are the result sets:

myVar
-----------
1

myVar
-----------
2

I also noticed that the queries returned all rows. This is due to the nature of how EXISTS works. It checks if the subquery clause returns at least a single row. As illustrated in another way, the above query always returns a valid row regardless of the value of @myVar. The query itself is incomplete. The filter condition that should link the two table is missing. If I am to use EXISTS to relate the two tables, the query should be written this way:

select * 
from authors a
where  exists 
   (select 1 from Books b 
        where a.AuthorID = b.aid and   BookTitle = 'Yesterday')

Here’s the result set:

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1           Jhon                           San Francisco

~~CK

Saturday, October 11, 2008

The Difference Between IN, EXISTS and JOIN

JOIN, IN and EXISTS can all be use to check for any parent-child table relationship. However, it's a common misconception that these three behave the same way. I found out (the hard way) that it's not always the case, specially, if the NOT operator is used to check for orphan records or broken links.

Consider this sample table:

set nocount on

create table Authors
(AuthorID int, AuthorName varchar(30), AuthorCity varchar(15))

create table Books
(aid int, BookID int, BookTitle varchar(30), BookPrice money)

insert into Authors values (1, 'Jhon', 'San Francisco')
insert into Authors values (2, 'Paul', 'Los Angeles')
insert into Authors values (3, 'George', 'San Diego')
insert into Authors values (4, 'Ringo', 'Oakland')

insert into Books values (1, 1, 'Yesterday',5.00)
insert into Books values (1, 2, 'In My Life',7.50)
insert into Books values (1, 3, 'Hey Junde',4.45)
insert into Books values (2, 4, 'Fool On the Hill',NULL)
insert into Books values (2, 6, 'If I Fell',7.80)
insert into Books values (4, 7, 'Let It Be',NULL)
insert into Books values (4, 8, 'Till There Was You',0.00)
insert into Books values (2, 9, 'Yellow Submarine',34.65)
insert into Books values
(1, 10, 'I Should Have Known Better',65.33)

select * from Authors
select * from Books

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1           Jhon                           San Francisco
2           Paul                           Los Angeles
3           George                         San Diego
4           Ringo                          Oakland

aid         BookID      BookTitle                      BookPrice
----------- ----------- ------------------------------ ---------------------
1           1           Yesterday                      5.00
1           2           In My Life                     7.50
1           3           Hey Junde                      4.45
2           4           Fool On the Hill               NULL
2           6           If I Fell                      7.80
4           7           Let It Be                      NULL
4           8           Till There Was You             0.00
2           9           Yellow Submarine               34.65
1           10          I Should Have Known Better     65.33

Here's how these three works in checking for existence of parent-child relationship.

select * 
from authors a
where exists (select aid from Books b where authorid = aid)

select *
from authors a
where authorid in (select aid from books)

select distinct a.*
from authors a
 join books b on a.authorid = b.aid
where b.aid is not null

Here are the result sets:

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1           Jhon                           San Francisco
2           Paul                           Los Angeles
4           Ringo                          Oakland

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1           Jhon                           San Francisco
2           Paul                           Los Angeles
4           Ringo                          Oakland

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1           Jhon                           San Francisco
2           Paul                           Los Angeles
4           Ringo                          Oakland

In the case of JOIN, I have to use DISTINCT to avoid returning duplicate rows. So far these queries are behaving the way I expected them to be.

Now I'll try and find parent records with no child using the NOT operator.

select * 
from authors a
where not exists (select aid from Books b where authorid = aid)

select *
from authors a
where  authorid not in (select aid from books)

select distinct a.*
from authors a
left join books b on a.authorid = b.aid
where b.aid is null

Here are the result sets:

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
3           George                         San Diego

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
3           George                         San Diego

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
3           George                         San Diego

The queries returned the expected results. Now, see what happen if I have a row with AuthorID = NULL on the parent table.

First I'll insert the dummy record.

insert into Authors values (NULL, 'Dummy', 'Daly City')

select * from Authors

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1           Jhon                           San Francisco
2           Paul                           Los Angeles
3           George                         San Diego
4           Ringo                          Oakland
NULL        Dummy                          Daly City

Running the queries again:

select * 
from authors a
where not exists (select aid from Books b where authorid = aid)

select *
from authors a
where  authorid not in (select aid from books)  

select distinct a.*
from authors a
left join books b on a.authorid = b.aid
where b.aid is null

Here are the result sets:

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
3           George                         San Diego
NULL        Dummy                          Daly City

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
3           George                         San Diego

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
NULL        Dummy                          Daly City
3           George                         San Diego

Notice that IN did not return the expected result. It excluded the NULL on the parent table. To include that record, I actually have to add an OR condition.

select *
from authors a
where  authorid not in (select aid from books)
or authorid is null

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
3           George                         San Diego
NULL        Dummy                          Daly City

Now what if I don't have a row with AuthorID = NULL on the parent table, but instead it's on the child table.

First remove the record from the Authors table :

delete from Authors where AuthorId is null

select * from Authors

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1           Jhon                           San Francisco
2           Paul                           Los Angeles
3           George                         San Diego
4           Ringo                          Oakland

then insert the record in Books table:

insert into Books values
(NULL, 5, 'The Long and Winding Road',5.65)

select * from Books

aid         BookID      BookTitle                      BookPrice
----------- ----------- ------------------------------ ---------------------
1           1           Yesterday                      5.00
1           2           In My Life                     7.50
1           3           Hey Junde                      4.45
2           4           Fool On the Hill               NULL
2           6           If I Fell                      7.80
4           7           Let It Be                      NULL
4           8           Till There Was You             0.00
2           9           Yellow Submarine               34.65
1           10          I Should Have Known Better     65.33
NULL        5           The Long and Winding Road      5.65

Running the queries again:

select * 
from authors a
where not exists (select aid from Books b where authorid = aid)

select *
from authors a
where  authorid not in (select aid from books)  

select distinct a.*
from authors a
left join books b on a.authorid = b.aid
where b.aid is null

Here are the result sets:

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
3           George                         San Diego

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
3           George                         San Diego

Notice that IN did not return any result at all. Now what will happen if both the parent and the child tables have AuthorId = NULL and AID = NULL.

First I'll insert back the row on Authors table:

insert into Authors values (NULL, 'Dummy', 'Daly City')

Now the tables look like this:

select * from Authors
select * from Books

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1           Jhon                           San Francisco
2           Paul                           Los Angeles
3           George                         San Diego
4           Ringo                          Oakland
NULL        Dummy                          Daly City

aid         BookID      BookTitle                      BookPrice
----------- ----------- ------------------------------ ---------------------
1           1           Yesterday                      5.00
1           2           In My Life                     7.50
1           3           Hey Junde                      4.45
2           4           Fool On the Hill               NULL
2           6           If I Fell                      7.80
4           7           Let It Be                      NULL
4           8           Till There Was You             0.00
2           9           Yellow Submarine               34.65
1           10          I Should Have Known Better     65.33
NULL        5           The Long and Winding Road      5.65

First I'll check for existence of parent-child relationship

select * 
from authors a
where exists (select aid from Books b where authorid = aid)

select *
from authors a
where authorid in (select aid from books)

select distinct a.*
from authors a
 join books b on a.authorid = b.aid
where b.aid is not null

Here are the result sets:

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1           Jhon                           San Francisco
2           Paul                           Los Angeles
4           Ringo                          Oakland

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1           Jhon                           San Francisco
2           Paul                           Los Angeles
4           Ringo                          Oakland

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1           Jhon                           San Francisco
2           Paul                           Los Angeles
4           Ringo                          Oakland

So even if there is a row in both tables with AuthorID = NULL and AID = NULL the result set did not return it. This is because SQL Server can not compare two NULLs. So NULL = NULL is neither TRUE nor FALSE. Since the condition is not TRUE, which is the primary requirement for the filter to include the row in the result set, it was excluded.

Now, if I try and see if there are records on the parent table that does not have a child record…

select * 
from authors a
where not exists (select aid from Books b where authorid = aid)

select *
from authors a
where  authorid not in (select aid from books)

select distinct a.*
from authors a
left join books b on a.authorid = b.aid
where b.aid is null

Here are the result sets:

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
3           George                         San Diego
NULL        Dummy                          Daly City

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
NULL        Dummy                          Daly City
3           George                         San Diego

The query that uses IN did not return any result set. This is because IN will not return any rows if the list contains a NULL value. This is true whether the list is a list of values or a subquery.

select * from Authors

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1           Jhon                           San Francisco
2           Paul                           Los Angeles
3           George                         San Diego
4           Ringo                          Oakland
NULL        Dummy                          Daly City

select *
from authors a
where authorid not in (1,4,NULL)

Here's the result set:

AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------

Based on these samples IN is the most unreliable way among the three in terms of checking parent-child relationship, specially if either or both the parent and child tables have a NULL on value on the column used as join key.

Although some of these situations may be avoided using the Primary Key and Foreign Key Constraint built in SQL Server, these inconsistencies could happen while using these queries during data analysis stage.

~~CK

Tuesday, September 30, 2008

Optimizing Your WHERE Clause

The WHERE clause is one of the most commonly used optional part of a query. Simply put, it filters out rows and narrows the number of rows returned by a query based on the condition included in the clause.

It's a common misconception that the SQL optimizer will always use an index whenever the table has a useful one. This is not always the case. In some cases, index will not be use and a table/index scan will be performed resulting in slow processing.

It's also widely accepted that the arrangement of expressions and the operator used does not matter, since the optimizer will parse and prepare an execution plan for the query anyway. Although this is true most of the time, arranging the logical expression properly can improve processing.

Here are some consideration that I keep in mind whenever I build my WHERE clause.

Avoid using expression that has a function on a column. This will prevent the optimizer from using the index and instead perform a table/index scan.

This query will not use an index:

select *
from AdventureWorksDW..FactInternetSalesReason
WHERE substring(SalesOrderNumber,1,2) = 'SI'

Modified, this one will use an index:

select *
from AdventureWorksDW..FactInternetSalesReason
WHERE SalesOrderNumber like 'SI%'

If the use of a function can not be avoided, use an indexed computed column instead.

If the SQL Server is not configured to be case-sensitive, do not bother using LOWER() or UPPER() functions .

These are three identical queries that will return identical results:

select * 
from AdventureWorksDW..DimGeography 
WHERE CountryRegionCode = 'AU'

select * 
from AdventureWorksDW..DimGeography 
WHERE upper(CountryRegionCode) = 'au'

select * 
from AdventureWorksDW..DimGeography 
WHERE CountryRegionCode = 'au'

The second query, however, will not use an index.

Use the equal (=) operator to compare two strings instead of LIKE.

These two queries will return the same results:

select * from AdventureWorksDW..FactInternetSales
WHERE salesordernumber = 'SO43703'

select * from AdventureWorksDW..FactInternetSales
WHERE  salesordernumber LIKE 'SO43703'

The first query is more efficient than the second one. If the LIKE operator can not be avoided, use as much leading character as much as possible. If the application performs too many LIKE operation, consider SQL Server's full-text search option instead.

select * from AdventureWorksDW..FactInternetSalesReason
WHERE salesordernumber like 'S%1'

The above query will perform faster than the query below:

select * from AdventureWorksDW..FactInternetSalesReason
WHERE salesordernumber like '%1'

Although they are not identical, given the choice, use the former rather than the latter.

Here are the most common operators in WHERE clause arranged based on best performing first:

=
>, <, >=, <= 
LIKE 
<>, NOT

Avoid using NOT operator as much as possible. Although not always the case, WHERE clause that uses the NOT operator does not utilize index.

select * from AdventureWorksDW..FactInternetSales
WHERE not ShipDateKey >= 10

will perform faster as:

select * from AdventureWorksDW..FactInternetSales
where ShipDateKey <10

Given the choice, use EXISTS() instead of IN(). Moreover, IN() have some issues handling with NULL values.

Force the optimizer to utilize an index by using index hint on the query. Use this as the last resort for optimization.

Given the choice, use BETWEEN instead of IN. The BETWEEN operator performs faster than IN.

If the clause have multiple logical expression connected by two or more AND operators, locate the expression that will be LEAST likely to be true. This way, if it's false, the clause will immediately end. If both expressions are equally likely to be false, test the least complex expression first. This way, if it's false, the more complex one need not be tested. Also, consider creating an index for a selective column or a covering index for the query.

If the clause have multiple logical expression connected by two or more OR operators, locate the expression that will be MOST likely to be true. This way, if it's true, the clause will immediately end. If both expressions are likely to be true, test the least complex expression first. This way, if it's true, the more complex one need not be tested.

Remember that IN operator is actually another form of OR, so place the most probable value at the start of the list.

A query will perform a table/index scan if it contains OR operator and if any of the referenced column does not have a useful index.

This query will perform a table/index scan. For this query to utilize an index, there must be an index on all three columns in the clause. Even if two out of the four columns have an index, it will still perform a table/index scan.

select *
from AdventureWorksDW..FactInternetSales
where ShipDateKey = 3
or PromotionKey = 3
or SalesTerritoryKey = 5
or SalesOrderLIneNumber = 3

If creating an index for each of these columns is not an option, rewrite the query to use UNION ALL (not UNION) instead. This way the query with useful index will be utilize, even if it's just the one or two out of the four query. It will still execute more efficiently.

select *
from AdventureWorksDW..FactInternetSales
where ShipDateKey = 3
union all
select *
from AdventureWorksDW..FactInternetSales
where PromotionKey = 3
union all
select *
from AdventureWorksDW..FactInternetSales
where SalesTerritoryKey = 5
union all
select *
from AdventureWorksDW..FactInternetSales
where SalesOrderLIneNumber = 3

The above query will give the same results but will run faster than using multiple ORs. If ShipDateKey and PromotionKey have useful index, it will use their respective index for that part of the query, improving the speed of the entire query.

Given the option, use EXIST or a LEFT JOIN instead of IN to test/compare the relationship of a parent-child tables.

Beware of redundant clause.

This query have a redundant WHERE clause:

select DueDateKey, ShipDateKey,
PromotionKey, CustomerKey
from AdventureWorksDW..FactInternetSales
WHERE
DueDateKey = 8 and ShipDateKey = 8 and
PromotionKey = 1 or ShipDateKey = 8 and PromotionKey = 1

ShipDateKey = 8 and PromotionKey = 1 is a subset of DueDateKey = 8 and ShipDateKey = 8 and PromotionKey = 1. What the optimizer will do is it will return all the rows requested.

Here's the result set:

DueDateKey  ShipDateKey PromotionKey CustomerKey
----------- ----------- ------------ -----------
13          8           1            11003
13          8           1            14501
13          8           1            21768
13          8           1            25863
13          8           1            28389

It may look like the query returned the requested rows efficiently. In reality what happen is, it returned these rows twice then perform a DISTINCT to remove the redundant rows.

Convert frequently running batch to stored procedure, specially, if the queries in the batch use user-defined variables. The optimizer might not take advantage of a useful index to run the queries in the batch. This is because the optimizer does not know the value of these variables when it chooses a way to access the data. If the batch is not frequently executed, consider including an INDEX hit on the query.

As much as possible, include a WHERE clause to the query. This will improve the way SQL Server retrieve the data.

~~CK