Saturday, October 11, 2008

The Difference Between IN, EXISTS and JOIN

JOIN, IN and EXISTS can all be use to check for any parent-child table relationship. However, it's a common misconception that these three behave the same way. I found out (the hard way) that it's not always the case, specially, if the NOT operator is used to check for orphan records or broken links.

Consider this sample table:
set nocount on

create table Authors
(AuthorID int, AuthorName varchar(30), AuthorCity varchar(15))

create table Books
(aid int, BookID int, BookTitle varchar(30), BookPrice money)

insert into Authors values (1, 'Jhon', 'San Francisco')
insert into Authors values (2, 'Paul', 'Los Angeles')
insert into Authors values (3, 'George', 'San Diego')
insert into Authors values (4, 'Ringo', 'Oakland')

insert into Books values (1, 1, 'Yesterday',5.00)
insert into Books values (1, 2, 'In My Life',7.50)
insert into Books values (1, 3, 'Hey Junde',4.45)
insert into Books values (2, 4, 'Fool On the Hill',NULL)
insert into Books values (2, 6, 'If I Fell',7.80)
insert into Books values (4, 7, 'Let It Be',NULL)
insert into Books values (4, 8, 'Till There Was You',0.00)
insert into Books values (2, 9, 'Yellow Submarine',34.65)
insert into Books values
(1, 10, 'I Should Have Known Better',65.33)


select * from Authors
select * from Books

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
3 George San Diego
4 Ringo Oakland

aid BookID BookTitle BookPrice
----------- ----------- ------------------------------ ---------------------
1 1 Yesterday 5.00
1 2 In My Life 7.50
1 3 Hey Junde 4.45
2 4 Fool On the Hill NULL
2 6 If I Fell 7.80
4 7 Let It Be NULL
4 8 Till There Was You 0.00
2 9 Yellow Submarine 34.65
1 10 I Should Have Known Better 65.33
Here's how these three works in checking for existence of parent-child relationship.
select * 
from authors a
where exists (select aid from Books b where authorid = aid)

select *
from authors a
where authorid in (select aid from books)

select distinct a.*
from authors a
join books b on a.authorid = b.aid
where b.aid is not null
Here are the result sets:
AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
4 Ringo Oakland

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
4 Ringo Oakland

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
4 Ringo Oakland
In the case of JOIN, I have to use DISTINCT to avoid returning duplicate rows. So far these queries are behaving the way I expected them to be.

Now I'll try and find parent records with no child using the NOT operator.
select * 
from authors a
where not exists (select aid from Books b where authorid = aid)

select *
from authors a
where authorid not in (select aid from books)

select distinct a.*
from authors a
left join books b on a.authorid = b.aid
where b.aid is null
Here are the result sets:
AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
3 George San Diego

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
3 George San Diego

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
3 George San Diego
The queries returned the expected results. Now, see what happen if I have a row with AuthorID = NULL on the parent table.

First I'll insert the dummy record.
insert into Authors values (NULL, 'Dummy', 'Daly City')

select * from Authors

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
3 George San Diego
4 Ringo Oakland
NULL Dummy Daly City
Running the queries again:
select * 
from authors a
where not exists (select aid from Books b where authorid = aid)

select *
from authors a
where authorid not in (select aid from books)

select distinct a.*
from authors a
left join books b on a.authorid = b.aid
where b.aid is null
Here are the result sets:
AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
3 George San Diego
NULL Dummy Daly City

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
3 George San Diego

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
NULL Dummy Daly City
3 George San Diego
Notice that IN did not return the expected result. It excluded the NULL on the parent table. To include that record, I actually have to add an OR condition.
select *
from authors a
where authorid not in (select aid from books)
or authorid is null


AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
3 George San Diego
NULL Dummy Daly City
Now what if I don't have a row with AuthorID = NULL on the parent table, but instead it's on the child table.

First remove the record from the Authors table :
delete from Authors where AuthorId is null

select * from Authors

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
3 George San Diego
4 Ringo Oakland
then insert the record in Books table:
insert into Books values
(NULL, 5, 'The Long and Winding Road',5.65)


select * from Books

aid BookID BookTitle BookPrice
----------- ----------- ------------------------------ ---------------------
1 1 Yesterday 5.00
1 2 In My Life 7.50
1 3 Hey Junde 4.45
2 4 Fool On the Hill NULL
2 6 If I Fell 7.80
4 7 Let It Be NULL
4 8 Till There Was You 0.00
2 9 Yellow Submarine 34.65
1 10 I Should Have Known Better 65.33
NULL 5 The Long and Winding Road 5.65
Running the queries again:
select * 
from authors a
where not exists (select aid from Books b where authorid = aid)

select *
from authors a
where authorid not in (select aid from books)

select distinct a.*
from authors a
left join books b on a.authorid = b.aid
where b.aid is null
Here are the result sets:
AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
3 George San Diego

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
3 George San Diego
Notice that IN did not return any result at all. Now what will happen if both the parent and the child tables have AuthorId = NULL and AID = NULL.

First I'll insert back the row on Authors table:
insert into Authors values (NULL, 'Dummy', 'Daly City')
Now the tables look like this:
select * from Authors
select * from Books

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
3 George San Diego
4 Ringo Oakland
NULL Dummy Daly City

aid BookID BookTitle BookPrice
----------- ----------- ------------------------------ ---------------------
1 1 Yesterday 5.00
1 2 In My Life 7.50
1 3 Hey Junde 4.45
2 4 Fool On the Hill NULL
2 6 If I Fell 7.80
4 7 Let It Be NULL
4 8 Till There Was You 0.00
2 9 Yellow Submarine 34.65
1 10 I Should Have Known Better 65.33
NULL 5 The Long and Winding Road 5.65
First I'll check for existence of parent-child relationship
select * 
from authors a
where exists (select aid from Books b where authorid = aid)

select *
from authors a
where authorid in (select aid from books)

select distinct a.*
from authors a
join books b on a.authorid = b.aid
where b.aid is not null
Here are the result sets:
AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
4 Ringo Oakland

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
4 Ringo Oakland

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
4 Ringo Oakland
So even if there is a row in both tables with AuthorID = NULL and AID = NULL the result set did not return it. This is because SQL Server can not compare two NULLs. So NULL = NULL is neither TRUE nor FALSE. Since the condition is not TRUE, which is the primary requirement for the filter to include the row in the result set, it was excluded.

Now, if I try and see if there are records on the parent table that does not have a child record…
select * 
from authors a
where not exists (select aid from Books b where authorid = aid)

select *
from authors a
where authorid not in (select aid from books)

select distinct a.*
from authors a
left join books b on a.authorid = b.aid
where b.aid is null
Here are the result sets:
AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
3 George San Diego
NULL Dummy Daly City

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
NULL Dummy Daly City
3 George San Diego
The query that uses IN did not return any result set. This is because IN will not return any rows if the list contains a NULL value. This is true whether the list is a list of values or a subquery.
select * from Authors

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
3 George San Diego
4 Ringo Oakland
NULL Dummy Daly City

select *
from authors a
where authorid not in (1,4,NULL)
Here's the result set:
AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
Based on these samples IN is the most unreliable way among the three in terms of checking parent-child relationship, specially if either or both the parent and child tables have a NULL on value on the column used as join key.

Although some of these situations may be avoided using the Primary Key and Foreign Key Constraint built in SQL Server, these inconsistencies could happen while using these queries during data analysis stage.



~~CK

No comments:

Post a Comment