Sunday, November 23, 2008

How to Find my Missing Identity

IDENTITY is a property of a numeric data type that tells SQL Server it is an auto-incrementing column. It increments every time a new record is inserted to the table.

An identity column can not start with value less than the seed and will always increment by the identity increment property. Although an identity column may be unique, it is not always sequential. If a record is deleted from the table, the deleted value will not be reuse, even if it's the last record.

Because it's not always sequential, it's possible that there are missing value on the column that breaks the sequence. Here's a query that could check for the missing sequence in an identity column:
select a.IdentityColumn+1
from TableName A
left join TableName B on A.IdentityColumn+1 = B.IdentityColumn
where B.IdentityColumn is null

~~CK

Saturday, November 1, 2008

The Problem with IN

In one of my old notes, I wrote how JOIN, IN and EXISTS are all affected by NULL values in the columns used to join tables, or to check for existence of relationship (parent-child) between tables.

Here are some queries that use IN that will not give any syntax error but could give unexpected results.

Consider this sample table:
select * from Authors
select * from books

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
3 George San Diego
4 Ringo Oakland

aid BookID BookTitle BookPrice
----------- ----------- ------------------------------ ---------------------
1 1 Yesterday 5.00
1 2 In My Life 7.50
1 3 Hey Junde 4.45
2 4 Fool On the Hill NULL
2 6 If I Fell 7.80
4 7 Let It Be NULL
4 8 Till There Was You 0.00
2 9 Yellow Submarine 34.65
1 10 I Should Have Known Better 65.33
Now, I executed this query:
select *
from authors
where authorid in
(select Authorid from Books
where BookTitle = 'Yesterday')

select *
from authors
where authorid in
(select aid from Books
where BookTitle = 'Yesterday')
Here are the result sets:
AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
3 George San Diego
4 Ringo Oakland

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco

Just like what I saw with EXISTS in my old notes, the parser did not return an error on the first query, but returned an unexpected result instead. Separately, the subquery will return an error.
select Authorid from Books
where BookTitle = 'Yesterday'

Msg 207, Level 16, State 1, Line 1
Invalid column name 'Authorid'.
Just like EXISTS, the first query did not return an error because even though AuthorID is not in Books table, because it exists in Authors table.

Unlike EXISTS, the relationship was established without the need to relate the relationship keys. To ensure accurate results, use table alias.
select *
from authors a
where authorid in
(select b.Authorid from Books b
where b.BookTitle = 'Yesterday')

Msg 207, Level 16, State 1, Line 4
Invalid column name 'Authorid'.

select *
from authors a
where authorid in
(select b.aid from Books b
where b.BookTitle = 'Yesterday')

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco


~~CK

Sunday, October 26, 2008

How to check for leap year?

Checking leap year is one of the most common checks that I perform in date values. Wikipedia have this pseudo-code to calculate for leap year:
if ((year modulo 4 is 0) and (year modulo 100 is not 0)) or
(year modulo 400 is 0) then

leap
else
not_leap
I've been using this algorithm all my programming life, and there's really nothing wrong with it. SQL Server, however, can accurately handle date data and have a couple functions that can be used to check for leap year without passing through this complicated algorithm. What I did is just to check if the expression is a valid date expression using the ISDATE() function.

Running the queries below:
select isdate('20070229') as validdate
select isdate('20080229') as validdate
will return the following result sets:
validdate
-----------
0
validdate
-----------
1
So if I have an integer year variable, I just need to build the date string.
set nocount on
declare @YearVar int

set @YearVar = 2007
select
isdate(cast(@YearVar as varchar(4)) + '0229') as validdate


set @YearVar = 2008
select
isdate(cast(@YearVar as varchar(4)) + '0229') as validdate
Here are the result sets:
validdate
-----------
0

validdate
-----------
1
To ensure that the date string will be properly interpreted, I use the unseparated date string format. For more on that, read my old notes.

Here's a sample code that illustrate how this algorithm works:
set nocount on
declare @YearVar int

set @YearVar = 2007
select
@YearVar as YearVar,

case
when isdate(cast(@YearVar as varchar(4)) + '0229') = 1 then

'leap year'
else 'not leap year'
end as validdate

YearVar validdate
----------- -------------
2007 not leap year


set @YearVar = 2008
select
@YearVar as YearVar,

case
when isdate(cast(@YearVar as varchar(4)) + '0229') = 1 then

'leap year'
else 'not leap year'
end as validdate

YearVar validdate
----------- -------------
2008 leap year


~~CK

Monday, October 20, 2008

Careful EXISTS

In one of my old notes, I wrote how JOIN, IN and EXISTS are all affected by NULL values in the columns used to join tables, or to check for existence of relationship (parent-child) between tables.

Here are some queries that use EXISTS that will not give any syntax error but could give unexpected results

Consider this sample table
select * from Authors
select * from Books

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
3 George San Diego
4 Ringo Oakland

aid BookID BookTitle BookPrice
----------- ----------- ------------------------------ -----------
1 1 Yesterday 5.00
1 2 In My Life 7.50
1 3 Hey Junde 4.45
2 4 Fool On the Hill NULL
2 6 If I Fell 7.80
4 7 Let It Be NULL
4 8 Till There Was You 0.00
2 9 Yellow Submarine 34.65
1 10 I Should Have Known Better 65.33
Now if I execute these queries:
select * 
from authors
where exists
(select Authorid from Books
where BookTitle = 'Yesterday')

select *
from authors
where exists
(select aid from Books
where BookTitle = 'Let It Be')

Here are the result sets:
AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
3 George San Diego
4 Ringo Oakland

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
3 George San Diego
4 Ringo Oakland

The first query did not result any error. This, in spite the column AuthorID does not exist in Books table.

Separately executed, the subquery will return an error:
select Authorid from Books  
where BookTitle = 'Yesterday'

Msg 207, Level 16, State 1, Line 1
Invalid column name 'Authorid'.
The parser did not return any error even though AuthorID does not exists in Books. This is because it exists in Authors. In this case, the query will use the column in Authors table instead. This could return unexpected results.

To illustrate this query in another way:
declare @myVar int

set @myVar = 1

select @myVar myVar from Books where BookTitle = 'Yesterday'

set @myVar = 2

select @myVar myVar from Books where BookTitle = 'Yesterday'

Here are the result sets:

myVar
-----------
1

myVar
-----------
2
I also noticed that the queries returned all rows. This is due to the nature of how EXISTS works. It checks if the subquery clause returns at least a single row. As illustrated in another way, the above query always returns a valid row regardless of the value of @myVar. The query itself is incomplete. The filter condition that should link the two table is missing. If I am to use EXISTS to relate the two tables, the query should be written this way:
select * 
from authors a
where exists
(select 1 from Books b
where a.AuthorID = b.aid and BookTitle = 'Yesterday')
Here’s the result set:
AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco


~~CK

Saturday, October 11, 2008

The Difference Between IN, EXISTS and JOIN

JOIN, IN and EXISTS can all be use to check for any parent-child table relationship. However, it's a common misconception that these three behave the same way. I found out (the hard way) that it's not always the case, specially, if the NOT operator is used to check for orphan records or broken links.

Consider this sample table:
set nocount on

create table Authors
(AuthorID int, AuthorName varchar(30), AuthorCity varchar(15))

create table Books
(aid int, BookID int, BookTitle varchar(30), BookPrice money)

insert into Authors values (1, 'Jhon', 'San Francisco')
insert into Authors values (2, 'Paul', 'Los Angeles')
insert into Authors values (3, 'George', 'San Diego')
insert into Authors values (4, 'Ringo', 'Oakland')

insert into Books values (1, 1, 'Yesterday',5.00)
insert into Books values (1, 2, 'In My Life',7.50)
insert into Books values (1, 3, 'Hey Junde',4.45)
insert into Books values (2, 4, 'Fool On the Hill',NULL)
insert into Books values (2, 6, 'If I Fell',7.80)
insert into Books values (4, 7, 'Let It Be',NULL)
insert into Books values (4, 8, 'Till There Was You',0.00)
insert into Books values (2, 9, 'Yellow Submarine',34.65)
insert into Books values
(1, 10, 'I Should Have Known Better',65.33)


select * from Authors
select * from Books

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
3 George San Diego
4 Ringo Oakland

aid BookID BookTitle BookPrice
----------- ----------- ------------------------------ ---------------------
1 1 Yesterday 5.00
1 2 In My Life 7.50
1 3 Hey Junde 4.45
2 4 Fool On the Hill NULL
2 6 If I Fell 7.80
4 7 Let It Be NULL
4 8 Till There Was You 0.00
2 9 Yellow Submarine 34.65
1 10 I Should Have Known Better 65.33
Here's how these three works in checking for existence of parent-child relationship.
select * 
from authors a
where exists (select aid from Books b where authorid = aid)

select *
from authors a
where authorid in (select aid from books)

select distinct a.*
from authors a
join books b on a.authorid = b.aid
where b.aid is not null
Here are the result sets:
AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
4 Ringo Oakland

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
4 Ringo Oakland

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
4 Ringo Oakland
In the case of JOIN, I have to use DISTINCT to avoid returning duplicate rows. So far these queries are behaving the way I expected them to be.

Now I'll try and find parent records with no child using the NOT operator.
select * 
from authors a
where not exists (select aid from Books b where authorid = aid)

select *
from authors a
where authorid not in (select aid from books)

select distinct a.*
from authors a
left join books b on a.authorid = b.aid
where b.aid is null
Here are the result sets:
AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
3 George San Diego

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
3 George San Diego

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
3 George San Diego
The queries returned the expected results. Now, see what happen if I have a row with AuthorID = NULL on the parent table.

First I'll insert the dummy record.
insert into Authors values (NULL, 'Dummy', 'Daly City')

select * from Authors

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
3 George San Diego
4 Ringo Oakland
NULL Dummy Daly City
Running the queries again:
select * 
from authors a
where not exists (select aid from Books b where authorid = aid)

select *
from authors a
where authorid not in (select aid from books)

select distinct a.*
from authors a
left join books b on a.authorid = b.aid
where b.aid is null
Here are the result sets:
AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
3 George San Diego
NULL Dummy Daly City

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
3 George San Diego

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
NULL Dummy Daly City
3 George San Diego
Notice that IN did not return the expected result. It excluded the NULL on the parent table. To include that record, I actually have to add an OR condition.
select *
from authors a
where authorid not in (select aid from books)
or authorid is null


AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
3 George San Diego
NULL Dummy Daly City
Now what if I don't have a row with AuthorID = NULL on the parent table, but instead it's on the child table.

First remove the record from the Authors table :
delete from Authors where AuthorId is null

select * from Authors

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
3 George San Diego
4 Ringo Oakland
then insert the record in Books table:
insert into Books values
(NULL, 5, 'The Long and Winding Road',5.65)


select * from Books

aid BookID BookTitle BookPrice
----------- ----------- ------------------------------ ---------------------
1 1 Yesterday 5.00
1 2 In My Life 7.50
1 3 Hey Junde 4.45
2 4 Fool On the Hill NULL
2 6 If I Fell 7.80
4 7 Let It Be NULL
4 8 Till There Was You 0.00
2 9 Yellow Submarine 34.65
1 10 I Should Have Known Better 65.33
NULL 5 The Long and Winding Road 5.65
Running the queries again:
select * 
from authors a
where not exists (select aid from Books b where authorid = aid)

select *
from authors a
where authorid not in (select aid from books)

select distinct a.*
from authors a
left join books b on a.authorid = b.aid
where b.aid is null
Here are the result sets:
AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
3 George San Diego

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
3 George San Diego
Notice that IN did not return any result at all. Now what will happen if both the parent and the child tables have AuthorId = NULL and AID = NULL.

First I'll insert back the row on Authors table:
insert into Authors values (NULL, 'Dummy', 'Daly City')
Now the tables look like this:
select * from Authors
select * from Books

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
3 George San Diego
4 Ringo Oakland
NULL Dummy Daly City

aid BookID BookTitle BookPrice
----------- ----------- ------------------------------ ---------------------
1 1 Yesterday 5.00
1 2 In My Life 7.50
1 3 Hey Junde 4.45
2 4 Fool On the Hill NULL
2 6 If I Fell 7.80
4 7 Let It Be NULL
4 8 Till There Was You 0.00
2 9 Yellow Submarine 34.65
1 10 I Should Have Known Better 65.33
NULL 5 The Long and Winding Road 5.65
First I'll check for existence of parent-child relationship
select * 
from authors a
where exists (select aid from Books b where authorid = aid)

select *
from authors a
where authorid in (select aid from books)

select distinct a.*
from authors a
join books b on a.authorid = b.aid
where b.aid is not null
Here are the result sets:
AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
4 Ringo Oakland

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
4 Ringo Oakland

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
4 Ringo Oakland
So even if there is a row in both tables with AuthorID = NULL and AID = NULL the result set did not return it. This is because SQL Server can not compare two NULLs. So NULL = NULL is neither TRUE nor FALSE. Since the condition is not TRUE, which is the primary requirement for the filter to include the row in the result set, it was excluded.

Now, if I try and see if there are records on the parent table that does not have a child record…
select * 
from authors a
where not exists (select aid from Books b where authorid = aid)

select *
from authors a
where authorid not in (select aid from books)

select distinct a.*
from authors a
left join books b on a.authorid = b.aid
where b.aid is null
Here are the result sets:
AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
3 George San Diego
NULL Dummy Daly City

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
NULL Dummy Daly City
3 George San Diego
The query that uses IN did not return any result set. This is because IN will not return any rows if the list contains a NULL value. This is true whether the list is a list of values or a subquery.
select * from Authors

AuthorID AuthorName AuthorCity
----------- ------------------------------ ---------------
1 Jhon San Francisco
2 Paul Los Angeles
3 George San Diego
4 Ringo Oakland
NULL Dummy Daly City

select *
from authors a
where authorid not in (1,4,NULL)
Here's the result set:
AuthorID    AuthorName                     AuthorCity
----------- ------------------------------ ---------------
Based on these samples IN is the most unreliable way among the three in terms of checking parent-child relationship, specially if either or both the parent and child tables have a NULL on value on the column used as join key.

Although some of these situations may be avoided using the Primary Key and Foreign Key Constraint built in SQL Server, these inconsistencies could happen while using these queries during data analysis stage.



~~CK

Wednesday, October 8, 2008

Updating a Row with Value from Preceding Row

I found this problem on one of the SQL Server forums that I am regular contributor of.

Given this table:
set nocount on
declare @yourtable table
(recnum int,
Col2 varchar(2),
FKCol3 varchar(15),
Col4 varchar(10))


insert into @yourtable values (1, 'AA', 'key0AA', 'Somedata')
insert into @yourtable values (2, 'CC', NULL, 'Somedata')
insert into @yourtable values (3, 'CC', NULL, 'Somedata')
insert into @yourtable values (4, 'AA', 'key0BB', 'Somedata')
insert into @yourtable values (5, 'BB', NULL, 'Somedata')
insert into @yourtable values (6, 'CC', NULL, 'Somedata')
insert into @yourtable values (7, 'AA', 'key0CC', 'Somedata')
insert into @yourtable values (8, 'CC', NULL, 'Somedata')
insert into @yourtable values (9, 'CC', NULL, 'Somedata')
insert into @yourtable values (10, 'AA', 'key0DD', 'Somedata')
insert into @yourtable values (11, 'AA', 'key0EE', 'Somedata')

select *
from @YourTable


recnum Col2 FKCol3 Col4
----------- ---- --------------- ----------
1 AA key0AA Somedata
2 CC NULL Somedata
3 CC NULL Somedata
4 AA key0BB Somedata
5 BB NULL Somedata
6 CC NULL Somedata
7 AA key0CC Somedata
8 CC NULL Somedata
9 CC NULL Somedata
10 AA key0DD Somedata
11 AA key0EE Somedata
The problem: The user wants to replace those PKCol3 that are NULL by the non-null PKCol3 that precedes it. In this case, records 2 and 3 will have PKCol3 with "key0AA". Records 5 and 6 will have PKCol3 = "key0BB", and so on.This has to be done for almost 2 million+ rows.

Here's the code I suggested:

First, here's a query that can be used for analysis:
select recnum, Col2, FKCol3, Col4,
newCol3 = isnull((
select top 1 FKCol3
from @YourTable b
where a.recnum > b.recnum and b.FKCol3 is not null
and a.FKCol3 is null

order by b.recnum desc),a.FKCol3)
from @YourTable a
Here's the result set:
recnum      Col2 FKCol3          Col4       newCol3
----------- ---- --------------- ---------- ---------------
1 AA key0AA Somedata key0AA
2 CC NULL Somedata key0AA
3 CC NULL Somedata key0AA
4 AA key0BB Somedata key0BB
5 BB NULL Somedata key0BB
6 CC NULL Somedata key0BB
7 AA key0CC Somedata key0CC
8 CC NULL Somedata key0CC
9 CC NULL Somedata key0CC
10 AA key0DD Somedata key0DD
11 AA key0EE Somedata key0EE
The intention is that newCol3 will be the new value of PKCol3.

Executing the update:
update @yourtable
set FKCol3 = isnull((
select top 1 FKCol3
from @YourTable b
where a.recnum > b.recnum and b.FKCol3 is not null
and a.FKCol3 is null

order by b.recnum desc),a.FKCol3)
from @yourtable a

select *
from @YourTable
Here is the updated result set:
recnum      Col2 FKCol3          Col4
----------- ---- --------------- ----------
1 AA key0AA Somedata
2 CC key0AA Somedata
3 CC key0AA Somedata
4 AA key0BB Somedata
5 BB key0BB Somedata
6 CC key0BB Somedata
7 AA key0CC Somedata
8 CC key0CC Somedata
9 CC key0CC Somedata
10 AA key0DD Somedata
11 AA key0EE Somedata
This script will only work if there's a unique column (in this case recnum) that can be used to identify and sort the rows.



~~ CK

Tuesday, September 30, 2008

Optimizing Your WHERE Clause

The WHERE clause is one of the most commonly used optional part of a query. Simply put, it filters out rows and narrows the number of rows returned by a query based on the condition included in the clause.

It's a common misconception that the SQL optimizer will always use an index whenever the table has a useful one. This is not always the case. In some cases, index will not be use and a table/index scan will be performed resulting in slow processing.

It's also widely accepted that the arrangement of expressions and the operator used does not matter, since the optimizer will parse and prepare an execution plan for the query anyway. Although this is true most of the time, arranging the logical expression properly can improve processing.

Here are some consideration that I keep in mind whenever I build my WHERE clause.

Avoid using expression that has a function on a column. This will prevent the optimizer from using the index and instead perform a table/index scan.

This query will not use an index:
select *
from AdventureWorksDW..FactInternetSalesReason
WHERE substring(SalesOrderNumber,1,2) = 'SI'

Modified, this one will use an index:
select *
from AdventureWorksDW..FactInternetSalesReason
WHERE SalesOrderNumber like 'SI%'

If the use of a function can not be avoided, use an indexed computed column instead.

If the SQL Server is not configured to be case-sensitive, do not bother using LOWER() or UPPER() functions .

These are three identical queries that will return identical results:
select * 
from AdventureWorksDW..DimGeography
WHERE CountryRegionCode = 'AU'

select *
from AdventureWorksDW..DimGeography
WHERE upper(CountryRegionCode) = 'au'

select *
from AdventureWorksDW..DimGeography
WHERE CountryRegionCode = 'au'
The second query, however, will not use an index.

Use the equal (=) operator to compare two strings instead of LIKE.

These two queries will return the same results:
select * from AdventureWorksDW..FactInternetSales
WHERE salesordernumber = 'SO43703'

select * from AdventureWorksDW..FactInternetSales
WHERE salesordernumber LIKE 'SO43703'
The first query is more efficient than the second one. If the LIKE operator can not be avoided, use as much leading character as much as possible. If the application performs too many LIKE operation, consider SQL Server's full-text search option instead.
select * from AdventureWorksDW..FactInternetSalesReason
WHERE salesordernumber like 'S%1'
The above query will perform faster than the query below:
select * from AdventureWorksDW..FactInternetSalesReason
WHERE salesordernumber like '%1'
Although they are not identical, given the choice, use the former rather than the latter.

Here are the most common operators in WHERE clause arranged based on best performing first:
=
>, <, >=, <=
LIKE
<>, NOT
Avoid using NOT operator as much as possible. Although not always the case, WHERE clause that uses the NOT operator does not utilize index.
select * from AdventureWorksDW..FactInternetSales
WHERE not ShipDateKey >= 10
will perform faster as:
select * from AdventureWorksDW..FactInternetSales
where ShipDateKey <10
Given the choice, use EXISTS() instead of IN(). Moreover, IN() have some issues handling with NULL values.

Force the optimizer to utilize an index by using index hint on the query. Use this as the last resort for optimization.

Given the choice, use BETWEEN instead of IN. The BETWEEN operator performs faster than IN.

If the clause have multiple logical expression connected by two or more AND operators, locate the expression that will be LEAST likely to be true. This way, if it's false, the clause will immediately end. If both expressions are equally likely to be false, test the least complex expression first. This way, if it's false, the more complex one need not be tested. Also, consider creating an index for a selective column or a covering index for the query.

If the clause have multiple logical expression connected by two or more OR operators, locate the expression that will be MOST likely to be true. This way, if it's true, the clause will immediately end. If both expressions are likely to be true, test the least complex expression first. This way, if it's true, the more complex one need not be tested.

Remember that IN operator is actually another form of OR, so place the most probable value at the start of the list.

A query will perform a table/index scan if it contains OR operator and if any of the referenced column does not have a useful index.

This query will perform a table/index scan. For this query to utilize an index, there must be an index on all three columns in the clause. Even if two out of the four columns have an index, it will still perform a table/index scan.
select *
from AdventureWorksDW..FactInternetSales
where ShipDateKey = 3
or PromotionKey = 3
or SalesTerritoryKey = 5
or SalesOrderLIneNumber = 3
If creating an index for each of these columns is not an option, rewrite the query to use UNION ALL (not UNION) instead. This way the query with useful index will be utilize, even if it's just the one or two out of the four query. It will still execute more efficiently.
select *
from AdventureWorksDW..FactInternetSales
where ShipDateKey = 3
union all
select *
from AdventureWorksDW..FactInternetSales
where PromotionKey = 3
union all
select *
from AdventureWorksDW..FactInternetSales
where SalesTerritoryKey = 5
union all
select *
from AdventureWorksDW..FactInternetSales
where SalesOrderLIneNumber = 3
The above query will give the same results but will run faster than using multiple ORs. If ShipDateKey and PromotionKey have useful index, it will use their respective index for that part of the query, improving the speed of the entire query.

Given the option, use EXIST or a LEFT JOIN instead of IN to test/compare the relationship of a parent-child tables.

Beware of redundant clause.

This query have a redundant WHERE clause:
select DueDateKey, ShipDateKey,
PromotionKey,
CustomerKey
from AdventureWorksDW..FactInternetSales
WHERE
DueDateKey = 8 and ShipDateKey = 8 and
PromotionKey = 1
or ShipDateKey = 8 and PromotionKey = 1
ShipDateKey = 8 and PromotionKey = 1 is a subset of DueDateKey = 8 and ShipDateKey = 8 and PromotionKey = 1. What the optimizer will do is it will return all the rows requested.

Here's the result set:
DueDateKey  ShipDateKey PromotionKey CustomerKey
----------- ----------- ------------ -----------
13 8 1 11003
13 8 1 14501
13 8 1 21768
13 8 1 25863
13 8 1 28389

It may look like the query returned the requested rows efficiently. In reality what happen is, it returned these rows twice then perform a DISTINCT to remove the redundant rows.

Convert frequently running batch to stored procedure, specially, if the queries in the batch use user-defined variables. The optimizer might not take advantage of a useful index to run the queries in the batch. This is because the optimizer does not know the value of these variables when it chooses a way to access the data. If the batch is not frequently executed, consider including an INDEX hit on the query.

As much as possible, include a WHERE clause to the query. This will improve the way SQL Server retrieve the data.



~~CK

Monday, September 29, 2008

Using Encryption to Store Sensitive Data

SQL Server, even the earlier version, has a significant number of ways to protect data from unauthorized users (ie. Views, functions, user rights, etc). However, administrators and database owners can still view table contents that might be too sensitive for anyone to see, like password stored in tables that are frequently used by UIs and other front-end apps.

SQL Server 2005 has an undocumented system function that can be use to encrypt a string to store encrypted information. It should be noted, however, that the function is a one way hash encryption. It means you can not decrypt the stored data. The only way to use it is to compare it with an encrypted string.
declare @string varchar(50), @EncryptedString varbinary(max)

set @string = 'correctpassword'
set @EncryptedString = pwdencrypt(@string)

select
pwdcompare('correctpassword',@EncryptedString ) correctpass,

pwdcompare('wrongpassword',@EncryptedString ) wongpass

Here's the result set:
correctpass wongpass
----------- -----------
1 0

This function is useful if you don't need to display the encrypted data later. If you need the decrypted data, later (like encrypting date of birth data then you'll need to display it later),it would be better to use a customized encryption function.



~~CK

Wednesday, September 24, 2008

How do you get the actual size of a Database?

How do you get the actual size of SQL Server Database? Quick answer: execute a sp_spaceused command. However, I found this excerpt from Books Online "When updateusage is specified, the SQL Server Database Engine scans the data pages in the database and makes any required corrections…..There are some situations, for example, after an index is dropped, when the space information for the table may not be current." In short, sp_spaceused is not always accurate.

For more accurate calculation of database size, try the following code:
select 
cast(size/128.0 as numeric(10,1)) as [size-mb],
cast(FILEPROPERTY(name, 'spaceused')/128.0 as numeric(10,1)) as [used-mb],
maxsize [maximum size-mb],
cast(growth*8/1024 as varchar(11))+
case when status&0x100000>0 then '%' else '' end [growth],

isnull(filegroup_name(groupid),'log') [groupname],
cast(100*(size/128-(FILEPROPERTY(name,'spaceused')/128))/(size/128.0)
as int) [% free]

from sysfiles
Here's the result set:
size-mb  used-mb  maximum size-mb growth   groupname  % free
-------- -------- --------------- -------- ---------- -----------
500.0 10.4 -1 64 PRIMARY 98
58.0 5.3 268435456 16 log 91


~~CK

What's the Difference Between SET and SELECT?

Is there really a difference between SET and SELECT when used to assign values to variables, in T-SQL? The misconception is that, there's no difference between them. I've been creating SQL Scripts for some time now and had been using these two statements interchangeably. That is until I got an unnoticed logical error resulting from using one instead of the other. So what are the similarities and differences? Both commands can assign values to a variable. For the most part, their similarities end there. Fundamentally, SET is the ANSI standard way of assigning values to a variable, while SELECT is not. Another visible difference is that SET can assign a value to only one variable, while SELECT can assign value to one or more variable at a time. Consider these variables:
declare @myVar1 int, @myVar2 int
This code should run successfully:
SELECT @myVar1 = 1, @myVar2 = 2
print '@myVar1'
print @myVar1
print ''
print '@myVar2'
print @myVar2
Here are the results:
@myVar1
1
 
@myVar2
2
This code will result to an error:
SET @myVar1 = 11, @myVar2 = 12
print '@myVar1'
print @myVar1
print ''
print '@myVar2'
print @myVar2
Here's the error returned:
Msg 102, Level 15, State 1, Line 11
Incorrect syntax near ','.
(I'll use the PRINT command to display variable content to avoid confusion between the command assigning values to variable and displaying values.) In order for this code to run, assignment of values should be done separately for each variable.
SET @myVar1 = 11
SET @myVar2 = 12
print '@myVar1'
print @myVar1
print ''
print '@myVar2'
print @myVar2
Here's the result:
@myVar1
11

@myVar2
12
Standards, however, is not the only consideration when using these two statements. Here's a sample application of how these commands could tremendously affect SQL Scripts:
declare @rowsaffected int, @errornumber int

select id/0 from sysobjects
set @rowsaffected = @@rowcount
set @errornumber = @@error

print 'rowsaffected'
print @rowsaffected
print ''
print 'errornumber'
print @errornumber
Here are the results:
Msg 8134, Level 16, State 1, Line 4
Divide by zero error encountered.

rowsaffected
0

errornumber
0
The code returned an error (division by zero), but the system variable @@error returned zero (0). This is not due to SET and SELECT per se. The @@error system variable checks if there's an error on the last t-sql statement executed. In this case set @rowsaffected = @@rowcount, is a valid statement, hence it did not return any error number. This could result in unhandled errors. In another application, I want to return the number of rows affected.
declare @rowsaffected int, @errornumber int

select id from sysobjects where id <= 10
set @errornumber = @@error
set @rowsaffected = @@rowcount

print 'rowsaffected'
print @rowsaffected
print ''
print 'errornumber'  
print @errornumber
Here are the results:
id
-----------
4
5
7
8

rowsaffected
1
 
errornumber
0
Notice that the query actually returned four rows, but the @@rowcount system variable reported only one row. The @@rowcount is a system variable that returns the number of rows affected by the previous statement. In this case, set @errornumber = @@error, returns one row. Using SET to assign the values of these system variables, in this case, is not advisable. To capture both values at the same time, use the SELECT statement.
declare @rowsaffected int, @errornumber int

select id from sysobjects where id <= 10
select @errornumber = @@error,@rowsaffected = @@rowcount

print 'rowsaffected'
print @rowsaffected
print ''
print 'errornumber'  
print @errornumber

id
-----------
4
5
7
8

rowsaffected
4
 
errornumber
0

select id/0 from sysobjects where id <= 10
select @errornumber = @@error,@rowsaffected = @@rowcount

print 'rowsaffected'
print @rowsaffected
print ''
print 'errornumber'  
print @errornumber
 
-----------
Msg 8134, Level 16, State 1, Line 4
Divide by zero error encountered.

rowsaffected
0
 
errornumber
8134
In both cases, the @@rowcount and @@error returned the right values. SET and SELECT also varies on the way scalar values are being assigned to variables, specifically if the value is coming from a query. Consider this sample table:
SET NOCOUNT ON
declare @Test table (i int, j varchar(15))
INSERT INTO @Test (i, j) VALUES (1, 'First Row')
INSERT INTO @Test (i, j) VALUES (2, 'Second Row')

select * from @Test

i           j
----------- ----------
1           First Row
2           Second Row
Using the two commands:
declare @myVarUseSET varchar(15), @myVarUseSELECT varchar(15)

SET @myVarUseSET = (select j from @Test where i = 2)
SELECT @myVarUseSELECT = j from @Test where i = 2

print '@myVarUseSET'
print @myVarUseSET
print ''
print '@myVarUseSELECT'
print @myVarUseSELECT
Here are the results:
@myVarUseSET
Second Row

@myVarUseSELECT
Second Row
The values are accurately assigned to both variables. What will happen if the query returned more than one rows.
SET @myVarUseSET = (select j from @Test)
SELECT @myVarUseSELECT = j from @Test

print '@myVarUseSET'
print @myVarUseSET
print ''
print '@myVarUseSELECT'
print @myVarUseSELECT
Here are the results:
Msg 512, Level 16, State 1, Line 10
Subquery returned more than 1 value. 
This is not permitted when the subquery follows =, !=, <, <= , >, >= or 
when the subquery is used as an expression.
@myVarUseSET

@myVarUseSELECT
Second Row
Notice that the SET query failed while the SELECT succeeded. How do SELECT statements choose which value among the rows returned will be stored to the variable? It will be the last row returned. If the query has an ORDER BY clause, it will return the last row as sorted by the ORDER BY clause. What will happen if, instead of multiple values returned, the query did not return any rows? To be able to appreciate more, I initialize the value of the variables and tried to assign a value that's coming from a query that did not return any rows.
select @myVarUseSET = 'initial value',
@myVarUseSELECT = 'initial value'

set @myVarUseSET = (select j from @Test where i = 3)
select @myVarUseSELECT = j from @Test where i = 3

print '@myVarUseSET'
print isnull(@myVarUseSET,'NULL')
print ''
print '@myVarUseSELECT'
print isnull(@myVarUseSELECT,'NULL')
Here are the results:
@myVarUseSET
NULL

@myVarUseSELECT
initial value
The way SET executes is that it will set the value of the variable to NULL. SELECT, on the other hand, will not replace the value of the variable. Notice that the variable @myVarUseSELECT still contains the initialized value of the variable. If not properly handled, this could lead to unnoticed errors that might affect the result of any scripts. Lastly, speed. Dealing with a single variable, there's not much of speed differential between the two commands. SET @variable = value and SELECT @variable = value have almost the same speed with very negligible difference. However, if it's dealing with three or more variables SELECT beats SET by almost 50% for the mere fact that SELECT can assign values to multiple variables at a time. Based on the abovementioned samples and cases, speed should not the only thing that must be considered. Keep in mind how these commands handle the results of the assignment operator.


~~ CK

Monday, September 22, 2008

Convert Column into A Delimited String, Part 1


Here's a problem that I usually encounter on the forum and at work. Sometimes I need a stored proc or function to read a table and return a single string delimited with either comma or tab or bar.
Although SQL Server 2005 included a PIVOT/UNPIVOT relational operators, I think it's still worth it to look at this solution.
Consider this table:
declare @myTable table (rownum int, rowname varchar(10))
declare @DelimitedString varchar(max)

insert @myTable values (1, 'One')
insert @myTable values (2, 'Two')
insert @myTable values (3, 'Three')
insert @myTable values (4, 'Four')
insert @myTable values (5, 'Five')
insert @myTable values (6, 'Six')
insert @myTable values (7, 'Seven')
insert @myTable values (8, 'Eight')
insert @myTable values (9, 'Nine')
insert @myTable values (10, 'Ten')

select * from @mytable

rownum      rowname
----------- ----------
1           One
2           Two
3           Three
4           Four
5           Five
6           Six
7           Seven
8           Eight
9           Nine
10          Ten
I need this output:
One,Two,Three,Four,Five,Six,Seven,Eight,Nine,Ten
So far, this is the best code that I got:
declare @DelimitedString varchar(max)

SELECT
@DelimitedString = COALESCE(@DelimitedString+',' , '') + ColumnName 
FROM @MyTable

SELECT @DelimitedString
Here's the result set:
-------------------------------------------------
One,Two,Three,Four,Five,Six,Seven,Eight,Nine,Ten
Then I tried parsing that string back to rows:
declare @xmldoc int
declare @DelimitedString varchar(250)

set @DelimitedString = 'One,Two,Three,Four,Five,Six,Seven,Eight,Nine,Ten'

set @DelimitedString = '<root><z><y>' + replace(@DelimitedString, ',', '</y></z><z><y>') + '</y></z></root>'

select @DelimitedString
The variable now looks like this:
<root>
  <z>
    <y>One</y>
  </z>
  <z>
    <y>Two</y>
  </z>
  <z>
    <y>Three</y>
  </z>
  <z>
    <y>Four</y>
  </z>
  <z>
    <y>Five</y>
  </z>
  <z>
    <y>Six</y>
  </z>
  <z>
    <y>Seven</y>
  </z>
  <z>
    <y>Eight</y>
  </z>
  <z>
    <y>Nine</y>
  </z>
  <z>
    <y>Ten</y>
  </z>
</root>
Followed by...
exec sp_xml_preparedocument @xmldoc OUTPUT, @DelimitedString

select y as rows
from openxml(@xmldoc, '/root/z',2)
with (y varchar(15))

exec sp_xml_removedocument @xmldoc
Here's the result set:
rows
---------------
One
Two
Three
Four
Five
Six
Seven
Eight
Nine
Ten
As always, comments are welcome


~~ CK

Friday, September 19, 2008

How do SQL Server Programmers Search for Their Dates?

Being a user of SQL Serever, I dealt with searching and using dates in logical conditions, in one way or another. Initially, it was quite confusing.

Here are a couple of factors.

First, it's because, as of SQL 2005, there are no data types that holds date only, nor time only data. SQL Server stores both the date part and time part using the same data type. The only difference is the precision of these data types.

Second is because you have to use string literals to handle static date data. Users usually miss the time part of the string literal. SQL Server will default the converted time part to zero (00:00:00.000). This could lead to logical conditions to return unexpected results. (For more on all these, read my old notes).

Try and simulate some logical conditions. Consider this sample table:
declare @myTable table(rownum int, DateData datetime)
set nocount on

insert into @mytable values(1,'2005-10-14 01:36:25.440')
insert into @mytable values(2,'2005-10-14 01:36:25.497')
insert into @mytable values(3,'2005-10-14 01:36:25.570')
insert into @mytable values(4,'2005-10-14 01:36:25.627')
insert into @mytable values(5,'2005-10-14 01:36:25.683')
insert into @mytable values(6,'2005-10-14 01:36:25.740')
insert into @mytable values(7,'2005-10-15 00:00:00.000')
insert into @mytable values(8,'2008-07-24 12:52:42.360')
insert into @mytable values(9,'2008-07-25 00:00:00.000')
insert into @mytable values(10,'2008-07-25 12:38:35.060')
insert into @mytable values(11,'2008-07-25 12:38:35.137')
insert into @mytable values(12,'2008-07-26 00:00:00.000')
insert into @mytable values(13,'2008-08-13 00:00:00.000')

select * from @myTable

rownum datedata
----------- -----------------------
1 2005-10-14 01:36:25.440
2 2005-10-14 01:36:25.497
3 2005-10-14 01:36:25.570
4 2005-10-14 01:36:25.627
5 2005-10-14 01:36:25.683
6 2005-10-14 01:36:25.740
7 2005-10-15 00:00:00.000
8 2008-07-24 12:52:42.360
9 2008-07-25 00:00:00.000
10 2008-07-25 12:38:35.060
11 2008-07-25 12:38:35.137
12 2008-07-26 00:00:00.000
13 2008-08-13 00:00:00.000

Now search for specific records using the date data as filter condition :
select *
from @myTable
where DateData = '20080724'


select *
from @myTable
where DateData = '20080813'
Here are the result sets:
rownum      datedata
----------- -----------------------

rownum datedata
----------- -----------------------
13 2008-08-13 00:00:00.000
Now, why did the first query did not return any rows while the second query returned the 13th row? That's in spite of row 8 being the only one with date July 24, 2008 anyway.

A more detail analysis shows that the WHERE clause of the query compares the DateData column to a date-formatted string. In order to compare two values, they must be of the same data type, otherwise, the data type with the lower precedence will be converted to the data type with the higher precedence. In this case, datetime is higher than any other native string data types. The first query is synonymous to:
select * from @myTable
where datedata = cast('20080724' as datetime)


rownum datedata
----------- -----------------------
Since the function converted a date-literal with no time part, it sets the time to zero (00:00:00.000). Resulting to a false condition (2008-07-24 12:53:00 <> 2008-07-24 00:00:00.000).

So how do SQL Server Programmers search for their dates? Here are some options how to retrieve the 8th record.

It's always possible tp convert the date column into string with no time part and compare two strings. Something like:
select * from @myTable
where convert(varchar(20), datedata, 112) = '20080724'

rownum datedata
----------- -----------------------
8 2008-07-24 12:52:42.360
It did. Technically, there's nothing wrong with this code. It returned the desired result. The only problem that it will encounter later is performance. SQL Server will not use index to optimize our query since we use a function in date column for our WHERE clause. This is OK if the query is processing a small table. If it's reading a large volume of data, this will tremendously affect the execution time.

Another option is to grab all records with dates between 20080724 AND 20080725.
select * from @myTable
where datedata BETWEEN '20080724' and '20080725'

rownum datedata
----------- -----------------------
8 2008-07-24 12:52:42.360
9 2008-07-25 00:00:00.000
Now why did it include row 9? It's the way how BETWEEN operator works. BETWEEN is a logical operator that operates inclusive-ly. The above query is similar to:
select * from @myTable
where datedata >= '20080724' and datedata <= '20080725'
The second condition will follow the “explicit conversion of lower precedent data type with no time part” rule, so the 9th row was included to the result set. So, try and include the time part:
select * from @myTable 
where datedata
BETWEEN '20080724 00:00:00.000' and '20080724 23:59:59.999'


rownum datedata
----------- -----------------------
8 2008-07-24 12:52:42.360
9 2008-07-25 00:00:00.000
Why did it still return 2 rows? The conversion, as it is, has nothing to do with this. The second expression ('20080724 23:59:59.999') will be properly converted to a DateTime expression, complete with time part. However, the DateTime precision is 3.333 seconds (For more about DateTime and SmallDatetime rounding and precision, read my old notes) resulting the second part to be rounded to '2008-07-25 00:00:00.000'. To resolve this rounding issue, try to prevent the time part from rounding up by doing:
select * from @myTable
where datedata
BETWEEN '20080724 00:00:00.000' and '20080724 23:59:59.998'


rownum datedata
----------- -----------------------
8 2008-07-24 12:52:42.360
And if it's SmallDateTime, the code should be :
select * from @myTable 
where datedata
BETWEEN '20080724 00:00:00.000' and '20080724 23:59:00'


rownum datedata
----------- -----------------------
8 2008-07-24 12:53:00
Now there are two ways of doing it depending on which data types the query is dealing with. Although, these are all valid codes, it is not recommended to have two versions of codes to handle these two sibling data types. Also, it's necessary to be always conscious of whether it's dealing with a DateTime or a SmallDateTime expression. To be more flexible, it would be better to create something like this:
select * from @myTable 
where datedata >='20080724 ' and datedata < '20080725'

rownum datedata
----------- -----------------------
8 2008-07-24 12:53:00
The above code is clearer and SQL Server can also use index for query optimization. For range of dates, it can easily be extended and modify the condition of the same code:
select * from @myTable 
where datedata >='20080724 ' and datedata < '20080727'

rownum datedata
----------- -----------------------
8 2008-07-24 12:53:00
9 2008-07-25 00:00:00
10 2008-07-25 12:39:00
11 2008-07-25 12:39:00
12 2008-07-26 00:00:00
As a general rule, use a method that can handle both DateTime and SmallDateTime, specially if you the requirement does need to deal with the time part. Also, consider the performance of the query. Placing a date data type column in a function as part of a filter condition will ignore the index and will perform a table scan. Most importantly, if using a function is unavoidable, be aware that most date and time functions are dependent to the user's default language, SET LANGUAGE and SET DATEFORMAT settings.



~~CK