November 2004 - Posts

Performance: ISNULL vs. COALESCE

Mladen aka spirit1 posted a speed test of COALESCE vs. ISNULL. Reported result: COALESCE is faster.

But leave it to Anatoly Lubarsky to argue with what was posted. He posted his own speed test, showing that ISNULL is faster.

Anatoly's results showed a miniscule difference, "52 seconds" vs. "52-53 seconds". Mlanden's tests show a larger difference, around 15%. But I don't trust either of these results.

One thing in common with both of the tests I linked to, and which makes them both flawed, is that they return data to the client. This factors greatly into testing time. What if there was a network hiccup, or what if the client UI did something different when rendering the results? We're not testing the network's ability to send data or the client's ability to render it. What's being tested is very specific: Speed of COALESCE vs. ISNULL.

So this leads me to present Adam's Number 1 Rule of Performance Testing: When performance testing a specific feature, do everything in your power to test only that feature itself. Isolate your test as much as possible so that there is no way network traffic or unrelated UI code will get in the way. If you aren't careful about this, you will end up testing these other resources instead of your goal. And when testing against tables in SQL Server, it's especially important to be careful given SQL Server's caching mechanisms. So when testing using tables, I'll always throw out the first few test runs, or even restart the server between tests, in order to control the cache in whever way is logical for the feature being tested.

Before getting to my own tests, I'd like to jump off on a quick tanget. COALESCE vs. ISNULL? Who cares! This isn't a performance question, this is a question of standards-conformant vs. proprietary code. ISNULL is non-standard and provides less functionality than COALESCE. Yet a lot of SQL Server developers love to use it, I suspect because it's a lot easier to remember (and spell). So learn a new word and type two extra characters and you'll end up with more maintainable, more functional code. Sounds good to me -- which is why I am a big fan of COALESCE.

But I am still curious... Which is faster?

In this case, no test data is needed. We're testing performance of the COALESCE and ISNULL functions themselves, not using them to access data from a table. So the most effective test, in my opinion, is to run COALESCE and ISNULL a bunch of times each (one million) and see which runs faster:

DECLARE @i INT SET @i = 1
DECLARE @CPU INT SET @CPU = @@CPU_BUSY
DECLARE @StartDate DATETIME SET @StartDate = GETDATE()

WHILE @i <= 1000000
BEGIN
	IF COALESCE('abc', 'def') = 'def'
		PRINT 1
	SET @i = @i + 1
END

PRINT 'COALESCE, both non-null'
PRINT 'Total CPU time: ' + CONVERT(varchar, @@CPU_BUSY - @CPU)
PRINT 'Total milliseconds: ' + CONVERT(varchar, DATEDIFF(ms, @StartDate, GETDATE()))
PRINT ''
GO

DECLARE @i INT SET @i = 1
DECLARE @CPU INT SET @CPU = @@CPU_BUSY
DECLARE @StartDate DATETIME SET @StartDate = GETDATE()

WHILE @i <= 1000000
BEGIN
	IF ISNULL('abc', 'def') = 'def'
		PRINT 1
	SET @i = @i + 1
END

PRINT 'ISNULL, both non-null'
PRINT 'Total CPU time: ' + CONVERT(varchar, @@CPU_BUSY - @CPU)
PRINT 'Total milliseconds: ' + CONVERT(varchar, DATEDIFF(ms, @StartDate, GETDATE()))
PRINT ''
GO

DECLARE @i INT SET @i = 1
DECLARE @CPU INT SET @CPU = @@CPU_BUSY
DECLARE @StartDate DATETIME SET @StartDate = GETDATE()

WHILE @i <= 1000000
BEGIN
	IF COALESCE(null, 'abc') = 'def'
		PRINT 1
	SET @i = @i + 1
END

PRINT 'COALESCE, first column null'
PRINT 'Total CPU time: ' + CONVERT(varchar, @@CPU_BUSY - @CPU)
PRINT 'Total milliseconds: ' + CONVERT(varchar, DATEDIFF(ms, @StartDate, GETDATE()))
PRINT ''
GO

DECLARE @i INT SET @i = 1
DECLARE @CPU INT SET @CPU = @@CPU_BUSY
DECLARE @StartDate DATETIME SET @StartDate = GETDATE()

WHILE @i <= 1000000
BEGIN
	IF COALESCE(null, 'abc') = 'def'
		PRINT 1
	SET @i = @i + 1
END

PRINT 'ISNULL, first column null'
PRINT 'Total CPU time: ' + CONVERT(varchar, @@CPU_BUSY - @CPU)
PRINT 'Total milliseconds: ' + CONVERT(varchar, DATEDIFF(ms, @StartDate, GETDATE()))
PRINT ''
GO

You'll notice that I'm not using STATISTICS TIME to get the CPU and run time. Unfortunately, STATSTICS TIME returns once per statement, so it is not usable for this test -- we would wind up with one million 0 millisecond results. If you're running on a quiet server (and you should always run targeted performance tests on a quiet server; that may have to become Adam's Number 2 Rule if I can't think of something better) @@CPU_BUSY will give a close enough approximation of how much CPU time the test is using. And DATEDIFF will give us a good enough time reading. Note that the predicate in the IF statement will never return true, so we know that we're not testing our network or client.

I ran these tests several times on a few different servers, and ISNULL appears to pretty consistently out-perform COALESCE by an average of 10 or 12 percent. But that's the difference between 6 seconds and 5.3 seconds (the approximate average runtimes per test on my servers), over the course of a million exections. Hardly worth the functionality and standards compliance sacrifice, at least in the scenarios I use these functions for.

The Church of XML

I just found this essay by Zef Hemel that gives us a very clear look into the minds of those who are obsessed with retrofitting this document management technology into every possible application.

We can only pray that this "church" is actually being run by Jim Jones...

Rowset string concatenation: Which method is best?

Yeah, yeah, yeah, let's get this out of the way right from the start: Don't concatenate rows into delimited strings in SQL Server. Do it client side.

Except if you really have to create delimited strings in SQL Server. In which case you should read on.

There was a little discussion on SQLTeam about the best way to concatenate. I recommended a scalar UDF solution, whereas Rob Volk recommended a solution involving a temp table.

I mentioned my dislike for the temp table solution for a couple of reasons. First of all, it relies on a clustered index for ordering. That will probably work in this example, but is not guaranteed to always work and relying on indexes rather than ORDER BY for ordering is definitely not a habit I want anyone to get into. The clustered index as it was described in Rob's example also has another problem that I didn't even notice until I was writing this entry. But I'll get to that in a moment. The second reason I dislike the temp table is that I felt it would be less efficient than the scalar UDF.

Rob didn't agree about the efficiency. And so I set out to prove him wrong...

We'll use the Authors table in Pubs. I want a comma-delimited list, per state, of the last name of each author who lives there.

First, the scalar UDF:

USE pubs
GO

CREATE FUNCTION dbo.ConcatAuthors(@State CHAR(2))
RETURNS VARCHAR(8000)
AS
BEGIN
	DECLARE @Output VARCHAR(8000)
	SET @Output = ''

	SELECT @Output =	CASE @Output 
				WHEN '' THEN au_lname 
				ELSE @Output + ', ' + au_lname 
				END
	FROM Authors
	WHERE State = @State
	ORDER BY au_lname

	RETURN @Output
END
GO

To find the list I want:

SELECT DISTINCT State, dbo.ConcatAuthors(State)
FROM Authors
ORDER BY State

... And the adaptation of Rob's temp table method... I did change two things due to problems I discovered during testing. One, I've altered the au_lname column to VARCHAR(8000); the column in the Authors table is VARCHAR(40), not large enough for all of the California authors. What if we were dealing with a much larger dataset? Second, I added an IDENTITY column, and I'm clustering on that instead of the actual data to get the ordering. I'm doing so because of the VARCHAR(8000). Index rows can be a maximum of 900 bytes, so if we had enough data to exceed that length, this method would fail.

CREATE TABLE #AuthorConcat
(
	State CHAR(2) NOT NULL,
	au_lname VARCHAR(8000) NOT NULL,
	Ident INT IDENTITY(1,1) NOT NULL PRIMARY KEY
)

INSERT #AuthorConcat 
(
	State,
	au_lname
)
SELECT
	State, 
	au_lname
FROM Authors
ORDER BY 
	State, 
	au_lname

DECLARE @Authors VARCHAR(8000)
SET @Authors = ''
DECLARE @State CHAR(2)
SET @State = ''

UPDATE #AuthorConcat
SET @Authors = au_lname =	CASE 
				WHEN @State = State THEN @Authors + ', ' + au_lname 
				ELSE au_lname END,
	@State = State

SELECT State, MAX(au_lname) 
FROM #AuthorConcat
GROUP BY State

Clever, but more complex and harder to read than the scalar UDF version. Output is identical, but that's not why we're here. Which one is more efficient?

Drumroll, please...

Results were tabulated using STATISTICS IO, STATISTICS TIME, and Query Analyzer's Show Execution Plan. DBCC DROPCLEANBUFFERS and DBCC FREEPROCCACHE were run before each test.

Scalar UDF Method
Total cost: 0.0492
Total Scan count: 1
Total Logical reads: 2
Total Physical reads: 2
Total time: 25 ms

Temp Table Method
Total cost: 0.2131
Total Scan count: 4
Total Logical reads: 9
Total Physical reads: 2
Total time: 88 ms

So in conclusion, neither method is incredibly taxing with the tiny Pubs dataset, but I think I have proven that the UDF is far more efficient.


Update, February 28, 2005: Modified the adapation of Rob Volk's method to use a CREATE TABLE instead of SELECT INTO, as the latter is not necessarily guaranteed to insert rows in the right order for the sake of this example. Thanks to "PW" on SQLServerCentral for pointing this problem out. Note that this changed the total costs very slightly -- for the better -- but the UDF still performs better by quite a large margin.

Sample data

Tal McMahon, a SQLServerCentral user, was kind enough to post a link to a big sample data file full of fake names, social security numbers, and phone numbers. Good stuff!

He was also kind enough to give me permission to link to it on here. So please feel free to grab it from here while Tal is still feeling generous.

Thanks again, Tal! It's always good to have massive amounts of data around for when that testing mood strikes :)

Is PATINDEX faster than LIKE?

I keep seeing the same suggestion on various "tips and tricks" websites: For situations in which you might want to use LIKE in the WHERE clause, but for which indexes cannot be used, PATINDEX will perform faster.

So, according to these sources, this:

SELECT *
FROM tbl
WHERE PATINDEX('%abc%', col) > 0

is faster than this:

SELECT *
FROM tbl
WHERE col LIKE '%abc%'

The thing is, I'm not one to just take this kind of stuff at face value, so I've tested this assertion several times. Every time I see this tip, I think, "I must be missing something," and I test again. And every single time I test again, with different data, different patterns, etc, I arrive at the same conclusion: They perform exactly the same.

Which brings us to today. In the SQL Server Central forums, William O'Malley told me that in his tests, on his data that he can't post due to his industry (?), PATINDEX does outperform LIKE.

So I decided to test yet again. And I'm still coming up with the exact same numbers. I'm hoping that some reader will be able to tell me this mysterious circumstance in which PATINDEX really does outperform LIKE. Or maybe explain why my test is totally incorrect.

Anyway, here's what I did today... First, I created a big table of test data (83 million rows) with the following (which you may notice that I lifted from a previous post):

SELECT DISTINCT A.Name + B.Name + C.Name AS SomewhatLargeString
INTO #BigTableOfStrings
FROM	master..spt_values A,
	master..spt_values B,
	master..spt_values C
WHERE	a.TYPE NOT IN ('P', 'R', 'F', 'F_U')
	AND b.TYPE NOT IN  ('P', 'R', 'F', 'F_U')


CREATE CLUSTERED INDEX CI_LargeString ON #BigTableOfStrings(SomewhatLargeString)

I decided to test against the pattern '%ossDbOwnChainRefere%', for which there are 1752 rows in the test table.

First, I ran the LIKE query:

SELECT COUNT(*)
FROM #BigTableOfStrings
WHERE SomewhatLargeString LIKE '%ossDbOwnChainRefere%'

Runtime: 9:55.

Then I tried PATINDEX:

SELECT COUNT(*)
FROM #BigTableOfStrings
WHERE PATINDEX('%ossDbOwnChainRefere%', SomewhatLargeString) > 0

Runtime: 9:56 (yes, worse)

Then I ran LIKE again with a runtime of 9:47, then PATINDEX again with a runtime of 9:50, and now I'm not patient enough to run either of them again.

So am I correct? Is this claim bogus? Or have I gone completely off-base?

Batching in SQL Server

Steve Jones of SQLServerCentral.com posted a good article about batching updates in SQL Server. He shows why it's not always best to do things in a 100% set-based manner. Blocking and locking during large updates can be mitigated by doing the work in smaller chunks. Definitely a good technique for the toolbox of anyone who has to deal with modifying big datasets.

Visual Studio Hosted Experience

Brian Keller posted on TheServerSide.NET about the Visual Studio Hosted Experience. Basically, you can log into this site and do various hands-on labs and exercises, all from the comfort of your web browser. No Visual Studio required. Very cool!

Here is the link: http://microsoft.demoservers.com/msdn

Paging in SQL Server 2005

I keep seeing questions on newsgroups about paging in stored procedures, and whether there will be a better way in SQL Server 2005. However, aside from a few answers in newsgroups, I haven't seen any content on how to do it. So I'd like to spend a few minutes and share with you the new features that will make paging stored procedures both easier to build and a lot more performant...

But first, since this is a performance-related blog, let's generate a bunch of big test data!

Since I don't have AdventureWorks installed on my test SQL Server 2005 server at the moment and am too lazy to track it down, the data is somewhat ugly... Anyway, start this up and let it run for a while (takes around 35 minutes on my test box):

SELECT DISTINCT A.Name + B.Name + C.Name AS SomewhatLargeString
INTO #BigTableOfStrings
FROM	master..spt_values A,
	master..spt_values B,
	master..spt_values C
WHERE	a.TYPE NOT IN ('P', 'R', 'F', 'F_U')
	AND b.TYPE NOT IN  ('P', 'R', 'F', 'F_U')

CREATE CLUSTERED INDEX CI_LargeString ON #BigTableOfStrings(SomewhatLargeString)

Coffee (or preferably, beer) time! See you in 35 minutes.

... All set?

Okay, now let's pretend we're in a web app with a SQL Server 2000 backend and we want the second page of data... Rows 11-20, ordered by SomewhatLarge. How might we do this in SQL Server 2000..? Here's one way:

SELECT A.SomewhatLargeString
FROM #BigTableOfStrings A
JOIN #BigTableOfStrings B ON B.SomewhatLargeString <= A.SomewhatLargeString
GROUP BY A.SomewhatLargeString
HAVING COUNT(*) BETWEEN 11 AND 20

... Still waiting for that to return? Keep waiting. Maybe get another coffee, or go home for the night. On my system, that query has quite possibly the biggest estimated cost I've ever seen, a staggering 1,523,110,700. Yeah, that sucks. So let's change it around a bit:

SELECT x.SomewhatLargeString
FROM (
	SELECT TOP 20 A.SomewhatLargeString, COUNT(*) AS TheCount
	FROM #BigTableOfStrings A
	JOIN #BigTableOfStrings B ON B.SomewhatLargeString <= A.SomewhatLargeString
	GROUP BY A.SomewhatLargeString
	ORDER BY A.SomewhatLargeString ) x
WHERE x.TheCount BETWEEN 11 AND 20

Much nicer than before, with an estimated cost on my system of 23.1, and a virtually instant return time. But wait, we have an over-zealous user who wants rows 20,001 - 20,010!

SELECT x.SomewhatLargeString
FROM (
	SELECT TOP 20010 A.SomewhatLargeString, COUNT(*) AS TheCount
	FROM #BigTableOfStrings A
	JOIN #BigTableOfStrings B ON B.SomewhatLargeString <= A.SomewhatLargeString
	GROUP BY A.SomewhatLargeString
	ORDER BY A.SomewhatLargeString ) x
WHERE x.TheCount BETWEEN 20001 AND 20010

Oops, cost skyrocketed to 50516, with a return time of 1:30... Can you feel your users abandoning your sinking ship of an app and heading towards the competitors? Probably not, since they won't actually click through 20,000 rows, but it makes a really good contrived example, so let's roll with it!

So how to solve this problem? In SQL Server 2000, the answer is, find some other paging mechanism, probably using a middle tier. But in SQL Server 2005, we have new and better toys to play with. Allow me to introduce your new paging best friend, the ROW_NUMBER() function. For those readers who are slow on the uptake, this function does exactly what its name implies; it creates a surrogate row number for each row in a result set. So now, instead of the very painfully inefficient COUNT(*) methods, we can let SQL Server do all the work as it builds the result set we actually want...

SELECT x.SomewhatLargeString
FROM (
	SELECT TOP 20010 A.SomewhatLargeString, ROW_NUMBER() OVER(ORDER BY A.SomewhatLargeString) AS TheCount
	FROM #BigTableOfStrings A
	ORDER BY A.SomewhatLargeString) x
WHERE x.TheCount BETWEEN 20000 AND 20010

I guess we can call that a tiny improvement. Total estimated cost: 0.202. That's only about a 25 MILLION PERCENT difference.

So why is ROW_NUMBER so much more efficient? It's a combination of the COUNT(*) method itself being inefficient and the query optimizer probably not handling it as well as it should. The COUNT(*) method requires an ordered-forward clustered index scan of the table, followed by correlation of each of the top 20010 rows that we asked for to all of the rows less than or equal to that row in the same table, for the count. Alas, the optimizer chooses a nested loop for that, which causes the cost to shoot up as the operation is repeated over and over.

The ROW_NUMBER method, on the other hand, requires only the scan of the TOP 20010 rows, followed by computation of the row number itself, which is nothing more than a scalar calculation. Then, just filter the rows. Simple, easy on the optimizer, easy for the system, and good for your customers. Definitely a great feature that I'm looking forward to using!