posted on Wednesday, November 10, 2004 9:27 PM by amachanic

Rowset string concatenation: Which method is best?

Yeah, yeah, yeah, let's get this out of the way right from the start: Don't concatenate rows into delimited strings in SQL Server. Do it client side.

Except if you really have to create delimited strings in SQL Server. In which case you should read on.

There was a little discussion on SQLTeam about the best way to concatenate. I recommended a scalar UDF solution, whereas Rob Volk recommended a solution involving a temp table.

I mentioned my dislike for the temp table solution for a couple of reasons. First of all, it relies on a clustered index for ordering. That will probably work in this example, but is not guaranteed to always work and relying on indexes rather than ORDER BY for ordering is definitely not a habit I want anyone to get into. The clustered index as it was described in Rob's example also has another problem that I didn't even notice until I was writing this entry. But I'll get to that in a moment. The second reason I dislike the temp table is that I felt it would be less efficient than the scalar UDF.

Rob didn't agree about the efficiency. And so I set out to prove him wrong...

We'll use the Authors table in Pubs. I want a comma-delimited list, per state, of the last name of each author who lives there.

First, the scalar UDF:

USE pubs
GO

CREATE FUNCTION dbo.ConcatAuthors(@State CHAR(2))
RETURNS VARCHAR(8000)
AS
BEGIN
	DECLARE @Output VARCHAR(8000)
	SET @Output = ''

	SELECT @Output =	CASE @Output 
				WHEN '' THEN au_lname 
				ELSE @Output + ', ' + au_lname 
				END
	FROM Authors
	WHERE State = @State
	ORDER BY au_lname

	RETURN @Output
END
GO

To find the list I want:

SELECT DISTINCT State, dbo.ConcatAuthors(State)
FROM Authors
ORDER BY State

... And the adaptation of Rob's temp table method... I did change two things due to problems I discovered during testing. One, I've altered the au_lname column to VARCHAR(8000); the column in the Authors table is VARCHAR(40), not large enough for all of the California authors. What if we were dealing with a much larger dataset? Second, I added an IDENTITY column, and I'm clustering on that instead of the actual data to get the ordering. I'm doing so because of the VARCHAR(8000). Index rows can be a maximum of 900 bytes, so if we had enough data to exceed that length, this method would fail.

CREATE TABLE #AuthorConcat
(
	State CHAR(2) NOT NULL,
	au_lname VARCHAR(8000) NOT NULL,
	Ident INT IDENTITY(1,1) NOT NULL PRIMARY KEY
)

INSERT #AuthorConcat 
(
	State,
	au_lname
)
SELECT
	State, 
	au_lname
FROM Authors
ORDER BY 
	State, 
	au_lname

DECLARE @Authors VARCHAR(8000)
SET @Authors = ''
DECLARE @State CHAR(2)
SET @State = ''

UPDATE #AuthorConcat
SET @Authors = au_lname =	CASE 
				WHEN @State = State THEN @Authors + ', ' + au_lname 
				ELSE au_lname END,
	@State = State

SELECT State, MAX(au_lname) 
FROM #AuthorConcat
GROUP BY State

Clever, but more complex and harder to read than the scalar UDF version. Output is identical, but that's not why we're here. Which one is more efficient?

Drumroll, please...

Results were tabulated using STATISTICS IO, STATISTICS TIME, and Query Analyzer's Show Execution Plan. DBCC DROPCLEANBUFFERS and DBCC FREEPROCCACHE were run before each test.

Scalar UDF Method
Total cost: 0.0492
Total Scan count: 1
Total Logical reads: 2
Total Physical reads: 2
Total time: 25 ms

Temp Table Method
Total cost: 0.2131
Total Scan count: 4
Total Logical reads: 9
Total Physical reads: 2
Total time: 88 ms

So in conclusion, neither method is incredibly taxing with the tiny Pubs dataset, but I think I have proven that the UDF is far more efficient.


Update, February 28, 2005: Modified the adapation of Rob Volk's method to use a CREATE TABLE instead of SELECT INTO, as the latter is not necessarily guaranteed to insert rows in the right order for the sake of this example. Thanks to "PW" on SQLServerCentral for pointing this problem out. Note that this changed the total costs very slightly -- for the better -- but the UDF still performs better by quite a large margin.

Comments

# re: Build a Comma Delemited String For All Rows @ Friday, January 07, 2005 1:22 PM

amachanic

# re: Rowset string concatenation: Which method is best? @ Tuesday, June 21, 2005 12:09 AM

This is really a good Article
UDF approach is always better

Madhivanan

Madhivanan

# re: Rowset string concatenation: Which method is best? @ Thursday, June 30, 2005 12:12 AM

This one is really good. I just got rid of a 16 hour headache because of this post. I used the technique in a query which concatenated results from one table and inserted them into another like this:

/* Declare temporary tables */
DECLARE @sourceTbl TABLE
(
ID int,
Word varchar(50)
)
DECLARE @destinationTbl TABLE
(
ID int,
Words varchar(200)
)

/*
* I fill the sourceTable with data from
* elsewhere. Result:
* ID Word
*---------------
* 1 word1
* 1 word2
* 1 word3
* 2 word4
* 2 word5
* 2 word6
* 3 word7
* 3 word8
* 3 word9
* 4 word10
*/

DECLARE @LastID int
DECLARE @i int
DECLARE @tempWords VARCHAR(200)

/* I need LastID to know when to stop iterating */
SET @LastID =
(
SELECT TOP 1 ID
FROM sourceTbl
ORDER BY ID DESC
)
SET @i = 1
WHILE @i <= @LastID
BEGIN
SET @tempWords = ''
SELECT @tempWords =
CASE @tempWords
WHEN '' THEN Word
ELSE @tempWords + ', ' + Word
END
FROM @sourceTbl
WHERE ID = @i
INSERT INTO @destinationTbl (ID, Words)
VALUES (@i, @tempWords)
SET @i = @i + 1
END

SELECT *
FROM destinationTbl

/*
* Result:
* ID Word
*---------------
* 1 word1, word2, word3
* 2 word4, word5, word6
* 3 word7, word8, word9
* 4 word10
*/

Lorf

# re: Rowset string concatenation: Which method is best? @ Monday, October 17, 2005 11:14 AM

I used the scalar udf method described. My query which ran quickly, now takes 4:30 minutes to process while in the query analyzer! When I run it as part of an export it doesn't take as long, but still takes a couple of minutes. Why do you think that is? The code does a bunch of counts on student data. It breaks down all the students by gender/ethnic code/etc. I used the concatenation for the course title as we have many combined classes. (For example Latin III and Latin IV being taught in the same classroom/by same teacher because there are not enough students to make 2 classes). I had to use 3 fields to make the function.
Create Function dbo.ConcatTchSched (@BegPeriod varchar(2), @EndPeriod varchar(2), @Tchname varchar(15))
Any ideas to make this more efficient??
Thanks,
J

jmj

# re: Rowset string concatenation: Which method is best? @ Wednesday, November 09, 2005 6:19 PM

This is a GREAT article! I agree, it's better to concat the client side, but there are some cases where that's just not possible. This article is GREAT!! Thanks go to the author!

bfrench

# re: Rowset string concatenation: Which method is best? @ Friday, December 02, 2005 6:20 PM

When using UDFs, one must be aware that the reported utilization of IO and TIME via set statistics do not include the UDF resources used. This is very easy to see using the Northwind DB (I do not have pubs)

CREATE FUNCTION dbo.Concatenate( @CategoryID integer )
RETURNS VARCHAR(8000)
AS
BEGIN
DECLARE @Output VARCHAR(8000)
SET @Output = ''
SELECT @Output = CASE @Output
WHEN '' THEN ProductName
ELSE @Output + ', ' + ProductName
END
FROM dbo.Products
WHERE CategoryID = @CategoryID
ORDER BY ProductName

RETURN @Output
END
GO
set statistics io on
go
SELECT CategoryID, CategoryName , dbo.Concatenate( CategoryID )
FROM dbo.Categories

The output is:
Table 'Categories'. Scan count 2, logical reads 2, physical reads 0, read-ahead reads 0.

WHERE ARE THE READS ON THE PRODUCT TABLE ?

In order to capture accurate resource usage, a trace must be run that includes the event "SQL:Statement Completed" This shows that 326 Reads are performed.

For the temporary table solution, there are 385 reads.

Of course, why the formatting cannot be done by the client is my pet-peave.

This takes 161 Reads:

SELECT Categories.CategoryID, Categories.CategoryName
, PRoducts.ProductName
FROM dbo.Categories
JOIN Products
on Categories.CategoryID = Products.CategoryID

In conclusion: if you are going to do this type of pivoting on the server, realize that this will require a 240% increase in server resources.

Anyone for 64-bit SQL Server ? ;-)

Carl Federl

# re: Rowset string concatenation: Which method is best? @ Friday, December 09, 2005 3:51 PM

Regarding the comment about why a query using this approach ran fast but now it takes 4m30s, be aware that for every row that is returned the UDF is run, and if the udf in turn selects from a table the UDF then runs its query for every row selected in the base table.

So if you select 100 rows, with the UDF approach above you are actually running 100 queries. And with 1000 rows, you are running 1000 queries. So this approach is fine if you limit it to small result sets, but for larger result sets it can be a performance problem.

Also, there was a bug in the Query Analyzer where it didn't return the correct reads for functions. I haven't checked if this is fixed. Profiler does return accurate reads, so use it for tuning UDFs.

hth,

Steve Moss

# re: Rowset string concatenation: Which method is best? @ Thursday, December 15, 2005 11:05 PM

This is really nice, but is there a way to get the 'authors' into columns instead of csv?

I have a view result that repeats 1,2,3,4,4,4,5 on each record and I need to transpose it so that I get columns (V1,V2,V3,V4_0,V4_1,V4_2,V5,ID). This table tracks value's 1 to 5 assigned to ID, so ID will be common for all 7 values.

I tried this method but it only worked in query analyzer. I tried to do it in a new view, but it timed out....even when changing it from Top 100 PERCENT to Top 3. Not sure why this happened either.

Thanks

TestEngineer

# re: Rowset string concatenation: Which method is best? @ Wednesday, February 08, 2006 9:56 AM

Well...i used the UDF method...the table has about 400.000 rows. It is a part of a larger procedure.

I ran only the part in witch the UDF should pivot the table. I ran it for 20 minutes and then quit.

Hm....its because of the size of the table or because i messed it up somewhere?

Cristian B

# re: Rowset string concatenation: Which method is best? @ Thursday, March 16, 2006 11:50 PM

Steve Moss wrote: "So this approach is fine if you limit it to small result sets, but for larger result sets it can be a performance problem."

I agree but only if you do nothing about it. See below for how to tune this bad boy.

Cristian B wrote: "I ran it for 20 minutes and then quit".

Yep, me too... until I tuned it and then it did it all in 5 seconds flat on a million rows for 50,000 CustID's.

First, here's the function I did the experiment with...

--===== Create a function to produce a CSV column from a set of data
-- identified by the input parameter @UserID
CREATE FUNCTION dbo.fTestCSV(@UserID INT)
RETURNS VARCHAR(8000)
AS
BEGIN
--===== Declare a working variable
DECLARE @CSV VARCHAR(8000)

--===== Concatenate the "SomeValue" column into a single CSV value
-- for a given user ID using a set based "loop"
SELECT @CSV = COALESCE(@CSV+',','')+CAST(SomeValue AS VARCHAR(10))
FROM dbo.CSVTest WITH (NOLOCK)
WHERE UserID = @UserID
RETURN @CSV
END

...here's the data I manufactured to have a nice large table to test against. Do notice the non-clustered index... That's the key to making this code run VERY fast..

--===== Create a populate a million row test table to demonstrate the
-- the power of SQL as compared to an "application". A "real life"
-- example would be much wider but this will suffice for test data
-- and is easy to write a comparison test for in an "application".
SELECT TOP 1000000
IDENTITY(INT,1,1) AS RowNum,
CAST(RAND(CAST(NEWID() AS VARBINARY))*50000+1 AS INT) AS UserID,
CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65))
+ CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65)) AS SomeValue,
'A column for kicks' AS Kicks,
'Still another column just for proofing' AS StillAnother
INTO dbo.CSVTest
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2

--===== A table is not properly formed unless a Primary Key has been assigned
ALTER TABLE dbo.CSVTest
ADD PRIMARY KEY CLUSTERED (RowNum)

--===== ...and, we'll do a little code optimization by adding an index to
-- support the upcoming query.
-- THIS INDEX IS CRITICAL!!! WITHOUT IT, THE CODE TAKES MORE THAN
-- 22 MINUTES TO RUN!!! (I gave up and stopped the run
-- WITH IT, THE CODE ONLY TAKES 5 SECONDS TO RUN!!!
CREATE NONCLUSTERED INDEX CSVTest_UserID_SomeValue
ON dbo.CSVTest (UserID,SomeValue)

Last but not least, here's how I tested the code... I made my own "STATISTICS" reader just because...

--===== Ensure that the data is NOT cached so we can get a
-- true measure of performance of the code being tested.
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE

--===== Declare a handfull of variables to measure duration
DECLARE @ServerDurationStart DATETIME
DECLARE @ServerDurationEnd DATETIME
DECLARE @CpuUsageMSStart INT
DECLARE @CpuUsageMSEnd INT
DECLARE @DiskReadWriteStart INT
DECLARE @DiskReadWriteEnd INT
DECLARE @RowCount INT

--===== Capture the condition of the resources used so far
-- as a starting point.
SELECT @ServerDurationStart = GETDATE(),
@CpuUsageMSStart = SUM(CPU),
@DiskReadWriteStart = SUM(Physical_IO)
FROM Master.dbo.SysProcesses
WHERE SPID = @@SPID

--===== Run the code being tested
SELECT UserID,dbo.fTestCSV(UserID)
FROM dbo.CSVTest
GROUP BY UserID

--===== Capture the end condition of the resources used so far
-- as an ending point.
SELECT @ServerDurationEnd = GETDATE(),
@CpuUsageMSEnd = SUM(CPU),
@DiskReadWriteEnd = SUM(Physical_IO)
FROM Master.dbo.SysProcesses
WHERE SPID = @@SPID

-- Capture the rowcount
SELECT @RowCount = @@ROWCOUNT

--===== Print the performance report
PRINT STR(DATEDIFF(ms,@ServerDurationStart,@ServerDurationEnd)) + ' :Total server duration (ms)'
PRINT STR(@CpuUsageMSEnd-@CpuUsageMSStart) + ' :CPU Usage (ms)'
PRINT STR(@DiskReadWriteEnd-@DiskReadWriteStart) + ' :Total Disk Reads/Writes'

Again, the key to to properly tune code this agressive. As Steve Moss pointed out, it run's the UDF once for each row in the return... in the code above, it runs it 50,000 times. If you don't tune code like this, you WILL run into performance problems. And, the same goes for too many or the wrong type of indexes... ya just gotta do it right.

Sure hope this helps answer some of the question on this method and UDF.

--Jeff Moden

Jeff Moden

# re: Rowset string concatenation: Which method is best? @ Tuesday, April 11, 2006 7:53 AM

I think string concatenation can be done using a simpler approach..

declare @strConcat nvarchar(4000)
set @strConcat = ''
select @strConcat = @strConcat + ItemCode + ' ' + CONVERT(CHAR(10),DueDate,110) + ' ' + cast(UnPaid as varchar(200)) + ',' from Ledger
set @strConcat = Left(@strConcat,len(@strConcat)-1)

will concatenate all the records of one table and return a string..

Sweta Jha

# re: Rowset string concatenation: Which method is best? @ Tuesday, April 18, 2006 12:23 AM

Uh huh... That's what the main crux of the fuction in my code had...
--===== Concatenate the "SomeValue" column into a single CSV value
-- for a given user ID using a set based "loop"
SELECT @CSV = COALESCE(@CSV+',','')+CAST(SomeValue AS VARCHAR(10))
FROM dbo.CSVTest WITH (NOLOCK)
WHERE UserID = @UserID
RETURN @CSV
END

You may have missed it because of all the performance measuring code I built in to prove a point.

--Jeff Moden

# re: Rowset string concatenation: Which method is best? @ Wednesday, June 28, 2006 11:31 AM

A Great example and a Great help.
Thanks!
Eric

Eric

# re: Rowset string concatenation: Which method is best? @ Wednesday, July 12, 2006 9:14 AM

The concatenation methods described in this post may not be as safe as one would expect.

I've mentioned this in a post in my blog:
http://milambda.blogspot.com/2005/07/return-related-values-as-array.html

SQL 2005 adds another possibility using the FOR XML clause:
http://milambda.blogspot.com/2006/07/aggregate-concatenation-in-sql-2005.html

The latter is based on a post by Tony Rogerson:
http://sqlblogcasts.com/blogs/tonyrogerson/archive/2006/07/06/871.aspx


ML

ML

# Rowset string concatenation: Which method is best? @ Thursday, July 13, 2006 12:16 AM

Originally posted here.
Yeah, yeah, yeah, let's get this out of the way right from the start:
Don't...

Anonymous

# Bitmask Handling, part 2: Bitmask reconstitution @ Thursday, July 13, 2006 12:27 AM

Originally posted here.

Posting the first part
of my series on bitmasks (yes, this is now officially...

Anonymous

# Bitmask Handling, part 2: Bitmask reconstitution @ Monday, January 08, 2007 2:35 PM

Originally posted here . Posting the first part of my series on bitmasks (yes, this is now officially

Anonymous

# Rowset string concatenation: Which method is best? @ Monday, January 08, 2007 2:38 PM

Originally posted here . Yeah, yeah, yeah, let's get this out of the way right from the start: Don't

Anonymous