Alberto Ferrari

SQL Server 2005, BI, SSIS and SSAS.

<September 2008>
SuMoTuWeThFrSa
31123456
78910111213
14151617181920
21222324252627
2829301234
567891011


Navigation

Images

Subscriptions

Post Categories



Bulk Insert of sorted data

In a previous post I showed how to make bulk insert faster using SSIS. During Sql Server Conference 2007 I had the great opportunity to double check it with Stefano Stefani that analyzed the problem in greater detail and finally came up with both an explanation of the problem and a simple workaround when facing the same problem with T-SQL (my solution is still good with SSIS but Stefano’s one is very effective with T-SQL).

The problem is related to bad performances when inserting rows into a table that has a clustered index (typically a primary key) based on an INT IDENTITY field and is easily shown in this example:
Create Table Destination (
	ID_Test Int Identity (1,1) Primary Key,
	Cod_Test VarChar (20),
	Test VarChar (100))
Go

Create Table RowsToInsert (
	Cod_Test VarChar (20),
	Test VarChar (100))
Go
Now, suppose you have 1.000.000 of rows in RowsToInsert, if you analyze the execution plan of this simple insert:
INSERT 
    Destination
SELECT 
    Cod_Test, 
    Test
FROM
    RowsToInsert
OPTION (RECOMPILE)

You will get this:

As you can see, the data is sorted before being inserted into the table even if the IDENTITY specification of the primary key guarantees to SQL Server that the data is already sorted. Of course, sorting one million rows takes a long time.

The problem is that SQL Server can insert rows in parallel mode into the table, thus there is no guarantee that the IDENTITY fields will be generated sequentially. This leads to very poor performances when inserting a huge amount of data, as is normally the case in DWH tasks.

A very effective workaround, when facing the problem in a T-SQL task, is that of sorting the data based on a dummy column, as in this example:

INSERT 
    Destination
SELECT 
    Cod_Test, 
    Test
FROM
    (SELECT 1 A, * FROM RowsToInsert) Q
ORDER BY A
OPTION (RECOMPILE)

The execution plan is now:

The sort task disappeared, as now SQL Server knows that the IDENTITY fields will be generated sequentially because the input is sorted.

The final consideration is that you can have a dramatic speed gain in huge INSERT operations with SQL Server into clustered indexed tables if you provide SORTED SELECTS to the INSERT operation. When the clustered index is based on an INT IDENTITY field you can get the same speed by sorting on a dummy field, instructing SQL Server that the IDENTITY columns will be computed sequentially and there is no need to sort the data.

posted on Monday, July 02, 2007 3:53 PM by AlbertoFerrari


# The clustered index, the bulk insert and the sort operation @ Monday, July 02, 2007 5:31 PM

A few months ago I wrote about a SSIS setting you have to use just to get better performance when you

Anonymous




Powered by Dot Net Junkies, by Telligent Systems