Data Transormation Services (DTS/SSIS) (RSS)

Data Transformation Services now renamed SQL Server Integration Services

SQL Server 2005 SP2 - CTP available

Here is available the SP2 CTP for SQL Server 2005.

The what's new document explain that there is an improvement in many-to-many dimension relationships calculations. I will try to evaluate the performance improvement, but unfortunately I'm in the middle of a busy timeframe (next week I will be in Seattle for SQL Pass - if someone want to exchange some opinion face-to-face here, let's contact me directly).

If someone get hands-on experience on many-to-many performance changes, please let me know.

Request for a better lookup component (SSIS dev team: call to action!)

I just answered to a request on SQLBI.EU forum of someone that is looking for a better lookup component, specifically designed for typical Data Warehouse scenarios.

While you can read the whole thread on the forum, I try to resume most wanted requirements, hoping that Katmai (next SQL Server version) developers are already working on that!

  • Default value for lookup failure: simple "dummy value" assignment for lookup that fails, so that we can easily implement fact table processing in non-incremental scenarios
  • Synchronous action for lookup failure: it is absolutely necessary to have the chance to execute a script (or another component, but this seems harder) in a synchronous way against the processed row. The need for synchronization is that if I need to add a dimension member, I need to do that *before* to process the next fact table row, that can reference the same member and I don't want (to try) to create a second surrogate key for the same application key
  • Dynamic update at preloaded lookup table: the previous feature brings the need for a in-memory update of the lookup table initially loaded to get fast performance. This thing has to be done synchronously to the row process and it is necessary only to add elements (no delete/update required).

Having lookup source, it should be not so hard to obtain these features: 80% of functionality are already implemented. Critical point is the need to make operations synchronous to the row process, before to process another row.

Actually I'm doing that with a double lookup: the first try to lookup in memory for existing (at start time) dimension members, the second do the lookup calling a stored procedure that first try to lookup with a SELECT sql statement, and if it doesn't get the value, it INSERTs a new dimension member. That the better way I found until now to implement this pattern in a fast and readable way, but I really dislike to have to add a stored procedure to support my SSIS package (I know that my position could be controversial and it could be an argument for a future post!).

Improve DTS to SSIS package migration with DtsToSsis-Prepare

Two weeks ago I wrote a tool to ease and improve the migration of a DTS package to a SSIS package. I named the tool DtsToSsis-Prepare, it is available with full source code at www.sqlbi.eu and now I wrote a larger article that describes the needs that DtsToSsis-Prepare try to solve.

For people who are too lazy to click a link, I extracted a part of the article in this post.

This is a DTS package before treatement: UDL connections are used to easily change connection parameters.

If we try to migrate this DTS package into an SSIS package, we obtain this result.

It is not very useful to get one SSIS package containing two DTS packages.

Using the DtsToSsis-Prepare tool with all options enabled we get this “prepared” DTS package.

As you can see, UDL connections have been converted to standard connections and have been consolidated into only two physical connections. At this point this DTS package would be less parallelized, but we will never execute this package, we just want it to migrate in a fair way to SSIS. The next picture shows the result of this migration.

We have two Data Flow Tasks corresponding to our original Transform Data Tasks.

I would be glad to receive feedback and comments about this tool: probably there are other features that would be useful to add. There is also a dedicated support forum to submit comments/bug/ideas.

The new and improved SCD handling with SSIS

Thanks to Alberto Ferrari there is a new way to handle Slowly Changing Dimensions (SCD) with Integration Services: he just released Table Difference 1.1 and FlowSync 1.0. Both are downloadable at www.sqlbi.eu.

The problem that those components solve is lack of performance handling large dimensions. Standard SCD component is very slow.

The FlowSync component is also interesting to solve nasty memory problems occurring when you have many asynchronous transformation into a the same Data Flow Task and a lot of data to process.

I'd like that SSIS dev team take a look at those components for the next version of SQL Server.

A tool to improve DTS to SSIS migration

I wrote a small tool to improve the migration from SSIS to DTS. The problem I had to solve was this: I have many Data Transform Tasks that have a simple behavior, read data from a query and write data on a table, using fast load with batch insert.

You say: why not to use the SSIS Migration Wizard? This is exactly what I want to do, but unfortunately I used UDL files into my package and the migration wizard do not migrate them and leave embedded a legacy DTS package for each Data Transform Task.

My idea was to modify the DTS changing properties where needed, and my tool (DtsToSsisPrepare) generate a new DTS file that can be better migrated to SSIS.

This is the command line help:

Dts to SSIS Prepare 1.0.0.0
Syntax: DtsToSsisPrepare.exe [switch:value]
  /ConsolidateUdl
     Consolidate UDL connections into regular connections
  /OptimizeConnections
     Optimize connections recycling the same connection for several tasks
  /RemoveAdvancedTransformations
     Remove advanced transformations phases from Transform Data Task
  /ResetMaxErrorCount
     Reset Maximum Error Count property from Transform Data Task
  /DtsSource
     Source DTS file name
  /DtsDestination
     Destination DTS file name
  /help
     show help
  /all
     Activate all conversion features

The best way to use the tool is to specify /all to enable all conversions.
DtsSource and DtsDestination are mandatory parameters.

I wish to write an article explaining the tool behavior in the coming days, in the meantime I'd like to receive feedback on this. Full source code and binaries are downloadable on www.sqlbi.eu. Enjoy!

TableDifference: a faster way to handle SCD

I had the opportunity to test TableDifference (a component written by Alberto Ferrari) since early versions and I can say it is a very interesting way to handle Slowly Changing Dimension (SCD) much faster than with the standard SCD component of SSIS.

While you work with dimensions with millions of members and you don't have a "LastModified" field to filter variations since last dimension processing, the need to compare the whole dimension with an equally wide table force you to implement custom solutions. TableDifference is a general purpose component that helps you to implement fast SCD processing in a more standardized way.

TableDifference is well described in this article and you can download source code and binaries here.

Post hotfix post-SP1 problem with DTS Designer

After I installed build 2153, my SQL Server Managment Studio DTS 2000 Package Designer (which is part of SQL Server 2005 Feature Pack) was not working.

After I searched for an updated feature pack (which has not been released) I discovered that the problem has been described in KB917406. May be there are other factors, because I installed other programs yesterday after the build 2153, anyway I think this post is a good reminder.

SSIS service does not start with SP1: KB filed it as a bug

Some days ago I posted my troubles with SSIS after SP1 install.

Today Microsoft has published KB918644 where this behavior is officially described as a bug. It seems that product feedback is working well...

SSIS service does not start with SP1: workaround

SQL Server 2005 SP1 seems to stop many working installations of SSIS service. I lost some hour on this problem and only a few days after I discovered a good workaround here.

While you install SQL Server 2005, setup procedure brings you to use NETWORK SERVICE as account for SSIS service. SP1 verify signatures of the files getting a file from internet: it doesn't work if the server has no internet connectivity and it doesn't work if the user is NETWORK SERVICE.

Possible workarounds:

  • Change user account for SSIS Service to a regular (least privilege) user
  • Disable checking certificate revocation list for NETWORK SERVICE user

The registry hack is this:

Windows Registry Editor Version 5.00

[HKEY_USERS\S-1-5-20\Software\Microsoft\Windows\CurrentVersion\WinTrust\Trust Providers\Software Publishing]
"State"=dword:00023e00

Now, the big question is: what of those two techniques has to be considered safer? I would say that changing registry for NETWORK SERVICE account doesn't change much and should have less side effects than changing user account on a production machine, but sincerely I don't have a consolidate answer now, so comments are welcome.

Alternative approach for SCD dimensions in SSIS?

Tonight I've studied how to improve performance of SCD transformation task of SSIS. I started from the consideration that in an actual SSIS package, the dimension processing is at least one magnitude slower than fact table processing; most of the time is consumed by SCD transformation.

A simple SQL Profiler session showed me how SCD works processing data: for each processed row, a sp_executesql is called with a query that have business key fields as parameter. While the query is executed very fast on the server, the throughput is not very good because we have the latency of communication between different processes (and may be different servers). Client CPU is consumed more than server CPU, probably due directly or indirectly to the sql query string construction (it's a SELECT field, field, field, ... FROM table WHERE businesskey = @P1".

Now, when you have a SCD you could also have a timestamp column on original data that helps you to extract and process only good changing candidates for dimension processing. But when this is not the case, the whole dimension is reprocessed against the last processed dimension, just to find new or changed attribute. This scenario is very common when data are provided by legacy system that are out of our control (otherwise we would have added the timestamp column months before...). It could be wonderful if SCD transformation would have an optional input flow to provide "lookup" sorted data, working in a way similar to the merge task. The wizard could use the business keys as lookup/join condition and SCD would not have to use sql query anymore just to solve the SCD condition to process. Unfortunately this is not the case, so by now the only option is to make your own SCD processing by hand using basic components. But this topic deserves attention by SSIS development team, because this kind of operations are very very common in a data warehouse environment.

CalendarTransform component for SSIS

I just downloaded and tried the CalendarTransform component for SSIS that Microsoft realesed with a shared source permissive license.

It's a useful data flow transformation component that is able to generate apparently all of the attributes you can desire to place into a date dimension. While I think it's a pretty useful component (and you have source code that can be improved!), I'd like to highlight areas of improvement:

  • Editor can be more user friendly: you have to go into "Input and Output Properties" and define what output columns have to be generated.
  • Add a preview of output, better if provided with sample data: a sample talks more than tousands of words, and in this case you have only one word (the enum member name...).
  • It's not clear how to force locale id from SSIS editor: I correctly get italian names for calendar with my default configuration, but order of items in a complete date string is the american one, very strange... Anyway the biggest problem is that you probably want to enforce a localization identifier independently by execution condition, so for a SSIS user it could be not so intuitive as it is for a veteran .NET programmer.
  • A little legenda for Format property (who remember differences between MMM, MM and M?): actually is very difficult to get appropriate documentation for date format even from .NET