September 2005 - Posts

Loading legacy data? Use control files...

If your data warehouse or data mart is populated from external legacy files, be sure to request a control file too.  By control file, I mean a file that contains the name of files they plan on sending and their associated row counts.

FTP processes can be interrupted.  Files get corrupted.  Using a control file adds an extra layer of data integrity and protection, as you can build in logic to ensure that what you have received is what you should have received.  If the row counts don't match, you can short-circuit your load process rather than load incomplete or partial result sets.

Ever try to back out bad or partial data from a big data warehouse?  It isn't fun.  Use a control file to prevent the partial or bad data from entering your tables in the first place.

The process I use, at a general level, is as follows:

1. I create a DTS package that uses an ActiveX script to check for the arrival of a "new" control file.  If the file isn't there, the ActiveX script returns an error.  This package is scheduled in a SQL Server Agent job step, which will "on failure", keep re-trying until the control file arrives. I have the retry interval set to every minute - because the file arrival times vary - and I wan't to process files as soon as possible.

2. Once the file arrives, the SQL Server Agent job kicks off the consecutive packages. These packages load the individual files into the staging database (including the control file).

3. After the data is loaded, I then use a  stored procedure to check the row counts of the SQL Server staging tables versus the control file.  If there are issues, I raise an error and none of the data is processed.  If no issues, however,  I continue on with loading the dimension and fact table data.