posted on Monday, March 21, 2005 6:37 PM
by
Knight_Reign
Normalizing flat files
A common question asked on newsgroups is how to normalize a flat file with DTS.
With DTS the answer was usually, you load the file into a table and then perform queries on the table to extract the colums into normalized tables. For large flat files this can be a real pain because you must use a different set of tools to perform a simple operation and it may be unnecessarily time consuming because you have to load the data before you can even begin to transform it. Yes, ELT. Finally, you would need to use one process to extract the data and another to transform it, decentralizing the process.
It would be great if it were possible to normalize the data as it flows through the pipeline for a one time load sequence. With SSIS, you can transform the data as you load it, ETL.
For example, maybe a flat file contains header/detail information in the same row. Likely, since the flat file isn't normalized there's some redundant header data. One way to do this in Integration Services is by splitting the flow from a flat file source adapter. For the simple case, where you only need to split the flat file into two tables, use the multi-cast to split the flow into two. On the one flow, the one to extract header information, use the sort t\ransform to remove duplicates. Then, use a lookup that references the header destination table. If the header data (row) is already there, based on the key, then you'll want to discard that row. Here's a case where the error flow comes in handy. For the lookup, flow the rows not found (errors) into the header destination table. Then, only those rows that aren't already in the table will get inserted.
On the other flow, the details flow, you can just flow this directly to the details table. No need to remove duplicates etc. if the details are unique transaction line items etc.and there shouldn't be any redundency. If there is, you can build a flow similar to the one for header information to eliminate dups.
If there is information that needs to be separated into more than 2 tables similar flows can be created along the same lines.
The point here is that it can all be done in the pipe and within one process keeping it centralized and easier to debug and maintain. In some cases, it may also perform better, especially in cases where the header row count is low or there is a low threshold of deviance from a small set of header data.
Update: Just wanted to add one last thing. ETL scenarios all have one thing in common, they're all different. So the approach to them will be different as well. Ultimately, how you approach your solution is going to depend largely on the requirements of the particular project. This post is not about establishing _the_ correct SSIS methodology, but more to explain the expanded possibilities that are available with SSIS vs. DTS.
Universe.Earth.Software.Microsoft.SQLServer.SSIS.KirkHaselden