October 2004 - Posts

SQL Server Data Mining website live!

Thanks to the SQL Server Data Mining development team, it's live the www.sqlserverdatamining.com web site. You can browse DM models, read tutorials and download samples. Let's go and visit the site, there's no reason to describe it here.

One suggestion to the SQLDM development team: please give us the Movie db with your DM models, so we could play with it and discuss on the newsgroup with the same (real) data at our fingertips!

Direct MOLAP insert vs. proactive cache

While I'm testing Analysis Services 2005, I'm starting to think about when and how to use the proactive cache feature vs. using the DTS to insert data straight into a MOLAP partition (using the Dimension and Partition Processing Data Flow items). In both cases the target is to reduce the latency between a data change and the cube update.

My first attempts was thwarted by the user interface of BI studio: I was trying to change the cube storage settings from the cube property ProactiveCaching, but it has no effects on already created partitions. So remember: you have to change the aggregation design (and proactive cache mode) for each single partition of your cube (you have already one for each measure group). I hope that Microsoft will change this behaviour a little bit because is counterintuitive: if you have just created a cube and want to modify the storage settings, probably you want to do it for any partition of the cube; I'd expect at least a warning and a UI feature to change the storage settings for all cube's partitions.

When I solved this problem (thanks to an answer in the newsgroup after I sent a repro case) I tried to implement a cube without a back-end data mart. I made tests with Northwind, creating a DTS package (or how it's named now.. SSIS package?) that push dimension members and fact rows directly into the cube. You have to create a Data Source View (with corresponding SQL tables) anyway, but SQL tables are never populated and data are loaded directly into the MOLAP partition. You can load data both in full and incremental mode, and probably the incremental mode is very interesting.

My hypothesis on how to use these two modes are the following.

Direct data load in partition with DTS

  • Simple cube, large amount of data, no need of relational star schema, no need of complex transformations from the raw data
  • Incremental update with small amount of new data and a very high frequency of update

Proactive cache

  • Complex cube structure, need for high availability
  • Unscheduled and/or unexpected updates of data mart data (may be due to corrections or maintenance batch)

I'm not a strong believer of the proactive cache usefulness in a traditional data warehouse scenario: the number of transformations necessary to bring the data on the multidimensional structure is very high (and sometimes they are very complex) that is practically impossible to update the data warehouse in a continuous manner. Of course, there are a lot of other scenarios where proactive cache is the optimal solution: I think its primary use will be to get the best performance from the UDM (Unified Dimensional Model) that will become the primary source of data for any kind of report (starting from Reporting Services).

The direct data load with DTS is probably interesting to build a "frequent update" scenario like this: imagine a big and complex data warehouse with some time-critical cubes (sales, orders, stock availability). While the nightly update could be very complex due to many dimension processing, may be that for any transaction in the LOB applications you can build a simple transformation that can update the cube at least for the most important dimensions. I imagine you can easily build a queue of "events" to be loaded into the cube, resolving only the more important dimensions (like customer, product and time for a sale) and leaving a "dummy" member to other dimensions: a DTS could extract these events from the queue and incrementally update the MOLAP partition, with a very low resource-consumption from all the point of views. Remember, a ROLAP query (using the proactive cache solution) could be resource-intensive and managing many different partitions to reduce this effect could be harder and expensive, not mentioning the need to update the data mart incrementally with data coming from the LOB applications.

Do you agree with my thoughts or not? Have I missed something? Do you see other scenarios? Comments are welcome!

When marketing is evil

You know, I'd like to talk about technical issues more than anything else. But who works with BI and MS technologies know that MS offers you a good server (AS2000, that will be great with AS2005) but a poor client experience: if you put toghether Excel Pivot Table and OLAP Add-Ins, Data Analyzer, Office Web Components... well, you still doesn't have a solution who can take advantage of all the features of AS2000. Sure, you can write some VBA macro in your Excel sheet or you can still encapsulate your favourite OWC control inside your Win/Web form. But why lose your time writing something like this? Let's shop a client!

And here starts your pain: you have saved tousand of dollars on your server just to spend them for client platform. And what client? Often many bugs, not a great support (remember, I work mainly in Italy, Europe, the other side of the Ocean). I tried several products, more or less with the same troubles. Not that these products are so bad, but... when you dislike MS products... remember what a range of services (documentation, help, knowledge base and, why not?, newsgroups and community) you can use to find solutions to real problems.

Ok, I'll be ok with that if I but cheap software (or shareware), but as I said before... you are getting to spend a lot of money for these products. And, guess what? No demo. No evaluation version. No anything. You can, of course, call a salesperson.... who will say that you can but at least a 10 clients license (even if you have only 2 users). I ever thought that these guys has a wrong marketing. Today I had the demonstration. Today I received this e-mail (I only cleared references to companies and persons for obvious reasons).

Dear Marco,
We have had contact about YYYYYYYYYY and Microsoft based OLAP solutions.
You mentioned that your main interest is in consultancy and training. You could work with XXXXXXXXXXXXX to offer YYYYYYYYYY  to your customers.
I know that you are a speaker at events and provide training/consultancy. Can you introduce XXXXXXXXXXX to your customers?
Best regards,

Now: I'm not interested in a single penny of commission. My reputation worths much more than that. But how can I suggest a product to a single customer (or to any student or to any attendant of a conference) if I can't test it before? Of course, I already tried to make this clear in a previous request and, after some month of silence, today I receive this (spectacular?) answer. May be this product is wonderful, but I can't base my opinion on a data sheet.

Microsoft: it's time for a serious BI client, we can't wait for your partners to get advantage of your servers...

Analysis Services 2005: many-to-many dimension, killer feature!

SQL Server 2005 has a new version of Analysis Services with a lot of new features, so much that a whole (thick) book would be necessary to describe everything. Nevertheless, many improvements are "marginal", in the sense that bring us better productivity (= less development time) or better performance. Aside administrative and development features, there are not so (very) much end user features so important for him to convince to jump into the new release (even without changing every client...): Analysis Services 2000 is a great product and if pushed to the limits can resolve a lot of business problems, you only have to know the product very well and have to be ready to do some stunt.

Anyway, Analysis Services 2005 (AS2005 for shorter) has some feature that is revolutionary. One of those is the many-to-many dimension.

This is the business scenario: you have a fact table that describe a fact measure (the account balance at a certain date) for an entity (for a bank account) which can be joined to many members of another dimension (the many owners of a bank account). Who knows the multidimensional model already see the trouble, beacuse it's not easy to describe the non-aggregability of measures joined to dimensions with a many-to-many relationship (in this case, each bank account can have one or more owners and each owner can have one or more accounts).

With AS2005 the trouble, simply, disappears. The "trick" is to use an intermediate fact table that, in the relational model, indeed defines the relationship many-to-many. In the following figure fact tables are yellow and dimension tables are blue. Note that DimCustomerAccount is considered a fact table.

When you define relationships between dimensions and measure groups (a measure group is similar to a real cube of AS2000, while a AS2005 cube is more like a virtual cube of AS2000), you specify that the Customer dimension is joined to Balance with the dimension DimCustomerAccount (it's the selected item in the following figure).

The relationship is further described by the following dialog box that you obtain with a click on the button contained in the selected item of the previous image. This dialog box is available for every combination between dimensions and measure groups and it defines the type of relationship.

Here is the result. In test tables I created 5 customers (Luca, Marco, Paolo, Roberto e Silvano) and 9 accounts (numbered from 1 to 9). Each account is joined to one or more customers and the balance for each account is always 100.

As you can see, for each customer you can identify accounts he owns and for each account you can see the balance repeated for each owner.... but the total for each account is always 100 (Grand Total row) and the balance for all accounts is 900 (100 * 9). Try to do that with any other multidimensional model (and I'm not talking only about AS2000) and see what happens... The following image synthetize the non-aggregability of some measure in respect of some dimensions.

You can obtain the same result with AS2000 but only with some stunt and some tradeoff in terms of processing time or query performance. Who, like me, already experienced similar issues, can't wait to upgrade to SQL 2005 in a production environment just only for this feature...

Short presentation about myself

Hi! I'm Marco Russo and I'm a consultant (and also trainer, speaker, writer...) based in Italy. I already have an italian blog (http://blogs.devleap.com/marco) but this one is in english and it's completely dedicated to the world of Business Intelligence in SQL Server.

I've been working on BI since 1998 and with SQL Server since 7.0 version. I'm also a beta tester of SQL 2005 and with Beta 2 we can start to talk publicly about it.

I only hope that my english will be understandable for you. Enjoy!