Alright, so I tested the GUID usage a bit more. In my previous post you can see I already tested the inserts for 100,000 rows using the 3 different meathods. The first meathod (I refer to this meathod as Test1 throughout) used the NEWID() as a default on the uniqueidentifier clustered primary key. The second meathod was to use a composite key made up of 2 integer columns, one being an identitiy and one forced with a server id (this meathod is refered to as Test2 throughout), these columns make up this tables clustered primary key. And the third meathod uses a function to create a sequential uniqueidentifier, using a combination of the NEWID() function and a converted binary(6) datetime value (refered to as Test3 throughout) to make up the clustered primary key.
These tests did the following, first truncated the tables and inserted 4,000,000 rows into test table, Test1, Test2 and Test3. Next, I tested single row selects for each table. For the next test, I selected 70,000 rows from each table. For the 70,000 row selects I did a ranged select, first I cleared the proc cache, the buffer cache, did a checkpoint, and then ran the select statement. I did this 10 times for each query and took the averages to report here.
INSERTS
Inserts for Test1 took 79 seconds to insert 100,000 rows and 45,191 seconds to insert 4,000,000 rows. The inserts for Test2 it took 33 seconds for the 100,000 rows and took 1630 seconds for the 4,000,000 rows. For Test3 the inserts took 33 seconds for the 100,000 inserted rows and 2025 seconds for the 4,000,000 rows.
|
Test 1 - Uniqueidentifier NEWID() |
Test 2 - Composite Key |
Test 3 - Uniqueidentifier getdate() |
| Insert - 100000 |
79 sec |
33 sec |
33 sec |
| Insert - 4000000 |
45191 sec |
1630 sec |
2025 sec |
It took much longer for test1 because to insert because the GUID values are random values and because the default fill factor for inedxes is 80%, which wasn't changed for this test, there were a great number of page splits. Because the inserts for Test2 and Test3 are sequential the fill factor for those indexes could be set to 100% for better performance on inserts and selects because fewer data pages would be used or returned. I am not quite sure why the inserts took longer for Test3 than they did for Test2 though maybe because of the function overhead (but not sure that should equate to around 400 seconds).
SELECTS
The CPU Time for all single row selects took 0ms. However, the average CPU and Elapsed times for the 70,000 row ranged select varied for each test. On average Test1 performed better on CPU time than the other two tests but Test2 faired better on logical reads and elapsed time than the other 2 tests, this was because the size of a Test2 row is 9 bytes while the size of the Test1 and Test3 rows is 17 bytes. Test1 caused a tremendous amount of page splits and had a scan density around 12% when I ran a DBCC showcontig after the inserts completed. The Test2 table had a scan density around 98% and the Test3 table had a scan density around 86% after inserts (I should have rebuilt the indexes after the data load, but I didn't :)). All of the CPU times were close enough in range to be circumstantial, but the elapsed time seemed to very between all of them. I think the variance is because of the nature of the data being returned to Query Analyzer for a more accurate test of elapsed time (I should have used osql, but I didn't :)). The tests seem to prove that the binary comparison of the GUID performs quite well.
|
Test 1 - Uniqueidentifier NEWID() |
Test 2 - Composite Key |
Test 3 - Uniqueidentifier getdate() |
Select Range 70000 - CPU TIME (AVERAGE) |
43 ms |
50 ms |
48 ms |
Select Range 70000 - Elapsed TIME (AVERAGE) |
1987 ms |
712 ms |
1175 ms |
Select Range 70000 - Logical Reads |
331 |
159 |
439 |
Select Range 70000 - Physical Reads |
2 |
2 |
3 |
Select Range 70000 - Read Aheads |
331 |
159 |
440 |
Select 1 Row - CPU Time |
0 ms |
0 ms |
0 ms |
Select 1 Row - Logical Reads |
3 |
3 |
3 |
Select 1 Row - Physical Reads |
3 |
3 |
3 |
Conclusion
It would seem to be as a result of the tests that the uniqueidentifier datatype performs about the same as an integer when filtering the data through the where clause. There are only a couple of downsides I can see in using the uniqueidentifier. First, and probably the most important, the size of the column (which can be an issue). It adds overhead to a system to have to return more data pages than possibly necessary (i/o's are the most expensive of system costs). The large size could in turn cause more page splits during inserts depending on whether the data is static or not (but so can adding a non-necessary column to a table). Second, with the use of the NEWID() function inserts will be random and therefore will by nature cause page splits where using an integer identity column will place the records sequentially into the table. "Ah...", you say, "...but the use of the function that parses the NEWID() and adds the converted getdate value to the end will do that as well." Yes, it does, and still it remains that I don't often load 4 million rows into a table, and so for me that is not as big a deal (just don't use the NEWID() when doing a mass load into a table). The third issue is having to type that uniqueidentifier value into a query manually (more potential for user error).
So in short, it looks as though if you plan carefully and design wisely, then the use of the uniqueidentifier can perform as well in SELECT statements as an integer. Inserts are a different story and we already discussed that as long as they are sequential values it really shouldn't matter.
I appologize for my initial shock in the usage of the uniqueidentifier column, hoping that everyone who read the initial statement can forgive me (Integers still rule the INSERTs though). :)
Later,
Zach