September 2005 - Posts

PASS 2005 - Day 2

Wow is all I can really say.  I only got to show up today for PASS for a couple of hours but all I can say is WOW!  This was my first PASS conference and man was it great.  I got to see all of my old co-workers at Scalability Experts today and that was great, I really enjoyed seeing them.  I saw all of the great booths and new products in the SQL Server market, which was a lot of fun.  Met some new people and hopefully new friends.  I really enjoyed the social side of the gathering.  It was great spending some time with ex-coworkers of mine Randy Dyess, Rick Heiges, Erin Welker, Darshan Singh and Larry Chesnut.  I met Louis Davidson and talked with him for a bit (very nice guy). And I got to stay for almost all of one session. 

I went and saw Bob Dorr's session about the SQL Server 2005 OS Engine.  Man was it great, the content was excellent and there was a lot of talk about how to use the system dmv's that PSS will use for trouble shooting issues, how to get to a sessions last error message, all of the statistics (OS, SQL, thread, etc.) for that thread when the error occured and so on.  Man it was great and I wish that I could have stayed for the whole thing, but it was great while it lasted.  Some of the DMV's he covered were...

  • dm_os_ring_buffer
  • dm_os_sys_info
  • dm_os_hosts
  • dm_os_stacks
  • Dm_os_wait_tasks
  • Dm_os_workers
  • Dm_os_schedulers
  • Dm_os_buffer_descriptors

 SIDE NOTE:  Have I ever told you how much I love OneNote? Dang, I love that app.  It's a great app for taking notes at things like technical sessions.

There were awsome topics beinng covered and it was a lot of fun.  Hope everyone that is there enjoys Friday.

Thanks PASS, you all did a fantastic job!!

-Zach

PASS 2005

There are a lot of great blog posts about the events and things going on at PASS this year.  I get to go today for about half the day which will be exciting (Thanks Tom!!).  I am really looking forward to seeing everyone there. 

SQL Server 2005 - DMV's

This is a topic that I am extremely insterested in.  I learned more about SQL Server internals from Microsoft than any other source by diving into the shipped system tables and stored procedures.

This is a big thing in SQL Server 2005, because, as I'm sure you're aware, we no longer have access to the system tables within SQL Server 2005.  There is a great blog by Randy Dyess that talks about this topic in particular. 

Along the lines of looking into system stored procedures, the sp_helptext stored procedure is replaced with object_definition() function.  The object_definition function is an undocumented function that will return the definition of a given object.

EXAMPLE:
select object_definition(object_id('sys.objects'))

The example above will give you the code behind creating that view, and will more than likely reference resource objects that you can't get to unless you are using the dedicated ADMIN connection.  The 'sys.objects' object is a system view, this object can be replaced with the name of a stored procedure to return the source text of that procedure or the name of a funtion to return the text of a function. 

This will be huge in SQL 2005 to determine how the MS Dev team is using SQL Server to give you the information that you need, to discover how and where SQL is getting and putting its information. I will blog some more on this topic tomorrow as I get into the actual DMV's and what information they are providing.

Good luck in the SQL 2005 hunt,
Zach

P.S. If you are going to PASS, have fun!! Hopefully I'll get a day this week to go.

 

Retrieving Unique Key Column Names

It's easy enough to get the PK and the column names it uses from SQL Server but what about Unique Key column names?  Ever tried it?  A friend of mine asked how you get those column names from SQL since there is no direct table to retrieve those from.  This is how I went about doing it. 

First you have to have a starting point.  For example, for all unique keys give me the column names for each, or another example might be, for a given table give me the column names of it's unique keys (don't mistake a unique key for a primary key, they are different things).  These two examples require different queries.  But I'll show you how I did it using the second example.

select o.name, o.id, o.parent_obj, o.xtype, o2.name, o2.id, i.name, i.indid, k.id, k.colid, c.name
from sysobjects o
inner join sysobjects o2
on o2.id = o.parent_obj
inner join sysindexes i
on o2.id = i.id and o.name = i.name
inner join sysindexkeys k
on o2.id = k.id and i.indid = k.indid
inner join syscolumns c
on c.id = o2.id and k.colid = c.colid
WHERE o2.name = 'table3'

Got Questions? Let me know and I'll post the answers in the comments.

Take it easy,
Zach

CHAR vs VARCHAR

This topic was discussed a bit in a forum post recently that has quite a bit of interest in my opinion.  Since I posted about the use of the GUID recently I thought I would do some some testing on the performance differences and the downfalls of using the CHAR and VARCHAR data types.

First we'll consider the functional differences between the 2 data types.  The CHAR is obviously a static length character datatype while the VARCHAR is a variable length character datatype.  The obviously positive and negative (we’ll cover this later) thing about a VARCHAR, is that you only store the amount of data you need to and no more, well, to get a little deeper you store an additional 2 bytes with every VARCHAR value. 

FUNDAMENTAL CONCEPT

When inserting into a CHAR(100) column regardless of the length of the data going into the column, SQL Server will store 100 bytes for this column.  When inserting into a VARCHAR(100) column depending on the length of the data going into the column, SQL Server will store the number of bytes of data going into this column. 

SELECTS

If you are using the same data to insert into 2 separate tables, one utilizing a char and the other utilizing a varchar, and the length of the data going into those tables are of fixed length, then there will be no performance difference in the retrieval of that data as long as no functions are used as part of the data retrieval.  So in other words if I had a char(100) column in the first table and a varchar(100) column in the same spot on the second table and all of the data that went into those columns were exactly 100 characters in length then both data types would perform equally as well.  If the data that was going into the char and varchar columns was of variable length (values varying from 1-100 characters equally) then the varchar would perform better.  This is not because the varchar data type is more efficient but because there will be fewer data pages to retrieve from disk (I'll get into this some more when I go over the inserts).

INSERTS

Comparing inserts for these data types is where the difference in performance lies because of the storage of data.  If the same 2 tables are used as in the select example and the data that is going into those tables is of variable length then the inserts will perform faster on the table with varchar.  If those same 2 tables are used and the data is of a fixed 100 character length then the inserts using both data types will be equal. 

UPDATES

Now, lets take a future scenario, lets say you have your variable length data in your 2 tables, and you want to do an update.  If your varchar(100) value is only storing 10 characters in a particular row.  And your data pages are full.  And you update it with a value that is 100 characters long, a page split will occur.  This will cause this update to be slower than updating your char column value that is 10 characters in length with a 100 character value.  Why you may ask?  Well because when you insert a 10 character value into a char(100) column, the database will store a value of 100 characters, the original 10 characters plus a padding of 90 spaces on the end.   When you insert a 10 character value into a varchar(100) column, the database will store a value only giving space on the data page for the 10 characters.  After the page fills there is no room for the additional 90 bytes of data so the system splits the page and moves to row to a new data page, placing a pointer to the new page and row. Evidence of this activity can be seen in the test below.

TEST RESULTS

In these tests I cover a couple of different scenarios.  

  1. Inserting 200,000 rows into a table called T1, which contained an integer identity column and a char(100) column
  2. Inserting 200,000 rows into a table called T2, which contained an integer identity column and a varchar(100) column
  3. Selecting all rows from table T1
  4. Selecting all rows from table T2
  5. Selecting a range of 15,000 rows from table T1
  6. Selecting a range of 15,000 rows from table T2
  7. Selecting a single row from table T1
  8. Selecting a single row from table T2
  9. Updating 200,000 rows from table T1
  10. Updating 200,000 rows from table T1

Inserting 200,000 NULL rows into the char and varchar tables T1 and T2 proves the amount of space allocated to store the data for each table.  During these two tests table T1 (char table) required 2817 data pages to store the data.  Table T2 (varchar table) required only 322 data pages to store all of the data.  This was because the T1 table reserved the full 100 character value for the char column regardless of the fact that the data inserted into the char column was a NULL value and therefore required 2817 pages to store the data.  Table T2 did not reserve the 100 characters of space when inserting the NULL value into the table therefore only needing 322 pages to store the data.  The inserts for both tables took 60 seconds for T1 and 61 seconds for T2.  The inserts were run multiple times and the averages were used for reporting.   After the inserts on table T1 the Scan Density and Page Density were 99.44% and 99.09% respectively.  After the inserts on table T2 the Scan Density and Page Density were 95.35% and 99.73% respectively.

Selecting all of the rows from both tables only 174 milliseconds for T1 and 152 milliseconds for T2.  This is because of the drastic difference in the number of pages being returned by both queries with no WHERE clause.  The number of logical reads that required by the system for table T1 were 2819 and the number logical reads required to retrieve all pages for T2 was 323.  When returning a range of 15,000 rows from T1 the query completed in 21 milliseconds and retrieved 214 data pages.  When returning the 15,000 rows from table T2 the query completed in 17 milliseconds and retrieved only 26 data pages.   When returning a single row from T1 the query completed in 0 milliseconds and retrieved 3 data pages.  When returning a single row from table T2 the query completed in 0 milliseconds and retrieved only 2 data pages. 

The performance hit when using a varchar in these test cases can be seen doing updates on these two tables.  When updating all rows in these two tables from a NULL value to a 100 character value the results vary greatly.  It took 1032 milliseconds to update all 200,000 rows in table T1, while taking 1911 milliseconds to update all rows in T2.  After doing a SHOWCONTIG on the two tables after the update you can see the cost of the updates to be page splits caused by the updates against T2.  There are still only 2817 pages for table T1 but now there are 3859 pages for T2 instead of the 322 pages we saw earlier.  There is also still a Scan Density and Page Density of 99.44% and 99.09% for T1 but now there is a 46.04% and 74.25% Scan Density and Page Density for T2.

CONCLUSION

There is no performance difference fundamentally retrieving data from the char and varchar datatypes.  The difference in performance comes from how you will be utilizing the columns. 

Make sure that you plan for the way you will be using the data, be it static or dynamic.

-Zach

XML discussion

Check out Adam Machanic's blog from yesterday about XML opinions.  He has 2 links there pointing to XML discussions that have 2 great view points.

Kent Tegel's blog on his view of XML usage is, of course, a good read as usual.

Enjoy!!

-Zach

Using GUIDs - Selects

Alright, so I tested the GUID usage a bit more.  In my previous post you can see I already tested the inserts for 100,000 rows using the 3 different meathods.  The first meathod (I refer to this meathod as Test1 throughout) used the NEWID() as a default on the uniqueidentifier clustered primary key.  The second meathod was to use a composite key made up of 2 integer columns, one being an identitiy and one forced with a server id (this meathod is refered to as Test2 throughout), these columns make up this tables clustered primary key.  And the third meathod uses a function to create a sequential uniqueidentifier, using a combination of the NEWID() function and a converted binary(6) datetime value (refered to as Test3 throughout) to make up the clustered primary key.

These tests did the following, first truncated the tables and inserted 4,000,000 rows into test table, Test1, Test2 and Test3.  Next, I tested single row selects for each table.  For the next test, I selected 70,000 rows from each table.  For the 70,000 row selects I did a ranged select, first I cleared the proc cache, the buffer cache, did a checkpoint, and then ran the select statement.  I did this 10 times for each query and took the averages to report here. 

INSERTS

Inserts for Test1 took 79 seconds to insert 100,000 rows and 45,191 seconds to insert 4,000,000 rows.  The inserts for Test2 it took 33 seconds for the 100,000 rows and took 1630 seconds for the 4,000,000 rows.  For Test3 the inserts took 33 seconds for the 100,000 inserted rows and 2025 seconds for the 4,000,000 rows.

Test 1 -
Uniqueidentifier
NEWID()
Test 2 -
Composite Key 
Test 3 -
Uniqueidentifier
getdate()
Insert - 100000 79 sec 33 sec 33 sec
Insert - 4000000 45191 sec 1630 sec 2025 sec

It took much longer for test1 because to insert because the GUID values are random values and because the default fill factor for inedxes is 80%, which wasn't changed for this test, there were a great number of page splits. Because the inserts for Test2 and Test3 are sequential the fill factor for those indexes could be set to 100% for better performance on inserts and selects because fewer data pages would be used or returned.  I am not quite sure why the inserts took longer for Test3 than they did for Test2 though maybe because of the function overhead (but not sure that should equate to around 400 seconds).

SELECTS

The CPU Time for all single row selects took 0ms.  However, the average CPU and Elapsed times for the 70,000 row ranged select varied for each test. On average Test1 performed better on CPU time than the other two tests but Test2 faired better on logical reads and elapsed time than the other 2 tests, this was because the size of a Test2 row is 9 bytes while the size of the Test1 and Test3 rows is 17 bytes.  Test1 caused a tremendous amount of page splits and had a scan density around 12% when I ran a DBCC showcontig after the inserts completed. The Test2 table had a scan density around 98% and the Test3 table had a scan density around 86% after inserts (I should have rebuilt the indexes after the data load, but I didn't :)).  All of the CPU times were close enough in range to be circumstantial, but the elapsed time seemed to very between all of them.  I think the variance is because of the nature of the data being returned to Query Analyzer for a more accurate test of elapsed time (I should have used osql, but I didn't :)).  The tests seem to prove that the binary comparison of the GUID performs quite well.

Test 1 -
Uniqueidentifier
NEWID()
Test 2 -
Composite Key
Test 3 -
Uniqueidentifier
getdate()
Select Range 70000 -
CPU TIME (AVERAGE)
43 ms 50 ms 48 ms
Select Range 70000 -
Elapsed TIME (AVERAGE)
1987 ms 712 ms 1175 ms
Select Range 70000 -
Logical Reads
331 159 439
Select Range 70000 -
Physical Reads
2 2 3
Select Range 70000 -
Read Aheads
331 159 440
Select 1 Row -
CPU Time
0 ms 0 ms 0 ms
Select 1 Row -
Logical Reads
3 3 3
Select 1 Row -
Physical Reads
3 3 3

 

Conclusion

It would seem to be as a result of the tests that the uniqueidentifier datatype performs about the same as an integer when filtering the data through the where clause. There are only a couple of downsides I can see in using the uniqueidentifier.  First, and probably the most important, the size of the column (which can be an issue).  It adds overhead to a system to have to return more data pages than possibly necessary (i/o's are the most expensive of system costs).  The large size could in turn cause more page splits during inserts depending on whether the data is static or not (but so can adding a non-necessary column to a table).  Second, with the use of the NEWID() function inserts will be random and therefore will by nature cause page splits where using an integer identity column will place the records sequentially into the table. "Ah...", you say, "...but the use of the function that parses the NEWID() and adds the converted getdate value to the end will do that as well." Yes, it does, and still it remains that I don't often load 4 million rows into a table, and so for me that is not as big a deal (just don't use the NEWID() when doing a mass load into a table).  The third issue is having to type that uniqueidentifier value into a query manually (more potential for user error). 

So in short, it looks as though if you plan carefully and design wisely, then the use of the uniqueidentifier can perform as well in SELECT statements as an integer.  Inserts are a different story and we already discussed that as long as they are sequential values it really shouldn't matter.

I appologize for my initial shock in the usage of the uniqueidentifier column, hoping that everyone who read the initial statement can forgive me (Integers still rule the INSERTs though).  :)

Later,

Zach