Performance (RSS)

CHAR vs VARCHAR

This topic was discussed a bit in a forum post recently that has quite a bit of interest in my opinion.  Since I posted about the use of the GUID recently I thought I would do some some testing on the performance differences and the downfalls of using the CHAR and VARCHAR data types.

First we'll consider the functional differences between the 2 data types.  The CHAR is obviously a static length character datatype while the VARCHAR is a variable length character datatype.  The obviously positive and negative (we’ll cover this later) thing about a VARCHAR, is that you only store the amount of data you need to and no more, well, to get a little deeper you store an additional 2 bytes with every VARCHAR value. 

FUNDAMENTAL CONCEPT

When inserting into a CHAR(100) column regardless of the length of the data going into the column, SQL Server will store 100 bytes for this column.  When inserting into a VARCHAR(100) column depending on the length of the data going into the column, SQL Server will store the number of bytes of data going into this column. 

SELECTS

If you are using the same data to insert into 2 separate tables, one utilizing a char and the other utilizing a varchar, and the length of the data going into those tables are of fixed length, then there will be no performance difference in the retrieval of that data as long as no functions are used as part of the data retrieval.  So in other words if I had a char(100) column in the first table and a varchar(100) column in the same spot on the second table and all of the data that went into those columns were exactly 100 characters in length then both data types would perform equally as well.  If the data that was going into the char and varchar columns was of variable length (values varying from 1-100 characters equally) then the varchar would perform better.  This is not because the varchar data type is more efficient but because there will be fewer data pages to retrieve from disk (I'll get into this some more when I go over the inserts).

INSERTS

Comparing inserts for these data types is where the difference in performance lies because of the storage of data.  If the same 2 tables are used as in the select example and the data that is going into those tables is of variable length then the inserts will perform faster on the table with varchar.  If those same 2 tables are used and the data is of a fixed 100 character length then the inserts using both data types will be equal. 

UPDATES

Now, lets take a future scenario, lets say you have your variable length data in your 2 tables, and you want to do an update.  If your varchar(100) value is only storing 10 characters in a particular row.  And your data pages are full.  And you update it with a value that is 100 characters long, a page split will occur.  This will cause this update to be slower than updating your char column value that is 10 characters in length with a 100 character value.  Why you may ask?  Well because when you insert a 10 character value into a char(100) column, the database will store a value of 100 characters, the original 10 characters plus a padding of 90 spaces on the end.   When you insert a 10 character value into a varchar(100) column, the database will store a value only giving space on the data page for the 10 characters.  After the page fills there is no room for the additional 90 bytes of data so the system splits the page and moves to row to a new data page, placing a pointer to the new page and row. Evidence of this activity can be seen in the test below.

TEST RESULTS

In these tests I cover a couple of different scenarios.  

  1. Inserting 200,000 rows into a table called T1, which contained an integer identity column and a char(100) column
  2. Inserting 200,000 rows into a table called T2, which contained an integer identity column and a varchar(100) column
  3. Selecting all rows from table T1
  4. Selecting all rows from table T2
  5. Selecting a range of 15,000 rows from table T1
  6. Selecting a range of 15,000 rows from table T2
  7. Selecting a single row from table T1
  8. Selecting a single row from table T2
  9. Updating 200,000 rows from table T1
  10. Updating 200,000 rows from table T1

Inserting 200,000 NULL rows into the char and varchar tables T1 and T2 proves the amount of space allocated to store the data for each table.  During these two tests table T1 (char table) required 2817 data pages to store the data.  Table T2 (varchar table) required only 322 data pages to store all of the data.  This was because the T1 table reserved the full 100 character value for the char column regardless of the fact that the data inserted into the char column was a NULL value and therefore required 2817 pages to store the data.  Table T2 did not reserve the 100 characters of space when inserting the NULL value into the table therefore only needing 322 pages to store the data.  The inserts for both tables took 60 seconds for T1 and 61 seconds for T2.  The inserts were run multiple times and the averages were used for reporting.   After the inserts on table T1 the Scan Density and Page Density were 99.44% and 99.09% respectively.  After the inserts on table T2 the Scan Density and Page Density were 95.35% and 99.73% respectively.

Selecting all of the rows from both tables only 174 milliseconds for T1 and 152 milliseconds for T2.  This is because of the drastic difference in the number of pages being returned by both queries with no WHERE clause.  The number of logical reads that required by the system for table T1 were 2819 and the number logical reads required to retrieve all pages for T2 was 323.  When returning a range of 15,000 rows from T1 the query completed in 21 milliseconds and retrieved 214 data pages.  When returning the 15,000 rows from table T2 the query completed in 17 milliseconds and retrieved only 26 data pages.   When returning a single row from T1 the query completed in 0 milliseconds and retrieved 3 data pages.  When returning a single row from table T2 the query completed in 0 milliseconds and retrieved only 2 data pages. 

The performance hit when using a varchar in these test cases can be seen doing updates on these two tables.  When updating all rows in these two tables from a NULL value to a 100 character value the results vary greatly.  It took 1032 milliseconds to update all 200,000 rows in table T1, while taking 1911 milliseconds to update all rows in T2.  After doing a SHOWCONTIG on the two tables after the update you can see the cost of the updates to be page splits caused by the updates against T2.  There are still only 2817 pages for table T1 but now there are 3859 pages for T2 instead of the 322 pages we saw earlier.  There is also still a Scan Density and Page Density of 99.44% and 99.09% for T1 but now there is a 46.04% and 74.25% Scan Density and Page Density for T2.

CONCLUSION

There is no performance difference fundamentally retrieving data from the char and varchar datatypes.  The difference in performance comes from how you will be utilizing the columns. 

Make sure that you plan for the way you will be using the data, be it static or dynamic.

-Zach

Using GUIDs - Selects

Alright, so I tested the GUID usage a bit more.  In my previous post you can see I already tested the inserts for 100,000 rows using the 3 different meathods.  The first meathod (I refer to this meathod as Test1 throughout) used the NEWID() as a default on the uniqueidentifier clustered primary key.  The second meathod was to use a composite key made up of 2 integer columns, one being an identitiy and one forced with a server id (this meathod is refered to as Test2 throughout), these columns make up this tables clustered primary key.  And the third meathod uses a function to create a sequential uniqueidentifier, using a combination of the NEWID() function and a converted binary(6) datetime value (refered to as Test3 throughout) to make up the clustered primary key.

These tests did the following, first truncated the tables and inserted 4,000,000 rows into test table, Test1, Test2 and Test3.  Next, I tested single row selects for each table.  For the next test, I selected 70,000 rows from each table.  For the 70,000 row selects I did a ranged select, first I cleared the proc cache, the buffer cache, did a checkpoint, and then ran the select statement.  I did this 10 times for each query and took the averages to report here. 

INSERTS

Inserts for Test1 took 79 seconds to insert 100,000 rows and 45,191 seconds to insert 4,000,000 rows.  The inserts for Test2 it took 33 seconds for the 100,000 rows and took 1630 seconds for the 4,000,000 rows.  For Test3 the inserts took 33 seconds for the 100,000 inserted rows and 2025 seconds for the 4,000,000 rows.

Test 1 -
Uniqueidentifier
NEWID()
Test 2 -
Composite Key 
Test 3 -
Uniqueidentifier
getdate()
Insert - 100000 79 sec 33 sec 33 sec
Insert - 4000000 45191 sec 1630 sec 2025 sec

It took much longer for test1 because to insert because the GUID values are random values and because the default fill factor for inedxes is 80%, which wasn't changed for this test, there were a great number of page splits. Because the inserts for Test2 and Test3 are sequential the fill factor for those indexes could be set to 100% for better performance on inserts and selects because fewer data pages would be used or returned.  I am not quite sure why the inserts took longer for Test3 than they did for Test2 though maybe because of the function overhead (but not sure that should equate to around 400 seconds).

SELECTS

The CPU Time for all single row selects took 0ms.  However, the average CPU and Elapsed times for the 70,000 row ranged select varied for each test. On average Test1 performed better on CPU time than the other two tests but Test2 faired better on logical reads and elapsed time than the other 2 tests, this was because the size of a Test2 row is 9 bytes while the size of the Test1 and Test3 rows is 17 bytes.  Test1 caused a tremendous amount of page splits and had a scan density around 12% when I ran a DBCC showcontig after the inserts completed. The Test2 table had a scan density around 98% and the Test3 table had a scan density around 86% after inserts (I should have rebuilt the indexes after the data load, but I didn't :)).  All of the CPU times were close enough in range to be circumstantial, but the elapsed time seemed to very between all of them.  I think the variance is because of the nature of the data being returned to Query Analyzer for a more accurate test of elapsed time (I should have used osql, but I didn't :)).  The tests seem to prove that the binary comparison of the GUID performs quite well.

Test 1 -
Uniqueidentifier
NEWID()
Test 2 -
Composite Key
Test 3 -
Uniqueidentifier
getdate()
Select Range 70000 -
CPU TIME (AVERAGE)
43 ms 50 ms 48 ms
Select Range 70000 -
Elapsed TIME (AVERAGE)
1987 ms 712 ms 1175 ms
Select Range 70000 -
Logical Reads
331 159 439
Select Range 70000 -
Physical Reads
2 2 3
Select Range 70000 -
Read Aheads
331 159 440
Select 1 Row -
CPU Time
0 ms 0 ms 0 ms
Select 1 Row -
Logical Reads
3 3 3
Select 1 Row -
Physical Reads
3 3 3

 

Conclusion

It would seem to be as a result of the tests that the uniqueidentifier datatype performs about the same as an integer when filtering the data through the where clause. There are only a couple of downsides I can see in using the uniqueidentifier.  First, and probably the most important, the size of the column (which can be an issue).  It adds overhead to a system to have to return more data pages than possibly necessary (i/o's are the most expensive of system costs).  The large size could in turn cause more page splits during inserts depending on whether the data is static or not (but so can adding a non-necessary column to a table).  Second, with the use of the NEWID() function inserts will be random and therefore will by nature cause page splits where using an integer identity column will place the records sequentially into the table. "Ah...", you say, "...but the use of the function that parses the NEWID() and adds the converted getdate value to the end will do that as well." Yes, it does, and still it remains that I don't often load 4 million rows into a table, and so for me that is not as big a deal (just don't use the NEWID() when doing a mass load into a table).  The third issue is having to type that uniqueidentifier value into a query manually (more potential for user error). 

So in short, it looks as though if you plan carefully and design wisely, then the use of the uniqueidentifier can perform as well in SELECT statements as an integer.  Inserts are a different story and we already discussed that as long as they are sequential values it really shouldn't matter.

I appologize for my initial shock in the usage of the uniqueidentifier column, hoping that everyone who read the initial statement can forgive me (Integers still rule the INSERTs though).  :)

Later,

Zach

Using GUIDs - Inserts

Currently we are designing a database schema.  When I heard they were using GUID's (Global Unique Identifier) as PK's (primary keys) my jaw about dropped.  Turns out the reasoning behind the use of the keys was completely legitimate and actually is the reason GUID's were implemented by MS in the first place.  They needed a key that would be unique across databases and database servers.  The GUID certainly fulfills this need but it comes a cost. 

The GUID is a wide column (16 bytes to be specific) and contains a unique combination of 33 hex value characters.  This column, because it is the primary key, is going to be stored in, of course, the clustered index (unless specified to be a non-clustered index), and will be the page pointer for each leaf page in a non-clustered index.  Also, if a GUID is used instead of an integer identity column then the bits need to be matched up for each row which is relatively fast.  If a high volume of inserts are done on these tables then GUID's being large will contribute to page splits, as will the fact that NEWID() generates a random value, which could place a new record on any of the data pages will cause performance problems.  Another reason is that typically primary keys are searched most frequently and trying to recall a GUID from memory to type into a query is nearly impossible (my feeble mind wouldn't be able to do it).

If you needed a key that would be unique across all servers I would recommend a composite key as the PK instead of the GUID.  Use a single integer identity column to uniquely identify the rows and another column to identify the server, placing a default server id value and placing a check constraint on the server id column to validate that the value is only what you would expect for it to be for that server. 

I have mocked a test to validate my theory on inserts only (results below), I'll get to the selects a little later, but I am interested in what the rest of you have found.

Test 1 - Completed in 79 seconds on test machine
Created table with the following syntax.

create table test1
   (tid uniqueidentifier default newid() not null,
    col1 char(1) not null
      primary key (tid)) on [primary]


I used the following script to insert into the table.

declare @d1 datetime, @d2 datetime, @i int, @j int
select @d1=getdate(), @i=1, @j=100000

while @i <= @j
begin
   insert test1 (col1)
   values ('a')

   set @i = @i + 1
end

select @d2=getdate()
select datediff(s,@d1,@d2)

 

Test 2 - Completed in 33 seconds on test machine
Created the second test table with the following syntax.

create table test 2
   (tid int identity(1,1) not null,
    sid int default 2 not null
      check (sid=2),
    col1 char(1) not null
      primary key (tid,sid)) on [primary]
 

I used the following script to insert into the test2 table.

declare @d1 datetime, @d2 datetime, @i int, @j int
select @d1=getdate(), @i=1, @j=100000

while @i <= @j
begin
   insert test2 (col1)
   values ('a')

   set @i = @i + 1
end

select @d2=getdate()
select datediff(s,@d1,@d2)


Let me know if you have feedback or different experiences on this.  I am specifically interested in real world performance differences in the two approaches. 

Performance is the key,

Zach

P.S. Please read the comments to this entry. 

Scalability Testing - Part 1 of... Crap, who knows...

Alright, so here’s the first of my technical blogs. 

This blog is going to focus on a customer that I recently worked with and the project that I was placed on there.  The customer will remain unnamed for this blog but the situation seems to be very typical in home grown 3-tier applications or software developed by vendors.  This project was to design and perform a test harness for a scalability test against this company’s application.  If the application bottlenecked the requirement was to identify it and let the customer re-architect that tier or rewrite that particular piece of the application that was causing the bottleneck, then re-test with the new code.

I’m gonna be 2 part’in this topic so look for the next one too.  I figure that most of you will already know how to do this type of thing but hopefully I say something worth while at some point to make this thing interesting.

First, I’ll give a run down of the environment and then I will talk about what is important about that information and what is not so important.  Being able to identify the not-so-important information is almost as important as identifying the important things.

Customer A wants to test their application with 100 users, 150 users and 200 users.  Customer A has an application that runs in a 3-tier architecture that is database agnostic (can run on multiple different database platforms Ex: SQL Server, Oracle, MySQL, etc.).  The web tier runs on Windows 2003 and IIS 6.0.  The web tier will send requests to the application tier which is also running on Windows 2003 and the application requests run through a cobol compiler so that the application tier also remains OS agnostic.  The single application server will send static SQL statements (not dynamic, meaning not built on the fly by the application server) to the database server. 

There were 2 web servers used for the 100 user and 150 user tests, and 3 web servers for the 200 user test, all web servers are current dual proc 3.8GHz, 2GB RAM boxes.  There is a single application server and is a current 4-way 3.5GHz box with hyper-threading and 3.5GB of RAM.  The database server is a current 4-way 3.0GHz server with 3GB of RAM.

Alright, so the important stuff point #1 is the fact that there are three tests, 100, 150 and 200 users, these are the metrics for measuring scalability and the metric(s) are important to identify before you can even begin developing any scalability test. 

Important stuff point #2 and really point #3 too: is the fact that the application is that the database and the OS are agnostic.  This introduces a huge scalability issue most of the time, because, once you have an application that is agnostic 9 times out of 10 the application will not be optimized for any specific database platform but instead the development is focused on ensuring that the functionality works on multiple database or OS platforms and the performance is "good" universally.  With the application tier being OS agnostic, more times than not, it introduces scalability issues with additional overhead, especially if the application is a UNIX based application and is ported to windows (which it was).  When looking at the application server you can usually identify the additional overhead in the areas of CPU utilization and context switching.  Context switching occurs when application threads are switched into and out of a processor, when threads are competing for processor time.  A good threshold for context switches is around 5000 context switches/sec per logical processor.

Important stuff point #4: the amount of RAM in each server.  Without knowing this you can pull a Perfmon counter “% Available Memory” but you still don’t get a good feel for how much memory you have left to use without the “Available Memory /MB” comparing both to the amount of RAM in the system, or just one of them if you prefer.  Without this you won't be able to tell if the application and web servers have memory pressure or not.  The database server is a different story and there is more information that can be found on that in the “Not so important stuff point #3” below.

Not so important stuff point #1, the number of users for each test.  The reason this is not important is because this number is going to vary application to application.  If you do not have intimate knowledge of how the application works and the functions of the application (which I didn’t) you won’t be able to tie the users to the functions performed by each and you won’t be able to tie that to individual measurements of each function.

Not so important stuff point #2, the speed of the processors doesn’t matter one bit when measuring performance unless you have an application doing a lot of calculations and compilations and the CPU’s are pegged (higher than 85% over an extended period of time).  If that is the case, at the very least, faster processors will help the situation, even better, more processors, better than that, multiple 64-bit processors (generally anything 8 CPU’s or more is best, I’ll blog about this some other time)   

Not so important stuff point #3, the amount of RAM in the database server.  Now, I don't mean to say that memory is not important to SQL Server, because by God it is, and I don't mean to say that the amount of RAM that you dedicate to SQL Server is not important because we all know it is.  All I am trying to say is that it will not help you gauge whether or not SQL Server has memory pressure and that in SQL Server memory pressure is measured differently than for other systems.  SQL Server has Perfmon counters to help determine memory pressure on the server.  For example, Page Life Expectancy, or PLE as some call it, is a measurement of how many seconds a data page will stay in the buffer cache if not accessed then Lazy writer flushes the page so that another data page can take it's place... blah, blah, blah (don't want to get too long winded if it's not too late already).  The generally accepted threshold for this counter is 300 seconds or 5 minutes.  You really should give it attention if the value is <= 300 because SQL Server could be forced to go to disk (causing more I/O's) to retrieve data pages more than it really should be.  The need for this value to be greater really depends on the applications running against that database server and can be quite complex to determine accurately what you might need it to be.  As you evaluate this value on a regular basis on various machines you can get a better “gut feel” for the value.  This brings us to the next set of memory counters in SQL Server that I would look at which is Target and Total Server Memory (Kb).  These counters should always be equal.  Target server memory, tells you how much memory SQL Server would like to have.  Total server memory, tells you how much memory SQL Server is actually using.   If the Target server memory is higher than the Total Server memory then you know that SQL Server would like to have more memory than it can physically acquire.  Thus, memory pressure.  There are other counters as well around the mem-to-leave area and also the caches that are also important.  I can get into these next time.  You've suffered enough I think, well... for now anyway. 

Maybe I should just blog about counters some time and get away from this stuff for now.  Let me know if you would be interested in that and I’ll put some time into it. 

Anyway it’s about 2 AM and time to hit the sack.  Check back later for part 2 (or more) of this project.

Keep on truck’in,

Zach