August 2005 - Posts

Using GUIDs - Inserts

Currently we are designing a database schema.  When I heard they were using GUID's (Global Unique Identifier) as PK's (primary keys) my jaw about dropped.  Turns out the reasoning behind the use of the keys was completely legitimate and actually is the reason GUID's were implemented by MS in the first place.  They needed a key that would be unique across databases and database servers.  The GUID certainly fulfills this need but it comes a cost. 

The GUID is a wide column (16 bytes to be specific) and contains a unique combination of 33 hex value characters.  This column, because it is the primary key, is going to be stored in, of course, the clustered index (unless specified to be a non-clustered index), and will be the page pointer for each leaf page in a non-clustered index.  Also, if a GUID is used instead of an integer identity column then the bits need to be matched up for each row which is relatively fast.  If a high volume of inserts are done on these tables then GUID's being large will contribute to page splits, as will the fact that NEWID() generates a random value, which could place a new record on any of the data pages will cause performance problems.  Another reason is that typically primary keys are searched most frequently and trying to recall a GUID from memory to type into a query is nearly impossible (my feeble mind wouldn't be able to do it).

If you needed a key that would be unique across all servers I would recommend a composite key as the PK instead of the GUID.  Use a single integer identity column to uniquely identify the rows and another column to identify the server, placing a default server id value and placing a check constraint on the server id column to validate that the value is only what you would expect for it to be for that server. 

I have mocked a test to validate my theory on inserts only (results below), I'll get to the selects a little later, but I am interested in what the rest of you have found.

Test 1 - Completed in 79 seconds on test machine
Created table with the following syntax.

create table test1
   (tid uniqueidentifier default newid() not null,
    col1 char(1) not null
      primary key (tid)) on [primary]


I used the following script to insert into the table.

declare @d1 datetime, @d2 datetime, @i int, @j int
select @d1=getdate(), @i=1, @j=100000

while @i <= @j
begin
   insert test1 (col1)
   values ('a')

   set @i = @i + 1
end

select @d2=getdate()
select datediff(s,@d1,@d2)

 

Test 2 - Completed in 33 seconds on test machine
Created the second test table with the following syntax.

create table test 2
   (tid int identity(1,1) not null,
    sid int default 2 not null
      check (sid=2),
    col1 char(1) not null
      primary key (tid,sid)) on [primary]
 

I used the following script to insert into the test2 table.

declare @d1 datetime, @d2 datetime, @i int, @j int
select @d1=getdate(), @i=1, @j=100000

while @i <= @j
begin
   insert test2 (col1)
   values ('a')

   set @i = @i + 1
end

select @d2=getdate()
select datediff(s,@d1,@d2)


Let me know if you have feedback or different experiences on this.  I am specifically interested in real world performance differences in the two approaches. 

Performance is the key,

Zach

P.S. Please read the comments to this entry. 

New Adventures In Life

Well, yesterday I started a new job.  Looks to be challenging, which is always good.  As with every large company there are politics and upheavals (trying to figure out which battles to avoid is always fun).  I haven't had much time to blog with the new schedule, adjustments and other things. 

I recently started another new adventure, coaching my 5 year old's flag football team.  Think your job is hard?  Try working with 20, 5 and 6 year olds for an hour twice a week.  My appreciation for a desk and a computer has been renewed!!  God bless those kindergarden teachers! 

Take it easy, I'll be bloggin soon,

Zach

 

ROW_NUMBER() - The simple things in life...

For those of you already diving into SQL Server 2005 this is nothing new but for those of you who are newly looking into it here is a cool new function.

ROW_NUMBER ()

The ROW_NUMBER will attach a row number to each row returned in a result set.  So for example you can execute a query like this in 2005 and see the results below.

select object_id, name, row_number() over(order by name asc) as row_number
from sys.objects
where type = 'u'
go

Results
object_id    Name                     row_number
1147151132   MSreplication_options    1
117575457    spt_fallback_db          2
133575514    spt_fallback_dev         3
149575571    spt_fallback_usg         4
1099150961   spt_monitor              5
1115151018   spt_values               6

If this is the first time you have seen this function you probably noticed something else unknown to you, the over() clause.  The over() clause determines the order of a result set, before the function stated before it, is executed or applied.  In essence, you have to supply the over() clause the column that you want to order the row number in.  To use the example above, you may choose to give row numbers in the alphabetic ascending order of the name column as above.  The over() clause is also used with other ranking functions.

That is example really doesn’t show a practical use this function though.  Today I was asked to help with a client who is running report services on SQL 2005.  This particular report was to display a graph grouping 100 employees by employee_id but sorted by the employee’s name.  There was a problem that you could only legibly see about 10 employees per page on the graph.  The problem was not limiting the results to 10 employees per page but rather to have a table of contents to quickly identify which employee by employee_id was on which page. 

The query was actually quite simple in 2005.  As I stated above the graphs were limited to 10 employees per page, so here the query I came up with.  This is not the live data but only an example of how this could work.

-- Setup
-- creating the test database
create database test
go

use test
go

-- create the employee table
create table employee (id int identity(1,1), test int)
declare @i int, @j int
select @i = 1, @j = 200
while @i <= @j
begin
      -- load table
      insert into employee (test)
      values (@i)
      set @i = @i + 1
end
go

-- Querying test table
use test
go
select id, round(((row_number() over(order by id asc)+ 10) /10),1) as page_number
from employee
go

This query gives the result set below.  Notice the “+10” in the query above before the row_number is divided by 10?  The “+10” is to start with a page_number of 1 instead of 0. 

Results
id   page_number
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
10   2
11   2
12   2
13   2
14   2  
...  ...

Another cool thing about the row_number function is that you can filter by the function too but only if you utilize the new CTE functionality, I’ll talk about it later (Example below).

WITH test1 AS
(select id, row_number() over(order by id asc) as row, round(((row_number() over(order by id asc)+ 10) /10),1) as page_number
from employee)
select * from test1
where page_number between 5 and 6
go

This is just a couple of examples of how this could be used, but I am personally excited about the use of this function. 

Man, the smallest things in life seem to bring such satisfaction.

PEACE,
Zach

Scalability Testing - Part 1 of... Crap, who knows...

Alright, so here’s the first of my technical blogs. 

This blog is going to focus on a customer that I recently worked with and the project that I was placed on there.  The customer will remain unnamed for this blog but the situation seems to be very typical in home grown 3-tier applications or software developed by vendors.  This project was to design and perform a test harness for a scalability test against this company’s application.  If the application bottlenecked the requirement was to identify it and let the customer re-architect that tier or rewrite that particular piece of the application that was causing the bottleneck, then re-test with the new code.

I’m gonna be 2 part’in this topic so look for the next one too.  I figure that most of you will already know how to do this type of thing but hopefully I say something worth while at some point to make this thing interesting.

First, I’ll give a run down of the environment and then I will talk about what is important about that information and what is not so important.  Being able to identify the not-so-important information is almost as important as identifying the important things.

Customer A wants to test their application with 100 users, 150 users and 200 users.  Customer A has an application that runs in a 3-tier architecture that is database agnostic (can run on multiple different database platforms Ex: SQL Server, Oracle, MySQL, etc.).  The web tier runs on Windows 2003 and IIS 6.0.  The web tier will send requests to the application tier which is also running on Windows 2003 and the application requests run through a cobol compiler so that the application tier also remains OS agnostic.  The single application server will send static SQL statements (not dynamic, meaning not built on the fly by the application server) to the database server. 

There were 2 web servers used for the 100 user and 150 user tests, and 3 web servers for the 200 user test, all web servers are current dual proc 3.8GHz, 2GB RAM boxes.  There is a single application server and is a current 4-way 3.5GHz box with hyper-threading and 3.5GB of RAM.  The database server is a current 4-way 3.0GHz server with 3GB of RAM.

Alright, so the important stuff point #1 is the fact that there are three tests, 100, 150 and 200 users, these are the metrics for measuring scalability and the metric(s) are important to identify before you can even begin developing any scalability test. 

Important stuff point #2 and really point #3 too: is the fact that the application is that the database and the OS are agnostic.  This introduces a huge scalability issue most of the time, because, once you have an application that is agnostic 9 times out of 10 the application will not be optimized for any specific database platform but instead the development is focused on ensuring that the functionality works on multiple database or OS platforms and the performance is "good" universally.  With the application tier being OS agnostic, more times than not, it introduces scalability issues with additional overhead, especially if the application is a UNIX based application and is ported to windows (which it was).  When looking at the application server you can usually identify the additional overhead in the areas of CPU utilization and context switching.  Context switching occurs when application threads are switched into and out of a processor, when threads are competing for processor time.  A good threshold for context switches is around 5000 context switches/sec per logical processor.

Important stuff point #4: the amount of RAM in each server.  Without knowing this you can pull a Perfmon counter “% Available Memory” but you still don’t get a good feel for how much memory you have left to use without the “Available Memory /MB” comparing both to the amount of RAM in the system, or just one of them if you prefer.  Without this you won't be able to tell if the application and web servers have memory pressure or not.  The database server is a different story and there is more information that can be found on that in the “Not so important stuff point #3” below.

Not so important stuff point #1, the number of users for each test.  The reason this is not important is because this number is going to vary application to application.  If you do not have intimate knowledge of how the application works and the functions of the application (which I didn’t) you won’t be able to tie the users to the functions performed by each and you won’t be able to tie that to individual measurements of each function.

Not so important stuff point #2, the speed of the processors doesn’t matter one bit when measuring performance unless you have an application doing a lot of calculations and compilations and the CPU’s are pegged (higher than 85% over an extended period of time).  If that is the case, at the very least, faster processors will help the situation, even better, more processors, better than that, multiple 64-bit processors (generally anything 8 CPU’s or more is best, I’ll blog about this some other time)   

Not so important stuff point #3, the amount of RAM in the database server.  Now, I don't mean to say that memory is not important to SQL Server, because by God it is, and I don't mean to say that the amount of RAM that you dedicate to SQL Server is not important because we all know it is.  All I am trying to say is that it will not help you gauge whether or not SQL Server has memory pressure and that in SQL Server memory pressure is measured differently than for other systems.  SQL Server has Perfmon counters to help determine memory pressure on the server.  For example, Page Life Expectancy, or PLE as some call it, is a measurement of how many seconds a data page will stay in the buffer cache if not accessed then Lazy writer flushes the page so that another data page can take it's place... blah, blah, blah (don't want to get too long winded if it's not too late already).  The generally accepted threshold for this counter is 300 seconds or 5 minutes.  You really should give it attention if the value is <= 300 because SQL Server could be forced to go to disk (causing more I/O's) to retrieve data pages more than it really should be.  The need for this value to be greater really depends on the applications running against that database server and can be quite complex to determine accurately what you might need it to be.  As you evaluate this value on a regular basis on various machines you can get a better “gut feel” for the value.  This brings us to the next set of memory counters in SQL Server that I would look at which is Target and Total Server Memory (Kb).  These counters should always be equal.  Target server memory, tells you how much memory SQL Server would like to have.  Total server memory, tells you how much memory SQL Server is actually using.   If the Target server memory is higher than the Total Server memory then you know that SQL Server would like to have more memory than it can physically acquire.  Thus, memory pressure.  There are other counters as well around the mem-to-leave area and also the caches that are also important.  I can get into these next time.  You've suffered enough I think, well... for now anyway. 

Maybe I should just blog about counters some time and get away from this stuff for now.  Let me know if you would be interested in that and I’ll put some time into it. 

Anyway it’s about 2 AM and time to hit the sack.  Check back later for part 2 (or more) of this project.

Keep on truck’in,

Zach

I'm a bit odd, but then again, everybody is a bit... I think they call that normal...

Alright, so this is my fist Blog and it contains nothing of technical interest, my appologies for that, but I'll start soon enough. I want to thank Donny Mack for the site and the chance to share my random thoughts.  

A little about myself, I am a DBA Consultant with a company called Scalability Experts where we specialize in scalability (obviously from the name) and performance of databases and database servers.  I have been either a DBA or involved with database tuning, maintenance, performance or whatever for the past 10 years.  I am looking forward to sharing the technical information that I have gathered over the years as well as I bit of non-technical everyday life stuff from time to time. 

Keep an eye out I'll be posting soon.

Keep on keep'in on,

Zach