Common Symptoms and Tools
Customers are normally good at identifying processor and memory bottlenecks,
but they rarely start a support call with a statement like, “My disk subsystem
is a bottleneck.” Therefore, identifying and resolving disk (I/O) subsystem
bottlenecks are difficult cases. The cases usually start with, “The overall
performance has degraded over time”, “It is just slow”, or
“The workload has increased and everything is slow.” The customer
can’t pinpoint one query or workload that is causing the slow performance.
There are three areas of data collection to troubleshoot disk I/O bottlenecks.
The first is System Monitor’s Physical Disk object. System Monitor should
capture all counters and all instances prior, during, and after the slow performance
periods, if possible. (Please note that the Physical Disk counters are not enabled
by default on Windows NT. But on Windows 2000, Windows XP, and Windows 2003
servers they are enabled by default. For enabling disk counters on NT systems,
run “diskperf –y” from a command prompt and restart the server
box for it to take effect.) Secondly, the database file layout—what drives
are the .mdf, .ndf, and .ldf files residing on for each database? Thirdly, what
is the configuration of each drive? This includes what type of raid configuration
and how many underlying spindles (disk drives) make up the raid configuration.
Information about the number and type of disk controllers can also be helpful.
Disk Layout Best Practices
As a rule of thumb, there are two guidelines for laying your disks:
1. Database files that are accessed randomly. Typically data files
(.mdf and .ndf) service random I/O. SQL Server does not know what query will
be requested next or where on the disk platter the data resides. This causes
random I/O for which the disk platter will spin to find data.
2. Database files that are accessed sequentially. Transaction log
(.ldf) activity is sequential. Sequential I/O’s access data sectors
that are adjacent to each other and have a faster seek time, and therefore
have a higher throughput than random I/O.
If the database files are placed on separate disks, another distinct advantage
is that it will benefit from parallel I/O. The data files that are accessed
via random I/O take priority over sequential I/O.
When the data files and transaction log files of a database are on the same
physical disk, the transaction log writes (sequential I/O) should wait till
the random I/O is done, which reduces the throughput of the transaction logging.
The recommended configuration for good performance is to separate the log files
and data files onto separate dedicated disks.
On a side note, it is also not recommended to have transaction log files on
RAID 5 because this slows down logging performance. Depending on your business
reasons you can choose RAID 5 or any other RAID for the data files. But for
log files, RAID 10 is always recommended.
Disk drive performance can be broken down into four areas:
1. Seek time: The time required to move to the track that holds the data.
2. Rotational latency: The time required for that platter to rotate under
the head.
3. Time required to transfer data from the disk drive to the disk controller.
4. Time for device drivers in the operating system, which is usually negligible.
The sum of these components is the time an I/O takes.
Disk drives have physical limitations on how many I/Os a disk is capable of
handling. Depending on the workload, this limit can be reached. When the number
of I/O requests exceeds the disks I/O capacity, the I/O requests will take longer
and be queued for its turn on the disk. Because of this, spreading out data
files across multiple physical drives allows for parallel I/O. Adding more physical
disks and spreading the data files among them can resolve a lot of disk subsystem
bottlenecks. As a general rule, a disk drive can handle about 125 random I/O’s
per second and 225 sequential I/O’s per second. These numbers are general
guidelines and can differ based on the hardware. See the disk drives specifications
to determine actual numbers. Also note that a disk drive’s performance
degrades when it is at 85 percent of its capacity.
Suppose from the disk drive specifications you find out that the average seek
time is 3.9 ms. The average rotational latency is 2 ms. Then the theoretical
disk I/O capacity is (3.9ms + 2ms) / 1000 = 169 I/Os per second. The theoretical
limit is almost never reached, so take 75 percent of the theoretical limit (169
* .75) = 127 I/O’s per second. This gives you a disk I/O capacity to work
with.
Investigating the System Monitor
In the System Monitor there are five Physical Disk counters for each physical
disk contained in the Logical Drive that are key to identifying I/O bottleneck.
Each disk should be investigated individually.
1. Avg. Disk Queue Length is the average number of both read and
write requests that were queued for the selected disk during the sample interval.
2. Avg. Disk sec/Read is the average time, in seconds, of a read
of data from the disk.
3. Avg. Disk sec/Write is the average time, in seconds, of a write
of data to the disk.
4. Disk Reads/sec is the rate of read operations on the disk.
5. Disk Writes/sec is the rate of write operations on the disk.
First, on the System Monitor properties window, slide the time interval to
only include the period of “slow performance.” Then calculate the
I/Os per disk as follows.
Suppose you have a RAID 5 configuration with four physical disks, and you had
the following information from the System Monitor.
Disk Reads/sec 120
Disk Writes/sec 150
Avg. Disk Queue Length 12
Avg. Disk Sec/Read .035
Avg. Disk Sec/Write .045 |
Raid Calculations:
Raid 0 -- I/Os per disk = (reads + writes) / number of
disks
Raid 1 -- I/Os per disk = [reads + (2 * writes)] / 2
Raid 5 -- I/Os per disk = [reads + (4 * writes)] / number of disks
Raid 10 -- I/Os per disk = [reads + (2 * writes)] / number of disks |
You can use the above RAID calculations to determine the I/Os per disk. 120
reads + (4 * 150 writes) / 4 physical disks = 180 I/Os per disk. This is over
the general rule of 125 I/Os per disk capacity. This system has a disk bottleneck.
You can determine that with 720 total I/Os, six physical disks would be required
to resolve the disk bottleneck. (720 / 125= 5.76). Adding even more disks will
provide room for growth. You should calculate for peak workloads and future
growth.
The Average Disk queue length is the total for all physical disks in the RAID
configuration. In the example above, 12 for avg. disk queue length / 4 physical
disk = 3 queued I/Os per disk. Anything over 2 is reason to investigate further.
Next look at average disk per seconds. In general, Average Disk Sec/Read of
11 ms -15 ms or lower is good. Average Disk Sec/Write of 12 ms or lower are
good. Anything above this number is reason to investigate further.
Additional Comments
You should not rely on one counter to determine a bottleneck. Look for multiple
counters to confirm your analysis.
See if SQL Server is doing a lot of I/O. Tools like SQL Profiler and Blocker
Script output can tell you if the query(s) are resulting in these large I/Os,
and tuning or optimizing these individual queries from a statistics and indexes
point of view may help from the disk I/O perspective. An improper index strategy
can add to a disk subsystem workload. For example, instead of doing an Index
Seek, the SQL Server Optimizer may choose to do a Table/Index Scan, resulting
in a lot of unnecessary I/O.
If the disk subsystem is a bottleneck, use the Index Tuning Wizard to go through
a proper workload of the production load (captured in the form of a trace file)
and see if it recommends any additional indexes.
You have seen that adding appropriate indexes reduces load on disks, thus reducing
the disk I/O. If the proper indexes do exist, adding more spindles to the RAID
configuration may also help.
Your disk subsystem should be able to handle your peak workload. Suppose that
the slow performance is for thirty minutes each day; the thirty-minute interval
should be investigated. If you use System Monitor counters for the entire day,
the averages will show that there is no disk bottleneck.
© 2003 Microsoft