Shallow queue.. What’s that!?


A shallow queue somewhere in the IO path (or an overflowing port buffer) will cause the I/O to back off.    You need the queues to be deep enough to withstand the bursts – sometimes increasing the queue depth is important.   Now, if the problem isn’t actually the bursts, but the I/O service time not being sufficient for the sustained workload (aka you have a slow, or underconfigured array), increasing the queue depth will help for only a fraction of a second, after which the deeper queue will still fill up, and now you just have increased the latency even more.   

While most customers will never run into this problem, some do.   In VMware land – this is usually the fact that the default LUN queue (and corresponding Disk.SchedNumReqOutstanding value) are 32 – which for most use cases is just fine, but when you have a datastore with many small VMs sitting on a single LUN, the possibility of microbursting patterns becomes more likely.

This is covered in this whitepaper, and summarized in this table (which I’ve referred to along with Vaughn).   In both this table, and the real world, the column on the left (outstanding I/O per LUN) is generally not the factor that determines the Maximum number of VMs – it’s the “LUN queue on each ESX host” depth column.

image

If you think you might be running into this problem – it’s pretty easy to diagnose.   Launch ESXtop, select the ESX disk device, press “u” to display the ESX disk device monitoring screen, press “enter” to return to the ESX disk device screen.   You’ll see a table like this – and QUED is the queue depth.

image

If this shows as 32 all the time or during “bad performance periods” – check the array service time, that is your DAVG/cmd, KAVG/cmd (GAVG/cmd is the sum of the 2).   If it’s low (6-10ms), you should probably increase the queue depth.   If you have a high array service time (>20ms), then you should consider changing the configuration (usually adding more spindles to the meta object). BTW KAVG/cmd above 3ms requires immediate attention!

If you find your array service time is long, or the array LUN queue (if your array has one) is a problem – you need to fix that before you look at queue depths and multipathing.  On EMC arrays – this can be done easily and is included as a basic array function. Check your storage admins for that.

 

Source: Virtualgeek.typepad.com

Advertisements

About PiroNet

Didier Pironet is an independent blogger and freelancer with +15 years of IT industry experience. Didier is also a former VMware inc. employee where he specialised in Datacenter and Cloud Infrastructure products as well as Infrastructure, Operations and IT Business Management products. Didier is passionate about technologies and he is found to be a creative and a visionary thinker, expressing with passion and excitement, hopefully inspiring and enrolling people to innovation and change.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s