Morgan Tocker ([info]mtocker) wrote,
@ 2008-12-18 09:03:00
Previous Entry  Add to memories!  Tell a Friend  Next Entry
Entry tags:mysql

IO scheduling in the 2.6 kernel
I was surprised by even the gap I saw on Vadim's post on the improvements of using the Noop IO scheduler. I've been changing my thoughts on what to set the scheduler to lately, and it's all leaning to Noop as the default.

An explanation first:
IO Schedulers (aka elevators) are a method of trying to get the best possible performance out of your disk subsystem as possible. Since your disk is essentially a mechanical device - it has a difference in performance between whether or not you are performing actions sequentially - or when you are performing actions randomly. And this difference can be huge! Last time I tested, a typical 7200RPM consumer hard drive could write 60MB/s sequentially, but performance dropped to only a few MB/s when I started trying to write small pieces of random data.

So how do the IO schedulers work?
They achieve this (mostly) by doing request reordering and merging, and by trying to read platters in one continuous direction. They may even detect that you are writing sequential blocks, and slightly delay an operation in order to 'save cost'.

Each IO scheduler will have different algorithms regarding how they do this reordering. For example, on a desktop Operating System you are probably more concerned about your MP3s not skipping than about the maximum sustained performance.

Death to schedulers
The problem with using techniques like IO scheduling is that the Linux kernel is pretty dumb to all the layers below it. Hard drives themselves have their own scheduling mechanisms, and if you are running a RAID controller *it* will have it's own scheduling mechanisms.

The last point is important - If you are doing scheduling when you have a RAID controller, from Linux's perspective it's probably all one big block device. The scheduler is making all sorts of assumptions about blocks being aligned on disk and it's WRONG WRONG WRONG - you probably have some sort of striping. So all the IO scheduler is doing is adding latency (bad) and to probably applying some partial serialization to writes (double bad).

So in that case, it's better to tell Linux to mind it's own business. In which case you want the Noop scheduler.

If you are curious where to learn more, I think the best references to learn more about scheduling have been some of the talks by the Youtube guys, and an earlier post by Domas Mituzas.




(3 comments) - (Post a new comment)


(Anonymous)
2008-12-18 05:10 pm UTC (link)
Small correction: I think a hard disk always spins in the same direction. You probably mean the platters going in a continuous direction.

Additionally, NCQ on disks is also bad. I want my disks to "trust" my controller and not trying to be smart themselves. After all we have BBU's on controllers, but not on disks. (although a ups would help too). Also, the reordering algorithm quality depends heavily on the make/brand of controllers. Lsi megaraid (dell perc) for example are not that good. Areca's are great.

Dieter_be

(Reply to this) (Thread)


[info]mtocker
2008-12-18 05:21 pm UTC (link)
Thanks.

I guess it depends on the disks. It's probably similar to things like wear level algorithms on flash controllers - most suck, because it's a hard problem to solve on a quite simple controller.

By Areca being "great", do you mean great for database workloads, or just great in general? (I guess I'm hinting at that some controllers might choose to do lower latency reads, at the cost of performance).

(Reply to this) (Parent)(Thread)


(Anonymous)
2009-02-01 11:01 am UTC (link)
I meant great in reordering (scheduling) algo's. But actually, they are great in all aspects :)

Re: db workloads: that's where your write cache comes in. (write-back) you basically want to delay your writes as long as possible (if a high read workload demands this). Thanks to BBU's on your controller you know your data is safe once the kernel pushed it to the blockdevice. (but then again, once your controller reordered writes and issued them to your disks, your data is volatile again: the disks should just execute them instead of trying to cache them, controller is smarter then disks ;-). Eg: disable NCQ)

Dieter_be

(Reply to this) (Parent)


(3 comments) - (Post a new comment)

Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…