Atomic operations between keys

I've been hanging out with the cool kids lately, and learning new technologies. One common problem I notice people have, is safely emulating atomic operations between keys without transactions. i.e. in the classic example:

User Account 1 has $100 in it.
User Account 2 has $100 in it.
Transfer $20 From 1 -> 2.

Now, assume that our database crashes during the balance transfer indicated above. What happens?

The most common way I have seen this emulated is using a third-party 'table' (a.k.a. a journal, or log) and a scheme that could probably be described as a two phase commit. I see subtle bugs in most implementations of this. I do not want to point any fingers, so I will show two naive examples in my own non-transactional database, MyISAM:

Example 1: Using the third-party table to log "the atomic action"


CREATE TABLE accounts (
 id INT NOT NULL primary key auto_increment,
 balance INT NOT NULL

CREATE TABLE modifications_log (
 id INT NOT NULL primary key auto_increment
 account_1 INT NOT NULL,
 account_2 INT NOT NULL,
 operation VARCHAR('20') NOT NULL,
 amount INT NOT NULL,
 is_complete TINYINT,

# Load initial data
INSERT INTO accounts VALUES (1, 100), (2, 100);

# Statement 1: Transfer $20 from A->B in the journal.
INSERT INTO modifications_log (account_1, account_2, operation, amount, is_complete) VALUES (1, 2, 'MOVE_LEFT_TO_RIGHT', 20, 0);

# Statements 2 & 3: Perform the actual modifications
UPDATE accounts SET balance =  balance-20 WHERE id = 1;
UPDATE accounts SET balance =  balance+20 WHERE id = 2;

# Statement 4: Mark action as complete
UPDATE modifications_log SET is_complete = 1 WHERE id = [[insert id from statement 1]];

The (worst) problem with this design, is that it doesn't actually solve the problem that it was designed to solve. If we have a failure, we can't just look at the modifications_log and see which updates still need to be applied.

Let me explain: If we found an incomplete modification (modifications_log.is_complete=0), we know that statement 4 was not successful, and that statement 1 was successful, but we have *absolutely* no idea as to if statements 2 or 3 were successful. It is also unsafe to just re-apply these statements, because they are not idempotent. In simple terms idempotent means that we should be able to replay a statement over and over, and produce the same result.

Example 2: Using the third party table to log resulting values

CREATE TABLE accounts (
 id INT NOT NULL primary key auto_increment,
 balance INT NOT NULL

CREATE TABLE modifications_log (
 id INT NOT NULL primary key auto_increment
 account_1_new_value INT NOT NULL,
 account_2_new_value INT NOT NULL,
 is_complete TINYINT,

# Load initial data
INSERT INTO accounts VALUES (1, 100), (2, 100);

# Statement 1: Fetch what will be the new balance (current less $20)
SELECT balance-20 FROM accounts WHERE id = 1;

# Statement 2:  Fetch what will be the new balance (current plus $20)
SELECT balance+20 FROM accounts WHERE id = 2;

# Statement 3: Transfer from A->B in the modification log.
INSERT INTO modifications_log (account_1_new_value, account_2_new_value, is_complete) VALUES ([[statement 1 result]], [[statement 2 result]], 0);

# Statements 4 & 5: Perform the actual update
UPDATE accounts SET balance = [[statement1 result]] WHERE id = 1;
UPDATE accounts SET balance = [[statement2 result]] WHERE id = 2;

# Statement 6: Mark the whole operation as complete
UPDATE modifications_log SET is_complete = 1 WHERE id = [[insert id from statement 3]]

This solution makes it easy to recover. If statement 3 was successful, but you had a crash before statement 6, you should be able to reapply statements 4 & 5 safely, regardless of whether or not they had previously been run.

However, this design also has its own problems:

1) When you run statements 1 & 2, you need to make sure nobody else can modify the data until you get all the way to statement 6, in any part of the application. If you fail to do this, you have a potential race condition. For example, if we try two transfers at once:

# Statement 1 notices 100 is in the table, subtracts 20, returns result of 80.
# Statement 2 notices 100 is in the table, adds 20, returns result of 120
--> [[race condition query]] A request to transfer $10 from account 1 to account 2 commences, deciding that the new balance should be $90 and $110 respectively. 
# Statement 4 runs as it would have, and sets account #1 to $80.

By the time the race condition query is finished, we have either $90 and $110, respectively, or $80 and $120, when really we should have $70 and $130. Eek!

We can prevent the race condition by introducing locking to the equation. However, this introduces its own can of worms (noting that even if our database server doesn't support locking we can normally emulate it as long as it has some sort of atomic CAS operation).

So onto the first example of locking:

LOCK account1;
# run statement 1
LOCK account2;
# run statement 2
# run statement 3
# run statement 4
# run statement 5
# run statement 6
UNLOCK account1;
UNLOCK account2;

The problem with this locking implementation, is that it is also incomplete. What happens if we have SESSION1 issue a statement #1 that locks like this:

 LOCK account1;
 # run statement 1

And around the same time SESSION2 issue a statement 1 that looks like this:

LOCK account2;
 # run statement 1

(.. and also assume for the example that for statement 2, both want to acquire a lock on the other record.)

In this case, neither operation SESSION1 or SESSION2 will be able to complete, as they will be blocked like a dog chasing its tail. We have ourselves a deadlock: each SESSION has a resource the other one wants, and will not release their respective locks until they have finished all work.

What we have to do is work on (a) perhaps choosing a lock acquisition path that is less prone to deadlocks, and ideally (b) implementing deadlock detection.

I believe the Dining philosophers problem tells us that if we number our forks acquire locks in an agreed upon way we should be OK. i.e. order the account ids, always pick the lower number row first, and always acquire all locks before starting any work. (Noting that I say "I believe" in the context above, because it makes me nervous talking in tautologies here. I usually leave this problem to smarter people).

In terms of deadlock detection, we are lucky in the context above that data is only modified after all locks have been acquired so we do not ever have to worry about rolling back work. A poor man's method would be to say that any SESSION that has exceeded N seconds is automatically invalidated, and must be retried by the application.

In summary:

  • I am sure it is possible to get this done right.
  • It is not actually easy to get it done right. Every time you invest in those magical statements START TRANSACTION and COMMIT, you save yourself a lot of hidden effort.

On Testing MySQL Knowledge

I can't read posts like this one without at least a little chuckle. Is the number one question you should be asking people how to start and stop MySQL on Windows? How does that really demonstrate how good someone is at their job when most people deploy on Linux[1]?

The original MySQL certification for 4.1 used to ask a whole bunch of trivia exactly like this - my favourite was a question where you had to say if a particular subqueries caused a syntax error. I don't know how this tests skill, since most subqueries shouldn't be used in production (hint: they are unoptimized in MySQL).

But MySQL changed its certification format: the new exams are Performanced-Based. This means to pass, you have to solve some of the problems you will be doing in real life. Hats off to Dave for leading this initiative.

Technical interviews need to change just as MySQL has. They should be organized in a way that doesn't intimidate the candidate who might know what they are doing, but can't always express it words when under pressure. Silly questions and 'gut feelings' about responses tend to favour the over confident.

I had a hand in designing the interview process at Percona. One of the steps candidates go through is a challenge to be completed on two running EC2 instances. I don't think it's flawless, but you tell me what is likely to be a better indication of talent:

Test #1:
* What does tee command do in MySQL?
* What is a serial data type in MySQL?
* If I created a column with data type VARCHAR(3), what would I expect to see in MySQL table?

Test #2:
* Log into these two servers (xxx is the master, yyy is the slave).
* Tell me if you think there is a replication problem.
* Resync the slave using the lowest impact method possible if there is.
* Optimize these two queries while you are at it.

Test #2 isn't the actual test we use, but it's not far off.

I know a lot of DBAs that probably can't answer test #1 correctly. Does this mean they are bad at their job? That is one possibility, but the more likely is that this test is useless and should only come out on Pub Trivia night.

[1] On an unrelated note, the example answers for questions 3, 11, 12, 14, 15, 22 also demonstrate a misunderstanding ranging from small to just fundamentally wrong.

My MySQL Conference Submission 2010: Workarounds

I just submitted an abstract for the MySQL Conference 2010. What do I want to talk about? Workarounds.

Domas likes to do tricks with GDB. Sheeri likes symlinking log files to /dev/null when not in use. Peter likes to do what he calls a delayed join, and I like to use IN() lists.

What's your favorite workaround? Are there any that you've seen people use in production that no longer work (or could be considered harmful)?

What I can think of so far:
* We no longer have to use this .frm trick on ENUM columns in 5.1.
* Remembering to "FLUSH LOGS" when you delete the log file and add the symlink is important ;) You can't see them in an ls, but the open log files will just keep growing in the background (can be seen with lsof). If you're using MySQL 5.1, it's much better just to handle this at the mysql server level.

Input wanted!

Hidden gems in 5.1

I think 5.1 gets some bad press for not being a compelling upgrade. It's not the big features that make the difference, but the subtle ones. I wanted to highlight some of the these that may make your life easier when it's time to upgrade:

* Prepared statements can now use the query cache. BUG #735
* InnoDB auto_increment insertion is more scalable. Manual Page
* The long_query_time can be set to values less than 1 second. BUG #6238
* SHOW PROFILES is available for everyone, not just community users!
* Creating Triggers no longer requires the SUPER privilege.

Let's hope for 1000 more of these in MySQL 5.4.

Efficient way to copy large amounts of data?

Dear Lazyweb,

Yesterday I tried to Rsync a MySQL data directory from serverA to serverB on the same network. I thought that if out of a few hunded gigabytes maybe 2% changed, this should work, right? Wrong. Rsync is designed to minimize bandwidth, so in my case it was *much* quicker to wipe the data and start again (I feel this is something I should have known earlier, but it doesn't hurt to try and share your mistakes).

Which gets me thinking - it doesn't have to be this way. Bit-torrent works in a similar way to Rsync, but it's certainly not network efficient. Are there any projects similar to Rsync that are using network-hungry algorithms to try and make sure that two directories are in sync with the goal just being as fast as possible?

Very Simple Introduction to Using XtraBackup on Max OS X

I've started using Xtrabackup to backup MySQL on my MacBook. Here's an example of a quick backup and restore:

1. Download the latest .tar.gz from Percona:

$ cd /tmp
$ wget
$ tar -xzf xtrabackup-*-macos.x86_64.tar.gz
$ cd xtrabackup*

2. This directory should contain innobackupex-1.5.1 and xtrabackup. You need to install these into a directory that appears in your $PATH. In my case, I am going to group it install it where my MySQL binaries are located (/usr/local/mysql/bin):

$ ls
innobackupex-1.5.1 xtrabackup
$ cp * /usr/local/mysql/bin/

3. Create a directory where you want your backup to go. In this case it's just a demo - so I'll use my tmpdir.

$ mkdir -p /tmp/backup

4. Test running the backup:

$ innobackupex-1.5.1 /tmp/backup/

Quick Explanation:
innobackupex-1.5.1 is a Perl script that insures that all of your non-InnoDB tables and other MySQL meta data is backed up. You can think of it as a wrapper around xtrabackup, which backs up the data inside InnoDB.

5. Check the data is backed up:

$ cd /tmp/backup
$ ls
$ cd 2009-04-08_15-12-52
$ ls
backup-my.cnf mysql-stderr xtrabackup_binlog_info
employees mysql-stdout xtrabackup_checkpoints
ibdata1 test xtrabackup_logfile

6. Attempt a recovery:

innobackupex-1.5.1 --copy-back /tmp/backup/2009-04-08_15-12-52

Disclaimer: In case you didn't know it, I work for Percona - the company that wrote xtrabackup.

Who is going to make MySQL easier to use?

This may appear a bit of a rant - but it's really intended as more of an observation from having trained people how to use MySQL, and noticing that everyone seems to make the same beginner mistakes. If you read the "Continued MySQL Values" on the MySQL Website, you'll notice that the third one in the list is:
  • The best and the most-used database in the world for online applications
  • Available and affordable for all
  • Easy to use
  • Continuously improved while remaining fast, secure and reliable
  • Fun to use and improve
  • Free from bugs

Note that "Fun to use" doesn't sound much like a database, and "free from bugs" will always be a distant dream ;).

It was Easy to Use that got me into MySQL, but I think this is one of the goals that has lost focus over the years. If you look at, there are a lot of annoying little S5 (Feature Requests) that would probably take the right person only a few minutes to fix. The sort of things I am talking about are:
  • SHOW SLAVE STATUS - Has at least one annoying ease of use bug. If it refuses to connect to the master because it shares the same server id, it won't show you that here. You have to go to the log file.
  • SHOW SLAVE STATUS Also shows you the 'last error' it incurred, with no way to clear this error. It confuses beginners
  • What is the difference between this wait_timeout and interactive_timeout thing, and why is it that when I set wait_timeout to 0 it converts to wait_timeout =1 and disconnects me? - I would prefer wait_timeout=0 to mean unlimited - not that I can think of too many good reasons to use it.
  • The anonymous user serves almost no practical use.
  • Changing InnoDB log file size requires you to do some shutdown and rename trickery, when this should be automatically done for you.
  • When connecting to a server on -h localhost mysql decides that "oh, you want the socket file", and due to misconfiguration it might go to the wrong location and tell you it can't connect even though your server is running.
  • The error log file is very bad at offering levels of configurable verbosity, or consistency which would allow you to grep through it for errors easier. Often it tells me an error code, which I am supposed to look up in perror - but most beginners don't know about perror.
  • ...

But they are probably never going to be fixed. I think it is sad to see that many MySQL users have now seen that these issues don't get fixed, so they no longer submit feature request bug reports. My theory is that:

  • MySQL/Sun is busy implementing new features to snag new customers. If they update a feature (such as the subquery optimizations in 6.0) it is normally going to be one of those big compelling features that all of their customers are demanding.
  • The small number of Community Contributors are most likely going to be writing patches to scratch an itch (either better performance or diagnostics). It's not going to be one of these annoying little things they long ago discovered and will no longer care about.
It's good to see that Drizzle has actually been very good at attacking a large number of items on my list. It's just sad to have to wait until Drizzle is ready, because this part of MySQL isn't broken, it just needs some love.

New Amazon EC2 Features - Reserved Instances

So you can now reserve EC2 instances - which brings the cost down to about 6.7cents/hour averaged over a year.

It's a great idea. I wonder if the next step will be to allow more customized instance types at a price premium, provided a reservation is made for 1-3 years. I could certainly use a machine with faster IO for MySQL boxes.