[Most Recent Entries]
Below are the 20 most recent journal entries recorded in
Morgan Tocker's LiveJournal:
[ << Previous 20 ]
[ << Previous 20 ]
|Thursday, May 2nd, 2013|
|Saturday, September 17th, 2011|
|Atomic operations between keys
I've been hanging out with the cool kids lately, and learning new technologies. One common problem I notice people have, is safely emulating atomic operations between keys without transactions. i.e. in the classic example:
User Account 1 has $100 in it.
User Account 2 has $100 in it.
Transfer $20 From 1 -> 2.
Now, assume that our database crashes during the balance transfer indicated above. What happens?
The most common way I have seen this emulated is using a third-party 'table' (a.k.a. a journal, or log) and a scheme that could probably be described as a two phase commit. I see subtle bugs in most implementations of this. I do not want to point any fingers, so I will show two naive examples in my own non-transactional database, MyISAM:
Example 1: Using the third-party table to log "the atomic action"
CREATE TABLE accounts (
id INT NOT NULL primary key auto_increment,
balance INT NOT NULL
CREATE TABLE modifications_log (
id INT NOT NULL primary key auto_increment
account_1 INT NOT NULL,
account_2 INT NOT NULL,
operation VARCHAR('20') NOT NULL,
amount INT NOT NULL,
# Load initial data
INSERT INTO accounts VALUES (1, 100), (2, 100);
# Statement 1: Transfer $20 from A->B in the journal.
INSERT INTO modifications_log (account_1, account_2, operation, amount, is_complete) VALUES (1, 2, 'MOVE_LEFT_TO_RIGHT', 20, 0);
# Statements 2 & 3: Perform the actual modifications
UPDATE accounts SET balance = balance-20 WHERE id = 1;
UPDATE accounts SET balance = balance+20 WHERE id = 2;
# Statement 4: Mark action as complete
UPDATE modifications_log SET is_complete = 1 WHERE id = [[insert id from statement 1]];
The (worst) problem with this design, is that it doesn't actually solve the problem that it was designed to solve. If we have a failure, we can't just look at the modifications_log and see which updates still need to be applied.
Let me explain: If we found an incomplete modification (modifications_log.is_complete=0), we know that statement 4 was not successful, and that statement 1 was successful, but we have *absolutely* no idea as to if statements 2 or 3 were successful. It is also unsafe to just re-apply these statements, because they are not idempotent. In simple terms idempotent means that we should be able to replay a statement over and over, and produce the same result.
Example 2: Using the third party table to log resulting values
CREATE TABLE accounts (
id INT NOT NULL primary key auto_increment,
balance INT NOT NULL
CREATE TABLE modifications_log (
id INT NOT NULL primary key auto_increment
account_1_new_value INT NOT NULL,
account_2_new_value INT NOT NULL,
# Load initial data
INSERT INTO accounts VALUES (1, 100), (2, 100);
# Statement 1: Fetch what will be the new balance (current less $20)
SELECT balance-20 FROM accounts WHERE id = 1;
# Statement 2: Fetch what will be the new balance (current plus $20)
SELECT balance+20 FROM accounts WHERE id = 2;
# Statement 3: Transfer from A->B in the modification log.
INSERT INTO modifications_log (account_1_new_value, account_2_new_value, is_complete) VALUES ([[statement 1 result]], [[statement 2 result]], 0);
# Statements 4 & 5: Perform the actual update
UPDATE accounts SET balance = [[statement1 result]] WHERE id = 1;
UPDATE accounts SET balance = [[statement2 result]] WHERE id = 2;
# Statement 6: Mark the whole operation as complete
UPDATE modifications_log SET is_complete = 1 WHERE id = [[insert id from statement 3]]
This solution makes it easy to recover. If statement 3 was successful, but you had a crash before statement 6, you should be able to reapply statements 4 & 5 safely, regardless of whether or not they had previously been run.
However, this design also has its own problems:
1) When you run statements 1 & 2, you need to make sure nobody else can modify the data until you get all the way to statement 6, in any part of the application. If you fail to do this, you have a potential race condition. For example, if we try two transfers at once:
# Statement 1 notices 100 is in the table, subtracts 20, returns result of 80.
# Statement 2 notices 100 is in the table, adds 20, returns result of 120
--> [[race condition query]] A request to transfer $10 from account 1 to account 2 commences, deciding that the new balance should be $90 and $110 respectively.
# Statement 4 runs as it would have, and sets account #1 to $80.
By the time the race condition query is finished, we have either $90 and $110, respectively, or $80 and $120, when really we should have $70 and $130. Eek!
We can prevent the race condition by introducing locking to the equation. However, this introduces its own can of worms (noting that even if our database server doesn't support locking we can normally emulate it as long as it has some sort of atomic CAS operation).
So onto the first example of locking:
# run statement 1
# run statement 2
# run statement 3
# run statement 4
# run statement 5
# run statement 6
The problem with this locking implementation, is that it is also incomplete. What happens if we have SESSION1 issue a statement #1 that locks like this:
# run statement 1
And around the same time SESSION2 issue a statement 1 that looks like this:
# run statement 1
(.. and also assume for the example that for statement 2, both want to acquire a lock on the other record.)
In this case, neither operation SESSION1 or SESSION2 will be able to complete, as they will be blocked like a dog chasing its tail. We have ourselves a deadlock: each SESSION has a resource the other one wants, and will not release their respective locks until they have finished all work.
What we have to do is work on (a) perhaps choosing a lock acquisition path that is less prone to deadlocks, and ideally (b) implementing deadlock detection.
I believe the Dining philosophers problem tells us that if we
number our forks acquire locks in an agreed upon way we should be OK. i.e. order the account ids, always pick the lower number row first, and always acquire all locks before starting any work. (Noting that I say "I believe" in the context above, because it makes me nervous talking in tautologies here. I usually leave this problem to smarter people).
In terms of deadlock detection, we are lucky in the context above that data is only modified after all locks have been acquired so we do not ever have to worry about rolling back work. A poor man's method would be to say that any SESSION that has exceeded N seconds is automatically invalidated, and must be retried by the application.
- I am sure it is possible to get this done right.
- It is not actually easy to get it done right. Every time you invest in those magical statements START TRANSACTION and COMMIT, you save yourself a lot of hidden effort.
|Thursday, February 4th, 2010|
|On Testing MySQL Knowledge
I can't read posts like this one
without at least a little chuckle. Is the number one question you should be asking people how to start and stop MySQL on Windows? How does that really demonstrate how good someone is at their job when most people deploy on Linux?
The original MySQL certification for 4.1 used to ask a whole bunch of trivia exactly like this - my favourite was a question where you had to say if a particular subqueries caused a syntax error. I don't know how this tests skill, since most subqueries shouldn't be used in production (hint: they are unoptimized in MySQL).
But MySQL changed its certification format: the new exams are Performanced-Based
. This means to pass, you have to solve some of the problems you will be doing in real life. Hats off to Dave
for leading this initiative.
Technical interviews need to change just as MySQL has. They should be organized in a way that doesn't intimidate the candidate who might know what they are doing, but can't always express it words when under pressure. Silly questions and 'gut feelings' about responses tend to favour the over confident.
I had a hand in designing the interview process at Percona
. One of the steps candidates go through is a challenge to be completed on two running EC2 instances. I don't think it's flawless, but you tell me what is likely to be a better indication of talent:
* What does tee command do in MySQL?
* What is a serial data type in MySQL?
* If I created a column with data type VARCHAR(3), what would I expect to see in MySQL table?
* Log into these two servers (xxx is the master, yyy is the slave).
* Tell me if you think there is a replication problem.
* Resync the slave using the lowest impact method possible if there is.
* Optimize these two queries while you are at it.
isn't the actual test we use, but it's not far off.
I know a lot of DBAs that probably can't answer test #1
correctly. Does this mean they are bad at their job? That is one possibility, but the more likely is that this test is useless and should only come out on Pub Trivia night.
 On an unrelated note, the example answers for questions 3, 11, 12, 14, 15, 22 also demonstrate a misunderstanding ranging from small to just fundamentally
|Tuesday, January 12th, 2010|
|My MySQL Conference Submission 2010: Workarounds
I just submitted an abstract for the MySQL Conference 2010
. What do I want to talk about? Workarounds.
Domas likes to do tricks with GDB
. Sheeri likes symlinking log files
to /dev/null when not in use. Peter likes to do what he calls a delayed join
, and I like to use IN() lists
What's your favorite workaround? Are there any that you've seen people use in production that no longer work (or could be considered harmful)?
What I can think of so far:
* We no longer have to use this .frm trick
on ENUM columns in 5.1.
* Remembering to "FLUSH LOGS" when you delete the log file and add the symlink is important ;) You can't see them in an ls, but the open log files will just keep growing in the background (can be seen with lsof). If you're using MySQL 5.1, it's much better just to handle this at the mysql server level.
|Friday, June 5th, 2009|
|Hidden gems in 5.1
I think 5.1 gets some bad press
for not being a compelling upgrade. It's not the big features that make the difference, but the subtle ones. I wanted to highlight some of the these that may make your life easier when it's time to upgrade:
* Prepared statements can now use the query cache. BUG #735
* InnoDB auto_increment insertion is more scalable. Manual Page
* The long_query_time can be set to values less than 1 second. BUG #6238
* SHOW PROFILES
is available for everyone, not just community users!
* Creating Triggers
no longer requires the SUPER privilege.
Let's hope for 1000 more of these in MySQL 5.4.
|Thursday, May 28th, 2009|
|Efficient way to copy large amounts of data?
Yesterday I tried to Rsync a MySQL data directory from serverA to serverB on the same network. I thought that if out of a few hunded gigabytes maybe 2% changed, this should work, right? Wrong. Rsync is designed to minimize bandwidth, so in my case it was *much* quicker to wipe the data and start again (I feel this is something I should have known earlier, but it doesn't hurt to try and share your mistakes).
Which gets me thinking - it doesn't have to be this way. Bit-torrent works in a similar way to Rsync, but it's certainly not network efficient. Are there any projects similar to Rsync that are using network-hungry algorithms to try and make sure that two directories are in sync with the goal just being as fast as possible
|Sunday, April 26th, 2009|
|Wednesday, April 8th, 2009|
|Very Simple Introduction to Using XtraBackup on Max OS X
I've started using Xtrabackup to backup MySQL on my MacBook. Here's an example of a quick backup and restore:
1. Download the latest .tar.gz from Percona:
$ cd /tmp
$ wget http://www.percona.com/mysql/xtrabackup/0.5/xtrabackup-0.5-macos.x86_64.tar.gz
$ tar -xzf xtrabackup-*-macos.x86_64.tar.gz
$ cd xtrabackup*
2. This directory should contain
. You need to install these into a directory that appears in your
. In my case, I am going to group it install it where my MySQL binaries are located (/usr/local/mysql/bin):
$ cp * /usr/local/mysql/bin/
3. Create a directory where you want your backup to go. In this case it's just a demo - so I'll use my tmpdir.
$ mkdir -p /tmp/backup
4. Test running the backup:
$ innobackupex-1.5.1 /tmp/backup/
is a Perl script that insures that all of your non-InnoDB tables and other MySQL meta data is backed up. You can think of it as a wrapper around
, which backs up the data inside InnoDB.
5. Check the data is backed up:
$ cd /tmp/backup
$ cd 2009-04-08_15-12-52
backup-my.cnf mysql-stderr xtrabackup_binlog_info
employees mysql-stdout xtrabackup_checkpoints
ibdata1 test xtrabackup_logfile
6. Attempt a recovery:
innobackupex-1.5.1 --copy-back /tmp/backup/2009-04-08_15-12-52
In case you didn't know it, I work for Percona - the company that wrote xtrabackup
|Who is going to make MySQL easier to use?
This may appear a bit of a rant - but it's really intended as more of an observation from having trained people how to use MySQL, and noticing that everyone seems to make the same beginner mistakes
. If you read the "Continued MySQL Values
" on the MySQL Website, you'll notice that the third one in the list is:
- The best and the most-used database in the world for online applications
- Available and affordable for all
- Easy to use
- Continuously improved while remaining fast, secure and reliable
- Fun to use and improve
- Free from bugs
Note that "Fun to use" doesn't sound much like a database, and "free from bugs" will always be a distant dream ;).
It was Easy to Use
that got me into MySQL, but I think this is one of the goals that has lost focus over the years. If you look at bugs.mysql.com
, there are a lot of annoying little S5 (Feature Requests) that would probably take the right person only a few minutes to fix. The sort of things I am talking about are:
- SHOW SLAVE STATUS - Has at least one annoying ease of use bug. If it refuses to connect to the master because it shares the same server id, it won't show you that here. You have to go to the log file.
- SHOW SLAVE STATUS Also shows you the 'last error' it incurred, with no way to clear this error. It confuses beginners
- What is the difference between this wait_timeout and interactive_timeout thing, and why is it that when I set wait_timeout to 0 it converts to wait_timeout =1 and disconnects me? - I would prefer wait_timeout=0 to mean unlimited - not that I can think of too many good reasons to use it.
- The anonymous user serves almost no practical use.
- Changing InnoDB log file size requires you to do some shutdown and rename trickery, when this should be automatically done for you.
- When connecting to a server on -h localhost mysql decides that "oh, you want the socket file", and due to misconfiguration it might go to the wrong location and tell you it can't connect even though your server is running.
- The error log file is very bad at offering levels of configurable verbosity, or consistency which would allow you to grep through it for errors easier. Often it tells me an error code, which I am supposed to look up in perror - but most beginners don't know about perror.
But they are probably never going to be fixed. I think it is sad to see that many MySQL users have now seen that these issues don't get fixed, so they no longer submit feature request bug reports. My theory is that:
- MySQL/Sun is busy implementing new features to snag new customers. If they update a feature (such as the subquery optimizations in 6.0) it is normally going to be one of those big compelling features that all of their customers are demanding.
- The small number of Community Contributors are most likely going to be writing patches to scratch an itch (either better performance or diagnostics). It's not going to be one of these annoying little things they long ago discovered and will no longer care about.
It's good to see that Drizzle has actually been very good at attacking a large number of items on my list. It's just sad to have to wait until Drizzle is ready, because this part of MySQL isn't broken, it just needs some love
|Wednesday, March 18th, 2009|
|New Amazon EC2 Features - Reserved Instances
So you can now reserve
EC2 instances - which brings the cost down to about 6.7cents/hour averaged over a year.
It's a great idea. I wonder if the next step will be to allow more customized instance types at a price premium, provided a reservation is made for 1-3 years. I could certainly use a machine with faster IO for MySQL boxes.
|Wednesday, February 11th, 2009|
|I wish I had more levels of verbosity in logging
I've been working as a Ruby on Rails developer the last couple of months. It's interesting to see how my impression of MySQL changes when I'm on the other side - and using a development environment I am less familiar with. Here are a two things I wished I could have been able to do:
--log-warnings=2 is enabled, log all statements the server receives that cause warnings or syntax errors.
--log-warnings=2 is enabled and
--some-other-setting, log all statements which return empty results.
Not that it caused me too much pain - but I think I could have benefited. I think I've read something about both of these before too... anyone know if it was in the Drizzle or Google patches?
|Friday, January 30th, 2009|
|MySQL Consulting Companies
Has anyone else noticed that almost all of the consulting companies that support MySQL (and blog) start with a 'P'? (Percona
, Proven Scaling
). I think OpenQuery
needs a name change to keep up with the market. Even though 'O' is the letter next to 'P' in the alphabet, it requires more than one bit flip - so it's not that close.
|Sunday, January 25th, 2009|
|My take on the Sun Database Group visa issues
Seeing this post
make the news today really interested me - since I had the (dis)pleasure of being personally involved. In the Australian spirit of 'giving word to the underdog', let me provide some clarity. But before I do - my kudos to Kaj for already following up and correcting himself on the gray details.
The person in question ('KV') was not going to Australia to speak at a conference, but deliver a public training course. To do this, you do need a business visa. Heck, you need a visa if you want to teach in the USA - so those speculators calling Australia some draconian system that doesn't understand Open Source is just wrong. Some departments
know it very well. A lot more
than my adopted home of Quebec ;)
The only advantage the USA has over Australia, is that speaking at conferences can be done with the VISA waiver system. But then again, in Australia's defense the visas are acquired online, and much easier to acquire than my USA one was ;)Why was local_mysql_activist upset?
local_mysql_activist runs a business in Australia training on Open Source technologies. On the same trip KV was due to teach in Sydney - KV was supposed to teach in Canberra. And it was during that week in Canberra, local_mysql_activist had *already scheduled a class* that would no doubt compete for potential customers.
But competition is good, no? Well, yes. But Sun can survive a lot longer on classes that only half fill than local_mysql_activist can. In the spirit of healthy competition, they could have picked the week before/after - but that's their choice. There are also other cities in Australia that come to mind before Canberra, which would have filled up.
The real problem with the Sun/MySQL course is that it never actually ran. It was canceled at the last minute due to low numbers, and customers were offered credits/refunds. They probably didn't have enough time to book in to local_mysql_activist's class - so in the end he was the real loser.The show did go on!
At the time I was also in the same training group at MySQL/Sun - and I happen to hold an Australian passport. When KV was put-off by local_mysql_activist - the show still went on. People in Australia still got their MySQL DBA course. I think someone somewhere should acknowledge this bit in their stories.
I don't support the way local_mysql_activist went about things. Involving
a government is just a messy, messy, disaster. But I can fully understand his frustrations. What I would have done is capitalized on what MySQL's can't do - offer a completely reputable, third party criticism on which features work and which ones don't; i.e.
Students used to ask me questions about what guides we had on migrating from Oracle to MySQL. I used to tell them there was no really definitive guide, but if there was - don't you think we have a conflict of interest in producing it!?Visa issues suck
Visa issues suck. I completely agree with Kaj on this one, having gone through both a Canadian and a USA visa myself. Governments in general just don't know how to deal with the fact that you can be employed in a different country to where you work.
Initially I was told by an immigration official that I wouldn't require a Canadian visa since my travel loosely met the definitions of a 'business traveler' here on business from Australia, as long as I never received employment in Canada.
That was until I came back in the country one day, and they detained me in immigration for a few hours, while they questioned me about my work - and eventually decided they were going to let me in, but I had 50 days to get a proper visa or get out. I only just made it ;)
[Disclaimer - I no longer work for Sun Microsystems. If I wasn't so busy at my new job, I would have checked my RSS reader and replied earlier!]
|Wednesday, December 24th, 2008|
|Heading to India
I'm about to leave for India for a few weeks - so if you're looking for me in Montreal - I won't be there. What brings me there? this.
|Thursday, December 18th, 2008|
|IO scheduling in the 2.6 kernel
I was surprised by even the gap I saw on Vadim's post
on the improvements of using the Noop IO scheduler. I've been changing my thoughts on what to set the scheduler to lately, and it's all leaning to Noop as the default.An explanation first:
IO Schedulers (aka elevators) are a method of trying to get the best possible performance out of your disk subsystem as possible. Since your disk is essentially a mechanical device - it has a difference in performance between whether or not you are performing actions sequentially - or when you are performing actions randomly. And this difference can be huge! Last time I tested, a typical 7200RPM consumer hard drive could write 60MB/s sequentially, but performance dropped to only a few MB/s when I started trying to write small pieces of random data.So how do the IO schedulers work?
They achieve this (mostly) by doing request reordering and merging, and by trying to read platters in one continuous direction. They may even detect that you are writing sequential blocks, and slightly delay an operation in order to 'save cost'.
Each IO scheduler will have different algorithms regarding how they do this reordering. For example, on a desktop Operating System you are probably more concerned about your MP3s not skipping than about the maximum sustained performance.Death to schedulers
The problem with using techniques like IO scheduling is that the Linux kernel is pretty dumb to all the layers below it. Hard drives themselves have their own scheduling mechanisms, and if you are running a RAID controller *it* will have it's own scheduling mechanisms.
The last point is important - If you are doing scheduling when you have a RAID controller, from Linux's perspective it's probably all one big block device. The scheduler is making all sorts of assumptions about blocks being aligned on disk and it's WRONG WRONG WRONG - you probably have some sort of striping. So all the IO scheduler is doing is adding latency (bad) and to probably applying some partial serialization to writes (double bad).
So in that case, it's better to tell Linux to mind it's own business. In which case you want the Noop scheduler.
If you are curious where to learn more, I think the best references to learn more about scheduling have been some of the talks by the Youtube guys, and an earlier post by Domas Mituzas
|Wednesday, December 17th, 2008|
|Why you don't want to work in a travel job for Sun
I left Sun over a month ago. Despite my manager approving my expense report, the accounting team has refused it. Now I'm supposed to pay a $6,000 Amex bill while I wait for them to figure out what they are going to do.
|Wednesday, December 10th, 2008|
|On Synergy: Culture conflicts between Sun and MySQL
Working at Sun was my first acquisition experience. I guess it was what I expected; managers hyping it up about being a "perfect match", and how much the two companies had in common. It was kind of interesting to see this even turned up a notch after they received additional "Sun management training". Anyway, I digress....
I'll state upfront I consider my experience a bad one (but I'll save the personal stories for another day). Here was an issue I saw while training Sun staff on how to user MySQL:
Sun's has a conflict of interest in selling hardware.
MySQL (InnoDB) doesn't actually *work* on big computers. It only scales up to about 4-8 CPU cores, and then it hits all sorts of internal bottlenecks. Most architectures work around this by using many small machines rather than one big one (aka "scale out").
But for Sun the profits are larger on selling *bigger* hardware. Most of Sun's bigger hardware (SPARC) has many more CPU cores, but each of these cores are infact slower than most Intel/AMD cores. So it doesn't work.
I'm not sure that the "old guard" of Sun Sales people will take to selling smaller, lower margin systems. I can predict them still trying to continue to either sell bigger machines (and suggest deploying Oracle), or sell bigger machines that are actually unsuited to MySQL. I remember hearing a Clayton Christensen talk on when Intel launched Celeron - and they sold them out of a completely different office. That sounded smart.
I think the idea of using commodity hardware installing DRBD+Heartbeat was the hardest to explain to Sun employees in HA classes. They didn't see why someone wouldn't buy a $5,000-$10,000 SAN and be done with it (Note: I should point out DRBD has other advantages besides being a low-cost SAN replacement).
 This review
is just one example. The analysis
is even more interesting.
|Eating your own dog food.
I'm pretty happy to hear that the MySQL Website, and the MySQL Bugs system are powered by 5.1. I think this is a real step forward from when 5.0 was released.
I just want to know when the support.mysql.com website will use 5.1. It has a lot heavier requirements, and with contracted SLAs to customers Sun would be making a real commitment if it were to upgrade that.
|Friday, November 28th, 2008|
|There's nothing point one about 5.1
MySQL 5.1 is GA. Yay!
A lot of new features have been added, and the numbering convention of just adding a .1 doesn't really explain that. If I had of numbered it, I probably would have called it "6.0".
In some ways MySQL has done both themselves (and DBAs) a small injustice. While working at MySQL I met a lot of customers that tended to be conservative - they don't install first releases, but instead wait for the second release.
In the case of 5.1, just be aware that there will be quite a few more features, and with it will be
more bugs. I think it's more stable than 5.0 - but you will still need to do plenty of testing.
I'm happy to see it finally released though - 3 years in the making!
But again if I had it my way, it would have been good to see a real "Point 1" release to 5.0. There were a lot of new features introduced in late 2005 that only required small addition. Changes that were large enough that the current 'no new features in a GA release' rule restricted, but not big enough to break 99% of applications.
I can only hope that 6.0 is 5.1's "point 1 release", and not just a deluge of new features 2 years late. Partitioning could be awesome if things like the "can't mix storage engine" limitation were lifted. Quickly.
I probably will be waiting at least until Percona
update to 5.1 GA, and perhaps another month after that.
 I think Oracle causes this - having traditionally offered much stronger second editions.
 For example; I now don't have to use the SUPER privilege for triggers, but I still have no way of using SIGNAL in a stored procedure. It's a shame that for both of these the compile cache is still per-connection.
|Monday, November 24th, 2008|
|Dead-simple server monitoring solutions
Dear Lazyweb. So, here's my thoughts:
* Create a basic php/rails/insert your application language choice page.
* Have it do a simple SELECT 1+1 from MySQL.
* Print the results to screen.
Are there any third party (independently hosted) monitoring tools out there that (for free or cheap) I can use to then connect to this page, and make sure the results are as expected?