Talk. Listen.

Lately…

Search

News.

Unscheduled Downtime

There was some unscheduled downtime last night and this morning. The technical story is a corrupt database table. The human story is that people were unable to use Today’s Meet for several hours, because it, unfortunately, happened while I was asleep.

I want to apologize to anyone who was affected by this. While I can never guarantee 100% uptime, it’s still my job to provide as stable and reliable a platform as I can, and when that fails, I take it personally. I’m sorry.

If your room is missing any data, please let me know as soon as possible, through email or Twitter. My rolling backups go for a few days, but sooner is better.

The timing is lousy, because I’m starting a cross country move tomorrow, but when I land (probably about three weeks) there are some new monitoring methods I want to integrate, to catch, report and try to fix, more obscure cases like this automatically. Third party uptime monitoring only notices the most egregious errors, like whole-server crashes. Even local monitoring with tools like monit wasn’t fine-grained enough to catch this. That means it’s time to look at custom solutions.

But, as I said, I’m moving, so it will be a little while. I will make sure everything stays up and running. These problems have been pretty rare. Hopefully nothing happens while I’m driving through Nebraska!

You are likely to be eaten by a grue.

A few weeks ago, visitors to Today’s Meet were greeted, not with the regular interface, but with a cryptic message warning that it was, in fact, very dark. They were likely to be eaten by a grue.

I don’t know how many of you are familiar with the Zork games? No? It doesn’t really matter.

The issue was caused by the database server. Specifically, it wasn’t running: MySQL failed to start with the system after a routine weekly restart.

Unfortunately, the monitoring I had in place at the time only checked that there was a response from the web server, since the database had been 100% reliable until that incident. The new monitoring checks the database server directly.

More unfortunately, there seems to have been some data loss at the time, which wasn’t noticed until more recently. I deeply apologize to those affected by the data loss. I know what a blow that can be. I have extended the life of my rolling database backups to provide more insulation against this type of thing.

I hope this will not happen again, and I hope that this transparency is appreciated, and that you will continue to use Today’s Meet.

Interacting with the server...