Unscheduled Downtime

There was some unscheduled downtime last night and this morning. The technical story is a corrupt database table. The human story is that people were unable to use Today’s Meet for several hours, because it, unfortunately, happened while I was asleep.

I want to apologize to anyone who was affected by this. While I can never guarantee 100% uptime, it’s still my job to provide as stable and reliable a platform as I can, and when that fails, I take it personally. I’m sorry.

If your room is missing any data, please let me know as soon as possible, through email or Twitter. My rolling backups go for a few days, but sooner is better.

The timing is lousy, because I’m starting a cross country move tomorrow, but when I land (probably about three weeks) there are some new monitoring methods I want to integrate, to catch, report and try to fix, more obscure cases like this automatically. Third party uptime monitoring only notices the most egregious errors, like whole-server crashes. Even local monitoring with tools like monit wasn’t fine-grained enough to catch this. That means it’s time to look at custom solutions.

But, as I said, I’m moving, so it will be a little while. I will make sure everything stays up and running. These problems have been pretty rare. Hopefully nothing happens while I’m driving through Nebraska!