There have been a handful of reports over the past few months of rooms set to expire in a year closing early. Unfortunately, once a room is closed, there is no way to retrieve the data; it’s gone forever. To avoid users unexpectedly losing data, I’ve temporarily disabled year-long rooms.
Unfortunately this is a very difficult issue to diagnose, owing to the long times involved. Users don’t always check rooms very frequently, so early-closing reports range from a few weeks after opening the room to a few months. So far, my efforts to find a root cause have been fruitless.
I do hope to address this issue very soon and re-enable year-long rooms, but I won’t do it before I’m confident that users will not lose data.
Last night’s maintenance was a success, but was limited in scope.
Over the next two evenings, Saturday and Sunday, 20-21 October, from around 7pm Eastern Time, there may be small service interruptions for additional capacity building and monitoring work. This work should not require significant downtime.
Update: (20 Oct 2012 – 3:30pm EDT)
After a rough hour or so where cache performance was spotty, it’s now warmed up and running smoothly. This should improve response times during periods of high traffic to any particular room.
During the first two presidential debates this year, TodaysMeet experienced much higher than average load and did not respond as well as it should have. This was particularly acute during the second debate last night, causing many people to abandon their TodaysMeet rooms out of frustration. For this I am deeply sorry.
Before I go into some of the grittier technical details, I want to assure you that I understand the underlying issues and am doing what I can to ameliorate them. Not everything can be addressed quickly, but I have a good handle on significantly improving stability in time for the next debate on Monday. And I have an exciting announcement coming up about addressing stability and performance at a much deeper level.
So what actually went wrong?
There are a number of small things that contributed to extremely high server load and the site becoming unresponsive.
“via web” is very slow. On all posts on TodaysMeet, you see a name, a date, and the phrase “via web”. Back when I built TodaysMeet, I’d intended to open up the API for people to build clients besides the website. (And when you could still import tweets into a room, the tweets would be labeled “via Twitter”.) That never happened, but TodaysMeet still looks up the “source” of every message, even though they are all the same (“web”). This makes what should be a very simple database query into a very complicated one, and the server spends significantly longer than it needs to looking up messages (and message lookups occur about once every 12-15 seconds for ever user on the site). It also complicated database writes every time someone posts a message.
During the second debate I reworked some of the code around this to avoid looking up the source. Since there is only one source, this change is permanent.
There were unnecessary services running. Very much like opening too many programs on your personal computer, having too many services running makes them all compete for server resources like CPU time and RAM. Some of these are installed by default by the operating system and I’d never taken the time to turn them off. Others, I had installed and started, intending to use them, but never actually doing so.
I shut down as many of these services as I confidently could, and will audit the server to make sure only absolutely necessary processes are running.
The database configuration is inefficient. Like the “via web” issue, but on a larger scale, the way the database is currently set up is not very efficient. This is how the database was set up four or five years ago, and the technology has changed since then.
This sort of change requires significant downtime to effect, so I want to be sure to get it right the first time. I will be doing some research and making a decision on how to proceed here.
The web server was logging too aggressively. For technical and, ironically, performance reasons, when you visit TodaysMeet, you are hitting a web server (nginx) that talks to another web server in the behind the scenes (Apache). Both of these servers write log files to disk, even though those logs are almost exact duplicates of each other. Normally, this is fine—maybe a waste of disk space but ultimately not a big deal. However, during the debate, it became an issue as too many things were trying to write to disk at the same time.
During the debate, I reduced the amount of text Apache was writing to its logs, and now I have disabled its logging entirely, since the logs at the nginx level provide a superset of the information.
The server is too small. This is not such a small issue. TodaysMeet has gotten away with using very little hardware for a very long time. I’ve only had to increase its resource allocation once in the past several years. Now I need to do so again.
This is the simplest thing I can do, but requires at least half an hour to an hour of hard downtime. I will schedule that for sometime this week and announce it here.
Again, I want to apologize for the inconvenience and disruption during the debates. My goal is to provide a stable, reliable service, and I hope that you’ll give me another chance.
Beyond the changes I outlined above, I have been working on a project to significantly improve the speed and reliability of TodaysMeet. I had planned to announce that here this week, but it won’t be ready soon enough for the third presidential debate, so I’m focusing on short term fixes right now and will come back to the bigger project.
Those of you following TodaysMeet on Twitter may have seen a few tweets today about a styling but in Internet Explorer 8. This seems like as good a time as any to bring a new tool to your attention: Today’s Meet Service Status.
This very simple site contains short status updates, and has an RSS feed, should you need it.
I’ll be using this to post updates on the Today’s Meet service when I’m either in a place where I can’t write a full blog post, or when I’d rather concentrate on fixing the issue and explain it later. It will be the first, best place to look during outages, both planned and unplanned. Both the status feed and this blog now run on a server separate from Today’s Meet itself,
Eventually, I may configure the status updates to cross post to Twitter, but for now you’ll just have to subscribe to the feed.
For those of you interested in the details of the bug, follow me below the break. read more…
A few weeks ago, visitors to Today’s Meet were greeted, not with the regular interface, but with a cryptic message warning that it was, in fact, very dark. They were likely to be eaten by a grue.
I don’t know how many of you are familiar with the Zork games? No? It doesn’t really matter.
The issue was caused by the database server. Specifically, it wasn’t running: MySQL failed to start with the system after a routine weekly restart.
Unfortunately, the monitoring I had in place at the time only checked that there was a response from the web server, since the database had been 100% reliable until that incident. The new monitoring checks the database server directly.
More unfortunately, there seems to have been some data loss at the time, which wasn’t noticed until more recently. I deeply apologize to those affected by the data loss. I know what a blow that can be. I have extended the life of my rolling database backups to provide more insulation against this type of thing.
I hope this will not happen again, and I hope that this transparency is appreciated, and that you will continue to use Today’s Meet.
Today’s Meet rooms are no longer case senstive. I originally intended this as a small privacy feature, but what I’ve seen is people getting confused and creating duplicate rooms, splitting up their group. So now http://todaysmeet.com/TodaysMeet and http://todaysmeet.com/todaysmeet will get you into the same room. read more…