Presidential Debate Post-mortem

During the first two presidential debates this year, TodaysMeet experienced much higher than average load and did not respond as well as it should have. This was particularly acute during the second debate last night, causing many people to abandon their TodaysMeet rooms out of frustration. For this I am deeply sorry.

Before I go into some of the grittier technical details, I want to assure you that I understand the underlying issues and am doing what I can to ameliorate them. Not everything can be addressed quickly, but I have a good handle on significantly improving stability in time for the next debate on Monday. And I have an exciting announcement coming up about addressing stability and performance at a much deeper level.

So what actually went wrong?

There are a number of small things that contributed to extremely high server load and the site becoming unresponsive.

“via web” is very slow. On all posts on TodaysMeet, you see a name, a date, and the phrase “via web”. Back when I built TodaysMeet, I’d intended to open up the API for people to build clients besides the website. (And when you could still import tweets into a room, the tweets would be labeled “via Twitter”.) That never happened, but TodaysMeet still looks up the “source” of every message, even though they are all the same (“web”). This makes what should be a very simple database query into a very complicated one, and the server spends significantly longer than it needs to looking up messages (and message lookups occur about once every 12-15 seconds for ever user on the site). It also complicated database writes every time someone posts a message.

During the second debate I reworked some of the code around this to avoid looking up the source. Since there is only one source, this change is permanent.

There were unnecessary services running. Very much like opening too many programs on your personal computer, having too many services running makes them all compete for server resources like CPU time and RAM. Some of these are installed by default by the operating system and I’d never taken the time to turn them off. Others, I had installed and started, intending to use them, but never actually doing so.

I shut down as many of these services as I confidently could, and will audit the server to make sure only absolutely necessary processes are running.

The database configuration is inefficient. Like the “via web” issue, but on a larger scale, the way the database is currently set up is not very efficient. This is how the database was set up four or five years ago, and the technology has changed since then.

This sort of change requires significant downtime to effect, so I want to be sure to get it right the first time. I will be doing some research and making a decision on how to proceed here.

The web server was logging too aggressively. For technical and, ironically, performance reasons, when you visit TodaysMeet, you are hitting a web server (nginx) that talks to another web server in the behind the scenes (Apache). Both of these servers write log files to disk, even though those logs are almost exact duplicates of each other. Normally, this is fine—maybe a waste of disk space but ultimately not a big deal. However, during the debate, it became an issue as too many things were trying to write to disk at the same time.

During the debate, I reduced the amount of text Apache was writing to its logs, and now I have disabled its logging entirely, since the logs at the nginx level provide a superset of the information.

The server is too small. This is not such a small issue. TodaysMeet has gotten away with using very little hardware for a very long time. I’ve only had to increase its resource allocation once in the past several years. Now I need to do so again.

This is the simplest thing I can do, but requires at least half an hour to an hour of hard downtime. I will schedule that for sometime this week and announce it here.

Again, I want to apologize for the inconvenience and disruption during the debates. My goal is to provide a stable, reliable service, and I hope that you’ll give me another chance.

Beyond the changes I outlined above, I have been working on a project to significantly improve the speed and reliability of TodaysMeet. I had planned to announce that here this week, but it won’t be ready soon enough for the third presidential debate, so I’m focusing on short term fixes right now and will come back to the bigger project.