Attack of the Zombie Dynos

I ended up giving a 5 minute lightning talk on this topic for NYC.rb on Tuesday, May 8, the day of the outage. For slides from that talk, click here.

Some of you may have noticed that Tuesday morning between about 7am and 8:45am, the SideTour site was up and down, up and down. I thought I’d take a moment to explain a little bit about what happened.

So the morning of Tuesday, May 8, I woke up to this:

At SideTour, we use Pingdom for outage monitoring. There’s a lot of other great ones out there, but we find the basic monitoring service indespensible.

This particular message woke me up (after the 3rd or 4th message - my phone was on vibrate - oops) and I immediately checked our application logs. We use Heroku for hosting, so our log looked something like this:

2012-08-05T07:51:37-07:00 heroku[router]: Error H12 (Request timeout) -> GET sidetour.com/ web=web.2 queue=0 wait=0ms service=0ms bytes=0

A simple Google search for Error H12 will give you Heroku’s error code docs and an explanation:

An HTTP request took longer than 30 seconds to complete. In the example below, the Rails app takes 37 seconds to render the page; the HTTP router returns a 503 prior to Rails completing its request cycle, but the Rails process continues and the completion message shows after the router message.

This last part is an important detail that Heroku expands on further in this article on timeout errors.

It turns out that some process being handled by one of the dynos allocated to our production instance was taking a long time - a very long time, perhaps an indefinite amount of time, and while the user was served a 500 error page after 30 seconds of hangup, the process continued to run under that dyno. Other dynos fell prey to the same problem, and eventually all of the dynos running our app were occupied by long-running processes, essentially turning them into zombies.

Fortunately, Heroku has a great suggestion for this problem - the rack-timeout gem.

The rack-timeout gem will kill any process running longer than 30 seconds. This will prevent your dynos from going zombie, but that’s only a short-term fix. After all, users making these long-running requests will still get served a 500 error page.

However, this Rack::Timeout::Error will get raised at the application level, rather than within Heroku’s routing layer, so services like Airbrake, which we used at SideTour, will be able to identify the offending method call.

In our case, it was processing an RSS feed from Tumblr. Tumblr had issues that caused the GET request for the RSS feed to hang up for an indefinite amount of time. The obvious solution, and the one we implemented, is to have a scheduled cron job pull in the RSS feed on a regular basis and persist it in the database for the app to query when it needs the info.

Tags: heroku