Saturday, April 23, 2011

Production Push: Redundancy is Repetitive

We've performed a production push this morning, adding a better-flowing website and navigation. Small changes to content tend to have a big impact on interpretation. We hope it's now much easier to understand what we do.

At the same time, our fearless compute-cloud friends at Amazon AWS have been having some major bad-times over the past 48 hours. A production-push announcement wouldn't be complete without acknowledging their current pain. We host the bitmenu.com platform on Amazon's EC2, and related services. Starting late Wednesday night we saw the lights flickering but everything seemed to continue operating fine. Then as of 1am PST Thursday morning, and for the past 48 hours, they've been in the house of pain with a major east-coast USA outage.

The key to our dodging the outage is that we're not constantly bouncing EC2 instances with EBS mount/unmount events. To do so is not necessarily a bad design, but may be needed in systems that repeatedly get stuck for unknown reasons. Hey, we're not built on Ruby *cough*.

We run our system on a hand-rolled AWS stack, based on bash shell with it's own failover using haproxy, polling, and cron. As such, we can do progressive production pushes across multiple machines, without customers seeing the event. Now if you are still following me, and you know what that is, then you are an old school bad ass.

In a nutshell, we didn't go out, even though we're in us-east-1, which has been suffering through the outage event. Many thanks to Vincent Jorgensen, who's keen design is proving quite solid. UPDATE: O'Reilly Media's George Reese posted his perspective on the outage one day after.

All that, and today's my birthday.