Disclaimer: This post does not contain any information on basketball or recruiting. Reading of this post has been linked to drowsiness, loss of time, and in some rare cases an acute interest in technology may form. You should always consult your boredom level before you begin reading any post.
As I'm sure some of you noticed, or surmised by the title of this post, yesterday we experienced some downtime. By some downtime, I mean about 6 hours of downtime. Which, by my reckoning, is about 6 hours too much downtime. Here's a recap of what transpired:
Noon PT - I leave home for some planned downtime with my wife to attend a SF Giants game.
12:26 PT - I receive an alert on my phone that the site is down.
12:27 PT - I think, "What did I do this time?"
12:28 PT - For the next hour or so, as we BART to the game, I curse at my phone. [Why can't you stay in 4G like all the other phones? Why aren't you a laptop?]
12:30 PT - My wife patiently wonders why I can't just take a couple of hours off.
1:30 PT - I futilely attempt to restart the Verbal Commits servers from my phone via Nezumi, as I enter the Giants game already in progress
1:41 PT - I send out a tweet about our problems.
2:00 PT - I try to be present and enjoy the game and catch up with our friends, while only slightly bothered by the ulcer in my stomach. Although, to be fair, that could have been due to the bratwurst, not my stress level.
3:30 PT - We leave game early and hop on BART.
4:30 PT - Once home, I immediately check our logs, and start troubleshooting.
4:38 PT - Bad news... Heroku claimed that the issue was already resolved at 2:59 PT.
5:25 PT - Heroku sends this response. "We experienced a large service event when EBS became stuck in a single us-east AZ. A small percentage of our disks have been lost, which means some dbs remain unavailable, including yours. We are performing a fork recovery of your db at this time to recover your data. When the fork is ready, it will restart your app with a new db address, at which time, your app should become available again."
6:28 PT - Our db is finally restored in read-only mode. We tell the world, we're back!
8:56 PT - 8.5 hours later and only 6 hours after Heroku claims the issue was resolved, we finally get a fully operational site back.
I've been frustrated with Heroku for multiple reasons prior to yesterday's events, but its been a "if it ain't broke, don't fix it" type of situation. Yesterday's events illuminated the fact, that while Heroku was a great service provider for getting Verbal Commits up and off the ground (they scaled with us from 0 to 800k monthly pageviews), they're no longer a good fit for us.
We're now in the market for a new provider, here are some of our current criteria:
• Ability to failover from one location to the next in the event something goes wrong, like yesterday.
• Writable disk so we don't have to code around our "Verbal" to get to our "Commits".
• An economical pricing plan given our needs.
• Support for our current technology stack as well as our future plans.
• Minimal time spent working on configuration issues (ie Low DevOps Budget/Time).
• Easily scalable as we continue to grow (a bit presumptuous perhaps concerning our growth, but I've got faith in us).
We've been playing around with trial accounts at the following vendors:
• EngineYard - Mature platform and their support is legendary, but like Heroku a bit pricey.
• Rackspace - Another solid platform with great support, but also a bit pricey.
We'd like to move quickly on this, but we also want to make sure we do our due diligence. Unfortunately, you may see some delay in our bug fixing as well as new features coming to the site. We're hoping it all goes well, and you won't notice a thing. Otherwise you may find yourself reading another long blog post about how technology got the better of me.