We knew in our hearts this day would come: Stack Overflow has defeated Google!
On July 2, from 6:45 AM PDT until 12:35 PM PDT, Google App Engine (App Engine) experienced an outage that ranged from partial to complete.
Following is a timeline of events, an analysis of the technology and process failures, and a set of steps the team is committed to taking to prevent such an outage from happening again. The App Engine outage was due to complete unavailability of the datacenter’s persistence layer, GFS, for approximately three hours.
The GFS failure was abrupt for reasons described below, and as a consequence the data belonging to App Engine applications remained resident on GFS servers and was unreachable during this period. Since needed application data was completely unreachable for a longer than expected time period, we could not follow the usual procedure of serving of App Engine applications from an alternate datacenter, because doing so would have resulted in inconsistent or unavailable data for applications.
The root cause of the outage was a bug in the GFS Master server caused by another client in the datacenter sending it an improperly formed filehandle which had not been safely sanitized on the server side, and thus caused a stack overflow on the Master when processed.
This is excerpted from a newsgroup posting by App Engine PM Chris Beckmann, and was forwarded along to me by Lenny Rachitsky.
In other, less amusing news, there will be no podcast this week. But don’t fret — next week, we will have the ineffable Miguel de Icaza of Mono fame. Joel and I are both big fans, so this one should be fun.