For those for whom #amazonfail
isn't yet ancient history, I thought I'd throw in a couple of brief observations. This
is the most plausible explanation for what happened I've yet read, having worked for a reasonably large online retailer, it rings very true. Without going into specifics, I'm very aware of how easy it is for one person to make enormously visible screw-ups. I saw it happen more than once, though never quite to the global level of comment that Amazon suffered. Other times screw-ups happen and get caught before they're noticed.
The main lesson is the PR one that's been written about at length; you can't always avoid mistakes, and you can't avoid people over-reacting and cyclonic weather systems appearing in the crockery, but you can try to manage the situation better after the fact. There's a few lessons about the way that near zero cost global communication changes the dynamics of this, but I don't think they're fundamentally different to anything that's gone before.
There's also lessons in systems design, and they should be noted by anybody involved in the field as much as the communications lessons should be studied by PR types. Robustness is usually included in lists of “Ilities”
(non-functional requirements), but is often the victim in distributed systems, hence Lamport's definition
. The fact that it took so much time and (presumably) effort from Amazon to roll back the changes once they were made suggest to me that they're suffering from this effect. When designing a distributed system, it's important to be able to easily track what has changed, where the change has taken effect, and have a method for quickly reversing any change.
There's also procedural robustness, possibly supported by software, to try to prevent the mistake happening in the first place. However, these aren't always appropriate, can be expensive to operate, and aren't infallible anyway. I still think you need the software to be able to recover from human error.
Oracle have a very nifty bit of technology called flashback
that helps with this on a single database. If you have an Oracle system, it would be the place to start building a robust recovery mechanism (although it doesn't solve any of the problems of data distribution or managing change consistency across a distributed system). If you don't have an Oracle system, it's a good place to see how it should be done.