One of the company’s engineers followed up with a blog post, explaining exactly
what went wrong.
According to Robert Johnson:
The key flaw that caused this outage to be so severe was an unfortunate handling
of an error condition. An automated system for verifying configuration values
ended up causing much more damage than it fixed.
The intent of the automated system is to check for configuration values that are
invalid in the cache and replace them with updated values from the persistent
store. This works well for a transient problem with the cache, but it doesn’t
work when the persistent store is invalid.
Today we made a change to the persistent copy of a configuration value that was
interpreted as invalid. This meant that every single client saw the invalid
value and attempted to fix it. Because the fix involves making a query to a
cluster of databases, that cluster was quickly overwhelmed by hundreds of
thousands of queries a second.
To make matters worse, every time a client got an error attempting to query one
of the databases it interpreted it as an invalid value, and deleted the
corresponding cache key. This meant that even after the original problem had
been fixed, the stream of queries continued. As long as the databases failed to
service some of the requests, they were causing even more requests to
themselves.
One of the company’s engineers followed up with a blog post, explaining exactly
what went wrong. (This blog post was written in Sep. 24, 2010.)
what went wrong.
According to Robert Johnson:
The key flaw that caused this outage to be so severe was an unfortunate handling
of an error condition. An automated system for verifying configuration values
ended up causing much more damage than it fixed.
The intent of the automated system is to check for configuration values that are
invalid in the cache and replace them with updated values from the persistent
store. This works well for a transient problem with the cache, but it doesn’t
work when the persistent store is invalid.
Today we made a change to the persistent copy of a configuration value that was
interpreted as invalid. This meant that every single client saw the invalid
value and attempted to fix it. Because the fix involves making a query to a
cluster of databases, that cluster was quickly overwhelmed by hundreds of
thousands of queries a second.
To make matters worse, every time a client got an error attempting to query one
of the databases it interpreted it as an invalid value, and deleted the
corresponding cache key. This meant that even after the original problem had
been fixed, the stream of queries continued. As long as the databases failed to
service some of the requests, they were causing even more requests to
themselves.
One of the company’s engineers followed up with a blog post, explaining exactly
what went wrong. (This blog post was written in Sep. 24, 2010.)
0 Response to "What Might Be The Reason Why Facebook Goes Down for the Second Time in a Week"
Post a Comment