Gdown – Google explains dec 2020 Outage
Google Details Their Big Outage (THE DEC 2020)
On Monday 14 December, 2020, for a duration of 47 minutes, customer-facing Google services that required Google OAuth access were unavailable. Cloud Service accounts used by GCP workloads were not impacted and continued to function. We apologize to our customers whose services or businesses were impacted during this incident, and we are taking immediate steps to improve the platform’s performance and availability.
The Google User ID Service maintains a unique identifier for every account and handles authentication credentials for OAuth tokens and cookies. It stores account data in a distributed database, which uses Paxos protocols to coordinate updates. For security reasons, this service will reject requests when it detects outdated data.Google uses an evolving suite of automation tools to manage the quota of various resources allocated for services. As part of an ongoing migration of the User ID Service to a new quota system, a change was made in October to register the User ID Service with the new quota system, but parts of the previous quota system were left in place which incorrectly reported the usage for the User ID Service as 0. An existing grace period on enforcing quota restrictions delayed the impact, which eventually expired, triggering automated quota systems to decrease the quota allowed for the User ID service and triggering this incident. Existing safety checks exist to prevent many unintended quota changes, but at the time they did not cover the scenario of zero reported load for a single service:
- Quota changes to large number of users
since only a single group was the target of the change,
- Lowering quota below usage,
since the reported usage was inaccurately being reported as zero,
- Excessive quota reduction to storage systems,
since no alert fired during the grace period,
- Low quota, since the difference between usage and quota exceeded the protection limit.As a result,
the quota for the account database was reduced,
which prevented the Paxos leader from writing. Shortly after,
the majority of read operations became outdated which resulted in errors on authentication lookups.
REMEDIATION AND PREVENTION
The scope of the problem was immediately clear as the new quotas took effect. This was detected by automated alerts for capacity at 2020-12-14 03:43 US/Pacific, and for errors with the User ID Service starting at 03:46, which paged Google Engineers at 03:48 within one minute of customer impact. At 04:08 the root cause and a potential fix were identified, which led to disabling the quota enforcement in one datacenter at 04:22. This quickly improved the situation, and at 04:27 the same mitigation was applied to all datacenters, which returned error rates to normal levels by 04:33. As outlined below, some user services took longer to fully recover.In addition to fixing the underlying cause, we will be implementing changes to prevent, reduce the impact of, and better communicate about this type of failure in several ways:1. Review our quota management automation to prevent fast implementation of global changes2. Improve monitoring and alerting to catch incorrect configurations sooner3. Improve reliability of tools and procedures for posting external communications during outages that affect internal tools4. Evaluate and implement improved write failure resilience into our User ID service database5. Improve resilience of GCP Services to more strictly limit the impact to the data plane during User ID Service failuresWe would like to apologize for the scope of impact that this incident had on our customers and their businesses. We take any incident that affects the availability and reliability of our customers extremely seriously, particularly incidents which span multiple regions. We are conducting a thorough investigation of the incident and will be making the changes which result from that investigation our top priority in Google Engineering.