Issues with logins and user registration on console.us-gov-east-1.aws.elastic-cloud.com
Incident Report for ESS (GovCloud)
Postmortem

[RCA]: GovCloud Userconsole Not Accessible 

2021-03-24

Summary

Impact

Inability to login to GovCloud cloud console/create new user registrations for 20 hours 30 minutes

Root Cause(s)

  1. During release operations a critical database migration was not run. As a result, access to user login and registration was lost.
  2. The impact duration was longer than anticipated for the following reasons:

    1. Time-to-detect was 20 hours:

      1. Monitoring and alerting for logins in GovCloud is in need of improvement
      2. Alerts for our monitoring clusters being unavailable indirectly brought attention to the incident.
    2. Time-to-respond was 27 minutes once the issue was detected.

    3. Time-to-remediate was 1 hour 18 minutes

  3. Operator error was the primary cause, however the lack of monitoring parity with other production environments is the primary reason for longer TTD.

Resolution Action Items

Running DB migrations restored access without requiring any further restarts.

Timeline

Action Items

Future Work

  • Mitigate failure: reduce likelihood of operator error by improving release automation in GovCloud
  • Mitigate impact: improve monitoring of login and registration failures for GovCloud
Posted Mar 26, 2021 - 23:56 UTC

Resolved
This incident has been resolved and an RCA will be posted within 72 hours.
Posted Mar 24, 2021 - 17:49 UTC
Monitoring
We have restored access to user logins and registration on the Cloud Console, console.us-gov-east-1.aws.elastic-cloud.com. During a recent release, a critical database migration was not run. We are investigating why this step was missed and will take steps to prevent this from happening in the future. We will update with a RCA in 72 hours.
Posted Mar 24, 2021 - 17:08 UTC
Investigating
We are investigating an issue with user logins and registration on the Cloud Console, console.us-gov-east-1.aws.elastic-cloud.com. Connectivity to existing deployments is not affected.
Posted Mar 24, 2021 - 16:46 UTC
This incident affected: AWS GovCloud US-East (aws-us-gov-east-1) (Cloud console).