Interruption of collector connectivity in us1 and us4
Incident Report for Auvik Networks Inc.
Postmortem

Service Disruption - Cluster US4

Duration of incident

Discovered: Jan 18, 2022 20:36 - UTC
Resolved: Jan 18, 2022 23:15 - UTC

Cause

A planned tenant migration from US1 cluster to US4 cluster caused an
unforeseen destabilization of the US4 cluster due to resource constraints.

Effect

The overload of resources caused several key services for communication to collectors to fail and disconnect in the US4 cluster in the Auvik platform. It also prevented a number of users on the US4 cluster from being able to log in.

Action taken

01/17/2022

13:00 - UTC Throughout the day ~3450 tenants are migrated from US1 cluster to US4 cluster.
17:30 - UTC The US4 cluster begins to show early signs of destabilization, but were not caught by proactive monitoring.

01/18/2022
20:36 - UTC Engineering begins investigation into a higher than normal rate of collector failures in the US4 cluster.
20:57 - UTC Resource consumption continues to increase forcing more services offline in the US4 cluster. The engineering team continues to investigate.
21:18 - UTC Engineering determines the root cause of the resource issue.
21:24 - UTC Engineering outlines a plan to address the resource issues on the US4 cluster.
21:37 - UTC The US1 cluster is reviewed by engineering for any possible issues from the migration of tenants. No adverse issues are discovered.

22:11 - UTC Additional resources are provisioned to the US4 cluster to address the connector disconnect issues.
22:37 - UTC Engineering begins to apply a fix to address the inability of users to log into their tenants on the US4 cluster.
23:07 - UTC Engineering completes the roll out of the fix to now allow affected users to log into their tenants on US4 cluster.
23:15 - UTC Engineering declares the incident has been resolved.

Future consideration(s)

  • Auvik will add and improve alert monitoring on the specific cluster services and resources that were affected during this outage. Current monitoring was not set up or not set up properly on the affected resources to create the needed alerting.
  • Auvik will review cluster migration preparation documentation to better understand and prevent resource over provisioning in the future.
Posted Mar 10, 2022 - 15:30 EST

Resolved
We have identified the cause of the interruption in connectivity and increased capacity to prevent a recurrence. All systems are operational.
Posted Jan 18, 2022 - 17:26 EST
Investigating
From 15:49 to 16:19 Eastern Time, some collectors disconnected from the Auvik service for sites in us1 and us4. Collectors have reconnected and the service is operating normally. We continue to investigate the root cause.
Posted Jan 18, 2022 - 16:53 EST
This incident affected: Network Mgmt (us1.my.auvik.com, us4.my.auvik.com).