Service Disruption - Some collectors not connecting
Incident Report for Auvik Networks Inc.
Postmortem

Service Disruption - Collector fails to initially connect or maintain a consistent connection to Auvik

Root Cause Analysis

Duration of incident

Discovered: Sep 13 2022, 16:00 UTC
Resolved: Oct 1, 2022, 13:00 UTC

Cause

Updates to software libraries in the collector to address recommendations made by a third-party collector security audit.

Effect

The collector would not establish a connection to Auvik or the collector would lose connection and not be able to re-establish a connection to the Auvik cloud until the collector WatchdogService forced a collector to reset.

Action taken

09/13/2022 - 09/24/2022

Auvik receives several tickets from clients over the course of several days that their collectors are experiencing random disconnects. Auvik opens tickets and begins troubleshooting the issue. Initial indications are related to Auvik’s cloud connectivity. The impact is a small, and random sub-section of client tickets. The ticket count rises for inconsistent connections but the number of tickets and clients does not necessitate calling out an incident to the time.

9/24/2022

13:00 UTC - Auvik reacts to approximately 2% of collectors not re-establishing a connection back to Auvik after regular bi-weekly maintenance is performed.

14:00 UTC - After an initial investigation, Auvik opens an incident on its status page about the collector service disruption and continues to troubleshoot the issue.

14:00-19:00 UTC - Auvik continues to troubleshoot the collector connection issues. The majority of collectors have reconnected and under 1% of collectors are outstanding but slowly connecting back to the network. Monitoring from a practical standpoint had resumed for customers and the remaining clients slowly started reconnecting. Auvik updates the Status page during this time, stating that the team will continue to monitor the situation over the weekend.

9/26/2022-09/29/2022

Auvik continues to gather data as to the specific cause of the intermittent connectivity issues. New tickets are opened and a test environment is put into place to pinpoint the specific issues that are causing the disconnects. The Auvik Status page is updated periodically during this time.

19:28 UTC - Auvik announces that it has scheduled an out-of-band maintenance window on CA1 at 7 am (Eastern Time) the next day.

09/30/2022

11:00 UTC - Auvik performs the out-of-band maintenance on the CA1 cluster connectors.
12:45 UTC - After reviewing the impact of the maintenance, it is found that additional changes are required.
14:30 UTC - Further alterations to the current collector build are tested in the lab and found to successfully address the ongoing collector disconnect issues.
15:30 UTC - Auvik announces an Auvik system-wide, out-of-band maintenance window to address the collector disconnect issues.

10/01/2022

12:00-13:00 UTC - Auvik performs the out-of-band maintenance on the collectors.

14:12 UTC - The latest fixes to the collector have appeared to resolve the disconnect issues.

10/02/2022

14:20 UTC - Auvik closes the incident listed on the Status page as the connections of the collectors to Auvik have remained steady and the logs for connectivity are clear.

Future consideration(s)

  • After a review of the incident and its associated impact, Auvik has correctly reclassified the incident from three days of a major incident and four days of a partial incident to the correct classification of one hour for a partial incident, to identify the issue, and classified as monitoring for the rest of the incident’s duration.
  • Auvik will increase resources for its lab environment to better test alpha and beta changes to the collector going forward.
  • Auvik will expand the team resourcing and improve the documentation for the Collector team and Auvik as a whole. This will create an improved historical knowledge base for the collector and relevant technologies, including, in this instance, how software library changes might affect its connectivity.
Posted Oct 12, 2022 - 06:05 EDT

Resolved
The solution for the service disruption with some of the collectors not reconnecting to the cloud in a timely manner or periodically losing connection and having a delayed reconnection to the cloud has been successfully deployed and is working. The source of the disruption has been resolved, and services have been fully restored.

A Root Cause Analysis (RCA) will follow after a full review has been completed.
Posted Oct 02, 2022 - 10:20 EDT
Update
We are continuing to monitor for any further issues.
Posted Oct 01, 2022 - 10:12 EDT
Update
We’ve identified the source of the service disruption with some of the collectors not reconnecting to the cloud in a timely manner or periodically losing connection and having a delayed reconnection to the cloud. This is a code-related issue that we continue to work through to resolve with a permanent resolution. We will be installing an out-of-band collector update Saturday, October 1, 2022, at 8 AM ET. This will force a restart of the collector service(s). We appreciate your patience as we continue to work toward a resolution. We will continue to post updates to this page.
Posted Sep 30, 2022 - 10:28 EDT
Update
We’ve identified the source of the service disruption with some of the collectors not reconnecting to the cloud in a timely manner or periodically losing connection and having a delayed reconnection to the cloud. This is a code-related issue that we continue to work through to resolve with a permanent resolution. We have rolled out a new fix to the CA1 cluster and are monitoring the results. We appreciate your patience as we continue to work toward a resolution. We will continue to post updates to this page.
Posted Sep 30, 2022 - 10:00 EDT
Update
We’ve identified the source of the service disruption with some of the collectors not reconnecting to the cloud in a timely manner or periodically losing connection and having a delayed reconnection to the cloud. This is a code-related issue that we continue to work through to resolve with a permanent resolution. We will roll out a revised fix to the CA1 cluster at 8:30 AM EDT on September 30, 2022. This will cause collectors to reboot as they do during the bi-weekly maintenance. We appreciate your patience as we continue to work toward a resolution. We will continue to post updates to this page.
Posted Sep 29, 2022 - 15:30 EDT
Update
We’ve identified the source of the service disruption with some of the collectors not reconnecting to the cloud in a timely manner or periodically losing connection and having a delayed reconnection to the cloud. This is a code-related issue that we continue to work through to resolve with a permanent resolution. We have rolled out a prepared fix to the CA1 cluster and are monitoring the results. We appreciate your patience as we continue to work toward a resolution. We will continue to post updates to this page.
Posted Sep 29, 2022 - 10:09 EDT
Update
We’ve identified the source of the service disruption with some of the collectors not reconnecting to the cloud in a timely manner or periodically losing connection and having a delayed reconnection to the cloud. This is a code-related issue that we continue to work through to resolve with a permanent resolution. We will roll out a prepared fix to the CA1 cluster at 7 AM EDT on September 30, 2022. This will cause collectors to reboot as they do during the bi-weekly maintenance. We appreciate your patience as we continue to work toward a resolution. We will continue to post updates to this page.
Posted Sep 28, 2022 - 15:28 EDT
Update
We’ve identified the source of the service disruption with some of the collectors not reconnecting to the cloud in a timely manner or periodically losing connection and having a delayed reconnection to the cloud. This is a code-related issue that we continue to work through to resolve with a permanent resolution. We appreciate your patience as we continue to work toward a resolution. We will continue to post updates to this page.
Posted Sep 28, 2022 - 07:09 EDT
Update
We’ve identified the source of the service disruption with some collectors not reconnecting to the cloud in a timely manner We are continuing to test a fix to maintain a consistent collector connection and continue to monitor the situation. We’ll keep you posted on a resolution.
Posted Sep 27, 2022 - 09:05 EDT
Update
We’ve identified the source of the service disruption with some collectors not reconnecting to the cloud in a timely manner We are currently testing a fix to maintain a consistent collector connection and continue to monitor the situation. We’ll keep you posted on a resolution.
Posted Sep 26, 2022 - 14:00 EDT
Update
We’ve identified the source of the service disruption with some collectors not reconnecting to the cloud in a timely manor and continue to monitor the situation. We are continuing to develop a fix to maintain a consistent collector connection. We’ll keep you posted on a resolution.
Posted Sep 26, 2022 - 08:50 EDT
Monitoring
We’ve identified the source of the service disruption the collectors. The situation seems to have stabilized. We will continue to monitor the situation for the remainder of the weekend and provide the next update on Monday September 26th.
Posted Sep 24, 2022 - 15:30 EDT
Identified
We've identified the source of the service disruption with the collectors. After our standard maintenance a subset of collectors did not immediately come back online. We are working as quickly as possible to resolve this issue.
Posted Sep 24, 2022 - 14:50 EDT
Update
We are continuing to investigate the disruption to the collectors. We will continue to provide updates as they become available.
Posted Sep 24, 2022 - 13:33 EDT
Update
As the final few collectors come back online we are continuing our investigation to ensure stability for our customers. We will continue to provide updates as they become available.
Posted Sep 24, 2022 - 12:35 EDT
Update
We are continuing to investigate the disruption to the collectors. We will continue to provide updates as they become available.
Posted Sep 24, 2022 - 11:24 EDT
Update
Nearly all collectors have reconnected to the cloud, but a few are still not reconnecting. We continue to investigate and will provide further updates as they become available.
Posted Sep 24, 2022 - 10:29 EDT
Investigating
Some customers' collectors are not reconnecting to the cloud, resulting in network monitoring data not being updated in the cloud. We will continue to provide updates as they become available.
Posted Sep 24, 2022 - 09:38 EDT
This incident affected: Network Mgmt (my.auvik.com, us1.my.auvik.com, us2.my.auvik.com, us3.my.auvik.com, us4.my.auvik.com, eu1.my.auvik.com, eu2.my.auvik.com, au1.my.auvik.com, ca1.my.auvik.com) and Auvik TrafficInsights, Auvik Website (www.auvik.com).