Root Cause Analysis
Discovered: Sep 13 2022, 16:00 UTC
Resolved: Oct 1, 2022, 13:00 UTC
Updates to software libraries in the collector to address recommendations made by a third-party collector security audit.
The collector would not establish a connection to Auvik or the collector would lose connection and not be able to re-establish a connection to the Auvik cloud until the collector WatchdogService forced a collector to reset.
09/13/2022 - 09/24/2022
Auvik receives several tickets from clients over the course of several days that their collectors are experiencing random disconnects. Auvik opens tickets and begins troubleshooting the issue. Initial indications are related to Auvik’s cloud connectivity. The impact is a small, and random sub-section of client tickets. The ticket count rises for inconsistent connections but the number of tickets and clients does not necessitate calling out an incident to the time.
13:00 UTC - Auvik reacts to approximately 2% of collectors not re-establishing a connection back to Auvik after regular bi-weekly maintenance is performed.
14:00 UTC - After an initial investigation, Auvik opens an incident on its status page about the collector service disruption and continues to troubleshoot the issue.
14:00-19:00 UTC - Auvik continues to troubleshoot the collector connection issues. The majority of collectors have reconnected and under 1% of collectors are outstanding but slowly connecting back to the network. Monitoring from a practical standpoint had resumed for customers and the remaining clients slowly started reconnecting. Auvik updates the Status page during this time, stating that the team will continue to monitor the situation over the weekend.
Auvik continues to gather data as to the specific cause of the intermittent connectivity issues. New tickets are opened and a test environment is put into place to pinpoint the specific issues that are causing the disconnects. The Auvik Status page is updated periodically during this time.
19:28 UTC - Auvik announces that it has scheduled an out-of-band maintenance window on CA1 at 7 am (Eastern Time) the next day.
11:00 UTC - Auvik performs the out-of-band maintenance on the CA1 cluster connectors.
12:45 UTC - After reviewing the impact of the maintenance, it is found that additional changes are required.
14:30 UTC - Further alterations to the current collector build are tested in the lab and found to successfully address the ongoing collector disconnect issues.
15:30 UTC - Auvik announces an Auvik system-wide, out-of-band maintenance window to address the collector disconnect issues.
12:00-13:00 UTC - Auvik performs the out-of-band maintenance on the collectors.
14:12 UTC - The latest fixes to the collector have appeared to resolve the disconnect issues.
14:20 UTC - Auvik closes the incident listed on the Status page as the connections of the collectors to Auvik have remained steady and the logs for connectivity are clear.