Post Incident Report - Hosting Outage - 23rd May 2021 Tuesday 25th May 2021 12:30:00


Summary:

Between approximately 12:30PM and 5:30PM on the 23rd of May 2021, our hosting infrastructure suffered from a network connectivity issue relating to equipment that is managed by our upstream datacenter providers. Our datacenter providers worked to isolate the issue that appears to have been caused by a faulty switch that had taken over the redundancy cluster of switches, rendering the network inaccessible from effected external networks.

The issue only effected certain customers depending on their ISP, we can confirm Telstra residential and business (NBN and Mobile based) were effected, as well as connectivity to 3rd party systems such as Office 365 were also effected. Customers on other networks - such as TPG and Vocus based providers, were not effected.

Once the switching issue was isolated, the datacenter team removed the faulty switch from the network and restored networks effected - including ours, resulting in services being restored at approximately 5:30PM.

Further maintenance works have been conducted last week to further determine the root cause of the issue, with vendors of the switch and the DC firewall equipment working to address the root cause.

Impact:

As noted above, the issue resulted in connectivity between certain ISP's or providers being unable to access our infrastructure. We can confirm Telstra customers were effected, as was connectivity to 3rd party system infrastructure such as Office 365. TPG, Vocus and other providers were not effected by this issue and services remained accessible as normal.

Customers of effected providers were unable to access any resources in our infrastructure including website and email services, furthermore email senders and/or recipients on effected networks suffered email deliver-ability delays as those networks were unable to connect to each other to send and receive email.

Further monitoring of the situation determined that as this was a networking issue, the time to resolve the current issue was quicker than any datacenter migration failover strategy our team have in place, so the decision to wait on the network issue to be resolved was taken.

As MyWebTeam was unable to determine which customers were directly effected, our status page system was updated (https://status.linsec.com.au) and customers known to use effected networks were contacted directly, while customers whom notified us of issues were notified regularly during the outage and informed when services were restored.

Next Steps:

As noted in the summary, the datacenter team are doing a series of maintenance updates in outside hours (usually 11PM for up to 10-15 minutes) to rectify the issue. Maintenance notifications will be posted to our status site at https://status.linsec.com.au as they are advised from our datacenter partners.

To further assist customers effected by issues of this nature, MyWebTeam is exploring a notification system that does not rely on emails, as this particular issue effected our ability to connect to customers via email and notify them of the outage. Systems such as opt in SMS notifications are being explored and will be implemented over the next quarter that will give customers the ability to subscribe to both hosting outages and upcoming maintenance by both email and SMS. This will allow customers to opt in and be updated in the event of hosting or infrastructure outage that effects their ability to access email or related services.

We are also looking at further failover strategies that can be put in place to mitigate the time required to restore the infrastructure to a new DC in the event of a major connectivity or DC service issue.

Questions?

If you have any questions or queries relating to this issue, feel free to contact us.