NRT1 connectivity issue
Incident Report for Packet
Postmortem

Incident: Top of Rack Switch Outage in Tokyo, JP (NRT1) Datacenter

Outage Start Time: 12:25 AM (EST) on November 7, 2017 Outage End Time: 1:30 AM (EST) on November 7, 2017

Reason for Outage

=Overview=

During this period of time, some customers connected to a single cabinet/top of rack switch in our Tokyo (NRT1) datacenter experienced a loss of connectivity.

=Identifying the Core Issue / Resolution=

On investigation, we identified an issue with the software running on the device in question, which was resolved following a full reload.

We are working with our hardware manufacturer, Juniper, at the highest levels, and have identified several flaws in their current code base related to how firewall Access Control Lists (ACLs) get programmed into the hardware, resulting in valid customer traffic getting discarded over time. With their assistance, we’ve also identified a code revision which may solve these issues, which we are currently testing in our lab. In addition, we’ve made some changes to our internal provisioning systems to help with how ACLs are generated and deployed, to help mitigate this issue.

=Moving Forward=

We are improving our internal procedures around incident response and network device troubleshooting, based upon the lessons learned from this particular outage. Though hardware failures are an unfortunate (and rare) fact of life, we endeavor to diagnose and recover from these issues as quickly as possible.

As a customer, if you are interested in greater switch diversity, you can review the “switch ID” note, which is exposed in our API and customer portal for each device below the facility code on the server detail page.

This (8) character field is an identifier for the physical (top of rack) switch each server is connected to, and dependent on for connectivity. Most instance types are available on multiple switches in each datacenter, and we are happy to work with you to promote greater switch diversity in your deployment if that is of interest. As we work to introduce some provisioning-time options around diversity, please don’t hesitate to drop us a note (help@packet.net) if we can assist.

Posted 9 months ago. Nov 09, 2017 - 11:30 EST

Resolved
This incident has been resolved.
Posted 9 months ago. Nov 07, 2017 - 02:02 EST
Identified
Network has been restored, but our team is still working on the NRT1 provisioning. For the meantime, please hold on NRT1 deploy/install.
Posted 9 months ago. Nov 07, 2017 - 01:43 EST
Investigating
We are seeing an outage in NRT1 facility. Provisioning / Deprovisioning of devices and portal access via NRT1 is affected. Our team is already looking into this issue.
Posted 9 months ago. Nov 07, 2017 - 00:53 EST