[Outages-discussion] [outages] [External]Re: CenturyLink Outages this morning

Jay Farrell jayfar at jayfar.com
Sat Dec 29 12:06:24 EST 2018


>From the reddit thread:

Please note we have updated your ticket:

*Event Conclusion Summary*

Outage Start: December 27, 2018 08:40 GMT
Outage Stop: December 29, 2018 10:12 GMT

Root Cause: A CenturyLink network management card in Denver, CO was
propagating invalid frame packets across devices.

Fix Action: To restore services the card in Denver was removed from
the equipment, secondary communication channel tunnels between
specific devices were removed across the network, and a polling filter
was applied to adjust the way the packets were received in the
equipment. As repair actions were underway, it became apparent that
additional restoration steps were required for certain nodes, which
included either line card resets or Field Operations dispatches for
local equipment login. Once completed, all services restored.

RFO Summary: On December 27, 2018 at 08:40 GMT, CenturyLink identified
an initial service impact in New Orleans, LA. The NOC was engaged to
investigate the cause, and Field Operations were dispatched for
assistance onsite. Tier IV Equipment Vendor Support was engaged as it
was determined that the issue was larger than a single site. During
cooperative troubleshooting between the Equipment Vendor and
CenturyLink, a decision was made to isolate a device in San Antonio,
TX from the network as it seemed to be broadcasting traffic and
consuming capacity. This action did alleviate impact; however,
investigations remained ongoing. Focus shifted to additional sites
where network teams were unable to remotely troubleshoot equipment.
Field Operations were dispatched to sites in Kansas City, MO, Atlanta,
GA, New Orleans, LA and Chicago, IL for onsite support. As visibility
to equipment was regained, Tier IV Equipment Vendor Support evaluated
the logs to further assist with isolation. Additionally, a polling
filter was applied to the equipment in Kansas City, MO and New
Orleans, LA to prevent any additional effects. All necessary
troubleshooting teams, in cooperation with Tier IV Equipment Vendor
Support, were working to restore remote visibility to the remaining
sites. The issue had CenturyLink Executive level awareness for the
duration. A plan was formed to remove secondary communication channels
between select network devices until visibility could be restored,
which was undertaken by the Tier IV Equipment Vendor Technical Support
team in conjunction with CenturyLink Field Operations and NOC
engineers. While that effort continued, investigations into the logs,
including packet captures, was occurring in tandem, which ultimately
identified a suspected card issue in Denver, CO. Field Operations were
dispatched to remove the card. Once removed, it did not appear there
had been significant improvement; however, the logs were further
scrutinized by the Vendor's Advanced Support team and CenturyLink
Network Operations to identify that the source packet did originate
from this card. CenturyLink Tier III Technical Support shifted focus
to the application of strategic polling filters along with the
continued efforts to remove the secondary communication channels
between select nodes. Services began incrementally restoring. An
estimated restoral time of 09:00 GMT was provided; however, as repair
efforts steadily progressed, additional steps were identified for
certain nodes that impeded the restoration process. This included
either line card resets or Field Operations dispatches for local
equipment login. Various repair teams worked in tandem on these
actions to ensure that services were restored in the most expeditious
method available. By 2:30 GMT on December 29, it was confirmed that
the impacted IP, Voice, and Ethernet Access services were once again
operational. Point-to-point Transport Waves as well as Ethernet
Private Lines were still experiencing issues as multiple Optical
Carrier Groups (OCG) were still out of service. The Transport NOC
continued to work with the Tier IV Equipment Vendor Support and
CenturyLink Field Operations to replace additional line cards to
resolve the OCG issues. Several cards had to be ordered from the
nearest sparing depot. Once the remaining cards were replaced it was
confirmed that all services except a very small set of circuits had
restored, and the Transport NOC will continue to troubleshoot the
remaining impacted services under a separate Network Event. Services
were confirmed restored at 10:12 GMT. Please contact the Repair center
to address any lingering service issues.

Additional Information:
Please note that as formal post incident investigations and analysis
occur the details relayed here may evolve. Locating the management
card in Denver, CO that was sending invalid frame packets across the
network took significant analysis and packet captures to be identified
as a source as it was not in an alarm status. The CenturyLink network
continued to rebroadcast the invalid packets through the redundant
(secondary) communication routes. CenturyLink will review
troubleshooting steps to ensure that any areas of opportunity
regarding potential for restoral acceleration are addressed. These
invalid frame packets did not have a source, destination, or
expiration and were cleared out of the network via the application of
the polling filters and removal of the secondary communication paths
between specific nodes. The management card has been sent to the
equipment vendor where extensive forensic analysis will occur
regarding the underlying cause, how the packets were introduced in
this particular manner. The card has not been replaced and will not be
until the vendor review is supplied. There is no increased network
risk with leaving it unseated. At this time, there is no indication
that there was maintenance work on the card, software, or adjacent
equipment. The CenturyLink network is not at risk of reoccurrence due
to the placement of the poling filters and the removal of the
secondary communication routes between select nodes.

https://www.reddit.com/r/networking/comments/a9z6tb/centurylink_outage_west_coast/ecszjax/


More information about the Outages-discussion mailing list