[Outages-discussion] [outages] [External]Re: CenturyLink Outages this morning

Frank Bulk frnkblk at iname.com
Thu Jan 24 18:23:39 EST 2019


Just saw this today.  


https://www.abqjournal.com/1271943/century-link-explains-end-of-year-outage.
html

ALBUQUERQUE, N.M. — CenturyLink executives on Wednesday blamed a faulty
electronic unit called a “network management card” for its Dec. 27-28
national service outage.

The failed unit, made by California-based Infinera Corp., clogged up
CenturyLink’s Internet services with a tidal wave of faulty messaging, state
Government Affairs Director Johnny Montoya told the New Mexico Public
Regulation Commission during a presentation Wednesday morning. That
overloaded the system and caused a two-day outage that impacted people from
Massachusetts to Washington state, including tens of thousands of Internet
customers in New Mexico and people whose phones are based on voice over
Internet protocol, or VoIP.

Verizon users lost phone service during the outage, since that company
relies on CenturyLink to manage wireless data traffic in New Mexico.

Emergency 911 calls were affected in places like Albuquerque and Las Cruces
because many people didn’t have phone service to make calls and because the
lack of Internet service impeded city systems from automatically routing
calls that did come through to emergency answering services, Montoya said.

CenturyLink’s corrective response was slowed because the faulty unit
impacted the company’s event-management system, which is set up to help
technicians rapidly isolate the cause of problems, Montoya said.

“Once the faulty card was identified, we pulled it,” Montoya said. “We then
applied filters and reinitialized over 100 network nodes to re-establish
customer traffic.”

The faulty unit has since been sent back to Infinera for analysis to
determine the root causes of the problem. But CenturyLink accepts its
responsibility, Montoya said.

“On behalf of the company, I apologize for the outage,” he said. “It was our
fault.”

CenturyLink is also cooperating with the Federal Communications Commission
to investigate what happened.

The company has eliminated the possibility of a cyberattack as the culprit,
Montoya said.

In the meantime, CenturyLink is installing network filters and monitoring
systems to prevent such incidents from re-occurring.

“We’re looking at additional steps to create more fault tolerance in the
future,” Montoya said. “
It’s fair to say this was a wake-up call for us and
our customers. We’re looking for ways to make our Internet service and all
our services more resistant.”

Commissioners and others asked the company to establish more efficient
emergency communications in case of future incidents. During the December
outage, customers, public officials and others could not communicate with
CenturyLink, creating confusion.

In Las Cruces, local managers and technicians thought their own systems had
failed, said Doña Ana County Commissioner Shannon Reynolds. Had they known
immediately about the problems with CenturyLink, they could have rapidly
switched to an analog phone system to manage 911 calls.

“There needs to be more communication upfront,” Reynolds said. “When a
problem this massive occurs, I would encourage (CenturyLink) to send out an
announcement to every news agency and others to alert the public.”

PRC Chair Theresa Becenti-Aguilar asked the company to establish direct
emergency contacts with the PRC and other entities for swift, 24/7
communications in an emergency.


-----Original Message-----
From: Outages-discussion <outages-discussion-bounces at outages.org> On Behalf
Of Jay Farrell
Sent: Saturday, December 29, 2018 11:06 AM
To: outages-discussion at outages.org
Subject: Re: [Outages-discussion] [outages] [External]Re: CenturyLink
Outages this morning

>From the reddit thread:

Please note we have updated your ticket:

*Event Conclusion Summary*

Outage Start: December 27, 2018 08:40 GMT
Outage Stop: December 29, 2018 10:12 GMT

Root Cause: A CenturyLink network management card in Denver, CO was
propagating invalid frame packets across devices.

Fix Action: To restore services the card in Denver was removed from
the equipment, secondary communication channel tunnels between
specific devices were removed across the network, and a polling filter
was applied to adjust the way the packets were received in the
equipment. As repair actions were underway, it became apparent that
additional restoration steps were required for certain nodes, which
included either line card resets or Field Operations dispatches for
local equipment login. Once completed, all services restored.

RFO Summary: On December 27, 2018 at 08:40 GMT, CenturyLink identified
an initial service impact in New Orleans, LA. The NOC was engaged to
investigate the cause, and Field Operations were dispatched for
assistance onsite. Tier IV Equipment Vendor Support was engaged as it
was determined that the issue was larger than a single site. During
cooperative troubleshooting between the Equipment Vendor and
CenturyLink, a decision was made to isolate a device in San Antonio,
TX from the network as it seemed to be broadcasting traffic and
consuming capacity. This action did alleviate impact; however,
investigations remained ongoing. Focus shifted to additional sites
where network teams were unable to remotely troubleshoot equipment.
Field Operations were dispatched to sites in Kansas City, MO, Atlanta,
GA, New Orleans, LA and Chicago, IL for onsite support. As visibility
to equipment was regained, Tier IV Equipment Vendor Support evaluated
the logs to further assist with isolation. Additionally, a polling
filter was applied to the equipment in Kansas City, MO and New
Orleans, LA to prevent any additional effects. All necessary
troubleshooting teams, in cooperation with Tier IV Equipment Vendor
Support, were working to restore remote visibility to the remaining
sites. The issue had CenturyLink Executive level awareness for the
duration. A plan was formed to remove secondary communication channels
between select network devices until visibility could be restored,
which was undertaken by the Tier IV Equipment Vendor Technical Support
team in conjunction with CenturyLink Field Operations and NOC
engineers. While that effort continued, investigations into the logs,
including packet captures, was occurring in tandem, which ultimately
identified a suspected card issue in Denver, CO. Field Operations were
dispatched to remove the card. Once removed, it did not appear there
had been significant improvement; however, the logs were further
scrutinized by the Vendor's Advanced Support team and CenturyLink
Network Operations to identify that the source packet did originate
from this card. CenturyLink Tier III Technical Support shifted focus
to the application of strategic polling filters along with the
continued efforts to remove the secondary communication channels
between select nodes. Services began incrementally restoring. An
estimated restoral time of 09:00 GMT was provided; however, as repair
efforts steadily progressed, additional steps were identified for
certain nodes that impeded the restoration process. This included
either line card resets or Field Operations dispatches for local
equipment login. Various repair teams worked in tandem on these
actions to ensure that services were restored in the most expeditious
method available. By 2:30 GMT on December 29, it was confirmed that
the impacted IP, Voice, and Ethernet Access services were once again
operational. Point-to-point Transport Waves as well as Ethernet
Private Lines were still experiencing issues as multiple Optical
Carrier Groups (OCG) were still out of service. The Transport NOC
continued to work with the Tier IV Equipment Vendor Support and
CenturyLink Field Operations to replace additional line cards to
resolve the OCG issues. Several cards had to be ordered from the
nearest sparing depot. Once the remaining cards were replaced it was
confirmed that all services except a very small set of circuits had
restored, and the Transport NOC will continue to troubleshoot the
remaining impacted services under a separate Network Event. Services
were confirmed restored at 10:12 GMT. Please contact the Repair center
to address any lingering service issues.

Additional Information:
Please note that as formal post incident investigations and analysis
occur the details relayed here may evolve. Locating the management
card in Denver, CO that was sending invalid frame packets across the
network took significant analysis and packet captures to be identified
as a source as it was not in an alarm status. The CenturyLink network
continued to rebroadcast the invalid packets through the redundant
(secondary) communication routes. CenturyLink will review
troubleshooting steps to ensure that any areas of opportunity
regarding potential for restoral acceleration are addressed. These
invalid frame packets did not have a source, destination, or
expiration and were cleared out of the network via the application of
the polling filters and removal of the secondary communication paths
between specific nodes. The management card has been sent to the
equipment vendor where extensive forensic analysis will occur
regarding the underlying cause, how the packets were introduced in
this particular manner. The card has not been replaced and will not be
until the vendor review is supplied. There is no increased network
risk with leaving it unseated. At this time, there is no indication
that there was maintenance work on the card, software, or adjacent
equipment. The CenturyLink network is not at risk of reoccurrence due
to the placement of the poling filters and the removal of the
secondary communication routes between select nodes.

https://www.reddit.com/r/networking/comments/a9z6tb/centurylink_outage_west_
coast/ecszjax/
_______________________________________________
Outages-discussion mailing list
Outages-discussion at outages.org
https://puck.nether.net/mailman/listinfo/outages-discussion




More information about the Outages-discussion mailing list