[Outages-discussion] Columbia SC Level 3 down

Peter Beckman beckman at angryox.com
Mon Mar 6 00:27:03 EST 2017


Here's the RFO Text for Discussion.

For me, their Corrective Actions seems a bit vague. Does L3 and other
Network Carriers not monitor the CPU utilization across their fleet for
changes? If one router has a baseline CPU utilization of 15%, which
correlates with a certain number of packets being processed, and then later
is using 85% CPU for the same number of packets, doesn't that set off
alarms somewhere, before the router falls over??

Are there not logs about what changed on the router when?

Or is their monitoring not that sophisticated?

Hell, I'm tiny, and even I monitor and alarm related metrics for
consistency, day vs day, week vs week, for such adjustments.

I'd expect that at 19:48 GMT, L3 should have seen that
something/someone modified some aspect of the config, OR an increase in
SNMP requests, that would lead them to a much faster root cause.

   ---- Original RFO ----

Cause
A software process adjustment on an internal server application produced an
inadvertent increase to polling frequency, which caused multiple routers to
bounce or become isolated in multiple North America markets, impacting IP and
voice services.

Resolution
The software process adjustment was reversed to stabilize services.

Summary
On March 2, 2017 at 17:55 GMT, the IP Network Operations Center (NOC) detected
alarms indicating there were unreachable nodes in the network. Simultaneously,
the Technical Service Center (TSC) began receiving reports of impact to IP and
voice services in multiple markets within North America. The IP NOC proceeded
with investigations to determine the root cause of the service issues and
several impacted routers were identified. At 19:48 GMT, the IP NOC analyzed a
router in Houston, TX as a potential root cause. In an effort to protect other
components of the network, the IP NOC isolated the Houston router during
investigations. At 20:37 GMT, the equipment vendor was engaged to assist with
troubleshooting of the impacted devices. Diagnostic information was collected
from the impacted routers, and evaluations were conducted to identify the root
cause. Troubleshooting was consistent, with highly experienced engineers and
management engaged throughout the incident.

During triage, some services were restored when multiple routers rebooted and
recovered; the last known service impacts were restored by 22:47 GMT. The
Houston router and associated services remained isolated while troubleshooting
of the continued impact progressed. At approximately 00:00 GMT, the root cause
was determined to be related to Simple Network Management Protocol (SNMP).
Earlier in the day, a software process had been adjusted on an internal server
system and inadvertently resulted in an increased polling frequency. The
increased frequency caused a heightened demand on the Central Processing Unit
(CPU) processing cycles on various routers. The system operates by polling
routers across the network to collect statistics used by Level 3 for both
billing and troubleshooting. The software process adjustment was reversed and
the Houston router was reinstated, restoring all remaining services on March 3,
2017 at approximately 00:04 GMT.

Corrective Actions
Level 3 is making continuous improvements, including the refinement of
processes and increased business focus, intended to prevent future service
issues associated with these software changes. Level 3 Architecture and
Engineering, Systems Operations, and the equipment vendor are engaged in a
detailed analysis of this incident. The analysis included investigations of
additional alarming capabilities to potentially provide earlier issue
identification. Appropriate actions will be taken in response to our continued
investigation, which may include an evaluation of potential enhanced
configurations and software code changes.

---------------------------------------------------------------------------
Peter Beckman                                                  Internet Guy
beckman at angryox.com                                 http://www.angryox.com/
---------------------------------------------------------------------------


More information about the Outages-discussion mailing list