[Outages-discussion] Columbia SC Level 3 down

Charles Sprickman spork at bway.net
Mon Mar 6 01:09:08 EST 2017


Also, what platform are they running that SNMP problems on the control plane tank the forwarding plane?

> On Mar 6, 2017, at 12:27 AM, Peter Beckman <beckman at angryox.com> wrote:
> 
> Here's the RFO Text for Discussion.
> 
> For me, their Corrective Actions seems a bit vague. Does L3 and other
> Network Carriers not monitor the CPU utilization across their fleet for
> changes? If one router has a baseline CPU utilization of 15%, which
> correlates with a certain number of packets being processed, and then later
> is using 85% CPU for the same number of packets, doesn't that set off
> alarms somewhere, before the router falls over??
> 
> Are there not logs about what changed on the router when?
> 
> Or is their monitoring not that sophisticated?
> 
> Hell, I'm tiny, and even I monitor and alarm related metrics for
> consistency, day vs day, week vs week, for such adjustments.
> 
> I'd expect that at 19:48 GMT, L3 should have seen that
> something/someone modified some aspect of the config, OR an increase in
> SNMP requests, that would lead them to a much faster root cause.
> 
>  ---- Original RFO ----
> 
> Cause
> A software process adjustment on an internal server application produced an
> inadvertent increase to polling frequency, which caused multiple routers to
> bounce or become isolated in multiple North America markets, impacting IP and
> voice services.
> 
> Resolution
> The software process adjustment was reversed to stabilize services.
> 
> Summary
> On March 2, 2017 at 17:55 GMT, the IP Network Operations Center (NOC) detected
> alarms indicating there were unreachable nodes in the network. Simultaneously,
> the Technical Service Center (TSC) began receiving reports of impact to IP and
> voice services in multiple markets within North America. The IP NOC proceeded
> with investigations to determine the root cause of the service issues and
> several impacted routers were identified. At 19:48 GMT, the IP NOC analyzed a
> router in Houston, TX as a potential root cause. In an effort to protect other
> components of the network, the IP NOC isolated the Houston router during
> investigations. At 20:37 GMT, the equipment vendor was engaged to assist with
> troubleshooting of the impacted devices. Diagnostic information was collected
> from the impacted routers, and evaluations were conducted to identify the root
> cause. Troubleshooting was consistent, with highly experienced engineers and
> management engaged throughout the incident.
> 
> During triage, some services were restored when multiple routers rebooted and
> recovered; the last known service impacts were restored by 22:47 GMT. The
> Houston router and associated services remained isolated while troubleshooting
> of the continued impact progressed. At approximately 00:00 GMT, the root cause
> was determined to be related to Simple Network Management Protocol (SNMP).
> Earlier in the day, a software process had been adjusted on an internal server
> system and inadvertently resulted in an increased polling frequency. The
> increased frequency caused a heightened demand on the Central Processing Unit
> (CPU) processing cycles on various routers. The system operates by polling
> routers across the network to collect statistics used by Level 3 for both
> billing and troubleshooting. The software process adjustment was reversed and
> the Houston router was reinstated, restoring all remaining services on March 3,
> 2017 at approximately 00:04 GMT.
> 
> Corrective Actions
> Level 3 is making continuous improvements, including the refinement of
> processes and increased business focus, intended to prevent future service
> issues associated with these software changes. Level 3 Architecture and
> Engineering, Systems Operations, and the equipment vendor are engaged in a
> detailed analysis of this incident. The analysis included investigations of
> additional alarming capabilities to potentially provide earlier issue
> identification. Appropriate actions will be taken in response to our continued
> investigation, which may include an evaluation of potential enhanced
> configurations and software code changes.
> 
> ---------------------------------------------------------------------------
> Peter Beckman                                                  Internet Guy
> beckman at angryox.com                                 http://www.angryox.com/
> ---------------------------------------------------------------------------
> _______________________________________________
> Outages-discussion mailing list
> Outages-discussion at outages.org
> https://puck.nether.net/mailman/listinfo/outages-discussion



More information about the Outages-discussion mailing list