[Outages-discussion] Columbia SC Level 3 down

Jeremy Chadwick jdc at koitsu.org
Mon Mar 6 02:36:20 EST 2017


I'm wondering this too, as well as pondering how polling interval could
impact something this severely.  I'm just sort of brain dumping as I go
here.  Because the RFO mentioned use of SNMP data for billing, I can't
help but think that means octet counters.

A 64-bit counter (Counter64) ranges from 0 to 18,446,744,073,709,551,615
after which it wraps.  A 10GE interface, if at 100% utilisation, would
be doing roughly 1,250,000,000 bytes/sec (1000 not 1024).  That means
roughly ~14757395258 seconds before the counter would wrap.

A 32-bit counter (Counter) is a bigger problem: 0 to 4,294,967,295, so a
10GE interface under 100% utilisation would wrap in ~3.4 seconds.
Surely everyone is using ifHC{In,Out}Octets by now...?!

When it comes to increased CPU usage due to SNMP queries, extensive
walks of an OID/MIB tree are a common cause (vs. doing very granular
per-OID gets/fetches).  Part of me wonders if someone tried to move from
the get method to a more "dynamic" walk or "discovery" method and didn't
think of the ramifications.

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Making life hard for others since 1977.             PGP 4BD6C0CB |

On Mon, Mar 06, 2017 at 01:09:08AM -0500, Charles Sprickman wrote:
> Also, what platform are they running that SNMP problems on the control plane tank the forwarding plane?
> 
> > On Mar 6, 2017, at 12:27 AM, Peter Beckman <beckman at angryox.com> wrote:
> > 
> > Here's the RFO Text for Discussion.
> > 
> > For me, their Corrective Actions seems a bit vague. Does L3 and other
> > Network Carriers not monitor the CPU utilization across their fleet for
> > changes? If one router has a baseline CPU utilization of 15%, which
> > correlates with a certain number of packets being processed, and then later
> > is using 85% CPU for the same number of packets, doesn't that set off
> > alarms somewhere, before the router falls over??
> > 
> > Are there not logs about what changed on the router when?
> > 
> > Or is their monitoring not that sophisticated?
> > 
> > Hell, I'm tiny, and even I monitor and alarm related metrics for
> > consistency, day vs day, week vs week, for such adjustments.
> > 
> > I'd expect that at 19:48 GMT, L3 should have seen that
> > something/someone modified some aspect of the config, OR an increase in
> > SNMP requests, that would lead them to a much faster root cause.
> > 
> >  ---- Original RFO ----
> > 
> > Cause
> > A software process adjustment on an internal server application produced an
> > inadvertent increase to polling frequency, which caused multiple routers to
> > bounce or become isolated in multiple North America markets, impacting IP and
> > voice services.
> > 
> > Resolution
> > The software process adjustment was reversed to stabilize services.
> > 
> > Summary
> > On March 2, 2017 at 17:55 GMT, the IP Network Operations Center (NOC) detected
> > alarms indicating there were unreachable nodes in the network. Simultaneously,
> > the Technical Service Center (TSC) began receiving reports of impact to IP and
> > voice services in multiple markets within North America. The IP NOC proceeded
> > with investigations to determine the root cause of the service issues and
> > several impacted routers were identified. At 19:48 GMT, the IP NOC analyzed a
> > router in Houston, TX as a potential root cause. In an effort to protect other
> > components of the network, the IP NOC isolated the Houston router during
> > investigations. At 20:37 GMT, the equipment vendor was engaged to assist with
> > troubleshooting of the impacted devices. Diagnostic information was collected
> > from the impacted routers, and evaluations were conducted to identify the root
> > cause. Troubleshooting was consistent, with highly experienced engineers and
> > management engaged throughout the incident.
> > 
> > During triage, some services were restored when multiple routers rebooted and
> > recovered; the last known service impacts were restored by 22:47 GMT. The
> > Houston router and associated services remained isolated while troubleshooting
> > of the continued impact progressed. At approximately 00:00 GMT, the root cause
> > was determined to be related to Simple Network Management Protocol (SNMP).
> > Earlier in the day, a software process had been adjusted on an internal server
> > system and inadvertently resulted in an increased polling frequency. The
> > increased frequency caused a heightened demand on the Central Processing Unit
> > (CPU) processing cycles on various routers. The system operates by polling
> > routers across the network to collect statistics used by Level 3 for both
> > billing and troubleshooting. The software process adjustment was reversed and
> > the Houston router was reinstated, restoring all remaining services on March 3,
> > 2017 at approximately 00:04 GMT.
> > 
> > Corrective Actions
> > Level 3 is making continuous improvements, including the refinement of
> > processes and increased business focus, intended to prevent future service
> > issues associated with these software changes. Level 3 Architecture and
> > Engineering, Systems Operations, and the equipment vendor are engaged in a
> > detailed analysis of this incident. The analysis included investigations of
> > additional alarming capabilities to potentially provide earlier issue
> > identification. Appropriate actions will be taken in response to our continued
> > investigation, which may include an evaluation of potential enhanced
> > configurations and software code changes.
> > 
> > ---------------------------------------------------------------------------
> > Peter Beckman                                                  Internet Guy
> > beckman at angryox.com                                 http://www.angryox.com/
> > ---------------------------------------------------------------------------
> > _______________________________________________
> > Outages-discussion mailing list
> > Outages-discussion at outages.org
> > https://puck.nether.net/mailman/listinfo/outages-discussion
> 
> _______________________________________________
> Outages-discussion mailing list
> Outages-discussion at outages.org
> https://puck.nether.net/mailman/listinfo/outages-discussion


More information about the Outages-discussion mailing list