[Outages-discussion] Columbia SC Level 3 down

Mon Mar 6 01:18:45 EST 2017

Does anyone affected have a traceroute with hostname resolution on? We may
be able to deduce whag platform by that.

On Mar 6, 2017 1:13 AM, "Charles Sprickman" <spork at bway.net> wrote:

> Also, what platform are they running that SNMP problems on the control
> plane tank the forwarding plane?
>
> > On Mar 6, 2017, at 12:27 AM, Peter Beckman <beckman at angryox.com> wrote:
> >
> > Here's the RFO Text for Discussion.
> >
> > For me, their Corrective Actions seems a bit vague. Does L3 and other
> > Network Carriers not monitor the CPU utilization across their fleet for
> > changes? If one router has a baseline CPU utilization of 15%, which
> > correlates with a certain number of packets being processed, and then
> later
> > is using 85% CPU for the same number of packets, doesn't that set off
> > alarms somewhere, before the router falls over??
> >
> > Are there not logs about what changed on the router when?
> >
> > Or is their monitoring not that sophisticated?
> >
> > Hell, I'm tiny, and even I monitor and alarm related metrics for
> > consistency, day vs day, week vs week, for such adjustments.
> >
> > I'd expect that at 19:48 GMT, L3 should have seen that
> > something/someone modified some aspect of the config, OR an increase in
> > SNMP requests, that would lead them to a much faster root cause.
> >
> >  ---- Original RFO ----
> >
> > Cause
> > A software process adjustment on an internal server application produced
> an
> > inadvertent increase to polling frequency, which caused multiple routers
> to
> > bounce or become isolated in multiple North America markets, impacting
> IP and
> > voice services.
> >
> > Resolution
> > The software process adjustment was reversed to stabilize services.
> >
> > Summary
> > On March 2, 2017 at 17:55 GMT, the IP Network Operations Center (NOC)
> detected
> > alarms indicating there were unreachable nodes in the network.
> Simultaneously,
> > the Technical Service Center (TSC) began receiving reports of impact to
> IP and
> > voice services in multiple markets within North America. The IP NOC
> proceeded
> > with investigations to determine the root cause of the service issues and
> > several impacted routers were identified. At 19:48 GMT, the IP NOC
> analyzed a
> > router in Houston, TX as a potential root cause. In an effort to protect
> other
> > components of the network, the IP NOC isolated the Houston router during
> > investigations. At 20:37 GMT, the equipment vendor was engaged to assist
> with
> > troubleshooting of the impacted devices. Diagnostic information was
> collected
> > from the impacted routers, and evaluations were conducted to identify
> the root
> > cause. Troubleshooting was consistent, with highly experienced engineers
> and
> > management engaged throughout the incident.
> >
> > During triage, some services were restored when multiple routers
> rebooted and
> > recovered; the last known service impacts were restored by 22:47 GMT. The
> > Houston router and associated services remained isolated while
> troubleshooting
> > of the continued impact progressed. At approximately 00:00 GMT, the root
> cause
> > was determined to be related to Simple Network Management Protocol
> (SNMP).
> > Earlier in the day, a software process had been adjusted on an internal
> server
> > system and inadvertently resulted in an increased polling frequency. The
> > increased frequency caused a heightened demand on the Central Processing
> Unit
> > (CPU) processing cycles on various routers. The system operates by
> polling
> > routers across the network to collect statistics used by Level 3 for both
> > billing and troubleshooting. The software process adjustment was
> reversed and
> > the Houston router was reinstated, restoring all remaining services on
> March 3,
> > 2017 at approximately 00:04 GMT.
> >
> > Corrective Actions
> > Level 3 is making continuous improvements, including the refinement of
> > processes and increased business focus, intended to prevent future
> service
> > issues associated with these software changes. Level 3 Architecture and
> > Engineering, Systems Operations, and the equipment vendor are engaged in
> a
> > detailed analysis of this incident. The analysis included investigations
> of
> > additional alarming capabilities to potentially provide earlier issue
> > identification. Appropriate actions will be taken in response to our
> continued
> > investigation, which may include an evaluation of potential enhanced
> > configurations and software code changes.
> >
> > ------------------------------------------------------------
> ---------------
> > Peter Beckman                                                  Internet
> Guy
> > beckman at angryox.com
> http://www.angryox.com/
> > ------------------------------------------------------------
> ---------------
> > _______________________________________________
> > Outages-discussion mailing list
> > Outages-discussion at outages.org
> > https://puck.nether.net/mailman/listinfo/outages-discussion
>
> _______________________________________________
> Outages-discussion mailing list
> Outages-discussion at outages.org
> https://puck.nether.net/mailman/listinfo/outages-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/outages-discussion/attachments/20170306/a1233b1a/attachment.html>