<div dir="auto">Does anyone affected have a traceroute with hostname resolution on? We may be able to deduce whag platform by that.</div><div class="gmail_extra"><br><div class="gmail_quote">On Mar 6, 2017 1:13 AM, "Charles Sprickman" <<a href="mailto:spork@bway.net">spork@bway.net</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Also, what platform are they running that SNMP problems on the control plane tank the forwarding plane?<br>

<br>

> On Mar 6, 2017, at 12:27 AM, Peter Beckman <<a href="mailto:beckman@angryox.com">beckman@angryox.com</a>> wrote:<br>

><br>

> Here's the RFO Text for Discussion.<br>

><br>

> For me, their Corrective Actions seems a bit vague. Does L3 and other<br>

> Network Carriers not monitor the CPU utilization across their fleet for<br>

> changes? If one router has a baseline CPU utilization of 15%, which<br>

> correlates with a certain number of packets being processed, and then later<br>

> is using 85% CPU for the same number of packets, doesn't that set off<br>

> alarms somewhere, before the router falls over??<br>

><br>

> Are there not logs about what changed on the router when?<br>

><br>

> Or is their monitoring not that sophisticated?<br>

><br>

> Hell, I'm tiny, and even I monitor and alarm related metrics for<br>

> consistency, day vs day, week vs week, for such adjustments.<br>

><br>

> I'd expect that at 19:48 GMT, L3 should have seen that<br>

> something/someone modified some aspect of the config, OR an increase in<br>

> SNMP requests, that would lead them to a much faster root cause.<br>

><br>

>  ---- Original RFO ----<br>

><br>

> Cause<br>

> A software process adjustment on an internal server application produced an<br>

> inadvertent increase to polling frequency, which caused multiple routers to<br>

> bounce or become isolated in multiple North America markets, impacting IP and<br>

> voice services.<br>

><br>

> Resolution<br>

> The software process adjustment was reversed to stabilize services.<br>

><br>

> Summary<br>

> On March 2, 2017 at 17:55 GMT, the IP Network Operations Center (NOC) detected<br>

> alarms indicating there were unreachable nodes in the network. Simultaneously,<br>

> the Technical Service Center (TSC) began receiving reports of impact to IP and<br>

> voice services in multiple markets within North America. The IP NOC proceeded<br>

> with investigations to determine the root cause of the service issues and<br>

> several impacted routers were identified. At 19:48 GMT, the IP NOC analyzed a<br>

> router in Houston, TX as a potential root cause. In an effort to protect other<br>

> components of the network, the IP NOC isolated the Houston router during<br>

> investigations. At 20:37 GMT, the equipment vendor was engaged to assist with<br>

> troubleshooting of the impacted devices. Diagnostic information was collected<br>

> from the impacted routers, and evaluations were conducted to identify the root<br>

> cause. Troubleshooting was consistent, with highly experienced engineers and<br>

> management engaged throughout the incident.<br>

><br>

> During triage, some services were restored when multiple routers rebooted and<br>

> recovered; the last known service impacts were restored by 22:47 GMT. The<br>

> Houston router and associated services remained isolated while troubleshooting<br>

> of the continued impact progressed. At approximately 00:00 GMT, the root cause<br>

> was determined to be related to Simple Network Management Protocol (SNMP).<br>

> Earlier in the day, a software process had been adjusted on an internal server<br>

> system and inadvertently resulted in an increased polling frequency. The<br>

> increased frequency caused a heightened demand on the Central Processing Unit<br>

> (CPU) processing cycles on various routers. The system operates by polling<br>

> routers across the network to collect statistics used by Level 3 for both<br>

> billing and troubleshooting. The software process adjustment was reversed and<br>

> the Houston router was reinstated, restoring all remaining services on March 3,<br>

> 2017 at approximately 00:04 GMT.<br>

><br>

> Corrective Actions<br>

> Level 3 is making continuous improvements, including the refinement of<br>

> processes and increased business focus, intended to prevent future service<br>

> issues associated with these software changes. Level 3 Architecture and<br>

> Engineering, Systems Operations, and the equipment vendor are engaged in a<br>

> detailed analysis of this incident. The analysis included investigations of<br>

> additional alarming capabilities to potentially provide earlier issue<br>

> identification. Appropriate actions will be taken in response to our continued<br>

> investigation, which may include an evaluation of potential enhanced<br>

> configurations and software code changes.<br>

><br>

> ------------------------------<wbr>------------------------------<wbr>---------------<br>

> Peter Beckman                                                  Internet Guy<br>

> <a href="mailto:beckman@angryox.com">beckman@angryox.com</a>                                 <a href="http://www.angryox.com/" rel="noreferrer" target="_blank">http://www.angryox.com/</a><br>

> ------------------------------<wbr>------------------------------<wbr>---------------<br>

> ______________________________<wbr>_________________<br>

> Outages-discussion mailing list<br>

> <a href="mailto:Outages-discussion@outages.org">Outages-discussion@outages.org</a><br>

> <a href="https://puck.nether.net/mailman/listinfo/outages-discussion" rel="noreferrer" target="_blank">https://puck.nether.net/<wbr>mailman/listinfo/outages-<wbr>discussion</a><br>

<br>

______________________________<wbr>_________________<br>

Outages-discussion mailing list<br>

<a href="mailto:Outages-discussion@outages.org">Outages-discussion@outages.org</a><br>

<a href="https://puck.nether.net/mailman/listinfo/outages-discussion" rel="noreferrer" target="_blank">https://puck.nether.net/<wbr>mailman/listinfo/outages-<wbr>discussion</a><br>

</blockquote></div></div>