<div dir="auto">Does anyone affected have a traceroute with hostname resolution on? We may be able to deduce whag platform by that.</div><div class="gmail_extra"><br><div class="gmail_quote">On Mar 6, 2017 1:13 AM, "Charles Sprickman" <<a href="mailto:spork@bway.net">spork@bway.net</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Also, what platform are they running that SNMP problems on the control plane tank the forwarding plane?<br>
<br>
> On Mar 6, 2017, at 12:27 AM, Peter Beckman <<a href="mailto:beckman@angryox.com">beckman@angryox.com</a>> wrote:<br>
><br>
> Here's the RFO Text for Discussion.<br>
><br>
> For me, their Corrective Actions seems a bit vague. Does L3 and other<br>
> Network Carriers not monitor the CPU utilization across their fleet for<br>
> changes? If one router has a baseline CPU utilization of 15%, which<br>
> correlates with a certain number of packets being processed, and then later<br>
> is using 85% CPU for the same number of packets, doesn't that set off<br>
> alarms somewhere, before the router falls over??<br>
><br>
> Are there not logs about what changed on the router when?<br>
><br>
> Or is their monitoring not that sophisticated?<br>
><br>
> Hell, I'm tiny, and even I monitor and alarm related metrics for<br>
> consistency, day vs day, week vs week, for such adjustments.<br>
><br>
> I'd expect that at 19:48 GMT, L3 should have seen that<br>
> something/someone modified some aspect of the config, OR an increase in<br>
> SNMP requests, that would lead them to a much faster root cause.<br>
><br>
> ---- Original RFO ----<br>
><br>
> Cause<br>
> A software process adjustment on an internal server application produced an<br>
> inadvertent increase to polling frequency, which caused multiple routers to<br>
> bounce or become isolated in multiple North America markets, impacting IP and<br>
> voice services.<br>
><br>
> Resolution<br>
> The software process adjustment was reversed to stabilize services.<br>
><br>
> Summary<br>
> On March 2, 2017 at 17:55 GMT, the IP Network Operations Center (NOC) detected<br>
> alarms indicating there were unreachable nodes in the network. Simultaneously,<br>
> the Technical Service Center (TSC) began receiving reports of impact to IP and<br>
> voice services in multiple markets within North America. The IP NOC proceeded<br>
> with investigations to determine the root cause of the service issues and<br>
> several impacted routers were identified. At 19:48 GMT, the IP NOC analyzed a<br>
> router in Houston, TX as a potential root cause. In an effort to protect other<br>
> components of the network, the IP NOC isolated the Houston router during<br>
> investigations. At 20:37 GMT, the equipment vendor was engaged to assist with<br>
> troubleshooting of the impacted devices. Diagnostic information was collected<br>
> from the impacted routers, and evaluations were conducted to identify the root<br>
> cause. Troubleshooting was consistent, with highly experienced engineers and<br>
> management engaged throughout the incident.<br>
><br>
> During triage, some services were restored when multiple routers rebooted and<br>
> recovered; the last known service impacts were restored by 22:47 GMT. The<br>
> Houston router and associated services remained isolated while troubleshooting<br>
> of the continued impact progressed. At approximately 00:00 GMT, the root cause<br>
> was determined to be related to Simple Network Management Protocol (SNMP).<br>
> Earlier in the day, a software process had been adjusted on an internal server<br>
> system and inadvertently resulted in an increased polling frequency. The<br>
> increased frequency caused a heightened demand on the Central Processing Unit<br>
> (CPU) processing cycles on various routers. The system operates by polling<br>
> routers across the network to collect statistics used by Level 3 for both<br>
> billing and troubleshooting. The software process adjustment was reversed and<br>
> the Houston router was reinstated, restoring all remaining services on March 3,<br>
> 2017 at approximately 00:04 GMT.<br>
><br>
> Corrective Actions<br>
> Level 3 is making continuous improvements, including the refinement of<br>
> processes and increased business focus, intended to prevent future service<br>
> issues associated with these software changes. Level 3 Architecture and<br>
> Engineering, Systems Operations, and the equipment vendor are engaged in a<br>
> detailed analysis of this incident. The analysis included investigations of<br>
> additional alarming capabilities to potentially provide earlier issue<br>
> identification. Appropriate actions will be taken in response to our continued<br>
> investigation, which may include an evaluation of potential enhanced<br>
> configurations and software code changes.<br>
><br>
> ------------------------------<wbr>------------------------------<wbr>---------------<br>
> Peter Beckman Internet Guy<br>
> <a href="mailto:beckman@angryox.com">beckman@angryox.com</a> <a href="http://www.angryox.com/" rel="noreferrer" target="_blank">http://www.angryox.com/</a><br>
> ------------------------------<wbr>------------------------------<wbr>---------------<br>
> ______________________________<wbr>_________________<br>
> Outages-discussion mailing list<br>
> <a href="mailto:Outages-discussion@outages.org">Outages-discussion@outages.org</a><br>
> <a href="https://puck.nether.net/mailman/listinfo/outages-discussion" rel="noreferrer" target="_blank">https://puck.nether.net/<wbr>mailman/listinfo/outages-<wbr>discussion</a><br>
<br>
______________________________<wbr>_________________<br>
Outages-discussion mailing list<br>
<a href="mailto:Outages-discussion@outages.org">Outages-discussion@outages.org</a><br>
<a href="https://puck.nether.net/mailman/listinfo/outages-discussion" rel="noreferrer" target="_blank">https://puck.nether.net/<wbr>mailman/listinfo/outages-<wbr>discussion</a><br>
</blockquote></div></div>