[outages] Level3 Chicago

Tue Aug 18 15:56:05 EDT 2009

The packet loss shown at hops #4, #5, and possibly #6 is either ICMP
deprioritisation or high CPU utilisation on said routers.  My vote is
the lesser, given that it's common these days, and most backbone
providers are known to use it (L3, Abovenet, Sprint, Verizon/MCI, and
AT&T just to name a few).

The 18ms jump between hop #3 and #4 is likely normal due to geographic
distance, though I don't know where hop #3 is located; #4 is obviously
Missouri.  The same goes for the 12ms increase between #6 and #7

If this evidence was presented to Level 3, I can assure you they'd tell
you the same thing.  Here's a similar example showing the exact
behaviour I describe but within the Comcast network.  Note hop #6:

HOST: icarus.home.lan             Loss%   Snt   Rcv  Last   Avg  Best  Wrst
  1. --------------                0.0%    45    45   0.3   0.4   0.3   2.4
  2. ???                          100.0    45     0   0.0   0.0   0.0   0.0
  3. 68.85.191.253                 0.0%    45    45   8.4   8.1   6.7  11.8
  4. 68.85.154.149                 0.0%    45    45  16.3  13.1   8.9  16.3
  5. 68.86.90.137                  0.0%    45    45  14.1  13.1  10.9  17.1
  6. 68.86.85.181                 97.8%    45     1  22.4  22.4  22.4  22.4
  7. 4.71.118.9                    0.0%    45    45  13.6  14.9  11.4  49.1
  8. 4.68.18.195                   0.0%    45    45  14.0  26.1  11.5 172.7
  9. 4.79.219.106                  0.0%    45    45  14.0  14.5  12.4  17.3
 10. 209.128.95.111                0.0%    45    45  13.5  14.3  12.0  27.8
 11. 72.20.109.194                 0.0%    45    45  14.0  15.2  12.0  40.0
 12. 72.20.106.125                 0.0%    45    45  12.7  14.2  12.2  19.4

Here's another.  Note hops #7, #8, and #9 (all Cogent), and also note in
this example the increased latency at hop #7 and #8 which doesn't
continue downwards through later hops:

HOST: icarus.home.lan             Loss%   Snt   Rcv  Last   Avg  Best  Wrst
  1. --------------                0.0%    45    45   0.4   0.4   0.3   0.4
  2. ???                          100.0    45     0   0.0   0.0   0.0   0.0
  3. 68.85.191.253                 0.0%    45    45   6.9   8.1   6.5  12.6
  4. 68.85.154.149                 0.0%    45    45   9.0  10.0   8.6  12.6
  5. 68.86.91.225                  0.0%    45    45  11.1  11.9  10.7  13.7
  6. 68.86.85.78                   0.0%    45    45  12.0  14.3  11.6  42.5
  7. 154.54.11.105                68.9%    45    14 191.4  26.3  11.9 191.4
  8. 154.54.28.81                 51.1%    45    22 206.6  37.7  12.9 206.6
  9. 66.28.4.150                  46.7%    45    24  14.9  16.8  14.5  29.7
 10. 38.112.39.114                 0.0%    45    45  16.9  17.4  15.4  20.7
 11. 38.104.134.30                 0.0%    45    45  14.7  18.5  13.9  26.6
 12. ???                          100.0    45     0   0.0   0.0   0.0   0.0

The easiest way to determine if there's a problem is whether or not the
loss witnessed at a hop continues down through succeeding hops.  Here's
a real life example (IPs removed for security reasons):

HOST: ----------------------      Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. ---------------               0.0%    60    6.8   6.0   0.4  85.9  16.1
  2. ---------------               0.0%    60    1.0  48.1   0.6 320.0  76.4
  3. 204.70.193.101               86.7%    60   29.3 122.4  23.5 290.8 106.1
  4. 206.24.227.105               86.7%    60   50.7   9.4   1.3  50.7  17.5
  5. 204.70.192.53                88.3%    60   15.0  11.3   1.6  15.6   6.6
  6. 204.70.194.178               88.3%    60   15.3  25.4  15.1  34.7   9.8
  7. 204.70.192.70                86.7%    60   34.7  40.9  34.6  84.2  17.5
  8. 204.70.194.18                88.3%    60   35.0  34.9  34.7  35.1   0.1
  9. 204.70.194.10                88.3%    60   34.9  35.0  34.7  35.6   0.3
 10. 208.175.175.18               88.3%    60   35.7  35.6  35.2  36.2   0.4
 11. 216.52.191.11                95.0%    60   35.8  35.7  35.6  35.8   0.1
 12. ---------------              100.0    60    0.0   0.0   0.0   0.0   0.0

What you see here was an issue with SAVVIS.  The root cause was a router
of theirs which rebooted unexpectedly.

The destination (hop #12) drops ICMP, so 100% loss there was completely
normal -- the rest wasn't.  This was determined by comparing the loss to
historic data when the issue with SAVVIS wasn't occurring.

And one more example -- which just occurred about 10 minutes ago on my
home Comcast connection:

HOST: ----------------------      Loss%   Snt   Rcv  Last   Avg  Best  Wrst
  1. ---------------              28.9%    45    32   0.5   0.6   0.3   5.6
  2. ???                          100.0    45     0   0.0   0.0   0.0   0.0
  3. 68.85.191.253                28.9%    45    32   9.8  12.4   6.8 134.1
  4. 68.85.154.149                28.9%    45    32  10.6  10.7   8.7  16.1
  5. 68.86.90.141                 28.9%    45    32  12.8  13.3  10.7  36.0
  6. 68.86.85.181                 28.9%    45    32  13.1  14.4  11.7  29.4
  7. 68.86.86.50                  28.9%    45    32  16.6  17.1  15.3  21.8
  8. 75.149.228.254               28.9%    45    32  27.2  20.6  14.7  40.3
  9. 209.58.116.50                28.9%    45    32  28.2  17.3  14.5  34.4
 10. 216.6.33.6                   28.9%    45    32  17.8  17.2  15.1  23.1
 11. 144.223.243.21               28.9%    45    32  16.5  17.5  15.9  21.3
 12. 144.232.8.111                28.9%    45    32  20.0  20.8  19.1  25.5
 13. 144.232.24.35                28.9%    45    32  23.2  24.3  22.8  28.9
 14. 144.232.20.60                31.1%    45    31  76.5  78.9  75.5 110.0
 15. 144.232.20.3                 28.9%    45    32  76.3  76.7  74.9  81.4
 16. 144.223.34.242               28.9%    45    32  76.6  76.7  74.9  81.7
 17. 64.127.129.10                28.9%    45    32 107.7  89.0  85.8 107.7
 18. 96.34.2.9                    28.9%    45    32  87.8  88.3  86.0 100.6

What happened here?  Based on my router logs, it appears DHCP expired
and numerous daemons on the router were automatically restarted (this is
normal in such a circumstance).  It's possible that Comcast's DHCP
servers took too long to respond to a renewal and the router chose to
down/up the WAN interface, or possibly restart relevant daemons.  Hard
to say -- the logging isn't as verbose as I'd like.

My cable modem log doesn't indicate LOS nor being rebooted, so this was
purely an IP-related issue, or issue with my own equipment.

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |

On Tue, Aug 18, 2009 at 01:31:28PM -0500, Brett Cooper wrote:
> I'm here in Kansas City, KS region and a trace to that IP in Detroit
> for Level3 shows this with the flags you have added for mtr. Level3
> is just having a bad week me thinks.
> 
> fed11.home.lan (0.0.0.0)                               Tue Aug 18
> 13:28:42 2009
> Keys:  Help   Display mode   Restart statistics   Order of fields   quit
>                                        Packets               Pings
> Host                                 Loss%   Snt   Rcv  Last   Avg  Best  Wrst
> 1. router.home.lan                    0.0%   274   274   0.2   0.2   0.1   0.3
> 2. ks-76-7-1-1.sta.embarqhsd.net      0.0%   274   274   7.7   8.3   6.2 114.8
> 3. ks-76-7-255-241.sta.embarqhsd.net  0.0%   274   274   8.4   8.6   6.1 114.4
> 4. ge-7-19.car1.StLouis1.Level3.net  41.5%   273   159  34.1  26.6  12.6 213.9
> 5. ae-11-11.car2.StLouis1.Level3.net 47.3%   273   144  12.6  25.9  12.6 205.6
> 6. ae-4-4.ebr2.Chicago1.Level3.net    1.1%   273   270  19.2  26.4  18.5 140.8
> 7. ae-8-8.car1.Detroit1.Level3.net    0.0%   273   273  27.8  38.8  24.1 229.2
> 8. ge-6-12-222.car2.Detroit1.Level3.  0.0%   273   273  26.5  39.1  24.1 239.0
> 
> --Brett
> 
> 
> Jeremy Chadwick wrote:
> >It would be helpful if you could use something like mtr instead of
> >traceroute in this case.  The below trace could be indicating ICMP
> >de-prioritisation on L3 routers, which is known to be enabled, but could
> >also be an indicator of packet loss starting at hop #5 and "trickling
> >down" through succeeding hops (possibly #10 or #11).
> >
> >mtr --order=LSRNABW can help in diagnosing this.
> >