[outages] Level3 Chicago
Jeremy Chadwick
outages at jdc.parodius.com
Tue Aug 18 15:56:05 EDT 2009
The packet loss shown at hops #4, #5, and possibly #6 is either ICMP
deprioritisation or high CPU utilisation on said routers. My vote is
the lesser, given that it's common these days, and most backbone
providers are known to use it (L3, Abovenet, Sprint, Verizon/MCI, and
AT&T just to name a few).
The 18ms jump between hop #3 and #4 is likely normal due to geographic
distance, though I don't know where hop #3 is located; #4 is obviously
Missouri. The same goes for the 12ms increase between #6 and #7
If this evidence was presented to Level 3, I can assure you they'd tell
you the same thing. Here's a similar example showing the exact
behaviour I describe but within the Comcast network. Note hop #6:
HOST: icarus.home.lan Loss% Snt Rcv Last Avg Best Wrst
1. -------------- 0.0% 45 45 0.3 0.4 0.3 2.4
2. ??? 100.0 45 0 0.0 0.0 0.0 0.0
3. 68.85.191.253 0.0% 45 45 8.4 8.1 6.7 11.8
4. 68.85.154.149 0.0% 45 45 16.3 13.1 8.9 16.3
5. 68.86.90.137 0.0% 45 45 14.1 13.1 10.9 17.1
6. 68.86.85.181 97.8% 45 1 22.4 22.4 22.4 22.4
7. 4.71.118.9 0.0% 45 45 13.6 14.9 11.4 49.1
8. 4.68.18.195 0.0% 45 45 14.0 26.1 11.5 172.7
9. 4.79.219.106 0.0% 45 45 14.0 14.5 12.4 17.3
10. 209.128.95.111 0.0% 45 45 13.5 14.3 12.0 27.8
11. 72.20.109.194 0.0% 45 45 14.0 15.2 12.0 40.0
12. 72.20.106.125 0.0% 45 45 12.7 14.2 12.2 19.4
Here's another. Note hops #7, #8, and #9 (all Cogent), and also note in
this example the increased latency at hop #7 and #8 which doesn't
continue downwards through later hops:
HOST: icarus.home.lan Loss% Snt Rcv Last Avg Best Wrst
1. -------------- 0.0% 45 45 0.4 0.4 0.3 0.4
2. ??? 100.0 45 0 0.0 0.0 0.0 0.0
3. 68.85.191.253 0.0% 45 45 6.9 8.1 6.5 12.6
4. 68.85.154.149 0.0% 45 45 9.0 10.0 8.6 12.6
5. 68.86.91.225 0.0% 45 45 11.1 11.9 10.7 13.7
6. 68.86.85.78 0.0% 45 45 12.0 14.3 11.6 42.5
7. 154.54.11.105 68.9% 45 14 191.4 26.3 11.9 191.4
8. 154.54.28.81 51.1% 45 22 206.6 37.7 12.9 206.6
9. 66.28.4.150 46.7% 45 24 14.9 16.8 14.5 29.7
10. 38.112.39.114 0.0% 45 45 16.9 17.4 15.4 20.7
11. 38.104.134.30 0.0% 45 45 14.7 18.5 13.9 26.6
12. ??? 100.0 45 0 0.0 0.0 0.0 0.0
The easiest way to determine if there's a problem is whether or not the
loss witnessed at a hop continues down through succeeding hops. Here's
a real life example (IPs removed for security reasons):
HOST: ---------------------- Loss% Snt Last Avg Best Wrst StDev
1. --------------- 0.0% 60 6.8 6.0 0.4 85.9 16.1
2. --------------- 0.0% 60 1.0 48.1 0.6 320.0 76.4
3. 204.70.193.101 86.7% 60 29.3 122.4 23.5 290.8 106.1
4. 206.24.227.105 86.7% 60 50.7 9.4 1.3 50.7 17.5
5. 204.70.192.53 88.3% 60 15.0 11.3 1.6 15.6 6.6
6. 204.70.194.178 88.3% 60 15.3 25.4 15.1 34.7 9.8
7. 204.70.192.70 86.7% 60 34.7 40.9 34.6 84.2 17.5
8. 204.70.194.18 88.3% 60 35.0 34.9 34.7 35.1 0.1
9. 204.70.194.10 88.3% 60 34.9 35.0 34.7 35.6 0.3
10. 208.175.175.18 88.3% 60 35.7 35.6 35.2 36.2 0.4
11. 216.52.191.11 95.0% 60 35.8 35.7 35.6 35.8 0.1
12. --------------- 100.0 60 0.0 0.0 0.0 0.0 0.0
What you see here was an issue with SAVVIS. The root cause was a router
of theirs which rebooted unexpectedly.
The destination (hop #12) drops ICMP, so 100% loss there was completely
normal -- the rest wasn't. This was determined by comparing the loss to
historic data when the issue with SAVVIS wasn't occurring.
And one more example -- which just occurred about 10 minutes ago on my
home Comcast connection:
HOST: ---------------------- Loss% Snt Rcv Last Avg Best Wrst
1. --------------- 28.9% 45 32 0.5 0.6 0.3 5.6
2. ??? 100.0 45 0 0.0 0.0 0.0 0.0
3. 68.85.191.253 28.9% 45 32 9.8 12.4 6.8 134.1
4. 68.85.154.149 28.9% 45 32 10.6 10.7 8.7 16.1
5. 68.86.90.141 28.9% 45 32 12.8 13.3 10.7 36.0
6. 68.86.85.181 28.9% 45 32 13.1 14.4 11.7 29.4
7. 68.86.86.50 28.9% 45 32 16.6 17.1 15.3 21.8
8. 75.149.228.254 28.9% 45 32 27.2 20.6 14.7 40.3
9. 209.58.116.50 28.9% 45 32 28.2 17.3 14.5 34.4
10. 216.6.33.6 28.9% 45 32 17.8 17.2 15.1 23.1
11. 144.223.243.21 28.9% 45 32 16.5 17.5 15.9 21.3
12. 144.232.8.111 28.9% 45 32 20.0 20.8 19.1 25.5
13. 144.232.24.35 28.9% 45 32 23.2 24.3 22.8 28.9
14. 144.232.20.60 31.1% 45 31 76.5 78.9 75.5 110.0
15. 144.232.20.3 28.9% 45 32 76.3 76.7 74.9 81.4
16. 144.223.34.242 28.9% 45 32 76.6 76.7 74.9 81.7
17. 64.127.129.10 28.9% 45 32 107.7 89.0 85.8 107.7
18. 96.34.2.9 28.9% 45 32 87.8 88.3 86.0 100.6
What happened here? Based on my router logs, it appears DHCP expired
and numerous daemons on the router were automatically restarted (this is
normal in such a circumstance). It's possible that Comcast's DHCP
servers took too long to respond to a renewal and the router chose to
down/up the WAN interface, or possibly restart relevant daemons. Hard
to say -- the logging isn't as verbose as I'd like.
My cable modem log doesn't indicate LOS nor being rebooted, so this was
purely an IP-related issue, or issue with my own equipment.
--
| Jeremy Chadwick jdc at parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, USA |
| Making life hard for others since 1977. PGP: 4BD6C0CB |
On Tue, Aug 18, 2009 at 01:31:28PM -0500, Brett Cooper wrote:
> I'm here in Kansas City, KS region and a trace to that IP in Detroit
> for Level3 shows this with the flags you have added for mtr. Level3
> is just having a bad week me thinks.
>
> fed11.home.lan (0.0.0.0) Tue Aug 18
> 13:28:42 2009
> Keys: Help Display mode Restart statistics Order of fields quit
> Packets Pings
> Host Loss% Snt Rcv Last Avg Best Wrst
> 1. router.home.lan 0.0% 274 274 0.2 0.2 0.1 0.3
> 2. ks-76-7-1-1.sta.embarqhsd.net 0.0% 274 274 7.7 8.3 6.2 114.8
> 3. ks-76-7-255-241.sta.embarqhsd.net 0.0% 274 274 8.4 8.6 6.1 114.4
> 4. ge-7-19.car1.StLouis1.Level3.net 41.5% 273 159 34.1 26.6 12.6 213.9
> 5. ae-11-11.car2.StLouis1.Level3.net 47.3% 273 144 12.6 25.9 12.6 205.6
> 6. ae-4-4.ebr2.Chicago1.Level3.net 1.1% 273 270 19.2 26.4 18.5 140.8
> 7. ae-8-8.car1.Detroit1.Level3.net 0.0% 273 273 27.8 38.8 24.1 229.2
> 8. ge-6-12-222.car2.Detroit1.Level3. 0.0% 273 273 26.5 39.1 24.1 239.0
>
> --Brett
>
>
> Jeremy Chadwick wrote:
> >It would be helpful if you could use something like mtr instead of
> >traceroute in this case. The below trace could be indicating ICMP
> >de-prioritisation on L3 routers, which is known to be enabled, but could
> >also be an indicator of packet loss starting at hop #5 and "trickling
> >down" through succeeding hops (possibly #10 or #11).
> >
> >mtr --order=LSRNABW can help in diagnosing this.
> >
More information about the Outages
mailing list