[outages] IPv6 tunnels in FRA1 on HE.net down?
Constantine A. Murenin
mureninc at gmail.com
Wed Jun 12 18:19:56 EDT 2013
I smell a big outage:
% date; traceroute tserv1.fra1.he.net
Wed Jun 12 15:17:04 PDT 2013
traceroute to tserv1.fra1.he.net (216.66.80.30), 30 hops max, 60 byte packets
1 192.168.105.3 (192.168.105.3) 0.673 ms 0.773 ms 0.923 ms
2 10gigabitethernet7-6.core3.fmt2.he.net (65.49.10.217) 1.795 ms
1.811 ms 1.795 ms
3 10gigabitethernet12-1.core1.lax1.he.net (184.105.213.26) 19.773
ms 10gigabitethernet10-1.core1.sjc2.he.net (184.105.222.14) 0.786 ms
0.756 ms
4 10gigabitethernet10-8.core1.nyc4.he.net (72.52.92.225) 71.714 ms
76.569 ms 10gigabitethernet14-2.core1.nyc4.he.net (184.105.213.198)
71.690 ms
5 * * *
6 * * *
7 * * *
8 * * *
9 * * *
10 * * *
11 * * *
12 * * *
13 * * *
14 * * *
15 * *^C
However, a "reverse" traceroute works fine:
% date; traceroute ns1.he.net
Wed Jun 12 15:18:47 PDT 2013
traceroute to ns1.he.net (216.218.130.2), 64 hops max, 40 byte packets
1 static.33.203.4.46.clients.your-server.de (46.4.203.33) 0.682 ms
0.531 ms 0.481 ms
2 hos-tr4.juniper2.rz13.hetzner.de (213.239.224.97) 0.245 ms
hos-tr3.juniper2.rz13.hetzner.de (213.239.224.65) 0.242 ms
hos-tr2.juniper1.rz13.hetzner.de (213.239.224.33) 0.241 ms
3 hos-bb2.juniper4.ffm.hetzner.de (213.239.240.150) 5.951 ms
hos-bb1.juniper1.ffm.hetzner.de (213.239.240.224) 4.803 ms 4.786 ms
4 30gigabitethernet4-3.core1.fra1.he.net (80.81.192.172) 6.354 ms
6.81 ms 5.451 ms
5 10gigabitethernet10-2.core1.par2.he.net (72.52.92.26) 22.803 ms
24.531 ms 24.787 ms
6 10gigabitethernet15-1.core1.ash1.he.net (184.105.213.93) 101.504
ms 99.563 ms 99.958 ms
7 10gigabitethernet11-1.core1.pao1.he.net (184.105.213.177) 163.687
ms 171.711 ms 175.34 ms
8 10gigabitethernet1-2.core1.fmt1.he.net (184.105.213.65) 163.940
ms 171.362 ms 167.67 ms
9 ns1.he.net (216.218.130.2) 163.52 ms 165.265 ms 164.143 ms
C.
On 12 June 2013 15:11, Constantine A. Murenin <mureninc at gmail.com> wrote:
> he.net fra1 tserv1 is down since about 10 minutes ago (~14:57 PT),
> this time it seems like the whole thing is down, cannot even ping
> ordns.he.net over IPv6; a connection of my friend who's running a
> smokeping is also down, e.g. this is definitely widespread.
>
> Not sure what's exactly down, but it seems to be bgp-related, perhaps:
>
> 1 2600:3c01::8678:acff:fe0d:79c1 (2600:3c01::8678:acff:fe0d:79c1)
> 0.699 ms 0.828 ms 0.970 ms
> 2 10gigabitethernet7-6.core3.fmt2.he.net (2001:470:1:3b8::1) 5.936
> ms 5.877 ms 5.859 ms
> 3 10gigabitethernet5-4.core1.pao1.he.net (2001:470:0:263::2) 6.743
> ms 6.877 ms 6.978 ms
> 4 * * *
> 5 * * *
> 6 * * *
> 7 * * *
> 8 * * *
> 9 * * *
> 10 * * *
> 11 * * *
> 12 * * *
> 13 * * *
> 14 * * *
> 15 * * *
> 16 * * *
> 17 * * *
> 18 * * *
>
> C.
>
> On 14 May 2013 15:50, Constantine A. Murenin <mureninc at gmail.com> wrote:
>> For what it is worth, further details about the issue have surfaced.
>> I found a friend who also has a tunnel on tserv1.fra1.he.net., and he
>> has been running smokeping to various IPv4 and IPv6 resources for
>> quite a while.
>>
>> According to several of his smokeping reports, it can be concluded
>> that this very outage occurred during 14T18:00/05 and 14T18:45/50; but
>> we've also noticed that there was another, 6 hour (yes, 6 hour) outage
>> a day earlier, ~13T12 to ~13T18 (which corresponds to Monday early to
>> late morning Pacific Time).
>>
>> I've contacted he.net again this time around, and they said that
>> they're trying to hunt some obscure kernel bug that is causing these
>> issues.
>>
>> The tunnelbroker.net is a free service, but to have a 6 hour outage,
>> clearly spanning 1/4th of a whole day, is absolutely ridiculous. I'm
>> stunned that IPv6 connectivity of tserv1.fra1.he.net. is, apparently,
>> still not monitored, even though it's known to be having these issues.
>> ???
>>
>> Alternatively, it is, of course, possible that some engineer has been
>> troubleshooting the root cause of this issue for those whole 5 or 6
>> hours on Sunday/Monday night; but I find that somewhat hard to
>> believe; more like it got busted, and noone responsible knew about it
>> being busted for most of the time that it was.
>>
>> Even more troubling, is that they don't even publish any reports about
>> these extended outages.
>>
>> For tserv1.fra1.he.net. end users: if you can `ping6 ordns.he.net`
>> (it runs on tserv itself, try $(host `dig +short -6 @ordns.he.net
>> whoami.akamai.net`)), but cannot `ping6 ns4.he.net`, then it most
>> likely means that tserv1.fra1.he.net IPv6-connectivity is down again,
>> and you must open a ticket with HE.net ASAP. Perhaps someone should
>> setup a smokeping with automatic emails to support at he.net?
>>
>> C.
>>
>> On 14 May 2013 09:42, Constantine A. Murenin <mureninc at gmail.com> wrote:
>>> This just happened again: all IPv6 tunnels on tserv1.fra1.he.net.
>>> were inaccessible; from within a tunnel, cannot access any IPv6
>>> resource, other than ordns.he.net, which runs on the tunnel server
>>> itself.
>>>
>>> Why does FRA1 loses IPv6 connectivity so often?
>>>
>>> Update: Seems like it has been resolved as I've been writing this
>>> email, but this would seem to happen a few times too many.
>>>
>>> C.
>>>
>>> On 23 April 2013 19:28, Constantine A. Murenin <mureninc at gmail.com> wrote:
>>>> On 23 April 2013 18:28, Constantine A. Murenin <mureninc at gmail.com> wrote:
>>>>> As of a couple of minutes ago, my IPv6 tunnel seems to have no
>>>>> connectivity, weirdly other than being able to access ordns.he.net
>>>>> (2001:470:20::2) just fine, but not ns{2,3,4,5}.he.net, or any other
>>>>> IPv6 host.
>>>>
>>>> Update: reported to he.net support@ on 18:34, including a follow-up phone call;
>>>> everything's back online, as of at least 18:51 PT.
>>>> Pretty fast resolution, for a free service. :-)
>>>>
>>>> According to he.net, tserv wasn't responding on its IPv6 address, and has
>>>> henceforth been rebooted.
>>>>
>>>> Which adds up as per my mtr from a Linode:
>>>>
>>>> # mtr --report{,-wide,-cycles=60} --order "SRL BGAWV" -6 XXXXXXXXX ; date
>>>> 2. 10gigabitethernet2-3.core1.fmt1.he.net 60 60 0.0% 0.4 2.5 8.5 85.5 15.2
>>>> 3. 10gigabitethernet1-2.core1.sjc2.he.net 60 60 0.0% 0.8 2.5 5.3 51.6 8.5
>>>> 4. 10gigabitethernet3-3.core1.den1.he.net 60 60 0.0% 27.8 31.7 32.3 70.0 7.3
>>>> 5. 10gigabitethernet5-5.core1.mci3.he.net 60 60 0.0% 39.7 43.6 44.3 114.3 10.1
>>>> 6. 10gigabitethernet5-2.core1.chi1.he.net 60 60 0.0% 52.0 56.2 57.4 177.2 17.0
>>>> 7. 100gigabitethernet7-2.core1.nyc4.he.net 60 59 1.7% 69.1 72.2 72.5 119.8 7.1
>>>> 8. 10gigabitethernet1-2.core1.lon1.he.net 60 60 0.0% 137.8 141.8 142.2 229.2 12.0
>>>> 9. 10gigabitethernet4-2.core1.fra1.he.net 60 60 0.0% 149.4 154.0 154.4 242.8 12.7
>>>> 10. ??? 60 0 100.0 0.0 0.0 0.0 0.0 0.0
>>>> Tue Apr 23 18:14:29 PDT 2013
>>>>
>>>> ...
>>>>
>>>> 2. 10gigabitethernet2-3.core1.fmt1.he.net 60 60 0.0% 0.4 2.8 9.3 78.3 16.8
>>>> 3. 10gigabitethernet1-2.core1.sjc2.he.net 60 60 0.0% 0.8 2.3 3.9 13.9 3.8
>>>> 4. 10gigabitethernet3-3.core1.den1.he.net 60 60 0.0% 27.8 30.3 30.5 39.2 3.5
>>>> 5. 10gigabitethernet5-5.core1.mci3.he.net 60 60 0.0% 39.7 43.4 43.6 52.8 4.4
>>>> 6. 10gigabitethernet5-2.core1.chi1.he.net 60 60 0.0% 52.0 54.1 54.2 62.2 3.2
>>>> 7. 100gigabitethernet7-2.core1.nyc4.he.net 60 60 0.0% 69.1 71.5 71.6 82.1 3.5
>>>> 8. 10gigabitethernet1-2.core1.lon1.he.net 60 60 0.0% 137.8 140.3 140.4 149.1 3.5
>>>> 9. 10gigabitethernet4-2.core1.fra1.he.net 60 60 0.0% 149.4 152.3 152.3 162.5 3.7
>>>> 10. tserv1.fra1.he.net 60 60 0.0% 154.1 155.5 155.5 163.4 2.2
>>>> 11. IPv6.XXXXXX 60 59 1.7% 155.2 156.0 156.1 162.9 1.2
>>>> Tue Apr 23 18:55:34 PDT 2013
>>>>
>>>>
>>>>
>>>> And I guess ordns.he.net (2001:470:20::2) really runs on tserv
>>>> (and hence wasn't affected during the outage).
>>>>
>>>> Cns# echo {ordns,ns{2,3,4,5}}.he.net | xargs -n1 traceroute6 -l; date
>>>> traceroute6 to ordns.he.net (2001:470:20::2) from 2001:470:XXXX:YYYY::2, 64 hops max, 12 byte packets
>>>> 1 ordns.he.net (2001:470:20::2) 6.39 ms 6.495 ms 6.213 ms
>>>> traceroute6 to ns2.he.net (2001:470:200::2) from 2001:470:XXXX:YYYY::2, 64 hops max, 12 byte packets
>>>> 1 XXXXXX.tunnel.tserv6.fra1.ipv6.he.net (2001:470:XXXX:YYYY::1) 9.353 ms 8.783 ms 9.328 ms
>>>> 2 gige-g2-20.core1.fra1.he.net (2001:470:0:69::1) 13.12 ms 15.134 ms 6.252 ms
>>>> 3 10gigabitethernet5-3.core1.lon1.he.net (2001:470:0:1d2::1) 18.113 ms 23.554 ms 20.342 ms
>>>> 4 ns2.he.net (2001:470:200::2) 20.449 ms 22.873 ms 20.517 ms
>>>> traceroute6 to ns3.he.net (2001:470:300::2) from 2001:470:XXXX:YYYY::2, 64 hops max, 12 byte packets
>>>> 1 * XXXXXX.tunnel.tserv6.fra1.ipv6.he.net (2001:470:XXXX:YYYY::1) 8.743 ms 9.2 ms
>>>> 2 gige-g2-20.core1.fra1.he.net (2001:470:0:69::1) 5.934 ms 10.749 ms 6.197 ms
>>>> 3 10gigabitethernet5-3.core1.ams1.he.net (2001:470:0:225::1) 18.567 ms 17.728 ms 13.282 ms
>>>> 4 ns3.he.net (2001:470:300::2) 13.462 ms 13.412 ms 13.525 ms
>>>> traceroute6 to ns4.he.net (2001:470:400::2) from 2001:470:XXXX:YYYY::2, 64 hops max, 12 byte packets
>>>> 1 XXXXXX.tunnel.tserv6.fra1.ipv6.he.net (2001:470:XXXX:YYYY::1) 8.838 ms 8.684 ms 8.682 ms
>>>> 2 gige-g2-20.core1.fra1.he.net (2001:470:0:69::1) 6.233 ms 13.208 ms 5.97 ms
>>>> 3 ns4.he.net (2001:470:400::2) 6.145 ms 6.38 ms 6.384 ms
>>>> traceroute6 to ns5.he.net (2001:470:500::2) from 2001:470:XXXX:YYYY::2, 64 hops max, 12 byte packets
>>>> 1 * XXXXXX.tunnel.tserv6.fra1.ipv6.he.net (2001:470:XXXX:YYYY::1) 8.655 ms 8.848 ms
>>>> 2 gige-g2-20.core1.fra1.he.net (2001:470:0:69::1) 12.607 ms 6.527 ms 11.929 ms
>>>> 3 10gigabitethernet5-3.core1.ams1.he.net (2001:470:0:225::1) 13.289 ms 13.968 ms 16.754 ms
>>>> 4 ns5.he.net (2001:470:500::2) 14.111 ms 13.575 ms 13.385 ms
>>>> Tue Apr 23 18:59:48 PDT 2013
>>>>
>>>>
>>>> However, it's unclear why IPv6 connectivity of tserv1.fra1.he.net
>>>> doesn't seem to be monitored otherwise. :/
>>>>
>>>> Cheers,
>>>> Constantine.
--
В. В. Путин о совершенстве, 24 декабря 2000 года: Если человека все
устраивает, то он полный идиот. Здорового человека в нормальной памяти
не может всегда и всё устраивать.
More information about the Outages
mailing list