[c-nsp] 256s cyclics in GRE? (was: 220s cyclic events?)

Dave Temkin dave at ordinaryworld.com
Tue Dec 27 08:27:00 EST 2005


What is the keying/data timeout set to on the IPSec tunnels in the PIX?

-Dave


On Tue, 27 Dec 2005, Andre Beck wrote:

> Re,
>
> I investigated this problem further but found no solution yet, however
> I know the effect in more detail now:
>
> On Tue, Dec 06, 2005 at 04:50:52PM +0100, Andre Beck wrote:
> > What we see is a short surge of lost packets, lasting approximately
> > 0.5 seconds (a 100ms interval ping will lose 5 to 6 packets), every
> > 220s or so (something in the range 215s to 220s, hard to measure
> > exactly). The whole remaining time is completely free of packet loss,
> > it's just the short hit every 220 seconds. It hoses IPT of course.
>
> When doing the first investigations, I was quite sure the effect was
> every 220s because mtr with 0.1s interval revealed it roughly every
> 2200 packets. But either I looked on the counters the wrong way or
> mtr doesn't actually use precisely 0.1s intervalls. A more detailed
> test showed the effect to be somewhere between 252s and 259s, which
> now centers nicely around a very round number.
>
> > The most interesting observation about it is probably that it occurs
> > at the same time for *all* remote locations, so it likely is caused
> > by something in the central network, PIX or 3745.
>
> More interesting observations:
>
> * It occurs only in *one* direction. The packets going from central
>   to remote all reach their destinations, just the packtes from
>   remote to central are lost in the surge. Together with the fact
>   that the surge occurs on all remote connections at the same time,
>   this clearly suggests the problem to be at the central side in
>   receiving, not at the remote side in sending.
>
> * I plugged an ACL onto the receiving tunnel interface on the 3745
>   which shows that the missing packets are never actually decapsulated
>   from the tunnel, so they are not lost in the central LAN after
>   beeing decapsed. I don't know of a way to prove they all reach the
>   router in encapsed form, though - at least not without a sniffer.
>
> * I normally can do a one million packets ping from the central router
>   to some LAN destination without losing any packet. But occasionally
>   I do lose one, and when that happens, it does so at the very moment
>   all tunnels observe the short surge of lost packets as well.
>
> * There are no anomalies in CEF or ARP at the occurance of the effect
>   that I would see from running the respective debugs.
>
> > What completely baffles me, though, is that unfamiliar cycle time of
> > 220s. Would it be 60s, 120s or especially 300s I'd be able to name
> > a number of potential candidates for the phenomenon. ARP retries and
> > switch MAC timeouts would be prominent candidates. OSPF has way lower
> > timers, BGP is not involved, the GRE keepalive is 10s...
>
> Now with the (roughly) 256s cycles, I'm still baffled, but the number
> is way more familiar. Just not in networking terms, though...
>
> > Anyone know of an approx. 220s cyclic event on either an IOS router
> > or a PIX that could result in short events of packet loss? There are
> > no significant CPU spikes on the 3745. And for that matter, pinging
> > from a host in the central PIX515's DMZ (which is different from the
> > network that connects to the 3745) towards a remote PIX506 doesn't
> > result in *any* loss - so the problem must be within the VPN itself,
> > not in the infrastructure it's built on.
>
> I've also looked into that a bit and could prove the effect is *not*
> with the PIXen. There is no packet loss through the PIX IPsec VPN,
> at the very same time to the same remote location I can see packets
> go through the PIX VPN per se while packets that travel over it and
> beeing GRE encapsulated observe the surge.
>
> In a final test we are going to replace the LAN topology between the
> central PIX and router, eliminating some Bad Notworks switches I'm
> suspicious about for months. But I actually don't expect that to change
> the situation.
>
> Now after finding out about all this stuff, what would you say is going
> on here? Something on the 3745 hitting roughely every 256s and causing
> all GRE tunnels to lose some decapsulations at once? And how to debug
> this further, ideally with IOS debug and/or ACL means? I'm trying to
> show whether all encapsed GRE packets reach the router or whether they
> do not, but how to do this without a sniffer? Counting would be easy,
> but there's a lot of traffic on these links as they are in production...
>
> TIA,
> Andre.
>


More information about the cisco-nsp mailing list