[c-nsp] 256s cyclics in GRE? (was: 220s cyclic events?)

Andre Beck cisco-nsp at ibh.net
Tue Dec 27 08:02:01 EST 2005


Re,

I investigated this problem further but found no solution yet, however
I know the effect in more detail now:

On Tue, Dec 06, 2005 at 04:50:52PM +0100, Andre Beck wrote:
> What we see is a short surge of lost packets, lasting approximately
> 0.5 seconds (a 100ms interval ping will lose 5 to 6 packets), every
> 220s or so (something in the range 215s to 220s, hard to measure
> exactly). The whole remaining time is completely free of packet loss,
> it's just the short hit every 220 seconds. It hoses IPT of course.

When doing the first investigations, I was quite sure the effect was
every 220s because mtr with 0.1s interval revealed it roughly every
2200 packets. But either I looked on the counters the wrong way or
mtr doesn't actually use precisely 0.1s intervalls. A more detailed
test showed the effect to be somewhere between 252s and 259s, which
now centers nicely around a very round number.
 
> The most interesting observation about it is probably that it occurs
> at the same time for *all* remote locations, so it likely is caused
> by something in the central network, PIX or 3745.

More interesting observations:

* It occurs only in *one* direction. The packets going from central
  to remote all reach their destinations, just the packtes from
  remote to central are lost in the surge. Together with the fact
  that the surge occurs on all remote connections at the same time,
  this clearly suggests the problem to be at the central side in
  receiving, not at the remote side in sending.

* I plugged an ACL onto the receiving tunnel interface on the 3745
  which shows that the missing packets are never actually decapsulated
  from the tunnel, so they are not lost in the central LAN after
  beeing decapsed. I don't know of a way to prove they all reach the
  router in encapsed form, though - at least not without a sniffer.

* I normally can do a one million packets ping from the central router
  to some LAN destination without losing any packet. But occasionally
  I do lose one, and when that happens, it does so at the very moment
  all tunnels observe the short surge of lost packets as well.

* There are no anomalies in CEF or ARP at the occurance of the effect
  that I would see from running the respective debugs.
 
> What completely baffles me, though, is that unfamiliar cycle time of
> 220s. Would it be 60s, 120s or especially 300s I'd be able to name
> a number of potential candidates for the phenomenon. ARP retries and
> switch MAC timeouts would be prominent candidates. OSPF has way lower
> timers, BGP is not involved, the GRE keepalive is 10s...

Now with the (roughly) 256s cycles, I'm still baffled, but the number
is way more familiar. Just not in networking terms, though...
 
> Anyone know of an approx. 220s cyclic event on either an IOS router
> or a PIX that could result in short events of packet loss? There are
> no significant CPU spikes on the 3745. And for that matter, pinging
> from a host in the central PIX515's DMZ (which is different from the
> network that connects to the 3745) towards a remote PIX506 doesn't
> result in *any* loss - so the problem must be within the VPN itself,
> not in the infrastructure it's built on.

I've also looked into that a bit and could prove the effect is *not*
with the PIXen. There is no packet loss through the PIX IPsec VPN,
at the very same time to the same remote location I can see packets
go through the PIX VPN per se while packets that travel over it and
beeing GRE encapsulated observe the surge.

In a final test we are going to replace the LAN topology between the
central PIX and router, eliminating some Bad Notworks switches I'm
suspicious about for months. But I actually don't expect that to change
the situation.

Now after finding out about all this stuff, what would you say is going
on here? Something on the 3745 hitting roughely every 256s and causing
all GRE tunnels to lose some decapsulations at once? And how to debug
this further, ideally with IOS debug and/or ACL means? I'm trying to
show whether all encapsed GRE packets reach the router or whether they
do not, but how to do this without a sniffer? Counting would be easy,
but there's a lot of traffic on these links as they are in production...

TIA,
Andre.
-- 
                  The _S_anta _C_laus _O_peration
  or "how to turn a complete illusion into a neverending money source"

-> Andre Beck    +++ ABP-RIPE +++    IBH Prof. Dr. Horn GmbH, Dresden <-


More information about the cisco-nsp mailing list