[c-nsp] BFD flapping on 6509 SUP720-3BXL

Thu Mar 1 18:50:11 EST 2012

On Thu, 2012-03-01 at 12:42 -0500, Ross Halliday wrote:
> the SUP720s are running 12.2(33)SXI4a. The lone 7204 VXR is an NPE-G2
> box with 12.4(24)T1.

BFD should work okay on SXI in my experience. We haven't run SXI4a
though, so a weakness specific to this and not to SXI1 or SXI5 or later
is theoretically possible. I don't think so though.

BFD is a process thing, so the more BFD sessions active the more CPU
load. Might the device with the problems have a lot more BFD sessions
than the others perhaps?

> core-site_1-c6509#sh int gig 6/1 | incl drop
>   Input queue: 0/75/33/33 (size/max/drops/flushes); Total output drops: 0
> core-site_1-c6509#sh int gig 7/1 | incl drop
>   Input queue: 0/75/12/10 (size/max/drops/flushes); Total output drops: 115015
> core-site_1-c6509#sh int gig 7/6 | incl drop
>   Input queue: 0/75/4/4 (size/max/drops/flushes); Total output drops: 5943
> core-site_1-c6509#sh int gig 7/13 | incl drop
>   Input queue: 0/75/30/30 (size/max/drops/flushes); Total output drops: 0
> core-site_1-c6509#show ibc | inc spd drop
>         Potential/Actual paks copied to process level 1169099650/1167989541 (1110109 dropped, 276692 spd drops)

The per interface "flush" counter describes SPD drops specific to this
interface. Can you tell if the flush counter increments together with
the adjacency drops?

I wonder if BFD is actually allowed to enter the SPD "headroom", or if
it would be discarded together with regular traffic. I would certainly
assume it's headroom elegible. Anybody happen to know that?

You might have luck raising the input hold-queue a little; we use
"hold-queue 256 in" on our TGE core interfaces. Beware that a hardware
forwarding device might not always benefit from raising this though. The
input queue only serves traffic that has to be processed by the CPU for
one reason or another, and having a lot of this traffic is probably a
sign of something not being right. On the other hand the 75 packets
default is not a lot on an interface that can do something like 10-20
Mpps. Even a short burst of maybe 200 packets probably arrives too fast
for the CPU to receive them, even though it might be able to process
them fine.

I'd definitely lower the BFD timers. We use 100/100/5 and each device
typically has between 2 and 6 such neighbors. More neighbors and lower
timers are worse.

> I don't quite understand what the IBC stuff is about. Does this
> indicate process-switched packets?

The IBC interface is the way traffic finds its way to the CPU for
software switching/processing. The "show ibc" command lists a lot about
thisinterface , drops and rates being relatively interesting.

If your IBC interface carries too much traffic ("show ibc | incl rate")
you should investigate why. The most busy of our devices typically have
a rate of 300-500 pps rx on the IBC interface. A rate of more than twice
this would call for investigation IMO. I only have experience with our
local setup though, and other networks might work fine with higher
rates.

If you want to look at the CPU traffic you have at least two options:

 1) "debug netdr capture rx" and "show netdr captured-packets". This
    gives you a lot of nice information about IBC specific things and
    runs locally on the box. (Remember to undebug.)

 2) Use a SPAN session to send traffic to a seperate box where you can
    use Wireshark or another tool. Take a look at here for a how-to:
    http://cisco.cluepon.net/index.php/6500_SPAN_the_RP

-- 
Peter