[j-nsp] bfd = busted failure detection :)

Sat Nov 21 18:16:57 EST 2009

On Sat, Nov 21, 2009 at 12:53:58PM -0800, Nilesh Khambal wrote:
> Hi Richard,
> 
> Just talking from this router perspective, it looks like the remote
> end router has problem receiving BFD packets from this router. It
> signaled the BFD session down because of that.

There are actually two particular interfaces between this pair of
routers (both MX960s running 9.4R3, both circuits are long-haul ~70ms
latency) that are flapping because of BFD. The interesting part is that
they both land on different DPCs (on both ends), there are other 
circuits between these same devices which are not having BFD issues, and 
I ran regular RE based pings between the devices (with src/dst set 
correctly to force traffic over the links in question) and didn't record 
any loss when BFD thought that it was detecting a failure.

The other side sees:

 Session up time 02:39:23, previous down time 00:00:16
 Local diagnostic CtlExpire, remote diagnostic NbrSignal

 Session up time 02:39:36, previous down time 00:00:28
 Local diagnostic CtlExpire, remote diagnostic NbrSignal

Of course this is far from the only problem I have with BFD (just the
one I have right in front of me :P), there are other devices with
similar but more intermittent issues (a few times a day, seemingly at
random) so I'm interested in some better options for debugging it.

> You can start by looking at egress stats at the on the local router.
> See if there are any ttp queue drops (software queue drops) in "show
> pfe statistics traffic" any queuing drops on the egress interface.

No software drops on either device, only a small and non-incrementing
number of hardware input drops and info cell drops.

> At the remote end, you can look for any input errors (framing, CRC
> etcs) at the interface level. Then look for any drops at the route
> lookup level and PFE CPU level. Check if PFE CPU is being overrun due
> to some excess host bound traffic. You can check "show pfe statistics
> error" on both side routers along with "show pfe statistics traffic
> fpc <slot>" to check if any ASIC blocks are having issues and they are
> dropping packets for this interface/PFE. Also, check the CPU and
> memory utilization of FPCs on either sides using "show chassis fpc"
> command.

Already checked the obvious stuff, no interface drops, no incrementing
errors (these devices have been up for a while, so there are a few old
ones from previous circuit flaps), the interfaces are nowhere close to
full, and ping over the affected links comes back clean. 

PFE CPU/memory seems perfectly normal on all DPCs on both devices, they
all look pretty much like this:

                     Temp  CPU Utilization (%)   Memory    Utilization (%)
Slot State            (C)  Total  Interrupt      DRAM (MB) Heap     Buffer
  0  Online            41     11          0       1024       27         30
  1  Online            40     11          0       1024       26         30
  2  Online            37     12          0       1024       26         30
  3  Online            36     10          0       1024       26         31
  4  Online            36     15          0       1024       26         30
  5  Online            38     12          0       1024       26         30
  6  Empty           
  7  Online            39     15          0       1024       26         30
  8  Online            38     14          0       1024       26         30

One of the circuits has a few Ipktwr Drops (the 1836 I-chip, the other
port is something completely unrelated) reported on one side, the other
is clean:

Ipktwr Drops:       1836        3        0     5010

I was hoping for some kind of ppm delegate and/or bfd counter stats that 
show exactly how many hellos are being missed, similar to what Cisco 
has:

Holddown (hits): 4979(0), Hello (hits): 999(745796)
Rx Count: 745694, Rx Interval (ms) min/max/avg: 564/1228/883 last: 16 ms ago
Tx Count: 745698, Tx Interval (ms) min/max/avg: 756/1000/880 last: 652 ms ago
Elapsed time watermarks: -1 0 (last: 0)
Registered protocols: ISIS TE/FRR
Uptime: 1w0d
Last packet: Version: 1            - Diagnostic: 0
             State bit: Up         - Demand bit: 0
             Poll bit: 0           - Final bit: 0
             Multiplier: 5         - Length: 24
             My Discr.: 1          - Your Discr.: 1
             Min tx interval: 999000    - Min rx interval: 999000
             Min Echo interval: 999000

-- 
Richard A Steenbergen <ras at e-gerbil.net>       http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)