[j-nsp] bfd = busted failure detection :)
Richard A Steenbergen
ras at e-gerbil.net
Sat Nov 21 18:16:57 EST 2009
On Sat, Nov 21, 2009 at 12:53:58PM -0800, Nilesh Khambal wrote:
> Hi Richard,
>
> Just talking from this router perspective, it looks like the remote
> end router has problem receiving BFD packets from this router. It
> signaled the BFD session down because of that.
There are actually two particular interfaces between this pair of
routers (both MX960s running 9.4R3, both circuits are long-haul ~70ms
latency) that are flapping because of BFD. The interesting part is that
they both land on different DPCs (on both ends), there are other
circuits between these same devices which are not having BFD issues, and
I ran regular RE based pings between the devices (with src/dst set
correctly to force traffic over the links in question) and didn't record
any loss when BFD thought that it was detecting a failure.
The other side sees:
Session up time 02:39:23, previous down time 00:00:16
Local diagnostic CtlExpire, remote diagnostic NbrSignal
Session up time 02:39:36, previous down time 00:00:28
Local diagnostic CtlExpire, remote diagnostic NbrSignal
Of course this is far from the only problem I have with BFD (just the
one I have right in front of me :P), there are other devices with
similar but more intermittent issues (a few times a day, seemingly at
random) so I'm interested in some better options for debugging it.
> You can start by looking at egress stats at the on the local router.
> See if there are any ttp queue drops (software queue drops) in "show
> pfe statistics traffic" any queuing drops on the egress interface.
No software drops on either device, only a small and non-incrementing
number of hardware input drops and info cell drops.
> At the remote end, you can look for any input errors (framing, CRC
> etcs) at the interface level. Then look for any drops at the route
> lookup level and PFE CPU level. Check if PFE CPU is being overrun due
> to some excess host bound traffic. You can check "show pfe statistics
> error" on both side routers along with "show pfe statistics traffic
> fpc <slot>" to check if any ASIC blocks are having issues and they are
> dropping packets for this interface/PFE. Also, check the CPU and
> memory utilization of FPCs on either sides using "show chassis fpc"
> command.
Already checked the obvious stuff, no interface drops, no incrementing
errors (these devices have been up for a while, so there are a few old
ones from previous circuit flaps), the interfaces are nowhere close to
full, and ping over the affected links comes back clean.
PFE CPU/memory seems perfectly normal on all DPCs on both devices, they
all look pretty much like this:
Temp CPU Utilization (%) Memory Utilization (%)
Slot State (C) Total Interrupt DRAM (MB) Heap Buffer
0 Online 41 11 0 1024 27 30
1 Online 40 11 0 1024 26 30
2 Online 37 12 0 1024 26 30
3 Online 36 10 0 1024 26 31
4 Online 36 15 0 1024 26 30
5 Online 38 12 0 1024 26 30
6 Empty
7 Online 39 15 0 1024 26 30
8 Online 38 14 0 1024 26 30
One of the circuits has a few Ipktwr Drops (the 1836 I-chip, the other
port is something completely unrelated) reported on one side, the other
is clean:
Ipktwr Drops: 1836 3 0 5010
I was hoping for some kind of ppm delegate and/or bfd counter stats that
show exactly how many hellos are being missed, similar to what Cisco
has:
Holddown (hits): 4979(0), Hello (hits): 999(745796)
Rx Count: 745694, Rx Interval (ms) min/max/avg: 564/1228/883 last: 16 ms ago
Tx Count: 745698, Tx Interval (ms) min/max/avg: 756/1000/880 last: 652 ms ago
Elapsed time watermarks: -1 0 (last: 0)
Registered protocols: ISIS TE/FRR
Uptime: 1w0d
Last packet: Version: 1 - Diagnostic: 0
State bit: Up - Demand bit: 0
Poll bit: 0 - Final bit: 0
Multiplier: 5 - Length: 24
My Discr.: 1 - Your Discr.: 1
Min tx interval: 999000 - Min rx interval: 999000
Min Echo interval: 999000
--
Richard A Steenbergen <ras at e-gerbil.net> http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
More information about the juniper-nsp
mailing list