[c-nsp] Write queue size of XXX exceeded limit of 100 messages

Tue Dec 14 12:07:03 EST 2010

Trying to work through an issue with an upstream where we're seeing
intermittent BGP session flapping.  Once or twice a day we're seeing our BGP
session with that provider drop, then flap for anywhere between 5 minutes
and the better part of an hour.  During the flapping if we enable debugging
on that session, we get messages like:

*Sep  4 21:10:48.727: BGP(0): 1.2.3.4 write queue size of 471 exceeded limit
of 100 messages
*Sep  4 21:10:48.827: BGP(0): 1.2.3.4 write queue size of 471 exceeded limit
of 100 messages
*Sep  4 21:10:48.927: BGP(0): 1.2.3.4 write queue size of 471 exceeded limit
of 100 messages
*Sep  4 21:10:49.027: BGP(0): 1.2.3.4 write queue size of 471 exceeded limit
of 100 messages
*Sep  4 21:10:49.127: BGP(0): 1.2.3.4 write queue size of 471 exceeded limit
of 100 messages
*Sep  4 21:10:49.227: BGP(0): 1.2.3.4 write queue size of 471 exceeded limit
of 100 messages
*Sep  4 21:10:49.327: BGP(0): 1.2.3.4 write queue size of 471 exceeded limit
of 100 messages
*Sep  4 21:10:49.427: BGP(0): 1.2.3.4 write queue size of 471 exceeded limit
of 100 messages
*Sep  4 21:10:49.527: BGP(0): 1.2.3.4 write queue size of 471 exceeded limit
of 100 messages
*Sep  4 21:10:49.627: BGP(0): 1.2.3.4 write queue size of 471 exceeded limit
of 100 messages

Those messages come as fast as the router can spit them out and continue
until the session goes idle again.  Google hasn't been a whole lot of help
with that particular message -- I've found a few mentions that seem to
indicate a memory problem but it's not clear whether that would be with the
router displaying the error or the remote router.  In this case I have no
visibility into the remote router, but the router on this side of the
connection shouldn't be having memory issues.  It looks to me like the error
is actually being sent by the remote router (1.2.3.4) but I'm not even
certain about that.  

Here's how things look when everything is working normally:

This side is a 7204 VXR NPE-G1 with 1GB RAM.

router#sh proc mem
Total: 922997376, Used: 324146356, Free: 598851020

router#sh ip bgp summ
BGP router identifier 1.1.1.1, local AS number 11111
BGP table version is 40446332, main routing table version 40446332
383853 network entries using 38769153 bytes of memory
658504 path entries using 31608192 bytes of memory
108261 BGP path attribute entries using 6497640 bytes of memory
97819 BGP AS-PATH entries using 2529712 bytes of memory
0 BGP route-map cache entries using 0 bytes of memory
0 BGP filter-list cache entries using 0 bytes of memory
BGP using 79404697 total bytes of memory
BGP activity 4995866/4612013 prefixes, 30926762/30268258 paths, scan
interval 60 secs

Neighbor        V    AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down
State/PfxRcd
1.2.3.4         4   123 7062260 11841061 40446285    0    0 1d09h
331684
1.1.1.1         4 11111 8191283 5916235 40446332    0    0 5w6d       276144
2.2.2.2         4 11111  583483 7077668 40446332    0    0 12w0d       50654
3.3.3.3         4 11111 6876244 4191128        0    0    0 2w0d     Idle
(Admin)

The connection to AS "123" is the upstream connection, AS "11111" are iBGP
connections to other devices on the local network.  Those BGP sessions stay
nailed up throughout the problems with the upstream session.  While the BGP
session is flapping, the session will go active, the OutQ will spike up to
400-600 and PfxRcd is usually either at 0 or only a few hundred.  It will
stay that way for anywhere from about 30 seconds to 4 or 5 minutes before
dropping back to idle and starting over again.  Debugging issues a timeout
error when the session drops and returns to idle.

Since this has been an ongoing issue for awhile now we've aimed some
additional monitoring at the BGP session and the circuit carrying that
traffic and have noticed that most (though not every) time this happens, the
initial BGP session drop is preceded by a brief burst of packet-loss on the
circuit.  The packet-loss lasts no more than a minute or two and is
generally less than 5%.  We're running a more or less continual "ping -f"
across the circuit which is the only way we're even able to detect the small
loss.  The packet-loss seems to correlate with the BGP drop often enough
that it doesn't seem coincidental, but doesn't seem to always happen.  In
the cases where packet-loss precedes the BGP drop, it always disappears as
soon as BGP drops and does not return at any time during the subsequent
flapping.

Upstream seems to be taking the position that if it isn't happening right
when they get around to looking at it, there is no problem, so I'm hoping
someone here can perhaps shed some more light on the write queue error.

Thanks!

Andrew