[c-nsp] Write queue size of XXX exceeded limit of 100 messages

Pete Lumbis alumbis at gmail.com
Tue Dec 14 18:29:12 EST 2010


The message you are seeing has to do with sending updates to the peer.
If they aren't getting ACK'd fast enough we start to queue. Once the
queue fills we fill up and can eventually exceed that sending BGP
queue.

If you have access to the far side, check TCP, input queues,
interfaces, ect. This is most likely a problem on the far side of the
link.

HTH,
Pete

On Tue, Dec 14, 2010 at 12:07 PM,  <andrew2 at one.net> wrote:
>
> Trying to work through an issue with an upstream where we're seeing
> intermittent BGP session flapping.  Once or twice a day we're seeing our BGP
> session with that provider drop, then flap for anywhere between 5 minutes
> and the better part of an hour.  During the flapping if we enable debugging
> on that session, we get messages like:
>
> *Sep  4 21:10:48.727: BGP(0): 1.2.3.4 write queue size of 471 exceeded limit
> of 100 messages
> *Sep  4 21:10:48.827: BGP(0): 1.2.3.4 write queue size of 471 exceeded limit
> of 100 messages
> *Sep  4 21:10:48.927: BGP(0): 1.2.3.4 write queue size of 471 exceeded limit
> of 100 messages
> *Sep  4 21:10:49.027: BGP(0): 1.2.3.4 write queue size of 471 exceeded limit
> of 100 messages
> *Sep  4 21:10:49.127: BGP(0): 1.2.3.4 write queue size of 471 exceeded limit
> of 100 messages
> *Sep  4 21:10:49.227: BGP(0): 1.2.3.4 write queue size of 471 exceeded limit
> of 100 messages
> *Sep  4 21:10:49.327: BGP(0): 1.2.3.4 write queue size of 471 exceeded limit
> of 100 messages
> *Sep  4 21:10:49.427: BGP(0): 1.2.3.4 write queue size of 471 exceeded limit
> of 100 messages
> *Sep  4 21:10:49.527: BGP(0): 1.2.3.4 write queue size of 471 exceeded limit
> of 100 messages
> *Sep  4 21:10:49.627: BGP(0): 1.2.3.4 write queue size of 471 exceeded limit
> of 100 messages
>
> Those messages come as fast as the router can spit them out and continue
> until the session goes idle again.  Google hasn't been a whole lot of help
> with that particular message -- I've found a few mentions that seem to
> indicate a memory problem but it's not clear whether that would be with the
> router displaying the error or the remote router.  In this case I have no
> visibility into the remote router, but the router on this side of the
> connection shouldn't be having memory issues.  It looks to me like the error
> is actually being sent by the remote router (1.2.3.4) but I'm not even
> certain about that.
>
> Here's how things look when everything is working normally:
>
> This side is a 7204 VXR NPE-G1 with 1GB RAM.
>
> router#sh proc mem
> Total: 922997376, Used: 324146356, Free: 598851020
>
> router#sh ip bgp summ
> BGP router identifier 1.1.1.1, local AS number 11111
> BGP table version is 40446332, main routing table version 40446332
> 383853 network entries using 38769153 bytes of memory
> 658504 path entries using 31608192 bytes of memory
> 108261 BGP path attribute entries using 6497640 bytes of memory
> 97819 BGP AS-PATH entries using 2529712 bytes of memory
> 0 BGP route-map cache entries using 0 bytes of memory
> 0 BGP filter-list cache entries using 0 bytes of memory
> BGP using 79404697 total bytes of memory
> BGP activity 4995866/4612013 prefixes, 30926762/30268258 paths, scan
> interval 60 secs
>
> Neighbor        V    AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down
> State/PfxRcd
> 1.2.3.4         4   123 7062260 11841061 40446285    0    0 1d09h
> 331684
> 1.1.1.1         4 11111 8191283 5916235 40446332    0    0 5w6d       276144
> 2.2.2.2         4 11111  583483 7077668 40446332    0    0 12w0d       50654
> 3.3.3.3         4 11111 6876244 4191128        0    0    0 2w0d     Idle
> (Admin)
>
> The connection to AS "123" is the upstream connection, AS "11111" are iBGP
> connections to other devices on the local network.  Those BGP sessions stay
> nailed up throughout the problems with the upstream session.  While the BGP
> session is flapping, the session will go active, the OutQ will spike up to
> 400-600 and PfxRcd is usually either at 0 or only a few hundred.  It will
> stay that way for anywhere from about 30 seconds to 4 or 5 minutes before
> dropping back to idle and starting over again.  Debugging issues a timeout
> error when the session drops and returns to idle.
>
> Since this has been an ongoing issue for awhile now we've aimed some
> additional monitoring at the BGP session and the circuit carrying that
> traffic and have noticed that most (though not every) time this happens, the
> initial BGP session drop is preceded by a brief burst of packet-loss on the
> circuit.  The packet-loss lasts no more than a minute or two and is
> generally less than 5%.  We're running a more or less continual "ping -f"
> across the circuit which is the only way we're even able to detect the small
> loss.  The packet-loss seems to correlate with the BGP drop often enough
> that it doesn't seem coincidental, but doesn't seem to always happen.  In
> the cases where packet-loss precedes the BGP drop, it always disappears as
> soon as BGP drops and does not return at any time during the subsequent
> flapping.
>
> Upstream seems to be taking the position that if it isn't happening right
> when they get around to looking at it, there is no problem, so I'm hoping
> someone here can perhaps shed some more light on the write queue error.
>
> Thanks!
>
> Andrew
>
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/
>



More information about the cisco-nsp mailing list