[c-nsp] FE ignored errors

Mon Dec 20 14:20:03 EST 2004

On Mon, Dec 20, 2004 at 01:54:33PM -0500, Jon Lewis wrote:
> On Mon, 20 Dec 2004, Rodney Dunn wrote:
> 
> > Exactly what I said in my other email.  There are situations where
> > the RSP can do more work than the VIPs combined.
> 
> In our case, I don't think that's an option...at least not with our
> RSP4's.  We ran some of the 7500s without dcef by accident for a couple
> weeks (someone turned it off while troubleshooting and forgot to turn it
> back on) and they really don't seem to handle large routing updates (like
> one transit provider suddenly going away) very well while the RSP is
> trying to CEF switch ~50mbit/s.

It's all about the CPU overhead that's there.  The 75xx is no
different than any other router when it comes to that.

> 
> Based on the tests I did last night (flapping a BGP session at 4am), I
> really don't think there was enough aggregate traffic coming into our
> network for there to be bursts large enough to cause large numbers of
> ignores, so I don't have much faith in that explanation.

I've never actually done that test in the lab.  I can tell you
I've never worked on a problem where it did turn out to be a large
BGP udpate causing the ignores.  But there is a first time for anything.

> 
> What seems far more likely to me is that when there are large numbers of
> routing updates (i.e. one of our BGP transits flap, and suddenly 60k
> routes "change") the VIPs are too busy receiving FIB updates from the RSP
> and are either failing to drain output packets from MEMD or are failing to
> move RX buffered packets from VIP particle buffers to MEMD as MEMD becomes
> available.

It would make sense that the VIP would spend it's time updating it's
forwarding table rather than swtiching packets on what could be bad
forwarding information.

> 
> I don't know how to prove this, but it seems much more likely to me given
> the low traffic levels at 4am.  If everything is chugging along fine with
> 3 transits (one on each 7500) and we take down one transit, shifting
> traffic onto the other two, we only see large numbers of ignores during
> the resulting BGP updates.  Once BGP has stabilized, ignores stop
> incrementing...at least in the short term.  If it were "bursty traffic
> overloading the VIP", I'd expect to continue to see ignores as long as one
> of the transits is down since that puts more traffic on the remaining two.

I could see it for a major update when a peer flaps but I wouldn't
think you would see that for normal BGP route churn on the backbone.

> 
> It might be interesting to graph and compare bgpPeerInUpdates
> (.1.3.6.1.2.1.15.3.1.10.peer-ip) against input errors (ignores).  I'll
> have to look into doing that.  I suspect we'll find that the ignores
> typically coincide with bursts of routing updates.

That's one idea.

> 
> In the mean time, if my assumptions are correct, are there things that can
> be done to mitigate this?  I actually found a URL last night where it was
> suggested that too much routing updates could bog down the router it
> suggested things like shrinking the tcp window to slow down the updates.

If your control plane isn't accurate it's hard to convince me you would
be forwarding packets down the right path.  That being said interrupt
level switching takes precedence over all process level work.  So other
than updating the CEF/adj tables when a new route goes in the table or
a route is changed BGP changes are transparent to the forwarding.
Even more so on a fully distributed box.

> It's kind of funny, because normally we want BGP to converge as quickly as
> possible.
> 
> Are there other settings that might keep the VIP from ignoring network
> receive interrupts for too long?

None that I know of.  But remember, the slower the VIP the slower
it is do both switch packets and update the forwarding table.
It could be a combination of things there.

Rodney

> 
> ----------------------------------------------------------------------
>  Jon Lewis                   |  I route
>  Senior Network Engineer     |  therefore you are
>  Atlantic Net                |
> _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________