[c-nsp] 7600 RSP720 SRD4 bgp bounce triggers cpu exhaustion

Mon Oct 31 16:53:04 EDT 2011

Hardware is 7600s, software version 12.2(33)SRD4, RSP720s, each with either 2GB or 4GB ram.  All cards have DFC 3BXLs or better. Nothing fancy only ipv4 and ipv6 and HSRP across 8-12 SVIs.  Two peering routers with a variety of transit peers (full tables) and dozens of bilateral peers.

Problem is that when IBGP (or the circuit itself) drops between the core & peering routers, the rest of the network practically falls apart with much unnecessary disruption. Drops of big EBGP peers doesn't seem to cause the problem.  Logs say basically:

1) connected interface interface or IBGP drops;
2) seconds later either OSPF to different IBGP neighbors resets, or HSRP transitions swapping the master status
3) more seconds pass and other BGP neighbors start to drop
4) from here on, random failures of OSPF and HSRP perhaps till it settles down 3-5 minutes later

I conclude that something is hammering the CPU horribly enough by #1 to cause 2-4 above.  When I log in and look during the churn, IP Input is hanging around 50% of processor with BGP router taking up usually the rest.  Have some basic packet filter ACLs on a few interfaces, but nothing extensive.

What could cause IP input to be so high?  "show interfaces stats" does not reveal what seems to be a disproportionate amount of packets getting processor switched.  "show ip cef switching statistics" doesn't look that bad, at least any as a % of the total at the bottom. Anything else to inspect?

The L3 links between core routers and peering routers have MTU 9216, and "show ip bgp neighbor" shows an appropriate max data segment of 9176 bytes.  The link between each core router is an SVI with MTU 1500.  Again it looks like "show ip bgp neighbor" BGP is detecting the right segment size of 1460.  Question is if this transit path in the middle having a smaller MTU causes the core routers in the middle any sort extra workout during BGP convergence (full IBGP mesh in this configuration)?  I don't think any sort of fragmentation is happening here.

Any known defects like this in SRD4?  Any recommendations or feedback for SRE, etc?  I don't recall having this issue back when running SRC code, but 

Am using "neighbor .... fall-over" in my IBGP peer-group to promote faster convergence.  Would that aggravate the CPU situation above? BGP & OSPF timers are default.

Thanks for any insight!