[c-nsp] Cisco 12000: linecard disabled due to not enough ram - VRRP stays active - how to avoid this nasty behavior

Sun Nov 6 03:33:47 EST 2005

Hello colleagues,

Just a few minutes ago a line card disabled itself due to shortage of local
memory on our Cisco 12000. One of our peers a few routers away
The BGP-sessions went down and traffic got rerouted to our standby router
but - and that's the really nasty thing - OSPF and VRRP stayed up.
This means that the Cisco still was VRRP-master although the forwarding on
the line card had stopped and traffic got black holed until I manually shut
the interface down.

Is there anything one can do in order to circumvent this behavior. The Cisco
should really stop sending the VRRP-heartbeats.

#CISCO12000:Show log
Nov  6 08:15:02.308 CET: %FIB-2-FIBDISABLE: Fatal error, slot 2: no memory
SLOT 2:Nov  6 08:15:02.228 CET: %SYS-2-MALLOCFAIL: Memory allocation of
65556 bytes failed from 0x400CE06C, alignment 16 
Pool: Processor  Free: 121296  Cause: Memory fragmentation 
Alternate Pool: None  Free: 0  Cause: No Alternate pool 

-Process= "CEF LC IPC Background", ipl= 0, pid= 57
-Traceback= 400D328C 400D5690 400CE074 40E508DC 40E15DD8 40E20768 40E2B17C
40E40648 40E380A4 40E38348 40E38714 40E39444
SLOT 2:Nov  6 08:15:02.300 CET: %FIB-3-NOMEM: Malloc Failure, disabling DCEF
on linecard
Nov  6 08:15:05.452 CET: %VRRP-6-STATECHANGE: Gi2/0.200 Grp 200 state Backup
-> Master
Nov  6 08:15:32.648 CET: %BGP-5-ADJCHANGE: neighbor x.x.x.2 Down BGP
Notification sent
Nov  6 08:15:32.648 CET: %BGP-3-NOTIFICATION: sent to neighbor x.x.x.2 4/0
(hold time expired) 0 bytes 
Nov  6 08:15:39.980 CET: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.2 on
GigabitEthernet2/0.991 from FULL to DOWN, Neighbor Down: Ded
Nov  6 08:16:32.941 CET: %BGP-5-ADJCHANGE: neighbor x.x.x.113 Down BGP
Notification sent
Nov  6 08:16:32.941 CET: %BGP-3-NOTIFICATION: sent to neighbor x.x.x.113 4/0
(hold time expired) 0 bytes 
Nov  6 08:17:49.061 CET: %BGP-5-ADJCHANGE: neighbor x.x.x.242 Down BGP
Notification sent
Nov  6 08:17:49.061 CET: %BGP-3-NOTIFICATION: sent to neighbor x.x.x.242 4/0
(hold time expired) 0 bytes 
As one can see the communication with the outside world has partly stopped.
BGP is down and the Cisco even thinks that it can now become vrrp-master for
Gi2/0.200. Terrible!

The cisco was still sending vrrp-heartbeats until I shut the linecard
manually down: then vrrp got disabled too and the backup router kicked in.
#CISCO12000:Show log
Nov  6 08:45:45.840 CET: %VRRP-6-STATECHANGE: Gi2/0.200 Grp 200 state Master
-> Init
Nov  6 08:45:45.840 CET: %VRRP-6-STATECHANGE: Gi2/0.201 Grp 201 state Master
-> Init
Nov  6 08:45:45.844 CET: %VRRP-6-STATECHANGE: Gi2/0.202 Grp 202 state Master
-> Init
Nov  6 08:45:45.844 CET: %VRRP-6-STATECHANGE: Gi2/0.203 Grp 203 state Master
-> Init
Nov  6 08:45:45.844 CET: %VRRP-6-STATECHANGE: Gi2/0.204 Grp 204 state Master
-> Init
Nov  6 08:45:45.844 CET: %VRRP-6-STATECHANGE: Gi2/0.207 Grp 207 state Master
-> Init
Nov  6 08:45:45.848 CET: %VRRP-6-STATECHANGE: Gi2/0.208 Grp 208 state Master
-> Init
Nov  6 08:45:45.848 CET: %VRRP-6-STATECHANGE: Gi2/0.210 Grp 210 state Master
-> Init
Nov  6 08:45:45.848 CET: %VRRP-6-STATECHANGE: Gi2/0.211 Grp 211 state Master
-> Init
Nov  6 08:45:47.836 CET: %LINK-5-CHANGED: Interface GigabitEthernet2/0,
changed state to administratively down
Nov  6 08:45:48.836 CET: %LINEPROTO-5-UPDOWN: Line protocol on Interface
GigabitEthernet2/0, changed state to down
Nov  6 08:45:50.404 CET: %SYS-5-CONFIG_I: Configured from console by console

The annoying thing is that my second "backup" router sees bgp and ospf to
the cisco go down but still seems to be receiving ospf-heartbeats:
#BACKUPROUTER: show log
Nov  6 08:15:53:N:OSPF: originate LSA, rid x.x.x.2, area 0.0.0.0, LSA type
5, LSA id x.y.139.200, LSA router id x.x.x.2
Nov  6 08:15:53:N:OSPF: originate LSA, rid x.x.x.2, area 0.0.0.0, LSA type
5, LSA id x.y.139.208, LSA router id x.x.x.2
Nov  6 08:15:53:N:OSPF: originate LSA, rid x.x.x.2, area 0.0.0.0, LSA type
5, LSA id x.y.139.224, LSA router id x.x.x.2
Nov  6 08:15:53:N:OSPF: originate LSA, rid x.x.x.2, area 0.0.0.0, LSA type
5, LSA id x.y.143.0, LSA router id x.x.x.2
Nov  6 08:15:53:N:OSPF: originate LSA, rid x.x.x.2, area 0.0.0.0, LSA type
5, LSA id x.y.148.0, LSA router id x.x.x.2
Nov  6 08:15:48:N:OSPF: originate LSA, rid x.x.x.2, area 0.0.0.0, LSA type
5, LSA id x.y.143.0, LSA router id x.x.x.2
Nov  6 08:15:48:N:OSPF: originate LSA, rid x.x.x.2, area 0.0.0.0, LSA type
5, LSA id a.b.121.0, LSA router id x.x.x.2
Nov  6 08:15:48:N:OSPF: originate LSA, rid x.x.x.2, area 0.0.0.0, LSA type
5, LSA id a.b.110.0, LSA router id x.x.x.2
Nov  6 08:15:48:N:OSPF: originate LSA, rid x.x.x.2, area 0.0.0.0, LSA type
5, LSA id a.b.168.0, LSA router id x.x.x.2
Nov  6 08:15:48:N:OSPF: originate LSA, rid x.x.x.2, area 0.0.0.0, LSA type
5, LSA id a.b.148.0, LSA router id x.x.x.2
Nov  6 08:15:48:N:OSPF: originate LSA, rid x.x.x.2, area 0.0.0.0, LSA type
5, LSA id a.b.234.0, LSA router id x.x.x.2
Nov  6 08:15:48:N:OSPF: originate LSA, rid x.x.x.2, area 0.0.0.0, LSA type
5, LSA id a.b.240.0, LSA router id x.x.x.2
Nov  6 08:15:42:N:OSPF: originate LSA, rid x.x.x.2, area x.y.128.0, LSA type
1, LSA id x.x.x.2, LSA router id x.x.x.2
Nov  6 08:15:42:N:OSPF: nbr state changed, rid x.x.x.2, nbr addr x.x.x.17,
nbr rid x.x.x.1, state initializing, rcv event 1-WayReceived
Nov  6 08:15:39:N:BGP: Peer x.x.x.1 DOWN (Hold Timer Expired)

It then takes until 8:45 - that's when I shut down the interface on the
cisco - to kick in:
#BACKUPROUTER: show log
Nov  6 08:46:23:N:OSPF: nbr state changed, rid x.x.x.2, nbr addr x.y.17, nbr
rid x.x.x.1, state down, rcv event NeighborGoingDown
Nov  6 08:46:23:N:OSPF: nbr state changed, rid x.x.x.2, nbr addr x.y.131.17,
nbr rid x.x.x.1, state initializing, rcv event Inactivity Timer Expires
Nov  6 08:45:48:N:VRRP: VRRP intf state changed, intf v211, vrid 211, state
master
Nov  6 08:45:48:N:VRRP: VRRP intf state changed, intf v210, vrid 210, state
master
Nov  6 08:45:48:N:VRRP: VRRP intf state changed, intf v208, vrid 208, state
master
Nov  6 08:45:48:N:VRRP: VRRP intf state changed, intf v207, vrid 207, state
master
Nov  6 08:45:48:N:VRRP: VRRP intf state changed, intf v204, vrid 204, state
master
Nov  6 08:45:48:N:VRRP: VRRP intf state changed, intf v203, vrid 203, state
master
Nov  6 08:45:48:N:VRRP: VRRP intf state changed, intf v202, vrid 202, state
master
Nov  6 08:45:48:N:VRRP: VRRP intf state changed, intf v201, vrid 201, state
master

I really don't get why bgp and ospf fail as they should but vrrp stays
active. Is there *anything* I can do about it? I really don't want this to
happen again.

This is so nasty.. On the paper we do have a nice hot-failover redundancy
concept but in reality a human needs to get involved :-(

Thanks for your help in advance.

Best regards,
Gunther

By the way: you shouldn't be using linecards or route processor with less
than 512mb ram nowadays...