[c-nsp] Weird reloads again with 12.2(25)S5

Sun Oct 9 16:49:53 EDT 2005

Hi,

and again we saw a funky reload of two supposedly redundant NPE-G1 core
routers with a single configuration change.

We have two NPE-G1 being core routers. Each of them has a link to three
BGP fulltable upstreams and to two backbone transfer links, where
several (at the moment seven) Cisco 7200 routers are linked for
aggregation.

I saw several BGP ghosts as described in my cisco-nsp mail on 2005-09-20
(Subject: BGP ghosts in 12.2(25)S5) on the first core router.
Investigating this I saw that soft-reconfiguration was enabled for one
eBGP neighbor on this router (the one which had the ghosts). So I tried 

conf t 
router bgp 29259
address-family ipv6 unicast
no neighbor 2001::x:y:z soft-reconfiguration inbound

The box apparently rebooted at that moment or very shortly afterwards, I
don't have any error messages in syslog. This wouldn't have been that
bad (after all, we do have two fully-redundant boxes to workaround that
danger), but the second router decided to die on us very shortly after
that, this time spewing out something in the remote syslog server:

21:51:26   backbone2 sees the b2b-gige to backbone1 going down
21:51:xx   backbone2 and all access-routers lose OSPF and BGP adjacency
           to backbone1
21:51:40   backbone2 has a traceback in syslog:

%SYS-2-GETBUF: Bad getbuffer, bytes= 59404
-Process= "IP Input", ipl= 0, pid= 45
-Traceback= 60768294 60798238 6098D9DC 6098D554 6098C618 609A2294
6097E76C 6097E8EC

21:51:51   backbone2 has the first OSPF adjancency to backbone1 again
...
21:52:13   access-routers have reestablished OSPF and BGP

The iBGP session between backbone1 and backbone2 is still missing in
syslog. This
is the last message in the syslog until 21:54:15 when both cores start
spewing out their startup messages (Line protocol ... changed state to
up). All access routers survived the crash, having seen a lost OSPF and
BGP adjacency at around 19:52:17. Our OSPF holdtime is 10 seconds. At
around 21:54 (see above) all sessions were reestablished and working
fine (except backbone1 suddenly having no ip route-cache cef on all
interfaces causing severe throughput problems).

Both routers claim to be reloaded

backbone1:
System returned to ROM by reload at 19:51:20 UTC Sun Oct 9 2005
System restarted at 19:51:28 UTC Sun Oct 9 2005
System image file is "sup-bootflash:/c7200-k91p-mz.122-25.S5.bin"

backbone2:
System returned to ROM by reload at 19:52:19 UTC Sun Oct 9 2005
System restarted at 19:52:27 UTC Sun Oct 9 2005
System image file is "sup-bootflash:/c7200-k91p-mz.122-25.S3.bin"

I'm very disappointed by the code quality lately, especially in the
12.2(25)S train. The boxes don't really do anything very special
(no additional interfaces, OSPFv2, OSPFv3, BGP with 4+1 fulltables on 1G
RAM). Yet we have regular unexplained reloads. This is the third time
that both routers went down when one of them was reloaded either because
of a software bug or manually.

Anyone having experience running 12.4 in "core" applications? I would
like to try running different IOS trains on both routers, maybe that
helps. But after I tried 12.4(3) on a spare router and OSPFv2 didn't
even bother to establish adjacencies (downgrading to 12.4(1a) worked
instantly) I'm not too confident in that either.

Bernhard