[nsp] Stable 6500 hybrid code?

Steve Francis steve@expertcity.com
Wed, 20 Nov 2002 20:47:22 -0800


What are the current recommendations anyone has for stable 6500 code, 
for hybrid mode SupII/MSFC2?

(Fairly vanilla BGP, OSPF, HSRP, with some PBR)

We have been running 6.3(6) CatOS,  12.1(8b)E9 IOS.

However, this morning we got inconsistency on the CEF tables in the 
switch and the router.  At first it looked like a RPF error (switch 
would inconsistently drop packets only if the source address was routed 
out  one particular peering.) Yet RPF counters did not increment.

To avoid that, we reloaded the router, then basically nothing worked, 
and we had to admin down almost all interfaces to get a working network. 
(While you could ping an interface of the router via a router on a local 
subnet, and things like the loopback of the router were being advertised 
in OSPF, you could not ping the loopback from even an adjacent, shared 
interface router.)  An ACL with the log keyword made individual IP's 
work, forcing CPU switching.

At this point the TAC engineer on the router tried "no mls ip unicast ", 
which caused the whole switch to crash with TLB Exception. (And even 
more fun - not respond to the console except with garbled Hex. Needed a 
power cycle.)

I cannot find any bugs matching what we experienced, so I cant see what 
versions fix them.

Most importantly, anyone have recommendations for stable CatOS and IOS?

Anyone recognize the above bugs?

Anyone have any idea how to make a 6500 run again if it crashes, and 
outputs this:
TLB Exception (load/instruction fetch) occurred.

Software ver
sion =  6.3(6)
               Process ID #1b, Name = Fib
                                             EPC: 809EFC54
{stack trace}
GDB: TLB Exception (load/instruction fetch)
                                            GDB: The system has trapped 
into the debugger.
          GDB: It will hang until examined with gdb.
                                                    Please use normal 
gdb. special gdb will not work on this apollo+ board
||||$S10#b4

Getting remote staff to power cycle remote core switches (which, 
incidentally, failed in such an interesting way that I could still talk 
to some nodes attached to it, but it seemed to take out most nodes on 
its functionally paired switch) was not the quickest way to restore service.

Thx