[c-nsp] ospf issue with two 7206 and two 4948

Fri Mar 31 21:01:19 EST 2006

This post is long, so here is the quick summary:

  I've got four cisco boxes as ospf neighbors on a vlan.
  Something is unstable as a poke will drive the cpu
  up on one box  and keep it there for 20 to 60 minutes.

  How can I get hints about what might be going on?
  What is the right "debug ip os ..." thing I should use?

Here is the long version:

  [7206/npe-g1]<--------------->[7206/npe-g1]
   |                             |
   |                             |
   |                             |
  [4948 L3 switch]--------------[4948 L3 switch]

The vertical connections are trunks carrying vlan 1 and 99, 
the horizontal connection between the two switches is an unconstrained
trunk, and the horizontal connection between the two 7206 is a gige
cross-connect numbered with a /30 (with ibgp over it).

Those four boxes each have an IP address in VLAN99,
and do ospf like this:

ROUTERS:
interface GigabitEthernet0/1.99
 ip ospf message-digest-key ...
 ip ospf network point-to-multipoint

router ospf 99
 area 0 authentication message-digest
 redistribute connected subnets
 passive-interface default
 no passive-interface GigabitEthernet0/1.99
 no passive-interface GigabitEthernet0/3    !btwn two routers
 network <blah1> area 0
 network <blah2> area 0                     !btwn two routers
 neighbor <foo> cost 16
 neighbor <baf> cost 4
 neighbor <bof> cost 8

L3 SWITCHES:

interface Vlan99
 ip ospf message-digest-key ...
 ip ospf network point-to-multipoint

router ospf 99
 area 0 authentication message-digest
 redistribute connected subnets
 passive-interface default
 no passive-interface Vlan99
 network <blah1> area 0

with similar " neighbor ... cost ..." statements.

The switches have a bunch of IP addresses in several VLANs 
attached to them.

Now, what happens is one switch will lose all three OSPF neighbors and
only two of them will come back up.   The session between the L3
switch and the router immediately above it will stay down for a long time.

During this time, the cpu on the *other* switch is at 75%,
up from the normal 10%.   That other switch has all ospf sessions 
up and normal.   

This is what I see:

c4948-1#sh proc cpu | exc 0.00
CPU utilization for five seconds: 79%/17%; one minute: 78%; five minutes: 66%
 PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process 
  18        5240     11004        476  0.47%  0.40%  0.44%   0 ARP Input        
  29       10644     38595        275  0.63%  1.22%  1.20%   0 Cat4k Mgmt HiPri 
  30      427588   3709441        115 60.15% 59.00% 49.93%   0 Cat4k Mgmt LoPri 
  49         144      3908         36  0.07%  0.06%  0.02%   0 Spanning Tree    
  55        5756     19488        295  0.47%  0.56%  0.59%   0 IP Input         
  68        2772      1046       2650  0.71%  0.59%  0.46%   0 CEF: IPv4 proces 

and it seems that this is the reason:

c4948-1#sh int vlan99 stats 
Vlan99
             Switch path    Pkts In   Chars In   Pkts Out  Chars Out
               Processor          0          0          0          0
-->>         Route cache   25191292 16393804201          0          0
                Hardware   56700127 21444272300   77294912 74861950153
                   Total   81891419 37838076501   77294912 74861950153

That is, a lot of traffic on VLAN99 is not being handled in hardware.
Usually, those "Route cache" numbers are very close to zero.
I don't know what that traffic is.  It is not normal traffic, it only
shows up on that one switch.

This has happened twice so far.  The first time the arp cache was
cleared on both 4948 switches.  Twenty minutes later (OK, so maybe
it's not related to the arp clearing :-), the problem happens and
lasts for one hour.  It corrected itself without intervention.
I tried various things in the first half hour (clear ip os proc, etc.), 
then I just sat and watched it.

The second time this happened was when the data center did a "routine"
generator test and we lost power to the switches but not the routers
(power sagged to 40 volts for a very short time and the 7206 power
supplies could ride it out, but the switches couldn't).
This corrected itself in 20 minutes.

The switches are running 

Cisco IOS Software, Catalyst 4000 L3 Switch Software (cat4000-I5S-M), 
  Version 12.2(25)EWA, RELEASE SOFTWARE (fc1)

the routers have 

Cisco IOS Software, 7200 Software (C7200-IK91O3S-M), Version 12.2(25)S5, 
  RELEASE SOFTWARE (fc1)

This is the ospf failure:

Mar 27 09:39:29: %OSPF-5-ADJCHG: Process 99, Nbr x.y.z.w on GigabitEthernet0/1.99 from EXSTART to DOWN, Neighbor Down: Too many retransmissions
Mar 27 09:40:29: %OSPF-5-ADJCHG: Process 99, Nbr 0.0.0.0 on GigabitEthernet0/1.99 from DOWN to DOWN, Neighbor Down: Ignore timer expired

Any ideas or hints?  I know there are other ways to set this up, but I
can't think of a reason why the current way is bad (other than the
failure described here).

Thanks,
-mark