[c-nsp] ospf issue with two 7206 and two 4948
Mark Kent
mark at noc.mainstreet.net
Fri Mar 31 21:01:19 EST 2006
This post is long, so here is the quick summary:
I've got four cisco boxes as ospf neighbors on a vlan.
Something is unstable as a poke will drive the cpu
up on one box and keep it there for 20 to 60 minutes.
How can I get hints about what might be going on?
What is the right "debug ip os ..." thing I should use?
Here is the long version:
[7206/npe-g1]<--------------->[7206/npe-g1]
| |
| |
| |
[4948 L3 switch]--------------[4948 L3 switch]
The vertical connections are trunks carrying vlan 1 and 99,
the horizontal connection between the two switches is an unconstrained
trunk, and the horizontal connection between the two 7206 is a gige
cross-connect numbered with a /30 (with ibgp over it).
Those four boxes each have an IP address in VLAN99,
and do ospf like this:
ROUTERS:
interface GigabitEthernet0/1.99
ip ospf message-digest-key ...
ip ospf network point-to-multipoint
router ospf 99
area 0 authentication message-digest
redistribute connected subnets
passive-interface default
no passive-interface GigabitEthernet0/1.99
no passive-interface GigabitEthernet0/3 !btwn two routers
network <blah1> area 0
network <blah2> area 0 !btwn two routers
neighbor <foo> cost 16
neighbor <baf> cost 4
neighbor <bof> cost 8
L3 SWITCHES:
interface Vlan99
ip ospf message-digest-key ...
ip ospf network point-to-multipoint
router ospf 99
area 0 authentication message-digest
redistribute connected subnets
passive-interface default
no passive-interface Vlan99
network <blah1> area 0
with similar " neighbor ... cost ..." statements.
The switches have a bunch of IP addresses in several VLANs
attached to them.
Now, what happens is one switch will lose all three OSPF neighbors and
only two of them will come back up. The session between the L3
switch and the router immediately above it will stay down for a long time.
During this time, the cpu on the *other* switch is at 75%,
up from the normal 10%. That other switch has all ospf sessions
up and normal.
This is what I see:
c4948-1#sh proc cpu | exc 0.00
CPU utilization for five seconds: 79%/17%; one minute: 78%; five minutes: 66%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
18 5240 11004 476 0.47% 0.40% 0.44% 0 ARP Input
29 10644 38595 275 0.63% 1.22% 1.20% 0 Cat4k Mgmt HiPri
30 427588 3709441 115 60.15% 59.00% 49.93% 0 Cat4k Mgmt LoPri
49 144 3908 36 0.07% 0.06% 0.02% 0 Spanning Tree
55 5756 19488 295 0.47% 0.56% 0.59% 0 IP Input
68 2772 1046 2650 0.71% 0.59% 0.46% 0 CEF: IPv4 proces
and it seems that this is the reason:
c4948-1#sh int vlan99 stats
Vlan99
Switch path Pkts In Chars In Pkts Out Chars Out
Processor 0 0 0 0
-->> Route cache 25191292 16393804201 0 0
Hardware 56700127 21444272300 77294912 74861950153
Total 81891419 37838076501 77294912 74861950153
That is, a lot of traffic on VLAN99 is not being handled in hardware.
Usually, those "Route cache" numbers are very close to zero.
I don't know what that traffic is. It is not normal traffic, it only
shows up on that one switch.
This has happened twice so far. The first time the arp cache was
cleared on both 4948 switches. Twenty minutes later (OK, so maybe
it's not related to the arp clearing :-), the problem happens and
lasts for one hour. It corrected itself without intervention.
I tried various things in the first half hour (clear ip os proc, etc.),
then I just sat and watched it.
The second time this happened was when the data center did a "routine"
generator test and we lost power to the switches but not the routers
(power sagged to 40 volts for a very short time and the 7206 power
supplies could ride it out, but the switches couldn't).
This corrected itself in 20 minutes.
The switches are running
Cisco IOS Software, Catalyst 4000 L3 Switch Software (cat4000-I5S-M),
Version 12.2(25)EWA, RELEASE SOFTWARE (fc1)
the routers have
Cisco IOS Software, 7200 Software (C7200-IK91O3S-M), Version 12.2(25)S5,
RELEASE SOFTWARE (fc1)
This is the ospf failure:
Mar 27 09:39:29: %OSPF-5-ADJCHG: Process 99, Nbr x.y.z.w on GigabitEthernet0/1.99 from EXSTART to DOWN, Neighbor Down: Too many retransmissions
Mar 27 09:40:29: %OSPF-5-ADJCHG: Process 99, Nbr 0.0.0.0 on GigabitEthernet0/1.99 from DOWN to DOWN, Neighbor Down: Ignore timer expired
Any ideas or hints? I know there are other ways to set this up, but I
can't think of a reason why the current way is bad (other than the
failure described here).
Thanks,
-mark
More information about the cisco-nsp
mailing list