[c-nsp] Cisco 7206 VXR hangs

Scott Lambert lambert at lambertfam.org
Sat Jan 20 03:50:47 EST 2007


On Mon, Jan 15, 2007 at 08:22:53AM +0100, Oliver Boehmer (oboehmer) wrote:
> cisco-nsp-bounces at puck.nether.net <> wrote on Sunday, January 14, 2007
> 10:05 PM:
> 
> > On Wed, Jan 10, 2007 at 05:20:26AM -0600, Scott Lambert wrote:
> >> I have a Cisco 7204VXR with NPE-G1 that has been hanging on me at
> >> one to three week intervals. 
> >> 
> >> The box is doing DSL aggregation as well as being our core router. 
> >> We have a handful of T1s on it.  Both DSL and Internet are on the
> >> same ATM OC3 interface. 
> >> 
> >> The box had been rock solid from July through October in the current
> >> hardware and software configuration before I tried some load reducing
> >> configuration changes.  The last hardware change was an upgrade from
> >> NPE-300 which had been working for years to the NPE-G1 which required
> >> the IOS upgrade to 12.2(28) SB2.
> >> 
> >> I did two config changes Oct 4 because the telco was trying to tell
> >> me my ATM throughput problems were due to CPU load on the box.  The
> >> CPU load dropped by half.  We went from 50% to 25% CPU utilization. 
> >> The ATM problems remained.  The telco eventually found a
> >> provisioning error and fixed the ATM issues. 
> >> 
> >> I enabled route-cache cef on my PPPo{E|A} virtual-templates and
> >> increased the small, middle, and big buffers' permanent settings.
> >> 
> >> @@ -172,6 +172,10 @@ controller T1 4/7
> >>   linecode b8zs
> >>   channel-group 0 timeslots 1-24
> >>  !
> >> +buffers small permanent 700
> >> +buffers middle permanent 700
> >> +buffers big permanent 400
> >> +!
> >>  bba-group pppoe global
> >>   virtual-template 3
> >>   sessions per-vc limit 1024
> >> @@ -8322,8 +8326,6 @@ interface Serial4/7:0
> >>  interface Virtual-Template1
> >>   description PPPoA Template
> >>   ip unnumbered Loopback0
> >> - no ip route-cache cef
> >> - no ip route-cache
> >>   ip ospf database-filter all out
> >>   peer default ip address pool dsl
> >>   ppp authentication pap callin
> >> @@ -8333,8 +8335,6 @@ interface Virtual-Template3
> >>   mtu 1492
> >>   ip unnumbered Loopback0
> >>   ip mtu 1492
> >> - no ip route-cache cef
> >> - no ip route-cache
> >>   ip ospf database-filter all out
> >>   no logging event link-status
> >>   peer default ip address pool dsl
> >> 
> >> November 4th, we had our first lockup during pretty much the slowest
> >> day of the week and off-peak hours at that.  There were no log
> >> entries because the syslog server was broken and there was no
> >> response on the serial console.  A power-cycle brought it right back
> >> up.  Everything appears to work normally after the power-cycle.  I
> >> crossed my fingers and hoped the cause was "cosmic ray".
> >> 
> >> Two weeks and one day later, the same thing happenned.  That night I
> >> brought the IOS up to the current level 12.2(28) SB5.
> >> 
> >> Three weeks later, another lockup.  We ordered RAM and a spare ATM
> >> OC3 card.  They have arrived but not been installed yet.
> >> 
> >> Tonight, a week later, it happened again.  I have now fixed my syslog
> >> problems and enabled logging to the console for warning level and
> >> above messages. 
> >> 
> >> The CPU, temperature, line error rate, and bandwidth MRTG graphs are
> >> normal leading up to the hangs.
> >> 
> >> Are the above config statements known to be dangerous with
> >> 12.2(28)SB#? If its not a known IOS bug, is there more likely
> >> hardware culprit I should replace first?  What else do I need to be
> >> doing to track this problem down? 
> >> 
> >>  router-7204 uptime is 7 hours, 54 minutes
> >> System returned to ROM by power-on
> >> System restarted at 02:42:15 UTC Wed Jan 10 2007
> >> System image file is "disk2:c7200-ik91s-mz.122-28.SB5.bin"
> >> 
> >> Cisco 7204VXR (NPE-G1) processor (revision B) with
> > 229376K/32768K bytes of memory.
> >> Processor board ID 21276969
> >> SB-1 CPU at 700Mhz, Implementation 1025, Rev 0.2, 512KB L2 Cache
> >> 4 slot VXR midplane, Version 2.1
> >> 
> >> 1 FastEthernet interface
> >> 3 Gigabit Ethernet interfaces
> >> 8 Serial interfaces
> >> 1 ATM interface
> >> 8 Channelized T1/PRI ports
> >> 509K bytes of NVRAM.
> >> 
> >> 20480K bytes of Flash PCMCIA card at slot 0 (Sector size 128K).
> >> 62592K bytes of ATA PCMCIA card at slot 2 (Sector size 512 bytes).
> >> 16384K bytes of Flash internal SIMM (Sector size 256K).
> >> Configuration register is 0x2102
> > 
> > We had another hang Saturday.  There were no warnings in the syslog
> > data, but there were messages on the serial console.
> > 
> > Three lines which repeat infinitely:
> > 
> > %SYS-2-NOTQ: unqueue didn't find 0 in queue 6432AC08
> > -Process= "<interrupt level>", ipl= 1
> > -Traceback= 6088E7AC 61764E24 61767234 604A04D4 604B0FFC
> > %SYS-2-NOTQ: unqueue didn't find 0 in queue 6432AC08
> > -Process= "<interrupt level>", ipl= 1
> > -Traceback= 6088E7AC 61764E24 61767234 604A04D4 604B0FFC
> > 
> > The closest thing I have found to match this error is:
> > 
> > http://noc.caravan.ru/ciscocd/cc/td/doc/product/software/ios12
> > 2/122cavs/122tcavs.htm 
> > 
> > I had reverted the route-cache cef changes post-boot Wednesday, but
> > there were still a few hundred PPPoE sessions using the vtemplate
> > sub-interface setup from before I could get in and change the
> > Virtual-Templates.  I probably should have kicked them off so
> > they could get the non-subif setup.
> > 
> > This was the shortest time every between hangs, with the fewest users
> > running subif, ie. CEF enabled, configurations.  This scares me.
> > 
> > Does anyone know what circumstances tickle the problem or if
> > CSCdx87590 is likely to affect a 7204VXR NPE-G1 running 12.2 SB?  The
> > Bug Toolkit doesn't seem to think CSCdx87590 affects anything but
> > 12.2T. 
> 
> Can you open a TAC case to have them take a look at this? 
> Can you pls send a "show ver" to decode the tracebacks?

router-7204#show ver
Cisco IOS Software, 7200 Software (C7200-IK91S-M), Version 12.2(28)SB5, RELEASE SOFTWARE (fc1)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2006 by Cisco Systems, Inc.
Compiled Mon 02-Oct-06 22:14 by richv

ROM: System Bootstrap, Version 12.3(4r)T1, RELEASE SOFTWARE (fc1)
BOOTLDR: 7200 Software (C7200-KBOOT-M), Version 12.2(4)BW2, EARLY DEPLOYMENT RELEASE SOFTWARE (fc1)

 router-7204 uptime is 2 minutes
System returned to ROM by reload at 07:36:38 UTC Sat Jan 20 2007
System restarted at 07:39:02 UTC Sat Jan 20 2007
System image file is "disk2:c7200-ik91s-mz.122-28.SB5.bin"


This product contains cryptographic features and is subject to United
States and local country laws governing import, export, transfer and
use. Delivery of Cisco cryptographic products does not imply
third-party authority to import, export, distribute or use encryption.
Importers, exporters, distributors and users are responsible for
compliance with U.S. and local country laws. By using this product you
agree to comply with applicable laws and regulations. If you are unable
to comply with U.S. and local laws, return this product immediately.

A summary of U.S. laws governing Cisco cryptographic products may be found at:
http://www.cisco.com/wwl/export/crypto/tool/stqrg.html

If you require further assistance please contact us by sending email to
export at cisco.com.

Cisco 7204VXR (NPE-G1) processor (revision B) with 229376K/32768K bytes of memory.
Processor board ID 21276969
SB-1 CPU at 700Mhz, Implementation 1025, Rev 0.2, 512KB L2 Cache
4 slot VXR midplane, Version 2.1

Last reset from power-on

PCI bus mb1 (Slots 1, 3 and 5) has a capacity of 600 bandwidth points.
Current configuration on bus mb1 has a total of 300 bandwidth points. 
This configuration is within the PCI bus capacity and is supported. 

PCI bus mb2 (Slots 2, 4 and 6) has a capacity of 600 bandwidth points.
Current configuration on bus mb2 has a total of 0 bandwidth points.
This configuration is within the PCI bus capacity and is supported. 
          
Please refer to the following document "Cisco 7200 Series Port 
Adaptor Hardware Configuration Guidelines" on CCO <www.cisco.com>, 
for c7200 bandwidth points oversubscription/usage guidelines.


1 FastEthernet interface
3 Gigabit Ethernet interfaces
8 Serial interfaces
1 ATM interface
8 Channelized T1/PRI ports
509K bytes of NVRAM.

20480K bytes of Flash PCMCIA card at slot 0 (Sector size 128K).
62592K bytes of ATA PCMCIA card at slot 2 (Sector size 512 bytes).
16384K bytes of Flash internal SIMM (Sector size 256K).
Configuration register is 0x2002


The queue address was slightly different this time.  I don't know if
that means anything.  The traceback is identical.

%SYS-2-NOTQ: unqueue didn't find 0 in queue 6436CAC8
-Process= "<interrupt level>", ipl= 1
-Traceback= 6088E7AC 61764E24 61767234 604A04D4 604B0FFC
%SYS-2-NOTQ: unqueue didn't find 0 in queue 6436CAC8
-Process= "<interrupt level>", ipl= 1
-Traceback= 6088E7AC 61764E24 61767234 604A04D4 604B0FFC

> We could collect more information the next time when you enable "Break
> has effect" in the config register (set it to 0x2002) and reload the box
> to have the change take effect.. Once the router hangs again, issue a
> break from the console and type "k 50" from the rommon to print the
> current stack trace. Then type "c" to continue (router will hang again),
> and repeat this a couple of times before resetting the router. 

I had problems with conserver.  I didn't get the stack trace yet.

This time there were no users with "ip route-cache cef" enabled
virtual-template.  So that's probably just a red herring.  

We do have "ip cef" enabled globally.  But "ip cef" has been enabled
globally across multiple chassis and IOS revisions since before I came
to the company and setup RANCID.

-- 
Scott Lambert                    KC5MLE                       Unix SysAdmin
lambert at lambertfam.org



More information about the cisco-nsp mailing list