[c-nsp] Cisco 7206 VXR hangs

Ted Mittelstaedt tedm at toybox.placo.com
Sat Jan 20 04:35:07 EST 2007


----- Original Message ----- 
From: "Scott Lambert" <lambert at lambertfam.org>
To: "Oliver Boehmer (oboehmer)" <oboehmer at cisco.com>
Cc: <cisco-nsp at puck.nether.net>
Sent: Saturday, January 20, 2007 12:50 AM
Subject: Re: [c-nsp] Cisco 7206 VXR hangs


> On Mon, Jan 15, 2007 at 08:22:53AM +0100, Oliver Boehmer (oboehmer) wrote:
> > cisco-nsp-bounces at puck.nether.net <> wrote on Sunday, January 14, 2007
> > 10:05 PM:
> >
> > > On Wed, Jan 10, 2007 at 05:20:26AM -0600, Scott Lambert wrote:
> > >> I have a Cisco 7204VXR with NPE-G1 that has been hanging on me at
> > >> one to three week intervals.
> > >>
> > >> The box is doing DSL aggregation as well as being our core router.
> > >> We have a handful of T1s on it.  Both DSL and Internet are on the
> > >> same ATM OC3 interface.
> > >>
> > >> The box had been rock solid from July through October in the current
> > >> hardware and software configuration before I tried some load reducing
> > >> configuration changes.  The last hardware change was an upgrade from
> > >> NPE-300 which had been working for years to the NPE-G1 which required
> > >> the IOS upgrade to 12.2(28) SB2.
> > >>
> > >> I did two config changes Oct 4 because the telco was trying to tell
> > >> me my ATM throughput problems were due to CPU load on the box.  The
> > >> CPU load dropped by half.  We went from 50% to 25% CPU utilization.
> > >> The ATM problems remained.  The telco eventually found a
> > >> provisioning error and fixed the ATM issues.
> > >>
> > >> I enabled route-cache cef on my PPPo{E|A} virtual-templates and
> > >> increased the small, middle, and big buffers' permanent settings.
> > >>
> > >> @@ -172,6 +172,10 @@ controller T1 4/7
> > >>   linecode b8zs
> > >>   channel-group 0 timeslots 1-24
> > >>  !
> > >> +buffers small permanent 700
> > >> +buffers middle permanent 700
> > >> +buffers big permanent 400
> > >> +!
> > >>  bba-group pppoe global
> > >>   virtual-template 3
> > >>   sessions per-vc limit 1024
> > >> @@ -8322,8 +8326,6 @@ interface Serial4/7:0
> > >>  interface Virtual-Template1
> > >>   description PPPoA Template
> > >>   ip unnumbered Loopback0
> > >> - no ip route-cache cef
> > >> - no ip route-cache
> > >>   ip ospf database-filter all out
> > >>   peer default ip address pool dsl
> > >>   ppp authentication pap callin
> > >> @@ -8333,8 +8335,6 @@ interface Virtual-Template3
> > >>   mtu 1492
> > >>   ip unnumbered Loopback0
> > >>   ip mtu 1492
> > >> - no ip route-cache cef
> > >> - no ip route-cache
> > >>   ip ospf database-filter all out
> > >>   no logging event link-status
> > >>   peer default ip address pool dsl
> > >>
> > >> November 4th, we had our first lockup during pretty much the slowest
> > >> day of the week and off-peak hours at that.  There were no log
> > >> entries because the syslog server was broken and there was no
> > >> response on the serial console.  A power-cycle brought it right back
> > >> up.  Everything appears to work normally after the power-cycle.  I
> > >> crossed my fingers and hoped the cause was "cosmic ray".
> > >>
> > >> Two weeks and one day later, the same thing happenned.  That night I
> > >> brought the IOS up to the current level 12.2(28) SB5.
> > >>
> > >> Three weeks later, another lockup.  We ordered RAM and a spare ATM
> > >> OC3 card.  They have arrived but not been installed yet.
> > >>
> > >> Tonight, a week later, it happened again.  I have now fixed my syslog
> > >> problems and enabled logging to the console for warning level and
> > >> above messages.
> > >>
> > >> The CPU, temperature, line error rate, and bandwidth MRTG graphs are
> > >> normal leading up to the hangs.
> > >>
> > >> Are the above config statements known to be dangerous with
> > >> 12.2(28)SB#? If its not a known IOS bug, is there more likely
> > >> hardware culprit I should replace first?  What else do I need to be
> > >> doing to track this problem down?
> > >>
> > >>  router-7204 uptime is 7 hours, 54 minutes
> > >> System returned to ROM by power-on
> > >> System restarted at 02:42:15 UTC Wed Jan 10 2007
> > >> System image file is "disk2:c7200-ik91s-mz.122-28.SB5.bin"
> > >>
> > >> Cisco 7204VXR (NPE-G1) processor (revision B) with
> > > 229376K/32768K bytes of memory.
> > >> Processor board ID 21276969
> > >> SB-1 CPU at 700Mhz, Implementation 1025, Rev 0.2, 512KB L2 Cache
> > >> 4 slot VXR midplane, Version 2.1
> > >>
> > >> 1 FastEthernet interface
> > >> 3 Gigabit Ethernet interfaces
> > >> 8 Serial interfaces
> > >> 1 ATM interface
> > >> 8 Channelized T1/PRI ports
> > >> 509K bytes of NVRAM.
> > >>
> > >> 20480K bytes of Flash PCMCIA card at slot 0 (Sector size 128K).
> > >> 62592K bytes of ATA PCMCIA card at slot 2 (Sector size 512 bytes).
> > >> 16384K bytes of Flash internal SIMM (Sector size 256K).
> > >> Configuration register is 0x2102
> > >
> > > We had another hang Saturday.  There were no warnings in the syslog
> > > data, but there were messages on the serial console.
> > >
> > > Three lines which repeat infinitely:
> > >
> > > %SYS-2-NOTQ: unqueue didn't find 0 in queue 6432AC08
> > > -Process= "<interrupt level>", ipl= 1
> > > -Traceback= 6088E7AC 61764E24 61767234 604A04D4 604B0FFC
> > > %SYS-2-NOTQ: unqueue didn't find 0 in queue 6432AC08
> > > -Process= "<interrupt level>", ipl= 1
> > > -Traceback= 6088E7AC 61764E24 61767234 604A04D4 604B0FFC
> > >
> > > The closest thing I have found to match this error is:
> > >
> > > http://noc.caravan.ru/ciscocd/cc/td/doc/product/software/ios12
> > > 2/122cavs/122tcavs.htm
> > >
> > > I had reverted the route-cache cef changes post-boot Wednesday, but
> > > there were still a few hundred PPPoE sessions using the vtemplate
> > > sub-interface setup from before I could get in and change the
> > > Virtual-Templates.  I probably should have kicked them off so
> > > they could get the non-subif setup.
> > >
> > > This was the shortest time every between hangs, with the fewest users
> > > running subif, ie. CEF enabled, configurations.  This scares me.
> > >
> > > Does anyone know what circumstances tickle the problem or if
> > > CSCdx87590 is likely to affect a 7204VXR NPE-G1 running 12.2 SB?  The
> > > Bug Toolkit doesn't seem to think CSCdx87590 affects anything but
> > > 12.2T.
> >
> > Can you open a TAC case to have them take a look at this?
> > Can you pls send a "show ver" to decode the tracebacks?
>
> router-7204#show ver
> Cisco IOS Software, 7200 Software (C7200-IK91S-M), Version 12.2(28)SB5,
RELEASE SOFTWARE (fc1)
> Technical Support: http://www.cisco.com/techsupport
> Copyright (c) 1986-2006 by Cisco Systems, Inc.
> Compiled Mon 02-Oct-06 22:14 by richv
>
> ROM: System Bootstrap, Version 12.3(4r)T1, RELEASE SOFTWARE (fc1)
> BOOTLDR: 7200 Software (C7200-KBOOT-M), Version 12.2(4)BW2, EARLY
DEPLOYMENT RELEASE SOFTWARE (fc1)
>
>  router-7204 uptime is 2 minutes
> System returned to ROM by reload at 07:36:38 UTC Sat Jan 20 2007
> System restarted at 07:39:02 UTC Sat Jan 20 2007
> System image file is "disk2:c7200-ik91s-mz.122-28.SB5.bin"
>
>
> This product contains cryptographic features and is subject to United
> States and local country laws governing import, export, transfer and
> use. Delivery of Cisco cryptographic products does not imply
> third-party authority to import, export, distribute or use encryption.
> Importers, exporters, distributors and users are responsible for
> compliance with U.S. and local country laws. By using this product you
> agree to comply with applicable laws and regulations. If you are unable
> to comply with U.S. and local laws, return this product immediately.
>
> A summary of U.S. laws governing Cisco cryptographic products may be found
at:
> http://www.cisco.com/wwl/export/crypto/tool/stqrg.html
>
> If you require further assistance please contact us by sending email to
> export at cisco.com.
>
> Cisco 7204VXR (NPE-G1) processor (revision B) with 229376K/32768K bytes of
memory.
> Processor board ID 21276969
> SB-1 CPU at 700Mhz, Implementation 1025, Rev 0.2, 512KB L2 Cache
> 4 slot VXR midplane, Version 2.1
>
> Last reset from power-on
>
> PCI bus mb1 (Slots 1, 3 and 5) has a capacity of 600 bandwidth points.
> Current configuration on bus mb1 has a total of 300 bandwidth points.
> This configuration is within the PCI bus capacity and is supported.
>
> PCI bus mb2 (Slots 2, 4 and 6) has a capacity of 600 bandwidth points.
> Current configuration on bus mb2 has a total of 0 bandwidth points.
> This configuration is within the PCI bus capacity and is supported.
>
> Please refer to the following document "Cisco 7200 Series Port
> Adaptor Hardware Configuration Guidelines" on CCO <www.cisco.com>,
> for c7200 bandwidth points oversubscription/usage guidelines.
>
>
> 1 FastEthernet interface
> 3 Gigabit Ethernet interfaces
> 8 Serial interfaces
> 1 ATM interface
> 8 Channelized T1/PRI ports
> 509K bytes of NVRAM.
>
> 20480K bytes of Flash PCMCIA card at slot 0 (Sector size 128K).
> 62592K bytes of ATA PCMCIA card at slot 2 (Sector size 512 bytes).
> 16384K bytes of Flash internal SIMM (Sector size 256K).
> Configuration register is 0x2002
>
>
> The queue address was slightly different this time.  I don't know if
> that means anything.  The traceback is identical.
>
> %SYS-2-NOTQ: unqueue didn't find 0 in queue 6436CAC8
> -Process= "<interrupt level>", ipl= 1
> -Traceback= 6088E7AC 61764E24 61767234 604A04D4 604B0FFC
> %SYS-2-NOTQ: unqueue didn't find 0 in queue 6436CAC8
> -Process= "<interrupt level>", ipl= 1
> -Traceback= 6088E7AC 61764E24 61767234 604A04D4 604B0FFC
>
> > We could collect more information the next time when you enable "Break
> > has effect" in the config register (set it to 0x2002) and reload the box
> > to have the change take effect.. Once the router hangs again, issue a
> > break from the console and type "k 50" from the rommon to print the
> > current stack trace. Then type "c" to continue (router will hang again),
> > and repeat this a couple of times before resetting the router.
>
> I had problems with conserver.  I didn't get the stack trace yet.
>
> This time there were no users with "ip route-cache cef" enabled
> virtual-template.  So that's probably just a red herring.
>
> We do have "ip cef" enabled globally.  But "ip cef" has been enabled
> globally across multiple chassis and IOS revisions since before I came
> to the company and setup RANCID.
>

Scott,

 I've been following this problem since we have a couple 7206VXR's
with ATM DS3s in them, we also have a 7206 non-VXR with an ATM DS3
card in it.  Our VXR's run NPE300s the nonVXr's run NPE 200s.  We
have not had a router crash for the last 4-6 years or so (it's been so long
I forgotten)

This may sound dumb but why don't you put the NPE-400 back in?

It seems to me if the NPE400 works and the NPE-G1 crashes that
the problem isn't in the ram or the OC3 card or the IOS.  It's either the
NPE-G1 card is bad (which swapping it for another NPE-G1 and
testing would show this) or simply that the NPE-G1 is shit in this platform.

At the rate your swapping things around, you have introduced so many
changes that you don't now know if the config will work even on the
NPE400 anymore.

Ted



More information about the cisco-nsp mailing list