[c-nsp] Cisco 7206 VXR hangs

Sat Jan 20 14:09:42 EST 2007

On Sat, Jan 20, 2007 at 01:35:07AM -0800, Ted Mittelstaedt wrote:
> 
> ----- Original Message ----- 
> From: "Scott Lambert" <lambert at lambertfam.org>
> To: "Oliver Boehmer (oboehmer)" <oboehmer at cisco.com>
> Cc: <cisco-nsp at puck.nether.net>
> Sent: Saturday, January 20, 2007 12:50 AM
> Subject: Re: [c-nsp] Cisco 7206 VXR hangs
> 
> 
> > On Mon, Jan 15, 2007 at 08:22:53AM +0100, Oliver Boehmer (oboehmer) wrote:
> > > cisco-nsp-bounces at puck.nether.net <> wrote on Sunday, January 14, 2007
> > > 10:05 PM:
> > >
> > > > On Wed, Jan 10, 2007 at 05:20:26AM -0600, Scott Lambert wrote:
> > > >> I have a Cisco 7204VXR with NPE-G1 that has been hanging on me at
> > > >> one to three week intervals.
> > > >>
> > > >> The box is doing DSL aggregation as well as being our core router.
> > > >> We have a handful of T1s on it.  Both DSL and Internet are on the
> > > >> same ATM OC3 interface.
> > > >>
> > > >> The box had been rock solid from July through October in the current
> > > >> hardware and software configuration before I tried some load reducing
> > > >> configuration changes.  The last hardware change was an upgrade from
> > > >> NPE-300 which had been working for years to the NPE-G1 which required
> > > >> the IOS upgrade to 12.2(28) SB2.
> > > >>
> > > >> I did two config changes Oct 4 because the telco was trying to tell
> > > >> me my ATM throughput problems were due to CPU load on the box.  The
> > > >> CPU load dropped by half.  We went from 50% to 25% CPU utilization.
> > > >> The ATM problems remained.  The telco eventually found a
> > > >> provisioning error and fixed the ATM issues.
> > > >>
> > > >> I enabled route-cache cef on my PPPo{E|A} virtual-templates and
> > > >> increased the small, middle, and big buffers' permanent settings.
> > > >>
> > > >> @@ -172,6 +172,10 @@ controller T1 4/7
> > > >>   linecode b8zs
> > > >>   channel-group 0 timeslots 1-24
> > > >>  !
> > > >> +buffers small permanent 700
> > > >> +buffers middle permanent 700
> > > >> +buffers big permanent 400
> > > >> +!
> > > >>  bba-group pppoe global
> > > >>   virtual-template 3
> > > >>   sessions per-vc limit 1024
> > > >> @@ -8322,8 +8326,6 @@ interface Serial4/7:0
> > > >>  interface Virtual-Template1
> > > >>   description PPPoA Template
> > > >>   ip unnumbered Loopback0
> > > >> - no ip route-cache cef
> > > >> - no ip route-cache
> > > >>   ip ospf database-filter all out
> > > >>   peer default ip address pool dsl
> > > >>   ppp authentication pap callin
> > > >> @@ -8333,8 +8335,6 @@ interface Virtual-Template3
> > > >>   mtu 1492
> > > >>   ip unnumbered Loopback0
> > > >>   ip mtu 1492
> > > >> - no ip route-cache cef
> > > >> - no ip route-cache
> > > >>   ip ospf database-filter all out
> > > >>   no logging event link-status
> > > >>   peer default ip address pool dsl
> > > >>
> > > >> November 4th, we had our first lockup during pretty much the slowest
> > > >> day of the week and off-peak hours at that.  There were no log
> > > >> entries because the syslog server was broken and there was no
> > > >> response on the serial console.  A power-cycle brought it right back
> > > >> up.  Everything appears to work normally after the power-cycle.  I
> > > >> crossed my fingers and hoped the cause was "cosmic ray".
> > > >>
> > > >> Two weeks and one day later, the same thing happenned.  That night I
> > > >> brought the IOS up to the current level 12.2(28) SB5.
> > > >>
> > > >> Three weeks later, another lockup.  We ordered RAM and a spare ATM
> > > >> OC3 card.  They have arrived but not been installed yet.
> > > >>
> > > >> Tonight, a week later, it happened again.  I have now fixed my syslog
> > > >> problems and enabled logging to the console for warning level and
> > > >> above messages.
> > > >>
> > > >> The CPU, temperature, line error rate, and bandwidth MRTG graphs are
> > > >> normal leading up to the hangs.
> > > >>
> > > >> Are the above config statements known to be dangerous with
> > > >> 12.2(28)SB#? If its not a known IOS bug, is there more likely
> > > >> hardware culprit I should replace first?  What else do I need to be
> > > >> doing to track this problem down?
> > > >>
> > > >
> > > > We had another hang Saturday.  There were no warnings in the syslog
> > > > data, but there were messages on the serial console.
> > > >
> > > > Three lines which repeat infinitely:
> > > >
> > > > %SYS-2-NOTQ: unqueue didn't find 0 in queue 6432AC08
> > > > -Process= "<interrupt level>", ipl= 1
> > > > -Traceback= 6088E7AC 61764E24 61767234 604A04D4 604B0FFC
> > > > %SYS-2-NOTQ: unqueue didn't find 0 in queue 6432AC08
> > > > -Process= "<interrupt level>", ipl= 1
> > > > -Traceback= 6088E7AC 61764E24 61767234 604A04D4 604B0FFC
> > > >
> > > > The closest thing I have found to match this error is:
> > > >
> > > > http://noc.caravan.ru/ciscocd/cc/td/doc/product/software/ios12
> > > > 2/122cavs/122tcavs.htm
> > > >
> > > > I had reverted the route-cache cef changes post-boot Wednesday, but
> > > > there were still a few hundred PPPoE sessions using the vtemplate
> > > > sub-interface setup from before I could get in and change the
> > > > Virtual-Templates.  I probably should have kicked them off so
> > > > they could get the non-subif setup.
> > > >
> > > > This was the shortest time ever between hangs, with the fewest users
> > > > running subif, ie. CEF enabled, configurations.  This scares me.
> > > >
> > > > Does anyone know what circumstances tickle the problem or if
> > > > CSCdx87590 is likely to affect a 7204VXR NPE-G1 running 12.2 SB?  The
> > > > Bug Toolkit doesn't seem to think CSCdx87590 affects anything but
> > > > 12.2T.
> > >
> > > Can you open a TAC case to have them take a look at this?
> > > Can you pls send a "show ver" to decode the tracebacks?
> >
> > router-7204#show ver
> > [cut]
> >
> > The queue address was slightly different this time.  I don't know if
> > that means anything.  The traceback is identical.
> >
> > %SYS-2-NOTQ: unqueue didn't find 0 in queue 6436CAC8
> > -Process= "<interrupt level>", ipl= 1
> > -Traceback= 6088E7AC 61764E24 61767234 604A04D4 604B0FFC
> > %SYS-2-NOTQ: unqueue didn't find 0 in queue 6436CAC8
> > -Process= "<interrupt level>", ipl= 1
> > -Traceback= 6088E7AC 61764E24 61767234 604A04D4 604B0FFC
> >
> > > We could collect more information the next time when you enable "Break
> > > has effect" in the config register (set it to 0x2002) and reload the box
> > > to have the change take effect.. Once the router hangs again, issue a
> > > break from the console and type "k 50" from the rommon to print the
> > > current stack trace. Then type "c" to continue (router will hang again),
> > > and repeat this a couple of times before resetting the router.
> >
> > I had problems with conserver.  I didn't get the stack trace yet.
> >
> > This time there were no users with "ip route-cache cef" enabled
> > virtual-template.  So that's probably just a red herring.
> >
> > We do have "ip cef" enabled globally.  But "ip cef" has been enabled
> > globally across multiple chassis and IOS revisions since before I came
> > to the company and setup RANCID.
> >
> 
> Scott,
> 
>  I've been following this problem since we have a couple 7206VXR's
> with ATM DS3s in them, we also have a 7206 non-VXR with an ATM DS3
> card in it.  Our VXR's run NPE300s the nonVXr's run NPE 200s.  We
> have not had a router crash for the last 4-6 years or so (it's been so long
> I forgotten)
> 
> This may sound dumb but why don't you put the NPE-400 back in?
> 
> It seems to me if the NPE400 works and the NPE-G1 crashes that
> the problem isn't in the ram or the OC3 card or the IOS.  It's either the
> NPE-G1 card is bad (which swapping it for another NPE-G1 and
> testing would show this) or simply that the NPE-G1 is shit in this platform.

The NPE-400 was in it for about 2 weeks.  It couldn't handle the load.
We traded the NPE-400 for the NPE-G1.  Then NPE-G1 ran smoothly for
about 4 months.  In the current configuration, the NPE-G1 is at 50% cpu
utilization during the daytime hours.  With "ip route-cache cef" on the
virtual-templates, cpu utilization went down to 25%.

> At the rate your swapping things around, you have introduced so many
> changes that you don't now know if the config will work even on the
> NPE400 anymore.

Actually, other than the buffer settings, IOS being three patch
levels more recent, enabling break, some logging tweaks and added DSL
customers, everything is as it was during the four months it ran okay.
That does seem like quite a few changes but only two of them were
elective.  

The logging changes have allowed me to figure out what is happening.
Hopefully the break enabling in the config-register will give me more
insight.

I could revert the buffer settings and IOS if anybody thinks I need to.
IOS was changed only after the first two occurances of the problem.  The
buffer settings were made prior to the first occurance.  I only backed
out the "ip route-cache cef" changes previously because I am trying to
make one change at a time while trying to cure this problem.

  buffers small permanent 700
  buffers middle permanent 700
  buffers big permanent 400

If I didn't have to take all 1680, now, of these VCCs on one OC3, I
would split the load across a couple of NPE-300s which were stable
for years.  Of course, during that time, we had less than 900 VCCs.
September 29th, we had 1446 VCCs.  October 3rd, we brought up another
130 VCCs from an acquisition.  So, by the time we had the first problem
we probably had about 1580 to 1600 VCCs.  That doesn't sound like a
magic number so I doubt the bug is being triggered by just the volume
of configured VCCs.  But I figured I'd throw it out just in case it is
important.

I am really not very comfortable with the networking gear.  I can
usually keep things running but in cases like this, I'm out of my
comfort zone.  

-- 
Scott Lambert                    KC5MLE                       Unix SysAdmin
lambert at lambertfam.org