[c-nsp] Cisco 7206 VXR hangs

Sun Jan 14 16:05:05 EST 2007

On Wed, Jan 10, 2007 at 05:20:26AM -0600, Scott Lambert wrote:
> I have a Cisco 7204VXR with NPE-G1 that has been hanging on me at one to
> three week intervals.
> 
> The box is doing DSL aggregation as well as being our core router.  We
> have a handful of T1s on it.  Both DSL and Internet are on the same ATM
> OC3 interface.
> 
> The box had been rock solid from July through October in the current
> hardware and software configuration before I tried some load reducing
> configuration changes.  The last hardware change was an upgrade from
> NPE-300 which had been working for years to the NPE-G1 which required
> the IOS upgrade to 12.2(28) SB2.
> 
> I did two config changes Oct 4 because the telco was trying to tell me
> my ATM throughput problems were due to CPU load on the box.  The CPU
> load dropped by half.  We went from 50% to 25% CPU utilization.  The ATM
> problems remained.  The telco eventually found a provisioning error and
> fixed the ATM issues.
> 
> I enabled route-cache cef on my PPPo{E|A} virtual-templates and
> increased the small, middle, and big buffers' permanent settings.
> 
> @@ -172,6 +172,10 @@ controller T1 4/7
>   linecode b8zs
>   channel-group 0 timeslots 1-24
>  !
> +buffers small permanent 700
> +buffers middle permanent 700
> +buffers big permanent 400
> +!
>  bba-group pppoe global
>   virtual-template 3
>   sessions per-vc limit 1024
> @@ -8322,8 +8326,6 @@ interface Serial4/7:0
>  interface Virtual-Template1
>   description PPPoA Template
>   ip unnumbered Loopback0
> - no ip route-cache cef
> - no ip route-cache
>   ip ospf database-filter all out
>   peer default ip address pool dsl
>   ppp authentication pap callin
> @@ -8333,8 +8335,6 @@ interface Virtual-Template3
>   mtu 1492
>   ip unnumbered Loopback0
>   ip mtu 1492
> - no ip route-cache cef
> - no ip route-cache
>   ip ospf database-filter all out
>   no logging event link-status
>   peer default ip address pool dsl
> 
> November 4th, we had our first lockup during pretty much the slowest
> day of the week and off-peak hours at that.  There were no log entries
> because the syslog server was broken and there was no response on the
> serial console.  A power-cycle brought it right back up.  Everything
> appears to work normally after the power-cycle.  I crossed my fingers
> and hoped the cause was "cosmic ray".
> 
> Two weeks and one day later, the same thing happenned.  That night I
> brought the IOS up to the current level 12.2(28) SB5.
> 
> Three weeks later, another lockup.  We ordered RAM and a spare ATM OC3
> card.  They have arrived but not been installed yet.
> 
> Tonight, a week later, it happened again.  I have now fixed my syslog
> problems and enabled logging to the console for warning level and above
> messages.
> 
> The CPU, temperature, line error rate, and bandwidth MRTG graphs are
> normal leading up to the hangs.
> 
> Are the above config statements known to be dangerous with 12.2(28)SB#?
> If its not a known IOS bug, is there more likely hardware culprit I
> should replace first?  What else do I need to be doing to track this
> problem down?
> 
>  router-7204 uptime is 7 hours, 54 minutes
> System returned to ROM by power-on
> System restarted at 02:42:15 UTC Wed Jan 10 2007
> System image file is "disk2:c7200-ik91s-mz.122-28.SB5.bin"
> 
> Cisco 7204VXR (NPE-G1) processor (revision B) with 229376K/32768K bytes of memory.
> Processor board ID 21276969
> SB-1 CPU at 700Mhz, Implementation 1025, Rev 0.2, 512KB L2 Cache
> 4 slot VXR midplane, Version 2.1
> 
> 1 FastEthernet interface
> 3 Gigabit Ethernet interfaces
> 8 Serial interfaces
> 1 ATM interface
> 8 Channelized T1/PRI ports
> 509K bytes of NVRAM.
> 
> 20480K bytes of Flash PCMCIA card at slot 0 (Sector size 128K).
> 62592K bytes of ATA PCMCIA card at slot 2 (Sector size 512 bytes).
> 16384K bytes of Flash internal SIMM (Sector size 256K).
> Configuration register is 0x2102

We had another hang Saturday.  There were no warnings in the syslog
data, but there were messages on the serial console.

Three lines which repeat infinitely:

%SYS-2-NOTQ: unqueue didn't find 0 in queue 6432AC08
-Process= "<interrupt level>", ipl= 1
-Traceback= 6088E7AC 61764E24 61767234 604A04D4 604B0FFC
%SYS-2-NOTQ: unqueue didn't find 0 in queue 6432AC08
-Process= "<interrupt level>", ipl= 1
-Traceback= 6088E7AC 61764E24 61767234 604A04D4 604B0FFC

The closest thing I have found to match this error is:

http://noc.caravan.ru/ciscocd/cc/td/doc/product/software/ios122/122cavs/122tcavs.htm

CSCdx87590

Symptoms A router may pause indefinitely under certain circumstances
         after Cisco Express Forwarding (CEF) is configured. Before
         pausing, the following messages may be seen on the console:

%SYS-2-NOTQ: unqueue didn't find 0 in queue 6311B9DC
-Process= "<interrupt level>", ipl= 3, pid= 46
-Traceback= 60C81950 60C7F4A0 60C7F4A0 61B1EC14 61B1EE8C 61B1FE30 61B3D754 61B396AC 61B39860 61B3A528 61B3B7F8 61B30DF4 61B3071C 61B3041C 61B33804

Conditions  This symptom is observed on a Cisco 3600 series router.

Workaround  There is no workaround.

I had reverted the route-cache cef changes post-boot Wednesday, but
there were still a few hundred PPPoE sessions using the vtemplate
sub-interface setup from before I could get in and change the
Virtual-Templates.  I probably should have kicked them off so they could
get the non-subif setup.

This was the shortest time every between hangs, with the fewest users
running subif, ie. CEF enabled, configurations.  This scares me.  

Does anyone know what circumstances tickle the problem or if CSCdx87590
is likely to affect a 7204VXR NPE-G1 running 12.2 SB?  The Bug Toolkit
doesn't seem to think CSCdx87590 affects anything but 12.2T.

-- 
Scott Lambert                    KC5MLE                       Unix SysAdmin
lambert at lambertfam.org