[c-nsp] Cisco 7206 VXR hangs

Mon Jan 15 02:22:53 EST 2007

cisco-nsp-bounces at puck.nether.net <> wrote on Sunday, January 14, 2007
10:05 PM:

> On Wed, Jan 10, 2007 at 05:20:26AM -0600, Scott Lambert wrote:
>> I have a Cisco 7204VXR with NPE-G1 that has been hanging on me at
>> one to three week intervals. 
>> 
>> The box is doing DSL aggregation as well as being our core router. 
>> We have a handful of T1s on it.  Both DSL and Internet are on the
>> same ATM OC3 interface. 
>> 
>> The box had been rock solid from July through October in the current
>> hardware and software configuration before I tried some load reducing
>> configuration changes.  The last hardware change was an upgrade from
>> NPE-300 which had been working for years to the NPE-G1 which required
>> the IOS upgrade to 12.2(28) SB2.
>> 
>> I did two config changes Oct 4 because the telco was trying to tell
>> me my ATM throughput problems were due to CPU load on the box.  The
>> CPU load dropped by half.  We went from 50% to 25% CPU utilization. 
>> The ATM problems remained.  The telco eventually found a
>> provisioning error and fixed the ATM issues. 
>> 
>> I enabled route-cache cef on my PPPo{E|A} virtual-templates and
>> increased the small, middle, and big buffers' permanent settings.
>> 
>> @@ -172,6 +172,10 @@ controller T1 4/7
>>   linecode b8zs
>>   channel-group 0 timeslots 1-24
>>  !
>> +buffers small permanent 700
>> +buffers middle permanent 700
>> +buffers big permanent 400
>> +!
>>  bba-group pppoe global
>>   virtual-template 3
>>   sessions per-vc limit 1024
>> @@ -8322,8 +8326,6 @@ interface Serial4/7:0
>>  interface Virtual-Template1
>>   description PPPoA Template
>>   ip unnumbered Loopback0
>> - no ip route-cache cef
>> - no ip route-cache
>>   ip ospf database-filter all out
>>   peer default ip address pool dsl
>>   ppp authentication pap callin
>> @@ -8333,8 +8335,6 @@ interface Virtual-Template3
>>   mtu 1492
>>   ip unnumbered Loopback0
>>   ip mtu 1492
>> - no ip route-cache cef
>> - no ip route-cache
>>   ip ospf database-filter all out
>>   no logging event link-status
>>   peer default ip address pool dsl
>> 
>> November 4th, we had our first lockup during pretty much the slowest
>> day of the week and off-peak hours at that.  There were no log
>> entries because the syslog server was broken and there was no
>> response on the serial console.  A power-cycle brought it right back
>> up.  Everything appears to work normally after the power-cycle.  I
>> crossed my fingers and hoped the cause was "cosmic ray".
>> 
>> Two weeks and one day later, the same thing happenned.  That night I
>> brought the IOS up to the current level 12.2(28) SB5.
>> 
>> Three weeks later, another lockup.  We ordered RAM and a spare ATM
>> OC3 card.  They have arrived but not been installed yet.
>> 
>> Tonight, a week later, it happened again.  I have now fixed my syslog
>> problems and enabled logging to the console for warning level and
>> above messages. 
>> 
>> The CPU, temperature, line error rate, and bandwidth MRTG graphs are
>> normal leading up to the hangs.
>> 
>> Are the above config statements known to be dangerous with
>> 12.2(28)SB#? If its not a known IOS bug, is there more likely
>> hardware culprit I should replace first?  What else do I need to be
>> doing to track this problem down? 
>> 
>>  router-7204 uptime is 7 hours, 54 minutes
>> System returned to ROM by power-on
>> System restarted at 02:42:15 UTC Wed Jan 10 2007
>> System image file is "disk2:c7200-ik91s-mz.122-28.SB5.bin"
>> 
>> Cisco 7204VXR (NPE-G1) processor (revision B) with
> 229376K/32768K bytes of memory.
>> Processor board ID 21276969
>> SB-1 CPU at 700Mhz, Implementation 1025, Rev 0.2, 512KB L2 Cache
>> 4 slot VXR midplane, Version 2.1
>> 
>> 1 FastEthernet interface
>> 3 Gigabit Ethernet interfaces
>> 8 Serial interfaces
>> 1 ATM interface
>> 8 Channelized T1/PRI ports
>> 509K bytes of NVRAM.
>> 
>> 20480K bytes of Flash PCMCIA card at slot 0 (Sector size 128K).
>> 62592K bytes of ATA PCMCIA card at slot 2 (Sector size 512 bytes).
>> 16384K bytes of Flash internal SIMM (Sector size 256K).
>> Configuration register is 0x2102
> 
> We had another hang Saturday.  There were no warnings in the syslog
> data, but there were messages on the serial console.
> 
> Three lines which repeat infinitely:
> 
> %SYS-2-NOTQ: unqueue didn't find 0 in queue 6432AC08
> -Process= "<interrupt level>", ipl= 1
> -Traceback= 6088E7AC 61764E24 61767234 604A04D4 604B0FFC
> %SYS-2-NOTQ: unqueue didn't find 0 in queue 6432AC08
> -Process= "<interrupt level>", ipl= 1
> -Traceback= 6088E7AC 61764E24 61767234 604A04D4 604B0FFC
> 
> The closest thing I have found to match this error is:
> 
> http://noc.caravan.ru/ciscocd/cc/td/doc/product/software/ios12
> 2/122cavs/122tcavs.htm 
> 
> CSCdx87590
> 
> Symptoms A router may pause indefinitely under certain circumstances
>          after Cisco Express Forwarding (CEF) is configured. Before
>          pausing, the following messages may be seen on the console:
> 
> %SYS-2-NOTQ: unqueue didn't find 0 in queue 6311B9DC
> -Process= "<interrupt level>", ipl= 3, pid= 46
> -Traceback= 60C81950 60C7F4A0 60C7F4A0 61B1EC14 61B1EE8C
> 61B1FE30 61B3D754 61B396AC 61B39860 61B3A528 61B3B7F8
> 61B30DF4 61B3071C 61B3041C 61B33804
> 
> Conditions  This symptom is observed on a Cisco 3600 series router.
> 
> Workaround  There is no workaround.
> 
> 
> I had reverted the route-cache cef changes post-boot Wednesday, but
> there were still a few hundred PPPoE sessions using the vtemplate
> sub-interface setup from before I could get in and change the
> Virtual-Templates.  I probably should have kicked them off so
> they could get the non-subif setup.
> 
> This was the shortest time every between hangs, with the fewest users
> running subif, ie. CEF enabled, configurations.  This scares me.
> 
> Does anyone know what circumstances tickle the problem or if
> CSCdx87590 is likely to affect a 7204VXR NPE-G1 running 12.2 SB?  The
> Bug Toolkit doesn't seem to think CSCdx87590 affects anything but
> 12.2T. 

Can you open a TAC case to have them take a look at this? 
Can you pls send a "show ver" to decode the tracebacks?

We could collect more information the next time when you enable "Break
has effect" in the config register (set it to 0x2002) and reload the box
to have the change take effect.. Once the router hangs again, issue a
break from the console and type "k 50" from the rommon to print the
current stack trace. Then type "c" to continue (router will hang again),
and repeat this a couple of times before resetting the router. 

	oli