[c-nsp] 7206vxr/npe-400 weird crash

Alastair Johnson alastair.johnson at maxnet.co.nz
Wed Nov 3 23:04:39 EST 2004


Hello,

Background: 7206VXR, NPE-400, IOS 12.0(26)S2, 256MB DRAM, I/O-2FE
and a PA-A3-OC3-MM in slot 2.

We were experimenting with using traffic shaping with bgp-defined
QoS policy in order to rate shape colocated customers for different
levels of bandwidth depending on which peer they were going to.

The service-policy was defined on the output of the VLAN facing
our core, which was carring approx. 40M of traffic in each direction.

During testing, it worked as we had hoped.

After about 15 minutes, the router locked up and was experiencing
mem alloc failures for OSPF process, etc.  It refused to forward
packets on the ATM interface and customer-facing interfaces; but
the BGP sessions with the core did stay up.

The odd thing about all this was that the router had plenty
of processor memory free throughout the exercise.  This is
a typical output of the router (not during the event, but afterwards):

ar01.akl1#show mem
               Head     Total(b)     Used(b)     Free(b)   Lowest(b)  Largest(b)
Processor   61818480   226392960   138801136    87591824    78170172    78143232
      I/O    F000000    16777300     4432156    12345144    12345144    12342268

The I/O memory had only 200 bytes of memory free.

After trying to clear this; it was decided a reload would be the
only way to restore normal service to the router.

During reload, the router started barfing 'Unknown device in slot 2'
errors during both bootloader and IOS initialization.  The router
would then crash, and reload.

If I pulled the PA, it would boot OK.  OIRing the PA after it was
booted, crash.  Also tried this in slot 1 with the same result.

Power cycled the device (turned it off for 30 seconds), and it
booted up fine with the PA in slot 2.

Crashinfos during OIR show:

*Nov  4 15:38:54.435 NZDT: %OIR-6-INSCARD: Card inserted in slot 2, interfaces administratively shut down
%ERR-1-GT64120 (PCI-1): Fatal error, PCI Master abort
 GT=0xB4000000, cause=0x00000400, mask=0x00D01D00, real_cause=0x00000400
 bus_err_high=0x00000000, bus_err_low=0x00000000, addr_decode_err=0x00000470

etc.

I don't think it was a faulty PA, as we swapped it for a spare
PA-A3-OC3-MM we had with exactly the same result.  Only after shutting
the box off for a full 30 seconds did it actually want to reboot.


The question:  What would cause the box to run out of i/omem and
crash?  Would this be related to the QoS policy?  Is it safe for
us to test again?

We are going to be replacing this router with a 7507 shortly.  Would
it be a better choice for this sort of QoS?

What would cause the device to refuse to recognize the PA until
shut off for 30 complete seconds?

If anyone wants the full crashinfo files, let me know.

if anyone could provide insight, that'd be great!

thanks

aj

-- 
Network Operations		||	noc. +64.9.915.1825
Maxnet				||	cell. +64.21.639.706


More information about the cisco-nsp mailing list