[c-nsp] %IPC-SPSTBY-5-WATERMARK errors on dual-sup 6500 & SXI

Thu Apr 30 12:32:01 EDT 2009

All,

We have a chassi with 2x sup720-3B and running SXI that, for the second 
time, appears to have "lost" the standby SUP to the above error messages.

The first time, the pattern was:

Mar 17 17:24:37.378 GMT: %XDR-6-XDRIPCNOTIFY: Message not sent to slot 
6/0 (6) because of
IPC error timeout. Disabling linecard. (Expected during linecard OIR or 
system reloads)
Mar 17 17:24:42.826 GMT: %XDR-SPSTBY-3-XDRNOMEM: XDR failed to allocate 
memory during ipcQ
chunks creation.
-Traceback= 40252F70 4025350C 40932AB8 40DD8E9C 40426BA8 40427068 
40427534 40427E38
40428608 40F465F4 40F3699C 40F36BB8 416E175C

...we did not notice these, but then a few days later the router began
logging:

Mar 21 07:17:51.798 GMT: %IPC-SPSTBY-5-WATERMARK: 1600 messages pending 
in rcv for the
port Card6/0:Request(2060000.7) seat 2060000
Mar 21 07:18:21.967 GMT: %IPC-SPSTBY-5-WATERMARK: 1600 messages pending 
in rcv for the
port Card6/0:Request(2060000.7) seat 2060000
Mar 21 07:18:52.126 GMT: %IPC-SPSTBY-5-WATERMARK: 1600 messages pending 
in rcv for the
port Card6/0:Request(2060000.7) seat 2060000

...with the number of IPC messages rising, basically forever.

TAC advised a bunch of stuff that basically amounted to re-seating the 
card, failing over to the sup to see if the sup or software was faulty 
(yikes...), swapping the sups around in the slots, and so forth. I 
re-seated the sup and it seemed stable, until a few days ago:

Apr 21 01:26:18.815 BST: %RPC-SPSTBY-2-FAILED_USERHANDLE: Failed to send
RPC request online_diag_sp_request:get_rp_cpu_info
-Traceback= 40252F70 4025350C 40B43D3C 410D8528 410FCEF8 4109B750
4109C550 4109D140 4109AAD0 4109A8E4 4088E6C0 4088E6AC

...then...

Apr 24 08:18:46.367 BST: %IPC-SPSTBY-5-WATERMARK: 1600 messages pending
in rcv for the port Card6/0:Request(2060000.7) seat 2060000

...again, rising forever.

I'm going to re-open the TAC case and see what they say, but I was 
wondering if anyone had come across this. There are some 
similar-sounding messages in the SXI release notes, but we've got other 
identically-configured boxes that don't display these symptoms, so I'm 
fearing a hardware fault (which would be ironic - this sup came from 
Cisco in response to an RMA...)