[c-nsp] per-LSP packet loss / FIB corruption?

Thu Jul 16 05:23:01 EDT 2009

All,

We had a very odd problem yesterday.

Our network (which is all 6500/sup720 running 12.2(33)SXI) runs MPLS 
layer3 VPNs for network segmentation, and there seemed to be packet loss 
between subnets on a pair of routers. Other subnets on those routers in 
different VPNs seemed fine.

The relevant topology is:

siteX --(10gig)-- coreB ==(2x10gig)== coreA --(10gig)-- datacentre

coreA and coreB are similarly configured, with a (fairly recently 
commissioned) 6716 in slot 1, and a 6704 in slot 2. The port channel 
between them has one member on the 6704, one on the 6716. The link from 
coreA -> datacentre is on the 6716 as is our firewall and some other 
intra-core links.

The loss was on packets going datacentre->siteX, and appeared to be 
"inside" coreA - according to a SPAN session (on coreA itself), 15 
packets would arrive at coreA, but only 13 would leave (for example). 
This was pretty consistent, though reports indicate the loss may have 
been higher earlier.

Other traffic from datacentre -> siteX, on different LSPs (i.e. in 
different VPNs) was fine, as far as we could tell.

However, investigating the problem we shutdown various links elsewhere 
in the network, and it seemed to "move" the problem around - it would 
manifest on other LSPs, and start working on the original one. However, 
it seems the problem was confined to coreA.

The loss persisted if we shutdown alternate members of the coreA -> 
coreB port-channel.

There appear to be no physical layer errors anywhere.

Given that coreA is definitely dropping packets, I'm inclined to think 
the problem lies there - but the question is, what might it be? I first 
considered FIB corruption, but it's hard to see how that can give the 
symptoms.

In the end we power-cycled the 6716 linecard, on the rationale that it 
was "new" and it seemed to solve the problem, but since it caused a 
routing change it may of course just have "moved it" around again, to 
LSPs carrying little traffic.

The 6716 passed a full set of GOLD diagnostics when it was delivered, so 
I'm not inclined to easily believe it's faulty.

Could it be the slot? If so, why would it manifest only on a single, or 
a small number of LSPs?

Baffling...