[c-nsp] C6k fabric drops
Peter Rathlev
peter at rathlev.dk
Wed Apr 19 05:46:09 EDT 2017
Hi all,
Sorry for this long tedious post.
I'm investigating fabric drops on a C6k switch. I see data that
confuses me and it's my hope that someone else have gone through
similar troubleshooting with more success. :-)
It's a C6509-V-E chassis with a single Sup2T (non-XL). Fabric drops are
occurring even though the fabric utlization doesn't seem all that high,
though measuring this precisely is difficult.
I have the following modules:
Mod Ports Card Type Model
--- ----- -------------------------------------- ------------------
1 48 CEF720 48 port 10/100/1000mb Ethernet WS-X6748-GE-TX
2 48 CEF720 48 port 10/100/1000mb Ethernet WS-X6748-GE-TX
4 24 CEF720 24 port 1000mb SFP WS-X6724-SFP
5 5 Supervisor Engine 2T 10GE w/ CTS (Acti VS-SUP2T-10G
8 20 DCEF2T 4 port 40GE / 16 port 10GE C6800-16P10G
9 20 DCEF2T 4 port 40GE / 16 port 10GE WS-X6904-40G
I see fabric drops inbound from module 8, both channels. We're RRD-
graphing these and the last ~18 hours util/drops show the following:
https://ampere.rathlev.dk/C6k-fabric-drops-20170419.png
The graph inputs are from scraping of "show fabric utilization" (GAUGE)
and "show fabric drop" (DERIVE).
The utilization is difficult to graph since "show fabric utilization"
returns the exact utilization at command invocation with no averaging.
Looking at "show fabric utilization detail" I can see that peak load
(not recently) for module 8 channel 0/1 was 23%/59% ingress and 25%/39%
egress, so the fabric utilization should have stayed at least below
these numbers for the graphed interval. But I still see drops. (All
drops are Low-Q-drops by the way, all High-Q-drop counters are 0.)
The drops (and slightly higher utilization) coincide with load on a
specific interface on module 8:
https://ampere.rathlev.dk/C6k-C6800-16p-interface-load-pps-20170419.png
https://ampere.rathlev.dk/C6k-C6800-16p-interface-load-bps-20170419.png
This traffic is MTU sized (1500B) IBM TSM based backup. The traffic
flows to/from an interface on module 9, hence the higher utilization on
modules 8 and 9. (Most of the traffic is between to VLANs which makes
the graph more symmetric.)
The fabric channels of both modules are 2 x 40G.
Furthermore: If I start a VLAN based local SPAN session, the number of
drops rise significantly. The rise in drops seems to follow the SPAN
session load, but even a few hundreds Mbps SPAN leads to drops. (A port
based local SPAN session does not lead to noticeable drops.) This makes
me curious but SPAN being what it is I'm not hoping for an explanation.
So my questions are something like:
1) Can I trust the "peak utilization" numbers? Or are they also
based on sampling and thus not able to see bursts?
2) Are there PPS limits to the fabric channels? Or can I trust
that the fabric can transport whatever I throw at it below
BPS capacity? (This is 1500B packets, shouldn't be relevant.)
3) Is there any way to know more about why packets are dropped?
I'm guessing its similar to a "regular" interface dropping
packets, e.g. lack of buffer space during high load. But with
such a moderate utilization?
4) Can anyone point at a detailed whitepaper for the C6800-16P10G
module, similar to this on for the 6904 module:
http://www.cisco.com/c/en/us/products/collateral/switches/catalyst-6500-series-switches/white_paper_c11-696669.pdf
Shortened URL: "http://goo.gl/s3Ndhg". Something that shows
internal bottlenecks and stuff like that.
5) What could I do to alleviate the problem, apart from swapping
interfaces so heave traffic stays on the same module? (This
can be a bit of a puzzle.)
The "right" way to solve this is probably to swap the C6k with a Nexus
7000/7700, but I'd like to at least understand the problem better
first.
Any comments/help are much appreciated. :-)
--
Peter
More information about the cisco-nsp
mailing list