[c-nsp] Nexus 7k with F2e line cards and egress queuing

Mon Dec 23 10:11:32 EST 2019

Hi James,

All these are valid and well-known points but in my case traffic is/was 
dropped very selectively, affecting only some traffic patterns.
It was traffic going from facebook cache appliance that became 
completely unusable for scrolling Instagram but worked for Facebook 
served from the same cache. Any other bulk traffic generator like 
youtube, netflix going the same path etc wasn't affected. I spent a lot 
of time analysing traffic captures taken from every port on the path and 
narrowed it down to this Nexus. For the record, drops happen not in the 
egress port but in VOQs. Troubleshooting it is a nightmare because it 
requires manual mapping for several internal variables to track down the 
queue for particular egress port. And after doing all this dull job, 
providing it to TAC and receiving a confirmation from them, I hear only 
silence from them for months. Fail from any PoV.

Kind regards,
Andrey Kostin

James Bensley писал 2019-12-19 16:22:
> On Sat, 14 Dec 2019 at 16:18, Curtis Piehler <cpiehler2 at gmail.com> 
> wrote:
>> Has anyone had egress VQ Congestion issues on the Nexus 7k using F2e 
>> line
>> cards causing input discards?  There has been intentional influx of 
>> traffic
>> over the past few months to these units (Primarily VoIP traffic) IE: 
>> SBCs
>> and such.  These SBCs are mostly 1G interfaces with a 10G uplink to 
>> the
>> core router.  At some point of traffic shift the switch interfaces 
>> facing
>> the SBC accrue egress VQ congestion and input discards start dropping
>> packets into the switches from the core router uplinks.
>> 
>> We have opened a Cisco TAC ticket and they go through the whole thing 
>> about
>> the Nexus design and dropping packets on ingress if the destination 
>> port is
>> congestion, etc... and I get all that.  They also say going from a 10G
>> uplink to a 1G downlink is not appropriate however those downstream 
>> devices
>> are not capable of 10G.  They amount of traffic influx isn't that much
>> (your talking 20-30M max of VoIP).  We have removed Cisco FabricPath 
>> from
>> all VDCs and even upgraded our code from 6.2.16 to 6.2.20a on the 
>> SUP-2E
>> supervisors.
>> 
>> I understand the N7K-F248XP-23E/25E have 288KB/Port and 256KB/SOC and 
>> I
>> would think these would be more than sufficient.  I know the 
>> F3248XP-23/25
>> have 295KB/Port 512KB/SOC however I can't see the need to drop 7x the
>> amount for line cards that should be able to handle this traffic?
> 
> Hi Curtis,
> 
> I haven't touched Nexus 7Ks in a few years and only worked with F2e
> cards, no others, so I'm no Nexus expert...
> 
> However, even at low traffic volumes it is possible to fill the egress
> interface buffers and experience packet loss. Seeing as you have voice
> traffic flying around does this mean you also use QoS (even if just
> default settings)?
> 
> Assuming the media streams are in an LLQ coming over the 10G link into
> the 7K, with the egress link as 1G with a 288KB per port as you say;
> that's about 0.2ms at 10Gbps or about 2ms at 1Gbps.
> 
> If you have some signalling traffic into your core router (someone is
> setting up a new call for example) and at the same time some RTP
> traffic comes into the core router from somewhere else (e.g. media
> from an established call) and both of these flows are destined to the
> same SBC attached to the 7k; the media traffic will be dequeued over
> the 10G link towards the 7K first because it's in an LLQ, then the
> signalling traffic will have been buffer for one or two ms, then
> you'll have a short burst of signalling traffic be dequeued over the
> 10G link to the 7K at a rate of 10Gbps, the 7K won't be able to
> dequeue this traffic out of the 1G link towards the SBC as fast as it
> comes in from the core router.
> 
> The volume of traffic doesn't need to be anywhere near 1Gbps, as long
> as it's greater than 288KBs of traffic it will be coming into the 7K
> quicker than it can dequeue it towards the SBC (so, if the burst from
> core router to 7K last for longer than 0.2ms at 10Gbps it will exceed
> the 2ms buffer on your 1Gbps port). If possible, run a packet capture
> on the 10G link and look at the traffic coming into the 7K link when
> you have 1G egress congestion / 10G input discards, even if you're
> seeing an average out speed of 30Mbps in a single 1 second sampling
> period meaning the 10G link will be idle for 99.7% of the 1 second
> sampling period, if the 10G link runs at full speed (meaning it never
> idles) for 0.2ms consistently you'll have packet loss.
> 
> If you don't explicitly have QoS configured, check what your default
> settings are implementing and if you can increase the per-port
> buffers. It's been a while since I touched a 7K but I do recall that
> the default settings were not great.
> 
> Cheers,
> James.
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/