[f-nsp] MLX throughput issues

Wilbur Smith wsmith at brocade.com
Fri Feb 20 22:09:01 EST 2015


Hello All,

Sorry to reply late, but it seems like you were hitting the buffer limit for a port domain (group of ports). I don’t have an FLS in front of me (flying ATM) so I can’t confirm, but I think we’re breaking up the buffer space into reserved segments for each port group. The reasoning behind this is that Is keeps “slow drain” devices on a single interface from using up all available buffer space for the switch. The down side is that if a port exhausts its allotted buffers, it can cause slow downs.

Over the years we’ve gone back in forth over whether its better to ship with shared buffers enabled; I think it would generate the same amount of TAC requests no matter what we do. Although the FLS isn’t as beefy as the FCX or ICX, it should still have some nobs you can turn to  increase performance. This should be in the config guide.

I’d try to narrow down what device or devices is causing buffer pressure on the switch and consider enabling ethernet pause-frames (flow control) on the switch and neighboring devices. There’s also different QOS setting that can switch from stick queues to weighted round-robin (and other types) to help make better use of the buffers on the uplink ports.

Sorry you’re running into this. The FLS is a very good campus access switch platform (good latency and minimal oversubscription, for a good cost), but my view is that it’s not the best switch to front-end server connections or heavy I/O. Others may disagree with me on this though.

Wilbur

From: "nethub at gmail.com<mailto:nethub at gmail.com>"
Date: Friday, February 13, 2015 at 4:13 PM
To: Brad Fleming
Cc: 'Jeroen Wunnink | Hibernia Networks', "foundry-nsp at puck.nether.net<mailto:foundry-nsp at puck.nether.net>"
Subject: Re: [f-nsp] MLX throughput issues

We already tried a full system reboot last night and it didn’t seem to help.  I’ll definitely keep your switch fabric reboot procedure in mind in case we run into that in the future.

I think we may have figured out at least a short-term solution.  On the FLS648, we ran the command “buffer-sharing-full” and immediately we were able to get better speeds.  It seems as though the FLS648’s buffers may have been causing the issue.  We’ll continue to monitor over the next few days and see if this actually solves the issue.

Thanks everyone for your feedback thus far.



From: Brad Fleming [mailto:bdflemin at gmail.com]
Sent: Friday, February 13, 2015 4:24 PM
To: nethub at gmail.com<mailto:nethub at gmail.com>
Cc: Jeroen Wunnink | Hibernia Networks; foundry-nsp at puck.nether.net<mailto:foundry-nsp at puck.nether.net>
Subject: Re: [f-nsp] MLX throughput issues

Over the years we’ve seen odd issues where one of the switch-fabric-links will “wigout” and some of the data moving between cards will get corrupted. When this happens we power cycle each switch fab one at a time using this process:
1) Shutdown SFM #3
2) Wait 1 minute
3) Power SFM #3 on again
4) Verify all SFM links are up to SFM#3
5) Wait 1 minute
6) Perform steps 1-5 for SFM #2
7) Perform steps 1-5 for SFM #3

Not sure you’re seeing the same issue that we see but the “SFM Dance” (as we call it) is a once-every-four-months thing somewhere across our 16 XMR4000 boxes. It can be done with little to no impact if you are patient verify status before moving to the next SFM.


On Feb 13, 2015, at 11:41 AM, nethub at gmail.com<mailto:nethub at gmail.com> wrote:

We have three switch fabrics installed, all are under 1% utilized.


From: Jeroen Wunnink | Hibernia Networks [mailto:jeroen.wunnink at atrato.com]
Sent: Friday, February 13, 2015 12:27 PM
To: nethub at gmail.com<mailto:nethub at gmail.com>; 'Jeroen Wunnink | Hibernia Networks'
Subject: Re: [f-nsp] MLX throughput issues

How many switchfabrics do you have in that MLX and how high is the utilization on them

On 13/02/15 18:12, nethub at gmail.com<mailto:nethub at gmail.com> wrote:
We also tested with a spare Quanta LB4M we have and are seeing about the same speeds as we are seeing with the FLS648 (around 20MB/s or 160Mbps).

I also reduced the number of routes we are accepting down to about 189K and that did not make a difference.


From: foundry-nsp [mailto:foundry-nsp-bounces at puck.nether.net] On Behalf Of Jeroen Wunnink | Hibernia Networks
Sent: Friday, February 13, 2015 3:35 AM
To: foundry-nsp at puck.nether.net<mailto:foundry-nsp at puck.nether.net>
Subject: Re: [f-nsp] MLX throughput issues

The FLS switches do something weird with packets. I've noticed they somehow interfere with changing the MSS window size dynamically, resulting in destinations further away having very poor speed results compared to destinations close by.

We got rid of those a while ago.


On 12/02/15 17:37, nethub at gmail.com<mailto:nethub at gmail.com> wrote:
We are having a strange issue on our MLX running code 5.6.00c.  We are encountering some throughput issues that seem to be randomly impacting specific networks.

We use the MLX to handle both external BGP and internal VLAN routing.  Each FLS648 is used for Layer 2 VLANs only.

From a server connected by 1 Gbps uplink to a Foundry FLS648 switch, which is then connected to the MLX on a 10 Gbps port, running a speed test to an external network is getting 20MB/s.

Connecting the same server directly to the MLX is getting 70MB/s.

Connecting the same server to one of my customer's Juniper EX3200 (which BGP peers with the MLX) also gets 70MB/s.

Testing to another external network, all three scenarios get 110MB/s.

The path to both test network locations goes through the same IP transit provider.

We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect to the Foundry FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for directly connecting the server.  A separate NI-MLX-10Gx4 connects to our upstream BGP providers.  Customer’s Juniper EX3200 connects to the same NI-MLX-10Gx4 as the FLS648.  We take default routes plus full tables from three providers by BGP, but filter out most of the routes.

The fiber and optics on everything look fine.  CPU usage is less than 10% on the MLX and all line cards and CPU usage at 1% on the FLS648.  ARP table on the MLX is about 12K, and BGP table is about 308K routes.

Any assistance would be appreciated.  I suspect there is a setting that we’re missing on the MLX that is causing this issue.






_______________________________________________

foundry-nsp mailing list

foundry-nsp at puck.nether.net<mailto:foundry-nsp at puck.nether.net>

http://puck.nether.net/mailman/listinfo/foundry-nsp






--



Jeroen Wunnink

IP NOC Manager - Hibernia Networks

Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300

Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623

jeroen.wunnink at hibernianetworks.com<mailto:jeroen.wunnink at hibernianetworks.com>

www.hibernianetworks.com<http://www.hibernianetworks.com/>





--



Jeroen Wunnink

IP NOC Manager - Hibernia Networks

Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300

Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623

jeroen.wunnink at hibernianetworks.com<mailto:jeroen.wunnink at hibernianetworks.com>

www.hibernianetworks.com<http://www.hibernianetworks.com/>
_______________________________________________
foundry-nsp mailing list
foundry-nsp at puck.nether.net<mailto:foundry-nsp at puck.nether.net>
http://puck.nether.net/mailman/listinfo/foundry-nsp

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/foundry-nsp/attachments/20150221/9998c7ff/attachment-0001.html>


More information about the foundry-nsp mailing list