[f-nsp] MLX throughput issues

Wed Feb 18 11:32:00 EST 2015

Try physically pulling the SFM's one by one rather then just 
powercycling them. Yes there is a difference there :-)

Also, 5400d is fairly buggy, there were some major issues with CRC 
checksums in hSFM's and on 100G cards. We have fairly good results with 
5500e

On 18/02/15 16:50, Brad Fleming wrote:
> TAC replaced hSFMs and line cards the first couple times but we’ve 
> seen this issue at least once on every node in the network. The ones 
> where we replaced every module (SFM, mgmt, port cards, even PSUs) have 
> still had at least one event. So I’m not even sure what hardware we’d 
> replace at this point. That lead us to thinking a config problem since 
> each box uses the same template but after a lengthy audit with TAC 
> nobody could find anything. It happens infrequently enough that we 
> grew to just live with it.
>
>
>
>> On Feb 18, 2015, at 12:45 AM, Frank Bulk <frnkblk at iname.com 
>> <mailto:frnkblk at iname.com>> wrote:
>>
>> So don’t errors like this suggest replacing the hardware?
>> Frank
>> *From:*foundry-nsp [mailto:foundry-nsp-bounces at puck.nether.net]*On 
>> Behalf Of*Brad Fleming
>> *Sent:*Tuesday, February 17, 2015 3:10 PM
>> *To:*Josh Galvez
>> *Cc:*foundry-nsp at puck.nether.net <mailto:foundry-nsp at puck.nether.net>
>> *Subject:*Re: [f-nsp] MLX throughput issues
>> The common symptoms for us are alarms of TM errors / resets. We’ve 
>> been told on multiple TAC cases that logs indicating transmit TM 
>> errors are likely caused by problems in one of the SFM links / lanes. 
>> We’ve been told that resetting the SFMs one at a time will clear the 
>> issue.
>> Symptoms during the issue is that 1/3rd of the traffic moving from 
>> one TM to another TM will simply get dropped. So we see TCP globally 
>> start to throttle like crazy and if enough errors count up the TM 
>> will simply reset. After the TM reset is seems a 50/50 chance the box 
>> will remain stable or go back to dropping packets within ~20mins. So 
>> when we see a TM reset we simply do the SFM Dance no matter what.
>>> On Feb 16, 2015, at 10:08 PM, Josh Galvez <josh at zevlag.com 
>>> <mailto:josh at zevlag.com>> wrote:
>>> Why kind of wigout? And how do you diagnose the corruption?  I'm 
>>> intrigued.
>>> On Mon, Feb 16, 2015 at 8:43 AM, Brad Fleming <bdflemin at gmail.com 
>>> <mailto:bdflemin at gmail.com>> wrote:
>>>> We’ve seen it since installing the high-capacity switch fabrics 
>>>> into our XMR4000 chassis roughly 4 years ago. We saw it through 
>>>> IronWare 5.4.00d. I’m not sure what software we were using when 
>>>> they were first installed; probably whatever would have been 
>>>> stable/popular around December 2010.
>>>>
>>>> Command is simply “power-off snm [1-3]” then “power-on snm [1-3]”.
>>>>
>>>> Note that the power-on process causes your management session to 
>>>> hang for a few seconds. The device isn’t broken and packets aren’t 
>>>> getting dropped; it’s just going through checks and echoing back 
>>>> status.
>>>>
>>>> -brad
>>>>
>>>>
>>>> > On Feb 16, 2015, at 7:07 AM, Jethro R Binks 
>>>> <jethro.binks at strath.ac.uk <mailto:jethro.binks at strath.ac.uk>> wrote:
>>>> >
>>>> > On Fri, 13 Feb 2015, Brad Fleming wrote:
>>>> >
>>>> >> Over the years we’ve seen odd issues where one of the
>>>> >> switch-fabric-links will “wigout” and some of the data moving 
>>>> between
>>>> >> cards will get corrupted. When this happens we power cycle each 
>>>> switch
>>>> >> fab one at a time using this process:
>>>> >>
>>>> >> 1) Shutdown SFM #3
>>>> >> 2) Wait 1 minute
>>>> >> 3) Power SFM #3 on again
>>>> >> 4) Verify all SFM links are up to SFM#3
>>>> >> 5) Wait 1 minute
>>>> >> 6) Perform steps 1-5 for SFM #2
>>>> >> 7) Perform steps 1-5 for SFM #3
>>>> >>
>>>> >> Not sure you’re seeing the same issue that we see but the “SFM 
>>>> Dance”
>>>> >> (as we call it) is a once-every-four-months thing somewhere 
>>>> across our
>>>> >> 16 XMR4000 boxes. It can be done with little to no impact if you are
>>>> >> patient verify status before moving to the next SFM.
>>>> >
>>>> > That's all interesting. What code versions is this? Also, how do you
>>>> > shutdown the SFMs?  I don't recall seeing documentation for that.
>>>> >
>>>> > Jethro.
>>>> >
>>>> >
>>>> >>
>>>> >>> On Feb 13, 2015, at 11:41 AM,nethub at gmail.com 
>>>> <mailto:nethub at gmail.com>wrote:
>>>> >>>
>>>> >>> We have three switch fabrics installed, all are under 1% utilized.
>>>> >>>
>>>> >>>
>>>> >>> From: Jeroen Wunnink | Hibernia Networks 
>>>> [mailto:jeroen.wunnink at atrato.com 
>>>> <mailto:jeroen.wunnink at atrato.com><mailto:jeroen.wunnink at atrato.com 
>>>> <mailto:jeroen.wunnink at atrato.com>>]
>>>> >>> Sent: Friday, February 13, 2015 12:27 PM
>>>> >>> To:nethub at gmail.com 
>>>> <mailto:nethub at gmail.com><mailto:nethub at gmail.com 
>>>> <mailto:nethub at gmail.com>>; 'Jeroen Wunnink | Hibernia Networks'
>>>> >>> Subject: Re: [f-nsp] MLX throughput issues
>>>> >>>
>>>> >>> How many switchfabrics do you have in that MLX and how high is 
>>>> the utilization on them
>>>> >>>
>>>> >>> On 13/02/15 18:12,nethub at gmail.com 
>>>> <mailto:nethub at gmail.com><mailto:nethub at gmail.com 
>>>> <mailto:nethub at gmail.com>> wrote:
>>>> >>>> We also tested with a spare Quanta LB4M we have and are seeing 
>>>> about the same speeds as we are seeing with the FLS648 (around 
>>>> 20MB/s or 160Mbps).
>>>> >>>>
>>>> >>>> I also reduced the number of routes we are accepting down to 
>>>> about 189K and that did not make a difference.
>>>> >>>>
>>>> >>>>
>>>> >>>> From: foundry-nsp [mailto:foundry-nsp-bounces at puck.nether.net 
>>>> <mailto:foundry-nsp-bounces at puck.nether.net><mailto:foundry-nsp-bounces at puck.nether.net 
>>>> <mailto:foundry-nsp-bounces at puck.nether.net>>] On Behalf Of Jeroen 
>>>> Wunnink | Hibernia Networks
>>>> >>>> Sent: Friday, February 13, 2015 3:35 AM
>>>> >>>> To:foundry-nsp at puck.nether.net 
>>>> <mailto:foundry-nsp at puck.nether.net><mailto:foundry-nsp at puck.nether.net 
>>>> <mailto:foundry-nsp at puck.nether.net>>
>>>> >>>> Subject: Re: [f-nsp] MLX throughput issues
>>>> >>>>
>>>> >>>> The FLS switches do something weird with packets. I've noticed 
>>>> they somehow interfere with changing the MSS window size 
>>>> dynamically, resulting in destinations further away having very 
>>>> poor speed results compared to destinations close by.
>>>> >>>>
>>>> >>>> We got rid of those a while ago.
>>>> >>>>
>>>> >>>>
>>>> >>>> On 12/02/15 17:37,nethub at gmail.com 
>>>> <mailto:nethub at gmail.com><mailto:nethub at gmail.com 
>>>> <mailto:nethub at gmail.com>> wrote:
>>>> >>>>> We are having a strange issue on our MLX running code 
>>>> 5.6.00c.  We are encountering some throughput issues that seem to 
>>>> be randomly impacting specific networks.
>>>> >>>>>
>>>> >>>>> We use the MLX to handle both external BGP and internal VLAN 
>>>> routing.  Each FLS648 is used for Layer 2 VLANs only.
>>>> >>>>>
>>>> >>>>> From a server connected by 1 Gbps uplink to a Foundry FLS648 
>>>> switch, which is then connected to the MLX on a 10 Gbps port, 
>>>> running a speed test to an external network is getting 20MB/s.
>>>> >>>>>
>>>> >>>>> Connecting the same server directly to the MLX is getting 70MB/s.
>>>> >>>>>
>>>> >>>>> Connecting the same server to one of my customer's Juniper 
>>>> EX3200 (which BGP peers with the MLX) also gets 70MB/s.
>>>> >>>>>
>>>> >>>>> Testing to another external network, all three scenarios get 
>>>> 110MB/s.
>>>> >>>>>
>>>> >>>>> The path to both test network locations goes through the same 
>>>> IP transit provider.
>>>> >>>>>
>>>> >>>>> We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect 
>>>> to the Foundry FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for 
>>>> directly connecting the server. A separate NI-MLX-10Gx4 connects to 
>>>> our upstream BGP providers. Customer’s Juniper EX3200 connects to 
>>>> the same NI-MLX-10Gx4 as the FLS648.  We take default routes plus 
>>>> full tables from three providers by BGP, but filter out most of the 
>>>> routes.
>>>> >>>>>
>>>> >>>>> The fiber and optics on everything look fine.  CPU usage is 
>>>> less than 10% on the MLX and all line cards and CPU usage at 1% on 
>>>> the FLS648.  ARP table on the MLX is about 12K, and BGP table is 
>>>> about 308K routes.
>>>> >>>>>
>>>> >>>>> Any assistance would be appreciated.  I suspect there is a 
>>>> setting that we’re missing on the MLX that is causing this issue.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> _______________________________________________
>>>> >>>>> foundry-nsp mailing list
>>>> >>>>>foundry-nsp at puck.nether.net 
>>>> <mailto:foundry-nsp at puck.nether.net><mailto:foundry-nsp at puck.nether.net 
>>>> <mailto:foundry-nsp at puck.nether.net>>
>>>> >>>>>http://puck.nether.net/mailman/listinfo/foundry-nsp<http://puck.nether.net/mailman/listinfo/foundry-nsp>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> --
>>>> >>>>
>>>> >>>> Jeroen Wunnink
>>>> >>>> IP NOC Manager - Hibernia Networks
>>>> >>>> Main numbers (Ext: 1011): USA+1.908.516.4200 
>>>> <tel:%2B1.908.516.4200>| UK+44.1704.322.300 <tel:%2B44.1704.322.300>
>>>> >>>> Netherlands+31.208.200.622 <tel:%2B31.208.200.622>| 24/7 IP 
>>>> NOC Phone:+31.20.82.00.623 <tel:%2B31.20.82.00.623>
>>>> >>>>jeroen.wunnink at hibernianetworks.com 
>>>> <mailto:jeroen.wunnink at hibernianetworks.com><mailto:jeroen.wunnink at hibernianetworks.com 
>>>> <mailto:jeroen.wunnink at hibernianetworks.com>>
>>>> >>>>www.hibernianetworks.com 
>>>> <http://www.hibernianetworks.com/><http://www.hibernianetworks.com/>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>>
>>>> >>> Jeroen Wunnink
>>>> >>> IP NOC Manager - Hibernia Networks
>>>> >>> Main numbers (Ext: 1011): USA+1.908.516.4200 
>>>> <tel:%2B1.908.516.4200>| UK+44.1704.322.300 <tel:%2B44.1704.322.300>
>>>> >>> Netherlands+31.208.200.622 <tel:%2B31.208.200.622>| 24/7 IP NOC 
>>>> Phone:+31.20.82.00.623 <tel:%2B31.20.82.00.623>
>>>> >>>jeroen.wunnink at hibernianetworks.com 
>>>> <mailto:jeroen.wunnink at hibernianetworks.com><mailto:jeroen.wunnink at hibernianetworks.com 
>>>> <mailto:jeroen.wunnink at hibernianetworks.com>>
>>>> >>>www.hibernianetworks.com 
>>>> <http://www.hibernianetworks.com/><http://www.hibernianetworks.com/>_______________________________________________
>>>> >>> foundry-nsp mailing list
>>>> >>>foundry-nsp at puck.nether.net 
>>>> <mailto:foundry-nsp at puck.nether.net><mailto:foundry-nsp at puck.nether.net 
>>>> <mailto:foundry-nsp at puck.nether.net>>
>>>> >>>http://puck.nether.net/mailman/listinfo/foundry-nsp<http://puck.nether.net/mailman/listinfo/foundry-nsp>
>>>> >>
>>>> >
>>>> > .  .  .  .  .  .  .  .  . .  .  .  .  .  .  .  .  .  .  . .  .  
>>>> .  .  .
>>>> > Jethro R Binks, Network Manager,
>>>> > Information Services Directorate, University Of Strathclyde, 
>>>> Glasgow, UK
>>>> >
>>>> > The University of Strathclyde is a charitable body, registered in
>>>> > Scotland, number SC015263.
>>>>
>>>>
>>>> _______________________________________________
>>>> foundry-nsp mailing list
>>>> foundry-nsp at puck.nether.net <mailto:foundry-nsp at puck.nether.net>
>>>> http://puck.nether.net/mailman/listinfo/foundry-nsp
>
>
>
> _______________________________________________
> foundry-nsp mailing list
> foundry-nsp at puck.nether.net
> http://puck.nether.net/mailman/listinfo/foundry-nsp

-- 

Jeroen Wunnink
IP NOC Manager - Hibernia Networks
Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300
Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623
jeroen.wunnink at hibernianetworks.com
www.hibernianetworks.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/foundry-nsp/attachments/20150218/66ff385d/attachment-0001.html>