[f-nsp] Trying to diagnose a possibly failing FESX648-PREM

Fri May 9 01:10:37 EDT 2014

Hi

As a general advice I would sort it out layer by layer with a checklist. 

Check grounding
Check cable type and Shield 
Check cable (FastIron# phy cable-diag tdr <n>) or with a diag tool
Check speed/duplex
Check the numbers of input/output on the other site.
Packetsize

Next safe a show tech- befor any config changes and use a diff or version tool.

On the device check the hw limits (tcam) and the impact of changing system-values (max vlan) cause the ipv6 models differs by allocating tcam. Do you enable ipv6 ?

Regards Erich

> Am 09.05.2014 um 00:07 schrieb "ebradsha at gmail.com" <ebradsha at gmail.com>:
> 
> Plenty of FCS errors and they're incrementing on the new switch as well. Flow control is enabled on all ports. Here's my 'show statistics' output:
> 
> SSH at FESX648 Router(config)#show statistics
> Port            In Packets          Out Packets       In Errors      Out Errors
> 1                   180855                    0               0               0
> 2                        0                    0               0               0
> 3                123136488             70341679               0               0
> 4                        0                    0               0               0
> 5                  5315137              6604598          648949               0
> 6                   342105              1549867          535454               0
> 7                  9669516             16503017         3137016               0
> 8                 14399232             29683571               1               0
> 9                  9974691             18817287         3853703               0
> 10                 4152353              4000770               0               0
> 11                13630527             25175503         5483288               0
> 12                   71369               149477            1642               0
> 13                 6881418              1668386          158036               0
> 14                  939892              3171692          261376               0
> 15                11008907             20921720         4404347               0
> 16                   77529               222362           24009               0
> 17                     433                87820               0               0
> 18                   82308              1759389          759693               0
> 19                       0                    0               0               0
> 20                   27175               109184            1567               0
> 21                       0                    0               0               0
> 22                       0                    0               0               0
> 23                       0                    0               0               0
> 24                       0                    0               0               0
> 25                       0                  391               0               0
> 26                     410                    0               0               0
> 27                       0                    0               0               0
> 28                       0                    0               0               0
> 29                       0                    0               0               0
> 
> Almost every port that is active has FCS errors.
> 
> I've had such an bizarre combination of symptoms (15% packet loss and erratic pings that was resolved by removing rate-limiting), that I initially I discounted the possibility that my cables were bad. However, I did self-terminate all of them (I've terminated thousands of cables) and I was using a new bag of RJ45 plugs that I haven't used elsewhere.
> 
> The datacenter technician who tested my uplink cross-connect cable also tested one of my self-terminated cables. Both cables passed the test, but maybe the rest of my self-terminated cables are bad...
> 
> 
>> On Thu, May 8, 2014 at 2:22 PM, Eldon Koyle <esk-puck.nether.net at esk.cs.usu.edu> wrote:
>> Could it be a cabling issue?  Are there any errors?
>> 
>> Is flow control enabled?
>> 
>> --
>> Eldon Koyle
>> 
>> On  May 08 14:13-0700, ebradsha at gmail.com wrote:
>> > Just spoke with a sysadmin working out of a different datacenter. They have
>> > FESX648-PREMs deployed and they're running sxr07400e.bin firmware as well.
>> > Completely stumped at this point :-/
>> >
>> >
>> > On Thu, May 8, 2014 at 1:38 PM, ebradsha at gmail.com <ebradsha at gmail.com>wrote:
>> >
>> > > I just had a replacement FESX648-PREM delivered overnight, hooked it up
>> > > and initially all looked good. However, when I imported my config and moved
>> > > over all of the CAT5e cables, the packet loss and erratic pings resumed.
>> > >
>> > > Assuming that there was some firmware issue at play, I started removing
>> > > different parts of my config while running a continuous ping test in the
>> > > background. The moment I removed all rate-limiting from the device, packet
>> > > loss halted and ping times stabilized. However, I continue to have problems
>> > > downloading files at full speed -- speed test files will do these 'stop and
>> > > start' pauses. Ultimately I can only average 6MB/s where I'd
>> > > normally expect to pull down at least 200MB/s.
>> > >
>> > > My original switch was running sxr07400e.bin and the replacement is
>> > > running sxr07400d.bin
>> > >
>> > > All my other switches are FESX448-PREMs, so unfortunately I don't have an
>> > > existing example config to model after.
>> > >
>> > > Anyone recommend a boot ROM and firmware version that works well with a
>> > > FESX648-PREM?
>> > >
>> > >
>> > >
>> > >
>> > > On Wed, May 7, 2014 at 4:36 PM, ebradsha at gmail.com <ebradsha at gmail.com>wrote:
>> > >
>> > >> This is a stand-alone switch in a cabinet so no L2 loop there. Pretty
>> > >> simple setup -- single BGP session with an upstream provider with the
>> > >> default route pointing right to them. CPU utilization currently sitting at
>> > >> 1%.
>> > >>
>> > >> Initially when I noticed the packet loss I thought I was getting DoS
>> > >> attacked, but I have sFlow monitoring activated on all ports and don't see
>> > >> anything out of the ordinary.
>> > >>
>> > >> I'll check the boot time diagnostics soon -- thanks for your input.
>> > >>
>> > >> - Elliot
>> > >>
>> > >>
>> > >> On Wed, May 7, 2014 at 4:28 PM, Jeroen Wunnink | Hibernia Networks <
>> > >> jeroen.wunnink at atrato.com> wrote:
>> > >>
>> > >>>  Could be a L2 loop or ddos against the mgmt IP. is the CPU load also
>> > >>> high?
>> > >>>
>> > >>>
>> > >>> On 07/05/14 20:46, ebradsha at gmail.com wrote:
>> > >>>
>> > >>> Hi all,
>> > >>>
>> > >>>  I believe I have a failing switch on my hands and I'm wondering if you
>> > >>> might be able to provide an assessment based on the symptoms I've seeing.
>> > >>>
>> > >>>  I'm currently running a Foundry FESX648-PREM with the following
>> > >>> version info:
>> > >>>
>> > >>>  SSH at FESX648 Router>show version
>> > >>>   SW: Version 07.4.00eT3e3 Copyright (c) 1996-2012 Brocade
>> > >>> Communications Systems, Inc. All rights reserved.
>> > >>>       Compiled on Dec 11 2013 at 19:00:43 labeled as SXR07400e
>> > >>>       (4593059 bytes) Primary sxr07400e.bin
>> > >>>        BootROM: Version 07.4.01T3e5 (FEv2)
>> > >>>   HW: Stackable FESX648-PREM6 (PROM-TYPE FESX648-L3U-IPV6)
>> > >>>
>> > >>> ==========================================================================
>> > >>>       Serial  #: FL18090011
>> > >>>          License: SX_V6_HW_ROUTER_IPv6_SOFT_PACKAGE   (LID: XXXXXXXXXXX)
>> > >>>        P-ASIC  0: type 0111, rev 00  subrev 01
>> > >>>       P-ASIC  1: type 0111, rev 00  subrev 01
>> > >>>       P-ASIC  2: type 0111, rev 00  subrev 01
>> > >>>       P-ASIC  3: type 0111, rev 00  subrev 01
>> > >>>
>> > >>> ==========================================================================
>> > >>>   300 MHz Power PC processor 8245 (version 0081/1014) 66 MHz bus
>> > >>>   512 KB boot flash memory
>> > >>>  8192 KB code flash memory
>> > >>>   256 MB DRAM
>> > >>> The system uptime is 26 minutes 49 seconds
>> > >>> The system : started=warm start   reloaded=by "reload"
>> > >>>
>> > >>>
>> > >>>  Quick summary of the symptoms:
>> > >>>
>> > >>>  1. These problems started only after ~15 servers were connected to the
>> > >>> switch. Although many servers were connected, utilization remains low, only
>> > >>> ~40Mbit on a 1Gbit uplink.
>> > >>>
>> > >>>  2. I just rebooted my switch 20 minutes ago, but I'm already seeing a
>> > >>> ton of FCS errors across many ports: http://pbrd.co/SABLtk
>> > >>>
>> > >>>  3. Inexplicably high and erratic ping times (80ms, instead of the
>> > >>> usual 20ms over the same route and variation of +- 20ms on every ping).
>> > >>> Ping times were low and stable before many servers were connected.
>> > >>>
>> > >>>  4. High packet loss. Before a lot of servers were connected, there was
>> > >>> no packet loss. Yesterday, the packet loss was hovering around 10%. It
>> > >>> seems to be worsening now. Today the average packet loss is 20%.
>> > >>>
>> > >>>  Screen capture: http://pbrd.co/SADKO7 <http://pbrd.co/SABZ3D>
>> > >>>
>> > >>>  5. Yesterday I was also able to temporarily eliminate packet loss and
>> > >>> the high ping times by disabling specific ports. Today, disabling ports 7
>> > >>> and 11 has no effect.
>> > >>>
>> > >>>  6. The cross-connect cables were suspect, but all cables have since
>> > >>> been tested with a MicroTest PentaScanner and all passed. We even replaced
>> > >>> the CAT5 cross-connect with a machined and molded CAT6 cable -- the same
>> > >>> packet loss and erratic ping times persisted.
>> > >>>
>> > >>>  7. Other strange things have happened. Yesterday I attempted to
>> > >>> connect up two new servers to the switch on port 37 and 38. Ports 5-48
>> > >>> belong to the same default VLAN. The servers could connect to the switch,
>> > >>> and ping the gateway IP, but they could not ping to the outside world. I
>> > >>> then moved the CAT5 cables to ports 22 and 23 -- same VLAN -- and
>> > >>> everything worked perfectly.
>> > >>>
>> > >>>  Does this seem like a failing switch? Are there any further diagnostic
>> > >>> tests I could run to verify this?
>> > >>>
>> > >>>  Thanks,
>> > >>> Elliot
>> > >>>
>> > >>>
>> > >>>
>> > >>> _______________________________________________
>> > >>> foundry-nsp mailing listfoundry-nsp at puck.nether.nethttp://puck.nether.net/mailman/listinfo/foundry-nsp
>> > >>>
>> > >>>
>> > >>>
>> > >>> --
>> > >>>
>> > >>> Jeroen Wunnink
>> > >>> IP NOC Manager - Hibernia Networksjeroen.wunnink at hibernianetworks.com
>> > >>> Phone: +1 908 516 4200 (Ext: 1011)
>> > >>> 24/7 NOC Phone: +31 20 82 00 623
>> > >>>
>> > >>>
>> > >>
>> > >
>> 
>> > _______________________________________________
>> > foundry-nsp mailing list
>> > foundry-nsp at puck.nether.net
>> > http://puck.nether.net/mailman/listinfo/foundry-nsp
> 
> _______________________________________________
> foundry-nsp mailing list
> foundry-nsp at puck.nether.net
> http://puck.nether.net/mailman/listinfo/foundry-nsp
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/foundry-nsp/attachments/20140509/32cdc6e0/attachment.html>