[f-nsp] FCX and target path 8.0.10m (and an aside)

Tue Feb 27 06:18:30 EST 2018

On Thu, 22 Feb 2018, Kennedy, Joseph wrote:

> What is connected to these stacks from a client perspective?

Edge stacks :)  Mostly HPE/Comware stuff of various vintages.

> Are you running PIM on the interfaces? IGMP snooping? Have you checked 

PIM yes, IGMP snooping usually no - the VLANs are mostly routed on this 
switch so it is the IGMP querier.  I think I see where you're going with 
this: over many Foundry/Brocade platforms we've had issues with multicast.

> your IGMP and multicast groups and how many group members are present? 

Only a couple of hundred groups usually.

> Do you have any link-local groups showing up? Assuming active IGMP 
> querier configuration of some kind, does the loss line up with any of 
> the IGMP interface timers reaching 0?

There are link-local groups present for sure.  We have a core MLX which 
sees far more groups (it is on the path to the RP) and shows related CPU 
issues, not enough to be a problem at the moment.  We're gradually rolling 
out filtering of groups at the routed distribution layer to cut down on 
this (although the effectiveness of this seems a bit hit and miss on some 
platforms).  The UPNP group is particularly pernicious, with ttls != 1.

The ping loss was much less noticeable over this weekend, when there is 
less activity on campus.

> Do your OSPF adjacencies look stable throughout or do they transition 
> during the events? You said you notice loss through the stack but do you 
> note any loss from the stack itself to the core uplinks?

OSPF adjacencies totally stable.

When the packet loss happens, it affects pings to the loopback address, 
the OSPF interface addresses, but it seems not so much addresses through 
it.  So more control plane than data plane.

On a previous stack failure we flipped the stacking cable to the opposite 
ports (there are two in the stack) but it failed again after that.  We 
upgraded from tyarget path 8.0.10m to 8.0.30q, but it made absolutely no 
difference.  Am also looking at the procedure to downgrade to 7.4 and see 
if the problem persists, although my gut feeling is that it is 
hardware-based.  The ping loss problem occurs on another stacked pair that
was 'upgraded' to 8.0.10m at the same time, but that doesn't exhibit the 
stack failure issue.

I'm now monitoring the MIB so hopefully will get SMS alert on failures and 
have rigged up remote access to the power so we can re-power them remotely 
without a visit.  This will bide us some time, but essentially we're 
looking at bringing forward a replacement we were likely to execute this 
summer.

Good questions Joseph thanks!

Addendum, Tuesday morning:

No failures to this point since Friday for us.  However the other FCX 
stack that was upgraded at the same time and exhibited the ping loss issue 
has now also experienced the stack break issue.  It has been running 15 
days (although not sure why it rebooted then).  Curiously, last week, a 
third FCX stack broke itself apart too, but that one is running 07400m and 
has been stable for a long long time.  I had been minded to think we just 
had a hardware issue on our most problematic stack, but with the two 
others now showing the same symptoms, I'm starting to worry about whether 
the stack issue is being provoked by some traffic.  Seems hard to believe, 
but I've suspected it for other equipment in the past .... ho hum.

Jethro.

> 
> --JK
> 
> -----Original Message-----
> From: foundry-nsp [mailto:foundry-nsp-bounces at puck.nether.net] On Behalf Of Jethro R Binks
> Sent: Thursday, February 22, 2018 5:16 AM
> To: foundry-nsp at puck.nether.net
> Subject: Re: [f-nsp] FCX and target path 8.0.10m (and an aside)
> 
> The silence was deafening!
> 
> So bit of a development with this.  We had three stack failure events 
> which required a hard reboot to sort.  We made the decision to upgrade 
> to 8.0.30q (we also replaced the CX4 cable, just in case it is degraded 
> in some way).  Upgrade was all fine.
> 
> Initially after the reboot, we didn't see the ping loss issues.  But 
> over the past few hours it has started to creep in again, much the same 
> as previously.  I've not re-done all the tests like shutting down one 
> then the other ospf interface to see if it makes any difference to the 
> problem, but my gut feeling is it will be just the same.
> 
> Anyone any thoughts?  Could there be some sort of hardware failure in 
> one of the units that might cause these symptoms?  Maybe I might have 
> more diagnostic tools available.  What might also be interesting is 
> trying to downgrade back to the 7.4 version we were running previously, 
> where we didn't see these issues.  But that's more service-affecting 
> downtime.
> 
> Jethro.
> 
> 
> 
> On Fri, 16 Feb 2018, Jethro R Binks wrote:
> 
> > I thought I was doing the right thing by upgrading a couple of my 
> > slightly aging FCXs to target path release 8.0.10m, which tested fine 
> > on an unstacked unit with a single OSPF peering.
> > 
> > The ones I am running on are stacks of two, each with two 10Gb/s 
> > connections to core, one OSPF peering on each.
> > 
> > Since the upgrade, both stacks suffer packet loss every 2 minutes 
> > (just about exactly) for about 5-10 seconds, demonstrated by pinging 
> > either a host through the stack, or an interface on the stack.  There 
> > are no log messages or changes in OSPF status or spanning tree 
> > activity.  When it happens, of course a remote session to the box stalls for the same period.
> > 
> > Shutting down either one of the OSPF links doesn't make a difference.  
> > CPU never changes from 1%.  No errors on ints.  I've used dm commands 
> > to catch packets going to CPU at about the right time and see nothing 
> > particularly alarming and certainly no flooding of anything.
> > 
> > This only started after the upgrade to 8.0.10m on each of them.  I 
> > have other FCX stacks on other code versions not exhibiting this issue.
> > 
> > Some of the comments in this thread seem to be reflective of my issue:  
> > 
> > https://www.reddit.com/r/networking/comments/4j47uo/brocade_is_ruining
> > _my_week_i_need_help_to/
> > 
> > I'm a little dismayed to get these problems on a Target Path release, 
> > which I assumed would be pretty sound.  I've been eyeing a potential 
> > upgrade to something in the 8.0.30 (recommendations?), with the usual 
> > added excitement of bringing a fresh set of bugs.
> > 
> > Before I consider reporting it, I wondered if anyone had any useful 
> > observations or suggestions.
> > 
> > And, as an aside, I wonder how we're all getting along in our new 
> > homes for our dissociated Brocade family now.  Very sad to see the 
> > assets of a once good company scattered to the four winds like this.
> > 
> > Jethro.
> > 
> > .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .
> > Jethro R Binks, Network Manager,
> > Information Services Directorate, University Of Strathclyde, Glasgow, 
> > UK
> > 
> > The University of Strathclyde is a charitable body, registered in 
> > Scotland, number SC015263.
> > _______________________________________________
> > foundry-nsp mailing list
> > foundry-nsp at puck.nether.net
> > http://puck.nether.net/mailman/listinfo/foundry-nsp
> > 
> 
> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .
> Jethro R Binks, Network Manager,
> Information Services Directorate, University Of Strathclyde, Glasgow, UK
> 
> The University of Strathclyde is a charitable body, registered in Scotland, number SC015263.
> _______________________________________________
> foundry-nsp mailing list
> foundry-nsp at puck.nether.net
> http://puck.nether.net/mailman/listinfo/foundry-nsp
> 

.  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .
Jethro R Binks, Network Manager,
Information Services Directorate, University Of Strathclyde, Glasgow, UK

The University of Strathclyde is a charitable body, registered in
Scotland, number SC015263.