[f-nsp] High LP CPU After Upgrade 4001a to 54c Multicast

Tue Nov 5 03:57:07 EST 2013

I think we've probably experienced the issue you described.  We have an 
MLX core, with other platforms at the distribution layer, and would 
experience peaks of very high CPU for a second or two at a time, which 
would disrupt OSPF and MRP at least.  It appears we were running 5.4.0d at 
the time.

However, since upgrading 5.5.0c I've been working with Brocade on another 
issue, wherein the standby management card would reset every few minutes 
or so.

Filtering unwanted multicast groups both at the distribution layer, and 
then later directly at the core interfaces, helped a bit.  However the 
most effective fix was to remove "ip multicast-nonstop-routing"; as it was 
described to me: "The problem is seen when in a specific pattern the 
outgoing ports for the groups (239.255.255.250) added and removed ... 
Engineering team performed troubleshooting and determined that for some 
reason, the OIF tree that is rooted at certain forwarding entries is being 
corrupted, either in the middle of traversal, or when there is database 
update."

At the moment Brocade are trying to replicate it in their lab environment 
to work on a fix.  If they sort that, and merge in your defect fix, maybe 
we'll finally see the back of the CPU/multicast issues we've been plagued 
with.

Jethro.

On Tue, 5 Nov 2013, Kennedy, Joseph wrote:

> The problem for us was so severe that both MLX MP’s were running at 99% 
> CPU and the LP’s were flooding unicast.
> 
> After a lot of work testing in a lab environment looking for an issue in 
> multicast routing that fit the symptoms(lol...no it wasn’t easy), I 
> confirmed that the source of the problem was in 5.2 and above (5.2.00 to 
> 5.4.00d) and processing of IGMP reports. Brocade's code updated mcache 
> entries for every IGMP report even when a matching mcache OIF entry 
> already existed.
> 
> All updates in a given IGMP query window in the problem code could be 
> represented as O(M(N^2)) where M is the number of OIF's and N is the 
> number of group members in a single group. For example, in an 
> environment with 100 OIF's and 300 group members this equates to 
> 9,000,000 updates per IGMP query window. This is in relation to previous 
> code releases where the updates could be represented by O(MN) or given 
> the same environment values as above, 30,000 updates per query window.
> 
> Many may not have noticed the issue because they don’t have a large 
> number of OIF’s or large number of group members in a single group. Some 
> may have run into this previously and just filtered the UPnP/SSDP IPv4 
> group (239.255.255.250) to resolve it. If you are running PIM-SM, have 
> upgraded to 5.2.00 or above and afterwards noted periods of abnormally 
> high MP/LP CPU, or you attempted the upgrade but had to revert due to 
> high MP CPU usage and unicast flooding (as we were seeing) then this may 
> be the root of your issue.
> 
> After reporting the problem to Brocade they provided a fix build and 
> incorporated the fix into 5.4.00e. This problem "should be" resolved in 
> 5.4.00e. The problem is not specific to running PIM-SM with VRF’s.
> 
> Related closed defect information from 5.4.00e:
> 
> Defect ID: DEFECT000468056
> Technical Severity: Medium
> Summary: High MP CPU utilization from IGMP reports after upgrade
> Symptom: After upgrading from 4.x to 5.4,high CPU utilization from IGMP reports in VRF
> Feature: IPv4-MC PIM-SM Routing
> Function: PERFORMANCE
> Reported In Release: NI 05.4.00
> 
> --JK
> 
> We have seen issues when our MLXes receive multicast traffic for which
> there have been no IGMP join messages sent (on edge ports).  I'm
> assuming that not getting any PIM joins would have the same effect.
> There are some applications that do not send IGMP messages if they
> expect their traffic to remain on the same L2 domain.  Apparently if the
> MLX doesn't have an entry for it, it punts it to the LP CPU.
> 
> To get an idea of which traffic is hitting the CPU, you can connect to
> the LP (rconsole <slot_number>, then enable) and run 'debug packet
> capture'.  That will show you a few packets as they hit the LP CPU, and
> should at least tell you the source IP, interface, and multicast group
> for the offending traffic.
> 
> HTH,
> 
> --
> Eldon Koyle
> --
> BOFH excuse #319:
> Your computer hasn't been returning all the bits it gets from the Internet.
> 
> On  Jun 03 10:32-0400, Walter Meyer wrote:
> > We are seeing high CPU on our LPs after upgrading from 4001a to 54c on two
> > MLXs.
> >
> > We are using PIM-SM and the mcast process is using a large amount of LP
> > CPU, but only after the upgrade. We are stable on the same config prior to
> > the upgrade. Also, the MLX that is the RP for networks with a large number
> > of multicast streams is the one that has a high CPU. The other core doesn't
> > have an issue (aside from being unstable because of the other MLX with high
> > CPU). We are pretty sure it has something to do with multicast routing we
> > just can't figure out why.
> >
> > We do have a large number of group/OIF entries spanning multiple physical
> > ints and ves, but this shouldn't be an issue because of the OIF
> > optimization feature on the platform...right? On 4001a and 54c we have a
> > shareabilitiy coefficient / optimization of 98%...So it doesn't seem like a
> > resource problem...But we can't figure out why the traffic is hitting CPU.
> >
> > Has anyone seen mcast problems after upgrading or have any troubleshooting
> > tips?
> 
> 

.  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .
Jethro R Binks, Network Manager,
Information Services Directorate, University Of Strathclyde, Glasgow, UK

The University of Strathclyde is a charitable body, registered in
Scotland, number SC015263.