[f-nsp] FW: High LP CPU After Upgrade 4001a to 54c Multicast

Wilbur Smith wsmith at brocade.com
Wed Nov 6 14:16:41 EST 2013


(Sorry…sending again from correct email address)

>Hi Folks,
>Sorry you’re running into issues with multicast on the MLX. Most of my
>experience with multicast is based on L2 and IGMP Snooping, but the
>internal forwarding and programming of the MLX is very similar for bot L2
>and L2 multicast.
>
>A good rule of thumb for the MLX is if you see hi CPU on an LP, it is
>usually caused by an entry not getting programed into the LP’s hardware
>forwarding table; the high CPU is triggered because the traffic is being
>process in software. This is not a 100% rule (this can also be triggered
>by excessive SNMP polling of an interface), but I think this applies in
>your case.
>
>So, what can cause a multicast hardware entry not being programed
>correctly? I’ve seen duplicate multicast IP addresses, or even a duplicate
>MAC address on a hardware encoder, trigger this. Basically, the router
>perceives duplicate MAC or IP as a host rapidly moving between two ports.
>The MLX is constantly trying to program two separate egress ports with the
>the same MAC/IP, so a hardware programing can never “stick” and the LP CPU
>will try to forward the traffic in software while it is trying to program
>the hardware forwarding table.
>
>If you haven’t changed any devices or re-addressed any multicast hosts,
>then you may have a separate problem triggering this. I would look closely
>at that you’ve set your query interval to. How often are you requiring
>client to respond to General Query or Group Specific Queries? If you have
>a large number of sources and receivers, or if you have clients or streams
>rapidly togging on and off, this could cause issues.
>
>I would try increasing the query to a larger value with:
>
>"ip igmp query-interval 300”
>
>And increase the membership timeout to an even higher value:
>
>"ip igmp group-membership-time 900”
>
>Finally, I would increase the timeout for a client’s response to an IGMP
>Query:
>
>"ip igmp max-response-time 20”
>
>
>This may cause some problem with sale entries if you have lots or
>add/remove of multicast streams, but is a useful test. If the query rate
>is contributing to the CPU spikes, you would start to see the LP CPU drop
>within a minute or so after setting this.
>
>I would also enable “multicast filter” to prevent the flooding of
>multicast IGMP message to clients that don’t need to see it.
>
>If trying some of this doesn’t help, can you give me some more details on
>your multicast traffic”
>
>How many multicast streams and *,G entries are on your router?
>What type of module (24X1G, 8X10G, etc) are you using in the MLX
>What is the average bandwidth of the streams?
>What is the source device of these multicast streams; a hardware or
>software encoder? Is this video?
>
>Hope this gives you a few ideas!
>
>-Wilbur
>
>Wilbur Smith
>SE Ninja, Brocade 
>wilbur.k.smith at gmail.com
>wsmith at brocade.com
>
>
>Disclosure: While I am a Brocade employee, my participation
>in this community is a personal choice and not directed by my employer.
>And
>information or recommendation I provide are my own and not an official
>recommendation
>from Brocade. Sorry folks, just need to make sure you know I’m ‘doin this
>“off
>the clock”!
>
>
>
>
>
>
>
>
>
>
>On 11/5/13, 12:57 AM, "Jethro R Binks" <jethro.binks at strath.ac.uk> wrote:
>
>>I think we've probably experienced the issue you described.  We have an
>>MLX core, with other platforms at the distribution layer, and would
>>experience peaks of very high CPU for a second or two at a time, which
>>would disrupt OSPF and MRP at least.  It appears we were running 5.4.0d
>>at 
>>the time.
>>
>>However, since upgrading 5.5.0c I've been working with Brocade on another
>>issue, wherein the standby management card would reset every few minutes
>>or so.
>>
>>Filtering unwanted multicast groups both at the distribution layer, and
>>then later directly at the core interfaces, helped a bit.  However the
>>most effective fix was to remove "ip multicast-nonstop-routing"; as it
>>was 
>>described to me: "The problem is seen when in a specific pattern the
>>outgoing ports for the groups (239.255.255.250) added and removed ...
>>Engineering team performed troubleshooting and determined that for some
>>reason, the OIF tree that is rooted at certain forwarding entries is
>>being 
>>corrupted, either in the middle of traversal, or when there is database
>>update."
>>
>>At the moment Brocade are trying to replicate it in their lab environment
>>to work on a fix.  If they sort that, and merge in your defect fix, maybe
>>we'll finally see the back of the CPU/multicast issues we've been plagued
>>with.
>>
>>Jethro.
>>
>>
>>
>>On Tue, 5 Nov 2013, Kennedy, Joseph wrote:
>>
>>> The problem for us was so severe that both MLX MPʼs were running at 99%
>>> CPU and the LPʼs were flooding unicast.
>>> 
>>> After a lot of work testing in a lab environment looking for an issue
>>>in 
>>> multicast routing that fit the symptoms(lol...no it wasnʼt easy), I
>>> confirmed that the source of the problem was in 5.2 and above (5.2.00
>>>to 
>>> 5.4.00d) and processing of IGMP reports. Brocade's code updated mcache
>>> entries for every IGMP report even when a matching mcache OIF entry
>>> already existed.
>>> 
>>> All updates in a given IGMP query window in the problem code could be
>>> represented as O(M(N^2)) where M is the number of OIF's and N is the
>>> number of group members in a single group. For example, in an
>>> environment with 100 OIF's and 300 group members this equates to
>>> 9,000,000 updates per IGMP query window. This is in relation to
>>>previous 
>>> code releases where the updates could be represented by O(MN) or given
>>> the same environment values as above, 30,000 updates per query window.
>>> 
>>> Many may not have noticed the issue because they donʼt have a large
>>> number of OIFʼs or large number of group members in a single group.
>>>Some 
>>> may have run into this previously and just filtered the UPnP/SSDP IPv4
>>> group (239.255.255.250) to resolve it. If you are running PIM-SM, have
>>> upgraded to 5.2.00 or above and afterwards noted periods of abnormally
>>> high MP/LP CPU, or you attempted the upgrade but had to revert due to
>>> high MP CPU usage and unicast flooding (as we were seeing) then this
>>>may 
>>> be the root of your issue.
>>> 
>>> After reporting the problem to Brocade they provided a fix build and
>>> incorporated the fix into 5.4.00e. This problem "should be" resolved in
>>> 5.4.00e. The problem is not specific to running PIM-SM with VRFʼs.
>>> 
>>> Related closed defect information from 5.4.00e:
>>> 
>>> Defect ID: DEFECT000468056
>>> Technical Severity: Medium
>>> Summary: High MP CPU utilization from IGMP reports after upgrade
>>> Symptom: After upgrading from 4.x to 5.4,high CPU utilization from IGMP
>>>reports in VRF
>>> Feature: IPv4-MC PIM-SM Routing
>>> Function: PERFORMANCE
>>> Reported In Release: NI 05.4.00
>>> 
>>> --JK
>>> 
>>> We have seen issues when our MLXes receive multicast traffic for which
>>> there have been no IGMP join messages sent (on edge ports).  I'm
>>> assuming that not getting any PIM joins would have the same effect.
>>> There are some applications that do not send IGMP messages if they
>>> expect their traffic to remain on the same L2 domain.  Apparently if
>>>the
>>> MLX doesn't have an entry for it, it punts it to the LP CPU.
>>> 
>>> To get an idea of which traffic is hitting the CPU, you can connect to
>>> the LP (rconsole <slot_number>, then enable) and run 'debug packet
>>> capture'.  That will show you a few packets as they hit the LP CPU, and
>>> should at least tell you the source IP, interface, and multicast group
>>> for the offending traffic.
>>> 
>>> HTH,
>>> 
>>> --
>>> Eldon Koyle
>>> --
>>> BOFH excuse #319:
>>> Your computer hasn't been returning all the bits it gets from the
>>>Internet.
>>> 
>>> On  Jun 03 10:32-0400, Walter Meyer wrote:
>>> > We are seeing high CPU on our LPs after upgrading from 4001a to 54c
>>>on two
>>> > MLXs.
>>> >
>>> > We are using PIM-SM and the mcast process is using a large amount of
>>>LP
>>> > CPU, but only after the upgrade. We are stable on the same config
>>>prior to
>>> > the upgrade. Also, the MLX that is the RP for networks with a large
>>>number
>>> > of multicast streams is the one that has a high CPU. The other core
>>>doesn't
>>> > have an issue (aside from being unstable because of the other MLX
>>>with high
>>> > CPU). We are pretty sure it has something to do with multicast
>>>routing we
>>> > just can't figure out why.
>>> >
>>> > We do have a large number of group/OIF entries spanning multiple
>>>physical
>>> > ints and ves, but this shouldn't be an issue because of the OIF
>>> > optimization feature on the platform...right? On 4001a and 54c we
>>>have a
>>> > shareabilitiy coefficient / optimization of 98%...So it doesn't seem
>>>like a
>>> > resource problem...But we can't figure out why the traffic is hitting
>>>CPU.
>>> >
>>> > Has anyone seen mcast problems after upgrading or have any
>>>troubleshooting
>>> > tips?
>>> 
>>> 
>>
>>.  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .
>>Jethro R Binks, Network Manager,
>>Information Services Directorate, University Of Strathclyde, Glasgow, UK
>>
>>The University of Strathclyde is a charitable body, registered in
>>Scotland, number SC015263.
>
>




More information about the foundry-nsp mailing list