[f-nsp] Multicast is being switched by LP CPU on MLXe?

Alexander Shikoff minotaur at crete.org.ua
Mon Dec 26 13:51:38 EST 2016


Hello!



On Tue, Dec 20, 2016 at 08:31:25PM +0000, Wilbur Smith wrote:
> Multicast is always loads of fun to debug! Not really though. 
> 
> So it looks like your upstream querier (router) for VLAN 450 is hanging off of e9/5, while the multicast source in on e11/12 and the subscriber to the multicast group is on e 7/1. Does this sound correct?
> 
> If you're seeing high LP usage and suspect cast traffic is being processed by the LP and not the hardware, this normally tells me the hardware tables are being repeatedly reprogramed or there is some table churn happening somewhere. It's normal to see a burst of high LP CPU in the instant the initial hardware programming takes place, but this should only last 1 second or so and the FPGA takes over.
> 
> I've dealt a lot with this in the past and here are some things that I've seen cause this:
> 
> 1) A hardware defect on the LP. This is very rare and when this does happen on routers with supported code normally you would also see sys-mon alerts in the logs
> 
> 2) A duplicate multicast group address or source IP is used for two different cast streams. Example would be for encoding devices both using the same group or even the same IP address on their interface. Even if they were in separate VLANs this would cause issues because it triggers the MLXe to constantly update the FID entry on the LP. I've seen this a few times in the past.
> 
> 3) Duplicate IP or MAC on two subscribers. This is more common that you think with Linux host based  receivers or a semi-custom device that is manually programmed. Sometimes a sys-admin copies the ifcfg file between two hosts and forgets to delete the MAC from the file. I've also seen hardware based devices where the IP address and/or MAC is flashed on to the device and an older config is re-used causing a duplication.
> 
> 4) Something going on in our code. If you have a LAG between the two routers, try disabling all but one of the ports if possibly. Bump the last port to make sure the table entries are reset and see it the issue still happens. If so, contact TAC because there may be something going on with cast FID programming on the LAG.
>  
> If possible, also try moving both the source and receiver to the same MLXe in a separate VLAN and just set that VLAN to 'multicast active'; make sure there's a VE with an IP on local VLAN though. 
> 
> With multicast active, the router provides the same IGMP messaging that's needed for IGMP Snooping or Multicast Passive to work, but without needing to run PIM. It can be a good way to confirm if this is an issue with PIM or a lower level issue. 
> 
> I can tell you that this does normally work very well on the MLX. I have some customers pushing 3.2 Terrabits of multicast on an MLXe-16 all with LPs in the 1-3% CPU range. The trick is to figure out what's causing table turn at the LP level and preventing the hardware table from being used.

Dear Wilbur,

Apologises for delay with reply.
Thank you for a lot of clues, I need some time to check them.

Meanwhile I've discovered the same problem in different VLAN,
with simpler configuration.

VLAN 779 consists of three ports. No LACP LAGs, no any special
ip multicast configuration:

telnet at lsr1-gdr.ki#show vlan 779

PORT-VLAN 779, Name V779_Cosmonova_Multicast, Priority Level 0, Priority Force 0, Creation Type STATIC
Topo HW idx    : 65535    Topo SW idx: 257    Topo next vlan: 0
L2 protocols   : NONE
Statically tagged Ports    : ethe 1/2 ethe 10/2 ethe 12/3 
Associated Virtual Interface Id: NONE
----------------------------------------------------------
Port  Type      Tag-Mode  Protocol  State     
1/2   PHYSICAL  TAGGED    NONE      FORWARDING
10/2  PHYSICAL  TAGGED    NONE      FORWARDING
12/3  PHYSICAL  TAGGED    NONE      FORWARDING
Arp Inspection: 0
DHCP Snooping: 0
IPv4 Multicast Snooping: Disabled
IPv6 Multicast Snooping: Disabled

No Virtual Interfaces configured for this vlan

telnet at lsr1-gdr.ki#show run | b 779
vlan 779 name V779_Cosmonova_Multicast 
 tagged ethe 1/2 ethe 10/2 ethe 12/3 
!

Port 12/3 here is connected to multicast source. Ports 1/2 and 10/2
are connected to multicast customers. There is neither IGMP nor PIM, 
just some static multicast groups.

And I see that multicast packets in this VLAN are being switched
by LP CPU:

LP-12#debug packet capture rx max 2
Rx capture enabled, maximum capture count 2.

[ppcr_rx_packet]: Packet received
Time stamp : 13 day(s) 14h 44m 53s:,
TM Header: [ 0564 eb24 0000 ]
Type: Fabric Unicast(0x00000000) Size: 1380 Class: 7 Src sys port: 2852
Dest Port: 0  Drop Prec: 0 Ing Q Sig: 0 Out mirr dis: 0x0 Excl src: 0 Sys mc: 0
**********************************************************************
Packet size: 1374, XPP reason code: 0x000682b5
00: 05f0 c403 5c50 e30b-8481 fffe 0e00 0000  FID     = 0x05f0
10: 0100 5e00 0083 f8c0-01e0 0a28 0800 4500  Offset  = 0x10
20: 0540 0000 4000 1e11-fc4b 0a31 6cad e400  VLAN    = 779(0x030b)
30: 0083 e1bd 04d2 052c-3c4a 4700 3113 4a9e  CAM     = 0x00ffff(R)
40: d024 f70f 5ee4 5e6b-389c f6a8 0bc4 f163  SFLOW   = 0
50: f292 9e41 fdb0 282b-4af5 bb2f f9ab 7543  DBL TAG = 0
60: abcc 6e4f b66b 0846-f71e acc1 a676 614d
70: c1ab 018c 8e9f 5c16-0490 e345 9fb9 be74
Pri CPU MON SRC   PType US BRD DAV SAV DPV SV ER TXA SAS Tag MVID
7   0   0   12/3  3     0  1   0   1   1   1  0  0   0   1   0

10.49.108.173 -> 228.0.0.131 UDP [57789 -> 1234] 
**********************************************************************
[ppcr_tx_packet] ACTION: Forward packet using fid 0xa018
Packet capture reached the limit. 2Issue the command again to activate it.
Packet Capture Disabled.


In my understanding it should not happen. 
Packet is coming to port 12/3 with src MAC f8c0.01e0.0a28 and dst
multicast MAC 0100.5e00.0083 and it should be simply flooded out of 
ports 1/2 and 10/2. But instead that it is going to LP CPU.

What's wrong with such easy configuration?

-- 
MINO-RIPE


More information about the foundry-nsp mailing list