[j-nsp] MC-LAG to EVPN migration triggering filter config bug?

Mon Feb 17 16:12:37 UTC 2025

Hi everyone!

Possible filter programming bug?

Environment:

MC-LAG pair of qfx5120-48ym, all hosts attached with LAG/LACP. One VLAN 
with IRB on switches for routing.

We are trying to turn this into a collapsed core EVPN setup during 
runtime. Successfully done on other site, but with almost no traffic. 
Problem on this site.

Simplified plan:
- disable all host interfaces on node 2
- convert config on node 2 to EVPN-based instead of MC-LAG-based. Reboot 
node 2.
- convert member interface of old ISL/ICL-link to be L2 trunk to carry 
cross-switch traffic
- one host at the time, disable link on node 1 and enable on node 2 
instead
- once all hosts are moved, tear down temporary L2 trunk, convert node 1 
to EVPN and reboot.
- everybody happy

In reality we hit strange behaviour. When troubleshooting we discovered 
that we could not ping between node 1 and node 2 via the temporary 
L2-trunk. Unique unicast IP-addresses on each IRB, but nothing appeared 
in the ARP table, and even the ethernet table was suspiciously empty.

Void of any good ideas (later in a lab setup), we removed the lo0 input 
filter protecting the RE. Now it started working the way it should have 
been working from the beginning!!

The lo0 input filter is only for family net, so it should not be able to 
influence mac-learning or ARP (L2 functions).

Questions:

- Is it possible that interface programming related to MC-LAG ISL/ICL 
(no mac-learning, no normal ARP handling) could have been left on the 
interface I repurposed to a temporary L2-trunk?
- In case the above is possible: Is there a way to ”flush” the 
interface programming in a case like this?

The obvious solution of rebooting the node is unfortunately not 
possible, several VMware clusters are running their VSAN backend through 
the switch, planned downtime is not really an option.

/Per