[c-nsp] N3K: "VPC peer keep-alive receive has failed"

Thu Dec 27 05:57:06 EST 2018

Hi,

I have a strange problem with Nexus N3K and QinQ tunnel.

I've configured 2 Nexus 3064 with VPC. It works well for monthes.

Recently I've added a port-channel in dot1q-tunnel mode (the 1st one in this
mode).
Since that I have this message:
"%$ VDC-1 %$ %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer
keep-alive receive has failed" multiple times a day on the 2 switches.

Details:
  BIOS: version 4.1.0
  NXOS: version 7.0(3)I6(1)

  new interface & port-channel:

	interface Ethernet1/35
	  switchport mode dot1q-tunnel
	  switchport access vlan 72
	  spanning-tree port type edge
	  speed 10000
	  channel-group 1035

	interface port-channel1035
	  switchport mode dot1q-tunnel
	  switchport access vlan 72
	  speed 10000
	  vpc 1035

   A "sh vlan id 72" only report peer-link ports/portchannels and 
   eth1/35 / po1035.

   There's no other end for the moment for this tunnel.

   Message appear on various time on each switch (i.e. not at the same time
   on both switches) and not the same number of time per day. For exemple
   today: 3 on a switch, 6 on the other one.

   Switches load seems the same than before this new port channel and there's
   no load pic around the message date/time (cacti 5mn measures)

   When I shut the port, messages no more appear. When I re-enable it they
   come back.

   I've tried changing keep alive parameters:
	--Keepalive interval            : 500 msec
	--Keepalive timeout             : 10 seconds
	--Keepalive hold timeout        : 6 seconds
   but same thing.

   Keepalive link is on a dedicated 2 ports port-channel, IPs are set
   directly on the portchannel, in a VRF.

   1st switch:
	vpc domain 1
	  role priority 1
	  peer-keepalive destination 10.0.6.3 source 10.0.6.2 vrf pkal \ 
	     interval 500 time out 10 hold-timeout 6
  	  peer-gateway
	  auto-recovery
	  ipv6 nd synchronize
	  ip arp synchronize

   2nd switch:
	vpc domain 1
	  role priority 2
	  peer-keepalive destination 10.0.6.2 source 10.0.6.3 vrf pkal \
	     interval 500 time out 10 hold-timeout 6
	  peer-gateway
	  auto-recovery
	  ipv6 nd synchronize
	  ip arp synchronize

   There's nothing in logs accept the "receive has failed" message.

   There's no error on keep-alive interfaces.

   On cacti, I just notice a little drop of outgoing traffic for keep-alive
   ports around message apparition so it seems it's not a receive problem but
   a transmit problem.

   If a configure 2 others N3K with same configuration (Back-to-Back
   configuration) for the other end of the tunnel and propagate vlan 72 toward
   them, I start having the same message on the other switches, even if the
   QinQ port on them is down. If I stop propagating vlan toward them,
   message stop on these 2 switches (but continue on the first 2 switches).

   Any idea ???

Manuel 

--
______________________________________________________________________
Manuel Guesdon - OXYMIUM