[c-nsp] Unique issue which is not making any sense, maybe not even Cisco related...

Sun Mar 2 21:03:57 EST 2014

First off please excuse if some of this does not make sense... I am working on a 48 hour day, and only got about a 2 hour nap so far...

I currently have 2 Cisco 6500 switches configured as the layer 3 core within my production network.  It has been configured with about 30 SVI's in a single VRF.  Each SVI has an HSRP Version 1 configuration.  We are going through a project, migrating to Nexus 7000 devices at the Layer 3 core, as well as replacing our aging 6500's and 4500's with Nexus 5500 servers switches for client connectivity.  Everything has been going exceptionally well, however I have 1 oddity which caused a production outage today, in such a way I have never seen.  Currently all SVI's fall into a single VRF on the Nexus Core as well.

Last night I migrated 6 SVI's from 6500 to Nexus.  All of them worked exactly as expected, except for 1 network, and specifically 1 pair of devices.  Configured on this VLAN I have a pair of F5 load balancers.  These load balancers exist in 6 different networks.  They appeared to have issues only in this single network.  All these cisco devices are behind a firewall.  There is an "Edge" Network in place.  The legacy 6500's are configured in this network as .2 and .3 with .1 as an hsrp ip.  The new nexus equipment is configured as .5 and .6 with .2 as an hsrp ip.  The firewall is .11.

Each of the 4 cisco devices have a default route pointed to .11.  There is no Dynamic routing at this point past EIGRP between all 4 devices redistributing connected subnets.  The 6500's do not support VSS, so they are standalone devices.  The Nexus devices are configured in a VPC domain.  Uplinks through the production network are doublesided VPC's from the Nexus 7000 core to Nexus 5000 Distribution.  I am migrating from HSRP version 1 to HSRP version 2 to allow for more HSRP instances in the future.  I have a large number of additional networks that need spun up soon, and I figured I would do it right the first time...

This is where I am having trouble.  This network is fully integrated and has been working for about 2 months without any issue.  About 75% of our network and server infrastructure is already migrated onto the Nexus infrastructure, including several layer 3 FHRP configurations.  Here is a snip of the existing 6500 config.

## 6500 Core 01

interface Vlan44
ip address 192.168.44.3 255.255.252.0
no ip redirects
no ip unreachables
no ip proxy-arp
standby 83 ip 192.168.44.1
standby 83 timers 1 3
standby 83 priority 125
standby 83 preempt
standby 83 track 1 decrement 50
arp timeout 240
end

## 6500 Core 02

interface Vlan44
ip address 192.168.44.4 255.255.252.0
no ip redirects
no ip unreachables
no ip proxy-arp
standby 83 ip 192.168.44.1
standby 83 timers 1 3
standby 83 priority 90
standby 83 preempt
standby 83 track 1 decrement 50
arp timeout 240
end

The new nexus configuration lines out very similarly.  This does not include the hsrp track configuration as of yet.  We are changing a large amount of the topology, and I did not implement it this evening as I did not want anything unexpected popping up.

## Nexus Core 01

interface Vlan44
  no ip redirects
  ip address 192.168.44.3/22
  hsrp version 2
  hsrp 44
    preempt
    priority 125
    timers  1  3

## Nexus Core 02

interface Vlan44
  no ip redirects
  ip address 192.168.44.4/22
  hsrp version 2
  hsrp 44
    preempt
    priority 90
    timers  1  3

Like I said, this is where it gets weird.  When I move from 6500, to Nexus everything looks fine, except for the pair of load balancers.  They are configured as 192.168.44.35 and 192.168.44.36 with about 90 VIP's through the network.  When on Nexus for HSRP, traffic from all vlan's on Nexus pass traffic properly to the load balancer VIP's with the exception of traffic sourcing from the Edge VLAN.  Either from other devices in that network, or from behind the firewall in that network.  Here is where it gets really weird... some devices have functional access.  I have 2 workstations on my desk, one identified as .16 and the other as .17,  one of them can get to the F5 devices.  The other cannot.  Itested from about 10 other points, and they are about 50/50 for functionality.  Again this is only 2 devices.  Packet captures from them show they appear to be seeing both physical MAC's and the HSRP mac when connecting to the HSRP vip, and tripping up on it.  I think...  Nothing else seems to be having issues with it, just these 2 devices...

I am in the process of replacing these devices with a solution from another vendor, but I am at least 3 months from completion.  Any thoughts on this or suggestions of where to look past hsrp states, arp and mac tables?  If additional information is required, please let me know...

Thanks,
Blake