[c-nsp] Cisco 3550-12G VSI stops routing traffic

Tue Apr 22 15:11:45 EDT 2008

Hey guys,
I've ran into a ridiculous problem that has me completely stumped.

Network is a standard edge/core/access/distribution network comprised of
7206,6509-sup7203bxls, 3550s&3750s, and 3550s/2950s, respectively.
Distribution is pure OSPF, with 226 routes currently in area 0, while the
cores & edges run full mesh bgp. The cores originate defaults for the
distribution layer, distribution layer carries all of the customer gateways
and communicates those networks to OSPF.

The distribution 3550-12G in question is running
c3550-ipservices-mz.122-25.SEB4.bin. It's configured with 22 VSIs, carries
all of Area 0 (226 routes), and has 354 mac addresses listed and just shy of
300 arp entries. Average traffic through the switch is approximately
120mbps. Not very loaded.

This switches decided to randomly stop routing traffic two two completely
separate VSIs (vlan 602, & vlan 149). These two VLANs are attached to the
same port & downstream access switch, G0/4 and a 2960. The Internet can see
the VSI IP addresses without issue, OSPF still advertises the routes without
issue, everything is great up to the switch. Hosts attached to the 3550-12G
are able to see their appropriate VSI gateway IP, but cannot see anything
past it. Attached hosts are, however, able to see all of the other 21 VSI IP
addresses on the switch -- just nothing off of the switch. No traffic is
able to pass from off-switch/Internet to affected attached hosts, period.
Resolution was to move the VSI/customer gateway to a different distribution
switch. Although the affected/broken 3550-12G is still in the switching
path, it does Layer 2 forwarding without issue -- just that those 2 VSIs
just stopped forwarding traffic.

So this morning, we lost two more networks, the primary and secondary IP
address on a VSI for a completely different customer (vlan 609). On a lark,
I clear arp'd and the two networks came back, but two other different VSIs
went down (vlan 122, 167)!

The only thing that all of the VSIs have in common is that they are all
servicing customers attached to the 3550-12G's port G0/4. As mentioned
earlier, there was a 2960 switch attached to G0/4, which has been replaced
to no avail. Host configuration on affected VSI makes no difference -
swapping in different servers, my laptop, etc, all yield the same problem.
However, as of right now, if I plug my laptop into an access switch on g0/7
configured for the same now-broken vlan 167, it works just fine. It's almost
as if the VSI's dealing specifically with g0/4 were having problems.

Fearing a broken g0/4 <-> 2960 trunk, my config has been reduced to 4 lines,
no change in service:
!
interface GigabitEthernet0/4
 description down_acc12.fac01.cos
 switchport trunk encapsulation dot1q
 switchport mode trunk
 load-interval 30
!

If I move the VSI & Gateway to different distribution switch, it works fine.
If I move the access to a different port, it works fine. I have not reloaded
the switch yet, as there is some other stuff on there that I don't want to
incur 3-4 minutes of downtime on -- but I am fearing that the problem may
jump and cause more harm. Am I dealing with a randomly screwed up g0/4
that's smoking VSIs (how?), a buggy IOS that does this or ???. I've been
searching the Internet the world over and would love to hear some ideas and
anecdotes.

Thanks for reading my wall of text,
Randal