[c-nsp] 6500 router hangs (IPV4 routing slows to a crawl) when IPV6 routing is enabled with VRFs.

Tue Jun 12 10:21:26 EDT 2012

I originally posted this on the IPV6-Ops mailing list, but it now seems to
be more of a switching issue than IPV6 protocol related.

Background:

Our enterprise backbone network has 2ea 6500s with Sup720XLs which connect
to our 3 major ISPs at 10Gbs. We call these the Internet Hubs. They are
running SXI5 IOS and are configured for BGP (full table), Internet IPV4
Multicast routing and EIGRP for IGP. They are running both IPV4 & IPV6 in a
dual stack mode with no problems for over a year.

These two routers connect to our Enterprise Edge routers (also 6500s with
Sup720XL-10G). They are running SXJ1 IOS code and house several VRFs,
mostly for guest networks. One of the VRFs is used for “outside” traffic. A
pair of Cisco ASAs connect the “outside VRF” and the “inside” global
routing tables. The ASAs neighbor EIGRP with the router  to learn about
IPV4 “inside” networks. These routers also do MPLS VPNs to connect to
various guest networks on different campuses as well as some other DMZ
stuff. We also have several outside partners connecting to these routers.

The ‘edge” routers connect to the Enterprise Core routers which route to
various campuses over a large DWDM Ethernet MAN/WAN.

The Problem:

Occurred when we tried to enable IPV6 routing on the edge routers. We have
narrowed the scenario down to these conditions:

1)      “mls ipv6 vrf “,      “ipv6 address-family”  added to one or more
VRF definitions.

2)      The “outside” VRF table holds the full Internet table + EIGRP
routes to local “outside” devices/subnets.

3)       IPV4 BGP session to a neighbor is open and operational and sharing
the “outside” VRF.

4)       No other IPV6 configuration has been entered yet.

When “ipv6 unicast-routing” is entered the following happens:

1)      EIGRP & BGP neighbors drop on interfaces with BFD enabled. (we took
it out)

2)      Traffic through the router drops to a crawl  (0-2000 bps)  ICMP
doesn’t seem affected, but I’m not pushing that much ICMP.

3)      The SP cpu goes to nearly 100%

4)      Most of the interface traffic is routed to the RP (confirmed by
ERSPAN)

5)      Telnet connections to the router don’t drop and EIGRP neighbors
stay connected.

This slowness isn’t the same as when BGP  is 1st enabled and is loading
routes – its much worse, traffic throughput almost stops ….!!

When we twice tried enabling IPV6 during a change window it brought all
Internet connectivity to a halt. I think this is due to the neighbor
relationships staying up and the router acting as a “black hole”.   We have
been able to duplicate the issue in a lab. At first we just duplicated the
hardware and configuration and it seemed all was OK, that’s why we made the
2nd attempt with Cisco TAC and our senior engineers on hand. Turns out you
need to be pushing data through the router to see the problem. In the lab I
have 3 sessions pushing from the “outside” and 3 from the “inside”. One
session is doing ICMP pings to a host beyond the router. The 2nd session is
doing TFTP GETs (UDP port 69) and the 3rd going HTTP GETs (TCP port 80)
using “curl” scripts.

In the lab, the “slowness” lasts almost 2 minutes. During which there is no
unusual traffic (i.e. BGP scanning or reloads) and no CPU processes rise to
any noticeable level. Nothing gets logged. The only thing I noticed is the
SP CPU goes to 100% and the RP starts getting flooded with traffic from
most interfaces. When we tried it in production it was lasting over 4
minutes, so we pulled the plug and removed the changes.  The “problem”
happens each time the command is entered OR removed. Also it doesn’

FIB TCAM maximum routes :   (BGP routes in table = 408K)

=======================

Current :-

-------

IPv4 + MPLS         - 512k (default)

IPv6 + IP Multicast - 256k (default)

The line cards in the production routers have 1GB ram and are XL versions.

Cisco TAC hasn’t been too helpful on this one. I’m looking for any ideas to
determine the problem, cause or how to live with it. I figure we could
disable IPV4 routing temporarily, enable IPV6 routing, then restart ipv4
routing or just reload the router with the IPV6 commands preloaded – but
that seems like a hack to me and I don’t know if this problem will bite me
in the ass later if we don’t better  understand why this is happening.

Any suggestions appreciated,

-Jim