[c-nsp] 6500 router hangs (IPV4 routing slows to a crawl) when IPV6 routing is enabled with VRFs.

Tue Jun 12 11:51:58 EDT 2012

Hey Jim,

Some things / guesses of the top of my head:

BFD on the cat6k/720 is implemented centrally.  In practice on this
platform I think it causes more outages than it is supposed to fix.

Are you really, really sure you are not out of fib space with your v4 
full table plus mpls labels?  (Check "sh plat hard cap").  Putting the 
dfz in a vrf may not be a good idea anyway.  

You don't have something configured like ipv6-urpf configured, do you?

When you are actually hitting enter after "ipv6 unicast-routing" you
probably are asking the box to recompute a pile of data structures, assign
labels, pointers, etc.  Some of that is on the RP, and then the SP gets
to do some as well.  Then SP then has to commit this (nearly full for
v4) rib to the tcam.   You are seeing the SP cpu hit 100% while this
happens.  If you are monitoring your dfc's you would probably see
activity there, too. 

During that time everything (well, nearly everything that is still a fib 
miss before the hardware shortcut is installed) is getting punted to the RP.
SPD is helping to bail you out from some of this flooding to keep
priority traffic like hello's up.

Keep in mind that these cpu's are slower than the ones in your previous 
cell phone.  

I think you have some options like you said:
- wait it out
- reboot w/ v6 enabled.
- if you use hsrp, use preempt delay of 5 min or so.  Even without v6, I
bet your topology converges from a cold start similarly.

Dale

Thus spake Jim Trotz (jtrotz at gmail.com) on Tue, Jun 12, 2012 at 10:21:26AM -0400:
> I originally posted this on the IPV6-Ops mailing list, but it now seems to
> be more of a switching issue than IPV6 protocol related.
> 
> 
> 
> Background:
> 
> 
> 
> Our enterprise backbone network has 2ea 6500s with Sup720XLs which connect
> to our 3 major ISPs at 10Gbs. We call these the Internet Hubs. They are
> running SXI5 IOS and are configured for BGP (full table), Internet IPV4
> Multicast routing and EIGRP for IGP. They are running both IPV4 & IPV6 in a
> dual stack mode with no problems for over a year.
> 
> 
> 
> These two routers connect to our Enterprise Edge routers (also 6500s with
> Sup720XL-10G). They are running SXJ1 IOS code and house several VRFs,
> mostly for guest networks. One of the VRFs is used for ?outside? traffic. A
> pair of Cisco ASAs connect the ?outside VRF? and the ?inside? global
> routing tables. The ASAs neighbor EIGRP with the router  to learn about
> IPV4 ?inside? networks. These routers also do MPLS VPNs to connect to
> various guest networks on different campuses as well as some other DMZ
> stuff. We also have several outside partners connecting to these routers.
> 
> 
> 
> The ?edge? routers connect to the Enterprise Core routers which route to
> various campuses over a large DWDM Ethernet MAN/WAN.
> 
> 
> 
> The Problem:
> 
> 
> 
> Occurred when we tried to enable IPV6 routing on the edge routers. We have
> narrowed the scenario down to these conditions:
> 
> 
> 
> 1)      ?mls ipv6 vrf ?,      ?ipv6 address-family?  added to one or more
> VRF definitions.
> 
> 2)      The ?outside? VRF table holds the full Internet table + EIGRP
> routes to local ?outside? devices/subnets.
> 
> 3)       IPV4 BGP session to a neighbor is open and operational and sharing
> the ?outside? VRF.
> 
> 4)       No other IPV6 configuration has been entered yet.
> 
> 
> 
> When ?ipv6 unicast-routing? is entered the following happens:
> 
> 
> 
> 1)      EIGRP & BGP neighbors drop on interfaces with BFD enabled. (we took
> it out)
> 
> 2)      Traffic through the router drops to a crawl  (0-2000 bps)  ICMP
> doesn?t seem affected, but I?m not pushing that much ICMP.
> 
> 3)      The SP cpu goes to nearly 100%
> 
> 4)      Most of the interface traffic is routed to the RP (confirmed by
> ERSPAN)
> 
> 5)      Telnet connections to the router don?t drop and EIGRP neighbors
> stay connected.
> 
> 
> 
> This slowness isn?t the same as when BGP  is 1st enabled and is loading
> routes ? its much worse, traffic throughput almost stops ?.!!
> 
> 
> 
> When we twice tried enabling IPV6 during a change window it brought all
> Internet connectivity to a halt. I think this is due to the neighbor
> relationships staying up and the router acting as a ?black hole?.   We have
> been able to duplicate the issue in a lab. At first we just duplicated the
> hardware and configuration and it seemed all was OK, that?s why we made the
> 2nd attempt with Cisco TAC and our senior engineers on hand. Turns out you
> need to be pushing data through the router to see the problem. In the lab I
> have 3 sessions pushing from the ?outside? and 3 from the ?inside?. One
> session is doing ICMP pings to a host beyond the router. The 2nd session is
> doing TFTP GETs (UDP port 69) and the 3rd going HTTP GETs (TCP port 80)
> using ?curl? scripts.
> 
> 
> 
> In the lab, the ?slowness? lasts almost 2 minutes. During which there is no
> unusual traffic (i.e. BGP scanning or reloads) and no CPU processes rise to
> any noticeable level. Nothing gets logged. The only thing I noticed is the
> SP CPU goes to 100% and the RP starts getting flooded with traffic from
> most interfaces. When we tried it in production it was lasting over 4
> minutes, so we pulled the plug and removed the changes.  The ?problem?
> happens each time the command is entered OR removed. Also it doesn?
> 
> 
> 
> FIB TCAM maximum routes :   (BGP routes in table = 408K)
> 
> =======================
> 
> Current :-
> 
> -------
> 
> IPv4 + MPLS         - 512k (default)
> 
> IPv6 + IP Multicast - 256k (default)
> 
> 
> 
> The line cards in the production routers have 1GB ram and are XL versions.
> 
> 
> 
> Cisco TAC hasn?t been too helpful on this one. I?m looking for any ideas to
> determine the problem, cause or how to live with it. I figure we could
> disable IPV4 routing temporarily, enable IPV6 routing, then restart ipv4
> routing or just reload the router with the IPV6 commands preloaded ? but
> that seems like a hack to me and I don?t know if this problem will bite me
> in the ass later if we don?t better  understand why this is happening.
> 
> 
> 
> Any suggestions appreciated,
> 
> 
> 
> -Jim
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/