[c-nsp] Hardware limitations on SUP32 with LDP and full routing table

Rodney Dunn rodunn at cisco.com
Thu Jan 22 13:43:46 EST 2009


I'm by no means a TCAM expert but it seems you are asking for the
excpetion state when the TCAM gets full to allow the more specific
in somehow so that the longest prefix is matched?

These exception cases come with all kinds of problems. One customer
wants it one way and the next customer wants it another so you never
win.

I suspect the punt is coming from the next hop not being resolved
to a valid adjacency or if there is no hit at all. If the search
is done in hardware and it has no knowledge, due to TCAM full, of
the less specific it can't forward based on that.

I'm not sure there is a way to solve the problem you describe without
more TCAM space. We don't do any kind of RIB compression.

Rodney


On Thu, Jan 22, 2009 at 06:15:13PM +0100, Marcus.Gerdon wrote:
> Hi Jose,
> Hi Marek,
> 
> I'm facing the same symptom with loosing connectivity on a couple of machines for quite some time. With a DFZ table the TCAM's are simply overloaded.
> 
> I've been able to track that down but for some weeks now Cisco can't provide any solution.
> 
> The problem itself isn't that complex:
> 
> When FIB is built (powered up or routing protocols come up; 'clear ip route *' also works - no reload required) the forwarding entries are created in ordered sequence in the TCAM, longest prefixes first.
> 
> Each packets destination is first looked up in the TCAM. Only if the TCAM doesn't provide a next hop the software FIB is queried. As TCAM is walked sequentially, the longest match is found first and next-hop is successfully determined. Only if no TCAM entry is found it's swictehd over to look into the software-only tables.
> 
> Think about a /16 being available at startup and entered into the TCAM. At some later time a more-specific /24 shows up in the routing tables. Whilst trying to create a forwarding entry it is determined that the TCAM isn't capable of holding the additional /24 as the area organzied for 24's at time of the initial population is full ('sh mls cef masks' and some investigations shows this and it's even reproducable).
> 
> Due to that only a software entry (you can check with 'sh ip cef') is created, but none in the TCAM (check with 'sh mls cef').
> 
> Now we have a TCAM and software entry for /16 and the overlapping /24 only in software.
> 
> When looking up an address within the /24 TCAM is queried first and finds the /16 record. As a match is found, software isn't queried at all.
> 
> Seems like somewhere the process inserting a prefix in the middle of the TCAM and reordering it if needed is broken. I've tried to work around using the cef consistency checks, but although they're working at large, a few hundered ms jitter is produced each time TCAM is ordered. I've disabled it again soon after as customers got to complain regarding applications disconnecting due to the introduced jitter.
> 
> If someones has an idea or even better a solution (or gets some definitive answer from Cisco - my case is open for some time now and the engineer told to going for reproducing this in the lab) please let me know.
> 
> 
> 
> 
> kind regards,
> 
> Marcus
> 
> 
> > -----Urspr?ngliche Nachricht-----
> > Von: cisco-nsp-bounces at puck.nether.net 
> > [mailto:cisco-nsp-bounces at puck.nether.net] Im Auftrag von Marek Tyban
> > Gesendet: Donnerstag, 22. Januar 2009 15:25
> > An: Jose
> > Cc: cisco-nsp at puck.nether.net
> > Betreff: Re: [c-nsp] Hardware limitations on SUP32 with LDP 
> > and full routing table
> > 
> > 
> > Hi Jose,
> > 
> > I think that generally SUP32 isn't suitable for todays full internet 
> > routing table. It's due to the hardware limitations (as you wrote).
> > 
> > When you have full routes on SUP32 you should see log output as below
> > 
> > %MLSCEF-SP-4-FIB_EXCEPTION_THRESHOLD: Hardware CEF entry 
> > usage is at 95% 
> > capacity for IPv4 unicast protocol.
> > 
> > %CFIB-SP-7-CFIB_EXCEPTION: FIB TCAM exception, Some entries will be 
> > software switched
> > 
> > I have seen similar troubles with some sites/networks weren't 
> > reachable 
> > throught SUP720-3B (non XL) routers, but the routing and CEF 
> > table were 
> > correct.
> > 
> > Regards,
> > Marek
> > 
> > On Wed, 21 Jan 2009, Jose wrote:
> > 
> > > I was wondering if I could get some additional opinions on 
> > a case I have open 
> > > with Cisco.  We have recently started turning up LDP on 
> > various links out 
> > > towards some routers that are being converted to act as 
> > PEs.  The core is all 
> > > connected together and has been running LDP on those 
> > particular links for 
> > > over 8 months.
> > >
> > > This past weekend we turned up LDP on a link to one of our 
> > remote cities and 
> > > we received sporadic complaints that some customers 
> > couldn't access any 
> > > sites/addresses if the path was via one of our P routers.  
> > If traffic was 
> > > through any other path on the network it was fine.  
> > Traceroutes to & from 
> > > this P router showed were unsuccessful even though the 
> > routing table and LFIB 
> > > all showed the correct information.  Turning off LDP across 
> > this link 
> > > resolved the problem for the customers.
> > >
> > > After opening up the TAC case and lots of troubleshooting 
> > they showed us 
> > > this:
> > >
> > > Without LDP on:
> > > frort04#sh ip cef exact-route 172.17.0.254 68.179.73.86
> > > 172.17.0.254    -> 68.179.73.86   : Vlan2210 (next hop 
> > 67.226.181.110) 
> > > <<<<<< the next hop is correct
> > >
> > > With LDP on:
> > > frort04#sh mls cef exact-route 172.17.0.254 68.179.73.86   
> > Interface: Vl2210, 
> > > Next Hop: 224.0.0.168, Vlan: 2210, Destination Mac: 
> > 00b0.4a5e.7419  <<<<<<<< 
> > > next hop can't be a multicast IP
> > >
> > > the CEF entry and MLS CEF entry are different, after 
> > consulting the LAN-SW 
> > > team, it is found this router had issue of overloaded 
> > routes causing mls cef 
> > > table become corrupted.
> > >
> > > So basically we were told that because the SUP32 has a 
> > hardware limitation of 
> > > 250K routes that it can hardware cef, we were getting 
> > corruption in our 
> > > tables and in turn corrupting how LDP was building its 
> > forwarding table.  The 
> > > core P routers currently hold the entire internet routing 
> > tables so yes they 
> > > technically are pretty full in terms of the number of 
> > routes they can hold. 
> > > They want us to reload our router to clear the tables but 
> > they can't 
> > > guarantee that this problem won't resurface again down the 
> > road or right 
> > > away.  I'm more curious if there is some kind of IOS bug we 
> > might be hitting 
> > > which I'm hoping one of you might know but they're supposed 
> > to be doing a bug 
> > > scrub as well.
> > >
> > > Any thoughts on what we're experiencing?  Should we bite 
> > the bullet and 
> > > upgrade to SUP720-3BXLs?
> > >
> > > Thanks.
> > >
> > > Jose
> > >
> > > _______________________________________________
> > > cisco-nsp mailing list  cisco-nsp at puck.nether.net
> > > https://puck.nether.net/mailman/listinfo/cisco-nsp
> > > archive at http://puck.nether.net/pipermail/cisco-nsp/
> > >
> > 
> > _______________________________________________
> > cisco-nsp mailing list  cisco-nsp at puck.nether.net
> > https://puck.nether.net/mailman/listinfo/cisco-nsp
> > archive at http://puck.nether.net/pipermail/cisco-nsp/
> > 
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/


More information about the cisco-nsp mailing list