[c-nsp] BGP Hold time expired/ospf dropping 6500 Sup720-3BXL

Gergely Antal skoal at skoal.name
Fri Jan 22 03:07:42 EST 2010


just a thought :
sh ip bgp neighbors | i Datagrams

maybe one router tries to negotiate the session with low datagram size
and the update storm floods the connection.


On Fri, 22 Jan 2010 02:06:53 +0100
"Andy B." <globichen at gmail.com> wrote:

>Hi,
>
>here we go:
>
>Core router that is causing headaches:
>
>interface Loopback0
> ip address x.x.x.130 255.255.255.255
>
>interface TenGigabitEthernet9/1
> ip address y.y.y.1 255.255.255.252
> no ip redirects
> no ip proxy-arp
> no cdp enable
>
>router ospf 1
> router-id x.x.x.130
> log-adjacency-changes
> redistribute connected subnets
> redistribute static subnets
> passive-interface default
> no passive-interface TenGigabitEthernet8/1
> no passive-interface TenGigabitEthernet9/1
> no passive-interface TenGigabitEthernet9/2
> network y.y.y.0 0.0.0.3 area 0
> network y.y.y.4 0.0.0.3 area 0
> network y.y.y.8 0.0.0.3 area 0
>
>
>Adjacent router (one of them):
>
>interface Loopback0
> ip address x.x.x.131 255.255.255.255
>
>interface TenGigabitEthernet4/1
> ip address y.y.y.2 255.255.255.252
> no ip redirects
> no ip proxy-arp
>
>router ospf 1
> router-id x.x.x.131
> log-adjacency-changes
> redistribute connected subnets
> redistribute static subnets
> passive-interface default
> no passive-interface TenGigabitEthernet4/1
> network y.y.y.0 0.0.0.3 area 0
>
>
>I hope this helps...
>
>Andy
>
>
>On Fri, Jan 22, 2010 at 1:53 AM, Jason LeBlanc
><jasonleblanc at gmail.com> wrote:
>> Can you send your <snipped> OSPF config?
>>
>> On Jan 21, 2010, at 5:28 PM, Andy B. wrote:
>>
>>> Hi,
>>>
>>> I just fell over this thread while doing a little reseach to solve a
>>> similar situation.
>>>
>>> Hardware:
>>>
>>> - 6509 with SUP720-3BXL on both ends
>>> - SXF15a
>>> - Uptime: 46 weeks
>>>
>>> Problem:
>>>
>>> - OSPF (for the loopback between cores) and BGP (mostly customers
>>> whom we send the full table) going up and down all the time:
>>>
>>> %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.130 on TenGigabitEthernet4/1
>>> from FULL to DOWN, Neighbor Down: Dead timer expired
>>> %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.131 on TenGigabitEthernet9/1
>>> from LOADING to FULL, Loading Done
>>> %BGP-5-ADJCHANGE: neighbor y.y.y.14 Down BGP Notification sent
>>> %BGP-3-NOTIFICATION: sent to neighbor y.y.y.14 4/0 (hold time
>>> expired) 0 bytes %BGP-5-ADJCHANGE: neighbor y.y.y.14 Up
>>>
>>> This keeps going on for several hours, and suddenly it stabilizes
>>> itself.
>>>
>>> Furthermore I use cacti to generate graphs from the core router via
>>> SNMP. I have one VLAN that has around 15 GBPS traffic at peak times,
>>> and as soon as I hit more than 15 GBPS, no more graphs are drawn,
>>> core router console becomes rather unresponsive and OSPF starts to
>>> behave strangely.
>>>
>>> What I can rule out is the fiber capacity. I have multiple circuits
>>> and different paths and operators. The OSPF issue happens on all
>>> circuits, not just a specific one. No 10 GE link is used more than
>>> 60%. In fact, traffic from inside my backbone to any place outside
>>> remains unaffected (thank God), but the core router itself is pretty
>>> useless. Pinging the core's loopback or any ip loaded on that box
>>> results in a 40-60% packet loss.
>>>
>>> CPU usage is not high, it's stable. No unusual processes, just IP
>>> Input and BGP Scanner. More than 50% memory is still free at that
>>> time.
>>>
>>> I've had this many times recently, but it really just happens when
>>> my core goes beyond +- 15 GBPS of traffic (outbound). We've been
>>> below 15 GBPS for 2 years and it never happaned at that time. Now
>>> all this mess happens almost daily, rendering important billing
>>> graphs useless and annoying full table BGP customers.
>>>
>>> Is this a memory issue, due to the router's long uptime? Would
>>> reloading the router help in this case? That's the last thing I
>>> would want to do, but if it helps...
>>>
>>> Cheers,
>>>
>>> Andy
>>>
>>> On Fri, Dec 11, 2009 at 5:22 PM, Drew Weaver
>>> <drew.weaver at thenap.com> wrote:
>>>> Howdy all,
>>>>
>>>> Last night I had an interesting encounter on one of my 6509s /w
>>>> SUP7203-BXL.
>>>>
>>>> This switch has 3x iBGP sessions with full internet tables and is
>>>> also running OSPF.
>>>>
>>>> Two of the three iBGP sessions randomly dropped with:
>>>>
>>>> %BGP-3-NOTIFICATION: sent to neighbor x.x.x.3 4/0 (hold time
>>>> expired) 0 bytes, I also noticed that during this period OSPF
>>>> dropped with Neighbor Down: Dead timer expired
>>>>
>>>> and then re-established, and then failed again, and
>>>> re-established, and failed again, and so-on, and so-on.
>>>>
>>>> I checked the physical interfaces between this 6500 and the two
>>>> GSR 12000s it peers with and there were no errors, there was also
>>>> no obvious spike in traffic that would account for latency that
>>>> might cause the hold timers to expire. I remember when this system
>>>> first came online it took a really long time for it to download
>>>> the full internet tables from the upstream GSRs and also during
>>>> that time there was a lot of CPU time being eaten up, I am
>>>> wondering if maybe the first session failing caused sort of a
>>>> 'performance' domino effect which then caused everything else to
>>>> fail, the issue eventually corrected itself and stabilized.
>>>>
>>>> This particular box is running 12.2(18)SXF17 so I am less likely
>>>> to believe it is a software bug.
>>>>
>>>> Does anyone have any tips on both how I can avoid the hold timer
>>>> issue altogether and also how I can make it so that if a session
>>>> does go down and re-establish it doesn't totally nail the CPU
>>>> while it's trying to re-establish/download the routes? A long time
>>>> ago I also read that increasing the MTU on both ends of a circuit
>>>> can make BGP tables download faster, I don't know if that's true
>>>> or not, has anyone else found that?
>>>>
>>>> thanks,
>>>> -Drew
>>>>
>>>>
>>>> _______________________________________________
>>>> cisco-nsp mailing list  cisco-nsp at puck.nether.net
>>>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>>>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>>>>
>>> _______________________________________________
>>> cisco-nsp mailing list  cisco-nsp at puck.nether.net
>>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>>
>>
>_______________________________________________
>cisco-nsp mailing list  cisco-nsp at puck.nether.net
>https://puck.nether.net/mailman/listinfo/cisco-nsp
>archive at http://puck.nether.net/pipermail/cisco-nsp/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <https://puck.nether.net/pipermail/cisco-nsp/attachments/20100122/0aee0d24/attachment.bin>


More information about the cisco-nsp mailing list