[c-nsp] BGP Hold time expired/ospf dropping 6500 Sup720-3BXL

Fri Jan 22 05:26:39 EST 2010

MTU is 1500 on all links:

Core 1:

#sh int te9/1 | i MTU
 MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec,

#sh int te9/2 | i MTU
 MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec,

#sh int te8/1 | i MTU
 MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec,

Core 2:

#sh int te4/1 | i MTU
 MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec,

Core 3:

#sh int te4/1 | i MTU
 MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec,

Core 4:

#sh int te4/1 | i MTU
 MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec,

Core 1 is physically connected to 2,3 and 4 (star topology).

BGP is fully meshed - no route reflector.

Andy

On Fri, Jan 22, 2010 at 11:00 AM, roy <bandwidth.user at gmail.com> wrote:
> We had a somewhat similar problem with ospf/bgp which was eventually
> resolved by making link mtu uniform across the links. Let me know if this
> helps.
>
> On Friday, 22 January, 2010 04:07 PM, Gergely Antal wrote:
>>
>> just a thought :
>> sh ip bgp neighbors | i Datagrams
>>
>> maybe one router tries to negotiate the session with low datagram size
>> and the update storm floods the connection.
>>
>>
>> On Fri, 22 Jan 2010 02:06:53 +0100
>> "Andy B."<globichen at gmail.com>  wrote:
>>
>>> Hi,
>>>
>>> here we go:
>>>
>>> Core router that is causing headaches:
>>>
>>> interface Loopback0
>>> ip address x.x.x.130 255.255.255.255
>>>
>>> interface TenGigabitEthernet9/1
>>> ip address y.y.y.1 255.255.255.252
>>> no ip redirects
>>> no ip proxy-arp
>>> no cdp enable
>>>
>>> router ospf 1
>>> router-id x.x.x.130
>>> log-adjacency-changes
>>> redistribute connected subnets
>>> redistribute static subnets
>>> passive-interface default
>>> no passive-interface TenGigabitEthernet8/1
>>> no passive-interface TenGigabitEthernet9/1
>>> no passive-interface TenGigabitEthernet9/2
>>> network y.y.y.0 0.0.0.3 area 0
>>> network y.y.y.4 0.0.0.3 area 0
>>> network y.y.y.8 0.0.0.3 area 0
>>>
>>>
>>> Adjacent router (one of them):
>>>
>>> interface Loopback0
>>> ip address x.x.x.131 255.255.255.255
>>>
>>> interface TenGigabitEthernet4/1
>>> ip address y.y.y.2 255.255.255.252
>>> no ip redirects
>>> no ip proxy-arp
>>>
>>> router ospf 1
>>> router-id x.x.x.131
>>> log-adjacency-changes
>>> redistribute connected subnets
>>> redistribute static subnets
>>> passive-interface default
>>> no passive-interface TenGigabitEthernet4/1
>>> network y.y.y.0 0.0.0.3 area 0
>>>
>>>
>>> I hope this helps...
>>>
>>> Andy
>>>
>>>
>>> On Fri, Jan 22, 2010 at 1:53 AM, Jason LeBlanc
>>> <jasonleblanc at gmail.com>  wrote:
>>>>
>>>> Can you send your<snipped>  OSPF config?
>>>>
>>>> On Jan 21, 2010, at 5:28 PM, Andy B. wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I just fell over this thread while doing a little reseach to solve a
>>>>> similar situation.
>>>>>
>>>>> Hardware:
>>>>>
>>>>> - 6509 with SUP720-3BXL on both ends
>>>>> - SXF15a
>>>>> - Uptime: 46 weeks
>>>>>
>>>>> Problem:
>>>>>
>>>>> - OSPF (for the loopback between cores) and BGP (mostly customers
>>>>> whom we send the full table) going up and down all the time:
>>>>>
>>>>> %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.130 on TenGigabitEthernet4/1
>>>>> from FULL to DOWN, Neighbor Down: Dead timer expired
>>>>> %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.131 on TenGigabitEthernet9/1
>>>>> from LOADING to FULL, Loading Done
>>>>> %BGP-5-ADJCHANGE: neighbor y.y.y.14 Down BGP Notification sent
>>>>> %BGP-3-NOTIFICATION: sent to neighbor y.y.y.14 4/0 (hold time
>>>>> expired) 0 bytes %BGP-5-ADJCHANGE: neighbor y.y.y.14 Up
>>>>>
>>>>> This keeps going on for several hours, and suddenly it stabilizes
>>>>> itself.
>>>>>
>>>>> Furthermore I use cacti to generate graphs from the core router via
>>>>> SNMP. I have one VLAN that has around 15 GBPS traffic at peak times,
>>>>> and as soon as I hit more than 15 GBPS, no more graphs are drawn,
>>>>> core router console becomes rather unresponsive and OSPF starts to
>>>>> behave strangely.
>>>>>
>>>>> What I can rule out is the fiber capacity. I have multiple circuits
>>>>> and different paths and operators. The OSPF issue happens on all
>>>>> circuits, not just a specific one. No 10 GE link is used more than
>>>>> 60%. In fact, traffic from inside my backbone to any place outside
>>>>> remains unaffected (thank God), but the core router itself is pretty
>>>>> useless. Pinging the core's loopback or any ip loaded on that box
>>>>> results in a 40-60% packet loss.
>>>>>
>>>>> CPU usage is not high, it's stable. No unusual processes, just IP
>>>>> Input and BGP Scanner. More than 50% memory is still free at that
>>>>> time.
>>>>>
>>>>> I've had this many times recently, but it really just happens when
>>>>> my core goes beyond +- 15 GBPS of traffic (outbound). We've been
>>>>> below 15 GBPS for 2 years and it never happaned at that time. Now
>>>>> all this mess happens almost daily, rendering important billing
>>>>> graphs useless and annoying full table BGP customers.
>>>>>
>>>>> Is this a memory issue, due to the router's long uptime? Would
>>>>> reloading the router help in this case? That's the last thing I
>>>>> would want to do, but if it helps...
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Andy
>>>>>
>>>>> On Fri, Dec 11, 2009 at 5:22 PM, Drew Weaver
>>>>> <drew.weaver at thenap.com>  wrote:
>>>>>>
>>>>>> Howdy all,
>>>>>>
>>>>>> Last night I had an interesting encounter on one of my 6509s /w
>>>>>> SUP7203-BXL.
>>>>>>
>>>>>> This switch has 3x iBGP sessions with full internet tables and is
>>>>>> also running OSPF.
>>>>>>
>>>>>> Two of the three iBGP sessions randomly dropped with:
>>>>>>
>>>>>> %BGP-3-NOTIFICATION: sent to neighbor x.x.x.3 4/0 (hold time
>>>>>> expired) 0 bytes, I also noticed that during this period OSPF
>>>>>> dropped with Neighbor Down: Dead timer expired
>>>>>>
>>>>>> and then re-established, and then failed again, and
>>>>>> re-established, and failed again, and so-on, and so-on.
>>>>>>
>>>>>> I checked the physical interfaces between this 6500 and the two
>>>>>> GSR 12000s it peers with and there were no errors, there was also
>>>>>> no obvious spike in traffic that would account for latency that
>>>>>> might cause the hold timers to expire. I remember when this system
>>>>>> first came online it took a really long time for it to download
>>>>>> the full internet tables from the upstream GSRs and also during
>>>>>> that time there was a lot of CPU time being eaten up, I am
>>>>>> wondering if maybe the first session failing caused sort of a
>>>>>> 'performance' domino effect which then caused everything else to
>>>>>> fail, the issue eventually corrected itself and stabilized.
>>>>>>
>>>>>> This particular box is running 12.2(18)SXF17 so I am less likely
>>>>>> to believe it is a software bug.
>>>>>>
>>>>>> Does anyone have any tips on both how I can avoid the hold timer
>>>>>> issue altogether and also how I can make it so that if a session
>>>>>> does go down and re-establish it doesn't totally nail the CPU
>>>>>> while it's trying to re-establish/download the routes? A long time
>>>>>> ago I also read that increasing the MTU on both ends of a circuit
>>>>>> can make BGP tables download faster, I don't know if that's true
>>>>>> or not, has anyone else found that?
>>>>>>
>>>>>> thanks,
>>>>>> -Drew
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> cisco-nsp mailing list  cisco-nsp at puck.nether.net
>>>>>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>>>>>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>>>>>>
>>>>> _______________________________________________
>>>>> cisco-nsp mailing list  cisco-nsp at puck.nether.net
>>>>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>>>>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>>>>
>>>>
>>> _______________________________________________
>>> cisco-nsp mailing list  cisco-nsp at puck.nether.net
>>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>>
>>
>>
>> _______________________________________________
>> cisco-nsp mailing list  cisco-nsp at puck.nether.net
>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/
>