[c-nsp] Random BGP Drops
Catalin Dominte
catalin.dominte at nocsult.net
Fri Jul 24 11:10:53 EDT 2015
Just a few more details about this.
This did not happen on any IPv6 sessions. Only IPv4. The v6 sessions
haven't flapped for months.
The specific thing we are looking at in the logs on the other side is this
line:
Jul 24 00:33:04 rt1 rpd[1396]: bgp_read_v4_message:10656: NOTIFICATION
received from A.B.C.D (External AS *****): code 4 (Hold Timer Expired
Error), socket buffer sndcc: 57 rcvcc: 0 TCP state: 4, snd_una: 3040466763
snd_nxt: 3040466801 snd_wnd: 16194 rcv_nxt: 3738492361 rcv_adv: 3738508724,
hold timer out 90s, hold timer remain 1:07.779687s
More specifically: "hold timer remain 1:07.779687s"
Does this indicate one-way communication over the BGP session? We can't
think what would cause that apart from our CoPP policy, the relevant bit of
that is:
policy-map copp
class routing
class management
class monitoring
class undesirable
police 6000000 conform-action transmit exceed-action drop
class other
class netbios
police cir 32000 conform-action drop exceed-action drop
violate-action drop
Hardware Counters:
class-map: undesirable (match-all)
Match: access-group 125
police :
6000000 bps 187500 limit 187500 extended limit
Earl in slot 1 :
4182956794 bytes
5 minute offered rate 40 bps
aggregate-forwarded 4172677422 bytes action: transmit
exceeded 10279372 bytes action: drop
aggregate-forward 152 bps exceed 0 bps
Earl in slot 4 :
54888502997 bytes
5 minute offered rate 9040 bps
aggregate-forwarded 34946501956 bytes action: transmit
exceeded 19942001041 bytes action: drop
aggregate-forward 7016 bps exceed 0 bps
Software Counters:
Class-map: undesirable (match-all)
276617525 packets, 36984017831 bytes
5 minute offered rate 6000 bps, drop rate 0000 bps
Match: access-group 125
police:
cir 6000000 bps, bc 187500 bytes
conformed 276617377 packets, 36983876623 bytes; actions:
transmit
exceeded 150 packets, 141208 bytes; actions:
drop
conformed 6000 bps, exceed 0000 bps
Class-map: other (match-all)
109899621 packets, 10132415208 bytes
5 minute offered rate 4000 bps
Match: access-group 124
Hardware Counters:
class-map: netbios (match-all)
Match: access-group 126
police :
32000 bps 1500 limit 1500 extended limit
Earl in slot 1 :
0 bytes
5 minute offered rate 0 bps
aggregate-forwarded 0 bytes action: drop
exceeded 0 bytes action: drop
aggregate-forward 0 bps exceed 0 bps
Earl in slot 4 :
0 bytes
5 minute offered rate 0 bps
aggregate-forwarded 0 bytes action: drop
exceeded 0 bytes action: drop
aggregate-forward 0 bps exceed 0 bps
Software Counters:
Class-map: netbios (match-all)
0 packets, 0 bytes
5 minute offered rate 0000 bps, drop rate 0000 bps
Match: access-group 126
police:
cir 32000 bps, bc 1500 bytes, be 1500 bytes
conformed 0 packets, 0 bytes; actions:
drop
exceeded 0 packets, 0 bytes; actions:
drop
violated 0 packets, 0 bytes; actions:
drop
conformed 0000 bps, exceed 0000 bps, violate 0000 bps
Class-map: class-default (match-any)
3182132665 packets, 248587325791 bytes
5 minute offered rate 237000 bps, drop rate 0000 bps
Match: any
3182132679 packets, 248587324073 bytes
5 minute rate 237000 bps
Kind regards,
Catalin Dominte
Senior Network Consultant
+44(0)1628302007
Nocsult Ltd
www.nocsult.net
On Fri, Jul 24, 2015 at 2:33 PM, Catalin Dominte <
catalin.dominte at nocsult.net> wrote:
> I checked this and the MSS matches on both sides:
>
> Juniper side:
> sndsbcc: 0 sndsbmbcnt: 0 sndsbmbmax: 262144
> sndsblowat: 2048 sndsbhiwat: 32768
> rcvsbcc: 0 rcvsbmbcnt: 0 rcvsbmbmax: 262144
> rcvsblowat: 1 rcvsbhiwat: 32768
> proc id: 3283 proc name: rpd
> iss: 1163062337 sndup: 1163062397
> snduna: 1163097242 sndnxt: 1163097242 sndwnd: 15130
> sndmax: 1163097242 sndcwnd: 65535 sndssthresh: 1073725440
> irs: 3033053077 rcvup: 3033087402
> rcvnxt: 3033087402 rcvadv: 3033069519 rcvwnd: 16384
> rtt: 0 srtt: 0 rttv: 12000
> rxtcur: 3000 rxtshift: 0 rtseq: 0
> rttmin: 1000 mss: 1460
> flags: ACKNOW [0x1]
>
> Cisco Side:
>
> Enqueued packets for retransmit: 0, input: 0 mis-ordered: 0 (0 bytes)
>
> Event Timers (current time is 0xEAB00ACB8):
> Timer Starts Wakeups Next
> Retrans 1813 10 0x0
> TimeWait 0 0 0x0
> AckHold 1821 1788 0x0
> SendWnd 0 0 0x0
> KeepAlive 0 0 0x0
> GiveUp 0 0 0x0
> PmtuAger 156412 156411 0xEAB00ADCB
> DeadWait 0 0 0x0
>
> iss: 3033053077 snduna: 3033087421 sndnxt: 3033087421 sndwnd: 16384
> irs: 1163062337 rcvnxt: 1163097261 rcvwnd: 15111 delrcvwnd: 1273
>
> SRTT: 300 ms, RTTO: 303 ms, RTV: 3 ms, KRTT: 0 ms
> minRTT: 0 ms, maxRTT: 8700 ms, ACK hold: 200 ms
> Flags: higher precedence, nagle, path mtu capable
>
> Datagrams (max data segment is 1460 bytes):
> Rcvd: 3611 (out of order: 0), with data: 1821, total data bytes: 34923
> Sent: 3616 (retransmit: 10), with data: 1803, total data bytes: 34343
>
> Another thing is the path-mtu is enabled, so TCP should negotiate the
> correct MSS. Am I wrong?
>
> Kind regards,
>
> Catalin Dominte
> Senior Network Consultant
> +44(0)1628302007
> Nocsult Ltd
> www.nocsult.net
>
>
> On Fri, Jul 24, 2015 at 1:56 PM, Mark Tinka <mark.tinka at seacom.mu> wrote:
>
>>
>>
>> On 24/Jul/15 14:48, Catalin Dominte wrote:
>>
>> Hi Mark,
>>
>> Thanks for getting back to me.
>>
>> This affects only a handful of customers and a handful of LINX peers.
>> They have always been stable, not had any issues with them.
>>
>> As far as I know they have not changed much in terms of hardware, but
>> in software configuration they could have changed stuff. I can control what
>> advertisements I receive from a customer and BGP policies, but I don't have
>> a lot of visibility into what the customers are doing during their normal
>> day to day operations.
>>
>> Besides it would be too much of a coincidence if say 5 peering sessions
>> get disconnected at random times, but all of them every time.
>>
>>
>> We've had issues like this across LINX peering sessions, where it turns
>> out to be an MTU issue.
>>
>> We have a standard TCP MSS of 1,500 bytes. We've generally solved this by
>> having the peer fix their MTU or MSS accordingly.
>>
>> I've always found it strange especially if the peer is physically
>> terminated on the LINX switch, and not coming in via a remote partner. But
>> fixing the MTU/MSS always works.
>>
>> Mark.
>>
>
>
More information about the cisco-nsp
mailing list