[c-nsp] Random BGP Drops
Catalin Dominte
catalin.dominte at nocsult.net
Fri Jul 24 12:48:38 EDT 2015
Each class matches an ACL that permits various traffic for each class. Then
I only policed certain classes.
I tested with ping with df bit set and 1500bytes works on all peers that
restarted. So that rules out MTU related issues I would have thought.
Yup. Everything works fine for most time but they drop randomly.
Catalin
On 24 Jul 2015 5:43 pm, "Daniel Dib" <daniel.dib at reaper.nu> wrote:
> As far as I can see he is just policing undesirable and netbios. The other
> classes are just there without policing so it will not do something or he
> didn't paste the entire config here. I don't think it looks related to CoPP
> based on that output.
>
> I suppose a Telnet on TCP on port 179 to the other side works? Any other
> indications that something isn't stable?
>
> -----Original Message-----
> From: cisco-nsp [mailto:cisco-nsp-bounces at puck.nether.net] On Behalf Of
> Chuck Church
> Sent: den 24 juli 2015 17:55
> To: 'Catalin Dominte'
> Cc: cisco-nsp at puck.nether.net
> Subject: Re: [c-nsp] Random BGP Drops
>
> It looks like you're lumping all the traffic for routing, management,
> monitoring, and undesirable into a single police statement. There are
> millions of drops as a result. Dedicating a police statement to each class
> would be far better. Especially since undesirable is grouped in there.
>
> Chuck
>
> -----Original Message-----
> From: cisco-nsp [mailto:cisco-nsp-bounces at puck.nether.net] On Behalf Of
> Catalin Dominte
> Sent: Friday, July 24, 2015 11:11 AM
> To: Mark Tinka <mark.tinka at seacom.mu>
> Cc: cisco-nsp at puck.nether.net
> Subject: Re: [c-nsp] Random BGP Drops
>
> Just a few more details about this.
>
> This did not happen on any IPv6 sessions. Only IPv4. The v6 sessions
> haven't flapped for months.
>
> The specific thing we are looking at in the logs on the other side is this
> line:
> Jul 24 00:33:04 rt1 rpd[1396]: bgp_read_v4_message:10656: NOTIFICATION
> received from A.B.C.D (External AS *****): code 4 (Hold Timer Expired
> Error), socket buffer sndcc: 57 rcvcc: 0 TCP state: 4, snd_una: 3040466763
> snd_nxt: 3040466801 snd_wnd: 16194 rcv_nxt: 3738492361 rcv_adv: 3738508724,
> hold timer out 90s, hold timer remain 1:07.779687s
>
> More specifically: "hold timer remain 1:07.779687s"
>
> Does this indicate one-way communication over the BGP session? We can't
> think what would cause that apart from our CoPP policy, the relevant bit of
> that is:
>
> policy-map copp
> class routing
> class management
> class monitoring
> class undesirable
> police 6000000 conform-action transmit exceed-action drop
> class other
> class netbios
> police cir 32000 conform-action drop exceed-action drop
> violate-action drop
>
> Hardware Counters:
>
> class-map: undesirable (match-all)
> Match: access-group 125
> police :
> 6000000 bps 187500 limit 187500 extended limit
> Earl in slot 1 :
> 4182956794 bytes
> 5 minute offered rate 40 bps
> aggregate-forwarded 4172677422 bytes action: transmit
> exceeded 10279372 bytes action: drop
> aggregate-forward 152 bps exceed 0 bps
> Earl in slot 4 :
> 54888502997 bytes
> 5 minute offered rate 9040 bps
> aggregate-forwarded 34946501956 bytes action: transmit
> exceeded 19942001041 bytes action: drop
> aggregate-forward 7016 bps exceed 0 bps
>
> Software Counters:
>
> Class-map: undesirable (match-all)
> 276617525 packets, 36984017831 bytes
> 5 minute offered rate 6000 bps, drop rate 0000 bps
> Match: access-group 125
> police:
> cir 6000000 bps, bc 187500 bytes
> conformed 276617377 packets, 36983876623 bytes; actions:
> transmit
> exceeded 150 packets, 141208 bytes; actions:
> drop
> conformed 6000 bps, exceed 0000 bps
>
> Class-map: other (match-all)
> 109899621 packets, 10132415208 bytes
> 5 minute offered rate 4000 bps
> Match: access-group 124
>
> Hardware Counters:
>
> class-map: netbios (match-all)
> Match: access-group 126
> police :
> 32000 bps 1500 limit 1500 extended limit
> Earl in slot 1 :
> 0 bytes
> 5 minute offered rate 0 bps
> aggregate-forwarded 0 bytes action: drop
> exceeded 0 bytes action: drop
> aggregate-forward 0 bps exceed 0 bps
> Earl in slot 4 :
> 0 bytes
> 5 minute offered rate 0 bps
> aggregate-forwarded 0 bytes action: drop
> exceeded 0 bytes action: drop
> aggregate-forward 0 bps exceed 0 bps
>
> Software Counters:
>
> Class-map: netbios (match-all)
> 0 packets, 0 bytes
> 5 minute offered rate 0000 bps, drop rate 0000 bps
> Match: access-group 126
> police:
> cir 32000 bps, bc 1500 bytes, be 1500 bytes
> conformed 0 packets, 0 bytes; actions:
> drop
> exceeded 0 packets, 0 bytes; actions:
> drop
> violated 0 packets, 0 bytes; actions:
> drop
> conformed 0000 bps, exceed 0000 bps, violate 0000 bps
>
> Class-map: class-default (match-any)
> 3182132665 packets, 248587325791 bytes
> 5 minute offered rate 237000 bps, drop rate 0000 bps
> Match: any
> 3182132679 packets, 248587324073 bytes
> 5 minute rate 237000 bps
>
>
> Kind regards,
>
> Catalin Dominte
> Senior Network Consultant
> +44(0)1628302007
> Nocsult Ltd
> www.nocsult.net
>
>
> On Fri, Jul 24, 2015 at 2:33 PM, Catalin Dominte <
> catalin.dominte at nocsult.net> wrote:
>
> > I checked this and the MSS matches on both sides:
> >
> > Juniper side:
> > sndsbcc: 0 sndsbmbcnt: 0 sndsbmbmax: 262144
> > sndsblowat: 2048 sndsbhiwat: 32768
> > rcvsbcc: 0 rcvsbmbcnt: 0 rcvsbmbmax: 262144
> > rcvsblowat: 1 rcvsbhiwat: 32768
> > proc id: 3283 proc name: rpd
> > iss: 1163062337 sndup: 1163062397
> > snduna: 1163097242 sndnxt: 1163097242 sndwnd: 15130
> > sndmax: 1163097242 sndcwnd: 65535 sndssthresh: 1073725440
> > irs: 3033053077 rcvup: 3033087402
> > rcvnxt: 3033087402 rcvadv: 3033069519 rcvwnd: 16384
> > rtt: 0 srtt: 0 rttv: 12000
> > rxtcur: 3000 rxtshift: 0 rtseq: 0
> > rttmin: 1000 mss: 1460
> > flags: ACKNOW [0x1]
> >
> > Cisco Side:
> >
> > Enqueued packets for retransmit: 0, input: 0 mis-ordered: 0 (0 bytes)
> >
> > Event Timers (current time is 0xEAB00ACB8):
> > Timer Starts Wakeups Next
> > Retrans 1813 10 0x0
> > TimeWait 0 0 0x0
> > AckHold 1821 1788 0x0
> > SendWnd 0 0 0x0
> > KeepAlive 0 0 0x0
> > GiveUp 0 0 0x0
> > PmtuAger 156412 156411 0xEAB00ADCB
> > DeadWait 0 0 0x0
> >
> > iss: 3033053077 snduna: 3033087421 sndnxt: 3033087421 sndwnd:
> 16384
> > irs: 1163062337 rcvnxt: 1163097261 rcvwnd: 15111 delrcvwnd:
> 1273
> >
> > SRTT: 300 ms, RTTO: 303 ms, RTV: 3 ms, KRTT: 0 ms
> > minRTT: 0 ms, maxRTT: 8700 ms, ACK hold: 200 ms
> > Flags: higher precedence, nagle, path mtu capable
> >
> > Datagrams (max data segment is 1460 bytes):
> > Rcvd: 3611 (out of order: 0), with data: 1821, total data bytes: 34923
> > Sent: 3616 (retransmit: 10), with data: 1803, total data bytes: 34343
> >
> > Another thing is the path-mtu is enabled, so TCP should negotiate the
> > correct MSS. Am I wrong?
> >
> > Kind regards,
> >
> > Catalin Dominte
> > Senior Network Consultant
> > +44(0)1628302007
> > Nocsult Ltd
> > www.nocsult.net
> >
> >
> > On Fri, Jul 24, 2015 at 1:56 PM, Mark Tinka <mark.tinka at seacom.mu>
> wrote:
> >
> >>
> >>
> >> On 24/Jul/15 14:48, Catalin Dominte wrote:
> >>
> >> Hi Mark,
> >>
> >> Thanks for getting back to me.
> >>
> >> This affects only a handful of customers and a handful of LINX peers.
> >> They have always been stable, not had any issues with them.
> >>
> >> As far as I know they have not changed much in terms of hardware,
> >> but in software configuration they could have changed stuff. I can
> >> control what advertisements I receive from a customer and BGP
> >> policies, but I don't have a lot of visibility into what the
> >> customers are doing during their normal day to day operations.
> >>
> >> Besides it would be too much of a coincidence if say 5 peering
> >> sessions get disconnected at random times, but all of them every time.
> >>
> >>
> >> We've had issues like this across LINX peering sessions, where it
> >> turns out to be an MTU issue.
> >>
> >> We have a standard TCP MSS of 1,500 bytes. We've generally solved
> >> this by having the peer fix their MTU or MSS accordingly.
> >>
> >> I've always found it strange especially if the peer is physically
> >> terminated on the LINX switch, and not coming in via a remote
> >> partner. But fixing the MTU/MSS always works.
> >>
> >> Mark.
> >>
> >
> >
> _______________________________________________
> cisco-nsp mailing list cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/
>
> _______________________________________________
> cisco-nsp mailing list cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/
>
>
More information about the cisco-nsp
mailing list