[c-nsp] Random BGP Drops

Catalin Dominte catalin.dominte at nocsult.net
Fri Jul 24 12:48:38 EDT 2015


Each class matches an ACL that permits various traffic for each class. Then
I only policed certain classes.

I tested with ping with df bit set and 1500bytes works on all peers that
restarted. So that rules out MTU related issues I would have thought.

Yup. Everything works fine for most time but they drop randomly.

Catalin
On 24 Jul 2015 5:43 pm, "Daniel Dib" <daniel.dib at reaper.nu> wrote:

> As far as I can see he is just policing undesirable and netbios. The other
> classes are just there without policing so it will not do something or he
> didn't paste the entire config here. I don't think it looks related to CoPP
> based on that output.
>
> I suppose a Telnet on TCP on port 179 to the other side works? Any other
> indications that something isn't stable?
>
> -----Original Message-----
> From: cisco-nsp [mailto:cisco-nsp-bounces at puck.nether.net] On Behalf Of
> Chuck Church
> Sent: den 24 juli 2015 17:55
> To: 'Catalin Dominte'
> Cc: cisco-nsp at puck.nether.net
> Subject: Re: [c-nsp] Random BGP Drops
>
> It looks like you're lumping all the traffic for routing, management,
> monitoring, and undesirable into a single police statement.  There are
> millions of drops as a result.  Dedicating a police statement to each class
> would be far better.  Especially since undesirable is grouped in there.
>
> Chuck
>
> -----Original Message-----
> From: cisco-nsp [mailto:cisco-nsp-bounces at puck.nether.net] On Behalf Of
> Catalin Dominte
> Sent: Friday, July 24, 2015 11:11 AM
> To: Mark Tinka <mark.tinka at seacom.mu>
> Cc: cisco-nsp at puck.nether.net
> Subject: Re: [c-nsp] Random BGP Drops
>
> Just a few more details about this.
>
> This did not happen on any IPv6 sessions. Only IPv4.  The v6 sessions
> haven't flapped for months.
>
> The specific thing we are looking at in the logs on the other side is this
> line:
> Jul 24 00:33:04  rt1 rpd[1396]: bgp_read_v4_message:10656: NOTIFICATION
> received from A.B.C.D (External AS *****): code 4 (Hold Timer Expired
> Error), socket buffer sndcc: 57 rcvcc: 0 TCP state: 4, snd_una: 3040466763
> snd_nxt: 3040466801 snd_wnd: 16194 rcv_nxt: 3738492361 rcv_adv: 3738508724,
> hold timer out 90s, hold timer remain 1:07.779687s
>
> More specifically: "hold timer remain 1:07.779687s"
>
> Does this indicate one-way communication over the BGP session? We can't
> think what would cause that apart from our CoPP policy, the relevant bit of
> that is:
>
> policy-map copp
>   class routing
>   class management
>   class monitoring
>   class undesirable
>    police 6000000    conform-action transmit     exceed-action drop
>   class other
>   class netbios
>    police cir 32000    conform-action drop     exceed-action drop
> violate-action drop
>
> Hardware Counters:
>
>     class-map: undesirable (match-all)
>       Match: access-group 125
>       police :
>         6000000 bps 187500 limit 187500 extended limit
>       Earl in slot 1 :
>         4182956794 bytes
>         5 minute offered rate 40 bps
>         aggregate-forwarded 4172677422 bytes action: transmit
>         exceeded 10279372 bytes action: drop
>         aggregate-forward 152 bps exceed 0 bps
>       Earl in slot 4 :
>         54888502997 bytes
>         5 minute offered rate 9040 bps
>         aggregate-forwarded 34946501956 bytes action: transmit
>         exceeded 19942001041 bytes action: drop
>         aggregate-forward 7016 bps exceed 0 bps
>
>   Software Counters:
>
>     Class-map: undesirable (match-all)
>       276617525 packets, 36984017831 bytes
>       5 minute offered rate 6000 bps, drop rate 0000 bps
>       Match: access-group 125
>       police:
>           cir 6000000 bps, bc 187500 bytes
>         conformed 276617377 packets, 36983876623 bytes; actions:
>           transmit
>         exceeded 150 packets, 141208 bytes; actions:
>           drop
>         conformed 6000 bps, exceed 0000 bps
>
>     Class-map: other (match-all)
>       109899621 packets, 10132415208 bytes
>       5 minute offered rate 4000 bps
>       Match: access-group 124
>
>   Hardware Counters:
>
>     class-map: netbios (match-all)
>       Match: access-group 126
>       police :
>         32000 bps 1500 limit 1500 extended limit
>       Earl in slot 1 :
>         0 bytes
>         5 minute offered rate 0 bps
>         aggregate-forwarded 0 bytes action: drop
>         exceeded 0 bytes action: drop
>         aggregate-forward 0 bps exceed 0 bps
>       Earl in slot 4 :
>         0 bytes
>         5 minute offered rate 0 bps
>         aggregate-forwarded 0 bytes action: drop
>         exceeded 0 bytes action: drop
>         aggregate-forward 0 bps exceed 0 bps
>
>   Software Counters:
>
>     Class-map: netbios (match-all)
>       0 packets, 0 bytes
>       5 minute offered rate 0000 bps, drop rate 0000 bps
>       Match: access-group 126
>       police:
>           cir 32000 bps, bc 1500 bytes, be 1500 bytes
>         conformed 0 packets, 0 bytes; actions:
>           drop
>         exceeded 0 packets, 0 bytes; actions:
>           drop
>         violated 0 packets, 0 bytes; actions:
>           drop
>         conformed 0000 bps, exceed 0000 bps, violate 0000 bps
>
>     Class-map: class-default (match-any)
>       3182132665 packets, 248587325791 bytes
>       5 minute offered rate 237000 bps, drop rate 0000 bps
>       Match: any
>         3182132679 packets, 248587324073 bytes
>         5 minute rate 237000 bps
>
>
> Kind regards,
>
> Catalin Dominte
> Senior Network Consultant
> +44(0)1628302007
> Nocsult Ltd
> www.nocsult.net
>
>
> On Fri, Jul 24, 2015 at 2:33 PM, Catalin Dominte <
> catalin.dominte at nocsult.net> wrote:
>
> > I checked this and the MSS matches on both sides:
> >
> > Juniper side:
> >    sndsbcc:          0 sndsbmbcnt:          0  sndsbmbmax:     262144
> > sndsblowat:       2048 sndsbhiwat:      32768
> >    rcvsbcc:          0 rcvsbmbcnt:          0  rcvsbmbmax:     262144
> > rcvsblowat:          1 rcvsbhiwat:      32768
> >    proc id:       3283  proc name:        rpd
> >        iss: 1163062337      sndup: 1163062397
> >     snduna: 1163097242     sndnxt: 1163097242      sndwnd:      15130
> >     sndmax: 1163097242    sndcwnd:      65535 sndssthresh: 1073725440
> >        irs: 3033053077      rcvup: 3033087402
> >     rcvnxt: 3033087402     rcvadv: 3033069519      rcvwnd:      16384
> >        rtt:          0       srtt:          0        rttv:      12000
> >     rxtcur:       3000   rxtshift:          0       rtseq:          0
> >     rttmin:       1000  mss:       1460
> >      flags: ACKNOW [0x1]
> >
> > Cisco Side:
> >
> > Enqueued packets for retransmit: 0, input: 0  mis-ordered: 0 (0 bytes)
> >
> > Event Timers (current time is 0xEAB00ACB8):
> > Timer          Starts    Wakeups            Next
> > Retrans          1813         10             0x0
> > TimeWait            0          0             0x0
> > AckHold          1821       1788             0x0
> > SendWnd             0          0             0x0
> > KeepAlive           0          0             0x0
> > GiveUp              0          0             0x0
> > PmtuAger       156412     156411     0xEAB00ADCB
> > DeadWait            0          0             0x0
> >
> > iss: 3033053077  snduna: 3033087421  sndnxt: 3033087421     sndwnd:
> 16384
> > irs: 1163062337  rcvnxt: 1163097261  rcvwnd:      15111  delrcvwnd:
>  1273
> >
> > SRTT: 300 ms, RTTO: 303 ms, RTV: 3 ms, KRTT: 0 ms
> > minRTT: 0 ms, maxRTT: 8700 ms, ACK hold: 200 ms
> > Flags: higher precedence, nagle, path mtu capable
> >
> > Datagrams (max data segment is 1460 bytes):
> > Rcvd: 3611 (out of order: 0), with data: 1821, total data bytes: 34923
> > Sent: 3616 (retransmit: 10), with data: 1803, total data bytes: 34343
> >
> > Another thing is the path-mtu is enabled, so TCP should negotiate the
> > correct MSS. Am I wrong?
> >
> > Kind regards,
> >
> > Catalin Dominte
> > Senior Network Consultant
> > +44(0)1628302007
> > Nocsult Ltd
> > www.nocsult.net
> >
> >
> > On Fri, Jul 24, 2015 at 1:56 PM, Mark Tinka <mark.tinka at seacom.mu>
> wrote:
> >
> >>
> >>
> >> On 24/Jul/15 14:48, Catalin Dominte wrote:
> >>
> >> Hi Mark,
> >>
> >>  Thanks for getting back to me.
> >>
> >>  This affects only a handful of customers and a handful of LINX peers.
> >> They have always been stable, not had any issues with them.
> >>
> >>  As far as I know they have not changed much in terms of hardware,
> >> but in software configuration they could have changed stuff. I can
> >> control what advertisements I receive from a customer and BGP
> >> policies, but I don't have a lot of visibility into what the
> >> customers are doing during their normal day to day operations.
> >>
> >>  Besides it would be too much of a coincidence if say 5 peering
> >> sessions get disconnected at random times, but all of them every time.
> >>
> >>
> >> We've had issues like this across LINX peering sessions, where it
> >> turns out to be an MTU issue.
> >>
> >> We have a standard TCP MSS of 1,500 bytes. We've generally solved
> >> this by having the peer fix their MTU or MSS accordingly.
> >>
> >> I've always found it strange especially if the peer is physically
> >> terminated on the LINX switch, and not coming in via a remote
> >> partner. But fixing the MTU/MSS always works.
> >>
> >> Mark.
> >>
> >
> >
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/
>
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/
>
>


More information about the cisco-nsp mailing list