[c-nsp] Random BGP Drops

Fri Jul 24 11:10:53 EDT 2015

Just a few more details about this.

This did not happen on any IPv6 sessions. Only IPv4.  The v6 sessions
haven't flapped for months.

The specific thing we are looking at in the logs on the other side is this
line:
Jul 24 00:33:04  rt1 rpd[1396]: bgp_read_v4_message:10656: NOTIFICATION
received from A.B.C.D (External AS *****): code 4 (Hold Timer Expired
Error), socket buffer sndcc: 57 rcvcc: 0 TCP state: 4, snd_una: 3040466763
snd_nxt: 3040466801 snd_wnd: 16194 rcv_nxt: 3738492361 rcv_adv: 3738508724,
hold timer out 90s, hold timer remain 1:07.779687s

More specifically: "hold timer remain 1:07.779687s"

Does this indicate one-way communication over the BGP session? We can't
think what would cause that apart from our CoPP policy, the relevant bit of
that is:

policy-map copp
  class routing
  class management
  class monitoring
  class undesirable
   police 6000000    conform-action transmit     exceed-action drop
  class other
  class netbios
   police cir 32000    conform-action drop     exceed-action drop
violate-action drop

Hardware Counters:

    class-map: undesirable (match-all)
      Match: access-group 125
      police :
        6000000 bps 187500 limit 187500 extended limit
      Earl in slot 1 :
        4182956794 bytes
        5 minute offered rate 40 bps
        aggregate-forwarded 4172677422 bytes action: transmit
        exceeded 10279372 bytes action: drop
        aggregate-forward 152 bps exceed 0 bps
      Earl in slot 4 :
        54888502997 bytes
        5 minute offered rate 9040 bps
        aggregate-forwarded 34946501956 bytes action: transmit
        exceeded 19942001041 bytes action: drop
        aggregate-forward 7016 bps exceed 0 bps

  Software Counters:

    Class-map: undesirable (match-all)
      276617525 packets, 36984017831 bytes
      5 minute offered rate 6000 bps, drop rate 0000 bps
      Match: access-group 125
      police:
          cir 6000000 bps, bc 187500 bytes
        conformed 276617377 packets, 36983876623 bytes; actions:
          transmit
        exceeded 150 packets, 141208 bytes; actions:
          drop
        conformed 6000 bps, exceed 0000 bps

    Class-map: other (match-all)
      109899621 packets, 10132415208 bytes
      5 minute offered rate 4000 bps
      Match: access-group 124

  Hardware Counters:

    class-map: netbios (match-all)
      Match: access-group 126
      police :
        32000 bps 1500 limit 1500 extended limit
      Earl in slot 1 :
        0 bytes
        5 minute offered rate 0 bps
        aggregate-forwarded 0 bytes action: drop
        exceeded 0 bytes action: drop
        aggregate-forward 0 bps exceed 0 bps
      Earl in slot 4 :
        0 bytes
        5 minute offered rate 0 bps
        aggregate-forwarded 0 bytes action: drop
        exceeded 0 bytes action: drop
        aggregate-forward 0 bps exceed 0 bps

  Software Counters:

    Class-map: netbios (match-all)
      0 packets, 0 bytes
      5 minute offered rate 0000 bps, drop rate 0000 bps
      Match: access-group 126
      police:
          cir 32000 bps, bc 1500 bytes, be 1500 bytes
        conformed 0 packets, 0 bytes; actions:
          drop
        exceeded 0 packets, 0 bytes; actions:
          drop
        violated 0 packets, 0 bytes; actions:
          drop
        conformed 0000 bps, exceed 0000 bps, violate 0000 bps

    Class-map: class-default (match-any)
      3182132665 packets, 248587325791 bytes
      5 minute offered rate 237000 bps, drop rate 0000 bps
      Match: any
        3182132679 packets, 248587324073 bytes
        5 minute rate 237000 bps

Kind regards,

Catalin Dominte
Senior Network Consultant
+44(0)1628302007
Nocsult Ltd
www.nocsult.net

On Fri, Jul 24, 2015 at 2:33 PM, Catalin Dominte <
catalin.dominte at nocsult.net> wrote:

> I checked this and the MSS matches on both sides:
>
> Juniper side:
>    sndsbcc:          0 sndsbmbcnt:          0  sndsbmbmax:     262144
> sndsblowat:       2048 sndsbhiwat:      32768
>    rcvsbcc:          0 rcvsbmbcnt:          0  rcvsbmbmax:     262144
> rcvsblowat:          1 rcvsbhiwat:      32768
>    proc id:       3283  proc name:        rpd
>        iss: 1163062337      sndup: 1163062397
>     snduna: 1163097242     sndnxt: 1163097242      sndwnd:      15130
>     sndmax: 1163097242    sndcwnd:      65535 sndssthresh: 1073725440
>        irs: 3033053077      rcvup: 3033087402
>     rcvnxt: 3033087402     rcvadv: 3033069519      rcvwnd:      16384
>        rtt:          0       srtt:          0        rttv:      12000
>     rxtcur:       3000   rxtshift:          0       rtseq:          0
>     rttmin:       1000  mss:       1460
>      flags: ACKNOW [0x1]
>
> Cisco Side:
>
> Enqueued packets for retransmit: 0, input: 0  mis-ordered: 0 (0 bytes)
>
> Event Timers (current time is 0xEAB00ACB8):
> Timer          Starts    Wakeups            Next
> Retrans          1813         10             0x0
> TimeWait            0          0             0x0
> AckHold          1821       1788             0x0
> SendWnd             0          0             0x0
> KeepAlive           0          0             0x0
> GiveUp              0          0             0x0
> PmtuAger       156412     156411     0xEAB00ADCB
> DeadWait            0          0             0x0
>
> iss: 3033053077  snduna: 3033087421  sndnxt: 3033087421     sndwnd:  16384
> irs: 1163062337  rcvnxt: 1163097261  rcvwnd:      15111  delrcvwnd:   1273
>
> SRTT: 300 ms, RTTO: 303 ms, RTV: 3 ms, KRTT: 0 ms
> minRTT: 0 ms, maxRTT: 8700 ms, ACK hold: 200 ms
> Flags: higher precedence, nagle, path mtu capable
>
> Datagrams (max data segment is 1460 bytes):
> Rcvd: 3611 (out of order: 0), with data: 1821, total data bytes: 34923
> Sent: 3616 (retransmit: 10), with data: 1803, total data bytes: 34343
>
> Another thing is the path-mtu is enabled, so TCP should negotiate the
> correct MSS. Am I wrong?
>
> Kind regards,
>
> Catalin Dominte
> Senior Network Consultant
> +44(0)1628302007
> Nocsult Ltd
> www.nocsult.net
>
>
> On Fri, Jul 24, 2015 at 1:56 PM, Mark Tinka <mark.tinka at seacom.mu> wrote:
>
>>
>>
>> On 24/Jul/15 14:48, Catalin Dominte wrote:
>>
>> Hi Mark,
>>
>>  Thanks for getting back to me.
>>
>>  This affects only a handful of customers and a handful of LINX peers.
>> They have always been stable, not had any issues with them.
>>
>>  As far as I know they have not changed much in terms of hardware, but
>> in software configuration they could have changed stuff. I can control what
>> advertisements I receive from a customer and BGP policies, but I don't have
>> a lot of visibility into what the customers are doing during their normal
>> day to day operations.
>>
>>  Besides it would be too much of a coincidence if say 5 peering sessions
>> get disconnected at random times, but all of them every time.
>>
>>
>> We've had issues like this across LINX peering sessions, where it turns
>> out to be an MTU issue.
>>
>> We have a standard TCP MSS of 1,500 bytes. We've generally solved this by
>> having the peer fix their MTU or MSS accordingly.
>>
>> I've always found it strange especially if the peer is physically
>> terminated on the LINX switch, and not coming in via a remote partner. But
>> fixing the MTU/MSS always works.
>>
>> Mark.
>>
>
>