[c-nsp] Help with output drops

Mon Jul 13 00:30:29 EDT 2009

Hi all,

I just finished installing and configuring a new 6509 with dual sup7203bxl
(12.2(18)SXF15a) and a 6724 linecard.  It serves a simple purpose of
maintaining a single BGP session, and managing layer3 (vlans) for various
access switches.  No end devices are connected.

The problem is that I am getting constant output drops when the aggregation
uplink goes above ~400 mbps.  Nowhere near the interface speed!  See below,
take note of massive 'Total output drops' with no other errors (on either end):

rtr1.ash#sh int g1/1
GigabitEthernet1/1 is up, line protocol is up (connected)
  Hardware is C6k 1000Mb 802.3, address is 00d0.01ff.5800 (bia 00d0.01ff.5800)
  Description: PTP-UPLINK
  Internet address is 209.9.224.68/29
  MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,
     reliability 255/255, txload 118/255, rxload 12/255
  Encapsulation ARPA, loopback not set
  Keepalive set (10 sec)
  Full-duplex, 1000Mb/s, media type is T
  input flow-control is off, output flow-control is off
  Clock mode is auto
  ARP type: ARPA, ARP Timeout 04:00:00
  Last input 00:00:00, output 00:00:01, output hang never
  Last clearing of "show interface" counters 05:01:25
  Input queue: 0/1000/0/0 (size/max/drops/flushes); Total output drops: 718023
  Queueing strategy: fifo
  Output queue: 0/100 (size/max)
  30 second input rate 47789000 bits/sec, 30797 packets/sec
  30 second output rate 465362000 bits/sec, 48729 packets/sec
  L2 Switched: ucast: 27775 pkt, 2136621 bytes - mcast: 24590 pkt, 1574763 bytes
  L3 in Switched: ucast: 592150327 pkt, 95608889548 bytes - mcast: 0 pkt, 0
bytes mcast
  L3 out Switched: ucast: 991372425 pkt, 1214882993007 bytes mcast: 0 pkt, 0 bytes
     592554441 packets input, 95674494492 bytes, 0 no buffer
     Received 33643 broadcasts (17872 IP multicasts)
     0 runts, 0 giants, 0 throttles
     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
     0 watchdog, 0 multicast, 0 pause input
     0 input packets with dribble condition detected
     991006394 packets output, 1214377864373 bytes, 0 underruns
     0 output errors, 0 collisions, 0 interface resets
     0 babbles, 0 late collision, 0 deferred
     0 lost carrier, 0 no carrier, 0 PAUSE output
     0 output buffer failures, 0 output buffers swapped out

The CPU usage is nil:

rtr1.ash#sh proc cpu sort

CPU utilization for five seconds: 1%/0%; one minute: 0%; five minutes: 0%
 PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process
   6     3036624    252272      12037  0.47%  0.19%  0.18%   0 Check heaps
 316      195004     99543       1958  0.15%  0.01%  0.00%   0 BGP Scanner
 119      267568   2962884         90  0.15%  0.03%  0.02%   0 IP Input
 172      413528   2134933        193  0.07%  0.03%  0.02%   0 CEF process
   4          16     48214          0  0.00%  0.00%  0.00%   0 cpf_process_ipcQ
   3           0         2          0  0.00%  0.00%  0.00%   0 cpf_process_msg_
   5           0         1          0  0.00%  0.00%  0.00%   0 PF Redun ICC Req
   2         772    298376          2  0.00%  0.00%  0.00%   0 Load Meter
   9       23964    157684        151  0.00%  0.01%  0.00%   0 ARP Input
   7           0         1          0  0.00%  0.00%  0.00%   0 Pool Manager
   8           0         2          0  0.00%  0.00%  0.00%   0 Timers
<<<snip>>>

I THINK I have determined the drops are caused by buffer congestion on the port:

rtr1.ash#sh queueing interface gigabitEthernet 1/1

rtr1.ash#sh queueing interface gigabitEthernet 1/1
Interface GigabitEthernet1/1 queueing strategy:  Weighted Round-Robin
  Port QoS is enabled
  Port is untrusted
  Extend trust state: not trusted [COS = 0]
  Default COS is 0
    Queueing Mode In Tx direction: mode-cos
    Transmit queues [type = 1p3q8t]:
    Queue Id    Scheduling  Num of thresholds
    -----------------------------------------
       01         WRR                 08
       02         WRR                 08
       03         WRR                 08
       04         Priority            01

    WRR bandwidth ratios:  100[queue 1] 150[queue 2] 200[queue 3]
    queue-limit ratios:     50[queue 1]  20[queue 2]  15[queue 3]  15[Pri Queue]

<<<snip>>>

  Packets dropped on Transmit:

    queue     dropped  [cos-map]
    ---------------------------------------------
    1                   719527  [0 1 ]
    2                        0  [2 3 4 ]
    3                        0  [6 7 ]
    4                        0  [5 ]

So it would appear all of my traffic goes into queue 1.  It would also seem
that 50% buffers for queue 1 isn't enough?  These are the default settings by
the way.

I'm pretty sure that wrr-queue queue-limit and wrr-queue bandwidth should help
us mitigate this frustrating packet loss, but I've no experience and would
like some insight and suggestions before I start making changes.  I am totally
unfamiliar with these features (I come from Foundry/Brocade background) and
would like any suggestions or advise you might have before I try anything that
could risk downtime or further issues in a production environment.

And lastly, what should I look out for when modifying the buffers?  Network
blips, more congestion, ect?  This is a production switch and the last thing I
need to do is make matters worse.

Thank you!

--
Randy