[c-nsp] performance problems / overruns on a 6500/sup720/dfc's

Thu Jul 23 17:53:58 EDT 2009

Hello All,

I hope you guys can help me with the following issue.

It started a couple of weeks ago when one customer reported degraded
performance.
The customer has ~30 servers on a WS-C3750E-48TD, which in turn has a
single 10GE link to the 6500 in question.
The 10GE link on the 6500 has a service policy configured to limit IP
traffic to 8Gbps. (via an aggregate-policer)
Before the problems started the customer was able to push 8Gbps on the
link for 16 hours a day, the remaining time the customer has less
visitors to their service.
The issue arises every day at a time the router starts to forward 7.5
- 8Mpps. (approx 50Gbps)
When that moment comes the interface facing the customer drops down to
5 - 6 Gbps.

In the interface counters we can see the number of overruns increases very fast.
This continues till about 23:00PM when the total traffic forwarded
drops below 8mpps.

mod1: WS-X6708-10GE
mod2: WS-X6748-SFP
mod3: WS-X6704-10GE
mod4: WS-X6748-GE-TX
mod5: WS-X6748-GE-TX
mod6: WS-SUP720-3BXL

Initially running 12.2(18)SXF15a
Currently running 12.2(33)SXI1

The customer was connected to Te1/7 and currently 3/2

Things we have investigated or changed. (all have not resolved the issue)
- We saw through "sh plat hard cap fab" that some of the fabric
channels were (nearly) congested.
We swapped around a couple of TenG interfaces between channels and
slots 1 and 3.

- We suspected possible relation to Cisco bugs CSCeh08451 or
CSCsl70634. Even though both are resolved in SXF12 we upgraded to SXI1
- Possibly hitting some bottleneck in PFC/fabric, so we upgraded
modules 2 and 3 (the heaviest utilized modules) with DFC-3BXL.
- Tried different hold-queue's in and out
- Several fabric buffer-reserve settings
- Disabling all netflow
- removing the policy-map(s)
- enabling/disabling send/receive flowcontrol on several ports and
also on the customer 3750.

More customers are noticing degraded performance. Lower speeds and 5 -
20% packetloss.

The router has enough memory available, SP and RP cpu's are always below 30%

Below sh int output of the first customer that reported issues.

TenGigabitEthernet3/2 is up, line protocol is up (connected)
  Hardware is C6k 10000Mb 802.3, address is 000f.35bb.0b40 (bia 000f.35bb.0b40)
  Description: XXX001 - MO08
  Internet address is xx.xx.240.126/26
  MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec,
     reliability 255/255, txload 6/255, rxload 202/255
  Encapsulation ARPA, loopback not set
  Keepalive set (10 sec)
  Full-duplex, 10Gb/s, media type is 10Gbase-LR
  input flow-control is off, output flow-control is off
  ARP type: ARPA, ARP Timeout 00:30:00
  Last input 00:00:00, output 00:00:00, output hang never
  Last clearing of "show interface" counters 00:56:37
  Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
  Queueing strategy: fifo
  Output queue: 0/40 (size/max)
  30 second input rate 7935497000 bits/sec, 665152 packets/sec
  30 second output rate 239985000 bits/sec, 438880 packets/sec
  L2 Switched: ucast: 32 pkt, 2048 bytes - mcast: 1052 pkt, 318283 bytes
  L3 in Switched: ucast: 2016175646 pkt, 2998867098833 bytes - mcast:
0 pkt, 0 bytes mcast
  L3 out Switched: ucast: 1483531972 pkt, 115723597149 bytes mcast: 0
pkt, 0 bytes
     2228491744 packets input, 3314752535506 bytes, 0 no buffer
     Received 3005 broadcasts (0 IP multicasts)
     0 runts, 0 giants, 0 throttles
     0 input errors, 0 CRC, 0 frame, 206532318 overrun, 0 ignored
     0 watchdog, 0 multicast, 0 pause input
     0 input packets with dribble condition detected
     1482844739 packets output, 115625721402 bytes, 0 underruns
     0 output errors, 0 collisions, 0 interface resets
     0 babbles, 0 late collision, 0 deferred
     0 lost carrier, 0 no carrier, 0 PAUSE output
     0 output buffer failures, 0 output buffers swapped out

As you can see no problems reported other than overruns (approx 10%)

sh plat hard cap for output:
     Forwarding engine load:
                     Module       pps   peak-pps                     peak-time
                     1        2852591    4416215  18:21:12 CEST Thu Jul 23 2009
                     2        1422180    1645505  22:42:03 CEST Thu Jul 23 2009
                     3         903195    1018577  11:28:05 CEST Wed Jul 22 2009
                     6        1756281    8244268  01:36:29 CEST Sat Jul 18 2009

We're pretty much stuck.
Thanks for reading if you've gotten this far.

Any help would be very appreciated.

Kind regards,

Bas

p.s. the box peaks at approx 35Mbps IPv6 traffic, that shouldn't
affect IPv4 forwarding performance right?