[c-nsp] Unusual Problem with Catalyst 6500 and Sup702-3b

Stephan M. Mackenzie stephan.mackenzie at gmail.com
Sun Oct 17 14:32:10 EDT 2010


Hi, this is my first post and I appreciate this resource, in fact last night
I was quite certain it had helped me to solve my problem.

 

I have a customer who noticed that among his 8 server cluster that speeds
were not consistent in downloading large files.

 

He was convinced it was something to do with our bandwidth provider.

 

I have since eliminated this as I have done tests with a neighboring
facility that is only 1 hop away and replicated his pattern of slow vs fast
front end servers.

 

Also, when testing on our backup link to cogent, we got very similar results
of 4 fast and 4 slow, but often inversed.  The previously 4 slow servers,
were now the fast ones and vice versa.

 

On Friday night we rebooted the router and noticed a line was inserted into
the running config by the IOS.

 

mls rate-limit unicast cef receive 10000 100

 

This is what actually led me to this list and what I felt was sure to be the
solution to my problem.

 

Our TCAM's were over limit and the Sup card was offloading to the CPU...
For what we were doing with providers at the time, we decided until we could
upgrade to 3BXL's we would just take a default route, and reboot the router
once more.

 

I was ever so sure this was going to be the silver bullet, the router came
up, the TCAM errors and over limit errors are gone.  But the problem
persist.

 

We have done our best to eliminate everything that could be a cause...  like
file servers, load balancers etc.  Also, internal tests on our layer2
environments run at 100+ MB/s

 

We bound a fresh class C to the boxes and announced only to Cogent and
retested...  this helped up to eliminate a few things, like our Primary
provider, any single Optical Interface, GBIC, or optical cable.

 

Each provider comes into the router via the Active and Hot Sup720-2b's.

 

>From certain datacenter or ISP test locations, on the same inbound route,
and same test conditions, the server pattern of slow to fast will inverse.

 

For example, on my original speed tests, from Australia, Dallas, Toronto and
DC, the same pattern was observed.  Then testing from Amsterdam, the pattern
was reversed.

 

Its worth noting, that the servers that are slow, start like they might
otherwise then regress...

 

For example,  testing from Toronto, the fast servers run at about 5.5M/s,
the slow servers start at 500K/s -  1.1MB's then settle down to as low as
half that speed.

 

At peak load the router is pushing 550mb/s, cpu is at 2% and memory at 11%
( memory has been upgraded to 1gb/1gb )

 

Supervisor cards have Sequential Serial Numbers, but as seen below the
hardware versions on the MSFC3 cards are different ( any cause for alarm
here? )

 

Could the combination of the older line cards and the newer ( relatively )
Sup720-3B cause something squirrelly like this?

 

show ver

Cisco IOS Software, s72033_rp Software (s72033_rp-ADVIPSERVICESK9_WAN-M),
Version 12.2(33)SXH, RELEASE SOFTWARE (fc5)

 

#show mod

Mod Ports Card Type                              Model              Serial
No.

--- ----- -------------------------------------- ------------------
-----------

  1   48  48-port 10/100/1000 RJ45 EtherModule   WS-X6148A-GE-TX

  2   48  48-port 10/100/1000 RJ45 EtherModule   WS-X6148A-GE-TX

  5    2  Supervisor Engine 720 (Hot)            WS-SUP720-3B

  6    2  Supervisor Engine 720 (Active)         WS-SUP720-3B

  7   16  SFM-capable 16 port 1000mb GBIC        WS-X6516-GBIC

  8   16  16 port 1000mb GBIC ethernet           WS-X6416-GBIC

  9   16  16 port 1000mb GBIC ethernet           WS-X6416-GBIC

 

Mod  Sub-Module                  Model              Serial       Hw
Status

---- --------------------------- ------------------ ----------- -------
-------

  5  Policy Feature Card 3       WS-F6K-PFC3B         2.3    Ok

  5  MSFC3 Daughterboard         WS-SUP720            2.3    Ok

  6  Policy Feature Card 3       WS-F6K-PFC3B         2.3    Ok

  6  MSFC3 Daughterboard         WS-SUP720            2.5    Ok

 

After further investigation it seems that this issue is not limited to just
this one customer and his 8 frontends...  seems to some degree we have, some
hosts that run well and some that dont and this pattern inverses based on
the test location and even changed after we rebooted the router.  ( one host
that was fast from most all locations became slow )

 

Typically the slow hosts run at around 10% of the speed of the fast host.

 

like 500K/s versus 5.5M/s

 

The farther away the test location, it seems, the slow servers will run up
to 100k intially then gradually regress to 20k ( for example Australia to St
Loius )

 

There is no rate limit configs other than the default ones for logging/icmp

 

There is no QoS configs on the router either.

 

I am going to test  nodes from each of the line cards, to eliminate any one
card as faulty, a least so far I know I have tested across more than 1 and
likely 2 different line cards and seen the same problem.

 

By running these older line cards am I crippling the router? Lowest common
denominator?

 

I will get a ws-x6748-sfp, and was considering removing all the older cards,
and cabling them to this newer line card. 

 

I really appreciate any advice and I am at the end of my rope on this.

 

 



More information about the cisco-nsp mailing list