[c-nsp] CoPP IS-IS traffic on N7k

Tue Jan 25 07:57:09 EST 2011

> On 25/01/2011, at 1:12 AM, Matthew Melbourne wrote:
>
>> As a follow-up to this, I've discovered that the CPU utilisation for
>> 'netstack' process jumps to 100% for ~5 mins.
>
> 'netstack' is the software process that implements the IP / TCP stack for received frames hitting control-plane.
> if netstack cpu is high for an extended period then it implies you have excessive traffic hitting that

OK, that helps a lot.
>
> key is probably to find out what traffic is hitting it.
>
> did you disable any h/w rate limiters or CoPP at all?  or increase the rates on any of these?

We haven't changed any h/w rate limiters. We've tweaked the CoPP
'default' policy slightly, but just to implement the equivalent of
ACLs on vtys; we're running 5.0(2a).
>
>
> use the embedded ethanalyzer (wireshark) to snoop the inband port would probably be the easiest way.
> use "ethanalyzer local interface inband [..]"
>
> if you don't want to hang around to wait for it to happen you could create an EEM action that is triggered on high cpu to do this for you, e.g.
>
>        event manager applet debug_highcpu
>          event snmp oid 1.3.6.1.4.1.9.9.109.1.1.1.1.6.1 get-type exact entry-op gt entry-val 80 exit-op lt exit-val 40 poll-interval 30
>          action 1.0 syslog msg high cpu load condition detecteds
>          action 2.0 cli show clock >> bootflash:high_cpu.txt
>          action 2.5 cli show process cpu sort >> bootflash:high_cpu.txt
>          action 3.0 cli show hardware internal cpu-mac inband stats >> bootflash:high_cpu.txt
>          action 3.0 cli ethanalyzer local interface inband limit-captured-frames 200 >> bootflash:high_cpu.txt
>
> something like that at least then gives you (or TAC) something else to go on.

I managed to catch it, and for some strange reason iSCSI data-plane
traffic is hitting the control-place. When netstack is not running at
100%, I see the usual control plane traffic, e.g. HSRP, STP, ARP
(etc), but when it's at 100% I see lots of:

nx02.fhcon# show file bootflash:high_cpu.txt
2011-01-25 10:03:47.571183 10.10.35.138 -> 10.10.47.0   TCP 3260 >
62592 [ACK] Seq=1 Ack=1 Win=2213 Len=0 TSV=1300 TSER=1297
2011-01-25 10:03:47.571202 10.10.35.138 -> 10.10.47.0   TCP 3260 >
62592 [ACK] Seq=1 Ack=3137 Win=2115 Len=0 TSV=1300 TSER=1297
2011-01-25 10:03:47.571211 10.10.35.138 -> 10.10.47.0   TCP 3260 >
62592 [ACK] Seq=1 Ack=6081 Win=2025 Len=0 TSV=1300 TSER=1297
2011-01-25 10:03:47.571220 10.10.35.138 -> 10.10.47.0   TCP 3260 >
62592 [ACK] Seq=1 Ack=8777 Win=2070 Len=0 TSV=1300 TSER=1297
2011-01-25 10:03:47.571229 10.10.35.138 -> 10.10.47.0   TCP 3260 >
62592 [ACK] Seq=1 Ack=11473 Win=2168 Len=0 TSV=1300 TSER=1297
2011-01-25 10:03:47.571238 10.10.35.138 -> 10.10.47.0   TCP 3260 >
62592 [ACK] Seq=1 Ack=15569 Win=2040 Len=0 TSV=1300 TSER=1297
2011-01-25 10:03:47.571247 10.10.35.138 -> 10.10.47.0   TCP [TCP
Window Update] 3260 > 62592 [ACK] Seq=1 Ack=15569 Win=2085 Len=0
TSV=1300 TSER=1297
2011-01-25 10:03:47.571256 10.10.35.143 -> 10.10.47.0   TCP 3260 >
62596 [ACK] Seq=1 Ack=1 Win=1670 Len=0 TSV=1425 TSER=1422
2011-01-25 10:03:47.571266 10.10.35.143 -> 10.10.47.0   TCP 3260 >
62596 [ACK] Seq=1 Ack=4345 Win=1534 Len=0 TSV=1425 TSER=1422
2011-01-25 10:03:47.571274 10.10.35.143 -> 10.10.47.0   TCP 3260 >
62596 [ACK] Seq=1 Ack=7241 Win=1444 Len=0 TSV=1425 TSER=1422
2011-01-25 10:03:47.571283 10.10.35.138 -> 10.10.47.0   TCP 3260 >
62592 [ACK] Seq=1 Ack=17065 Win=2123 Len=0 TSV=1300 TSER=1297
2011-01-25 10:03:47.571292 10.10.35.138 -> 10.10.47.0   TCP 3260 >
62592 [ACK] Seq=1 Ack=19713 Win=2040 Len=0 TSV=1300 TSER=1297
2011-01-25 10:03:47.571300 10.10.35.143 -> 10.10.47.0   TCP 3260 >
62596 [ACK] Seq=1 Ack=8689 Win=1534 Len=0 TSV=1425 TSER=1422
2011-01-25 10:03:47.571309 10.10.35.143 -> 10.10.47.0   TCP 3260 >
62596 [ACK] Seq=1 Ack=10137 Win=1625 Len=0 TSV=1425 TSER=1422
2011-01-25 10:03:47.571318 10.10.35.143 -> 10.10.47.0   TCP 3260 >
62596 [ACK] Seq=1 Ack=14481 Win=1489 Len=0 TSV=1425 TSER=1422

Which is weird as those IPs are contained within SVIs on the unit
within a storage VRF:

interface Vlan22
  no shutdown
  vrf member vpsstorage
  ip address 10.10.34.3/23
  hsrp 1
    authentication md5 key-chain VPS-HSRP-MD5
    preempt delay minimum 180 reload 180
    priority 250
    timers  1  3
    ip 10.10.34.1

interface Vlan26
  no shutdown
  vrf member vpsstorage
  ip address 10.10.44.3/22
  hsrp 1
    authentication md5 key-chain VPS-HSRP-MD5
    preempt delay minimum 180 reload 180
    priority 250
    timers  1  3
    ip 10.10.44.1

So, the obvious question is why is this traffic hitting the control
plane? The other Nexus 7k unit, has these same VLANs/VRFs defined, and
is notionally the HSRP active router, although within the vPC world,
the HSRP 'standby' device can forward ingress traffic.

Thanks for the pointers; I suspect we may now have to get a TAC case raised.

Cheers,

Matt

-- 
Matthew Melbourne