[c-nsp] ME3600X 15.2S memory leak

Jason Lixfeld jason at lixfeld.ca
Sat Feb 27 12:09:12 EST 2016



> On Feb 27, 2016, at 1:21 AM, Mark Tinka <mark.tinka at seacom.mu> wrote:
> 
> 
> 
> On 26/Feb/16 20:42, Jason Lixfeld wrote:
> 
>> Upgrade to at least 15.3(3)S2. There are major issues with IPv6 and egress ACLs that cause this.
> 
> Funny you should mention that, Jason.
> 
> There is some IPv6 strangeness that I can't quite put my finger on when
> this happens, and we are re-routing across the ring. The last time I saw
> something like with egress IPv6 ACL's was when the ME3600X started
> shipping - 12.2EY days.
> 
> At any rate, I was considering upgrading the unit to 15.5(3)S2 and
> monitor it for a couple of days.
> 
> The box is running IPv6 and VPNv6. One customer is setup for IPv6, but
> that BGP session is down.
> 
> IPv6 ACL's exit only on the core-facing interfaces.
> 
> Do you have details on this issue you can share?

I worked with Cisco on it for months - this went past TAC and past the BU.  I worked directly with the IOS-XE, ME3600 and Nile ASIC hardware developers to identify the issue, and it took forever!  (Credit to this team of developers in India - there were relentless and amazing!  Too bad it took months to get to this team)

Here’s the jist of it (but the issue was not about egress IPv6 ACLs, it was with the combination of IPv6 being enabled and *any* egress ACL being configured on any interface):

First identified in 15.3(3)S (but first seen in 15.2S) is a IPv6/Egress ACL resource collision issue caused by shared memory between the two features causing memory corruption.  This can be seen by ChCompactChecksumerrorCount incrementing. 'no ipv6 unicast routing' & reload to fix.  The other option is to set 'platform acl egress-disable' to disable egress ACLs, but since there were egress ACLs used on our boxes, we opted to disable IPv6. Reload is required to implement either fix.

I dug back through my emails, but I can’t actually find the bug ID that was provided for this issue, but I think CSCul27742 is it.  Cisco seems to have redirected that BugID to CSCui23725, but you can still sort of screen scrape it:

---

CSCul27742 Transit Packet Loss and Output Drops due to IPv6 Routing
Symptom: Transit traffic is randomly dropped. When traffic is lost the number of Output Drops under the "show interface" command is seen incrementing.

Conditions: me3600 or me36800 with "IPv6 unicast-routing" or "no ipv6 multicast-routing" configured. The traffic dropped does not have to be IPv6 traffic, and the box does not need to be configured for any other IPv6 services. This does not impact other platforms running this software version.

Workaround:...more
Details
Known Affected Releases: (1)
15.3(3)S
Known Fixed Releases: 0
Release Pending
Product: Cisco ME 3600X Series Ethernet Access Switches

---

In 15.3(3)S and earlier, there was no way to disable egress ACLs from the CLI, so the only way to do it was through sdcli:

service internal
exit (to return to enable mode)
sdcli
nile pp reg configegressouteracl configure 1 0 aclEnable 0
nile pp reg configegressouteracl configure 0 0 aclEnable 0 arsenic mmap i_write 0x40 0x00c24018 0x32 arsenic mmap i_write 0x40 0x00c2401c 0x30 arsenic mmap i_write 0x45 0x00c24018 0x32 arsenic mmap i_write 0x45 0x00c2401c 0x30 exit

NOTE:  These changes will *not* persist across reload.

'platform acl egress-disable’ was introduced per CSCui23725 which made it possible to disable egress ACLs from the CLI while running a version of code that was affected by CSCul27742.

If you are running into odd issues with late 15.2 and 15.3, check here to see if you are running up against CSCul27742.  If so, disable egress ACLs or disable IPv6:

sdcli#nile debug stats 0 ChannelCompact
ChCompactReversalAbortCount            0 (0x0)
ChCompactDiscardCount            18362 (0x47BA)
ChCompactChecksumerrorCount            6046 (0x179E)
ChCompactLengthErrorCount            0 (0x0)
ChCompactSequenceErrorCount            0 (0x0)
ChCompactETxFifoFullDiscardCount            0 (0x0)
Ok

sdcli#nile debug stats 0 ChannelCompact
ChCompactReversalAbortCount            0 (0x0)
ChCompactDiscardCount            18381 (0x47CD)
ChCompactChecksumerrorCount            6921 (0x1B09)
ChCompactLengthErrorCount            0 (0x0)
ChCompactSequenceErrorCount            0 (0x0)
ChCompactETxFifoFullDiscardCount            0 (0x0)
Ok

There was supposed to be a feature introduced in later code to auto-detect between egress ACLs or IPv6, depending on what the configuration was.  Aside from that, I don’t honestly know if this issue was ever actually fixed.  For us, once we got to 15.3(3)S2, we disabled IPv6 and abandoned the platform.

Hope that helps.




More information about the cisco-nsp mailing list