[j-nsp] MX-series: SHEAF: possible leak ...

Alexandre Snarskii snar at snar.spb.ru
Mon Aug 31 06:44:08 EDT 2009


On Fri, Aug 28, 2009 at 03:18:01AM -0700, Nilesh Khambal wrote:
> Hi Alex,
> 
> Sometimes, these messages also suggest a transient spike in the 
> sheaf memory utilization on the FPC for one of sheaves. It may not 
> be necessarily a memory leak. Sheaf memory is used for sending and 
> receiving the control and data packets between RE and the PFE.

Thanks, that explains a lot - looks like we have too much
of that 'cosmetic' messages 'jtree app NH negative' to 
overflow all the message queue: 

Aug 27 12:42:29 rt077-201 xntpd[80154]: bind() fd 243, family 2, port 123, addr 10.0.0.1, in_classd=0 flags=0: Can't assign requested address
Aug 27 12:42:29 rt077-201 Stats bucket for jtree app NH negative.. 
Aug 27 12:42:33 rt077-201 last message repeated 57 times
Aug 27 12:42:33 rt077-201 SYSLOG: 13065 messages lost, message queue overflowed. 
and if this message queue is located on same sheaf memory
that might be a cause for SHEAF error messages. 

Anyway, after JunOS upgrade all those messages disappeared, 
but I start getting new one (and another interesting ungoogleable
messages) considering the same ICHIP: 

[Aug 31 10:18:05.078 LOG: Err] ICHIP(0):Packet drop in Ichip pktwr,rate: 1, total: 93963

and 'show pfe statistics error' shows me that there are errors
on this one ichip: 

snar at RT077-201> show pfe statistics error 
Slot 3
ICHIP Error statistics:
ICHIP                     0        1        2        3
---------------------------------------------------------
Iwo DESRD:                0        0        0        0
Iwo HDRF:          359724673  2965007   103588     1530
Ipktwr Drops:         93988        0        0        0

other fpc shows Iwo HDRF errors too, but not Ipktwr drops. 

Well, 94000 packets dropped out of 50408188862 passed on 
this interface since upgrade may be considered negligible,
but as far as they started mostly immediately after upgrade,
and as far as interface is far from saturation - may be there
is some problem with that I-Chip/fabric/fabric connection ? 
And are there any good documentation on what these errors mean ? 

> 
> Seeing the SHEAF and NH App messages together in this scenario suggests 
> too many nexthop related operations in the PFE. If those messages started 
> after adding the new customer, you might want to check if this customer 
> is receiving any kind attack traffic. 

This customer was even disconnected for debugging purposes, but
error messages appeared right before JunOS upgrade. 
And no, this customer does not had no attack traffic, and did
not participate in multicast exchanges. 

> This may also include the traffic for non-existing destinations 
> which are directly connected to this new customer interface. 
> Are there any PFE_NH_RESOLVE_THROTTLED messages seen as well 
> in the logs along with these SHEAF messages?

There are some PFE_NH_RESOLVE_THROTTLED messages on same
(and others) fpc's of that box, but not on the same I-Chip. 

> Please get the below command outputs from FPC3 and get in touch 
> with JTAC by opening a case. Provide them with these outputs with 
> answers to above queries.

Tried to open case, but looks like our CTO forgot to extend 
our support contract :( 
Anyway, logs are saved, and if you want to take a look at - you may 
contact me directly. 

> 
> >From FPC3:
> ++++++++++
> 
>  *   show  route summary
>  *   show route manager statistics
>  *   show nhdb management operations
>  *   show packet
>  *   show packet statistics
>  *   show jtree [0-3] memory extensive composition
>  *   show jtree [0-3] memory extensive
>  *   show rsmon
> 
> Take at least 2-3  snapshots of these commands when these syslog messages are seen.
> 
> Thanks,
> Nilesh.
> 
> 
> 
> On 8/28/09 2:25 AM, "Alexandre Snarskii" <snar at snar.spb.ru> wrote:
> 
> 
> 
> Hi!
> 
> Since yesterday one of our MX-series routers started logging
> next messages:
> 
>  fpc3 SHEAF: possible leak, ID 5 (packet(600)) (10563/512/1024)
>  Stats bucket for jtree app NH negative..
> 
> (both are logged by fpc3, another fpc's not affected), and after some
> googling I started worrying, because first message (SHEAF leak) is only
> mentioned in j-nsp some years ago in context of PFE memory leak that can
> stop your router forwarding (PSN-2004-06-009), and the second is cosmetic
> bug closed in PR/400917, concerning PFE memory leak too.
> 
> So, the question is: do you think that fpc reload may help with
> this issue or i should plan JunOS upgrade in next maintenance
> window ? (now this router running JunOS 8.4R4.2 that is EOE/EOL
> anyway).
> 
> PS: messages starter right after provisioning another client's
> interface, nothing special:
> 
> cvs diff -u -D '2009-08-27 13:00' -D '2009-08-27 14:00' rt077-201
> [...]
> +        unit 2103 {
> +            description <censored>
> +            vlan-id 2103;
> +            family inet {
> +                mtu 1500;
> +                policer {
> +                    input 512Kbit;
> +                    output 512Kbit;
> +                }
> +                address <censored>;
> +            }
> +        }
> 
> _______________________________________________
> juniper-nsp mailing list juniper-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp


More information about the juniper-nsp mailing list