[j-nsp] MX80 watchdog

Mon Jun 12 12:17:44 EDT 2023

Do you monitor RPD task memory use and Freebsd process memory use?
Is it possible you are leaking memory over time, and getting DRAM
pressure at the 1500d mark?

It might be this:
https://prsearch.juniper.net/problemreport/PR1099998

Initially as you said it happens at strenuous SSD access, I was
thinking that Junos does have RE failover limits on disk-io read/write
latency, which causes false positive RE switchovers now and again
(more people have hit them, than people are aware of hitting them).
But in your case this can't possibly be true, because the MX80 doesn't
have two RE. But for completeness,
https://www.juniper.net/documentation/us/en/software/junos/high-availability/topics/ref/statement/not-on-disk-underperform-edit-chassis.html

On Mon, 12 Jun 2023 at 18:35, Tom Bird via juniper-nsp
<juniper-nsp at puck.nether.net> wrote:
>
> Afternoon,
>
> I've been upgrading some MX80 routers to from 15.1, consistently they
> seem to fall over during periods of strenuous SSD access, or indeed once
> during a "commit check".
>
> We thought this might be due to the uptime (~1500 days) so have been
> rebooting them prior to the upgrade which has mostly stopped the problem
> from happening.  Not completely, however - they get stuck for about an
> hour doing this, after which they reboot and continue to work.
>
>
> watchdog: scheduling fairness gone for 3540 seconds now.
> (da1:umass-sim1:1:0:0): Synchronize cache failed, status == 0x34, scsi
> status == 0x0
> Automatic reboot in 15 seconds - press a key on the console to abort
> Rebooting...
>
>
> I'd like it if they waited a bit less than an hour and see the watchdog
> can be configured but I can't find any useful documentation about
> exactly what conditions it would fire and what the defaults are.
>
> Currently there is no configuration under "system processes watchdog",
> and it looks like it can be enabled, disabled and the timeout set up to
> 3600 seconds.
>
> So my question is, is it this watchdog that is resetting the thing after
> an hour and would it be reasonable to set the timeout to say 300 seconds
> so there was less down time if it went wrong.
>
> Thanks,
> --
> Tom
>
> :: www.portfast.co.uk / @portfast
> :: hosted services, domains, virtual machines, consultancy
> _______________________________________________
> juniper-nsp mailing list juniper-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp

-- 
  ++ytti