[j-nsp] MX304 Port Layout

Fri Jun 9 12:24:47 EDT 2023

Not sure, but anything shipped before May most likely would be affected, if not into May a bit.   Since we were one of the first if not the first customer to get the fixed applied to the equipment we got at the end of March.   

We never knew the real root cause outside that when it happened the primary RE would lock up-sold and not respond to any command or input or allow access to any management port and there were no crash dumps or logs.  The Backup RE would NOT Take over and get stuck in a loop trying to take over the mastership, but the backup RE would still respond to management, but even a reboot of it would not allow it to take mastership.   The only solution was a full power plug removal or RE removal from the chassis for a power reset.   But they were able to find this in the lab at Juniper right after we reported it and they worked on a fix and got it to use about 1.5 weeks later.    We got lucky in that one of the 3 boxes would never last more than 6 hours after a reboot before the lockup of the master RE  (No matter if it was RE0 or RE1 as master)  The other 2 boxes could go a week or more before locking up.   So we were a good test to see if the fixed work since in the lab it would take up to 8 days before locking up.

FYI:  The one that would lock up inside 6 hours was our spare and had no traffic at all or even a optic plugged into any port and not even test traffic which the other 2 had going.  We did not go into production until 2 weeks after the fix was applied to make sure.

This problem would only surface also if you have more then one RE plugged into the system.  Even if failover was not configured.  It was just the presence of the 2nd RE that would trigger it.   I understand that the engineering team is now fully regressive testing all releases with multiple REs now.  I guess that was not true before we found the bug.

-----Original Message-----
From: juniper-nsp <juniper-nsp-bounces at puck.nether.net> On Behalf Of Mark Tinka via juniper-nsp
Sent: Thursday, June 8, 2023 10:53 PM
To: juniper-nsp at puck.nether.net
Subject: Re: [EXT] [j-nsp] MX304 Port Layout

On 6/9/23 00:03, Litterick, Jeff (BIT) via juniper-nsp wrote:

> The big issue we ran into is if you have redundant REs then there is a super bad bug that after 6 hours (1 of our 3 would lock up after reboot quickly and the other 2 would take a very long time) to 8 days will lock the entire chassis up solid where we had to pull the REs physical out to reboot them.     It is fixed now, but they had to manually poke new firmware into the ASICs on each RE when they were in a half-powered state,  Was a very complex procedure with tech support and the MX304 engineering team.  It took about 3 hours to do all 3 MX304s  one RE at a time.   We have not seen an update with this built-in yet.  (We just did this back at the end of April)

Oh dear, that's pretty nasty. So did they say new units shipping today would come with the RE's already fixed?

We've been suffering a somewhat similar issue on the PTX1000, where a bug was introduced via regression in Junos 21.4, 22.1 and 22.2 that causes CPU queues to get filled up by unknown MAC address frames, and are not cleared. It takes 64 days for this packet accumulation to grow to a point where the queues get exhausted, causing a host loopback wedge.

You would see an error like this in the logs:

<date> <time> <hostname> alarmd[27630]: Alarm set: FPC id=150995048, color=RED, class=CHASSIS, reason=FPC 0 Major Errors <date> <time> <hostname> fpc0 Performing action cmalarm for error /fpc/0/pfe/0/cm/0/Host_Loopback/0/HOST_LOOPBACK_MAKE_CMERROR_ID[1]
(0x20002) in module: Host Loopback with scope: pfe category: functional
level: major
<date> <time> <hostname> fpc0 Cmerror Op Set: Host Loopback: HOST LOOPBACK WEDGE DETECTED IN PATH ID 1  (URI: 
/fpc/0/pfe/0/cm/0/Host_Loopback/0/HOST_LOOPBACK_MAKE_CMERROR_ID[1])
Apr 1 03:52:28  PTX1000 fpc0 CMError: 
/fpc/0/pfe/0/cm/0/Host_Loopback/0/HOST_LOOPBACK_MAKE_CMERROR_ID[3]
(0x20004), in module: Host Loopback with scope: pfe category: functional
level: major

This causes the router to drop all control plane traffic, which, basically, makes it unusable. One has to reboot the box to get it back up and running, until it happens again 64 days later.

The issue is resolved in Junos 21.4R3-S4, 22.4R2, 23.2R1 and 23.3R1.

However, these releases are not shipping yet, so Juniper gave us a workaround SLAX script that automatically runs and clears the CPU queues before the 64 days are up.

We are currently running Junos 22.1R3.9 on this platform, and will move to 22.4R2 in a few weeks to permanently fix this.

Junos 20.2, 20.3 and 20.4 are not affected, nor is anything after 23.2R1.

I understand it may also affect the QFX and MX, but I don't have details on that.

Mark.

_______________________________________________
juniper-nsp mailing list juniper-nsp at puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp