[j-nsp] MX304 Port Layout

Thu Jun 8 23:53:04 EDT 2023

On 6/9/23 00:03, Litterick, Jeff (BIT) via juniper-nsp wrote:

> The big issue we ran into is if you have redundant REs then there is a super bad bug that after 6 hours (1 of our 3 would lock up after reboot quickly and the other 2 would take a very long time) to 8 days will lock the entire chassis up solid where we had to pull the REs physical out to reboot them.     It is fixed now, but they had to manually poke new firmware into the ASICs on each RE when they were in a half-powered state,  Was a very complex procedure with tech support and the MX304 engineering team.  It took about 3 hours to do all 3 MX304s  one RE at a time.   We have not seen an update with this built-in yet.  (We just did this back at the end of April)

Oh dear, that's pretty nasty. So did they say new units shipping today 
would come with the RE's already fixed?

We've been suffering a somewhat similar issue on the PTX1000, where a 
bug was introduced via regression in Junos 21.4, 22.1 and 22.2 that 
causes CPU queues to get filled up by unknown MAC address frames, and 
are not cleared. It takes 64 days for this packet accumulation to grow 
to a point where the queues get exhausted, causing a host loopback wedge.

You would see an error like this in the logs:

<date> <time> <hostname> alarmd[27630]: Alarm set: FPC id=150995048, 
color=RED, class=CHASSIS, reason=FPC 0 Major Errors
<date> <time> <hostname> fpc0 Performing action cmalarm for error 
/fpc/0/pfe/0/cm/0/Host_Loopback/0/HOST_LOOPBACK_MAKE_CMERROR_ID[1] 
(0x20002) in module: Host Loopback with scope: pfe category: functional 
level: major
<date> <time> <hostname> fpc0 Cmerror Op Set: Host Loopback: HOST 
LOOPBACK WEDGE DETECTED IN PATH ID 1  (URI: 
/fpc/0/pfe/0/cm/0/Host_Loopback/0/HOST_LOOPBACK_MAKE_CMERROR_ID[1])
Apr 1 03:52:28  PTX1000 fpc0 CMError: 
/fpc/0/pfe/0/cm/0/Host_Loopback/0/HOST_LOOPBACK_MAKE_CMERROR_ID[3] 
(0x20004), in module: Host Loopback with scope: pfe category: functional 
level: major

This causes the router to drop all control plane traffic, which, 
basically, makes it unusable. One has to reboot the box to get it back 
up and running, until it happens again 64 days later.

The issue is resolved in Junos 21.4R3-S4, 22.4R2, 23.2R1 and 23.3R1.

However, these releases are not shipping yet, so Juniper gave us a 
workaround SLAX script that automatically runs and clears the CPU queues 
before the 64 days are up.

We are currently running Junos 22.1R3.9 on this platform, and will move 
to 22.4R2 in a few weeks to permanently fix this.

Junos 20.2, 20.3 and 20.4 are not affected, nor is anything after 23.2R1.

I understand it may also affect the QFX and MX, but I don't have details 
on that.

Mark.