[j-nsp] MX304 Port Layout
Mark Tinka
mark at tinka.africa
Thu Jun 8 23:53:04 EDT 2023
On 6/9/23 00:03, Litterick, Jeff (BIT) via juniper-nsp wrote:
> The big issue we ran into is if you have redundant REs then there is a super bad bug that after 6 hours (1 of our 3 would lock up after reboot quickly and the other 2 would take a very long time) to 8 days will lock the entire chassis up solid where we had to pull the REs physical out to reboot them. It is fixed now, but they had to manually poke new firmware into the ASICs on each RE when they were in a half-powered state, Was a very complex procedure with tech support and the MX304 engineering team. It took about 3 hours to do all 3 MX304s one RE at a time. We have not seen an update with this built-in yet. (We just did this back at the end of April)
Oh dear, that's pretty nasty. So did they say new units shipping today
would come with the RE's already fixed?
We've been suffering a somewhat similar issue on the PTX1000, where a
bug was introduced via regression in Junos 21.4, 22.1 and 22.2 that
causes CPU queues to get filled up by unknown MAC address frames, and
are not cleared. It takes 64 days for this packet accumulation to grow
to a point where the queues get exhausted, causing a host loopback wedge.
You would see an error like this in the logs:
<date> <time> <hostname> alarmd[27630]: Alarm set: FPC id=150995048,
color=RED, class=CHASSIS, reason=FPC 0 Major Errors
<date> <time> <hostname> fpc0 Performing action cmalarm for error
/fpc/0/pfe/0/cm/0/Host_Loopback/0/HOST_LOOPBACK_MAKE_CMERROR_ID[1]
(0x20002) in module: Host Loopback with scope: pfe category: functional
level: major
<date> <time> <hostname> fpc0 Cmerror Op Set: Host Loopback: HOST
LOOPBACK WEDGE DETECTED IN PATH ID 1 (URI:
/fpc/0/pfe/0/cm/0/Host_Loopback/0/HOST_LOOPBACK_MAKE_CMERROR_ID[1])
Apr 1 03:52:28 PTX1000 fpc0 CMError:
/fpc/0/pfe/0/cm/0/Host_Loopback/0/HOST_LOOPBACK_MAKE_CMERROR_ID[3]
(0x20004), in module: Host Loopback with scope: pfe category: functional
level: major
This causes the router to drop all control plane traffic, which,
basically, makes it unusable. One has to reboot the box to get it back
up and running, until it happens again 64 days later.
The issue is resolved in Junos 21.4R3-S4, 22.4R2, 23.2R1 and 23.3R1.
However, these releases are not shipping yet, so Juniper gave us a
workaround SLAX script that automatically runs and clears the CPU queues
before the 64 days are up.
We are currently running Junos 22.1R3.9 on this platform, and will move
to 22.4R2 in a few weeks to permanently fix this.
Junos 20.2, 20.3 and 20.4 are not affected, nor is anything after 23.2R1.
I understand it may also affect the QFX and MX, but I don't have details
on that.
Mark.
More information about the juniper-nsp
mailing list