[j-nsp] interpreting 10Gb interface "PCS statistics" values
Chuck Anderson
cra at WPI.EDU
Fri Oct 21 18:58:36 EDT 2016
When I was getting these and the Cisco far end was getting tons of
errors, the light levels were good all around. It ended up being a
fiber problem near the transmitter. Try shooting the fiber link with
an OTDR to see if you are getting lots of reflections.
On Fri, Oct 21, 2016 at 12:23:18PM -0700, Michael Loftis wrote:
> Was hoping someone who knew more could chime in...but it's measured in
> seconds basically because the PCS (physical coding sublayer) does NOT
> keep detailed statistics...so the "Seconds" value means there were X
> distinct seconds in which an error was flagged in that category...the
> previous response detailing bit vs errored blocks I think is wrong.
> The PCS layer can repair single bit errors, thus a second with one or
> more single bit (but correctable!) errors is a "bit errored second" -
> if it is unabled to correct and recover a valid PCS block then you get
> the "errored block" seconds...
>
> It's not a raw count of the number of those errors, just that it
> occurred in a ~1s window X times. You can totally get PCS errors
> unplugging an optic or otherwise shutting down the remote end. You
> can totally get spurious PCS errors from a marginal ish link that
> shows PLENTY of light (SNR is low or a marginal cable). in MX
> specifically it *can* in very rare circumstances indicate a problem
> even between the optic and the MIC....most of the time my suggestion
> for PCS errors is clear counters and check in 1h and 24h. If you get
> a significant number of errored seconds in a 24h period then
> check/clean ends and patches, maybe replace optics.
>
> Also beware, lots of DOM bugs in various JunOS releases cause the DOM
> values to get stuck, and it can be hard or impossible to check in a
> non outage causing way (sometimes you can safely bend the patch cable
> and observe the increase in loss to verify your DOM values aren't
> stuck) - I've had this most commonly in the past on DPC cards but have
> also observed it in MPC cards. The DOM data is also highly dependent
> upon the optic itself and there's a LOT of buggy stuff out there so
> it's not all juniper's fault there.
>
>
> On Fri, Oct 21, 2016 at 11:07 AM, David B Funk
> <dbfunk at engineering.uiowa.edu> wrote:
> > Thanks guys but this isn't what I was asking.
> >
> > The optical power is similar (within a few tenths of a dBm) at my end, down
> > by 3 dBm at the far end of the link that is having issues (-6.23 dBm as
> > opposed to -3.73 dBm) but not enough to explain what I'm seeing.
> >
> > The big question I have is: What does "30 Seconds" mean for an attribute
> > that by description of the docs is supposed to be number of PCS blocks with
> > invalid Sync headers?
> > Particularly when the guy on the Cisco at the other end says his error
> > counters are going up like crazy (and packets are being dropped) while the
> > stats my end stays constant at "30 Seconds".
> > What does that mean?
> >
> > The particularly frustrating thing is that data streams are dropping packets
> > (EG iperf3 showing retries and seriously degraded performance) but none of
> > the interface stats are showing any values that indicate an issue other than
> > that "30 Seconds".
> >
> > Can anybody tell me what "30 Seconds" means (in the context of an error
> > counter)?
> >
> >
> >
> >
> > On Fri, 21 Oct 2016, Christopher Costa wrote:
> >
> >> Here's my notes from a jtac review about these a couple years ago:
> >>
> >>
> >>
> >> [pcs] encoding is continually transmitting to keep the line in sync. The
> >> PCS layer is directly below the MAC layer so for MX,
> >> it’s on the MIC. PCS errors can be caused by anything MIC or lower, i.e.
> >> transceiver, fiber, line equipment, etc.
> >>
> >>
> >>
> >> PCS functionality:
> >> ===================
> >> IEEE 802.3ae 10GbE interfaces use a 64B/66B encoder/decoder in the
> >> PHY-PCS (Physical Coding Sub layer) to allow reasonable
> >> clock recovery and facilitate alignment of the data stream at the
> >> receiver.
> >> As the scheme name suggests, 64 bits of data on the MAC layer are
> >> transmitted as a 66-bit code block on the PHY layer, which
> >> realizes easier clock/timing synchronization. A 66-bit code block contains
> >> a 2-bit Sync. Header + 8 octets data/control field.
> >> If the Sync. header is '01', the 8 octets are entirely data.
> >> If the Sync. header is '10', an 8-bit Type field follows, plus 56 bits of
> >> data/control field.
> >> The 8 octets data/control field is scrambled by using a self-synchronous
> >> scrambler to achieve complete DC-balance on the
> >> serial line.
> >> PCS statistics displays PCS fault conditions by checking valid Sync.
> >> headers received with every 66 bits interval, so that we
> >> can monitor 10Gbps high speed transmission line quality.
> >> If the 64B/66B receiver does not detect the 2-bit Sync.
> >> Header with regular 66-bit interval and it estimates the high BER (Bit
> >> Error Rate of >10^-4), PCS statistics will report a
> >> problem.
> >> PCS statistics :
> >> ================
> >> - "Bit errors" indicates the number of PCS blocks with invalid Sync
> >> headers.
> >> - "Errored blocks" indicates the number of PCS blocks with a valid Sync.
> >> header but invalid block format.
> >>
> >>
> >> On Fri, Oct 21, 2016 at 9:37 AM, Michael Carey <mcarey at kinber.org> wrote:
> >> David,
> >>
> >> When I've seen PCS statistical errors before, it pointed to either a
> >> failing optic that needed replaced in our MX or a drastic change in
> >> optical
> >> light levels caused by an OSP fiber issue. How do your "show
> >> interface
> >> diagnostic optic" levels look?
> >>
> >> On Wed, Oct 19, 2016 at 7:40 PM, David B Funk
> >> <dbfunk at engineering.uiowa.edu>
> >> wrote:
> >>
> >> > I've got a couple of 10Gig-eth interfaces (xe- on MX480) of which
> >> I'm
> >> > trying to interpret the "PCS statistics" values.
> >> >
> >> > One of them is pretty steady at:
> >> >
> >> > PCS statistics Seconds
> >> > Bit errors 4
> >> > Errored blocks 4
> >> >
> >> > The other one seems to vary with the values ranging from 10 to 70.
> >> > EG:
> >> >
> >> > PCS statistics Seconds
> >> > Bit errors 61
> >> > Errored blocks 69
> >> >
> >> > The second interface will will trigger a number of error
> >> conditions at the
> >> > other end which terminates in a Cisco router with out showing any
> >> error
> >> > conditions at my end (EG BPDU Error: None, MAC-REWRITE Error:
> >> None,
> >> > CRC/Align errors 0, FIFO errors 0, etc..) During some of these
> >> times I'll
> >> > see significant packet loss and others see minimal problems.
> >> >
> >> > According to Juniper docs the PCS statistics should mean:
> >> >
> >> > PCS statistics
> >> > (10-Gigabit Ethernet interfaces) Displays Physical Coding
> >> Sublayer (PCS)
> >> > fault
> >> > conditions from the WAN PHY or the LAN PHY device.
> >> >
> >> > Bit errors—High bit error rate. Indicates the number of bit
> >> errors
> >> > when the
> >> > PCS receiver is operating in normal mode.
> >> > Errored blocks—Loss of block lock. The number of errored
> >> blocks when
> >> > PCS
> >> > receiver is operating in normal mode.
> >> >
> >> > But I don't know how to interpret a value of "16 seconds" with
> >> that
> >> > definition.
> >> > Can anybody shed some light on what those numbers mean.
> >> >
> >> > Thanks.
More information about the juniper-nsp
mailing list