[j-nsp] interpreting 10Gb interface "PCS statistics" values

Fri Oct 21 18:58:36 EDT 2016

When I was getting these and the Cisco far end was getting tons of
errors, the light levels were good all around.  It ended up being a
fiber problem near the transmitter.  Try shooting the fiber link with
an OTDR to see if you are getting lots of reflections.

On Fri, Oct 21, 2016 at 12:23:18PM -0700, Michael Loftis wrote:
> Was hoping someone who knew more could chime in...but it's measured in
> seconds basically because the PCS (physical coding sublayer) does NOT
> keep detailed statistics...so the "Seconds" value means there were X
> distinct seconds in which an error was flagged in that category...the
> previous response detailing bit vs errored blocks I think is wrong.
> The PCS layer can repair single bit errors, thus a second with one or
> more single bit (but correctable!) errors is a "bit errored second" -
> if it is unabled to correct and recover a valid PCS block then you get
> the "errored block" seconds...
> 
> It's not a raw count of the number of those errors, just that it
> occurred in a ~1s window X times.  You can totally get PCS errors
> unplugging an optic or otherwise shutting down the remote end.  You
> can totally get spurious PCS errors from a marginal ish link that
> shows PLENTY of light (SNR is low or a marginal cable).  in MX
> specifically it *can* in very rare circumstances indicate a problem
> even between the optic and the MIC....most of the time my suggestion
> for PCS errors is clear counters and check in 1h and 24h.  If you get
> a significant number of errored seconds in a 24h period then
> check/clean ends and patches, maybe replace optics.
> 
> Also beware, lots of DOM bugs in various JunOS releases cause the DOM
> values to get stuck, and it can be hard or impossible to check in a
> non outage causing way (sometimes you can safely bend the patch cable
> and observe the increase in loss to verify your DOM values aren't
> stuck) - I've had this most commonly in the past on DPC cards but have
> also observed it in MPC cards.  The DOM data is also highly dependent
> upon the optic itself and there's a LOT of buggy stuff out there so
> it's not all juniper's fault there.
> 
> 
> On Fri, Oct 21, 2016 at 11:07 AM, David B Funk
> <dbfunk at engineering.uiowa.edu> wrote:
> > Thanks guys but this isn't what I was asking.
> >
> > The optical power is similar (within a few tenths of a dBm) at my end, down
> > by 3 dBm at the far end of the link that is having issues (-6.23 dBm as
> > opposed to -3.73 dBm) but not enough to explain what I'm seeing.
> >
> > The big question I have is: What does "30 Seconds" mean for an attribute
> > that by description of the docs is supposed to be number of PCS blocks with
> > invalid Sync headers?
> > Particularly when the guy on the Cisco at the other end says his error
> > counters are going up like crazy (and packets are being dropped) while the
> > stats my end stays constant at "30 Seconds".
> > What does that mean?
> >
> > The particularly frustrating thing is that data streams are dropping packets
> > (EG iperf3 showing retries and seriously degraded performance) but none of
> > the interface stats are showing any values that indicate an issue other than
> > that "30 Seconds".
> >
> > Can anybody tell me what "30 Seconds" means (in the context of an error
> > counter)?
> >
> >
> >
> >
> > On Fri, 21 Oct 2016, Christopher Costa wrote:
> >
> >> Here's my notes from a jtac review about these a couple years ago:
> >>
> >>
> >>
> >> [pcs] encoding is continually transmitting to keep the line in sync. The
> >> PCS layer is directly below the MAC layer so for MX,
> >> it’s on the MIC. PCS errors can be caused by anything MIC or lower, i.e.
> >> transceiver, fiber, line equipment, etc.
> >>
> >>
> >>
> >>  PCS functionality:
> >>  ===================
> >>  IEEE 802.3ae 10GbE interfaces use a 64B/66B encoder/decoder in the
> >> PHY-PCS (Physical Coding Sub layer) to allow reasonable
> >> clock recovery and facilitate alignment of the data stream at the
> >> receiver.
> >>  As the scheme name suggests, 64 bits of data on the MAC layer are
> >> transmitted as a 66-bit code block on the PHY layer, which
> >> realizes easier clock/timing synchronization. A 66-bit code block contains
> >> a 2-bit Sync. Header + 8 octets data/control field.
> >>   If the Sync. header is '01', the 8 octets are entirely data.
> >>  If the Sync. header is '10', an 8-bit Type field follows, plus 56 bits of
> >> data/control field.
> >>   The 8 octets data/control field is scrambled by using a self-synchronous
> >> scrambler to achieve complete DC-balance on the
> >> serial line.
> >>  PCS statistics displays PCS fault conditions by checking valid Sync.
> >> headers received with every 66 bits interval, so that we
> >> can monitor 10Gbps high speed transmission line quality.
> >>   If the 64B/66B receiver does not detect the 2-bit Sync.
> >>  Header with regular 66-bit interval and it estimates the high BER (Bit
> >> Error Rate of >10^-4), PCS statistics will report a
> >> problem.
> >>   PCS statistics :
> >>  ================
> >>  - "Bit errors" indicates the number of PCS blocks with invalid Sync
> >> headers.
> >>  - "Errored blocks" indicates the number of PCS blocks with a valid Sync.
> >> header but invalid block format.
> >>
> >>
> >> On Fri, Oct 21, 2016 at 9:37 AM, Michael Carey <mcarey at kinber.org> wrote:
> >>       David,
> >>
> >>       When I've seen PCS statistical errors before, it pointed to either a
> >>       failing optic that needed replaced in our MX or a drastic change in
> >> optical
> >>       light levels caused by an OSP fiber issue.  How do your "show
> >> interface
> >>       diagnostic optic" levels look?
> >>
> >>       On Wed, Oct 19, 2016 at 7:40 PM, David B Funk
> >> <dbfunk at engineering.uiowa.edu>
> >>       wrote:
> >>
> >>       > I've got a couple of 10Gig-eth interfaces (xe- on MX480) of which
> >> I'm
> >>       > trying to interpret the "PCS statistics" values.
> >>       >
> >>       > One of them is pretty steady at:
> >>       >
> >>       >   PCS statistics                      Seconds
> >>       >     Bit errors                             4
> >>       >     Errored blocks                         4
> >>       >
> >>       > The other one seems to vary with the values ranging from 10 to 70.
> >>       > EG:
> >>       >
> >>       >   PCS statistics                      Seconds
> >>       >     Bit errors                            61
> >>       >     Errored blocks                        69
> >>       >
> >>       > The second interface will will trigger a number of error
> >> conditions at the
> >>       > other end which terminates in a Cisco router with out showing any
> >> error
> >>       > conditions at my end (EG BPDU Error: None, MAC-REWRITE Error:
> >> None,
> >>       > CRC/Align errors 0, FIFO errors 0, etc..) During some of these
> >> times I'll
> >>       > see significant packet loss and others see minimal problems.
> >>       >
> >>       > According to Juniper docs the PCS statistics should mean:
> >>       >
> >>       >  PCS statistics
> >>       >   (10-Gigabit Ethernet interfaces) Displays Physical Coding
> >> Sublayer (PCS)
> >>       > fault
> >>       >   conditions from the WAN PHY or the LAN PHY device.
> >>       >
> >>       >     Bit errors—High bit error rate. Indicates the number of bit
> >> errors
> >>       > when the
> >>       >       PCS receiver is operating in normal mode.
> >>       >     Errored blocks—Loss of block lock. The number of errored
> >> blocks when
> >>       > PCS
> >>       >       receiver is operating in normal mode.
> >>       >
> >>       > But I don't know how to interpret a value of "16 seconds" with
> >> that
> >>       > definition.
> >>       > Can anybody shed some light on what those numbers mean.
> >>       >
> >>       > Thanks.