[c-nsp] tracking down sporadic packet loss

Charles Sprickman spork at bway.net
Thu Dec 6 19:43:45 EST 2012


I'm having a tough time finding where else to dig for the source of
packet loss on what seems like a fairly lightly-loaded network.  We
have a very simple setup with a 7206/NPE-G2.

                       ___________  dot1q           dot1q
Transit1(Gi0/1)-- -----|         |  trunk  ________ trunk
                       |   7206  |---------| 3560  |------- MetroE
DSL Provider (Gi0/2)---|         | (Gi0/3  |_______|        (Gi0/2)
                       |_________| to Gi0/1) |  |  |  
                                             |  |   \
                                             |  |    \
                                           Transit2    Servers
                                           (fa0/13,14)  (fa0/1-12)

Our aggregate usage is under 300Mb/s.  The MetroE connection peaks
at about 120Mb/s.  The DSL link peaks at around 110Mb/s.

DSL subs come in as a VLAN per customer, and get a subinterface per
customer.  Each subinterface uses "ip unnumbered loopback X" where
"X" is the customer's gateway.

MetroE subs also come in one per VLAN and terminate on numbered
subinterfaces.  The VLANs are trunked through the switch.

3560 is setup in standard "router on a stick" - subinterfaces are
created on Gi0/3 on the 7206 for fa0/13-14 and a few other small
vlans for a handful of servers (less than 15Mb/s peak).  Native vlan
is unused.

CPU usage on the G2 averages about 30% at peak times of the day.
Every link here runs clean as far as "sh int" can show me.

During peak traffic times however, we start seeing some light packet
loss from the server vlans to anything reached via Transit1 and to
DSL circuits (hard to prove it's not the backhaul or customer line
usage there however).  At the same time, a ping running to anyone
off the metro ethernet circuit is clean, as is anything reached via
Transit2.  There appears to be no loss from MetroE customers to
Transit1 destinations nor from DSL clients to Transit1.  I just
added a bunch more targets in each area mapped out above to
smokeping to try and narrow this down, but in the meantime, what
else can I look at?  As noted, there's nothing alarming in any
interface counters here, but the pattern does seem to be that
anything in any of the server vlans traversing the router/switch
trunk and heading out any other GigE interface on the router shows
loss, but traffic from the server vlan to anything that traverses
the router/switch trunk and then turns back around and heads out
another port on the 3560 does not show loss.

I don't have enough hard data yet to point any fingers, but what are
some of the more low-level items to look at on the 7206 and the
3560?

Thanks,

Charles


More information about the cisco-nsp mailing list