[j-nsp] High latency and slow connections

Wed Nov 5 12:16:06 EST 2003

> Are the pings transiting the M5 or source from the RE?

Both packets that transit the M5 and those where the RE is the source or
destination exhibited this problem. The load on the RE was 0.1, although I did
happen to catch a moment where the load jumped to 1.0 with "sampled" and "rpd"
taking most of the CPU. I have netflow accounting turned on (with every 100th
packed being sampled) so some spikes on sampled CPU usage are to be expected.

> You say you have a fastethernet connected box.  Where is this box
> connected, is it on the production network or connected to the
> management network via fxp0?

The fastethernet connected box is connected to a VLAN on a Cisco 3550, which
has a gigabit trunk to the M5. But even packets comming in one one STM-1 and
going out the other STM-1 exhibited this problem. Actually *any* packet going
through the M5 to any direction had this problem.

> If the FE Box is on the network, is there any other equipment in the
> path?  Do you have a way to isolate a path to carry out transit pings
> across the M5 only?

Here is some crude ASCII art. Obviously you need a monospace font to see it.

                         us1        us2
                          |          |
			  |STM1      |STM1
			  |          |
			  |          |
                         lj2--STM1--mb3
			/	     |
		      3550	    3550
			|	    / \ \
		       s1          s2 s3 s4

"mb3" is the M5 having trouble. "lj2" is another M5, connected to mb3 through
a STM-1. "us1" and "us2" are our upstreams. "3550" are two Cisco 3550
switches, while s1 to s4 are various servers.

All tests have been done with both ping and traceroute. We also had complaints
from customers about slow connections (customer connected directly to one of
the switches and rate limited to 4 Mbps saw 350 Kbps speeds on downloads).

I did traceroutes from s2 to mb3, from s2 to lj2, from s2 to s1, from s2 to
us1, from s1 to us2, from s1 to s2, from s1 to us2. Also from mb3 to lj2, us1,
us2 and s2, etc. All showed the same results.

Pings from s2 to mb3 showed 10 ms (they are below 1 ms normally). Pings from
s2 to s1 showed around 30 to 40 ms, they are normally around 4 ms.

> It is hard to think of a cause for such a dramatic increase in the data
> plane of the M5, only buffering would introduce such delays and as the
> interfaces have low levels of traffic this would not be the case.  If
> however the pings are being sourced from the RE there may be some
> increased control traffic that is taking precedence over the pings.

Well, low level of traffic is a relative term. They had normal traffic levels
for this time of day. The STM-1 between lj2 and mb3 was loaded around 60 Mbps
in both directions, the STM-1 to us2 was around 100 Mbps, the gigabit ethernet
from mb3 to the 3550 was at around 140 Mbps.

I do have some QoS configuration on some of the links, but I removed all of it
(literally) as a test. The interesting effect was an even higher latency (by
about another 10 ms).

I checked out all pps rates on all the links. Most were between 10000 and
30000 pps, which is normal. During DoS attacks we usually see peaks of even
80000 pps or more and this was handled by the M5 without any trouble.

Interesting enough, I saw a simmilar problem a couple of months ago. A DDoS
attack was targeted at a customer downstream of mb3 (connected to another
router which is connected to the 3550). The DDoS was hitting us through us2
with about 100000 pps and it completely filled up the remaining capacity on
the STM-1 to us2. Latency shot through the roof which was to be expected.

But the annoying part was, that even traffic traversing the gigabit ethernet
and going for example from s2 to s1 (which are not traversing us2 but are
going s2-mb3-lj2-s1) was seeing extremely high latency (300 ms and above). I
would have expected that gigabit ethernet can easily handle 100000 pps and
about 200 Mbps of traffic. It was difficult to diagnose anything while under
the pressure of a DoS attack, so our upstream filtered the attack and I did
not collect any useful data. But it seemed very much simmilar to what we
experienced today, except that today *all* traffic going through *any*
interface was affected.

Here's the output of "show chassis hardware detail" on the troubled box:

Hardware inventory:
Item             Version  Part number  Serial number     Description
Chassis                                59496             M5
Midplane         REV 03   710-002650   HB1361
Power Supply A   Rev 04   740-002497   LK22841           AC
Power Supply B   Rev 04   740-002497   LK23113           AC
Display          REV 04   710-001995   HJ3124
Routing Engine   REV 04   740-003877   9000019802        RE-2.0
Routing Engine                         7400000734579001  RE-2.0
FEB              REV 05   710-003311   HJ3655            E-FEB
FPC 0
  PIC 0          REV 01   750-005091   HD1292            1x G/E, 1000 BASE-SX
  PIC 1          REV 03   750-003748   HH1347            2x STM-1 SDH, SMIR

The box is running 5.7R3.4. Here's a traceroute from s2 to s1, now
that everything seems to be ok again:

traceroute to fog.amis.net (212.18.32.146), 64 hops max, 44 byte packets
 1  mb3-ge-0-0-0-3.router.amis.net (212.18.32.1)  0.859 ms  0.505 ms  0.626 ms
 2  lj2-so-0-2-1-0.router.amis.net (212.18.35.114)  13.176 ms  3.131 ms  3.170
ms
 3  fog (212.18.32.146)  3.292 ms  4.109 ms  3.458 ms