[j-nsp] SRX 3600 dropped packets - how to debug?

Phil Mayers p.mayers at imperial.ac.uk
Fri May 24 07:02:27 EDT 2013


On 24/05/13 11:33, Wood, Peter (ISS) wrote:
> Hey Phil,
>
> A friendly hello from Lancaster Uni, also using SRX 3600's.
>
> Can you reproduce the loss? Or alternatively know source/destination
> ranges of likely connections? A user it's more likely to affect or
> can demonstrate it reliably?

Depends what you mean by "reproduce". The counter in question is rising 
continually, so (assuming that counter can be trusted) it's happening 
continually. But I have no idea *what* traffic might be being dropped.

Someone suggested to me that this counter might include sessions where 
the 3-way handshake is not completed successfully, which if true might 
account for it, but several hundred/sec seems too high for that.

To be clear: we don't have any *reports* of packet loss (well, not since 
I upgraded to 12.1R6.5 to fix PSN-2012-10-754 ;o) - it's just the 
counter value incrementing that has me concerned.

Could the counter be wrong/misleading?


> As pretty much unless this is a policy that's doing it (if you have
> "then deny", then get a "then count" on all those rules too, but it
> sounds like packet loss rather than session creation
> rejection/failure/timeout), you're gonna be stuck doing a datapath
> debug.

I did investigate the datapath debug and flow tracing (see below) but 
neither suggested anything like the rate of events required to match the 
rate of counter increments. There was a background of:

CID-00:FPC-11:PIC-00:THREAD_ID-08:RT:SPU invalid session id 00000000

...when I had flowtracing enabled, but that seemed to be ~10-20/sec. 
Unsure if it's related.

Slightly OT, I did spend some time thinking it was dropping some 
fragmented packets, but that was a red herring - I didn't realise the 
SRX re-assembles then re-fragments IP frags, which means if some PPPoE 
customer sends you:

packet 0-1400
packet 1400-1450

...the SRX will merge them into a single unfragmented packet on egress - 
until I realised this, I was missing the egress non-fragment, and 
thinking they'd been dropped.

>
> http://www.juniper.net/techpubs/software/junos-security/junos-security10.2/junos-security-swconfig-security/topic-41983.html
>
>  If you're shifting anywhere like the amount of traffic we are you
> aren't going to want to set up a filter for 0/0 to 0/0. Something
> I've had to explain to JTAC on numerous occasions (something along
> the lines of "You want me to enable full flow debugging on three
> SPC's collectively pushing 8Gbps!?!").

At the moment, the SRX is sitting in front of our "personally owned" 
VRF; this means all our wireless and wired laptops, and RAS VPN address 
ranges. This is doing about 1Gbps, which is probably still more than I 
can sensibly debug with flow tracing or packet capture.

It is a shame there isn't a "datapath-debug drops".

>
> Also you using anything like AppTrack and AppFW/AppQos/AppDos?

They were enabled at one point, but I disabled them whilst investigating 
the above-mentioned loss/PSN, and haven't turned them back on yet.

> I've unfortunately had a fair amount of experience with datapath
> debugs, so feel free to give me a shout off list.

That's... slightly ominous!

I did wonder about interpretation; the pcap header contains various bits 
of metadata, but it's unclear to me how to interpret those, and which 
ones are valuable and which not. Is there any decent guide to that?

Completely unrelated, can I ask if you have separate NPCs or the newer 
integrated IOC/NPC, and whether you have any comments pro or con the latter?

Cheers,
Phil


More information about the juniper-nsp mailing list