[nsp] [long] total packet corruption tricking TCP checksums ?

From: Kai (kai-cisco-nsp-trap@conti.nu)
Date: Mon Jan 31 2000 - 11:49:59 EST


I am having a *big* problem over here, and wonder if any reader of the
list has ever experienced something like this (I have a lot of other
anecdotal failures going back some 8 years, but never anything like
this):

(on another note: I wonder how I got unsubscribed from the list silently
sometimes after Nov 5th, 1999)

Thanks,
bye,Kai

--
kai@conti.nu             "Just say No" to Spam            Kai Schlichting
Palo Alto, New York, You name it             Sophisticated Technical Peon
Kai's SpamShield <tm> is FREE!                 http://SpamShield.Conti.nu
|                                                                       |
LeasedLines-FrameRelay-IPLs-ISDN-PPP-Cisco-Consulting-VoiceFax-Data-Muxes
WorldWideWebAnything-Intranets-NetAdmin-UnixAdmin-Security-ReallyHardMath

------- [...] Customers started to complain that downloaded files are corrupted sometimes this week - there was a lone complainer a week or two ago though.

Basically, FTP/HTTP downloads from random sites get corrupted in random ways: zip files are broken, their md5 checksums are different every time, while their length is correct.

I tested this with like 5 sites today, and with several platforms and machines as the download platform: 3 BSDI boxes, 4 different Win98 boxes, all show the same thing.

I rebooted and power-cycled my C2926 switch, then removed/re-inserted the FastEther card in my 7204 router, finally rebooted and power-cycled the same.

While there is some variations in how many files in a sample get corrupted (sometimes I get 4 out of 8 good ones @5MB, sometimes I get 8 completely different copies of the same file), there seems to be no direct relation between the corruption rate and the power-cycling of switch or router.

I also tried to limit CEF impact by shutting down one of our two UUnet T1's (double-T, multiplexed via CEF) in turn - to no effect.

Just a trivial sample of ramdom multi-MB files that I used: http://www.tucows.syracuse.net/files5/ghvhopp.zip http://www.newdealinc.com/eval/nde.zip (password for zip: new2u) Random Netscape distributions from ftp://ftp.netscape.com

I tried downloading this on XXXXXXXXX's shell server, and I had no problem doing so, multiple times. www.XXXXXXXXX.com is hanging off the same UUnet router (on a neighboring HSSI card into UUnet's FR switch) as we are - a really short path : when trying to http the files from there, they got corrupted, too.

This leaves: - UUnet's HSSI card corrupting things - UUnet's FR switch corrupting things - Our 7204 corrupting things on the serial input side - Our 7204 corrupting things on the FastEther output side - Our 2926 switch with VLAN trunking corrupting things

Given that IP and TCP checksums are merely 16 bit-wide 1's complements rather than CRC checksums, a mere "bit switch" (the same bit wrong in ANY two 16-bit words of , say: the TCP data) can corrupt the data but make the checksum appear correct. I never knew the checksum was so simple - how come that we don't see more widespread corrupted files that got mangled in transport ? To my knowledge, ftp does no checksumming of any kind on its own, nor does HTTP.

Does the Internet live on borrowed time ?

What could be causing these corruptions, and: have you seen something like that before (I sure haven't) ?

The strangest thing I ever saw that came close was a Frame-Relay link hanging off a V.35 card in a BSDI machine that just wouldn't transfer one particular 304-byte red-ball.gif via HTTP in outbound direction : you would expect transfers to hang due to TCP corruptions, but never achieve a 100% non-hanging, 'bit reversal' type of corruption.

---

Next morning:

I have switched the 7204 with a spare 2501 for some tests , and still get these corruptions! That makes UUnet's router (7xxx with RSP4 and HSSI card into a FrameRelay switch) and their frame switch a suspect in this bizarre failure. Notes: - NO errors on the frame pvc or my 7204 interfaces - UUnet is (universally) running CEF per-destination, - I am running CEF per-packet load balancing. - I now receive widespread customer complaints about corrupt files that were ftp'd, http'd, email MIME attachments, in particular .wav voice mail files from people using universal messaging. - files get corrupted 'outbound' as well.

======================================================================

Following: a friend who looked into this problem for me as a second opinion:

[...] so then i thought "hmmm how about if i can diff these things". to make a sample file i cat'd /usr/dict/words onto itself about a dozen times on my machine (XXXXXXXX.net) and then used wget to grab the file on YYYYYY. then i grabbed the file 8 times sequentially (back to back in a script).

[YYYYYY] [/tmp] ~/md5sum words* d9e3cda3db13bd66418706fb2b0a571a words1 d9e3cda3db13bd66418706fb2b0a571a words2 40b970e8505147a88a004fa74b18689d words3 d9e3cda3db13bd66418706fb2b0a571a words4 d9e3cda3db13bd66418706fb2b0a571a words5 21ba1e4e1beb66bd248af71da6355708 words6 610cdfacd3b06483a9bbe4341dcac162 words7 d9e3cda3db13bd66418706fb2b0a571a words8

hmmmmmmmmmmmmmmm

[YYYYYY] [/tmp] diff words2 words3 282708,282709c282708,282709 < debris < debt --- > dearis > debu

# a->b == 0110 0001 -> 0110 0010 # t->u == 0111 0100 -> 0111 0101

[YYYYYY] [/tmp] diff words5 words6 88829,88830c88829 < lessen < lesson --- > lessenzleson

# \n -> z == 0000 1010 -> 0111 1010

[YYYYYY] [/tmp] diff words5 words7 242878,242880c242878,242879 < parallel < parallelepiped < paralysis --- > parallem > paxall`leplpedparelysis

# e -> ` == 0110 0101 -> 0110 0000 # i -> l == 0110 1001 -> 0110 1100

just a few bits (not bytes) at a time are affected... i dont know if this means anything, since the sample set was so small...



This archive was generated by hypermail 2b29 : Sun Aug 04 2002 - 04:12:09 EDT