[nsp] [long] total packet corruption tricking TCP checksums ?

From: Kai (kai-cisco-nsp-trap@conti.nu)
Date: Mon Jan 31 2000 - 11:49:59 EST

Next message: Alex Bligh: "Re: [nsp] [long] total packet corruption tricking TCP checksums ?"
Previous message: Martin, Christian: "RE: Packet drops"
Next in thread: Alex Bligh: "Re: [nsp] [long] total packet corruption tricking TCP checksums ?"
Reply: Alex Bligh: "Re: [nsp] [long] total packet corruption tricking TCP checksums ?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I am having a *big* problem over here, and wonder if any reader of the
list has ever experienced something like this (I have a lot of other
anecdotal failures going back some 8 years, but never anything like
this):

(on another note: I wonder how I got unsubscribed from the list silently
sometimes after Nov 5th, 1999)

Thanks,
bye,Kai

--
kai@conti.nu             "Just say No" to Spam            Kai Schlichting
Palo Alto, New York, You name it             Sophisticated Technical Peon
Kai's SpamShield <tm> is FREE!                 http://SpamShield.Conti.nu
|                                                                       |
LeasedLines-FrameRelay-IPLs-ISDN-PPP-Cisco-Consulting-VoiceFax-Data-Muxes
WorldWideWebAnything-Intranets-NetAdmin-UnixAdmin-Security-ReallyHardMath
-------
[...]
Customers started to complain that downloaded files are corrupted
sometimes this week - there was a lone complainer a week or two ago though.
Basically, FTP/HTTP downloads from random sites get corrupted in
random ways: zip files are broken, their md5 checksums are different
every time, while their length is correct.
I tested this with like 5 sites today, and with several platforms and
machines as the download platform: 3 BSDI boxes, 4 different Win98
boxes, all show the same thing. 
I rebooted and power-cycled my C2926 switch, then removed/re-inserted
the FastEther card in my 7204 router, finally rebooted and power-cycled
the same.
While there is some variations in how many files in a sample get corrupted
(sometimes I get 4 out of 8 good ones @5MB, sometimes I get 8 completely
different copies of the same file), there seems to be no direct relation
between the corruption rate and the power-cycling of switch or router.
I also tried to limit CEF impact by shutting down one of our two UUnet
T1's (double-T, multiplexed via CEF) in turn - to no effect.
Just a trivial sample of ramdom multi-MB files that I used:
http://www.tucows.syracuse.net/files5/ghvhopp.zip
http://www.newdealinc.com/eval/nde.zip (password for zip: new2u)
Random Netscape distributions from ftp://ftp.netscape.com
I tried downloading this on XXXXXXXXX's shell server, and I had
no problem doing so, multiple times. www.XXXXXXXXX.com is hanging
off the same UUnet router (on a neighboring HSSI card into UUnet's
FR switch) as we are - a really short path : when trying to http the
files from there, they got corrupted, too.
This leaves:
- UUnet's HSSI card corrupting things
- UUnet's FR switch corrupting things
- Our 7204 corrupting things on the serial input side
- Our 7204 corrupting things on the FastEther output side
- Our 2926 switch with VLAN trunking corrupting things
Given that IP and TCP checksums are merely 16 bit-wide 1's complements
rather than CRC checksums, a mere "bit switch" (the same bit wrong
in ANY two 16-bit words of , say: the TCP data) can corrupt the data
but make the checksum appear correct. I never knew the checksum
was so simple - how come that we don't see more widespread corrupted
files that got mangled in transport ? To my knowledge, ftp does no
checksumming of any kind on its own, nor does HTTP.
Does the Internet live on borrowed time ?
What could be causing these corruptions, and: have you seen something
like that before (I sure haven't) ?
The strangest thing I ever saw that came close was a Frame-Relay link
hanging off a V.35 card in a BSDI machine that just wouldn't transfer
one particular 304-byte red-ball.gif via HTTP in outbound direction :
you would expect transfers to hang due to TCP corruptions, but never
achieve a 100% non-hanging, 'bit reversal' type of corruption.
---
Next morning:
I have switched the 7204 with a spare 2501 for some tests , and still
get these corruptions! That makes UUnet's router (7xxx with RSP4 and
HSSI card into a FrameRelay switch) and their frame switch a suspect
in this bizarre failure. Notes:
- NO errors on the frame pvc or my 7204 interfaces
- UUnet is (universally) running CEF per-destination,
- I am running CEF per-packet load balancing.
- I now receive widespread customer complaints about corrupt files
   that were ftp'd, http'd, email MIME attachments, in particular .wav
   voice mail files from people using universal messaging.
- files get corrupted 'outbound' as well.
======================================================================
Following: a friend who looked into this problem for me as a second
opinion:
[...]
so then i thought "hmmm how about if i can diff these things".  to make a 
sample file i cat'd /usr/dict/words onto itself about a dozen times on my
machine (XXXXXXXX.net) and then used wget to grab the file on YYYYYY.  then
i grabbed the file 8 times sequentially (back to back in a script).
[YYYYYY] [/tmp] ~/md5sum words*        
d9e3cda3db13bd66418706fb2b0a571a  words1
d9e3cda3db13bd66418706fb2b0a571a  words2
40b970e8505147a88a004fa74b18689d  words3
d9e3cda3db13bd66418706fb2b0a571a  words4
d9e3cda3db13bd66418706fb2b0a571a  words5
21ba1e4e1beb66bd248af71da6355708  words6
610cdfacd3b06483a9bbe4341dcac162  words7
d9e3cda3db13bd66418706fb2b0a571a  words8
hmmmmmmmmmmmmmmm
[YYYYYY] [/tmp] diff words2 words3
282708,282709c282708,282709
< debris
< debt
---
 > dearis
 > debu
# a->b == 0110 0001 -> 0110 0010
# t->u == 0111 0100 -> 0111 0101
[YYYYYY] [/tmp] diff words5 words6
88829,88830c88829
< lessen
< lesson
---
 > lessenzleson
# \n -> z == 0000 1010 -> 0111 1010
[YYYYYY] [/tmp] diff words5 words7
242878,242880c242878,242879
< parallel
< parallelepiped
< paralysis
---
 > parallem
 > paxall`leplpedparelysis
# e -> ` == 0110 0101 -> 0110 0000
# i -> l == 0110 1001 -> 0110 1100
just a few bits (not bytes) at a time are affected... i dont know if this 
means anything, since the sample set was so small...

Next message: Alex Bligh: "Re: [nsp] [long] total packet corruption tricking TCP checksums ?"
Previous message: Martin, Christian: "RE: Packet drops"
Next in thread: Alex Bligh: "Re: [nsp] [long] total packet corruption tricking TCP checksums ?"
Reply: Alex Bligh: "Re: [nsp] [long] total packet corruption tricking TCP checksums ?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2b29 : Sun Aug 04 2002 - 04:12:09 EDT