[c-nsp] Prove it's not the network!

Jeff Fitzwater jfitz at Princeton.EDU
Thu May 15 08:55:43 EDT 2008


I sure hope Justin lets us know what the problem really was, after all  
this..




Jeff Fitzwater
OIT Network Systems
Princeton University

On May 15, 2008, at 3:56 AM, Whisper wrote:

> Justin, I have alwasy been under the impression that Network Engineers
> primary role was going around constantly proving that the Network is  
> not the
> problem. :)
>
> Your rant, I suspect, is more or less repeated on  daily basis by  
> Network
> Engineers all around the world.
>
> On Thu, May 15, 2008 at 3:41 PM, Justin Shore <justin at justinshore.com>
> wrote:
>
>> Nathan wrote:
>>
>>> Proceed by elimination. If there is someone else in the office (I
>>> suppose the T1 is not just for one person) whose Outlook is *not*
>>> slow, and especially if "someone else" can be extended to "everybody
>>> else" then the problem is not the network.
>>>
>>> Outlook can have severe speed/response problems when not kept  
>>> healthy;
>>> most notably there's something called PST files that have to be kept
>>> at a reasonable size, or re-indexed or something, and people who  
>>> like
>>> to keep all their mail tend to run into that.
>>
>> Here's a long account of a similar battle over PSTs that I fought.
>>
>> I fought a 'blame-the-network' battle at a customer's site a couple
>> years ago.  We built a brand-new GigE greenfield network in a new
>> building and help the customer move into their new digs.  Shortly
>> thereafter a certain group of users started complaining that their
>> computers were horribly slow, most especially Outlook.  This reached
>> upper management before it came back down to us contractors so it  
>> was a
>> huge deal when it landed at our feet.
>>
>> First thing we did was narrow down exactly who had the problem and  
>> who
>> didn't.  95% of the complaints were "me too!" complaints and weren't
>> legitimate.  The remaining 5% were isolated to one group of users  
>> in one
>> specific area of the new building.  Their IT staff that was working  
>> on
>> this problem with us immediately blamed us again because "it had to  
>> be
>> the network's fault because all the users are in the same physical
>> vicinity".  I showed them graph after graph of the network I/O from  
>> the
>> Exchange servers through the core and down through the uplinks to
>> distribution.  In the end we ended up graphing every affected users'
>> port.  The graphs did not help; we were still to blame.
>>
>> Finally one day I sat down with the squeakiest user and had her  
>> show me
>> exactly what was slow and the steps she took to make that happen from
>> minute 1 of her walking into her office.  I had her shut down and  
>> start
>> from a cold boot.  She commented that the login process was faster  
>> than
>> normal and asked what I'd done to fix it (grrr).  She fired up  
>> Outlook
>> and I noticed that it was very slow.  She said that it was faster  
>> than
>> normal.  Finally Outlook came up and she started scrolling through  
>> her
>> email.  She selected a message and waited 10 seconds or so for the
>> message to come up.  Then she'd try to save the attachment to the
>> desktop and it would take 4-5 minutes (for a 20MB attachment).  She
>> continued on with her daily routine and started scrolling down  
>> there her
>> Outlook folders.  I stopped her when I saw "Inbox, Sent, Drafts, etc"
>> scroll by more than once.  This was the sign I was looking for.  I  
>> took
>> the wheel at this point and started counting.  She had 8 (count them,
>> EIGHT) sets of default Outlook folders because she had 8 PSTs  
>> mounted in
>> Outlook.  She explained that she hits the Exchange PST hard limit  
>> of 2GB
>> every 8-10 months.  The company's IT folks would export everything  
>> to a
>> new PST to give her a fresh inbox.  Then they'd mount it in Outlook  
>> so
>> she could have access to it (it was tax stuff so Legal wouldn't let  
>> her
>> delete anything, literally).  I started hunting for the PSTs and  
>> found
>> them on an old file server, one that we had no idea was related to  
>> the
>> mail system.  She was mounting 8 roughly 2GB PSTs across the  
>> network to
>> Outlook on a PC running XP w/ 128MB of RAM.  Wonderful.
>>
>> But it gets better.  I noticed that her inbox wasn't on the server  
>> but
>> was instead in a PST on the same file server and her email was set to
>> deliver to PST, not Exchange directly.  In this situation the way
>> Exchange works, email is held on the server for PST users until they
>> bring their Outlook online.  OL then downloads the queued up email  
>> and
>> stuffs it into the PST.  Well, the PST was stored on the server so  
>> the
>> client would have to manipulate the PST on the server.
>>
>> Oh, but it gets better still.  A few days later one of sys admins was
>> looking the newly discovered file server that was apparently  
>> critical to
>> the function of the mail server.  From across the room we here loud
>> profanity and run over to see what happened.  He discovered that the
>> idiot IT staff set up Windows to compress the non-RAIDed drive that
>> contains all the user PSTs and home directories because they ran  
>> low on
>> drive space about a year earlier.  Before a user's OL client can  
>> modify
>> the PST the server has to decompress the entire PST, then write the
>> changes for the client, and recompress the PST and then write it  
>> back to
>> disk.  The server was a low-end MS box with 256MB of RAM with no RAID
>> and a backup that usually failed.  Oh, and that sys admin also
>> discovered shortly thereafter that all of the users created in the  
>> past
>> year and a half were set to deliver to PST because of, you guessed  
>> it,
>> another drive space issue.  Isn't that nice.
>>
>> All the users that reported this problem turned out to be users that
>> handled tax data and couldn't delete any email.  That's why that  
>> group
>> of users all experienced the problem.  Every single one of these  
>> users
>> were mounting 2-8 2GB PSTs across the network.  Those that shutdown  
>> at
>> night would come in at 8am and fire up their computers.  A couple  
>> dozen
>> different users would all try to pull down their PSTs from the
>> compressed file system of the poor server.  So it wasn't the  
>> network's
>> fault.  The network was running like a champ.  The POS server put  
>> into
>> mission critical service by incompetent IT staff was to blame.  We  
>> spent
>> weeks troubleshooting the problem and trying to convince management  
>> that
>> the network was fine.  In the end I had to sit down with a user,  
>> watch
>> everything that they did and then analyze their steps to figure out  
>> what
>> was causing the problem.  Oh, and the reason it was faster the day I
>> worked with her was because we did this mid-morning, not at 8am.  Did
>> anyone ever apologize (even figuratively) to the network folks?   
>> Nope.
>> Of course not.
>>
>>
>> As a network engineer I've found that the vast majority of my job is
>> helping other people find their problems.  The network seldom  
>> breaks and
>> when it does it's not subtle; it's catastrophic.  Even highly skilled
>> technical people still blame the network when their stuff doesn't  
>> work
>> right (after all my network is just a bunch of tubes, right?).
>> Networking is like mysterious dark magic that no one seems to
>> understand.  It's the gremlins on the wire that causes Windows to  
>> crash,
>> not poor programming and a lack of QA.  Networking is simply not
>> understood by most people and it's human nature to fear and loathe  
>> what
>> they don't understand.  To be able to do my job effectively I have to
>> know my shit and everyone elses' well enough to know how something  
>> works
>> when it inevitably breaks.  Had I not come into networking with a
>> systems background and were I not a quick study under fire I would  
>> not
>> be good at what I do.  Did something "suddenly" break that must have
>> been caused by the network maintenance I did last week?  No, it's the
>> fact that it never worked to begin with and you never actually  
>> tested it
>> when you deployed it a year ago.  It wasn't until a user tested it  
>> for
>> you that you became aware of the fact that it wasn't working.  It  
>> just
>> happened to come a week after I did maintenance on an unrelated  
>> device
>> on an unrelated network.  But I'm going to spend all morning sniffing
>> and decoding traffic to help you realize that this device off to the
>> side over here couldn't possibly be involved.  *sigh*  Story of my  
>> life.
>>
>> </OT RANT>
>>
>> Justin
>> _______________________________________________
>> cisco-nsp mailing list  cisco-nsp at puck.nether.net
>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>>
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/



More information about the cisco-nsp mailing list