[c-nsp] Prove it's not the network!

Justin Shore justin at justinshore.com
Thu May 15 01:41:40 EDT 2008


Nathan wrote:

> Proceed by elimination. If there is someone else in the office (I
> suppose the T1 is not just for one person) whose Outlook is *not*
> slow, and especially if "someone else" can be extended to "everybody
> else" then the problem is not the network.
> 
> Outlook can have severe speed/response problems when not kept healthy;
> most notably there's something called PST files that have to be kept
> at a reasonable size, or re-indexed or something, and people who like
> to keep all their mail tend to run into that.

Here's a long account of a similar battle over PSTs that I fought.

I fought a 'blame-the-network' battle at a customer's site a couple 
years ago.  We built a brand-new GigE greenfield network in a new 
building and help the customer move into their new digs.  Shortly 
thereafter a certain group of users started complaining that their 
computers were horribly slow, most especially Outlook.  This reached 
upper management before it came back down to us contractors so it was a 
huge deal when it landed at our feet.

First thing we did was narrow down exactly who had the problem and who 
didn't.  95% of the complaints were "me too!" complaints and weren't 
legitimate.  The remaining 5% were isolated to one group of users in one 
specific area of the new building.  Their IT staff that was working on 
this problem with us immediately blamed us again because "it had to be 
the network's fault because all the users are in the same physical 
vicinity".  I showed them graph after graph of the network I/O from the 
Exchange servers through the core and down through the uplinks to 
distribution.  In the end we ended up graphing every affected users' 
port.  The graphs did not help; we were still to blame.

Finally one day I sat down with the squeakiest user and had her show me 
exactly what was slow and the steps she took to make that happen from 
minute 1 of her walking into her office.  I had her shut down and start 
from a cold boot.  She commented that the login process was faster than 
normal and asked what I'd done to fix it (grrr).  She fired up Outlook 
and I noticed that it was very slow.  She said that it was faster than 
normal.  Finally Outlook came up and she started scrolling through her 
email.  She selected a message and waited 10 seconds or so for the 
message to come up.  Then she'd try to save the attachment to the 
desktop and it would take 4-5 minutes (for a 20MB attachment).  She 
continued on with her daily routine and started scrolling down there her 
Outlook folders.  I stopped her when I saw "Inbox, Sent, Drafts, etc" 
scroll by more than once.  This was the sign I was looking for.  I took 
the wheel at this point and started counting.  She had 8 (count them, 
EIGHT) sets of default Outlook folders because she had 8 PSTs mounted in 
Outlook.  She explained that she hits the Exchange PST hard limit of 2GB 
every 8-10 months.  The company's IT folks would export everything to a 
new PST to give her a fresh inbox.  Then they'd mount it in Outlook so 
she could have access to it (it was tax stuff so Legal wouldn't let her 
delete anything, literally).  I started hunting for the PSTs and found 
them on an old file server, one that we had no idea was related to the 
mail system.  She was mounting 8 roughly 2GB PSTs across the network to 
Outlook on a PC running XP w/ 128MB of RAM.  Wonderful.

But it gets better.  I noticed that her inbox wasn't on the server but 
was instead in a PST on the same file server and her email was set to 
deliver to PST, not Exchange directly.  In this situation the way 
Exchange works, email is held on the server for PST users until they 
bring their Outlook online.  OL then downloads the queued up email and 
stuffs it into the PST.  Well, the PST was stored on the server so the 
client would have to manipulate the PST on the server.

Oh, but it gets better still.  A few days later one of sys admins was 
looking the newly discovered file server that was apparently critical to 
the function of the mail server.  From across the room we here loud 
profanity and run over to see what happened.  He discovered that the 
idiot IT staff set up Windows to compress the non-RAIDed drive that 
contains all the user PSTs and home directories because they ran low on 
drive space about a year earlier.  Before a user's OL client can modify 
the PST the server has to decompress the entire PST, then write the 
changes for the client, and recompress the PST and then write it back to 
disk.  The server was a low-end MS box with 256MB of RAM with no RAID 
and a backup that usually failed.  Oh, and that sys admin also 
discovered shortly thereafter that all of the users created in the past 
year and a half were set to deliver to PST because of, you guessed it, 
another drive space issue.  Isn't that nice.

All the users that reported this problem turned out to be users that 
handled tax data and couldn't delete any email.  That's why that group 
of users all experienced the problem.  Every single one of these users 
were mounting 2-8 2GB PSTs across the network.  Those that shutdown at 
night would come in at 8am and fire up their computers.  A couple dozen 
different users would all try to pull down their PSTs from the 
compressed file system of the poor server.  So it wasn't the network's 
fault.  The network was running like a champ.  The POS server put into 
mission critical service by incompetent IT staff was to blame.  We spent 
weeks troubleshooting the problem and trying to convince management that 
the network was fine.  In the end I had to sit down with a user, watch 
everything that they did and then analyze their steps to figure out what 
was causing the problem.  Oh, and the reason it was faster the day I 
worked with her was because we did this mid-morning, not at 8am.  Did 
anyone ever apologize (even figuratively) to the network folks?  Nope. 
Of course not.


As a network engineer I've found that the vast majority of my job is 
helping other people find their problems.  The network seldom breaks and 
when it does it's not subtle; it's catastrophic.  Even highly skilled 
technical people still blame the network when their stuff doesn't work 
right (after all my network is just a bunch of tubes, right?). 
Networking is like mysterious dark magic that no one seems to 
understand.  It's the gremlins on the wire that causes Windows to crash, 
not poor programming and a lack of QA.  Networking is simply not 
understood by most people and it's human nature to fear and loathe what 
they don't understand.  To be able to do my job effectively I have to 
know my shit and everyone elses' well enough to know how something works 
when it inevitably breaks.  Had I not come into networking with a 
systems background and were I not a quick study under fire I would not 
be good at what I do.  Did something "suddenly" break that must have 
been caused by the network maintenance I did last week?  No, it's the 
fact that it never worked to begin with and you never actually tested it 
when you deployed it a year ago.  It wasn't until a user tested it for 
you that you became aware of the fact that it wasn't working.  It just 
happened to come a week after I did maintenance on an unrelated device 
on an unrelated network.  But I'm going to spend all morning sniffing 
and decoding traffic to help you realize that this device off to the 
side over here couldn't possibly be involved.  *sigh*  Story of my life.

</OT RANT>

Justin


More information about the cisco-nsp mailing list