[c-nsp] Prove it's not the network!
Justin Shore
justin at justinshore.com
Thu May 15 01:41:40 EDT 2008
Nathan wrote:
> Proceed by elimination. If there is someone else in the office (I
> suppose the T1 is not just for one person) whose Outlook is *not*
> slow, and especially if "someone else" can be extended to "everybody
> else" then the problem is not the network.
>
> Outlook can have severe speed/response problems when not kept healthy;
> most notably there's something called PST files that have to be kept
> at a reasonable size, or re-indexed or something, and people who like
> to keep all their mail tend to run into that.
Here's a long account of a similar battle over PSTs that I fought.
I fought a 'blame-the-network' battle at a customer's site a couple
years ago. We built a brand-new GigE greenfield network in a new
building and help the customer move into their new digs. Shortly
thereafter a certain group of users started complaining that their
computers were horribly slow, most especially Outlook. This reached
upper management before it came back down to us contractors so it was a
huge deal when it landed at our feet.
First thing we did was narrow down exactly who had the problem and who
didn't. 95% of the complaints were "me too!" complaints and weren't
legitimate. The remaining 5% were isolated to one group of users in one
specific area of the new building. Their IT staff that was working on
this problem with us immediately blamed us again because "it had to be
the network's fault because all the users are in the same physical
vicinity". I showed them graph after graph of the network I/O from the
Exchange servers through the core and down through the uplinks to
distribution. In the end we ended up graphing every affected users'
port. The graphs did not help; we were still to blame.
Finally one day I sat down with the squeakiest user and had her show me
exactly what was slow and the steps she took to make that happen from
minute 1 of her walking into her office. I had her shut down and start
from a cold boot. She commented that the login process was faster than
normal and asked what I'd done to fix it (grrr). She fired up Outlook
and I noticed that it was very slow. She said that it was faster than
normal. Finally Outlook came up and she started scrolling through her
email. She selected a message and waited 10 seconds or so for the
message to come up. Then she'd try to save the attachment to the
desktop and it would take 4-5 minutes (for a 20MB attachment). She
continued on with her daily routine and started scrolling down there her
Outlook folders. I stopped her when I saw "Inbox, Sent, Drafts, etc"
scroll by more than once. This was the sign I was looking for. I took
the wheel at this point and started counting. She had 8 (count them,
EIGHT) sets of default Outlook folders because she had 8 PSTs mounted in
Outlook. She explained that she hits the Exchange PST hard limit of 2GB
every 8-10 months. The company's IT folks would export everything to a
new PST to give her a fresh inbox. Then they'd mount it in Outlook so
she could have access to it (it was tax stuff so Legal wouldn't let her
delete anything, literally). I started hunting for the PSTs and found
them on an old file server, one that we had no idea was related to the
mail system. She was mounting 8 roughly 2GB PSTs across the network to
Outlook on a PC running XP w/ 128MB of RAM. Wonderful.
But it gets better. I noticed that her inbox wasn't on the server but
was instead in a PST on the same file server and her email was set to
deliver to PST, not Exchange directly. In this situation the way
Exchange works, email is held on the server for PST users until they
bring their Outlook online. OL then downloads the queued up email and
stuffs it into the PST. Well, the PST was stored on the server so the
client would have to manipulate the PST on the server.
Oh, but it gets better still. A few days later one of sys admins was
looking the newly discovered file server that was apparently critical to
the function of the mail server. From across the room we here loud
profanity and run over to see what happened. He discovered that the
idiot IT staff set up Windows to compress the non-RAIDed drive that
contains all the user PSTs and home directories because they ran low on
drive space about a year earlier. Before a user's OL client can modify
the PST the server has to decompress the entire PST, then write the
changes for the client, and recompress the PST and then write it back to
disk. The server was a low-end MS box with 256MB of RAM with no RAID
and a backup that usually failed. Oh, and that sys admin also
discovered shortly thereafter that all of the users created in the past
year and a half were set to deliver to PST because of, you guessed it,
another drive space issue. Isn't that nice.
All the users that reported this problem turned out to be users that
handled tax data and couldn't delete any email. That's why that group
of users all experienced the problem. Every single one of these users
were mounting 2-8 2GB PSTs across the network. Those that shutdown at
night would come in at 8am and fire up their computers. A couple dozen
different users would all try to pull down their PSTs from the
compressed file system of the poor server. So it wasn't the network's
fault. The network was running like a champ. The POS server put into
mission critical service by incompetent IT staff was to blame. We spent
weeks troubleshooting the problem and trying to convince management that
the network was fine. In the end I had to sit down with a user, watch
everything that they did and then analyze their steps to figure out what
was causing the problem. Oh, and the reason it was faster the day I
worked with her was because we did this mid-morning, not at 8am. Did
anyone ever apologize (even figuratively) to the network folks? Nope.
Of course not.
As a network engineer I've found that the vast majority of my job is
helping other people find their problems. The network seldom breaks and
when it does it's not subtle; it's catastrophic. Even highly skilled
technical people still blame the network when their stuff doesn't work
right (after all my network is just a bunch of tubes, right?).
Networking is like mysterious dark magic that no one seems to
understand. It's the gremlins on the wire that causes Windows to crash,
not poor programming and a lack of QA. Networking is simply not
understood by most people and it's human nature to fear and loathe what
they don't understand. To be able to do my job effectively I have to
know my shit and everyone elses' well enough to know how something works
when it inevitably breaks. Had I not come into networking with a
systems background and were I not a quick study under fire I would not
be good at what I do. Did something "suddenly" break that must have
been caused by the network maintenance I did last week? No, it's the
fact that it never worked to begin with and you never actually tested it
when you deployed it a year ago. It wasn't until a user tested it for
you that you became aware of the fact that it wasn't working. It just
happened to come a week after I did maintenance on an unrelated device
on an unrelated network. But I'm going to spend all morning sniffing
and decoding traffic to help you realize that this device off to the
side over here couldn't possibly be involved. *sigh* Story of my life.
</OT RANT>
Justin
More information about the cisco-nsp
mailing list