[outages] Lessons Learned: RRTB outage
Jay Ashworth
jra at baylink.com
Fri Sep 23 19:17:17 EDT 2011
So I had to renumber some servers this afternoon, cause I was expanding to
a larger netblock (a 28 instead of a 29).
I renumbered my servers and my DNS (which I'd set the TTL on to 300 like a
good boy on Wednesday), and then pulled the trigger with Road Runner. He
"rescripted" his SMC router (the likely cause of some standard deviation noted
by a couple of reporters -- the router, not the rescripting), and I pinged
it and it was ok, and I mtr'd it and it was ok, so I hit the webserver,
and that came up fine, too.
So then my boss calls me 15 minutes later: it's not working.
"I wonder what that could be", sez I; I'd even traced and hit the webserver
from my Android phone (Sprint; Opera Mobile 11), and it had worked fine.
That was Red Herring #1.
So my boss uses a Mac. So does my best friend, and while he was on the way
out the door to a second-anniversary-wake for a guy we went to school with,
he took a moment to try to hit it as well. No luck.
That was Red Herring #2 (both of them use Macs).
Those of you who've been playing close, careful attention here may have
noticed by now the thing I did *not* say I'd done:
Changing the default gateway on the server.
My office lan could hit it *because its uplink was in the same network*;
*it* had a route for that network. Everyone else... couldn't.
Apparently, Sprint operates a caching server, even if you're using the
version of Opera (Mobile, not Mini) that does *not*, which explains Red
Herring #1.
As for Red Herring #2, well... Macs don't, apparently, hard-cache IPs the
way WinXP does (I'm looking at *you*, "ipconfig/ flushdns"), but I already
knew that, because boss had the right address.
Lesson Learned: Make sure you know what your diagnostic tests are telling
you, before you use them to rule out possible problems. Better yet: don't
rule those potential problems out at all: work your whole diagnostic tree
every time
Oh: I forgot Red Herring #3: the traces that broke *didn't hit that carrier
edge router* for some reason. No clue why.
Thanks to the dozen or so people who responded; a couple of whom have
way too {much time,many servers} on their hands. :-)
Followups to -discuss
Cheers,
-- jra
--
Jay R. Ashworth Baylink jra at baylink.com
Designer The Things I Think RFC 2100
Ashworth & Associates http://baylink.pitas.com 2000 Land Rover DII
St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274
More information about the Outages
mailing list