[outages] Lessons Learned: RRTB outage

Jay Ashworth jra at baylink.com
Fri Sep 23 19:17:17 EDT 2011


So I had to renumber some servers this afternoon, cause I was expanding to 
a larger netblock (a 28 instead of a 29).

I renumbered my servers and my DNS (which I'd set the TTL on to 300 like a
good boy on Wednesday), and then pulled the trigger with Road Runner.  He
"rescripted" his SMC router (the likely cause of some standard deviation noted
by a couple of reporters -- the router, not the rescripting), and I pinged
it and it was ok, and I mtr'd it and it was ok, so I hit the webserver,
and that came up fine, too.

So then my boss calls me 15 minutes later: it's not working.

"I wonder what that could be", sez I; I'd even traced and hit the webserver
from my Android phone (Sprint; Opera Mobile 11), and it had worked fine.

That was Red Herring #1.

So my boss uses a Mac.  So does my best friend, and while he was on the way
out the door to a second-anniversary-wake for a guy we went to school with,
he took a moment to try to hit it as well.  No luck.

That was Red Herring #2 (both of them use Macs).

Those of you who've been playing close, careful attention here may have
noticed by now the thing I did *not* say I'd done: 

Changing the default gateway on the server.

My office lan could hit it *because its uplink was in the same network*;
*it* had a route for that network.  Everyone else... couldn't.

Apparently, Sprint operates a caching server, even if you're using the 
version of Opera (Mobile, not Mini) that does *not*, which explains Red
Herring #1.

As for Red Herring #2, well... Macs don't, apparently, hard-cache IPs the
way WinXP does (I'm looking at *you*, "ipconfig/ flushdns"), but I already
knew that, because boss had the right address.

Lesson Learned: Make sure you know what your diagnostic tests are telling 
you, before you use them to rule out possible problems.  Better yet: don't
rule those potential problems out at all: work your whole diagnostic tree
every time

Oh: I forgot Red Herring #3: the traces that broke *didn't hit that carrier
edge router* for some reason.  No clue why.

Thanks to the dozen or so people who responded; a couple of whom have
way too {much time,many servers} on their hands.  :-)

Followups to -discuss

Cheers,
-- jra
-- 
Jay R. Ashworth                  Baylink                       jra at baylink.com
Designer                     The Things I Think                       RFC 2100
Ashworth & Associates     http://baylink.pitas.com         2000 Land Rover DII
St Petersburg FL USA      http://photo.imageinc.us             +1 727 647 1274



More information about the Outages mailing list