[Outages-discussion] comcast dns outage?

Jeremy Chadwick outages at jdc.parodius.com
Mon Dec 6 02:40:32 EST 2010


On Sun, Dec 05, 2010 at 11:16:14PM -0800, Avleen Vig wrote:
> On Sun, Dec 5, 2010 at 6:25 PM, Jeremy Chadwick
> <outages at jdc.parodius.com> wrote:
> > Comcast confirms the issue:
> >
> > http://www.dslreports.com/forum/r25163231-
> >
> > The above post also confirms their "status" page is not a real-time
> > monitor, which means it shouldn't be relied upon.  Don't they have a
> > (capable) NOC?  Amazing.
> 
> It's hard to suggest that their NOC isn't competent over something like this.
> Whose responsibility is communication to the customer?
> Most NOC's are only responsible for acting as middle-men to alert
> engineers when a problem appears, or to solve basic issues.
> 
> Let's leave the pejorative remarks at the door, please.

Moving to outages-discussion as per mailing list guidelines.

My point: what Comcast totes (and what people think; this thread is
proof) is a real-time monitor for system status isn't.  And that's OK --
it means Comcast's status page is updated manually when there's a known
problem.  The most appropriate group for this is the NOC.  Proper NOCs
have procedures in place to handle this situation, combined with proper
monitoring (meaning, part of the procedure is to update the page).  DNS
issues are witnessed via monitoring, escalations begin, and alongside
escalations customers are contacted (notifications are sent) and/or said
status page should be updated to reflect reality.

Here's a comparison: are you familiar with the "Service Status" page
associated with Windows Live Messenger (what was known as Windows
Messenger or MSN Messenger for quite some time)?  That page is also
manually updated.  Do you know who'ss responsible for toggling the
status?  The Hotmail NOC (now MSN SOC).

Comcast's NOC, however, does not communicate with customers, and that
includes business-tier customers (please see the DSLR forum thread I
provided to see evidence of this).  Communication to customers from
Comcast to their customers are "best-effort".  Post-mortems are more
common, and usually only occur if the media/press gets wind of an
outage.  This is the 2nd or 3rd DNS-related outage Comcast has had in
the past few months.  Keep that in mind.

All that said, as someone who has spent the past 8 years of his life
working in multi-tiered NOCs as a senior engineer (given my SA/NA
career), I'm more than justified to make such remarks.  That sounds
highly narcissistic and egocentric, but it still applies nonetheless.  I
can pass judgement because my job requires me to be familiar with the
procedures, processes, and necessities associated with customer-facing
operations groups.

Bottom line: companies need to make their "service status" pages
automated (this is possible, arguing it isn't is preposterous -- heck,
work hooks into a ticketing system if monitoring-based updates aren't
feasible), or accept the responsibility of what a public "service
status" page represents (read: promptly updating the page when there are
known problems.  Promptly means within 15 minutes tops; I imagine
Comcast business customers have SLAs, but non-business customers don't).

When your customers have absolutely no idea what's going on/what's
broken, and CSRs tell them "everything is fine" combined with a status
page that says "all green" despite hard evidence proving otherwise,
scrutiny is justified.

That's really all I have to say on the matter.  Improve the system or
improve the process/procedure.  Period.

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |


More information about the Outages-discussion mailing list