[Outages-discussion] [outages] Outages Message List Scope

Thu Mar 9 14:42:53 EST 2017

I agree in the canary in the coal mine aspect. When I signed up I expected that to some degree but with more of a backbone/service provider bent. I have plenty of filtering options available but filtering said canaries can be problematic if you don’t know which colors of canaries you are going to get. :)

I worded the Slack statement poorly; my bad. My problem with the Slack stuff was the 2hrs+ later “me to” or “its working” messages after Slack’s own status page noted they had a problem and had rolled back to fix it. Slack is critical infrastructure for some organizations and any outage for that kind of cloud service is going to generate calls to my call center if it goes on longer than the 15 minutes it takes to get from end user to technical contact to our help desk.

AWS, Azure, & Google are absolutely a reason for message; you are going to hear about it from your users/clients in a damn hurry if one of those is having a moment these days. An outage with those services now looks almost exactly the same to your networks as guy standing next to the Big Yellow Cable Finder with a 288 count fiber hanging from the bucket saying “ooopsy” depending on your or your client’s needs. :)

Thanks for the input everybody!
~Jason
From: John Starta <john at starta.org<mailto:john at starta.org>>
Date: Thursday, March 9, 2017 at 12:54 PM
To: "outages-discussion at outages.org<mailto:outages-discussion at outages.org>" <outages-discussion at outages.org<mailto:outages-discussion at outages.org>>
Cc: Peter Beckman <beckman at angryox.com<mailto:beckman at angryox.com>>, Jason Grider <jgrider at asc.edu<mailto:jgrider at asc.edu>>
Subject: Re: [outages] [Outages-discussion] Outages Message List Scope

As you might expect there are a variety of opinions on what’s important enough to merit an outages mailing list message. I personally don’t have an issue with smaller or website specific outages being reported. I view them as potential canaries in the coal mine. For instance, an issue with Slack can signal an AWS problem which might not have been registered on Amazon’s dashboard yet. (Given what resides on AWS, Azure, and Google Cloud these days I personally think they qualify as major [communications] infrastructure.)

Everyone on this mailing list should have the technical skill to know how to utilize the filtering capabilities of their mail clients and/or server. If you don’t like hearing about Slack outages, for instance, then create filters to parse incoming mailing list messages for only keywords important to you. I would recommend using server-side filters so that both your mobile[1] and desktop can benefit. If server-side filters aren’t available to you, then consider getting a Gmail account which does and subscribe to the outages list from there.

Simply put: Why must everyone on the list be deprived of potentially useful information just because some don’t / won’t learn their tools to benefit themselves.

John Starta

[1] A frequent complaint of many is X outage shouldn’t be reported to this list — to paraphrase: “stop filling the inbox on my mobile with outages I find unimportant.”

On Mar 9, 2017, at 10:58 AM, Peter Beckman <beckman at angryox.com<mailto:beckman at angryox.com>> wrote:
Agreed, and thanks for saying something.
Though, from some standpoints, Slack being down could be considered a
communications failure... but I agree, clarification should be given.
Appropriate posts (based on my reading of the "Mission Statement"):
    * Network link down or packet loss (show your work)
    * Telecommunications issue (voice, video, SMS)
    * BGP flaps
    * DDOS affecting network latency
    * generally packet delivery and receipt related
Maybe appropriate:
    * Large cloud/service provider outage (some communication may be
        dependent); e.g recent AWS S3 US-East issue, CloudFlare Security issue
    * AT&T 911 Outage
    * Mobile Network outage/issue
    * MicroTik Zero Day
Not appropriate:
    * Web service is down (endpoint related) e.g. Slack, Twitter, Amazon
        retail, Facebook
    * "I'm seeing a problem, are you?" posts -- either know or don't post
    * "Me too" posts unless you are adding to the discussion with
       additional, not before seen detail
    * After-the-fact posts "Yeah, I saw that happen"
I do not represent the Outages list, this is my personal take on what
should or shouldn't be here.
Oh, and this should go to -discussion.
On Thu, 9 Mar 2017, Jason Grider via Outages wrote:
Is a 504 – Gateway Error something that falls into this mailing list’s
scope? I’m asking because I’ve only been subscribed for a few days and
between a 10 minute outage at Slack generating ~25 messages over the
course of 5 hours and an (most likely internal to Invidia) HTTP server
error I’m wondering how much filtering I may need to put in place to get
the pieces of information I need from the chaff of events that have a
limited blast radius. There is no disrespect intended to anybody about
what has been sent to the list. I’m just trying to set my expectations in
line with the data I’ve asked for. :)
>From the signup page: https://puck.nether.net/mailman/listinfo/outages (bold is my emphasis)
"The primary goal of this mailing list ("outages") is for
outages-reporting that would apply to failures of major communications
infrastructure components having significant traffic-carrying capacity,
similar to what FCC provided prior to 9/11 days but they seem to have
pulled back due to terrorism concerns. Some also believe that LEC's and
IXC's also like this model as they no longer have to air their dirty
laundry. Then again, this mailing list is not about making anyone look
bad, its all about information sharing and keeping network operators &
end users abreast on the situation as close to real-time information as
possible in order to assess and respond to major outage such as routing
voice/data via different carriers which may directly or indirectly impact
us and our customers. A reliable communications network is essential in
times of crisis.
The purpose of this list is to have a central place to lookup and report
so that end users & network operators know why their services (e-mail,
phones, etc) went down eliminating the need to open tons of trouble
tickets during a major event. One master ticket - such as fiber cut
affect xxx OC48's would suffice. We hope this would empower users and
network operators to post such events so that everyone could benefit from
it. “
Then again “OC48” may date that statement a little bit. :)
“Thank you and have a great day!” I say from the confines of my shiny
silver flame suit.
Jason Grider
---------------------------------------------------------------------------
Peter Beckman                                                  Internet Guy
beckman at angryox.com<mailto:beckman at angryox.com>                                 http://www.angryox.com/
---------------------------------------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/outages-discussion/attachments/20170309/fb007386/attachment.html>