[VoiceOps] Plivo Offline, Domain Expired; Out of Business?

Peter Beckman beckman at angryox.com
Sun Apr 23 18:42:32 EDT 2017


Still priceless. If our application servers cannot access critical
resources from a carrier, regardless of the cause (network, application
outage on their end, domain went un-registered), I now know it isn't
working.

That still is priceless. Monitoring should tell you what, not why. Blackbox
monitoring (measure the experience customers experience) is wildly more
valuable than whitebox monitoring (a DB query took more than 10 seconds,
once).

Your metrics system should help you answer why, once you know that
something is wrong.

Beckman

Good reading:
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

>From a prior Google SRE

On Sun, 23 Apr 2017, Alex Balashov wrote:

> With regard to the so-called priceless information, I would be careful.
> Monitoring vendors onesself is good, but it can lead to the
> misapprehension that the blame for an outage lies with the vendor where
> it in fact does not, but is instead caused by intermediate routing or
> whatever.
>
> In my experience, if used with conceit, such monitoring can cause as many
> problems as it solves. It certainly isn't priceless. That depends on the
> specific situation.
>
> On April 23, 2017 4:27:00 PM EDT, Alexander Lopez <alex.lopez at opsys.com> wrote:
>> The information you gather by monitoring your vendors is priceless, if
>> you are able to determine within a short period of time that your
>> vendor is the problem,you have just saved precious time in identifying
>> the problem. You can then reroute around the issue, instead of spending
>> time in looking at your platform
>>
>> On Apr 23, 2017 3:10 PM, Peter Beckman <beckman at angryox.com> wrote:
>> Yep, we monitor our vendors. In some cases, better than they monitor
>> themselves. It's frustrating that they don't/can't/won't, but wanting
>> them
>> to change doesn't help our customer experience. So we do it for
>> them proactively.
>>
>> Too many outages on our vendor services have been caused by their
>> ineptitude, such as this Plivo outage. We decided as a result of this
>> Plivo
>> ouatge that, while it was only merely annoying to our Operations team
>> and
>> didn't result in any customer-facing issues, we could have seen this
>> coming
>> and maybe averted this disaster for everyone by adding two config lines
>> in
>> our monitoring platform and a process to ensure notification to,
>> followup
>> with and closure of the issue with Plivo.
>>
>> The cost to our Operations team is small, the automated monitoring
>> costs
>> nothing, but the impact of knowing that our vendors are having (or will
>> have) issues before they tell us or know themselves, AND before our
>> customers complain, improves our customer experience and operational
>> excellence despite our vendor's failings.
>>
>> It also gives us a chance to write defensive code to handle the
>> situations
>> where the vendor is not meeting their contractually obligated level of
>> service.
>>
>> Beckman
>>
>>
>> On Sun, 23 Apr 2017, Keln Taylor wrote:
>>
>>> Just to clarify, you are saying that you monitor the domain and SSL
>> cert of
>>> your vendors so you can notify them?
>>> That's cool.
>>>
>>>
>>> Sincerely,
>>> Keln Taylor
>>> 870-204-2121
>>> kelntaylor at gmail.com
>>>
>>> On Sun, Apr 23, 2017 at 12:31 PM, Peter Beckman <beckman at angryox.com>
>> wrote:
>>>
>>>> We should all strive to NOT do that. We integrated a once a day
>> check into
>>>> our Monitoring platform that starts warning Operations 30 days
>> before the
>>>> domain expires, and actually pages people starting at 9am on
>> Weekdays 7
>>>> days before if it hasn't been renewed. We had to tweak it for how
>> our
>>>> registrar publishes that information, and we automated renewals so
>> it
>>>> rarely goes off, but when it does we can get in front of it.
>>>>
>>>> We have the same thing in place for our public and internal SSL/TLS
>>>> Certificates.
>>>>
>>>> If you are running a business on the web and don't automate
>> monitoring of
>>>> critical infrastructure, you get outages like this. Heck, we started
>>>> monitoring the domain and SSL certs of our critical-path dependent
>>>> services/vendors since another outage many years ago after an SSL
>> cert
>>>> expired.
>>>>
>>>> Plivo wasn't in our mix, as they aren't critical-path, but they are
>> now,
>>>> and they are still in alarm.  Operations now will be automatically
>> notified
>>>> when we can actually see Plivo again.
>>>>
>>>> Beckman
>>>>
>>>> On Sun, 23 Apr 2017, Gavin Henry wrote:
>>>>
>>>> On 23 April 2017 at 17:31, Alex Balashov <abalashov at evaristesys.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> I've done it. Just got distracted, got to fiddling around and
>> thinkin'
>>>>>> bout things, and before I know it, the domain's expired and I'm no
>> longer
>>>>>> on the register of respectable party guests...
>>>>>>
>>>>>
>>>>> Embarrassing. I'm sure I'll do it at some point. :-)
>>>>>
>>>>>
>>>> ------------------------------------------------------------
>>>> ---------------
>>>> Peter Beckman
>> Internet Guy
>>>> beckman at angryox.com
>>>> http://www.angryox.com/
>>>> ------------------------------------------------------------
>>>> ---------------
>>>> _______________________________________________
>>>> VoiceOps mailing list
>>>> VoiceOps at voiceops.org
>>>> https://puck.nether.net/mailman/listinfo/voiceops
>>>>
>>>
>>
>> ---------------------------------------------------------------------------
>> Peter Beckman                                                  Internet
>> Guy
>> beckman at angryox.com
>> http://www.angryox.com/
>> ---------------------------------------------------------------------------
>> _______________________________________________
>> VoiceOps mailing list
>> VoiceOps at voiceops.org
>> https://puck.nether.net/mailman/listinfo/voiceops
>
>
> -- Alex
>
> --
> Principal, Evariste Systems LLC (www.evaristesys.com)
>
> Sent from my Google Nexus.
> _______________________________________________
> VoiceOps mailing list
> VoiceOps at voiceops.org
> https://puck.nether.net/mailman/listinfo/voiceops
>

---------------------------------------------------------------------------
Peter Beckman                                                  Internet Guy
beckman at angryox.com                                 http://www.angryox.com/
---------------------------------------------------------------------------


More information about the VoiceOps mailing list