[VoiceOps] Plivo Domain Outage Post Mortem (fwd)

Mon May 8 21:04:36 EDT 2017

Thought the list would like to see the Plivo outage post mortem.

Make sure you renew your domains people!

Beckman
---------------------------------------------------------------------------
Peter Beckman                                                  Internet Guy
beckman at angryox.com                                 http://www.angryox.com/
---------------------------------------------------------------------------

---------- Forwarded message ----------
Date: Mon, 8 May 2017 16:11:52 EDT
From: Plivo <do-not-reply at plivo.com>
Subject: Plivo Domain Outage Post Mortem

Plivo Domain Outage Post Mortem & Analysis

Dear Valued Customer,

On April 23, 2017, we experienced an outage on our primary domain (plivo.com) and related subdomains. Although we had been sharing regular updates and various workarounds during the domain outage, we would like to communicate our root cause analysis, and the steps we will take to ensure this doesn’t happen in the future.

** What happened and what was the impact?
------------------------------------------------------------

At 10:39 UTC on April 23, 2017, our team noticed that our primary domain (plivo.com) and all of its related subdomains were unresolvable from most countries, which resulted in an outage for customers across all services.

Our on-call team immediately began taking action, and within the next 4 hours provided workarounds for our customers that ensured access to most of our services. Customers were communicated about these updates via Twitter, a live status update document and through their respective account managers.

Over the next 18 hours, while working with our domain registrar, we isolated and corrected multiple configuration and provisioning errors. By 12:30 UTC on April 24, 2017, all of our services were back up using most DNS providers globally. However, a small percentage of our voice and sms customers had increased latency and errors during the next few hours, which were resolved immediately.

** Timeline
------------------------------------------------------------

April 23, 2017
* 10:39 UTC: plivo.com and its subdomains could not be resolved from various locations globally. The On-Call team immediately started investigating the issue.
* 10:42 UTC: Our domain showed up as being expired by our domain registrar.
* 10:50 UTC: Our engineers contacted our domain registrar to understand the reason and resolved it with them while in parallel start implementing a contingency action plan.
* 13:30 UTC: A patch was deployed to all of our servers to temporarily provide a workaround for the unresolvable domain, and switch all of our internal tools and servers to an internal domain name.
* 14:00 UTC: Our internal tools and servers were resolvable.
* 14:34 UTC: We communicated a workaround to our customers with the temporary IPs of our services, which ensured that service was not disrupted.
* 15:00 UTC: Outbound calls to PSTN came back online.
* 15:30 UTC: We communicated new domain names to our carriers and worked to re-establish Inbound Calls. At this time we saw 60% of our voice traffic back up.
* 17:00 UTC: We released alternative links to our WebSDK on the live update document.
* 17:30 UTC: Our registrar (phone.plivo.com) was patched, so that it accepted direct connections using the IP Only, which allowed our customers to register.
* 19:46 UTC: Inbound SMS was back in service.
* 20:00 UTC: plivo.io was provisioned by setting up another cluster as a workaround.
* 22:18 UTC: api.plivo.io came online as a backup for api.plivo.com.
* 22:26 UTC: manage.plivo.io came online as a backup for manage.plivo.com.
* 22:30 UTC: We released a new version of our WebSDK using a different domain to mitigate the issues experienced by customers. At this time 80% of our Voice traffic was back up.
* 23:12 UTC: We released new plivo.io domain names for customers using custom inbound carriers.

April 24, 2017
* 00:38 UTC: phone.plivo.io and app.plivo.io came online, as temporary replacements for phone.plivo.com and app.plivo.com.
* 03:30 UTC: We saw some plivo.com domains starting to resolve on their original IP addresses. We kept monitoring the propagation of our nameservers.
* 09:30 UTC: Around 50% of the main DNS servers have been updated.
* 10:30 UTC: We asked some DNS Servers to refresh their caches for plivo.com to expedite the propagation.
* 13:00 UTC: 90% of DNS Servers have been updated correctly with Plivo’s name servers.
* 16:00 UTC: All Plivo services were fully operational and 99% of DNS Servers updated correctly with Plivo’s name servers.

** Root-Cause Analysis
------------------------------------------------------------

Plivo’s primary domain (plivo.com) was set to renew automatically on April 17 annually. However, due to a configuration error with our registrar, instead of automatically renewing, the domain expired. We did not see any issues with our domain until April 23, 2017, as the registrar had a grace period of 5 days upon expiration. Unfortunately, what also made this unnoticeable until the day of the incident, was that we never received any updates, warnings or notifications regarding the possible expiry of the domain.

Upon further drill down by our team with the registrar, we found no notifications or alerts sent by our registrar regarding the expiry of the domain and the auto renew for the domain was never triggered. Although we still don’t have an official confirmation regarding this from the registrar, we suspect it is due to a configuration error at their end.

Immediately after this we started working with the registrar to restore the domain. The first reprovisioning order for the domain was stuck in a queued state and never got executed for almost 4 hours. This was then escalated to their team and we retried the restoration manually, which also failed multiple times.

The official response we received from the registrar was that "Since the name server on the domains have expired it usually takes 12-24 hours to reprovision the name servers on the domain and some more time for the changes to propagate globally."

To get our services back online for our customers, we set up a temporary domain at “plivo.io” and pointed all of our services to this new domain. Then, we published this workaround in a live document that we updated throughout the incident.

After almost 18 hours of working with the registrar, at 03:30 UTC April 24 2017, the order was finally executed successfully and we started seeing the name servers update and reflected on some DNS providers.

** When will the workarounds expire?
------------------------------------------------------------

How long can customers continue to use plivo.io as a workaround domain?

We will ensure that the workaround domain “plivo.io” will remain operational until May 31, 2017. We will decommission the domain after May 31, 2017.

How long can customers continue to hardcode IPs in etc/hosts?

We advise customers to go back to the plivo.com domain names as soon as possible. Because of our elastic architecture, we cannot guarantee that these IPs will stay the same in the near future.

It is especially critical to use api.plivo.com instead of the temporary IPs that we provided during the incident. We will send reminders to all of our customers who switched to the temporary domain to revert back to the original Plivo.com domain and IPs.

** Related Service disruptions
------------------------------------------------------------

Between April 24-27, 2017, 30% percent of our customer traffic saw irregular service degradation and disruptions in our SMS, Voice API & WebRTC/SIP service.

These incidents are related to the maintenance that was originally planned for April 23, 2017. When the unexpected DNS issue hit, we were near the end of our deployment that had the purpose of strengthening our Voice platform by making phone.plivo.com more redundant. However, the new deployment created performance issues in the form of locking and latency on a specific database table that was accessed for most customers. This occurred every time when the traffic for these services started spiking.

We worked on those issues and built a new internal service to optimize the volume of data processed to avoid elevated database writes and latencies. We also deployed patches on April 27, 2017 that improved the overall stability and performance of our platform, while also readying it for much higher workloads.

** What are we doing about it for the future?
------------------------------------------------------------

While this incident was due to an error in configuration and provisioning by our domain registrar, we take complete responsibility for this outage. We are responsible in ensuring uptime of our services to our customers. Clearly with better checks and thorough processes we could have avoided the whole situation in spite of the error from the domain registrar.

This entire outage exposed some critical flaws in our dependency on our 3rd party service providers. To ensure we minimize impact on Plivo’s services by 3rd party errors or issues, we have outlined a set of steps that we will initiate immediately:

1. Categorize all 3rd party services into different three priority levels (i.e., P1, P2, P3), based on potential impact on Plivo’s services. Detail potential workarounds in the event of experiencing downtime from these services. Perform monthly, quarterly, bi-annual, and annual audits and reviews of all P1 and P2 services for renewal and configuration settings.
2. Plan and renew all related category of services like domain names, TLS certificates, etc., for the longest period when possible. We have already executed this for our domain by renewing it for the next 10 years. We will execute the same strategy for all of our certificates and related services.
3. Setup automated monitoring to alert and notify all stakeholders in case any of our domains or similar services get within a month of their expiry date. Stakeholders will have the authority and access to take action immediately. This will avoid dependency on vendor notifications.
4. When possible, update our SDKs to be able to dynamically update domain endpoints, so a switch is possible at the customer's end without any application code changes.

Our focus has always been to provide you the best quality of service and uptime, and this disruption clearly came up short of expectation.

We apologize for the disruption and the inconvenience that this has caused your business and to your customers. We will work harder to earn back your trust by execution of all the steps that follow.

Sincerely,
The Plivo Team

https://www.plivo.com/?utm_source=Domain+Post-mortem+-+Batch+3&utm_campaign=2b122008d5-DOMAIN_POSTMORTEM_2017_05_03&utm_medium=email&utm_term=0_aaa8bcee8c-2b122008d5-104044017

http://twitter.com/plivo?utm_source=Domain+Post-mortem+-+Batch+3&utm_campaign=2b122008d5-DOMAIN_POSTMORTEM_2017_05_03&utm_medium=email&utm_term=0_aaa8bcee8c-2b122008d5-104044017

http://facebook.com/plivo?utm_source=Domain+Post-mortem+-+Batch+3&utm_campaign=2b122008d5-DOMAIN_POSTMORTEM_2017_05_03&utm_medium=email&utm_term=0_aaa8bcee8c-2b122008d5-104044017

https://plus.google.com/+Plivo?utm_source=Domain+Post-mortem+-+Batch+3&utm_campaign=2b122008d5-DOMAIN_POSTMORTEM_2017_05_03&utm_medium=email&utm_term=0_aaa8bcee8c-2b122008d5-104044017

Copyright © 2017 Plivo All rights reserved / View as Webpage (http://mailchi.mp/7bab18c30aec/plivo-update-all-services-back-up-697689?e=92ecfbb00b)