[Outages-discussion] VoIP - complete outage at DASH Carrier Services

Frank Bulk frnkblk at iname.com
Wed Dec 15 17:55:34 EST 2010


Thanks for sharing.

I recently saw a telco share their ultra-redundant configuration, and they
used different brand SBCs, just so they could avoid this kind of issue.  Not
my first preference, but it has some validity.

Frank

-----Original Message-----
From: outages-discussion-bounces at outages.org
[mailto:outages-discussion-bounces at outages.org] On Behalf Of Chris Stone
Sent: Wednesday, December 15, 2010 3:01 PM
To: outages-discussion at outages.org
Subject: Re: [Outages-discussion] VoIP - complete outage at DASH Carrier
Services

DASH has posted (email) the following with regards to their outage
yesterday:

Date of Incident:                Tuesday 14 December, 2010
Time Incident Began:        3:00 PM MST Denver POP, 3:00 MST Atlanta POP
Time Incident Resolved:    4:40 MST Denver POP,  4:50 MST Atlanta POP

Reason for Outage

dash experienced a corruption of a configuration file on our Acme Packet SBC
clusters in the Denver and Atlanta POPs.  The clusters do not share a common
configuration, but they are configured similarly.  dash is working with Acme
Packet to identify the cause of the corruption.

Services Affected

Inbound and outbound call routing.

Resolution

dash removed the corrupted entity and rebuilt that same portion of the
configuration in each cluster. No other changes were made to the
configuration.


Root Cause

The corrupt configuration database caused routing requests to not complete
correctly and over a short time caused process failure on the Acme Packet
SBC cluster. Specifically the process failure resulted in the public VRRP
interfaces of the border controller to drop.

 dash is working with Acme Packet to identify root cause and implement
corrective action as necessary. The root cause will be communicated at such
time it is identified.

Corrective Action

Until root cause is identified and long term corrective action is
implemented, dash monitoring will continue to send critical alerts if the
situation is repeated. To resolve the issue the corrupt configuration file
would be removed and rebuilt. Time to remove the corrupt file and rebuild is
approximately one minute for each SBC cluster.
_______________________________________________
Outages-discussion mailing list
Outages-discussion at outages.org
https://puck.nether.net/mailman/listinfo/outages-discussion




More information about the Outages-discussion mailing list