[VoiceOps] Ideas for Building Inbound Redundancy
universe at truemetal.org
Fri Feb 3 15:01:49 EST 2017
Am 02.02.2017 um 16:30 schrieb Voip Jacob:
> The case that we're trying to protect against would be if both PBXs at
> both data centers were unreachable for our SIP provider (DNS issues,
> internal network routing issues, routing issues between SIP provider &
> datacenters, etc.). [...]
I can't help exactly with the call queues/groups thingie, but here's
what I did recently for my inbound infrastructure where DIDs get routed
to me from several carriers and routed from me to my customers, via SIP.
I believe I have covered all possible outage scenarios with this setup:
There are 2 data centers, geographically diverse. I colocate servers +
routers in each. Let's call that a "site". Each site has two redundant
(with VRRP) routers that speak BGP with the upstream routers.
Per site, there are 2 physical servers with CentOS + KVM, and each
server hosts 2 VMs:
VM 1) Asterisk for SIP/RTP + Galera Cluster for database
VM 2) quagga for BGP + kamailio as SIP proxy for load balancing/SIP failover
A total of 8 VMs. (To start with)
I took a spare /24 IPv4 netblock I had lying around, and quagga running
on the 4 quagga/kamailio VMs is announcing this prefix via BGP to the 4
internet-facing routers. Each quagga connects to both the active and
standby router local to this site. That means a total of 8 BGP sessions,
4 per site, 2 per router.
Announcing this prefix at multiple sites at the same time, where each
site uses different upstream providers, results in that IPv4 prefix
becoming "anycast'ed", meaning it is visible in the global routing table
via multiple paths and the decision at which site IP packets for this
prefix ends up on is made by the BGP algorithm (and by the provider
where the traffic originates).
A single IP address out of that /24 is up on all 4 quagga VMs as an
alias address. Yes, the same IP address, four times. You may think this
might be broken and cause problems if there is the same IP up in the
same VLAN, but, the BGP algorithm on our internet-facing routers will
choose one of the VMs and decide where to send all traffic to, the other
VM will act as standby. There is no LAN traffic to/from this particular
IP, so it just works. Initially I messed around with "keepalived" and
some other tools, but it didn't work out, and running rather less
userland daemons (which can crash, too) is better. :)
Now, we tell the providers that we buy DIDs from: Hey, route all the SIP
packets for us to this particular IP only. A single IP is all they need
and get from us! No more "Please add our new IP address ...".
I use both Asterisk' dialplan and also A2Billing (a "VoIP Softswitch
Solution", open source and free of charge) to route DIDs to customers.
Because of A2B and because of CDRs we need a MySQL database. This is
what Galera Cluster is for, a multi-master active-active replacement for
MySQL. There are 4 Asterisk/database VMs, so there are 4 instances
running which synchronize each other all the time, thus it does not
matter to which instance you are sending your write requests. I simply
use the local node for read + write and let Galera take care of the
internals. There is also a 9th VM in a country far, far way which only
runs Galera arbiter, does not store any MySQL data and simply acts like
a decision-making component which is there to prevent split-brain
situations because of the even node count. It's good that it's far away
so it is aware when a whole site is down due to network issues.
SIP + media:
kamailio running on the quagga VM is the entry point for all inbound SIP
traffic. A simple configuration which basically just says: There are 2
Asterisk servers to distribute the calls to. Check if they're both up.
If one of them is down, send all traffic to the remaining server. If
both are up, distribute evenly 50/50 so we get load balancing and all of
our servers will actually process calls and not just sit idle until
disaster comes. We let Asterisk handle all RTP and don't worry about an
Each Asterisk VM has an public IP address that is local to this site and
unique. So I tell my customers: Please allow inbound calls from these IPs:
220.127.116.11 (site 1, server 1, VM 1)
18.104.22.168 (site 1, server 2, VM 1)
22.214.171.124 (site 2, server 1, VM 1)
126.96.36.199 (site 2, server 2, VM 1)
Now, let's go through the possible disasters and see how this whole
thing will react:
- Data center lights up in a big ball of fire/upstreams go down/fiber
cut etc.: site 1 is down, thanks to BGP anycast all traffic will
instantly and without manual intervention go to site 2, and vice versa.
- Router dies: Remaining router takes over, BGP sessions to both quaggas
on each, BGP sessions to upstreams on each, VRRP between them. Instantly
+ no manual intervention.
- Physical server dies: quagga VM goes down, BGP session disappears, BGP
session to quagga VM on remaining physical server takes priority on
router. (quagga + kamailio are active all the time on both VMs, on
"standby" VM just sit idle until the other quagga disappears)
- quagga VM down/reboot: BGP session disappears, remaining VM gets priority.
- Asterisk crashes: kamailio detects this and sends all calls to
All MySQL data is always everywhere, always write-able, regardless of
site blowup, server failure or VM crash. We don't worry about harddrive
faults or filesystem redundancy (GlusterFS, ...). Keep it simple. If a
server dies, we replace it. (We still use RAID, of course)
I make all configuration changes on a single node. A simple script syncs
the configuration with the other Asterisk'es and reloads them. Same for
web access to A2B, a single node is designated for that, but you could
easily make that redundant as well if web access is crucial, again
thanks to BGP anycast.
The cool thing is that it scales for both load and redundancy count,
just add new servers as you please on any site and add them to Galera,
kamailio, even BGP if you wish. You could even add more sites in other
cities, countries, continents.
If you have read until this point you have possibly figured out that if
kamailio crashes on the quagga VM that has currently priority, calls
will go to a black hole. I was too lazy to setup kamailio-failover...yet :)
Also, since there are multiple carriers that deliver DIDs, spread over
the world and using different upstreams, anycast really does its job and
some traffic arrives at site 1, other traffic at site 2. By luck, it's
currently close to a 50/50 distribution.
Since I took the couple of days to implement that, I sleep so well again. :)
PS: BGP anycast is awesome.
More information about the VoiceOps