Building a Resilient SIP Solution

I am the father of three boys. More accurately, I am the father of three men, since the oldest is 29 and the youngest is 22.

However, for most of their existence, they were boys and to me that meant years of worrying about broken bones, cuts, bruises, and much worse. They all came through adolescence and the teenage years in relatively good shape, but I spent more hours in emergency rooms and urgent care clinics than I care to think about.

I want to think that a lot of the reason why they survived into their twenties is that I was always insistent on doing things safely. They wore helmets when they snowboarded or rode their bikes.

Driving without seat belts was punishable by a denial of car privileges. While I wasn’t overbearing, I did my best to ensure that they understood the risks of their activities and took the proper precautions.

All of this leads me to what I want to write about today—risk management for SIP.

There are a number of places in a SIP infrastructure where things can go wrong. There are network elements that move SIP sessions in, out, and around your enterprise.

This article originally appeared on Andrew Prokop’s blog, SIP Adventures, and is reprinted with permission.

There are users who connect their SIP endpoints to that network. There are SIP servers that route sessions from place to place. There are trunks to and from your SIP carrier. Clearly, there are lots of places where things can wrong. Fortunately, there are just as many ways to ensure that things either don’t go wrong in the first place, or heal themselves as quickly as possible.

Network First

The quality of your SIP traffic is directly related to the quality of your network. Before I have anyone move to SIP, I insist on an assessment to ensure that all routers and switches are of a vintage that can handle real-time media and all are correctly configured. Don’t just trust that your network is ready to support quality-of-service. Go through the proper tests to make sure that all the I’s are dotted and the T’s are crossed. Believe me, it will be money well spent.

In terms of SIP and trunks, I highly recommend that you have multiple ingress and egress points. This may mean multiple MPLS drops to a single location. It may also mean having a primary circuit in your main data center and a backup circuit in a disaster recovery site. The important thing is to guarantee that if one connection drops there is another one ready to take over.

You may even want to consider multiple providers. For instance, you might want to use AT&T for your prime circuits and Verizon for your backup. This will complicate your dial plan, but the price of losing all call services might be worth the effort.

On the Border

The first line of SIP defense is your Session Border Controller (SBC). Every enterprise-grade SBC on the market today offers a high availability (HA) option. In an HA-SBC pair, one box runs in active mode, processing all incoming and outgoing SIP sessions. The second box is in standby mode, waiting to be told to take over that SIP traffic. There is a link between the two that keeps the standby box synchronized with the active box. This allows the standby box to hit the ground running whenever the main SBC dies. It knows everything that was occurring at the time of failure and all active sessions continue on as if nothing happened.

An important aspect of HA-SBCs today is that the active and standby modes must be connected by a Layer-2 network. This means that they must be on the same subnet. Since many primary and backup data centers only support Layer-3 between them, HA-SBC pairs cannot be geographically split. This doesn’t negate the need for HA, it simply means that HA can only be campus-based and the two boxes must be local to one another.

Session Management

Session management is what turns “SIP the protocol” into “SIP the architecture.” I’ve written extensively about session management (please see my articles, Session Management and Avaya Session Manager vs. Session Border Controller), but the main thing to understand here is that it loosely connects disparate SIP services, servers, and endpoints together to form a cohesive whole.

Unlike SBCs, Avaya Aura Session Manager supports HA in an active-active manner. This means that all Session Managers are active at all times and nothing runs in standby mode. Session Managers are grouped into communities where each member understands what the others in the community are doing. This facilitates a seamless failover if one of the members dies or is taken offline.

Another difference between the Aura Session Manager and a SBC is that Session Managers do not require a Layer-2 network between them. This allows an enterprise to spread Session Managers across the world and still maintain a single logical SIP solution.

The Endpoints

The last layer of resiliency resides with the endpoints. As I wrote about in “In the Beginning: SIP Registration,” SIP allows a single user to simultaneously register multiple devices. I always have at least two devices registered. When sitting at my desk, I take calls on my Avaya 9641 telephone, but those same calls will also ring One-X Mobile for iOS on my iPhone. It doesn’t matter if I am stationary or mobile. I never miss a call.

Lastly, Avaya telephones support something called SIP Outbound. This allows a single device to simultaneously register to multiple SIP servers. My 9641 can register to three different Session Managers at the same time (two in the core and one at the branch). If one goes down, I am already registered to a backup. This provides for a seamless failover with no dropped calls.

No Single Points of Failure

If you follow these guidelines, you will create a resilient SIP architecture with no single points of failure. I haven’t done the math, but I can easily envision five or six nines of availability. Remember, downtime is lost money, and who can afford that these days? So, like I always told my kids, wear your helmets and buckle up your seatbelts. No one wants to take their business to the emergency room.