How to Prevent Media Gateway Split Registrations
Back when Avaya Aura Communication Manager 5.2 was released, I recall reading about this new capability called Split Registration Prevention Feature (SRPF). Although I studied the documentation, it wasn’t until I read Timothy Kaye’s presentation (Session 717: SIP and Business Continuity Considerations: Optimizing Avaya Aura SIP Trunk Configurations Using PE) from the 2014 IAUG convention in Dallas that I fully understood its implications.
What is a Split Registration?
First I need to explain what SRPF is all about. Imagine a fairly large branch office that has two or more H.248 Media Gateways (MG), all within the same Network Region (NR). SRPF only works for MGs within a NR and provides no benefit to MGs assigned to different NRs.
Further, imagine that the MGs provide slightly different services. For example, one MG might provide local trunks to the PSTN, and another might provide Media Module connections to analog phones. For this discussion, it does not matter what type of phones (i.e. SIP, H.323, BRI, DCP, or Analog) exist within this Network Region. During a “sunny day,” all the MGs are registered to Processor Ethernet in the CM-Main, which is in a different NR somewhere else in the network. It aids understanding if you believe that all the resources needed for calls within a NR are provided by equipment within that NR.
A “rainy day” is when CM-Main becomes unavailable, perhaps due to a power outage. When a MG’s Primary Search Timer expires, it will start working down the list trying to register with any CM configured on the Media Gateway Controller (MGC) list. All MGs should have been configured to register to the same CM-Survivable server, which by virtue of their registration to it causes CM-Survivable to become active.
In this context a CM server is “active” if it controls one or MGs. A more technical definition is that a CM becomes “active” when it controls DSP resources, which only happens if a MG, Port Network (PN) or Avaya Aura Media Server (AAMS) registers to the CM server.
Since all the MGs are registered to the same CM, all resources (e.g. trunks, announcements, etc.) are available to all calls. In effect, the “rainy day” system behaves the same as the “sunny day” with the exception of which CM is performing the call processing. Even if power is restored, only the CM-Survivable is active, and because no MGs are registered to CM-Main it is inactive.
In CM 5.2, SPRF was originally designed to work with splits between CM-Main and Survivable Remote (fka Local Survivable Processor) servers. In CM 6, the feature was extended to work with Survivable Core (fka Enterprise Survivable Servers) servers. To treat the two servers interchangeably, I use the generalized term “CM-Survivable.”
A “Split Registration” is where within a Network Region some of the MGs are registered to CM-Main and some are registered to a CM-Survivable. In this case only some of the resources are available to some of the phones. Specifically, the resources provided by the MGs registered to CM-Main are not available to phones controlled by CM-Survivable, and vice versa. In my example above, it is likely some of the phones within the branch office would not have access to the local trunks.
Further, the Avaya Session Managers (ASM) would discover CM-Survivable is active. They would learn of CM-Survivable server’s new status when either ASM or CM sent a SIP OPTIONS request to the other. The ASMs then might begin inappropriately routing calls to both CM-Main and CM-Survivable. Consequently, a split registration is even more disruptive than the simple failover to a survivable CM.
What can cause split registrations? One scenario is when the “rainy day” is caused by a partial network failure. In this case some MGs, but not all, maintain their connectivity with CM-Main while the others register to CM-Survivable. Another scenario could be that all MGs failover to CM-Survivable, but then after connectivity to CM-Main has been restored some of the MGs are reset. Those MGs would then register to CM-Main.
How SRPF Functions
If the Split Registration Prevention Feature is enabled, effectively what CM-Main does is to un-register and/or reject registrations by all MGs in the NRs that have registered to CM-Survivable. In other words, it pushes the MGs to register to CM-Survivable. Thus, there is no longer a split registration.
When I learned that, my first question was how does CM-Main know that MGs have registered to CM-Survivable? The answer is that all CM-Survivable servers are constantly trying to register with CM-Main. If a CM-Survivable server is processing calls, then when it registers to CM-Main it announces that it is active. Thus, once connectivity to CM-Main is restored, CM-Main learns which CM-survivable servers are active. This is an important requirement. If CM-Main and CM-Survivable cannot communicate with each other a split registration could still occur.
My second question was how CM forces the MGs back to the CM-Survivable. What I learned was that CM-Main looks up all the NRs for which that Survivable server is administered. The list is administered under the IP network region’s “BACKUP SERVERS” heading. CM-Main then disables the NRs registered to CM-Survivable. That both blocks new registrations and terminates existing registrations of MGs and H.323 endpoints.
Once the network issues have been fixed, with SRPF there are only manual ways to force MGs and H.323 endpoints to failback to CM-Main. One fix would be to log into CM-Survivable and disable the NRs. Another would be to disable PROCR on CM-Survivable. An even better solution is to reboot the CM-Survivable server because then you don’t have to remember to come back to it in order to enable NRs and/or PROCR.
Implications of SRPF
Enabling SRPF has some big implications to an enterprise’s survivability design. The first limitation is that within an NR the MGC of all MGs must be limited to two entries. The first entry is Processor Ethernet of CM-Main, and the second the PE of a particular CM-Survivable. In other words, for any NR there can only be one survivable server.
Similarly, all H.323 phones within the NR must be similarly configured with an Alternate Gatekeeper List (AGL) of just one CM-Survivable. The endpoints get that list from the NR’s “Backup Servers” list (pictured above). This also means the administrator must ensure that for each NR all the MGs’ controller lists match the endpoints’ AGL.
Almost always, if SRPF is enabled, Media Gateway Recovery Rules should not be used. However in some configurations enabling both might be desirable. In this case, all MGs must be using an mg-recovery rule with the “Migrate H.248 MG to primary:” field set to “immediately” when the “Minimum time of network stability” is met (default is 3 minutes). Be very careful when enabling both features because there is a danger that in certain circumstances both the SRPF and Recovery Rule will effectively negate each other.
Finally, SPRF only works with H.248 MGs. Port Networks (PN) do not have a recovery mechanism like SRPF to assist in rogue PN behavior.
The Split Registration Prevention Feature (Force Phones and Gateways to Active Survivable Servers?) is enabled globally on the CM form: change system-parameters ip-options.
If I had not found Tim Kaye’s presentation, I would not have completely understood SRPF. So, now whenever I come across a presentation or document authored by him, I pay very close attention. He always provides insightful information.