Understanding Avaya Internet Protocol Server Interface Resets
I cut my teeth on Port Network (PN) outages when I joined Avaya’s Tier-3 backbone support back in 2006. I was assigned to supporting the S8700 series of duplexed Communication Manager (CM) servers just as CM 3 was being released. Back then, the timers were very tight and a large percentage of my trouble tickets were explaining to customers why an IPSI (Internet Protocol Server Interface) reset, which in turn caused a port network outage.
Avaya uses several different heartbeat mechanisms so that devices know if they have lost connectivity. In the case of port networks, which means any IPSI-controlled cabinet (such as a G650), the heartbeat is variously known as a Sanity Checkslot, Socket Sanity, or IPSI Sanity. This TCP heartbeat is sent to every IPSI every second by the active CM-main (and in duplex CM also by the standby CM). So, if you were to have CM-duplex and duplicated IPSIs in each of the maximum of 64-port networks (2 CM*2 IPSI *64 PN) 256 heartbeats would fly through the network each second.
Originally, the IPSI would react if only three consecutive heartbeats went missing. Starting in CM 3.13, the timer was administrable by an Avaya engineer and in CM 5.0 it became administrable by customers on the CM change system-parameters ipserver-interface form. Now the IPSI Socket Sanity Timeout defaults to 15 seconds (values: 3 to 15 seconds). Data from CM substitutes for missing heartbeats.
Frequently, the cause of missing heartbeats is a mismatch between the IPSI being locked to communicate 100 Mbps/full duplex while the Ethernet switch was set to auto-negotiate (resulting in a half-duplex connection), or vice versa. Also, not enabling quality of service (QoS) to give priority to IPSI traffic, or not segregating the IPSI traffic into a separate physical/virtual LANs, frequently caused problems.
Upon detecting the outage, the IPSI assumes it is sick and reacts by performing a warm reset. During the warm reset, stable calls using resources within the PN stay up. But neither new calls can be initiated nor established calls transition to some other state (e.g. hold) for the obvious reason that there is no connection to CM to manage such transactions. The IPSI’s warm reset generally takes only a few seconds.
If it still doesn’t get heartbeats or data from CM, then after a default of 60 seconds (values: 60 to 120 seconds) the IPSI escalates to a cold reset. All calls using resources within the PN are dropped. On the change system-parameters port-networks form, the PN cold reset delay timer can be modified.
Next, based on the No Service Time Out Interval, the IPSI then waits for a default of 5 minutes (values: 2 to 15 minutes). During that time, while the IPSI is waiting for communication from CM-main, the resources within that PN are unavailable. Note that if one heartbeat gets through, perhaps on a flapping WAN circuit, the timer resets and the countdown starts from the beginning. If the No Service Time Out timer expires, the IPSI then attempts to register to a CM-Survivable Core (SC), formerly known as Enterprise Survivable Servers.
Each IPSI manages its own prioritized list of addresses for up to seven CM-SC, plus the CM-Main, which is always first on the list. Actually, it is in how the CM-SCs are configured that determines the server list for the IPSI. And it is the job of the CM-SC to advertise its own values to the IPSIs so that each IPSI can generate the appropriate list of eight server addresses. A customer can have up to 63 CM-SC. Note that IPSIs cannot register to CM-Survivable Remote (formerly known as Local Survivable Processors) servers.
The preference setting (System Preferred/Local Preferred, Local Only) along with a Community Size field and a Priority Score field, determines the server’s priority on IPSI’s lists. How to assign weighting of these values is beyond the scope of this article.
Each server in a CM-Duplex configuration is constantly comparing its health to the other. One statistic among many they compare is how many IPSIs each one can communicate with right now. If the standby server can communicate with more IPSIs than the active server, the standby takes over and makes itself Active. This can cause frequent server interchanges if an unreliable WAN link connection to a PN causes some of the heartbeats to get lost. So, Avaya introduced the option to Ignore Connectivity in Server Arbitration on the change ipserver-interface n form, thereby potentially reducing interchanges.
I have ignored duplicated IPSIs because I am not a big fan of them. Most of the IPSI-related tickets I’ve received were caused by network issues that duplicated IPSIs would not have protected against.
Based on my experiences, I recommend helping calls in progress stay up as long as they can by delaying the Port Network Cold Reset to 120 seconds. Then I suggest hurrying the registration to a CM-SC by setting the No Service Time Out to 2 minutes.
Although PNs are fading from Avaya’s product mix, they are a solid technology representing 30 years of development. Many customers will rely on them for years to come.