Administration

HARC – Operation and Recovery

Introduction

Behavior of all the components of the UCX when operating in a High Availability cluster configuration is heavily dependent on the network topology and capabilities of the individual devices (particularly 3rd party) connected to it. This document describes the various failure modes and the expected recovery times of the system components.

Operating Modes

Once a node in the High Availability cluster has been assigned the role of the Active node, the both nodes monitor the network for communication with the other node.  When the Standby node no longer detects that connection, it will initiate an internal process to change roles become the Active Node.  Once it becomes the Active Node, it will only change roles back to a Standby node if it is instructed to do so through the High Availability Console interface.

The operation of the High Availability cluster is therefore dependent on three factors:

  1. Communication between the nodes operating in the Active and Standby roles
  2. Network Connectivity of the node operating in the Standby role
  3. Network Connectivity of the node operating in the Active role

Communication between the nodes operating in the Active and Standby roles

Losing communication between the Active and Standby nodes while both nodes are still operational can result in both nodes operating in Active mode, and the resulting behavior is unpredictable as it is highly dependent on the entire network topology.  This possibility can be practically eliminated if the nodes are connected on the same Layer 2 (Ethernet) switch.

NOTE

In order to ensure predictable behavior of the High Availability Cluster, E-Metrotel recommends connecting both the Active and Standby nodes to the same Layer 2 (Ethernet) switch at the same location.

Network Connectivity of the node operating in the Standby role

The node operating in the Standby role can lose connectivity to the Active server in a number of ways, such as losing power, being gracefully shutdown or restarted through the UCx Web-based Configuration Utility, or the network cable being disconnected at the UCX or the switch.  In all cases, the current operation of the High Availability cluster is not impacted, as long as the Active node is still connected to the network and running.  Once the node in the Active role has detected the loss of communication, the cluster will change the status of the Standby node on the Console page on the High Availability tab, highlighting the Secondary node as red, with no information regarding the Status, Inter-node Link, or HA Resources.
UCX70HAMidRebootonSecondaryState_0.png

It is also important to note that until the node in Standby role has recovered and and is detected by the High Availability Cluster, the Action buttons are no longer available as the redundant abilities are temporarily unavailable.  Once the node in Standby role has recovered and is communicating with the High Availability cluster, the Console will reflect this as follows:

Network Connectivity of the node operating in the Active role

The node in the Active role can also lose connectivity to the Standby node in a number of ways, such as losing power, being gracefully shutdown or restarted through the UCx Web-based Configuration Utility, or the network cable being disconnected at the UCX or the switch.  In all cases, the node in Standby role will detect the loss of communication and will initiate the process of changing its role to Active. In that role, it will begin to communicate using the IP address of the High Availability cluster.  The process to detect and switch roles takes approximately 15 to 60 seconds. Once the role change has been completed, this is reflected in the Console page as follows:
UCX70HAStandbyPrimaryRebootStarted.png

As is the case above for loss of communication from the node in Standy role, it is important to note that until the node formerly in the Active role (which was configured as Primary in this example) has recovered and and is detected by the High Availably Cluster, the Action buttons are no longer available as the redundant abilities are temporarily unavailable.  Once the node formerly in the Active role has recovered and is communicating with the High Availability cluster, it will remain in Standby role until forced to switch roles or until the currently Active system goes off-line.  The Console will reflect this as follows:

Maintenance Mode

Note that once a Cluster is operational with either the Primary or Secondary Node assigned the Active role, the system can be manually placed in Maintenance mode through the Console configuration page. Which ever node is Active will continue to be the operating node with the Cluster IP address assigned to it. The active node will continue to send database synchronization information to the standby node so that it can keep its own database up-to-date, but the there will be no attempt to switch the operation from the active node to the standby node (the one in Maintenance mode) automatically (i.e. even in the event of a failure of the Active node) or manually (through the Console interface). This mode is provided to prevent unnecessary switchovers should the network connectivity between the two nodes become unstable – for example if there is network maintenance required on that link. However, the data synchronization from the active node to the standby node is continued, the standby node can be manually re-activated and placed into service after the active node has experienced an outage.

NOTE

If the standby node status is set to maintenance, and software update is performed on the cluster, the software update process will activate the standby node. In other words, after a successful software update, both nodes will be active.

High Availability Cluster Recovery

Since all communication from peripheral devices and supporting equipment has been previously configured to use the cluster IP address, the system will be able to communicate with each network device without any necessary user intervention, although different components will receiver at different rates.  These are described below.

E-MetroTel XStim (DSM digital phones and E-MetroTel/Nortel/Avaya/Panasonic IP Phones)

Existing calls will be dropped. If the phones are in the same LAN infrastructure as the HA cluster, they should be fully operational for initiating and receiving new calls within 30 seconds of the new Active node becoming operational. If the phones are remotely connected, then the reconnection process will be automatically initiated by the phones within 90 to 120 seconds.  Manual intervention at the phone using the “Retry Now” softkey when it is displayed can shorten the recovery time

Nortel/Avaya Digital and Analog Phones connected via an MGC

Existing calls will be dropped.  The MGC will typically recover communication with the Cluster within about 60 seconds.  However, in some scenarios this process can take up to 11 minutes for certain timeout parameters on the MGC to detect and restart the communication.  If access to the MGC is available, a power recycle of the MGC may shorten the overall recovery process.

InfinityOne Desktop Clients

Existing calls will be dropped.  The client application will also be logged out and the user will need to log back in.  Once logged in, the softphone will be immediately available for making and receiving calls.

SIP Phones

Existing calls will be dropped.The phones will be able to make new calls as soon the Secondary node becomes Active.  The phones will need to re-register as part of the standard SIP protocol process before being able to receive new incoming calls.  The re-registration process is controlled by the SIP phone configuration, and happens on an interval specified by the phone.  Some manufacturers have extremely long default values for this process (even as much as one or two hours).

NOTE

E-MetroTel recommends setting this parameter to be on the order of 60 to 120 seconds.

NOTE

The method for setting this parameter can vary for each manufacturer and phone type. Please consult the phone's configuration documentation for specific instructions.

SIP Trunks

Existing calls will be dropped.  The trunks will recover based on the timeout of the Registration settings in the UCX and the Trunk Provider.  UCX Registration timers are setting on the SIP Settings menu item of the PBX Configuration page of the PBX tab in the UCX Web-based Configuration Utility.  The default Registration Timer expiry on the UCX is 120 seconds.

Hospitality and HOBIC interfaces

The protocols used for these interfaces have their own mechanism for automatic reconnection after a temporary loss of communication.  Since the protocol also included an Acknowledgement mechanism, any messages sent during the loss of communication will be resent after communication is restored.  The UCX High Availability cluster must first enable the Hospitality service as part of the role change for the soon-to-be Active node.  This process can take between 60 and 120 seconds, but the end-to-end recovery time will also be dependent on reconnect timer settings on the PMS platform.

Contents