Monday, August 31, 2015

EFM with Elastic Load Balancing on AWS

EnterpriseDB Failover Manager (EFM) has built-in support for virtual IP address use, but a VIP isn't appropriate for every environment, particularly for cloud environments where your nodes may be spread across different regions/zones/networks. When using Amazon Web Services, an Elastic IP Address (EIP) can be used instead through EFM's fencing script ability. An EIP isn't always the best choice, however, if you don't want a public IP address. This blog describes using Elastic Load Balancing (ELB) as an alternative. An ELB has the advantage that it can be completely internal to a VPC -- no public IP address will be involved -- while still being able to span multiple availability zones.

The ELB getting started guide is here, but the steps are fairly simple. Starting one from the AWS console involves:
  1. Giving the new balancer a name and picking your VPC (or using "EC2 Classic").
  2. If using a VPC, choose whether or not you want to make an internal load balancer or have it available to the Internet. If internal, you will also select the subnets and security groups to use later.
  3. Choose the protocols/ports to listen for traffic to forward.
  4. Configure security settings if a secure protocol (ssl, https) was picked in step 3.
  5. Configure health checks.
  6. Add instances.
  7. Optionally, add tags for organizing your AWS resources.
In step 5 above, you don't have to worry about the health checks in great detail. If using EFM to manage failover, you just want the ELB to send traffic to the current master, not be in charge of figuring out whether the master is alive or not. I use TCP pings to my database's port with rather large values for timeout, interval, and unhealthy threshold. You can set the "healthy" threshold as low as 2 checks, but this shouldn't normally come into play.

For step 6 above, we'll use imaginary instances "i-aaaaaaaa" for the master, and "i-bbbbbbbb" and "i-cccccccc" for two replica nodes. (For more information on setting up EFM with three database nodes, you can see a video example here or follow the user's guide.) Initially, you only want to add the master database node to the ELB. In effect, we're using the ELB as a private EIP. After the ELB has been created, you will have options for the address to use on the balancer's description tab. Note that your instances can be spread across availability zones when using this feature.

From the EFM perspective, this cluster is no different from any other 3-node cluster that is not having EFM manage a virtual IP address internally. To have EFM switch the load balancer after a failover, we need to add a script to call when a standby is promoted. EFM 2.0 provides two hooks for user-supplied scripts: a fencing script that is used to fence off the old master (i.e. from a load balancer) and a post-promotion script for any work that needs to be done once promotion has completed. In this case, I recommend using the fencing script entirely. The load balancer changes take a couple seconds, and even if promotion hasn't finished yet, the load balancer will see that the newly-added instance is in service. See chapter 3 of the EFM user's guide for more information on the fencing script property.

Our script will be using the AWS command line interface to alter the ELB's instances. More information on installing the CLI can be found here. These are the steps I took on the database nodes, though there are several ways to install the tools:
  1. curl "https://s3.amazonaws.com/aws-cli/awscli-bundle.zip" -o "awscli-bundle.zip"
  2. yum install -y unzip (if needed)
  3. unzip awscli-bundle.zip
  4. ./awscli-bundle/install -i /usr/local/aws -b /usr/local/bin/aws
  5. aws configure
The first four steps I ran as root, or they could be done through sudo. Because the fencing script will be run as the 'efm' user, I ran the final configuration step as 'efm.' If you don't want the this user to have that much access to your aws controls, then the fencing script can simply be "sudo the_actual_script" with the appropriate sudoers permissions in place and the_actual_script is what we define below.

With the above initial steps completed, the fencing script is quite simple. On each node, the script will remove either/both of the two other nodes from the ELB and then add itself. Here is the example script on the i-aaaaaaaa node. You may only want these on the standby nodes, but it could be useful on the master in case you ever bring it up as a standby, or simply to reconfigure the balancer if a failed master is brought back online again.
[root@ONE ]}> cat /var/efm/pickme.sh
#!/bin/sh

export ELB_NAME=<line balancer name>
export OLD="i-bbbbbbbb i-cccccccc"
export NEW=i-aaaaaaaa

aws elb deregister-instances-from-load-balancer --load-balancer-name ${ELB_NAME} --instances ${OLD}
aws elb register-instances-with-load-balancer --load-balancer-name ${ELB_NAME} --instances ${NEW}
That's all there is to the script. It should be tested that the 'efm' user can run this script on each node. You can run the script and refresh the AWS console to see the changes take effect. More information on the 'aws elb' command is on this page.

The final step is to specify this script in your efm.properties file. It will then be run right before the trigger file is created on a standby that is being promoted. In this example, the property would be (along with comments in the file):
# Absolute path to fencing script run during promotion
#
# This is an optional user-supplied script that will be run during
# failover on the standby database node.  If left blank, no action will
# be taken.  If specified, EFM will execute this script before promoting
# the standby. The script is run as the efm user.
#
# NOTE: FAILOVER WILL NOT OCCUR IF THIS SCRIPT RETURNS A NON-ZERO EXIT CODE.
script.fence=/var/efm/pickme.sh
The script should only take a couple seconds to run, and the ELB will take another couple seconds to decide that the newly-added instance is in service and available for traffic to be sent to it. After that, traffic will be sent to your new master database.

Wednesday, August 26, 2015

Video: EFM 2.0 Installation and Startup


We've posted two new videos about installing EnterpriseDB Failover Manager and starting a new EFM cluster.

The first video shows the installation and setup of EFM.

The seconds shows the steps involved in starting a failover manager cluster. After the initial cluster has started, we show more information about the new .nodes file as nodes are added to the cluster.

For more information, see the EFM 2.0 documentation.

Wednesday, July 29, 2015

Changes for EDB Failover Manager 2.0

EDB Failover Manager 2.0 includes several changes from 1.X. The user's guide contains information on upgrading and a full description of the properties. This blog gives a little more detail about just some of the new features.

New 'efm' command


The service interface now contains only the standard commands, such as start, stop,  and status. Ditto for systemctl on RHEL 7. For everything else, there is a new 'efm' script that is installed in /usr/efm-2.0/bin. This script is now used for commands such as cluster-status, stop-cluster, encrypt, etc, in addition to new commands in version 2.0. A full description is here.

Cluster name simplification


Every failover manager cluster running in the same network should have a unique cluster name. In 1.X, there were two separate places that a cluster name was specified: the service script (so that it could find the properties file location) and in the properties file itself (to define the name used by jgroups for clustering).

Version 2.0 simplifies this by using the convention that your cluster name is the same as the .properties and .nodes file names, and the files are expected in the /etc/efm-2.0 directory. Thus, a single parameter in the service script tells it what file information needs to be passed into the agent at startup. Likewise, passing the cluster name into the 'efm' script tells the script where to find the needed files in order to connect to a running agent (for instance, when running the 'cluster-status' command). There is no more cluster.name parameter in the properties file.

This makes it even harder for you to accidentally run two clusters that cause interference with each other, and cuts down on the information needed to run failover manager. Section 4.9 of the user's guide has full information on how to run more than once cluster at a time, using separate cluster names. The change also simplifies the password text encryption, because you don't need to save the cluster name in a properties file first before running the encrypt utility.

Specifying initial cluster addresses


In 1.X, a cluster always had exactly 3 nodes, with the addresses never changing. You specified the addresses for these in each properties file. This accomplished two things:
  1. The cluster knew which node addresses were allowed to join.
  2. An agent, at startup, knew which addresses to contact to find the other cluster members.

EFM 2.0 supports an arbitrary number of standby (or, for that matter, witness) nodes. You may not know all of the addresses when starting the initial members -- you might, for instance, add another standby months after the cluster was started. So now the properties file doesn't contain agent/witness addresses. Each properties file records only that node's binding address (which was inferred from agents/witness properties in 1.X) and whether or not a node is a witness node.

After starting the first member, the two steps above are more explicit. For step 1, the 'efm' utility is used to add a new node's information to the list of allowed addresses. For step 2, you now start an agent with a list of existing cluster members in a .nodes file, kept in the same directory as the properties file.

After an agent joins the cluster, EFM will keep this file up-to-date for you as other nodes join or leave the cluster. Section 4.2 of the user's guide walks you through these steps.

Wednesday, June 3, 2015

IPv6 and Centos 6.6 -- SocketException: Permission denied

What I don't know about IPv6 would fill a very long web page.

When verifying that Enterprise Failover Manager (EFM) would work with IPv6 on Centos 6.6, my connections (using JGroups) would fail at the socket level with java.net.SocketException: Permission denied.

A short Java app reproduces the problem:

[root@FOUR ~]}> cat IPv6Test.java
import java.net.InetAddress;
import java.net.Socket;
public class IPv6Test {
    public static void main(String[] args) {
        try {
            InetAddress ia = InetAddress.getByName("fe80::20c:29ff:feb0:ba66");
            System.err.println("Opening socket for: " + ia);
            Socket socket = new Socket(ia, 22);
            System.err.println("We have: " + socket);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}
[root@FOUR ~]}> javac IPv6Test.java && java IPv6Test
Opening socket for: /fe80::20c:29ff:feb0:ba66
java.net.SocketException: Permission denied
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at [....]

I thought this was a problem at the OS level creating the socket in the first place, but now I understand that the error message could be coming back from the remote node and this is how it's displayed by Linux back to the client. In case others run into this, here are the configuration changes needed to get a proper IPv6 setup working. First, disable the usual suspects (if you know how to properly use NetworkManager and ip6tables, feel free to do so instead of killing them):

[root@FOUR ~]}> grep disabled /etc/selinux/config
#     disabled - No SELinux policy is loaded.
SELINUX=disabled
[root@FOUR ~]}> service NetworkManager stop
Stopping NetworkManager daemon:                            [  OK  ]
[root@FOUR ~]}> chkconfig NetworkManager off
[root@FOUR ~]}> service ip6tables stop
ip6tables: Setting chains to policy ACCEPT: filter         [  OK  ]
ip6tables: Flushing firewall rules:                        [  OK  ]
ip6tables: Unloading modules:                              [  OK  ]
[root@FOUR ~]}> chkconfig ip6tables off

At this point, the problem is my link-scoped IPv6 address on both nodes:

[root@THREE ~]}> ifconfig
eth0      Link encap:Ethernet  HWaddr 00:0C:29:B0:BA:66  
          inet addr:172.16.144.153  Bcast:172.16.144.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:feb0:ba66/64 Scope:Link
          [….]

What we want is a globally-scoped address for each node. After a little editing, this is my current eth0 config:

[root@FOUR ~]}> cat /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
BOOTPROTO=dhcp
IPV6INIT=yes
IPV6ADDR=fdd9:6fe6:6a5b:3835::4
IPV6FORWARDING=no
IPV6_AUTOCONF=no
ONBOOT=yes
TYPE=Ethernet
PEERDNS=yes

The IPv6 address was found using this link, and then using ::1, ::2, etc., for the various virtual machines. After the above changes and a reboot, I now have a proper global IPv6 address on my nodes:

[root@THREE ~]}> ifconfig
eth0      Link encap:Ethernet  HWaddr 00:0C:29:B0:BA:66  
          inet addr:172.16.144.153  Bcast:172.16.144.255  Mask:255.255.255.0
          inet6 addr: fdd9:6fe6:6a5b:3835::3/64 Scope:Global
          inet6 addr: fe80::20c:29ff:feb0:ba66/64 Scope:Link
          [….]

…and no more SocketException with the new address:

[root@FOUR ~]}> javac IPv6Test.java && java IPv6Test
Opening socket for: /fdd9:6fe6:6a5b:3835:0:0:0:3
We have: Socket[addr=/fdd9:6fe6:6a5b:3835:0:0:0:3,port=22,localport=49971]

My thanks to the JGroups users forum (and Bela), the OpenJDK java.net list, and Dave Page for their help getting me back on track with IPv6.