19.18 Highly Available Storage (HAST)

Contributed by Daniel Gerzo. With inputs from Freddie Cash, Pawel Jakub Dawidek, Michael W. Lucas, and Viktor Petersson.

19.18.1 Synopsis

High-availability is one of the main requirements in serious business applications and highly-available storage is a key component in such environments. Highly Available STorage, or HAST, was developed by Pawel Jakub Dawidek as a framework which allows transparent storage of the same data across several physically separated machines connected by a TCP/IP network. HAST can be understood as a network-based RAID1 (mirror), and is similar to the DRBD« storage system known from the GNU/Linux« platform. In combination with other high-availability features of FreeBSD like CARP, HAST makes it possible to build a highly-available storage cluster that is resistant to hardware failures.

After reading this section, you will know:

Before reading this section, you should:

The HAST project was sponsored by The FreeBSD Foundation with the support from OMCnet Internet Service GmbH and TransIP BV.

19.18.2 HAST Features

The main features of the HAST system are:

19.18.3 HAST Operation

As HAST provides a synchronous block-level replication of any storage media to several machines, it requires at least two nodes (physical machines) -- the primary (also known as master) node, and the secondary (slave) node. These two machines together will be called a cluster.

Note: HAST is currently limited to two cluster nodes in total.

Since the HAST works in primary-secondary configuration, it allows only one of the cluster nodes to be active at any given time. The primary node, also called active, is the one which will handle all the I/O requests to HAST-managed devices. The secondary node is then being automatically synchronized from the primary node.

The physical components of the HAST system are:

HAST operates synchronously on a block level, which makes it transparent for file systems and applications. HAST provides regular GEOM providers in /dev/hast/ directory for use by other tools or applications, thus there is no difference between using HAST-provided devices and raw disks, partitions, etc.

Each write, delete or flush operation is sent to the local disk and to the remote disk over TCP/IP. Each read operation is served from the local disk, unless the local disk is not up-to-date or an I/O error occurs. In such case, the read operation is sent to the secondary node.

19.18.3.1 Synchronization and Replication Modes

HAST tries to provide fast failure recovery. For this reason, it is very important to reduce synchronization time after a node's outage. To provide fast synchronization, HAST manages an on-disk bitmap of dirty extents and only synchronizes those during a regular synchronization (with an exception of the initial sync).

There are many ways to handle synchronization. HAST implements several replication modes to handle different synchronization methods:

  • memsync: report write operation as completed when the local write operation is finished and when the remote node acknowledges data arrival, but before actually storing the data. The data on the remote node will be stored directly after sending the acknowledgement. This mode is intended to reduce latency, but still provides very good reliability. The memsync replication mode is currently not implemented.

  • fullsync: report write operation as completed when local write completes and when remote write completes. This is the safest and the slowest replication mode. This mode is the default.

  • async: report write operation as completed when local write completes. This is the fastest and the most dangerous replication mode. It should be used when replicating to a distant node where latency is too high for other modes. The async replication mode is currently not implemented.

Warning: Only the fullsync replication mode is currently supported.

19.18.4 HAST Configuration

HAST requires GEOM_GATE support in order to function. The GENERIC kernel does not include GEOM_GATE by default, however the geom_gate.ko loadable module is available in the default FreeBSD installation. For stripped-down systems, make sure this module is available. Alternatively, it is possible to build GEOM_GATE support into the kernel statically, by adding the following line to the custom kernel configuration file:

options        GEOM_GATE

The HAST framework consists of several parts from the operating system's point of view:

The following example describes how to configure two nodes in master-slave / primary-secondary operation using HAST to replicate the data between the two. The nodes will be called hasta with an IP address 172.16.0.1 and hastb with an IP address 172.16.0.2. Both of these nodes will have a dedicated hard drive /dev/ad6 of the same size for HAST operation. The HAST pool (sometimes also referred to as a resource, i.e. the GEOM provider in /dev/hast/) will be called test.

The configuration of HAST is being done in the /etc/hast.conf file. This file should be the same on both nodes. The simplest configuration possible is following:

resource test {
        on hasta {
                local /dev/ad6
                remote 172.16.0.2
        }
        on hastb {
                local /dev/ad6
                remote 172.16.0.1
        }
}

For more advanced configuration, please consult the hast.conf(5) manual page.

Tip: It is also possible to use host names in the remote statements. In such a case, make sure that these hosts are resolvable, e.g. they are defined in the /etc/hosts file, or alternatively in the local DNS.

Now that the configuration exists on both nodes, it is possible to create the HAST pool. Run the following commands on both nodes to place the initial metadata onto the local disk, and start the hastd(8) daemon:

# hastctl create test
# /etc/rc.d/hastd onestart

Note: It is not possible to use GEOM providers with an existing file system (i.e. convert an existing storage to HAST-managed pool), because this procedure needs to store some metadata onto the provider and there will not be enough required space available.

HAST is not responsible for selecting node's role (primary or secondary). Node's role has to be configured by an administrator or other software like Heartbeat using the hastctl(8) utility. Move to the primary node (hasta) and issue the following command:

# hastctl role primary test

Similarly, run the following command on the secondary node (hastb):

# hastctl role secondary test

Caution: It may happen that both of the nodes are not able to communicate with each other and both are configured as primary nodes; the consequence of this condition is called split-brain. In order to troubleshoot this situation, follow the steps described in Section 19.18.5.2.

It is possible to verify the result with the hastctl(8) utility on each node:

# hastctl status test

The important text is the status line from its output and it should say complete on each of the nodes. If it says degraded, something went wrong. At this point, the synchronization between the nodes has already started. The synchronization completes when the hastctl status command reports 0 bytes of dirty extents.

The last step is to create a filesystem on the /dev/hast/test GEOM provider and mount it. This has to be done on the primary node (as the /dev/hast/test appears only on the primary node), and it can take a few minutes depending on the size of the hard drive:

# newfs -U /dev/hast/test
# mkdir /hast/test
# mount /dev/hast/test /hast/test

Once the HAST framework is configured properly, the final step is to make sure that HAST is started during the system boot time automatically. The following line should be added to the /etc/rc.conf file:

hastd_enable="YES"

19.18.4.1 Failover Configuration

The goal of this example is to build a robust storage system which is resistant from the failures of any given node. The key task here is to remedy a scenario when a primary node of the cluster fails. Should it happen, the secondary node is there to take over seamlessly, check and mount the file system, and continue to work without missing a single bit of data.

In order to accomplish this task, it will be required to utilize another feature available under FreeBSD which provides for automatic failover on the IP layer -- CARP. CARP stands for Common Address Redundancy Protocol and allows multiple hosts on the same network segment to share an IP address. Set up CARP on both nodes of the cluster according to the documentation available in Section 32.14. After completing this task, each node should have its own carp0 interface with a shared IP address 172.16.0.254. Obviously, the primary HAST node of the cluster has to be the master CARP node.

The HAST pool created in the previous section is now ready to be exported to the other hosts on the network. This can be accomplished by exporting it through NFS, Samba etc, using the shared IP address 172.16.0.254. The only problem which remains unresolved is an automatic failover should the primary node fail.

In the event of CARP interfaces going up or down, the FreeBSD operating system generates a devd(8) event, which makes it possible to watch for the state changes on the CARP interfaces. A state change on the CARP interface is an indication that one of the nodes failed or came back online. In such a case, it is possible to run a particular script which will automatically handle the failover.

To be able to catch the state changes on the CARP interfaces, the following configuration has to be added to the /etc/devd.conf file on each node:

notify 30 {
        match "system" "IFNET";
        match "subsystem" "carp0";
        match "type" "LINK_UP";
        action "/usr/local/sbin/carp-hast-switch master";
};

notify 30 {
        match "system" "IFNET";
        match "subsystem" "carp0";
        match "type" "LINK_DOWN";
        action "/usr/local/sbin/carp-hast-switch slave";
};

To put the new configuration into effect, run the following command on both nodes:

# /etc/rc.d/devd restart

In the event that the carp0 interface goes up or down (i.e. the interface state changes), the system generates a notification, allowing the devd(8) subsystem to run an arbitrary script, in this case /usr/local/sbin/carp-hast-switch. This is the script which will handle the automatic failover. For further clarification about the above devd(8) configuration, please consult the devd.conf(5) manual page.

An example of such a script could be following:

#!/bin/sh

# Original script by Freddie Cash <fjwcash@gmail.com>
# Modified by Michael W. Lucas <mwlucas@BlackHelicopters.org>
# and Viktor Petersson <vpetersson@wireload.net>

# The names of the HAST resources, as listed in /etc/hast.conf
resources="test"

# delay in mounting HAST resource after becoming master
# make your best guess
delay=3

# logging
log="local0.debug"
name="carp-hast"

# end of user configurable stuff

case "$1" in
        master)
                logger -p $log -t $name "Switching to primary provider for ${resources}."
                sleep ${delay}

                # Wait for any "hastd secondary" processes to stop
                for disk in ${resources}; do
                        while $( pgrep -lf "hastd: ${disk} \(secondary\)" > /dev/null 2>&1 ); do
                                sleep 1
                        done

                        # Switch role for each disk
                        hastctl role primary ${disk}
                        if [ $? -ne 0 ]; then
                                logger -p $log -t $name "Unable to change role to primary for resource ${disk}."
                                exit 1
                        fi
                done

                # Wait for the /dev/hast/* devices to appear
                for disk in ${resources}; do
                        for I in $( jot 60 ); do
                                [ -c "/dev/hast/${disk}" ] && break
                                sleep 0.5
                        done

                        if [ ! -c "/dev/hast/${disk}" ]; then
                                logger -p $log -t $name "GEOM provider /dev/hast/${disk} did not appear."
                                exit 1
                        fi
                done

                logger -p $log -t $name "Role for HAST resources ${resources} switched to primary."


                logger -p $log -t $name "Mounting disks."
                for disk in ${resources}; do
                        mkdir -p /hast/${disk}
                        fsck -p -y -t ufs /dev/hast/${disk}
                        mount /dev/hast/${disk} /hast/${disk}
                done

        ;;

        slave)
                logger -p $log -t $name "Switching to secondary provider for ${resources}."

                # Switch roles for the HAST resources
                for disk in ${resources}; do
                        if ! mount | grep -q "^/dev/hast/${disk} on "
                        then
                        else
                                umount -f /hast/${disk}
                        fi
                        sleep $delay
                        hastctl role secondary ${disk} 2>&1
                        if [ $? -ne 0 ]; then
                                logger -p $log -t $name "Unable to switch role to secondary for resource ${disk}."
                                exit 1
                        fi
                        logger -p $log -t $name "Role switched to secondary for resource ${disk}."
                done
        ;;
esac

In a nutshell, the script does the following when a node becomes master / primary:

  • Promotes the HAST pools as primary on a given node.

  • Checks the file system under the HAST pool.

  • Mounts the pools at appropriate place.

When a node becomes backup / secondary:

  • Unmounts the HAST pools.

  • Degrades the HAST pools to secondary.

Caution: Keep in mind that this is just an example script which should serve as a proof of concept solution. It does not handle all the possible scenarios and can be extended or altered in any way, for example it can start/stop required services etc.

Tip: For the purpose of this example we used a standard UFS file system. In order to reduce the time needed for recovery, a journal-enabled UFS or ZFS file system can be used.

More detailed information with additional examples can be found in the HAST Wiki page.

19.18.5 Troubleshooting

19.18.5.1 General Troubleshooting Tips

HAST should be generally working without any issues, however as with any other software product, there may be times when it does not work as supposed. The sources of the problems may be different, but the rule of thumb is to ensure that the time is synchronized between all nodes of the cluster.

The debugging level of the hastd(8) should be increased when troubleshooting HAST problems. This can be accomplished by starting the hastd(8) daemon with the -d argument. Note, that this argument may be specified multiple times to further increase the debugging level. A lot of useful information may be obtained this way. It should be also considered to use -F argument, which will start the hastd(8) daemon in foreground.

19.18.5.2 Recovering from the Split-brain Condition

The consequence of a situation when both nodes of the cluster are not able to communicate with each other and both are configured as primary nodes is called split-brain. This is a dangerous condition because it allows both nodes to make incompatible changes to the data. This situation has to be handled by the system administrator manually.

In order to fix this situation the administrator has to decide which node has more important changes (or merge them manually) and let the HAST perform the full synchronization of the node which has the broken data. To do this, issue the following commands on the node which needs to be resynchronized:

# hastctl role init <resource>
# hastctl create <resource>
# hastctl role secondary <resource>