Heartbeat install – config – operation

Heartbeat installation

Note that the same procedure can be followed to install Heartbeat on DB server 2 (192.168.2.52).

Heartbeat can be installed from EPEL repository

following command install EPEL:

rpm -Uvh http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-7.noarch.rpm

then Install Heartbeat

yum --enablerepo=epel install heartbeat


Version installed: heartbeat-3.0.4-1.el6.x86_64

Install heartbeat resource file for mysql

create the file: /etc/ha.d/resource.d/mysql
with the following content:

[root@db1 resource.d]# cat mysql
#!/bin/sh
set -v -x
#
# This script is inteded to be used as resource script by heartbeat
#
# Mar 2006 by Monty Taylor
#
###

. /etc/ha.d/shellfuncs

case "$1" in
    start)
        res=`/etc/init.d/mysqld start`
        ret=$?
        ha_log $res
        exit $ret
        ;;
    stop)
        res=`/etc/init.d/mysqld stop`
        ret=$?
        ha_log $res
        exit $ret
        ;;
    status)
#        if [[ `ps -ef | grep '[m]ysqld'` > 1 ]] ; then
#           echo "running"
#        else
#           echo "stopped"
#        fi

        if [[ `service mysqld status` == 'mysqld is stopped' ]] ; then
        echo "stopped"
    else
        echo "running"
    fi

        ;;
    *)
        echo "Usage: mysql {start|stop|status}"
        exit 1
        ;;
esac

exit 0

We will configure Heartbeat to use this file to lauch mysql.

Basically, to configure heartbeat it is necessary to add 3 files in the folder /etc/ha.d

  • /etc/ha.d/ha.cf
  • /etc/ha.d/haresources
  • /etc/ha.d/authkeys

The 3 files should be exactly the same in both servers running heartbeat, execpt in ha.cf the ucast IP address can be different. See the example of ucast and ha.cf config in the following sections.

The file /etc/ha.d/ha.cf

Here is the ha.cf file. (You can click the ha.cf link to download it).

Note that some commented parts of the original file were removed from the wiki to make documentation more concise.

bash ha.cf:

#       File to write debug messages to
debugfile /var/log/ha-debug
#
#
#       File to write other messages to
#
logfile /var/log/ha-log
#
#
#       Facility to use for syslog()/logger
#
logfacility     local0

autojoin none

#
#
#       A note on specifying "how long" times below...
#
#       The default time unit is seconds
#               10 means ten seconds
#
#       You can also specify them in milliseconds
#               1500ms means 1.5 seconds
#
#
#       keepalive: how long between heartbeats?
#
keepalive 2
#
#       deadtime: how long-to-declare-host-dead?
#
#               If you set this too low you will get the problematic
#               split-brain (or cluster partition) problem.
#               See the FAQ for how to use warntime to tune deadtime.
#
deadtime 30
#
#       warntime: how long before issuing "late heartbeat" warning?
#       See the FAQ for how to use warntime to tune deadtime.
#
warntime 10
#
#
#       Very first dead time (initdead)
#
#       On some machines/OSes, etc. the network takes a while to come up
#       and start working right after you've been rebooted.  As a result
#       we have a separate dead time for when things first come up.
#       It should be at least twice the normal dead time.
#
initdead 120
#
#
#       What UDP port to use for bcast/ucast communication?
#
#udpport        694

#       Set up a unicast / udp heartbeat medium
#       ucast [dev] [peer-ip-addr]
#
#       [dev]           device to send/rcv heartbeats on
#       [peer-ip-addr]  IP address of peer to send packets to
#

ucast eth2 192.168.2.52

#
auto_failback off

node db1.snarvaez.poweredbygnulinux.com
node db2.snarvaez.poweredbygnulinux.com

#       debug - set debug level
#         defaults to zero
debug 1

The /etc/ha.d/authkeys file

This is a very short file which simply indicates the authentication method selected, and the password.
It should be the same on both servers.

put this file in path: /etc/ha.d/authkeys

and set permissions to 0600

Here is an example (the password is different in the original file)

authkeys:

auth 2
1 crc
2 sha1 thisisthesecret
3 md5 Hello!

The /etc/ha.d/haresources file

Here we indicate to heartbeat what services it should start and monitor.

haresources:

db1.snarvaez.poweredbygnulinux.com  192.168.2.55/24  192.168.10.55/24  172.16.30.55/24 \
   drbddisk::mysqldata Filesystem::/dev/drbd0::/mysqldata::ext4   mysql

It does the following: bring up 3 virtual IP addresses,

then change mode of DRBD to primary for resource mysqldata

then mount /dev/drbd0 into /mysqldata

finally lauch mysql server.

config firewall for heartbeat

firewall on both DB servers should allow UDP connection on port 694.

Add the following rule to the iptables of each DB server:

-A INPUT -p udp -m udp --dport 694 -j ACCEPT

Edit file /etc/sysconfig/iptables:

# Firewall configuration written by system-config-firewall
# Manual customization of this file is not recommended.
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 7788:7799 -j ACCEPT
-A INPUT -p udp --dport 694 -j ACCEPT
-A INPUT -j REJECT --reject-with icmp-host-prohibited
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
-A OUTPUT -p tcp -m tcp --dport 7788:7799 -j ACCEPT
COMMIT

Restart the firewall for new rules to take place. It should be done in both servers:

[root@db1 ha.d]# service iptables restart
iptables: Flushing firewall rules:                         [  OK  ]
iptables: Setting chains to policy ACCEPT: filter          [  OK  ]
iptables: Unloading modules:                               [  OK  ]
iptables: Applying firewall rules:                         [  OK  ]

Disabling SELinux

Unfortunately SELinux makes a conflict with heartbeat when trying to access network card eth2.
Note this error seems related to SELinux with NIC cards different than eth0, as it seems to work fine with eth0.
This is the error in the logs:

Oct 07 18:46:12 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth2
Oct 07 18:46:12 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: info: glib: ucast: bound send socket to device: eth2
Oct 07 18:46:12 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: info: glib: ucast: bound receive socket to device: eth2
Oct 07 18:46:12 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: ERROR: glib: ucast: error binding socket. Retrying: Permission denied
Oct 07 18:46:13 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: ERROR: glib: ucast: error binding socket. Retrying: Permission denied
Oct 07 18:46:14 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: ERROR: glib: ucast: error binding socket. Retrying: Permission denied
Oct 07 18:46:15 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: ERROR: glib: ucast: error binding socket. Retrying: Permission denied
Oct 07 18:46:16 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: ERROR: glib: ucast: error binding socket. Retrying: Permission denied

Oct 07 18:46:21 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: ERROR: glib: ucast: error binding socket. Retrying: Permission denied
Oct 07 18:46:22 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: ERROR: glib: ucast: unable to bind socket. Giving up: Permission denied
Oct 07 18:46:22 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: ERROR: make_io_childpair: cannot open ucast eth2

So we should disable SELinux with following commands.
(make the change in both servers )

[root@db2] getenforce
Enforcing

[root@db2] setenforce 0
[root@db2] getenforce
Permissive

Disable SELinux permanently (change will persist reboots)

emacs /etc/selinux/config:

# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
#SELINUX=enforcing
SELINUX=permissive
# SELINUXTYPE= can take one of these two values:
#     targeted - Targeted processes are protected,
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted

Starting heartbeat

Services like mysql should not be started automatically when the server restart, because it is the job of Heartbeat to start and stop services in both servers.

check services are configured as follow:

Note that drbd and heartbeat should be automatically started when the machine boots.

(some services were removed from the output)

[root@db1 ha.d] chkconfig --list

drbd                0:off   1:off   2:on    3:on    4:on    5:on    6:off
heartbeat           0:off   1:off   2:on    3:on    4:on    5:on    6:off
ip6tables           0:off   1:off   2:on    3:on    4:on    5:on    6:off
iptables            0:off   1:off   2:on    3:on    4:on    5:on    6:off
mysqld              0:off   1:off   2:off   3:off   4:off   5:off   6:off
netconsole          0:off   1:off   2:off   3:off   4:off   5:off   6:off
network             0:off   1:off   2:on    3:on    4:on    5:on    6:off
ntpd                0:off   1:off   2:on    3:on    4:on    5:on    6:off
ntpdate             0:off   1:off   2:off   3:off   4:off   5:off   6:off
rsyslog             0:off   1:off   2:on    3:on    4:on    5:on    6:off
sshd                0:off   1:off   2:on    3:on    4:on    5:on    6:off

Now start heartbeat on both servers:

[root@db1] service heartbeat start

[root@db2] service heartbeat start

This should bring up virtual IP addresses and required services in the primary server.

Troubleshooting Heartbeat

According to our ha.cf config file, Heartbeat logs can be checked in the files:

/var/log/ha-debug

/var/log/ha-log

Here is an example of a successful heartbeat run:

Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28944]: info: **************************
Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28944]: info: Configuration validated. Starting heartbeat 3.0.4
Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28944]: info: Heartbeat Hg Version: node: fcd56a9dd18c286a8c6ad63999
7a56b5ea40d441
Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: heartbeat: version 3.0.4
Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Heartbeat generation: 1349649990
Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: glib: ucast: write socket priority set to IPTOS_LOWDEL
AY on eth2
Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: glib: ucast: bound send socket to device: eth2
Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: glib: ucast: bound receive socket to device: eth2
Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: glib: ucast: started on port 694 interface eth2 to 10.
1.10.52
Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: G_main_add_TriggerHandler: Added signal manual handler
Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: G_main_add_TriggerHandler: Added signal manual handler
Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: G_main_add_SignalHandler: Added signal handler for sig
nal 17
Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Local status now set to: 'up'
Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Managed write_hostcachedata process 28953 exited with
return code 0.
Oct 10 06:31:08 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Link db2.snarvaez.poweredbygnulinux.com:eth2 up.
Oct 10 06:31:08 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Status update for node db2.snarvaez.poweredbygnulinux.com: status up
Oct 10 06:31:08 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Managed write_hostcachedata process 28955 exited with
return code 0.
harc(default)[28954]:   2012/10/10_06:31:08 info: Running /etc/ha.d//rc.d/status status
Oct 10 06:31:08 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Managed status process 28954 exited with return code 0
.
Oct 10 06:31:09 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Comm_now_up(): updating status to active
Oct 10 06:31:09 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Local status now set to: 'active'
Oct 10 06:31:09 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Managed write_hostcachedata process 28972 exited with
return code 0.
Oct 10 06:31:09 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Managed write_delcachedata process 28973 exited with r
eturn code 0.
Oct 10 06:31:10 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Status update for node db2.snarvaez.poweredbygnulinux.com: status active
Oct 10 06:31:10 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: AnnounceTakeover(local 0, foreign 1, reason 'HB_R_BOTH
STARTING' (0))
Oct 10 06:31:10 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: AnnounceTakeover(local 0, foreign 1, reason 'T_RESOURC
ES' (0))
Oct 10 06:31:10 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: STATE 1 => 3
Oct 10 06:31:10 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: STATE 3 => 2
Oct 10 06:31:10 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: AnnounceTakeover(local 0, foreign 1, reason 'T_RESOURC
ES' (0))
Oct 10 06:31:10 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: other_holds_resources: 0
Oct 10 06:31:10 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: STATE 2 => 3
harc(default)[28974]:   2012/10/10_06:31:10 info: Running /etc/ha.d//rc.d/status status
Oct 10 06:31:10 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Managed status process 28974 exited with return code 0
.
Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: local resource transition completed.
Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURC
ES(us)' (0))
Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Initial resource acquisition complete (T_RESOURCES(us)
)
Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: remote resource transition completed.
Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURC
ES(us)' (1))
Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: other_holds_resources: 1
Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: other_holds_resources: 1
/usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_192.168.2.55)[29027]:    2012/10/10_06:31:20 INFO:  Resource is stopped
req_resource(default)[29004]:   2012/10/10_06:31:20 debug: in /usr/share/heartbeat/req_resource 192.168.2.55/24
req_resource(default)[29004]:   2012/10/10_06:31:20 debug: dont_ask:  nice_failback: yes
Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28991]: info: 1 local resources from [/usr/share/heartbeat/ResourceManager listkeys db1.snarvaez.poweredbygnulinux.com]
Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28991]: info: Local Resource acquisition completed.
Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28991]: info: FIFO message [type resource] written rc=81
Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1))
Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Managed req_our_resources(ask) process 28991 exited with return code 0.
harc(default)[29079]:   2012/10/10_06:31:20 info: Running /etc/ha.d//rc.d/ip-request-resp ip-request-resp
ip-request-resp(default)[29079]:        2012/10/10_06:31:20 received ip-request-resp 192.168.2.55/24 OK yes
ResourceManager(default)[29102]:        2012/10/10_06:31:20 info: Acquiring resource group: db1.snarvaez.poweredbygnulinux.com 192.168.2.55/24 192.168.10.55/24 172.16.30.55/24 drbddisk::mysqldata Filesystem::/dev/drbd0::/mysqldata::ext4 mysql
/usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_192.168.2.55)[29130]:    2012/10/10_06:31:20 INFO:  Resource is stopped
ResourceManager(default)[29102]:        2012/10/10_06:31:20 info: Running /etc/ha.d/resource.d/IPaddr 192.168.2.55/24 start
IPaddr(IPaddr_192.168.2.55)[29215]:       2012/10/10_06:31:20 INFO: Using calculated nic for 192.168.2.55: eth2
IPaddr(IPaddr_192.168.2.55)[29215]:       2012/10/10_06:31:20 INFO: Using calculated netmask for 192.168.2.55: 255.255.255.0
IPaddr(IPaddr_192.168.2.55)[29215]:       2012/10/10_06:31:20 INFO: eval ifconfig eth2:0 192.168.2.55 netmask 255.255.255.0 broadcast 192.168.2.255
/usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_192.168.2.55)[29192]:    2012/10/10_06:31:20 INFO:  Success
/usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_192.168.10.55)[29331]: 2012/10/10_06:31:20 INFO:  Resource is stopped
ResourceManager(default)[29102]:        2012/10/10_06:31:20 info: Running /etc/ha.d/resource.d/IPaddr 192.168.10.55/24 start
IPaddr(IPaddr_192.168.10.55)[29416]:    2012/10/10_06:31:20 INFO: Using calculated nic for 192.168.10.55: eth1
IPaddr(IPaddr_192.168.10.55)[29416]:    2012/10/10_06:31:20 INFO: Using calculated netmask for 192.168.10.55: 255.255.255.0
IPaddr(IPaddr_192.168.10.55)[29416]:    2012/10/10_06:31:20 INFO: eval ifconfig eth1:0 192.168.10.55 netmask 255.255.255.0 broadcast 192.168.10.255
/usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_192.168.10.55)[29393]: 2012/10/10_06:31:20 INFO:  Success
Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: other_holds_resources: 1
/usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_172.16.30.55)[29532]:  2012/10/10_06:31:20 INFO:  Resource is stopped
ResourceManager(default)[29102]:        2012/10/10_06:31:20 info: Running /etc/ha.d/resource.d/IPaddr 172.16.30.55/24 start
IPaddr(IPaddr_172.16.30.55)[29617]:     2012/10/10_06:31:20 INFO: Using calculated nic for 172.16.30.55: eth0
IPaddr(IPaddr_172.16.30.55)[29617]:     2012/10/10_06:31:20 INFO: Using calculated netmask for 172.16.30.55: 255.255.255.0
IPaddr(IPaddr_172.16.30.55)[29617]:     2012/10/10_06:31:21 INFO: eval ifconfig eth0:0 172.16.30.55 netmask 255.255.255.0 broadcast 172.16.30.255
/usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_172.16.30.55)[29594]:  2012/10/10_06:31:21 INFO:  Success
ResourceManager(default)[29102]:        2012/10/10_06:31:21 info: Running /etc/ha.d/resource.d/drbddisk mysqldata start
/usr/lib/ocf/resource.d//heartbeat/Filesystem(Filesystem_/dev/drbd0)[29779]:    2012/10/10_06:31:21 INFO:  Resource is stopped
ResourceManager(default)[29102]:        2012/10/10_06:31:21 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /mysqldata ext4 start
Filesystem(Filesystem_/dev/drbd0)[29862]:       2012/10/10_06:31:21 INFO: Running start for /dev/drbd0 on /mysqldata
/usr/lib/ocf/resource.d//heartbeat/Filesystem(Filesystem_/dev/drbd0)[29854]:    2012/10/10_06:31:21 INFO:  Success
ResourceManager(default)[29102]:        2012/10/10_06:31:21 info: Running /etc/ha.d/resource.d/mysql  start
mysql(default)[30018]:  2012/10/10_06:31:22 Starting mysqld: [ OK ]
Oct 10 06:31:22 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Managed ip-request-resp process 29079 exited with return code 0.