Heartbeat install – config – operation
Heartbeat installation
Note that the same procedure can be followed to install Heartbeat on DB server 2 (192.168.2.52).
Heartbeat can be installed from EPEL repository
following command install EPEL:
rpm -Uvh http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-7.noarch.rpm
then Install Heartbeat
yum --enablerepo=epel install heartbeat
Version installed: heartbeat-3.0.4-1.el6.x86_64
Install heartbeat resource file for mysql
create the file: /etc/ha.d/resource.d/mysql
with the following content:
[root@db1 resource.d]# cat mysql #!/bin/sh set -v -x # # This script is inteded to be used as resource script by heartbeat # # Mar 2006 by Monty Taylor # ### . /etc/ha.d/shellfuncs case "$1" in start) res=`/etc/init.d/mysqld start` ret=$? ha_log $res exit $ret ;; stop) res=`/etc/init.d/mysqld stop` ret=$? ha_log $res exit $ret ;; status) # if [[ `ps -ef | grep '[m]ysqld'` > 1 ]] ; then # echo "running" # else # echo "stopped" # fi if [[ `service mysqld status` == 'mysqld is stopped' ]] ; then echo "stopped" else echo "running" fi ;; *) echo "Usage: mysql {start|stop|status}" exit 1 ;; esac exit 0
We will configure Heartbeat to use this file to lauch mysql.
Basically, to configure heartbeat it is necessary to add 3 files in the folder /etc/ha.d
- /etc/ha.d/ha.cf
- /etc/ha.d/haresources
- /etc/ha.d/authkeys
The 3 files should be exactly the same in both servers running heartbeat, execpt in ha.cf the ucast IP address can be different. See the example of ucast and ha.cf config in the following sections.
The file /etc/ha.d/ha.cf
Here is the ha.cf file. (You can click the ha.cf link to download it).
Note that some commented parts of the original file were removed from the wiki to make documentation more concise.
bash ha.cf:
# File to write debug messages to debugfile /var/log/ha-debug # # # File to write other messages to # logfile /var/log/ha-log # # # Facility to use for syslog()/logger # logfacility local0 autojoin none # # # A note on specifying "how long" times below... # # The default time unit is seconds # 10 means ten seconds # # You can also specify them in milliseconds # 1500ms means 1.5 seconds # # # keepalive: how long between heartbeats? # keepalive 2 # # deadtime: how long-to-declare-host-dead? # # If you set this too low you will get the problematic # split-brain (or cluster partition) problem. # See the FAQ for how to use warntime to tune deadtime. # deadtime 30 # # warntime: how long before issuing "late heartbeat" warning? # See the FAQ for how to use warntime to tune deadtime. # warntime 10 # # # Very first dead time (initdead) # # On some machines/OSes, etc. the network takes a while to come up # and start working right after you've been rebooted. As a result # we have a separate dead time for when things first come up. # It should be at least twice the normal dead time. # initdead 120 # # # What UDP port to use for bcast/ucast communication? # #udpport 694 # Set up a unicast / udp heartbeat medium # ucast [dev] [peer-ip-addr] # # [dev] device to send/rcv heartbeats on # [peer-ip-addr] IP address of peer to send packets to # ucast eth2 192.168.2.52 # auto_failback off node db1.snarvaez.poweredbygnulinux.com node db2.snarvaez.poweredbygnulinux.com # debug - set debug level # defaults to zero debug 1
The /etc/ha.d/authkeys file
This is a very short file which simply indicates the authentication method selected, and the password.
It should be the same on both servers.
put this file in path: /etc/ha.d/authkeys
and set permissions to 0600
Here is an example (the password is different in the original file)
authkeys:
auth 2 1 crc 2 sha1 thisisthesecret 3 md5 Hello!
The /etc/ha.d/haresources file
Here we indicate to heartbeat what services it should start and monitor.
haresources:
db1.snarvaez.poweredbygnulinux.com 192.168.2.55/24 192.168.10.55/24 172.16.30.55/24 \ drbddisk::mysqldata Filesystem::/dev/drbd0::/mysqldata::ext4 mysql
It does the following: bring up 3 virtual IP addresses,
then change mode of DRBD to primary for resource mysqldata
then mount /dev/drbd0 into /mysqldata
finally lauch mysql server.
config firewall for heartbeat
firewall on both DB servers should allow UDP connection on port 694.
Add the following rule to the iptables of each DB server:
-A INPUT -p udp -m udp --dport 694 -j ACCEPT
Edit file /etc/sysconfig/iptables:
# Firewall configuration written by system-config-firewall # Manual customization of this file is not recommended. *filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT -A INPUT -p icmp -j ACCEPT -A INPUT -i lo -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT -A INPUT -p tcp -m state --state NEW -m tcp --dport 7788:7799 -j ACCEPT -A INPUT -p udp --dport 694 -j ACCEPT -A INPUT -j REJECT --reject-with icmp-host-prohibited -A FORWARD -j REJECT --reject-with icmp-host-prohibited -A OUTPUT -p tcp -m tcp --dport 7788:7799 -j ACCEPT COMMIT
Restart the firewall for new rules to take place. It should be done in both servers:
[root@db1 ha.d]# service iptables restart iptables: Flushing firewall rules: [ OK ] iptables: Setting chains to policy ACCEPT: filter [ OK ] iptables: Unloading modules: [ OK ] iptables: Applying firewall rules: [ OK ]
Disabling SELinux
Unfortunately SELinux makes a conflict with heartbeat when trying to access network card eth2.
Note this error seems related to SELinux with NIC cards different than eth0, as it seems to work fine with eth0.
This is the error in the logs:
Oct 07 18:46:12 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth2 Oct 07 18:46:12 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: info: glib: ucast: bound send socket to device: eth2 Oct 07 18:46:12 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: info: glib: ucast: bound receive socket to device: eth2 Oct 07 18:46:12 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: ERROR: glib: ucast: error binding socket. Retrying: Permission denied Oct 07 18:46:13 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: ERROR: glib: ucast: error binding socket. Retrying: Permission denied Oct 07 18:46:14 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: ERROR: glib: ucast: error binding socket. Retrying: Permission denied Oct 07 18:46:15 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: ERROR: glib: ucast: error binding socket. Retrying: Permission denied Oct 07 18:46:16 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: ERROR: glib: ucast: error binding socket. Retrying: Permission denied Oct 07 18:46:21 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: ERROR: glib: ucast: error binding socket. Retrying: Permission denied Oct 07 18:46:22 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: ERROR: glib: ucast: unable to bind socket. Giving up: Permission denied Oct 07 18:46:22 db1.snarvaez.poweredbygnulinux.com heartbeat: [5366]: ERROR: make_io_childpair: cannot open ucast eth2
So we should disable SELinux with following commands.
(make the change in both servers )
[root@db2] getenforce Enforcing [root@db2] setenforce 0 [root@db2] getenforce Permissive
Disable SELinux permanently (change will persist reboots)
emacs /etc/selinux/config:
# This file controls the state of SELinux on the system. # SELINUX= can take one of these three values: # enforcing - SELinux security policy is enforced. # permissive - SELinux prints warnings instead of enforcing. # disabled - No SELinux policy is loaded. #SELINUX=enforcing SELINUX=permissive # SELINUXTYPE= can take one of these two values: # targeted - Targeted processes are protected, # mls - Multi Level Security protection. SELINUXTYPE=targeted
Starting heartbeat
Services like mysql should not be started automatically when the server restart, because it is the job of Heartbeat to start and stop services in both servers.
check services are configured as follow:
Note that drbd and heartbeat should be automatically started when the machine boots.
(some services were removed from the output)
[root@db1 ha.d] chkconfig --list drbd 0:off 1:off 2:on 3:on 4:on 5:on 6:off heartbeat 0:off 1:off 2:on 3:on 4:on 5:on 6:off ip6tables 0:off 1:off 2:on 3:on 4:on 5:on 6:off iptables 0:off 1:off 2:on 3:on 4:on 5:on 6:off mysqld 0:off 1:off 2:off 3:off 4:off 5:off 6:off netconsole 0:off 1:off 2:off 3:off 4:off 5:off 6:off network 0:off 1:off 2:on 3:on 4:on 5:on 6:off ntpd 0:off 1:off 2:on 3:on 4:on 5:on 6:off ntpdate 0:off 1:off 2:off 3:off 4:off 5:off 6:off rsyslog 0:off 1:off 2:on 3:on 4:on 5:on 6:off sshd 0:off 1:off 2:on 3:on 4:on 5:on 6:off
Now start heartbeat on both servers:
[root@db1] service heartbeat start [root@db2] service heartbeat start
This should bring up virtual IP addresses and required services in the primary server.
Troubleshooting Heartbeat
According to our ha.cf config file, Heartbeat logs can be checked in the files:
/var/log/ha-debug
/var/log/ha-log
Here is an example of a successful heartbeat run:
Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28944]: info: ************************** Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28944]: info: Configuration validated. Starting heartbeat 3.0.4 Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28944]: info: Heartbeat Hg Version: node: fcd56a9dd18c286a8c6ad63999 7a56b5ea40d441 Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: heartbeat: version 3.0.4 Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Heartbeat generation: 1349649990 Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: glib: ucast: write socket priority set to IPTOS_LOWDEL AY on eth2 Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: glib: ucast: bound send socket to device: eth2 Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: glib: ucast: bound receive socket to device: eth2 Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: glib: ucast: started on port 694 interface eth2 to 10. 1.10.52 Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: G_main_add_TriggerHandler: Added signal manual handler Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: G_main_add_TriggerHandler: Added signal manual handler Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: G_main_add_SignalHandler: Added signal handler for sig nal 17 Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Local status now set to: 'up' Oct 10 06:31:04 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Managed write_hostcachedata process 28953 exited with return code 0. Oct 10 06:31:08 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Link db2.snarvaez.poweredbygnulinux.com:eth2 up. Oct 10 06:31:08 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Status update for node db2.snarvaez.poweredbygnulinux.com: status up Oct 10 06:31:08 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Managed write_hostcachedata process 28955 exited with return code 0. harc(default)[28954]: 2012/10/10_06:31:08 info: Running /etc/ha.d//rc.d/status status Oct 10 06:31:08 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Managed status process 28954 exited with return code 0 . Oct 10 06:31:09 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Comm_now_up(): updating status to active Oct 10 06:31:09 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Local status now set to: 'active' Oct 10 06:31:09 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Managed write_hostcachedata process 28972 exited with return code 0. Oct 10 06:31:09 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Managed write_delcachedata process 28973 exited with r eturn code 0. Oct 10 06:31:10 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Status update for node db2.snarvaez.poweredbygnulinux.com: status active Oct 10 06:31:10 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: AnnounceTakeover(local 0, foreign 1, reason 'HB_R_BOTH STARTING' (0)) Oct 10 06:31:10 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: AnnounceTakeover(local 0, foreign 1, reason 'T_RESOURC ES' (0)) Oct 10 06:31:10 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: STATE 1 => 3 Oct 10 06:31:10 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: STATE 3 => 2 Oct 10 06:31:10 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: AnnounceTakeover(local 0, foreign 1, reason 'T_RESOURC ES' (0)) Oct 10 06:31:10 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: other_holds_resources: 0 Oct 10 06:31:10 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: STATE 2 => 3 harc(default)[28974]: 2012/10/10_06:31:10 info: Running /etc/ha.d//rc.d/status status Oct 10 06:31:10 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Managed status process 28974 exited with return code 0 . Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: local resource transition completed. Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURC ES(us)' (0)) Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Initial resource acquisition complete (T_RESOURCES(us) ) Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: remote resource transition completed. Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURC ES(us)' (1)) Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: other_holds_resources: 1 Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: other_holds_resources: 1 /usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_192.168.2.55)[29027]: 2012/10/10_06:31:20 INFO: Resource is stopped req_resource(default)[29004]: 2012/10/10_06:31:20 debug: in /usr/share/heartbeat/req_resource 192.168.2.55/24 req_resource(default)[29004]: 2012/10/10_06:31:20 debug: dont_ask: nice_failback: yes Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28991]: info: 1 local resources from [/usr/share/heartbeat/ResourceManager listkeys db1.snarvaez.poweredbygnulinux.com] Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28991]: info: Local Resource acquisition completed. Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28991]: info: FIFO message [type resource] written rc=81 Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Managed req_our_resources(ask) process 28991 exited with return code 0. harc(default)[29079]: 2012/10/10_06:31:20 info: Running /etc/ha.d//rc.d/ip-request-resp ip-request-resp ip-request-resp(default)[29079]: 2012/10/10_06:31:20 received ip-request-resp 192.168.2.55/24 OK yes ResourceManager(default)[29102]: 2012/10/10_06:31:20 info: Acquiring resource group: db1.snarvaez.poweredbygnulinux.com 192.168.2.55/24 192.168.10.55/24 172.16.30.55/24 drbddisk::mysqldata Filesystem::/dev/drbd0::/mysqldata::ext4 mysql /usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_192.168.2.55)[29130]: 2012/10/10_06:31:20 INFO: Resource is stopped ResourceManager(default)[29102]: 2012/10/10_06:31:20 info: Running /etc/ha.d/resource.d/IPaddr 192.168.2.55/24 start IPaddr(IPaddr_192.168.2.55)[29215]: 2012/10/10_06:31:20 INFO: Using calculated nic for 192.168.2.55: eth2 IPaddr(IPaddr_192.168.2.55)[29215]: 2012/10/10_06:31:20 INFO: Using calculated netmask for 192.168.2.55: 255.255.255.0 IPaddr(IPaddr_192.168.2.55)[29215]: 2012/10/10_06:31:20 INFO: eval ifconfig eth2:0 192.168.2.55 netmask 255.255.255.0 broadcast 192.168.2.255 /usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_192.168.2.55)[29192]: 2012/10/10_06:31:20 INFO: Success /usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_192.168.10.55)[29331]: 2012/10/10_06:31:20 INFO: Resource is stopped ResourceManager(default)[29102]: 2012/10/10_06:31:20 info: Running /etc/ha.d/resource.d/IPaddr 192.168.10.55/24 start IPaddr(IPaddr_192.168.10.55)[29416]: 2012/10/10_06:31:20 INFO: Using calculated nic for 192.168.10.55: eth1 IPaddr(IPaddr_192.168.10.55)[29416]: 2012/10/10_06:31:20 INFO: Using calculated netmask for 192.168.10.55: 255.255.255.0 IPaddr(IPaddr_192.168.10.55)[29416]: 2012/10/10_06:31:20 INFO: eval ifconfig eth1:0 192.168.10.55 netmask 255.255.255.0 broadcast 192.168.10.255 /usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_192.168.10.55)[29393]: 2012/10/10_06:31:20 INFO: Success Oct 10 06:31:20 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: other_holds_resources: 1 /usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_172.16.30.55)[29532]: 2012/10/10_06:31:20 INFO: Resource is stopped ResourceManager(default)[29102]: 2012/10/10_06:31:20 info: Running /etc/ha.d/resource.d/IPaddr 172.16.30.55/24 start IPaddr(IPaddr_172.16.30.55)[29617]: 2012/10/10_06:31:20 INFO: Using calculated nic for 172.16.30.55: eth0 IPaddr(IPaddr_172.16.30.55)[29617]: 2012/10/10_06:31:20 INFO: Using calculated netmask for 172.16.30.55: 255.255.255.0 IPaddr(IPaddr_172.16.30.55)[29617]: 2012/10/10_06:31:21 INFO: eval ifconfig eth0:0 172.16.30.55 netmask 255.255.255.0 broadcast 172.16.30.255 /usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_172.16.30.55)[29594]: 2012/10/10_06:31:21 INFO: Success ResourceManager(default)[29102]: 2012/10/10_06:31:21 info: Running /etc/ha.d/resource.d/drbddisk mysqldata start /usr/lib/ocf/resource.d//heartbeat/Filesystem(Filesystem_/dev/drbd0)[29779]: 2012/10/10_06:31:21 INFO: Resource is stopped ResourceManager(default)[29102]: 2012/10/10_06:31:21 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /mysqldata ext4 start Filesystem(Filesystem_/dev/drbd0)[29862]: 2012/10/10_06:31:21 INFO: Running start for /dev/drbd0 on /mysqldata /usr/lib/ocf/resource.d//heartbeat/Filesystem(Filesystem_/dev/drbd0)[29854]: 2012/10/10_06:31:21 INFO: Success ResourceManager(default)[29102]: 2012/10/10_06:31:21 info: Running /etc/ha.d/resource.d/mysql start mysql(default)[30018]: 2012/10/10_06:31:22 Starting mysqld: [ OK ] Oct 10 06:31:22 db1.snarvaez.poweredbygnulinux.com heartbeat: [28945]: info: Managed ip-request-resp process 29079 exited with return code 0.