Gentoo Cluster using heartbeat and drbd problem

Discussion in 'Server Operation' started by nekromancer, Nov 29, 2006.

  1. nekromancer

    nekromancer New Member

    Hi, I hope this is the right place to post this since I saw others posting drbd/heartbeat questions here. If there is some mailing list or forum that deals specifically with such things please director me to it.

    I have set up 2 identical PCs running Gentoo. They both have DRBD v0.7.21 and Heartbeat v1.2.7 (using ldirectord) installed. I am going for a Hot-Standby (active/passive) system. I have setup the network with 1 ethernet cable for connecting both nodes to a LAN (eth1). There is a crossover ethernet cable hooking up the 2 pcs directly (eth0), this is dedicated for DRBD replication.

    Heartbeat is set up to use eth1 to connect to the LAN and send heartbeats. Both nodes are started and everything is find and they share the virtual ip address perfectly. The failover works fine if I test it by turning off hearbeat on the primary node. It also works fine if I unplug the power supply from the primary node. But if I unplug the eth1 network cable the ip address fails-over but it doesn't switch the DRBD disk. The drbd disk remains mounted on the primary node but not on the secondary node even though heartbeat failed over to the secondary node and the secondary node took over the virtual ip address.

    The only way I got this to work is by unplugging both the cross over cable (eth0) and the network cable (eth1) at the same time. So drbd gets cut off and so does the network, only then does the secondary node take over both the drbddisk and the ip all together.

    seeing this, I decided to use just 1 interface for both drbd and heartbeat (eth1). Simulating a network failure (unplug the eth1 cable) both failover. Then when I reconnect the cable it fails back automatically even though I set autofailback off ! Not only that, data in drbddisk does not get replicated to the other node once connected. Doing a cat /proc/drbd shows both disks in a consistant state some how.

    If I set the drbd conf to go Stand onle instead of reconnect, and I handle the drbd disks manually the data does get replicated! This is through me using drbdadm commands, then running heartbeat on the failed node to fail back to.

    Basically I don't know why this is happening. What I wan
    Code:
    
    
    t is a Active/Passive hot-standby setup. I want drbd on a crossover cable and heartbeat on a network. Once one node fails the other should take over, when the failed node comes back online it should NOT failback, I want the admin to tend the node and he should decide wheather to failback or not; that way drbd can do a sync.

    Below is the drbd.conf file I am using (at least the relevant parts)
    Note: 172.22.0.x is the crossover cable
    Note: 192.168.1.x is the LAN network

    Code:
    resource mirror {
    
       protocol C;
       incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";
    
       startup {
          degr-wfc-timeout 20;    # 20 seconds
       }
    
    
       disk {
          on-io-error   detach;
       }
    
    
       net {
          ko-count 4;
          on-disconnect stand_alone;
          #on-disconnect reconnect;
       }
    
    
       syncer {
          rate 10M;
          group 1;
          al-extents 257;
       }
    
    
       on gentoo1 {
          device     /dev/drbd0;
          disk       /dev/sda4;
          address    172.22.0.1:7788;
          meta-disk  internal;
       }
    
    
       on gentoo2 {
          device    /dev/drbd0;
          disk      /dev/sda4;
          address   172.22.0.2:7788;
          meta-disk internal;
       }
    } 

    This is the ha.cf config file

    Code:
    logfile   /var/log/ha-log
    logfacility   local0
    
    keepalive 1
    deadtime 15
    warntime 5
    
    bcast   eth1
    
    auto_failback off
    
    node   gentoo1
    node   gentoo2 
    This is the haresources file

    Code:
    gentoo1 drbddisk::mirror Filesystem::/dev/drbd0::/ha::reiserfs 192.168.1.3/8/eth1 ldirectord apache2
    Thanks in advance!
     
  2. nekromancer

    nekromancer New Member

    meh, so much for this post.
    The nodes were split-brained. Fixing it was to add a STONITH device (managable remote power switch) to each node.
     

Share This Page