VMWare replication and failover

Discussion in 'HOWTO-Related Questions' started by sebastienp, Aug 5, 2008.

  1. sebastienp

    sebastienp New Member

    OK, accuracy :

    I have no problem with vm1 when started on srv1 : it gets its IP (192.168.1.20 staticaly configured), I can access it.
    But when I disconnect srv1, even if the instance goes online on srv2, vm1 over srv2 doesn't get any IP, as far as eth0 doesn't exists anymore on srv2.

    Is this normal ?
    Do someone have a clue ?

    Thank you in advance,
    S.

    =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

    Hi there,

    Once again many thanks for the time you spent achieving these howtos. It helps a lot !!!

    Sorry to burden, but I have questions regarding the "Virtual Machine Replication & Failover with VMWare Server & Debian Etch (4.0)" howto.

    It looks like I missed something...

    OK, I have 2 physical nodes:
    srv1:
    eth0 : 192.168.1.11/24 - eth1 : 172.16.0.1/20 (heartbeat)
    srv2:
    eth0 : 192.168.1.12/24 - eth1 : 172.16.0.2/20 (heartbeat)

    DRBD and Heartbeat are working well.
    #
    #srv1:~# cat /proc/drbd
    #version: 0.7.21 (api:79/proto:74)
    #SVN Revision: 2326 build by [email protected]l, 2008-07-22 22:14:19
    # 0: cs:Connected st:primary/Secondary ld:Consistent
    # ns:2236 nr:0 dw:100 dr:2237 al:0 bm:27 lo:0 pe:0 ua:0 ap:0
    #srv1:~#
    #srv1:~# /etc/init.d/heartbeat status
    #heartbeat OK [pid 2645 et al] is running on srv1.site.local #[srv1.site.local]...
    #srv1:~#

    Here are the config files:

    *drbd.conf :
    resource vm1 {
    protocol C;
    incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";
    startup {
    wfc-timeout 10;
    degr-wfc-timeout 30;
    }
    disk {
    on-io-error detach;
    }
    net {
    max-buffers 20000;
    unplug-watermark 12000;
    max-epoch-size 20000;
    }
    syncer {
    rate 500M;
    group 1;
    al-extents 257;
    }
    on srv1.site.local {
    device /dev/drbd0;
    disk /dev/cciss/c0d0p7;
    address 172.16.0.1:7789;
    meta-disk internal;
    }
    on srv2.site.local {
    device /dev/drbd0;
    disk /dev/cciss/c0d0p7;
    address 172.16.0.2:7789;
    meta-disk internal;
    }
    }

    *ha.cf :
    logfile /var/log/ha-log
    gfile /var/log/ha-log
    logfacility local0
    keepalive 1
    deadtime 10
    warntime 10
    udpport 694
    bcast eth1

    logfacility local0
    keepalive 1
    deadtime 10
    warntime 10
    udpport 694
    bcast eth1
    auto_failback on
    node srv1.site.local
    node srv2.site.local
    ping 192.168.1.1
    respawn hacluster /usr/lib/heartbeat/ipfail

    *authkeys :
    auth 1
    1 md5 secret

    *haresources :
    srv1.site.local 192.168.1.10 drbddisk::vm1 Filesystem::/dev/drbd0::/var/vm::ext3 vmstart

    vmstart points to the correct files in /var/vm.

    VMWare server v.1.0.5 is installed and working on both servers, and the VMWare instance vm1.site.local is created on srv1.
    Hosts are declared in /etc/hosts.

    What I understood was when booting vm1, it will get the IP address (192.168.1.10 for instance) configured in haresources.

    But when I boot vm1, it gets an IP via DHCP.
    I can access its services via this IP, but I don't have failover.
    When disconnecting srv1, the instance goes online on srv2, but eth0 doesn't exists anymore ! It is declared in /etc/network/interfaces as dhcp but it's not up.
    Trying ifup eth0, I have :
    SIOCSIFADDR: No such device
    eth0: ERROR while getting interface flags: No such device (twice)
    Bind socket to interface: No such device
    Failed to bring up eth0

    If I set another IP staticaly on vm1 (let's say 1.20), I don't have failover since I loose 1.20 as soon as I disconnect srv1... even if the VM switches to srv2, with 1.10 IP !
    Once again, eth0 disappears.

    If I set the haresources's IP statically on vm1 (iface eth0 inet static address 192.168.1.10 ...),
    then I access srv1 (or srv2, depending which server holds the eth0:0...) instead of vm1.

    Could you please be so kind to explain with more details what sould theorically happend ?
    What if I want to configure several virtual machines ?
    Did I miss something ? Did I misunderstood ?

    Many thanks for your support,
    S.
     
    Last edited: Aug 6, 2008
  2. thanis

    thanis New Member

    Hi, please keep in mind that the entire configuration of heartbeat and drbd does NOT have anything to do with the virtual machines. The haresources IP address is the heartbeat IP address if configured correctly. NIC configuration for your virtual machines is done on the virtual machines, basically, whenever you talk about your vm's, it is all VMWare related and no longer tied to the HA part of the tutorial.

    So , what kind of OS are you running in your VM and are the VMWare tools installed ?

    Grtz,
    Thanis
     
  3. sebastienp

    sebastienp New Member

    to be continued

    Hi Thanis, first of all many thanks for your answer, maybe you're on vacations... Nice from you to take the time.

    OK, I worked a lot since last post, and ' got it, for sure, heartbeat and drbd are "by themselves". Commited.

    I now use vmware server 1.0.6 instead of 1.0.5, just in case...
    But same mess...
    Downgrading to 1.0.2, why not, it's my last chance !?

    Nevertheless, the mes is with ethernet card (these are HP servers, according to HP full compatible with debian etch. I don't know for vmware).

    No problem with the primary server (srv1). My vms bind AMD pcnet card as eth0 with its IP staticaly configured.

    But when moving to srv2, I always have the same error:
    SIOCSIFADDR: No such device
    eth0: ERROR while getting interface flags: No such device (twice)
    Bind socket to interface: No such device
    Failed to bring up eth0
    ...at boot sequence, and no eth0 available.

    ifconfig -a shows that eth0 doesn't exists, but eth1 is there ! (It's not there on srv1...).

    I read several things about /etc/udev/rules.d/z25_persistent-net.rules.
    I tried to remove/tune it, successless.

    I want to use several (2-4) Linux vms and only 1 Windows 2003 server vm (specific purpose/service). How many did you try in your lab ? Which OS ?

    FYI, the problem is there with only Linux vms, and also with only the Windows vm. Gods love us ;-)

    Once again, no problem on srv1.
    But as soon as I disconnect it, srv2 holds back vmware instance, OK, but it's just like eth0 vanished !

    That was my first try, I used a bridged network configuration (what kind of vmware network configuration did you use for your test lab ?).

    Now, I'm trying with a NAT config, I have a couple of possibilities.
    - tuning nat.conf for vmnet8 on both servers, so that they share the same "NATed network". What about MAC addresses ? They are the same for vmnet8. To be tested;
    - using a tunnel broker solution, but I'm not familiar to IPv6. To be tested;

    I'm still working on it...

    I can say I never saw something explicit in the log files, except the NIC failure... It's the main problem !

    Sorry to ask, but could you please send a basic sketch of your topology when you did it ?
    I find your howto very interesting and knwoledgefull, but if you permit, not so detailled considering the topic, even for linux/vmware users.

    That's easy to say for me cause I never posted an howto, but I promise, if I succeed in doing this one, I sware I'll post something !

    I'll keep you updated, once again many thanks for your time.

    Regards,
    S.
     
  4. thanis

    thanis New Member

    Hi Sebastien,

    Could it be that your second server is connected differently ? I have tried to recreate your issue, and get the same problem if for example that in Server1, nic0 is connected/installed ==> eth0, but in Server2, nic1 is connected/installed, and then VMWare will have a different physical nic ID to bridge. In this case, the virtual nic's are also different and that is why your vmware nic is eth0 on Server1, but eth1 on server2. Since you also have the same issue with Windows, you can be pretty sure it is related to VMWare, so it kind of falls outside of the scope.

    Other news: I will create a newer/bigger howto soon using the latest VMWare with the latest DRBD. I will also try to get the active/active mode of drbd up & running.

    Grtz,
    Thanis
     
  5. sebastienp

    sebastienp New Member

    Hi Thanis,

    Thanks for your quick answer.

    Both servers (same model) are identically configured :
    - eth0 : LAN 192.168.1.11/24 for srv1, 192.168.1.12/24 for srv2;
    - eth1 : DRBD/Heartbeat 172.16.0.1/20 for srv1, 172.16.0.2/20 for srv2;

    When you did the lab to reproduce my problem, did you succeed in achieving the howto without issues ?

    Great news is the new improved how-to version !!!

    Once again thanks for your time,
    Kind regards from Paris/France,

    S.
     
  6. thanis

    thanis New Member

    Hi sebastien, of course I had no problems with the howto, I wrote it myself :) But I really stress that your problem is VMWare config related, I have no clue as to why you are having this issue without seeing your actual environment. I think that the vmware config on the second server is bridged to the wrong NIC, but like I say, I cannot be sure at all.

    Perhaps we should wait for the other thread of Bart Van Kleef, to see if he has the same issue as you do.

    Grtz,
    Thanis
     
  7. sebastienp

    sebastienp New Member

    Hi Thanis,

    Of course, you did it so it worked for you...

    But, which versions (vmware server 1.0.2 ? drbd 0.7 OK, heartbeat package ?) are you using ?
    What kind of VMware network config did you use ?

    Another question, regarding vmstart script, second line case "$1" , do I have to name case "$2" if I add another vm ?

    I don't want to abuse, so don't hesitate to throw me over the window if you're feeling I'm doing it, but it's possible to grant you an ssh access to the cluster, if you want.

    Thank you again and again,
    S.
     
  8. thanis

    thanis New Member

    Hi sebastien, let's wait for Bart first and then we'll see. If I could have ssh access if all else fails that would be great :)

    The "case" statements are for the start/stop/status arguments.

    Grtz,
    Thanis
     
  9. ipguru99

    ipguru99 New Member

    No srv1-eth0, srv2-eth0, too

    First, Great article. Second, I hope someone is still listening to this thread...

    I have a customer that wants to do something like this, but not spend $30k getting it going with VMWare Fusion. We saw this and HAD to try it.

    I have the exact same problem as sebastienp? SRV1 works great, everything fails over so fast it's unbelievable (easy when you realize what is going on).

    The vm moves over to SRV2 when I pull the eth0 cable out of SRV1.. but no eth0 on SRV2. I even started over and created a new vm on SRV2.. same thing when I fail it back to SRV1... everything moves over, but no eth0. The /etc/network/interfaces file says there is an eth0.. but when trying to bring it up manually, the vm just says there isn't an eth0. I fail it back to where ever I created the original vm and the eth0 is fine and accessible.

    I am using old Gateway pc's as a test, but they are identical. I have a different set of cards for eth1 (SRV1 has a 3c905 and SRV2 has a Digital).. so I don't think that is causing anything.

    Anyone ever get this resolved?

    Thanks!
     
  10. christr

    christr New Member

    Did you try to manually bring up the VM on the other host by any chance? Using the VMWare Console? If so, did you specify 'keep' or 'create' when you tried to bring up the 'copy' on the other host? If you selected 'create' then the virtual MAC addresses of the ethernet cards changed and that may be causing your issue. You must select 'keep' to keep the virtual machine ID (and all virtual NIC mac addresses) identical or Linux will think it's a new interface. Hence the eth1 designation now. (it still knows about eth0 having mac address X, so it adds eth1 with the new mac address).

    I've had this issue a few times myself so what I do on the vmware server 'clusters' I build with this is I move one of the vm's by hand to the other server (by killing heartbeat) while I have the vmware console up and running. I shutdown the vm, kill heartbeat (to make everything move) then manually 'start' the VM on the 2nd box. That time I get the prompt about creating/keeping/always create/always keep. Pick always keep and you should be OK.

    Any questions fire away... i'm actually in process of building a new pair of servers this week using DRBD 8.3 & VMWare Server 2.0. (need a bunch of changes from the howto to get it to work but not too bad so far).

    --Chris
     
  11. ipguru99

    ipguru99 New Member

    Chris,

    Thanks for getting back to me! I can't believe I didn't think about that..

    With the release of Xen 5 from Citrix (for free!), I changed my direction for this customer.

    Funny, though. Your article made me get totally immersed in the drbd stuff. I had always wanted to get more into the cluster stuff, but just never jumped in and started reading. Now, the drbd stuff (with Heartbeat) just looks like it could do SO much.

    Again, thanks for your post!

    ..and if you haven't seen the new free stuff from Xen, you should check it out. It does the VMotion stuff.. (which is what I was really interested in for this customer).

    Thanks all,
    ipguruu99
     
  12. christr

    christr New Member

    Yes, Xen is quite nice... unfortunately a lot of my hardware doesn't play too well with it (no VT bit). So I use it when I can. :)

    Also an FYI for those playing with VMWare Server 2.0 and all this HA/DRDB stuff. The workaround for moving the VM between machines is a little different. You're no longer prompted in the GUI for the 'always keep' or 'always create' options. So moving the VM between them can be a problem.

    The fix is to add the following line to the .vmx file for each of your virtual machines. It will keep the UUID (mac address ,etc) the same as you shutdown/suspend and move between hardware.

    uuid.action = "keep"


    I'm still working on some details for doing all this automagically on vmware 2.0. Will post when done. Right now my play environment is a pair of Dell 2850's using the Proxmox bare metal install. Then I installed VMware Server on top of it. So I can use OpenVZ/PVE and VMWare at the same time on the same host. Pretty spiffy.

    --Chris
     

Share This Page