HOW-TO: Recover from failed Storage VMotion

A while ago I received a request from the storage department to move a whole ESX cluster to another storage I/O-Group. This would be a disruptive action.
I was wondering if storage VMotion would help me out here if they assigned me some new storage on the other I/O-Group instead of moving the current LUNs.

If you are going to use sVMotion I would strongly suggest using the sVMotion plug-in from lostcreations.

Well sVMotion made it possible to move the cluster to the other I/O-Group online. This would save me a lot of time explaining to the customer why the entire cluster had to go offline. We are talking off 175+ vms here.

So that was the path I chose, and it worked out fine, but I got so excited on sVMotion that I didn’t paid attention to the available storage space left on my new LUNs. So after a while the LUN filled up and the sVMotion process failed.

Whenever a sVMotion fails you probably end up in a situation where the config files are moved to the new location and the .vmdk files and their accompanying snapshot files (sVMotion creates a snapshot in order to copy the .vmdk files) are still in the old location.
Issuing another sVMotion will generate this error: “ERROR: A specified parameter was not correct. spec” and if you turn off the vm you get an extra option “Complete Migration”. This option actually makes a copy of the .vmdk files to the same LUN, and hence it requires twice the space of the .vmdks on your LUN and most importantly it requires downtime.

Here’s what I did to resolve this split:

  • Create a snapshot of the vm. Since it’s not available via vCenter GUI in this state, you have to do this in the COS or connect your VIC directly to the ESX host.
    • Through SSH console session:
      • find the config_file_path of the VM
        vmware-cmd -l
      • Create a snapshot of the vm
        vmware-cmd <config_file_path> createsnapshot snapshot_name snapshot_description 1 1
    • Through VIC:
      • Use GUI as normal.
  • Remove (Commit) the snapshots:
    vmware-cmd <config_file_path> removesnapshots
    This will remove the newly created snapshot AND the snapshot created by sVMotion.
  • vCenter still thinks the vm is in dmotion state so you can’t edit settings, perform VMotion or anything else via vCenter. To fix this we need to clear the DMotionParent parameters in the .vmx file with the following commands from the COS:
    vmware-cmd <config_file_path> setconfig scsi0:0.DMotionParent ""
    vmware-cmd <config_file_path> setconfig scsi0:1.DMotionParent ""
    Do this for every DMotionParent entry in the .vmx file, so be sure to check your .vmx file to get the right SCSI IDs. Note that editing the .vmx file directly will not trigger a reload of the .vmx config file! 
  • Now Perform a new storage migration to move back the .vmx configuration file to its original location.
  • Clean up destination LUN and remove any files/folders created by the failed sVMotion. We’re done and back in business again without downtime!
    We can retry the sVMotion now.

Offcourse I didn’t found out all this by myself. All credits go to Argyle from the VMTN Forum. You can read his original thread here.

Someone would probably say “Don’t try this at home”, but if you’re curious and do want to try this at home, use the following procedure to reproduce this split situation:

  • perform a sVMotion of a TEST vm
  • on the COS of the ESX host issue”:
    service mgmt-vmware restart
  • Have fun!!

No related posts.

23 Comments on “HOW-TO: Recover from failed Storage VMotion”

  1. #1 Sven Huisman
    on Mar 31st, 2009 at 11:15 am

    Thanks, great info!

  2. #2 Jay Rogers
    on Apr 1st, 2009 at 9:53 pm

    We just went through a very large storage migration project using the lostcreations tool. We had some SAN issues we were fighting which caused some migrations to fail.

    We where able to use the lostcreations tool again on the same virtual machine and it fixed everything itself, getting the vm moved and back all in a single folder.

    Very cool!

  3. #3 Arnim van Lieshout
    on Apr 2nd, 2009 at 12:44 pm

    Jay,

    I’m very curious which steps you took exactly.
    I wasn’t able to use sVMotion after it failed anymore.

    -Arnim

  4. #4 Jay Rogers
    on Apr 2nd, 2009 at 1:33 pm

    In my case in a std 2 disk vm, I had one disk on the new lun and the other disk on the other lun with all the other files that make up a vm.

    I went back into the tool. selected the vm and selected another LUN to move it too and it fixed everything.

  5. #5 Arnim van Lieshout
    on Apr 9th, 2009 at 11:26 am

    I guess your crash was different than mine. When my sVMotion crashed I was unable to perform another sVMotion, even to a third LUN. In my situation the crash occurred on moving the first disk, maybe that was the difference. I ended up with the configuration files on the new location and my disk files still at the old location.

  6. #6 Michael Escobar
    on Jun 17th, 2009 at 7:56 pm

    What version of ESX was this with? Using the RCLI (which may be the difference) with 3.5 Build 158874 I get the following error trying to create or remove the snapshots:

    Fault:
    SOAP Fault:
    ———–
    Fault string: A general system error occurred: You must power off the virtual machine and complete the migration before invoking this operation.
    Fault detail: SystemError=HASH(0xb772cf8)

    We’d like to complete this without downtime as well.

  7. #7 Arnim van Lieshout
    on Jun 17th, 2009 at 8:36 pm

    Michael,

    I don’t know exactly what build I was running back then.
    Must have been Update1 or Update2.

    You have to use the COS or connect your VI-client directly to your ESX host to remove the snapshot.

    -Arnim

  8. #8 bitsorbytes
    on Jul 7th, 2009 at 6:15 am

    Hi Arnim,

    Thanks heaps for your howto, this has been one of the most helpful posts I’ve seen this year!!

    I was looking for a way to recover from some failed svmotions and was sick of arranging downtime on the guests to fix failed moves. Now with this howto, you have mad my life much easier!!

    Thanks

  9. #9 Marc Gijsman
    on Aug 5th, 2009 at 2:28 pm

    Hi Armin,

    I have tried to resolve the same issue with your solution and the procedure seems to be working. But when I try to svmotion the config files back to the original location I get the same error as before??

    Have I missed something?

    Marc

  10. #10 Arnim van Lieshout
    on Aug 5th, 2009 at 5:18 pm

    Hi Marc,

    Check if all steps had the desired effect.

    Make sure that all the snapshots are gone
    Make sure that the DMotionParent entries are set correctly to “”

    -Arnim

  11. #11 bitsorbytes
    on Aug 5th, 2009 at 11:34 pm

    For those worried about VMware not ’supporting’ this, they have updated a KB on there website which runs through this. This website has more detail

    http://kb.vmware.com/kb/1009113

  12. #12 Arnim van Lieshout
    on Aug 5th, 2009 at 11:39 pm

    Thanks for sharing!

  13. #13 Maik from FL
    on Aug 19th, 2009 at 9:28 am

    Thanks !!! Works great for my me. After deleting the Guest a I`ve an orphand entry in the ESX. The ESX restart and the vCenter Service restart no solution, but I disconnect the affected ESX, deleted the ESX und bring ESX in the Cluster. No more orphand entry all Configuration like LAN, Storage all is available. THANKS for YOUR HELP!

    VMware KB entry:
    KB1003742

  14. #14 Chris Sprinkle
    on Sep 22nd, 2009 at 2:01 pm

    I performed an SVmotion on a VM and the swap file was left behind in the old datastore. There were no reported errors. The VM continued to run without an issue. All configurations are to store the swap with the VM.

    Anyone have any ideas?

  15. #15 Arnim van Lieshout
    on Sep 22nd, 2009 at 3:03 pm

    Chris,

    Is this swapfile actively used by the vm or just a leftover from the sVmotion?
    Check your .vmx file for sched.swap.derivedName= parameter.

    Probably the vm is using a new swap file in the new location. If this is the case, the old swapfile can be safely removed.

    -Arnim

  16. #16 Chris Sprinkle
    on Sep 22nd, 2009 at 4:59 pm

    The swapfile is listed in the .vmx file and is actively used. There is no new swapfile in the new location.

    I SVMotioned the VM again to a different datastore hoping that the swap file would change, but it remained in the original location.

  17. #17 Chris Sprinkle
    on Sep 22nd, 2009 at 5:31 pm

    Also, I have an entry for sched.swap.dir in the .vmx file.

  18. #18 Arnim van Lieshout
    on Sep 23rd, 2009 at 10:54 am

    Chris,

    Perhaps there’s some kind of corruption in this swap file, which prevents it from being moved.
    I haven’t seen this behaviour before.
    Did you try to power off the VM, delete swap file and clear the sched.swap.derivedName parameter in your .vmx file?
    Perhaps this will recreate the swap file in the correct location.
    If you still experience problems I suggest to open an SR with VMware Support or open a discussion on the VMTN forum

    -Arnim

  19. #19 Philippe Hoste
    on Oct 14th, 2009 at 12:26 pm

    Hello

    Me too I have a SVMotion failed. Now my server has his .vmdk on one LUN and all other files on the other and went power off. I’ve tried the procedure here above but it doesn’t work for me: it already fails when trying to create a snapshot: I get error VMControl -3: Invalid arguments. I’ve tried the exact same command on other servers and it works fine.
    Can anybody help me ?

  20. #20 Arnim van Lieshout
    on Oct 19th, 2009 at 9:15 pm

    Try to power-on your VM first. If that doesn’t work, you have some sort of corruption on your VM.
    Try to resolve the corruption/misconfiguration first.
    Analyze the vmware.log file(s) to find out what is going wrong when powering-on your VM.

    -Arnim

  21. #21 Macnet
    on Mar 3rd, 2010 at 11:59 am

    Cheers for this Anim (and Argyle).
    Just needed the first section:
    Create snapshot;
    remove snapshot;
    Migrated storage.
    (it even cleared up after itself…)
    Done

  22. #22 Alex Dumont
    on Mar 3rd, 2010 at 3:06 pm

    This post was the exact answer to my problem (same error)! Thanks a lot for taking the time to publish this entry.

  23. #23 Kiran Shewale
    on Jul 20th, 2010 at 4:58 pm

    Thanks Arnim for this important blog about VM recover from failed storage vmotion.
    Cheers!!!!!

Leave a Comment