This is my first blog post for EMC consulting despite being with the company for quite some time. Not sure why i’ve not got round to this earlier but hey, better late than never!
A bit of background on me. I work as a Senior systems administrator for EMC consulting. Prior to EMC my skillset was mainly with Microsoft technologies but EMC Consulting has given me the opportunity to work with a far wider array of technologies, including VMware and Microsoft virtualisation and EMC SAN technology.
Daily contact with these technologies has allowed me to learn a great deal. I’ve also encountered and overcome many problems and challenges along the way. Until now i’ve simply shared this knowledge with my colleagues. Going forward i hope this blog will allow me to pass on any relevant knowledge i’ve gathered to a wider audience. Hopefully the ensuing dialogue will allow me to learn too!
That’s enough about me. Now, down to business.
I recently had a problem with my production VMware environment whereby we suffered power failure to a number of ESX hosts in our HA cluster (all running ESX 4.0 U1). When the ESX hosts came back online there were large numbers of greyed out VM’s in an invalid/unknown state. Also, a large number (but not all) of the remaining VM’s gave a ‘Resource is busy/in use’ or ‘unable to access a file since it is locked’ error when trying to power on. The vsphere server was one of the VM’s stuck in the busy/in use state.
Now, file locking typically occurs when more than one host thinks it owns a VM file. I tried the procedure in this article by Gabrie van Zanten http://www.gabesvirtualworld.com/unable-to-access-a-file-since-it-is-locked/ but was still unable to unlock the files. I even rebooted the hosts but without success.
I’d had a similar issue previously with locked files for one VM. I resolved it by moving the VM in question back to its original host. However, this time i tried something different thanks to some assistance from VMware support. I decided to start with the Vsphere server VM.
Vmkfstools –D <path to vmdk> |tail –f /var/log/vmkernel |grep –i owner
This will give the MAC address of the ESX host which owns the file.
The lock was now removed and the VM could be powered on.
Once the vsphere server was online it resolved the locks on all the other VM’s (there were over 100 VM’s down). All invalid/unknown VM entries disappeared.
My understanding of this is that when the ESX hosts in the HA cluster went down, the remaining hosts tried to take control of the VM’s on the failed hosts. However, when the failed hosts came back online a few minutes later, they still believed they had the VM’s. This created the lock. If we’d had the vsphere server online i assume it would have resolved these conflicts but with it being down, it was every ESX host for itself it seems!
I spoke to a VMware support engineer about this but didn’t really get a satisfactory answer as to why this occurred. In my case, having HA enabled actually created this whole issue. Without it, the VM’s could have been powered on once the hosts came back online with no file locking issues.
If anyone has any knowledge of this issue or has experienced something similar i’d love to hear from you.
|Update your feed preferences|