We came across very strange PSOD Issue where 9 out of 10 ESXi host kept crashing with PSOD and strange thing was every time it was random 9 servers
One of the toughest weekend ever to detect this root cause. 72 hours only J
VMware vCenter 5.1
VMware ESXi 5.1
Cisco UCS Servers
Post Analysis root cause was Mis-Configuration of a single VMDK associated (Disk 0:12) had a mis-configuration within the parameters of the disk, the C/H/S (Cylinders/Heads/Sectors) value had been changed to give a value of 256 for the "Heads" which is outside of the limit which is 255.From a VMware perspective, they always create a VMDK file with the Heads value of 255, the PSOD was caused because the ESX host could not handle that value as that value would be unexpected.
As the HA was enabled when 1 Host crashed it tried Powering on the VM on another Host in Cluster causing all the hosts go down except the last working host in cluster
Value was corrected by modifying the VMX file on the particular VM
Permanent Fix / Resolution
VMware engineering is developing a fix to prevent the PSOD occurring and / or a preventative mechanism to not allow a higher than 255 value to be created.