Having problems with new M6/M60 like VMs fail to power on, NVRM BAR1 error, ECC is enabled, or nvidia-smi fails

Updated 09/29/2021 01:02 PM

Having problems with new M6/M60 like VMs fail to power on, NVRM BAR1 error, ECC is enabled, or nvidia-smi fails

NVIDIA M60 / M6 Problems - checking your card in "graphics" mode

Quick Check

If you are experiencing issues with a new M60 / M6 card, please run this command:

lspci -n | grep 10de

You should get something that looks like:

[root@SAGRID-ESXi5:~] lspci -n | grep 10de

0000:04:00.0 Class 0300: 10de:11bf [vmgfx2]

0000:05:00.0 Class 0300: 10de:11bf [vmgfx3]

0000:08:00.0 Class 0300: 10de:11bf [vmgfx0]

0000:09:00.0 Class 0300: 10de:11bf [vmgfx1]

0000:86:00.0 Class 0300: 10de:11bf [vmgfx4]

0000:87:00.0 Class 0300: 10de:11bf [vmgfx5]

If your system reports 0302 instead of 0300, the GPU is in "compute" mode. If it reports 0300 as above then it's reporting that it is in "graphics" mode, though you should continue with the checks below to confirm.

If you notice the GPUs are in compute mode - then use the gpumodeswitch utility and change it to graphics, make sure you reboot the server to apply these changes. The gpumodeswitch utility is supplied in the driver download and fully documented.

NVIDIA's Jason Southern has published a video to guide you through the utility too: https://www.youtube.com/watch?v=VAQhiNNFXxQ&feature=youtu.be

Side note: lspci is a standard utility on many unix like OSs that probes pci devices (such as GPUs) - "10de" is an identifier allocated to the NVIDIA as a vendor, so this command is picking out just the information on NVIDIA pci devices from the massive list of pci devices you may have.

NOTE: In some cases we have found that even though graphics mode is reported, VMs fail to start:

Cisco B200m4
Graphics Processor:
Tesla M6

Symptom 1 PRESENT:

1. VMs (Virtual Machines) fail to power

You see an error message such as: " Could not initialize plugin "/usr/lib64/vmware/plugin/libnvidia-vgx.so" for 'GPU 'grid_m60-2q

Symptom 6 PRESENT:

6. The memory on the GPU is missing 0.5 gigabytes of RAM

i.e. in vCenter see RAM of 7.49GB rather than the expected 7.99GB, this could be a sign you are in compute mode

Symptom 5 INCONCLUSIVE because their was no entry for M6 anywhere in here:

5. A Memory BAR of 8 gigabytes is being used (this means you are in compute mode)

On hypervisors with a full version of linux such as XenServer (VMware does not have a full version) you can use

[root@localhost libvirt]# lspci -vvvv -d "10de:*" | grep Region

To check if GPU BARs correctly assigned on ESX you should use:

# esxcfg-info -a | less

And scroll or search down to Tesla M60 label. ESX doesn't have full version of lspci and the linux command "lspci -vvv" won't work.

Side notes:
Nvidia-SMI functioned as if the graphics card was in graphics mode.
Adding a "PCI Device Shared" was visible as if there was a successful driver ViB loaded. and the M6 profiles were able to be selected
So in all indication without looking at the VMware vCenter RAM allocation on the graphics card, this was not obvious.

Resolution:
Apparently in some cases the GPU will look like its in graphics mode when in fact it isn't
Uninstalled the GPUManager ViB. Re-Run the GPUModeSwitch process again to validate that the GPU was indeed in graphics mode. Reinstalled the GPU manager ViB

Background

By default M60 / M6 cards are currently (April 2016) shipped to OEMs (server vendors) in "compute" mode and many so not switch them over when shipping a server designed for GRID remoting graphics.

What Symptoms / problems could you see if in "compute" mode?

1. VMs (Virtual Machines) fail to power on

· You see an error message such as: " Could not initialize plugin "/usr/lib64/vmware/plugin/libnvidia-vgx.so" for 'GPU 'grid_m60-2q

2. An error is seen running the nvidia-smi command/utility (after loading the vib on ESX - see below on checking vib is loaded correctly)

· "NVRM: BAR1 is 8192M @ 0x387c00000000 (PCI:0000:83:00.0) NVRM: This is a 64-bit BAR mapped above 16 TB by the system NVRM: BIOS or the VMware ESXi kernel. This PCI I/O region assigned NVRM: to your NVIDIA device is not supported by the kernel. NVRM: BAR1 is 8192M @ 0x387800000000 (PCI:0000:84:00.0)"

3. ECC Protection is enabled (this means you are in compute mode)

4. I/O Base BAR is disabled (this means you are in compute mode)

5. A Memory BAR of 8 gigabytes is being used (this means you are in compute mode)

On hypervisors with a full version of linux such as XenServer (VMware does not have a full version) you can use

· [root@localhost libvirt]# lspci -vvvv -d "10de:*" | grep Region

To check if GPU BARs correctly assigned on ESX you should use:

· # esxcfg-info -a | less

And scroll or search down to Tesla M60 label. ESX doesn't have full version of lspci and the linux command "lspci -vvv" won't work.

6. The memory on the GPU is missing 0.5 gigabytes of RAM

· i.e. in vCenter see RAM of 7.49GB rather than the expected 7.99GB, this could be a sign you are in compute mode

7. An emulator required to run this VM failed to start

Side note: Other errors from nvidia-smi

· "NVIDIA-SMI has failed because it couldn't communicate with Nvidia Driver"

This message usually means that driver is not loaded or incorrect version, so after running nvidia-smi, type:

# dmesg

And check all NVRM entries output.

But first check: vib is loaded correctly

# esxcli software vib list | grep -i nvidia (to see if VIB is installed)

# esxcli system load -m nvidia (it should returns something like "Unable to load module /usr/lib/vmware/vmkmod/nvidia: Busy" that means driver is loaded)

Also if you catch any other message or error, please, use dmesg to check output info about the error.

Caveat

Symptoms may have multiple root causes!

The symptoms above are signs you may need to switch from "graphics" to "compute" mode, but if you check and all is ok or change the mode and have issues after reboot then you probably have another issue. M60 and M6 cards come with full support and you should raise a support ticket with NVIDIA.

You can of course discuss issues on our support forums:

· https://gridforums.nvidia.com/

Applicable Products

NVIDIA M60 / M6

NVIDIA SUPPORT

Having problems with new M6/M60 like VMs fail to power on, NVRM BAR1 error, ECC is enabled, or nvidia-smi fails