Incorrect BIOS settings on a server when used with a hypervisor can cause MMIO address issues that result in GRID GPUs failing to be recognized.

Updated 09/29/2021 01:02 PM

Incorrect BIOS settings on a server when used with a hypervisor can cause MMIO address issues that result in GRID GPUs failing to be recognized.


Symptoms or Errors

If the BIOS settings are incompatible with the hypervisor support for MMIO addressing, MMIO support issues recognizing GPUs can occur. Symptoms and errors can include:

· nvidia-smi calls fail

[root@localhost:~] nvidia-smi Failed to initialize NVML: Unknown Error

Many other configuration errors can cause this to happen too, so further investigation is advised.

· vGPU profiles are not available in vCenter / XenCenter

· The hypervisor command, "dmesg | grep NVIDIA". Finds messages containing: "This PCI I/O region assigned to your NVIDIA device is invalid"

· Sometimes mis-configuration may only be noticed on upgrade as by luck MMIO memory holes have been avoided by chance

Many other misconfigurations could cause similar issues.

Known Configurations where care is needed - Citrix XenServer

All versions of Citrix XenServer prior to XS6.5 were a 32-bit hypervisor; XS6.5 was Citrix's first 64-bit hypervisor. As such versions of XenServer earlier than XS6.5 must be run on Servers with MMIO mapping above 4G disabled (various servers call name these BIOS option differently e.g. 64-bit MMIO, Memory Hole for PCI MMIO, Above 4G Decoding). Further information is available from Citrix: http://support.citrix.com/article/CTX139834. (32-bit addressing corresponds to a 4G limit, see wikipedia).

Known Configurations where care is needed - VMware ESXi / vSphere

The current version of ESXi / vSphere (6.0 including up to 6.0 Update 2) and below are limited for vGPU PCI devices to 44-bit addressing, although ESXi is a 64-bit hypervisor. BIOS settings need to ensure PCI addressing for NVIDIA GRID GPUs is below the 44-bit limit.

This KB article from VMware outlines the limits on MMIO access for PCI devices: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2087943. The article documents:

· the Maximum Physical Address (MAXPA) is limited to 16TB in versions 6.0, 16TB (44-bit, 2 to the power of 44).

· vSphere/ESXi 5.1.x is limited to 4GB for MXPA and so decoding above 4G should be disabled.

Server BIOS configuration

Customers experiencing errors associated with MMIO and PCI I/O allocation should consult their server vendor for advice on how best to configure the BIOS to fit the constraints of MMIO region access on their particular server given the constraints of the specific version of the hypervisor they have chosen.

Specific example - SuperMicro Server

Some customers found that on upgrading GRID 3.0 they encountered vGPU unavailable within vCenter where previously by luck MMIO holes were avoided. SuperMicro have confirmed the appropriate BIOS settings for the server (1028GQ-TRT) should include MMIOHBase set to 2T, this ensures the MMIO mapping is indexed at 41-bits (below the 44-bit limit).

Relevant Products

Citrix XenServer

VMware vSphere/ESXi

NVIDIA GRID vGPU

Is this answer helpful?

Live Chat

Chat online with one of our support agents

CHAT NOW

ASK US A QUESTION

Contact Support for assistance

800.797.6530

Ask a Question