Incorrect BIOS settings on a server when used with a hypervisor can cause MMIO address issues that result in GRID GPUs failing to be recognized.
If the BIOS settings are incompatible with the hypervisor support for MMIO addressing, MMIO support issues recognizing GPUs can occur. Symptoms and errors can include:
· nvidia-smi calls fail
Failed to initialize NVML: Unknown Error
Many other configuration errors can cause this to happen too, so further investigation is advised.
· vGPU profiles are not available in vCenter / XenCenter
· The hypervisor command, "dmesg | grep NVIDIA". Finds messages containing:
"This PCI I/O region assigned to your NVIDIA device is invalid"
· Sometimes mis-configuration may only be noticed on upgrade as by luck MMIO memory holes have been avoided by chance
Many other misconfigurations could cause similar issues.
All versions of Citrix XenServer prior to XS6.5 were a 32-bit hypervisor; XS6.5 was Citrix's first 64-bit hypervisor. As such versions of XenServer earlier than XS6.5 must be run on Servers with MMIO mapping above 4G disabled (various servers call name these BIOS option differently e.g. 64-bit MMIO, Memory Hole for PCI MMIO, Above 4G Decoding). Further information is available from Citrix: http://support.citrix.com/article/CTX139834. (32-bit addressing corresponds to a 4G limit, see wikipedia).
The current version of ESXi / vSphere (6.0 including up to 6.0 Update 2) and below are limited for vGPU PCI devices to 44-bit addressing, although ESXi is a 64-bit hypervisor. BIOS settings need to ensure PCI addressing for NVIDIA GRID GPUs is below the 44-bit limit.
This KB article from VMware outlines the limits on MMIO access for PCI devices: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2087943. The article documents:
· the Maximum Physical Address (MAXPA) is limited to 16TB in versions 6.0, 16TB (44-bit, 2 to the power of 44).
· vSphere/ESXi 5.1.x is limited to 4GB for MXPA and so decoding above 4G should be disabled.
Customers experiencing errors associated with MMIO and PCI I/O allocation should consult their server vendor for advice on how best to configure the BIOS to fit the constraints of MMIO region access on their particular server given the constraints of the specific version of the hypervisor they have chosen.
Some customers found that on upgrading GRID 3.0 they encountered vGPU unavailable within vCenter where previously by luck MMIO holes were avoided. SuperMicro have confirmed the appropriate BIOS settings for the server (1028GQ-TRT) should include MMIOHBase set to 2T, this ensures the MMIO mapping is indexed at 41-bits (below the 44-bit limit).
NVIDIA GRID vGPU