Error with the gpu while using clouderizer with Google Colab


#1

I’m trying to run clouderizer inside Google Colab but whenever I check for the GPU with Nvidia-smi I get this issue : “Failed to initialize NVML: Driver/library version mismatch”. I tried installing with and without the Cuda option, Purging and reinstalling the Nvidia driver before running the clouderizer and while running the clouderizer, running a new google colab project, running it on another google account but I still get the same issue. How can I solve it ?


#2

Hi Mlewi

I am suspecting some CUDA driver changes on Google’s host machine where Colab docker containers are run.

Can you try setting this env variable?
export LD_PRELOAD=/usr/lib64-nvidia/libnvidia-ml.so

After this in same terminal session try nvidia-smi. Let me know if this works. I will then try to incorporate this in Clouderizer init itself on Colab.

-Prakash


#3

That worked seamlessly, thank you very much !
I’ve had another issue with torch not recognizing CUDA (‘CUDA module initialization failed’) maybe it was something from my part. But i searched a bit and found a solution on the nvidia forums, it was fixed by adding some symlinks (and creating the mentioned file with a little modification) because it seems that the cuda installer didn’t create them propretly. And thanks to the startup script the process is now automated so everything works smoothly.
Again, thank you very much !


#4

Great!

Can you post the changes you did for torch to work. It might help others looking for this issue?

-Prakash


#5

Sure thing,
First thing I added a symlink for the cuda library like so :
cd /usr/lib64-nvidia
ln -s libcuda.so.1 libcuda.so

and then I added a config file “nvidia-lib64.conf” that was missing inside the folder “/etc/ld.so.conf.d” that contains the path to nvidia’s library by running this code (in my case it was the lib64):
echo /usr/lib64-nvidia/ > /etc/ld.so.conf.d/nvidia-lib64.conf

and I realoaded the cache and links to the libraries to detect the changes made :
sudo ldconfig