Shiyu

ubuntu 踩坑记录

显卡驱动重装

某次装好后,遇到bug:

Can’t run remote python interpreter: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request: unknown

docker 里nvidia-smi不能用了,直接在docker外nvidia-smi也报错:

NVIDIA-SMI couldn’t find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system. Please also try adding directory that contains libnvidia-ml.so to your system PATH.

估计是什么时候update弄成的。

解决方法:重装显卡驱动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# BTW this is all in console mode (for me, alt+ctrl+F2)
# login + password as usual

# removing ALL nvidia software
$ sudo apt-get purge nvidia*

# Checking what's left:
$ dpkg -l | grep nvidia
# Then I deleted the ones that showed up (mostly libnvidia-* but also xserver-xorg-video-nvidia-xxx`)
$ sudo apt-get purge libnvidia* xserver-xorg-video-nvidia-440
$ sudo apt autoremove # clean it up

# now reinstall everything including nvidia-common
$ sudo apt-get nvidia-common

# find the right driver again
$ sudo add-apt-repository ppa:graphics-drivers/ppa
$ sudo apt update
$ ubuntu-drivers devices
$ sudo apt-get install nvidia-driver-460 # the recommended one by ubuntu-drivers
$ update-initramfs -u # needed to do this so rebooting wouldn't lose configuration I think

$ sudo reboot

然后再重装NVIDIA-docker:

1
2
3
4
5
6
7
$curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu18.04/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$sudo apt-get update

$sudo apt-get install nvidia-docker2
$sudo pkill -SIGHUP dockerd
$docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

测试:

1
sudo nvidia-docker run --rm nvidia/cuda:10.1-devel nvidia-smi

万幸CUDA, CuDNN都还有。

1
2
3
4
5
6
>>> import torch
>>> torch.cuda.is_available()
True
>>> a=torch.randn(1,2)
>>> a.cuda()
tensor([[-0.4678, 0.1525]], device='cuda:0')

配置默认运行的是nvidia-docker 而不是 docker (https://zhuanlan.zhihu.com/p/37519492),在/etc/docker/daemon.json 文件中配置如下内容:

1
2
3
4
5
6
7
8
9
10
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": [],
"registry-mirrors": ["https://gemfield.mirror.aliyuncs.com"]
}
}
}

pycharm里用docker

python 位置:/home/shiyuuuu/anaconda3/bin/python

image-20210326170845579