Ö>: pytorch frequent errors and solutions

Thứ Hai, 6 tháng 9, 2021

1. CUDA memory overflow (if you have multiple GPUs)

- If you have multiple GPUs, you are able to run pytorch operations (e.g. toch.argsort) on all of them, not just a single GPU.

- Let's take this snippet as an example:

Ok, for each 800 input samples, I sent the batch to another CPU with a specified ID by the batch_idx.

2. Load a Python object saved to GPU to CPU

- I did a stupid job when I saved a dictionary containing some GPU tensors to a .npy file.

- Then when I load the npy file, my GPUs had been dominated and ran out of memory.

- Hell yeah, I googled a lot but not find any solutions to load only a part of the file or load the dictionary directly to CPU.

- Suddenly, I think that, Oh, I can load the dictionary to GPU, then move the tensor back to CPU then overwrite the dictionary. Haha. How smart I am!

Ö>