Thứ Hai, 6 tháng 9, 2021

pytorch frequent errors and solutions

 1. CUDA memory overflow (if you have multiple GPUs)

- If you have multiple GPUs, you are able to run pytorch operations (e.g. toch.argsort) on all of them, not just a single GPU.

- Let's take this snippet as an example:








Ok, for each 800 input samples, I sent the batch to another CPU with a specified ID by the batch_idx. 

2. Load a Python object saved to GPU to CPU

- I did a stupid job when I saved a dictionary containing some GPU tensors to a .npy file. 

- Then when I load the npy file, my GPUs had been dominated and ran out of memory.

- Hell yeah, I googled a lot but not find any solutions to load only a part of the file or load the dictionary directly to CPU.

- Suddenly, I think that, Oh, I can load the dictionary to GPU, then move the tensor back to CPU then overwrite the dictionary. Haha. How smart I am!

Không có nhận xét nào:

Đăng nhận xét