Description
Describe the bug
I have converted the DeepEdit transforms (https://github.com/Project-MONAI/MONAI/blob/dev/monai/apps/deepedit/transforms.py) to run on the GPU instead of the CPU. I just spent over a month of debugging the code since modifying the transforms to run on the GPU did completely keep me from running the code without OOM messages. I will paste some funny images below.
This means I could no longer run the code on the 11 Gb GPU (smaller crop size) or 24 Gb GPU. Having gotten access to a big cluster I tried 50Gb and even 80 Gb GPU and the code still crashed.
Most confusing of all the things, the crashes were apparently random, always at different epochs even when the same code was run twice. The memory usage appeared to be conforming to no pattern.
After debugging my own code for weeks I realized using the Garbage collection that some references are never cleared and the GC count always increases. This insight helped my to find this issue: #3423 which described the problem pretty well.
The problematic and nondeterministic behavior is linked to the garbage collection which only cleans references if they use a lot of memory. This is true for the previous transforms since they were done in the RAM where the orphaned memory areas will be rather big and be cleaned very soon.
This is not true however for GPU pointers in torch which then are cleared at random times but apparently not often enough for the code to work. This also explains why calling torch.cuda.empy_cache() would not bring any relief - the references to the memory still existed even though they were out scope but torch does not know that it can release the GPU memory then.
The fix for this random behavior is to add a GarbageCollector(trigger_event="iteration") into the training and validation handlers.
I did not find any MONAI docs which mention this behaviour, specifically when it comes to debugging OOM or Cudnn errors. However since there already is that GarbageCollector I guess other people must have run into this issue as well which makes it even more frustrating to me.
--> Conclusion: I am not sure if there is an easy solution to this problem. Seeing there are other people running this issue and since this is hard, indeterministic bugs, it is very important to fix it imo. What I do not know is how complex a fix would be, maybe someone here knows more. Also I don't know if this behavior sometimes occurs when using pytorch code only. However if this is MONAI specific it is framework breaking.
As a temporary fix I can add: The overhead for calling the GarbageCollector in my case appears to be neglectable. Maybe this should be a default handler for SupervisedTrainer and SupervisedEvaluator, only to be turned off with a performance flag if needed.
To Reproduce
Run the DeepEdit Code and follow the speedup guide, more specifically move the transforms to the GPU.
In my experience adding ToTensord(keys=("image", "label"), device=device, track_meta=False) at the end of transform is already enough to let the GPU memory run out or at least increase it extremely and most importantly non-deterministically.
I did however rework all of the transforms and moved all of the transforms including FindDiscrepancyRegionsDeepEditd, AddRandomGuidanceDeepEditd and AddGuidanceSignalDeepEditd to the GPU. (Also see #1332 about that)
Expected behavior
No memory leak.
Screenshots
Import info before the images: Training and validation where cropped to a fixed value. So in theory the GPU memory usage should remain constant over the epochs but different between training and validation. The spikes seen in the later images are due to the validation which only ran every 10 epochs. The important hings here is that these spikes do not increase over time.
x axis: iterations, y axis: amount of GPU memory used as returned by nvmlDeviceGetMemoryInfo()
one epoch is about 400 samples for training and 100 for validation
After a few weeks I got it to a point where it ran much more consistently. Interestingly some operations introduce more non-determinism in the GPU memory usage and developed a feeling which ones that were and removed / replaced them with different operations. The result is the image below. However clearly there is still something fishy going on
For comparison how it looks after adding the GarbageCollector (iteration level cleanup)
And using the GarbageCollector but with epoch level cleanup (does only work on the 80Gb GPU, crashes on the 50Gb one. As we can see in the image above the "actually needed memory" is 33Gb for this setting - with GarbageCollection per Iteration we need 71 Gb at least and still it might crash on some bad GC day). What can clearly be seen is a lot more jitter
Environment
This bug exists independent of environment. I did start with the officially recommended one, tried out different CUDA version and in the end upgraded to the most recent torch version to see if maybe it would be fixed there. I will paste the output from the last environment even though I know it will not be supported. You can verify this bug on the default MONAI pip installation as well however.
================================
Printing MONAI config...
================================
MONAI version: 1.1.0
Numpy version: 1.24.3
Pytorch version: 2.0.0+cu117
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: a2ec3752f54bfc3b40e7952234fbeb5452ed63e3
MONAI __file__: /homes/mhadlich/.conda/envs/monai/lib/python3.10/site-packages/monai/__init__.py
Optional dependencies:
Pytorch Ignite version: 0.4.12
Nibabel version: 5.1.0
scikit-image version: 0.20.0
Pillow version: 9.5.0
Tensorboard version: 2.13.0
gdown version: 4.7.1
TorchVision version: 0.15.1+cu117
tqdm version: 4.65.0
lmdb version: 1.4.1
psutil version: 5.9.5
pandas version: 2.0.1
einops version: 0.6.1
transformers version: 4.21.3
mlflow version: 2.3.1
pynrrd version: 1.0.0
For details about installing the optional dependencies, please visit:
https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies
================================
Printing system config...
================================
System: Linux
Linux version: Ubuntu 22.04.2 LTS
Platform: Linux-5.15.0-73-generic-x86_64-with-glibc2.35
Processor: x86_64
Machine: x86_64
Python version: 3.10.10
Process name: python
Command: ['python', '-c', 'import monai; monai.config.print_debug_info()']
Open files: [popenfile(path='/projects/mhadlich_segmentation/sliding-window-based-interactive-segmentation-of-volumetric-medical-images_main/tmp.txt', fd=1, position=1040, mode='w', flags=32769)]
Num physical CPUs: 48
Num logical CPUs: 48
Num usable CPUs: 1
CPU usage (%): [100.0, 100.0, 59.9, 1.4, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2, 0.4, 0.2, 0.0, 0.0, 0.0, 0.0, 0.2, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.7, 0.0, 0.0, 0.0, 0.4, 0.0, 0.0, 2.7, 0.0, 0.0, 0.0, 0.0]
CPU freq. (MHz): 1724
Load avg. in last 1, 5, 15 mins (%): [5.1, 5.0, 5.1]
Disk usage (%): 66.3
Avg. sensor temp. (Celsius): UNKNOWN for given OS
Total physical memory (GB): 1007.8
Available memory (GB): 980.8
Used memory (GB): 20.0
================================
Printing GPU config...
================================
Num GPUs: 1
Has CUDA: True
CUDA version: 11.7
cuDNN enabled: True
cuDNN version: 8500
Current device: 0
Library compiled for CUDA architectures: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']
GPU 0 Name: NVIDIA RTX A6000
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 84
GPU 0 Total memory (GB): 47.5
GPU 0 CUDA capability (maj.min): 8.6
Additional context
I will publish the code in my masters thesis at the end of September so if it should be necessary, I might be able to share it beforehand.