Memory leaks when using torch DataLoaders
This is related to https://github.com/pytorch/pytorch/issues/13246
If a torch Dataset
is holding a sampler object and a torch DataLoader
is using multiple workers, then the underlying sampler object sent over the torch multiprocessing channel.
This means when any pure-python attribute is referenced, its refcount is increased and inter-process copy-on-write behavior is triggered (because you need to write to the Python object to increment its refcount). This is not a problem for large numpy arrays because only the base array has a refcount, not its elements.
However, for python lists and dicts, every time you access an item you trigger copy-on-write (which I think brings the entire page over from the main processes's memory space into the child's memory space, which may be the reason for non-linear grown if you are randomly accessing). This is a problem for the CocoDataset object that is carried over. Its basically a giant pure-python dictionary, and accessing it causes a lot of copy-on-write behavior.
Possible mitigation strategies:
-
Use a
multiprocessing.Manager
to store the CocoDataset. Not sure what that entails. -
Pass a reference to a backend database connection instead of the coco file?
-
Represent the CocoDataset --- or at least the relevant parts --- using numpy objects somehow.