Reproducible Deep Learning Using PyTorch

Darina Bal Roitshtain
4 min readSep 9, 2022

--

Photo by Samer Daboul from Pexels

Have you ever tried comparing the results of different model configurations or various models? Maybe you tried to refactor your deep learning (DL) code without the purpose of changing its underlying functionality. Such as: rewriting messy code, integrating existing code with your task-specific requirements, optimizing specific code sections, etc. How did you ensure that no bugs were added?

Neural network projects are full of non-deterministic processes that result in different outcomes in each execution. To perform a fair comparison, you need to enable reproducibility.

In this article, I will explain how to achieve it. Let's start!

“The assumption of an absolute determinism is the essential foundation of every scientific inquiry” , Max Planck

Getting reproducible functionality in the machine learning process involves several considerations: random seeds, data splitting, data loading, and deterministic operations.

Random Seeds

When you set random seeds, you ensure that various pseudorandom number generators (PRNGs) are reproducible. Do it as follows:

  • SEED can be any integer number of your choice.
  • RandomState(MT19937(SeedSequence())) creates a new one BitGenerator. You can also use np.seed() that initializes the python RNG (reseeds the BitGenerator) and sets seed for custom operators, but note that NumPy suggests the first option as the best practice.
  • np.random.seed() reseed the BitGenerator. If any of the libraries or code rely on NumPy, seed it.
  • torch.manual_seed() sets the seed for generating random numbers.

Data Splitting

When you "randomly" split your data to train and validation subsets, you must ensure that you can reproduce the same data split the next time you run and evaluate the model. For this, you set the seed as follows:

random_state controls the shuffling applied to the data before applying the split, enabling reproducible output across multiple function calls.

Data Loading

You need to make sure that your data loading process is reproducible. The data loaded to your model in each execution of the whole algorithm should be the same to make the result comparable.

As suggested by NVIDIA GitHub, to shuffle differently but reproducibly every epoch, you should reset the generator by creating an instance (self.g) of torch.Generator in torch.utils.data.DataLoader and use it as follows:

set_epoch should be called at the start of each epoch.

Note, that the function is implemented as a method inside DataLoader class, but you can implement is a way more convenient for you.

Additionally, as suggested by PyTorch documentation, in multi-process data loading algorithm DataLoader will reseed the workers using worker_init_fn() to preserve reproducibility in the following way:

Please note, when creating a torch.Generator() object, the device should be mentioned, as shown below:

I encourage you to experiment with different random states for data split and loading and examine the differences in results.

Deterministic Operations

NVIDIA CUDA Deep Neural Network (cuDNN) is a GPU-accelerated library for deep neural networks. CUDA is a framework that serves as a basis for cuDNN that uses it for accelerated computational tasks and is optimized for it. CuDNN is a kind of Deep Learning library that uses CUDA, and CUDA is the way to contact the GPU.

CUDA

Set the seed for generating random numbers for the current GPU:

or for all GPUs:

cuDNN

As declared in NVIDIA cuDNN documentation, most of cuDNN's routines belonging to the same version are designed to generate the same bit-wise results across runs when executed on the same architecture GPUs. However, there are exceptions, such as ConvolutionBackwardFilter, ConvolutionBackwardData, PoolingBackward, SpatialTfSamplerBackward, CTCLoss, etc., that don't guarantee reproducible results, even running on the same architecture. The reason is the use of atomic operations (program operations that run entirely independently of any other processes) in a way that introduces truly random floating point rounding errors. Also, across different architectures, no cuDNN routines guarantee bit-wise reproducibility.

So, what should you do?

torch.backends.cudnn.deterministic = True causes cuDNN only to use deterministic convolution algorithms. It does not guarantee that your training process will be deterministic if other non-deterministic functions exist. On the other hand, torch.use_deterministic_algorithms(True) affects all the normally-nondeterministic operations. As the documentation states, some of the listed operations don't have a deterministic implementation, and an error will be thrown. If you need to use non-deterministic operations, the solution is to write a custom deterministic implementation yourself.

torch.backends.cudnn.benchmark = True causes cuDNN to benchmark multiple convolution algorithms and select the fastest. So, when Falseis set, it disables the dynamic selection of cuDNN convolution algorithms and ensures that the algorithm selection itself is reproducible. If your model does not change and your input sizes remain the same — then you may benefit from setting torch.backends.cudnn.benchmark = True. In case of changing input size, cuDNN will benchmark every time a new input size appears, which will lead to worse performance.
However, if your model changes input sizes, some layers are activated in certain conditions, etc., then setting True might stall the execution.

Note:

  • In case using the aforementioned cuDNN settings will not reproduce your results, use torch.backends.cudnn.enabled = False. It controls whether cuDNN is enabled or not. Disabling cuDNN can solve the reproducibility issue.
  • Pay attention to the specific cuDNN, CUDA, or any Python library versions that are needed when utilizing cuDNN with machine learning models. Using the incorrect version can result in issues in your project.

Conclusion

To be on the "reproducible side", keep your 'Data Spliting' and 'Data Loading' processes as described above.
To quickly achieve deterministic behavior of the other part of your flow, I suggest defining the following function:

And call it at the beginning of your algorithm.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Darina Bal Roitshtain
Darina Bal Roitshtain

No responses yet

Write a response