By Antonin Vidon, Senior Machine Learning Engineer
It might initially feel counterintuitive, but training a Machine Learning model isn’t necessarily a deterministic process. In fact, it’s usually not the preferred strategy. You might say, if my only goal is to obtain the best model, why can’t I just drop all random variations? Machine Learning theory tells you otherwise: such an approach hinders the model’s ability to generalize to unseen examples, ultimately compromising its real performance.
To overcome this, research has introduced the idea of injecting randomness during model training. This helps explore a broader range of solutions, avoid getting stuck in suboptimal behavior, and prevents becoming overly reliant on patterns that may not generalize properly to real-world situations.
Case Study: Field Extraction at Hyperscience
Setting the Stage
Let’s illustrate this paradigm with one of Hyperscience’s core use-cases: Field Extraction. In this specific case, you are training a model to extract a set of predefined fields. For instance, you might need to extract the product name, price, and delivery date from a purchase order. To get your model ready to automate such a task, you’ll typically annotate fields across a set of training documents and let it learn from them.
The Downsides of Unpredictability
Even if you expect the algorithm to produce the same model every time you train, random operations are inherently unpredictable, resulting in varying levels of performance – in other words, you might end up extracting different fields on the same document.
All this might sound completely benign, given different randomness shouldn’t hinder convergence and performance should be similar, as illustrated by the plot above. However when deploying for production – as opposed to research – this isn’t enough, and such unpredictability yields the following engineering challenges:
- Lack of reproducibility: if a customer encounters an unexpected scenario during training, we may struggle to replicate it due to our inability to ensure identical behavior
- Performance drift when re-training: a customer might see an automation rate decrease when re-training on the same dataset
- Internal QA testing cannot rely on metric consistency: we need to arbitrarily define what is an expected vs. unexpected variation of metrics to avoid false alarms
- Silent bugs: some dangerous bugs may cause deviations that fall within the expected range of variation, allowing them to go unnoticed
Breaking down sources of randomness
For all these reasons, managing randomness during training is crucial when developing a Machine Learning product. To understand just how to do that, let’s first go back to our field extraction use case and explain where that randomness actually comes from.
- Model weights initialization: the model’s architecture is partially dependent on the use-case, like the number of fields to extract per document, and the corresponding weights are initialized randomly to facilitate convergence to the optimal solution
- Sampling: at each training step, the data loader randomly selects a fixed-size batch of data from the dataset
- Data augmentation: examples are randomly transformed to increase the size and diversity of the dataset. In our case, this could be moving, shuffling, replacing or deleting words over the page.
- Training algorithm: regularization techniques like drop-out randomly deactivate part of the model to help learn more meaningful general patterns
The trick
Fortunately we have a way of making all these operations reproducible without altering the benefits stemming from their random behavior. We can set an initial value used by the generator to produce the random sequence of numbers: the random seed. By using a fixed hardcoded value, the sequence of random numbers remains the same each time the program is run, allowing for entirely replicable experiments.
😐 machine only accuracy: previous 0.92937 vs. current 0.92937
😐 accuracy at 95% target accuracy: previous 0.94739 vs. current 0.94739
😐 automation rate at 95% target accuracy: previous 0.97046 vs. current 0.97046
This way, we can rely on metric identity to help us catch any bug that might affect training.
Going further: managing randomness despite training interruptions
At Hyperscience, the training of our models is resilient to interruptions. This means that even if the instance on which the model is trained is stopped without notice, we can resume training from the point where it left off. This feature is essential for cloud environments where instances may periodically restart, such as every 48 hours within FedRAMP environments, which Hyperscience will soon support with our FedRAMP Certification through Palantir’s FedStart program. It also enables the use of low-cost spot instances, which can be interrupted at any time by the cloud provider.
Ensuring reproducibility despite interruptions is a more challenging process, because even if we set the seeds at the beginning of training, these are not persisted across sessions. In other words, the random generator reverts to its default state when the instance restarts. Using the initial seeds when resuming is also not suited because we would just replicate training from scratch on a model already partially trained.
In order to maintain determinism across instance restarts, the trick is to periodically update the seeds using a deterministic algorithm. In practice, we synchronize this with the frequency of saving model checkpoints. Hence, when we resume training from a given checkpoint on a new machine, all seeds are instantly reset to a value specific to this exact training step. This way, all subsequent random behavior remains the same as if there hadn’t been any interruption.
Taking a step back: how does it all fit into Hyperscience roadmap?
While playing a critical role in ensuring performance on customers use-cases, we‘ve seen that randomness brings a lot of engineering challenges during productization. To address these, our team has developed a toolkit to manage it in a complex setting where training can be interrupted at any time. This approach allows us to meet the stringent requirements of some cloud providers, such as FedRAMP, aligning with our objective to serve U.S. federal agencies with the highest security standards. In keeping with our focus on cost-efficiency, we’ve also been able to make our internal experiments 3 times cheaper by running them on spot instances, and are preparing to extend this capability to our customers.