As we have seen previously in this series of blog posts, Active Learning is a Machine Learning training paradigm which requires the repetitive succession of the same steps until the model is trained to the satisfaction of the data scientist.
We have also seen that those who have experimented with Active Learning practically always use the exact same approach based on a least-confidence querying strategy. Very few people though, even among experts, realize that there are actually countless different approaches to Active Learning, and that in order to have good results, one would need to tune a certain number of elements that go into the design of an Active Learning process. In short, there are many variations on Active Learning.
In a way, educating the experts of today on how to leverage Active Learning isn’t without similarities to the early days of Deep Learning. For those who would remember, the concept of Deep Learning met really rough beginnings in the early 2010s when many data scientists were resisting its increasing popularity. “I tried Deep Learning on my problem, but the performance just wasn’t as good as with other algorithms”, was something one would typically hear when crossing a group of ML people. Deep Learning eventually gained in popularity when the skeptics finally realized that their results were subpar only because Deep Learning required a lot of tuning, and that tuning a Deep Learning model properly took a lot of patience and experience. Deep Learning itself wasn’t the reason they were not getting good results; rather, it was their failure to tune the hyperparameters properly that was at fault.
Now, we are observing the same phenomena with Active Learning, which to give you the results you expect for your particular use case and data type, needs tuning. A lot of tuning.
This final blog post is meant to guide you through the many things you might want to consider building an Active Learning process that is just right for you.
1. Tune the initialization sample
The initial loop of an Active Learning process is used to bootstrap the process, and usually relies on randomly selected records because the system knows nothing about the data yet at this stage. In subsequent loops, and the further we go, each unselected record will have additional metadata associated with it: a series of past predictions, of confidence levels and other uncertainty measurement, of model states, etc. But for “loop 0”, as we call it at Alectio, there is no such information to go by, so usually the data scientist has very few options other than randomly selecting a batch of data.
Sometimes, though, data scientists take the process one step further by using unsupervised learning or information theory techniques to guarantee a more diverse initial distribution of the data, but that approach may not always be possible as it is more invasive. According to early research on Active Learning applied to classical Machine Learning, about 10% of initializations lead to stale results where the model’s learning curve never picks up. So the choice of an initial sample can be critical for the rest of the process.
2. Tune the model initialization
As we have already seen, the way that an Active Learning workflow is initialized can be critical to get good results, and this is equally true of the data that is sampled for the first loop as it is of the way the model itself is initialized. This is actually nothing new: even when training Deep Learning models using supervised learning, the way the weights are initialized can make a huge difference in the way learning unfolds. The additional challenge in our case is that the early loops are trained with very little data, making it more difficult for a model to recover from a bad initialization.
3. Choose whether to clear the parameters or not
Active Learning is all about incremental learning: at each loop, we should get closer to convergence. In most cases, people decide to resume training from the previous state of the model. In that case, should only the newly selected data be used to re-train, or should all selected records be used? Or should the parameters be cleared, and the model retrained from scratch? Clearly, the latter option is significantly more computationally intensive. However, it might also be safer as it avoids inheriting early biases which are more likely to occur towards the beginning of the process as it hasn’t stabilized yet.
4. Tune the loop size (or loop sizes)
The “vanilla” Active Learning processes that you come across in papers or open source libraries assume that you will want to use loops of the exact same size (number of records) throughout the entire process, but there is actually no strong reason to do that. For instance, it might be a good idea to start with a larger loop as a certain threshold of data might be required to reach a stable model and establish a strong starting point for subsequent loops. Or, on a contrary, one might want to use smaller loops at first to ensure that the querying strategy is tuned properly (we will discuss that in point 7) before scaling up.
And then, even if you concur that there is no reason to use a static, pre-established loop size throughout the process, there is the obvious question of how to best compute the ideal size dynamically, at each loop. Should one use an optimization approach designed to minimize computational or labeling costs? Could this be done with bayesian tuning, with the loop size considered a hyperparameter? With no formal theory being established, those questions are all left to the data scientist’s appreciation.
5. Tune the stopping criteria
Oftentimes, the Active Learning process is ended after a specific number of loops have been completed, which is decided in advance and often arbitrarily. In practice, though, there are many other much more pragmatic criteria that can be used to decide when it is time to end the Active Learning workflow. If the data scientist is on a budget, he / she can decide to terminate when he / she goes over the amount of annotation money allocated for the project. The process can be aborted too when the amount of time or the amount of computing power goes over what was agreed on, or simply when a specific performance criteria is met (for example, if the model reaches the accuracy threshold that was demanded by management).
Finally, some newer techniques based on information density can determine that the remaining data just does not contain novel information and hence would not help boost the performance of the model anymore, in which case it makes no sense to proceed further without bringing fresher data into the equation.
6. Decide how the labeling process is to be managed
Somehow, due to some abuse of language in the space of data labeling, many assume that Active Learning is synonymous with Human-in-the-Loop data labeling (the practice of using humans to annotate the most tricky data that an automated process would struggle with), and hence, very few people realize they have, in fact, the option of using any labeling process they’d like to annotate the selected data. There is no reason why the selected data not be annotated with an autolabeling process, or even with a new-generation algorithm like the Snorkel algorithm.
One important note though, is that the success of the Active Learning process as a whole depends critically on the quality of the labels. If the data is mis-annotated, the selection process will most likely be biased and will defeat the reason why you wanted to try Active Learning in the first place. Hence, it is perfectly wise to introduce a label audit process (for example, an anomaly detection framework) between the LABEL step and the TRAIN step of Active Learning, as your labels better be pristine.
7. Tune the querying strategy
The querying strategy itself is the most sensitive part of the process. People often assume they have a limited number of querying strategies to pick from, but querying strategies are technically nothing more than sampling functions (“selective sampling” is part of the jargon you will often hear when studying Active Learning), so you can get really creative when designing them.
What is usually called a querying strategy is a predetermined rule, usually set somewhat arbitrarily, and based on only one type of “metadata”, such as the entropy or confidence level (refer to part 3 if you want to learn more about the most common vanilla querying strategies). Choosing such a querying strategy, however, would be equivalent to using a rule-based selection process with only one feature, like a decision tree of depth 1. How reliable can we really expect such a process to be? This is why Alectio has invested a lot of time and research in replacing vanilla querying strategies by actual ML models comprising more than one (meta)feature, that are dynamic and evolve from one loop to another. We even include multiple generations of metadata to reach a conclusion regarding whether or not a specific record should be used.
8. Decide whether to tune the hyperparameters or not
It took time to see some success from Deep Active Learning – the practice of Active Learning on Deep Learning models – partially because unlike other algorithms that don’t require hyperparameters, Deep Learning calls for tuning of some hyperparameters that might depend heavily, for example, on the amount of training data used. For example, no one would think of using a large batch size if their dataset was really small; similarly, using the same batch sizes for the first loop (where maybe only 100 records have been selected) and the 100th loop, where 10,000 records are selected, would be a bad idea. Ultimately, even though Data-Centric AI is all the rage these days, the future of Active Learning is truly in combining a Model-Centric AI and a Data-Centric AI approach which allows to tune model and data simultaneously.
Tuning an Active Learning workflow might feel like little more than picking an off-the-shelf querying strategy, a loop size and a number of loops arbitrary, and this is exactly why the results obtained by beginners are rarely compelling. The reality though, is that tuning Active Learning like a pro takes at least as much know-how as hyperparameter-tuning a Deep Learning model.
Wondering if there are automated, algorithmic ways to tune your Active Learning, just like there are hyperparameter tuning libraries for Deep Learning algorithms? Well, look no further: this is one of the many things the Alectio platform will do for you.
Wanna give us a try? Reach out and we will show you a demo.