What is active learning?

Active learning is a semi-supervised machine learning strategy. Generally speaking, active learning aims to reduce the amount of labeled data required to train an effective model. AL models do this by first learning from a random sampling of data, after which the model actively requests more specific types of labels to improve its performance. This leads to converging an optimal model faster using less data.

Alectio has extensive expertise in active learning and it forms a crucial cornerstone of our platform.

How does active learning help with labeling costs?

A lot of data we use to train models is either of minimal value (think duplicative data) or is actually harmful (think of mislabeled or spammy data). Training models with a lot of useless or detrimental data reduces their efficacy. Active learning seeks to solve that problem by listening to what your models need to succeed.

Aren’t there other solutions to reduce labeling costs?

There are, yes. Labeling costs are a real issue for a lot of businesses, either because of the volume of their typical dataset or because getting quality labels is slow or expensive (think of something where the labels need to come from experts like surgeons, lawyers, geophysicists, etc.). Labeling is also expensive because you often need to label data more than once. Of course, crowdsourced labeling has helped for some of us, but some companies don’t want to share data with third parties or simply cannot because of privacy concerns (or the data requires expertise the crowd doesn’t have).

Human-in-the-loop and Snorkel are two popular appraoches but there are still issues you may run into. For example, both still require you to label a ton of data (more efficient though they may be), which is often a waste of time and money.

Our approach at Alectio is different because we’re interested in finding and prioritizing the most useful data for your models to ingest. This solves the issues around labeling bottlenecks, overfitting, compute resources, and more since you label less data but label the right data for your project.

Doesn't active learning use a lot of compute resources?

Active learning was originally developed to help people save on labeling costs, not compute resources. That means that yes, it can use more computer resources than “regular” supervised learning. That said, active learning can actually help reduce your consumption of compute resources if your number of training loops isn’t too high.

Okay, so Alectio is an active learning company?

We definitely leverage active learning here at Alectio, but we combine it with reinforcement learning, meta learning, information theory, entropy analysis, topological data analysis, data shapley, and more. That’s because active learning in and of itself isn’t enough to get you the results you need. And as we mentioned in our last answer, active learning was originally designed to reduce labeling costs but generally will not reduce the compute power you use or the training time your problem requires. Combining it with other methods and concepts helps keep those in check.

So why isn’t active learning used more in the industry?

Largely because people are most familiar with a certain kind of active learning where the model selects least-confident data to train on. This works in academia, where the data is clean and the labels are accurate, but in the real world where data can be messy, this strategy doesn’t work.

How can using less data lead to better model performance?

That’s definitely a pervasive belief. But remember something we said up above: all data is not created equal. Some of it is really useful for model training. Some of it less so (redundant data, for example, can cause overfitting). Some of it is actively harmful (mislabeled data, for example, can cause serious confusion).

What kind of data does Alectio work with?

We can work with virtually any kind of data, though our approach excels especially with images. That said, we can work with virtually any data type because our tech learns from the metadata (log files) generated from the training process itself.

Will Alectio help me with feature engineering?

Our technology identifies which records are the most impactful and useful to a model, not which features should be used in a model. That said, since we can identify which data is useless to a learning process, it can occasionally be used to find weaknesses in the model itself, which can in turn help with feature engineering.

What if I don’t have a model yet or I’m still developing it?

In most situations, usefulness is actually a function of the use case and the data versus the model itself. Think about a facial recognition problem. Regardless of if you’ve selected a model, data without a person in it or with bad resolution is going to be less useful than other data. We can uncover that without knowing what model you’re using.

So data usefulness isn’t model-specific?

Usually not! Our research shows that usefulness is data-specific, not model-specific. For example, data uselessness is usually due to either redundancy or irrelevance, and while irrelevance is use case specific, redundancy is a more general concept. Data hurtfulness is also fairly use case agnostic. You can read a bit more about that here.

Can I still use human-in-the-loop to curate my data?

Of course! Many companies have dedicated teams focused on data curation these days. The issue is that people don’t understand how models work, especially black box models like deep learning. Having them decide which data matters often amounts to wild guessing and can inject biases into your data. At Alectio, we sometimes say we give the model a voice. It decides what data it needs to learn.