How We can Understand What Data your Model Needs – Without Looking at your Model

April 23, 2020



At Alectio, we’ve pioneered a technique that lets us understand how a model’s learning and what data the model needs without looking at either the model or the data. Simply put: we use machine learning to understand how a machine learning model works and importantly, how it learns. This allows us to make certain key inferences about the model, even without knowing the nuts and bolts of the data or the model itself.

People usually balk at that assertion. We understand why. After all, it’s a bit counterintuitive. We’ll explain it a bit more in a moment, but first, think about this:

Imagine a stranger on the bus. He pulls out a book you’ve never heard of and starts reading. A few pages in, he starts laughing. In fact, he’s losing his mind. He’s doubled over and, frankly, he’s making a little bit of a scene.

Across the aisle from him, there’s a different stranger. She pulls a book you’ve never heard of out from her backpack and starts reading. A few pages in, she’s frowning. She leafs through it absently and starts staring out the window. She looks at it a few more times but never turns the page before she puts it away for good.

Now, here’s the question: which book do you want to read?

Easy question, isn’t it? But think about how you made it. You don’t know either the man or the woman on the bus. You’d never heard of either book. And you can’t see what’s inside either of them either. You have almost no information. But you made your decision on clues and context.

Now, think of the people in the example above as models. Think of the books as data. And think of their behavior–either laughing or boredom–as signals the model is sending to us.

This is broadly how this type of metalearning works. But to do it right, you have to train your model incrementally and examine its progress after each successive loop in your training process. You do this by showing the model a pool of unlabeled data and asking for its predictions. From those predictions, you decide which data should be labeled next.

At Alectio, we spend a ton of time and research on this particular step of the process. We use an ensemble approach that includes active learning, but whatever you’re doing, it’s advisable that you don’t simply rely on the best judgement of a team of data scientists here. People aren’t great at understanding how machine learning models really learn or what they actually need. Especially for more black box approaches like deep learning.

We’ve written a bit about active learning querying strategies here and, suffice it to say, we spend a ton of time and researching these because they allow us to help our clients identify the right data to train their models. Suffice it to say: finding the right clues and interpreting them the right way is hard. It’s why simply querying strategies like “least confidence” strategies don’t work on real-world datasets.

But we want to circle back and highlight something really important here: it isn’t necessary to see either the model or the data to know what a model needs. All you need is the model’s predictions on unlabeled data.

We use that ensemble approach we mentioned above to look at predictive behavior changes over loops, how a model becomes more or less confident on certain unlabeled data rows, and a whole host of other signals. It’s how we use machine learning to observe the behavior of your model so we can help you unlock the data it needs to become more accurate with far less data than is otherwise necessary. And we don’t have to look at your model or your data to do it.