Welcome back to our explainer series, where we tackle complex machine learning topics in five levels of escalating complexity.
In this episode, we’re looking at one of those things data scientists absolutely adore doing: Data Labeling. (And, in case you’re curious, we can help you do less of it!)
00:00 – Intro
00:38 – Level 1: Kindergartener
01:42 – Level 2: Teenager
03:39 – Level 3: Non-expert adult
05:07 – Level 4: Computer science major
06:55 – Level 5: Machine learning expert
09:31 – ???
Contact us to learn how we can help your team build better models with less data. We’d love to show you how it works!
A 5 year old
Did you know that machines can learn? In fact, machines learn a lot like you do. If you think about it, you didn’t know what a toaster was until your parent showed you an example of a toaster.
Machines need to learn from something that’s called data and data is basically just information. Data can be pictures of a dog; they can be sentences or they can be clips of audio. But basically, in order for a machine to learn from that data or that information, they need a human to label that and labeling is sort of like putting a name tag on top of a picture.
But one thing that makes a difference between you versus a machine is that machines need to learn from a lot of examples. Think of it as a machine needing to see the toaster 400 times before it realizes that a picture of a toaster is a toaster. But you, once you learn it once or two times, you’re going to know that it’s a toaster immediately. In a way, it makes you a lot smarter than a machine, even though machines might seem like they’re a lot smarter than you.
What comes to your mind when thinking about artificial intelligence? Well, perhaps you’re thinking about some amazing robots from the Hollywood movies or some self-driving cars in the streets, or even the voice assistance in your mobile phone just like Siri or Alexa.
That’s what most people see, but before all those could see the light of day Machine Learning scientists need to collect the data, so that a machine learning model could learn from it. But before the data could be used, it should be labeled.
So, what is data labeling? It can be as simple as associating the image to its content to tell if the image is about a car or a truck. But in reality, it’s much more complex because normally the images captured by a camera can contain so many different objects and we want to let our model know the location information for each of those objects. We do this by using something called bonding boxes. There are also other types of data labeling tasks.
In some of the cases, we want to identify all the topics in a document or the proper nouns in a sentence. Sometimes, the data labelabeling task can even be subjective, such as content moderation when people are asked if a specific tweet contains some inappropriate information to themselves.
Now, think about the fact that every single record should be labeled before they could be used. Isn’t that a bit daunting knowing how much data we’re dealing with nowadays?
A non-expert adult
Have you ever signed up into a website and asked to check whether you’re human or not? They have shown you a grade of images: this is called captcha, when they ask you to label a few images for a particular object. For example, you have may have been asked to label buses in a few images that you were shown. What happened to all of the label data that you generated? Oftentimes, what happens is that a Machine Learning algorithm is trained on this data. For instance, a bus detection model would have been trained on all of the buses you selected. Sneaky, right?
How are companies using your data to generate a profit for their models? Think about this one: Facebook recently announced that it has the best facial recognition AI system in the world. Now where did they get all of that data? If you’ve ever tagged a friend family or co-worker on a Facebook image, they’ve probably used that data to train their models for facial recognition.
Now, I don’t know your views on this, but I just think this is some relevant information you should be aware about!
A CS Student
It’s a hot day today! Luckily, I grabbed myself a bottle of water and some hand sanitizer to use during these uncertain times. Whoa! I just noticed that these aren’t labeled. What if I drink off of the wrong bottle? Sounds horrible, right?
In Machine Learning, you have a concept that is very similar called supervised learning. Supervised learning is a process in which humans supervise the learning of machines by accurately labeling the data points that they collected.
What if something goes wrong in this labeling process? What if you mislabel something? Something as catastrophic as like an autonomous driving vehicle failing to detect a pedestrian could happen. Sounds horrendous, right?
Unfortunately, we don’t see a lot of people in academia talking about the criticality of this process to students. People often think that data labeling is easy and often overlook it, but it’s definitely not. If we were to give you an example, imagine you’re labeling an autonomous driving scene. You’d be faced with scenarios in which there are occluded objects, reflections of objects or even like a poster that contains objects like a car or pedestrians. Are we even supposed to label this? Sounds confusing, right?
Let’s be honest, as Machine Learning engineers, we don’t want to sit all day to label the data points that we collected. That’s why we have companies dedicated to label this entire dataset efficiently.
A ML Expert
With the amount of data that we collect every year, it seems that nothing stands in our way to taking Machine Learning to the next level. That’s without taking into consideration the fact that of course, all of this data needs to be prepared and labeled before it can be used. For many machine learning teams out there, in fact, we’re already way past the point where it’s even an option to label this data in-house.
This is where the many data labeling companies out there can help us, but even then we might already be seeing the limits of what can be done with human labeling. Did you know, for instance, that even if every single human on the planet was to stop doing what they are doing right now in order to do nothing else but data labeling, we still wouldn’t have enough people to label all of the data.
So, even if you see an increase in the number of companies relying on crowdsourcing in order to increase the size of their task force, that might just turn out not to be enough yet. And still that’s not even the only problem, because with such large volumes of data all of those companies are also struggling to provide decent SLAs or even guarantee the quality of the labels that they are returning to their customers.
The only way out of this predicament effort to keep using supervised learning in the future is to start relying more and more on things such as auto labeling. The case where you use a Machine Learning process to generate synthetic labels to be used in your Machine Learning process. If that sounds like a chicken and egg problem though, maybe that’s because it’s the case.
For me, the only way or one of the only ways to think about the whole thing the right way is by considering a paradigm based on a human in the loop approach, where you’d use a human to try and validate the labels capture the bad ones coming out of the synthetic labeling process in order to try to make the whole thing better.
Unfortunately, there isn’t quite enough research in this space yet to take us to the levels where we need to be today. The good news though is that I anticipate a lot of very exciting research in this space in the upcoming years.