As the adoption of Machine Learning grew in the Industry, so did the need for data annotation . Leaders in the space as well as those funding those new initiatives learned the hard way that collecting loads of data wasn’t enough as the training data ML models required needed a lot of preparation and processing – and we are not just talking about ETL and feature extraction. Raw data just isn’t good enough for Machine Learning, and there is a lot of work required to make it ML-ready.
But while preparing training data required a lot more than just data labeling, a lot of people unfortunately started spreading the somewhat misleading narrative that data labeling represented the bulk of the work, failing to mention that the process also involved – among other things – cleaning data, dealing with outliers and missing data, curating the data, potentially applying data augmentation and even validating the labels. And this is how many labeling companies eventually started rebranding as “data preparation” companies, making it unclear what their core competency truly was.
We at Alectio have known this all along; in fact, that’s what DataPrepOps is all about. The mission of DataPrepOps is to bridge the many gaps in the data preparation journey that labeling companies cannot bridge for data scientists.
So what is it that DataPrepOps can do that labeling companies don’t?
1. Transactions with labeling companies are still manual
This might be hard for those not used to interacting with data labeling companies, but even as we enter the Age of Generative AI, when you send data to annotate, this will still happen manually, meaning that you will need to drop your data in Google Drive and share it with someone you don’t know, or even ship it via email. This leads to unnecessary risks in terms of data security and data privacy, and zero traceability, not to mention that you will still need to jump on a call with a customer success person to confirm reception of data and explain what you expect. Hardly a process that could permit the creation of an end-to-end ML training pipeline.
2. Labeling companies timelines are inappropriate for the age of MLOps
While autolabeling is glamorized all over social media, it is not an option for 95% of use cases, ranging from search relevance to content moderation and computer vision in medical imaging. This means that most data needs to be annotated manually, even when a human-in-the-loop approach is taken. However, the fact humans are required to annotate the data shouldn’t stop efficient data sharing: in other terms, one can still get data annotated very quickly if the data is routed to the right annotator, and if pipelines exist to return the output back to the team who requested the annotations as soon as the work is completed. Yet because most labeling companies have been set up as service providers as opposed to technology companies, the requester will not get his/her annotations back before days, if not weeks: a real dichotomy knowing how hard at work the rest of the ML industry is to make everything automated and real-time. DataPrepOps brings this much needed paradigm shift towards continuous labeling.
3. Labeling companies can’t cover all niches
The best known labeling companies – such as Scale AI or Labelbox – excel at getting data annotated for the most common use cases, such as autonomous driving or facial recognition, but they usually aren’t of much help when it comes to more niche use cases (like those required to annotate heavy machinery data, or surgical videos) simply because their annotators are generalists. Smaller labeling providers, however, often offer such expertise, but the problem is no one knows of them, and it’s not easy to identify them. This is the future of data labeling in a marketplace of labeling providers.
4. Labeling companies won’t truthfully report their own performance metrics
If you have ever used a third party to annotate your data, you certainly know that there is usually a huge gap between the quality you expect and the one that is reported by the labeling company. That’s because reliably measuring the quality of the process requires both objectivity and truthfulness. It requires taking a certain distance to the process. Think about it: if the labeling company knew about their own weaknesses, they would probably have attempted to address them already, unless something is making it hard for them to do just that. In that case, reporting truthful metrics is probably not in their best interest, and that’s why you can never know if the quality they claim is what you got. Auditing of the label needs to be done by a third-party with no conflict of interest.
5. Labeling companies can’t (and won’t) tell you which data you should use for your model
Datasets have become so large nowadays that oftentimes, it is just not an option to label all of it anymore, either because of time, cost, or the lack of annotators with the right expertise on the market. The issue is that labeling companies obviously make more money when you annotate more data, so they don’t have a huge incentive to help identify which part of your dataset is actually meaningful. But even if they did, they simply couldn’t tell you which data you should sample, or whether any of your data is irrelevant or harmful because they do not have the context of what you are trying to achieve with that data. For reliable data curation, one needs visibility on not only the use case, but the model itself because each model is different and the data that is useful to a specific model might not be useful to the next one. Data curation needs to sit at the intersection of model training and data preparation, and data labeling companies are just not naturally equipped to help with that.
6. Labeling companies can’t help with data collection, data augmentation or data generation
Even though many labeling companies label themselves data preparation companies (no pun intended!), they do not offer technology to help you with data sourcing. They just annotate what you send them, and even if sometimes an expert on their end can provide some advice, they just don’t have enough visibility into your existing data collection processes to be truly helpful. Most of the time, they also don’t provide solutions to augment or synthesize data, and even those who do don’t help you figure out what augmentations to apply or what data to generate (partially because of the lack of visibility into the model). DataPrepOps solves this issue by connecting open source libraries as well as the model’s training pipeline directly with the annotation process.
7. Labeling companies can’t do much to improve data quality
We’ve all seen ads from labeling companies claiming they will provide high quality training data, but all they can do is provide high quality labels. Labeling companies do not – and cannot – improve the quality of the raw data, and again, they will annotate the data you send. If your dataset is unbalanced or corrupted, they cannot clean it or fix it for you. Data cleaning – including identifying faulty records with an adversarial ML approach and even repairing specific records – is another thing that DataPrepOps is meant to solve for ML scientists that labeling companies can’t.
Ultimately, labeling companies are still playing a critical part in the Machine Learning industry, and will keep doing so for years to come. But the data labeling processes that worked for the 2010s are also seriously flawed for the Machine Learning industry of the future, both because we are entering the age of super large models (which require more data but also much better quality) and because of the community’s push for real-time model training in the form of ML observability or even online learning. DataPrepOps is key to the modernization of the way datasets are prepared, and how data preparation is managed and operationalized. And if you’re curious to learn more, we’d love to show you how.