You couldn’t have missed the tremendous growth in the adoption of ML/AI over the past few years. What was once only feasible in sci-fi movies has now become a significant part of our lives. And the one element that has been key in making this sci-fi dream a bit closer to reality is the data.
Still, we are very far from what can be achieved with proper use of ML/AI. But one thing is certain: the ability to build pristine datasets is and will keep playing a critical role in the development of AI. However, working with Big Data is neither cheap nor easy, and DataPrepOps will be key to ascertain the sustainable and economical use of data at any scale.
ML is everywhere!
Just look around, and wherever you are, there’s a use case for Machine Learning. From healthcare, education to even an algorithm tracking the genre of the songs that you like.
Yup, it’s almost everywhere!
But its implementation is no piece of cake.
Back in the mid 2010s, before the release of new MLOps tools and open source libraries like PyTorch, building a Machine Learning model was surely not trivial; shipping it to prod was close to impossible for a data scientist without a DevOps background. Eventually, building models got easier… a lot easier. Still, preparing the data remains a challenge, and an ever-growing one too, with the size of datasets getting ever larger.
But in early 2021 a new term – DataPrepOps – was coined to refer to a portfolio of workflows and technologies meant to tune not the model, but the data. DataPrepOps was born as the alter ego to MLOps, aiming at facilitating not traditional Model-Centric ML, but Data-Centric ML, and eventually make it as easy and intuitive to optimize datasets as it has become to optimize ML models.
DataPrepOps is trending among organizations that pay attention to their data preparation practices and understand how important it is to manage/govern the data that is being collected and fed into their AI/ML projects.
Top executives are after data prep solutions that can improve their ML models, predictions, and other AI-related projects to enhance the overall efficiency and meet the ever evolving needs of their organization.
DataPrepOps can be defined as “automating and operationalizing the preparation of a training dataset”. DataPrepOps is to data preparation what MLOps is to model building, tuning, and deployment. It is MLOps, but targeted at Data Centric AI. (Read more here)
Even though data preparation is rarely considered as a sexy part of the ML lifecycle, it is actually a complex field with many moving parts, starting from data labeling to data augmentation and data validation. Even synthetic data generation, which is often viewed as independent of the workflows associated with the preparation of natural data (such as real-life pictures or documents), is inherently related to the entire data preparation workflow, and DataPrepOps is key in coordinating all those steps.
And while the field is still brand new, its adoption within organizations is off the charts for a simple reason: the world is on a quest for automation, while being determined not to break the bank for the sole sake of developing ML products.
So, all that being said, what’s in store for DataPrepOps in 2023?
In this article, we’ll talk about what we believe is about to get hot in the DataPrepOps domain. Let’s dive into it!
1. There will be a higher focus on DataPrepOps for NLP
In recent years, a lot of modern Data Preparation companies have focused on providing tools for Computer Vision problems, and especially for those tightly coupled to autonomous driving. However, due to the recent consolidation of the autonomous driving space and the popularization of large language models such as GPT-3 and the brand new ChatGPT, it is clear that pristine language training data will be required to both push the limits of what can be done in Conversational AI, and allow companies to tune those models to their own custom applications.
2. Data Preparation companies will have to focus on developing privacy-preserving approaches
Another significant advance in AI in the past few months has been the development of GitHub Copilot, an AI pair-coding companion trained on billions of open source lines of code. However, Copilot quickly got in hot water as coders decided to sue GitHub and its parent company, Microsoft, over their ownership rights on the code that they wrote and which had been used as training data. The situation has since spawned an interesting debate: can the creator of the training data – whatever that might be – expect to be compensated if a third-party uses their work to train a model with it?
3. Active Learning pipelines will generalize
It’s taken decades for Active Learning to reach the level of attention that it deserves, but thanks to the recent spike in popularity of Data-Centric AI, the Machine Learning community is finally paying attention to its potential. Now, make no mistake: leveraging Active Learning is hard. Two reasons: because building the pipelines brings another level of complexity compared to regular MLOps, and because making Active Learning perform well on a given use case takes the same level of tuning as a Deep Learning model. For these reasons, in 2023, we can anticipate a significant increase in the number of open source libraries for Active Learning as well as in the tools meant to support its unique, incremental learning workflow. As a result, we can also hope to see more teams leveraging the power of Active Learning in practice, on a variety of different use cases.
4. Continuous labeling will be all the rage
Over the past decades, Data Preparation (especially Data Labeling) companies have proliferated to support the growing needs of Machine Learning teams everywhere. Companies have even been founded with the sole purpose of providing elegant annotation tools for those labeling companies to operate more efficiently. However, the process of getting data annotated has remained painfully asynchronous and difficult to manage. Think about it: even in 2022, most labeling requests are still being sent via email or through a Cloud storage solution, and customers often don’t hear about the status of their requests for weeks. This is at odds with the very goal of MLOps, which is to provide continuous training pipelines and model maintenance automation. In other terms, as long as data preparation remains asynchronous, true model automation and autoML cannot exist. The reason: there will be a need for a human to upload the annotations into the training pipeline, and we’re not even starting to mention Active Learning and ML Observability pipelines here. The solution: making labeling pipelines continuous and easy to integrate with other MLOps tools.
5. DataPrepOps will be coming to the edge
The development of the robotics field is rapidly moving the Machine Learning community towards Edge Machine Learning. Training and retraining models on the edge is not only becoming possible, it’s becoming a necessity. However, even the finest edge-compute technology fails to solve the nagging problem of data preparation on the edge. If you are building models who can truly learn from their surroundings, chances are, you will need data preparation on the edge.
It is clear that AI will keep on growing at a fine rate from here on and with its growth, model training methods that we used to follow before will become obsolete. Organizations will pay more attention towards efficiency, cost-effectiveness, and ease of use. Automated dataset training, privacy, and data-centric approaches will be the key areas where organizations will invest in 2023.
Though DataPrepOps is something relatively new in the tech industry, it’s catching on within most savvy data-led organizations, and they are ready to test its abilities to provide better & economical models with less data. Companies and ML enthusiasts have much to look forward to in 2023.
What will be your approach towards optimizing your DataPrepOps strategy in 2023?
If you haven’t given a thought to the above question yet, no worries, we got you! Alectio is hell-bent on developing ever more data-centric solutions to address the problems commonly faced when using model-centric approaches. We are particularly excited to contribute our best solution yet to the problem of data preparation at the edge: our groundbreaking Data Filtering technology meant to identify relevant data at the point of collection. And we can safely say that we can anticipate many other Machine Learning experts to pitch more disruptive ideas in this space.