Perhaps you already know the story of Timnit Gebru, the high profile ethics researcher who was just forced out at Google, but if not, let’s level-set before we get started. Gebru is likely most famous for a paper she wrote while at IBM that highlighted the gender and racial biases in facial recognition. What she found was pretty staggering: namely, that commercially available facial recognition software was far more accurate predicting gender for light-skinned people than darker-skinned ones. This was especially true for lighter-skinned men and darker-skinned women (those accuracy gaps were sometimes above 30 percent).
Gebru soon moved on to work at Google where she continued her research. At least until recently. Gebru was forced out after a conflict with Google brass surrounding a paper she coauthored that focused on the size and the potential negative effects of enormous language models. The summary, in brief: there are significant environmental, financial, and ethical costs to the massive, state-of-the-art NLP models Gebru was looking at. Google claimed Gebru didn’t go through the proper approval channels (some other Google AI researchers balked at this reasoning) and she asked for conditions to be met or she’d resign. Soon after, she was cut off from her corporate email account.
Now, on a fundamental level, this is really disappointing. First and foremost, it cuts against Google’s own supposed commitment to ethical AI. It’s hard to claim you’re looking to avoid bias or be socially beneficial if you silence internal critics who are highlighting gaps in your practices. The response has been fairly widespread condemnation from not only within the tech and AI communities, but the public at large. And though we agree wholeheartedly that, based on what we know right now, Google messed up badly here, we’d like to step away from the optics and Google’s response to Gebru’s research and look more deeply at her research itself. Namely: can language models be too big?
At the risk of ruining the suspense, the answer is a resounding yes.
Let’s begin by stating unequivocally: we understand that natural language processing (NLP) models are complex. Where we once relied on unsophisticated methodologies like “bag of words” (which, if you’re not familiar, is exactly what it sounds like), current, state-of-the-art models are for nuanced. But that nuance comes at a real cost––to the environment, to explainability, and to competition in tech and machine learning.
See, big, complex models, at least the way Google is building them (and, let’s face it, the way most of us are building them) require tons of data and tons of compute power. As we’ve covered before, this isn’t a trivial amount of energy. Training models like these just once has the carbon footprint of several cars over their average lifetimes. For a company that made a lot of hay bragging about being carbon neutral, it’s hard to square those competing priorities. In fact, there’s an argument that Google more or less spearheaded the arms race in big data and the way we approach model building today. It’s hard to solve a problem you at least partially created.
Additionally these models balloon in size, it’s also harder to know what’s actually inside them. There’s simply no way researchers can adequately audit the millions of data rows these models ingest, even in a company with the resources and size of Google. Language is a fluid, living thing, and state-of-the-art NLP models need to be constantly retrained to reflect this. Just think what the words “Donald Trump” meant in 2014 or what the word “virus” meant just last year. If your model is supposed to either understand or, more complicated, create text? It needs to be updated all the time. That has the attendant environmental issues we mentioned above, but past that, it requires a ton of data. And if you can’t audit the data effectively, you could end up creating something like Microsoft’s Tay, the regrettably horrible bot poisoned on social media.
Lastly, if the models require an ever-increasing amount of compute and data, how long till nobody outside of the Googles and Facebooks of the world can compete? Their resource advantage is already tremendous but if we simply can’t match their successes because of that advantage? That’s not a good thing for the tech world, for research, for, well, all of us. If you look far enough in the future, you can see a world where small, then medium, then even big companies cannot keep pace with the big tech firms in machine learning.
If this seems fatalist, let us apologize. It actually isn’t. It’s just fatalist if we don’t make any changes to the way we’re doing things. And if you’re a company trying to make models the way Google does, you likely know that’s a tall order, namely because of the advantages we just laid out.
So what’s the best course of action here? Stop buying into this paradigm that great models require oodles of data.
See, at the scale Google is operating under, they’re absolutely including garbage in their models. They can fight this by brute force––simply adding more and more data––but that isn’t efficient or viable long-term (or viable for smaller machine learning teams either). Data curation, data prep, data augmentation, and synthetic generation are just a few of the ways to reduce the amount of data you need because you’ll include the most utile data, not just all the data. The reality is that you can get competitive–if not superior–models by investing not in quantity but in quality. Because, let’s face it, you aren’t going to beat the Googles of the world on quantity.
While Gebru’s sudden unemployment is truly unfortunate, she went out sounding alarm bells that we should all take to heart. Models are too big. For small companies now and the industry writ large in the future. They simply aren’t sustainable, especially if they continue growing this way.
We can all do our part by fighting against the need for more and more and more data and instead focus on more responsible models with more modest amounts of data. We have the option to save money, save time, reduce our carbon footprint, avoid biases, all without trading off accuracy. It’s just a matter of refusing to think like the biggest companies in tech and instead embrace leaner, more agile paradigms. And if you’d like to get started, we’d love to help.