Corporations in different sectors leverage the power of machine learning along with the availability of big data and compute resources to bring remarkable enhancement to all sorts of operations, including content recommendation, inventory management, sales forecasting, and fraud detection.

Yet, despite their seemingly magical behavior, current AI algorithms are very efficient statistical engines that can predict outcomes as long as they don’t deviate too much from the norm.

Currently, advances in AI is mostly tied to scaling deep learning models and creating neural networks with more layers and parameters. According to artificial intelligence research lab OpenAI, “since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.4-month doubling time.” This means that in few years, the metric has grown by a factor of 300,000.

This requirement imposes severe limits on AI research and can also have other, less savory repercussions.

In a blog post titled “The Bitter Lesson,” AI scientist Rich Sutton argues that the artificial intelligence industry owes its advances to the “continued exponentially falling cost per unit of computation” and not our progress in encoding the knowledge and reasoning of human mind into computer software.

“Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation,” Sutton says, adding that “the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation.”

That short paragraph says a lot about the current state of artificial intelligence, but it needs a lot of unpacking.

In the blog, one can read something of the kind: one should work on scalable methods that can maximally leverage compute and forget about modelling the world. A number of examples are explained to support this claim, namely Deep Blue and AlphaGO who leverage search and learning rather than human strategies, speech recognition, visual object recognition etc.

And we can add a few more to the list, melanoma detection and tumor detection, statistical machine translation etc. There is no doubt a trend here that cannot be ignored.

But from Rich’s argumentation there is one really important factor missing: besides compute, data is perhaps the more fundamental raw material of machine learning. All the examples in his blog share one crucial property, namely that they are very well, and rather narrowly defined problems where you can either generate your own data (e.g. alphaGO) or have ample data available (e.g. speech).

In these regimes data-driven, discriminative, black box methods such as DL shine. We can view this as interpolation problems. The input domain is well delimited, we have sufficient data to cover that input domain and interpolate between the dots. The trouble starts when we need to extrapolate.

Max Welling wrote in his blog "Do we still need models or just more data and compute?"

This question, or versions of it, seems to divide the AI community. And much like Bayesians and Frequentists they hold rather strong polarizing views on the matter. The question seems to come in different flavors: symbolic AI or statistical AI, white box AI or black box AI, model driven or data driven AI, generative or discriminative AI? A recent blog by Rich Sutton adds to the list compute-driven AI versus human-knowledge based AI. The discussion is both fascinating and deeply fundamental.

We should all be thinking about these questions.

For the moment, bigger is better

“Within many current domains, more compute seems to lead predictably to better performance, and is often complementary to algorithmic advances,” OpenAI’s researchers note.

We can witness this effect in many projects where the researchers have concluded they owed their advances to throwing more compute at the problem.

In June 2018, OpenAI introduced an AI that could play Dota 2, a complex battle arena game, at a professional level. Called OpenAI Five, the bot entered a major e-sports competition but lost to human players in the finals. The research lab returned this year with a revamped version of the OpenAI Five and was able to claim the championship from humans.

The secret recipe as the AI researchers put it: “OpenAI Five’s victories on Saturday, as compared to its losses at The International 2018, are due to a major change: 8x more training compute.”

There are many other examples like this, where an increase in compute resources has resulted in better results. This is especially true in reinforcement learning, which is one of the hottest areas of AI research.

According to a paper by researchers at the University of Massachusetts Amherst, training a transformer AI model (often used in language-related tasks) with 213 million parameters causes as much pollution as the entire lifetime of five vehicles. Google’s famous BERT language model and OpenAI’s GPT-2 respective 340 million and 1.5 billion parameters.

Given that current AI research is largely dominated by the “bigger is better” mantra, this environmental concern is only going to become worse. Unfortunately, AI researchers seldom report or pay attention to these aspects of their work. The University of Massachusetts researchers recommend that AI papers be transparent about the environmental costs of their models and provide the public with a better picture of the implications of their work.

Deep learning is data hungry

“In a world with infinite data, and infinite computational resources, there might be little need for any other technique,” Marcus says in his paper.

And therein lies the problem, because we don’t live in such a world.

You can never give every possible labelled sample of a problem space to a deep learning algorithm. Therefore, it will have to generalize or interpolate between its previous samples in order to classify data it has never seen before such as a new image or sound that’s not contained in its dataset.

“Deep learning currently lacks a mechanism for learning abstractions through explicit, verbal definition, and works best when there are thousands, millions or even billions of training examples,” says Marcus

So what happens when deep learning algorithm doesn’t have enough quality training data? It can fail spectacularly, such as mistaking a rifle for a helicopter, or humans for gorillas.

The heavy reliance on precise and abundance of data also makes deep learning algorithms vulnerable to spoofing. “Deep learning systems are quite good at some large fraction of a given domain, yet easily fooled,” Marcus says.

Testament to the fact are many crazy stories such as deep learning algorithms mistaking stop signs for speed limit signs with a little defacing, or British police software not being able to distinguish sand dunes from nudes.

“Deep learning is not likely to disappear, nor should it,” Marcus says. “But five years into the field’s resurgence seems like a good moment for a critical reflection, on what deep learning has and has not been able to achieve.”

Despite the huge contributions of deep learning to the field of artificial intelligence, there’s something very wrong with it: It requires huge amounts of data. This is one thing that both the pioneers and critics of deep learning agree on. In fact, deep learning didn’t emerge as the leading AI technique until a few years ago because of the limited availability of useful data and the shortage of computing power to process that data.

Reducing the data-dependency of deep learning is currently among the top priorities of AI researchers.

In his keynote speech at the AAAI conference, computer scientist Yann LeCun discussed the limits of current deep learning techniques and presented the blueprint for “self-supervised learning,” his roadmap to solve deep learning’s data problem. LeCun is one of the godfathers of deep learning and the inventor of convolutional neural networks (CNN), one of the key elements that have spurred a revolution in artificial intelligence in the past decade.

No alt text provided for this image

Self-supervised learning is one of several plans to create data-efficient artificial intelligence systems. At this point, it’s really hard to predict which technique will succeed in creating the next AI revolution (or if we’ll end up adopting a totally different strategy).

Data-centric AI

Think of a Data-Centric AI system as programming with focus on data instead of code.

“Instead of focusing on the code, companies should focus on developing systematic engineering practices for improving data in ways that are reliable, efficient, and systematic. In other words, companies need to move from a model-centric approach to a data-centric approach.” — Andrew Ng, CEO and Founder of LandingAI

A data-centric AI approach involves building AI systems with quality data — with a focus on ensuring that the data clearly conveys what the AI must learn. Doing so helps teams reach the performance level required and removes unnecessary trial-and-error time spent on improving the model without changing inconsistent data.

In a data-centric approach, you spend relatively more of your time labeling, managing, slicing, augmenting, and curating the data, with the model itself remaining relatively more fixed.

The tectonic shift to a data-centric approach is as much a shift in focus of the machine-learning community and culture as a technological or methodological shift-”data-centric” in this sense means you are now spending relatively more of your time on labeling, managing, slicing, augmenting, and curating the data, with the model itself relatively more fixed.

The key idea is that data is the primary arbiter of success or failure and is, therefore, the key focus of iterative development. It is important to note that this is not an either/or binary between data-centric and model-centric approaches. Successful AI requires both well-conceived models and good data.

Data in Deployment

From the standpoint of data-centric AI, we can even make the case that it is the data itself that is the largest potential source of technical debt in an ML system. There are two ways to see this is true. The first is to look at the overall components of a typical production-level ML system.

It is useful to note that the ML code – the bit that we tend to think of as the cool part – is actually a small component of the overall system, maybe five percent or less in terms of overall code. Things like data collection, data verification, and feature extraction all form much larger parts of the overall system, and are all obviously at the heart of a data-centric approach.

But even typical Serving Infrastructure that helps to deploy the model to make predictions within the context of a live system will require an extensive data pipeline to ensure that all relevant information is provided to the model at prediction time.

And if we consider Monitoring, any ML Ops engineer worth their salt will make sure that monitoring data distributions is a top priority. Overall, this means that something like 70% of our overall system complexity is tied to data – processing, handling, and monitoring – and that these tasks can bridge multiple systems or subsystems. No wonder it is a major source of technical debt.

No alt text provided for this image
ML code – the bit that we tend to think of as the cool part – is actually a small component of the overall system

The second, perhaps even more important way to see that data can be a source of unexpectedly large technical debt. This is due to the fact that the data defines the behavior of our models, and in this way takes on the role of code.

If we want a vision model to do a good job at identifying insects, or an audio model to recognize a verbal command, or a movie recommendation model to help a user pick entertainment for a Friday evening, the abilities and behaviors of our model will be defined primarily by what data we collect and use for training.


The traditional model-centric approach to ML has been tremendously successful and has brought the field to a place in which the models themselves are ever more downloadable, commoditized, and, above all, widely accessible.

But the newer, powerful, “deep-learning” models are now so data-hungry that not only have datasets and manual labeling of training data become unwieldy, there are diminishing returns to be had in terms of how much progress can be made iterating only on the model. The answer to pushing AI forward now and over coming years can be found in a data-centric approach.

John W. Tukey was among the first “Data Science” thought leaders, as he pondered the importance of data in The Future of Data Analysis in 1962. Data Science Journal, dedicated to publishing papers on “the management of data and databases in Science and Technology,” was launched in the early 2000s, when Big Data hit the scene and the role Data Scientist was essentially born.

As the field of ML progresses, successful AI will continue to involve both well-built models and well-engineered data. But because of the sophistication of today’s models, the biggest returns moving forward will emerge from approaches that prioritize the data. And if data is increasingly the key arbiter of success or failure, data has to be the focus of iterative development moving forward.

AI and ML have great promise in just about every industry sector with lofty expectations to deliver lower costs, enhanced precision, improved customer experiences and lots of innovations. But there are major barriers and top on the list is data.

Experts estimate 31 percent of ML projects die because of a lack of access to production-ready data.

I’m glad that machine learning luminaries like Andrew Ng are bringing attention to the importance of systematic work on data. In the next few years, we will witness a new paradigm of data-centric machine learning infrastructure. Where the 2010s were about improving models, the 2020s are going to be all about data.

At this point in the evolution of AI, the success and the future of real-world AI applications hinges on the quality of data. While ML/AI hardware infrastructure and modelling have made significant progress, it is time to move AI’s center of mass towards data. A campaign was needed to shift the community’s thinking. And no-one better than Andrew to do it.

Just like the diet of a human is critical to health and performance success, how we fuel AI models is crucial to their success. The quality of data has a huge impact on how well the whole system works. Learning demands good and relevant data. Inference demands recognising new data.

The excitement around data-centric AI should galvanize the researchers and companies to focus more attention on building data-centric tools and frameworks, enabling a systematic and principled approach towards data excellence by improving data quality throughout the lifecycle of a machine learning project.