My name is Russell Kaplan, I’m Head of Nucleus at Scale AI, which is a dataset management product that helps teams improve their machine learning models by improving their datasets.

In this article, I’ll talk about data-centric AI in practice and what we've learned at scale AI, working with hundreds of computer vision teams, about what it takes to make machine learning work well in production.

Here are our main talking points:

Let’s dive in 👇

What is data-centric AI?

The concept of data-centric AI has gained a lot of traction recently. It's the observation that making machine learning and computer vision work well in production, involves so much more than iterating on the models and on the code.

In fact, we're increasingly focused on iterating on the dataset. Data-centric AI is the discipline of systematically engineering the data used to build an AI system.

Over time machine learning architectures, hyperparameter optimization schedules, and learning rates, have become increasingly explored and figured out. We have a good set of foundational building blocks that usually work when you plug it into a new machine learning problem. So, production teams are shifting their investment in their focus from model development to data development.

Challenges?

An important part of why this is necessary is that most machine learning problems we encounter in the real world are longtail distributed.

As an example, Berkeley deep drive dataset is a public data set for self-driving cars, and it shows a classic data distribution we see in all types of computer vision applications, where the most frequent classes occur quite frequently and there's an exponential drop-off in frequency as we get to rarer and rarer classes (edge cases).

This is one of the things that makes machine learning in production so challenging, that we have to work well on all of these edge cases, but they're so inconsistently represented in the data we collect.

This has important implications for every part of the machine learning lifecycle. Not all data is created equal. When you're prioritizing what to annotate, would you rather annotate the image on the left or the image on the right? If you're the entertainer, it might be easier to annotate the one on the left, but as the dataset builder, it's much more valuable to have the image on the right. How we label cars inside of other cars, are the types of edge cases that make real-world computer vision so challenging.

We can do this in a targeted manual way but also prioritize annotation data systematically. There's an entire area of literature called ‘active learning focus’ on figuring out strategies given a model or set of models to find the right unlabeled data that when annotated will drive the biggest model performance improvement. A common plot in this type of literature is to see an estimated annotation budget on the x-axis and to measure how much accuracy can you get under that invitation budget. On the y-axis, the baseline is always random sampling. We’ll just collect some data, send it off for annotation and train the model.

Labeling

Data curation is not the only problem with adopting data-centric AI. The reality is not all data is well labeled. This is one of the most understudied problems in academic literature because we have to have standard data set benchmarks to compare research results. All those research results tend to assume that the data is annotated correctly, and all evaluations are with respect to those annotations. However, those annotations aren't entirely correct. It's estimated that around 6% of the validation set of ImageNet is mislabeled. There are similar figures for many other public datasets.

What makes this problem worse is that labeling errors are most common at the very tails of the data distribution. In those edge cases, we're not really sure what we should do, according to some ambiguous set of instructions, should we label that image as a single truck, as a car or as a set of cars?

Cleaning bad labels is one of the most frequent ways we see to systematically drive model performance improvements. Again, it’s hard to study in an academic setting, because the test set also has bad labels. So sometimes cleaning the labels in a training set makes your performance worse on the test set, because you start to under-fit the annotation errors. Although in a real-world setting, we don't care about that, we care about actual performance and actual results.

There's an interesting relationship between the labeling accuracy of a dataset and the value of that dataset. You can derive mathematically that there's a roughly quadratic relationship between the error rate and the value of a data point. As your error rate goes up linearly, in annotation accuracy, your model performance drops quadratically, which is one of the reasons it's so valuable to invest in a high-quality annotation.

Beyond building the data set, and making sure it's well curated and well labeled, we're just getting started because it gives you the inputs needed to build your first model. Under data-centric AI, we increasingly think of deployment, not as an event, but as a process. This is an evolving system, we're trying to get the best possible accuracy as quickly as possible.

Measuring ML team progress

Given all this, we need some way to measure ML team progress. As a team, how are we doing at building a data-centric AI workflow, improving our velocity, and our time to production?

In brainstorming ways to measure this, I thought of another type of progress measurement that's much broader than machine learning. It's a way to measure the progress of civilizations called the Kardashev scale. There’s some analogy for ML team productivity. There are three buckets of teams at different stages of their AI journey: Type 1, type 2, and type 3 ML teams.

Deployment velocity

The first axis of measurement is around deployment velocity. A Type 1 ML team will often train a model ad hoc, deploy it, see how it does, and get in some monthly cadence or ad hoc cadence for shipping improvements of those models. It happens with dedicated focus, and dedicated input, and it's not on a rigorous operational schedule.

As infrastructure and data practices improve and as the team grows, there becomes a richer operational cadence and you're deploying weekly as a Type 2 ML team. The teams with the most sophisticated infrastructure who are the farthest along in building out their machine learning stack, tend to do continuous improvements with no intervention needed. Always retraining, always getting better, always collecting new data.

Data curation

Another clear access is in data curation. We all start with random sampling because it's a simple baseline and gets to somewhere. However, very quickly, when machine learning models get deployed into the real world, we learn that random sampling is not good enough and we get bug reports from customers and complaints from our bosses.

To be able to fix all those bugs, we need to do targeted data set improvements and we need to mind the edge cases where we're currently struggling the most. Type 2 ML teams have a data collection process to manually improve the curation process on targeted edge cases.

The most sophisticated ML teams do this in a highly leveraged way, where they can take quick feedback and turn it into massive curation campaigns to systematically improve performance on that edge case and make sure that performance doesn't regress in the future, but also use automated active learning so the pipeline of data coming in can get prioritized according to what's going to be most valuable.

Annotation quality

The third is around annotation quality. When teams start out, usually they don't have a QA process for annotation. However, as teams realize the dramatic performance improvements of super high-quality data on model performance, they start to implement their own QA or use some service to provide QA. This type of QA can significantly improve performance, especially in edge cases where those labels are most likely to be challenging.

The Type 3 ML teams do this with their models fully in the loop. In fact, your machine learning model knows a lot about whether a label is likely to be correct or incorrect.

Where do the improvements in a machine learning stack come from?

In the beginning, it's the machine learning scientists who are driving those improvements. You're running the experiments, collecting the data, training the models, and getting things off the ground.

As teams grow, processes mature and the data becomes more and more central to the improvement lifecycle, machine learning engineers will work hand in hand with data PMs and data teams who specialize in the collection, quality, and curation, to make sure that performance improvements are driven in the areas where it's needed most.

Once the infrastructure is good enough, your entire ML engineering team should theoretically be able to take a vacation. Andre Karpati famously called this ‘operation vacation’. As long as the labelers keep labeling and the system stays running, they should be able to come back to a higher-performing model. This is a useful North Star for infrastructure and processes for machine learning teams as they continue to grow and develop.

Scale AI and Nucleus

These observations have been motivating to us at Scale AI in building our product suite because we see teams face these challenges again and again. The product I work on, called Nucleus, is a data set management platform designed to help with two things. The first is to figure out where in the data distribution are the models struggling the most. You need to know the qualitative failure modes of your machine learning models to have very high-performance guarantees.

Once you can find where your model is struggling, the next important part is how do you fix the problem. We see two common paths for fixing the problem: better data curation and better labels. Nucleus helps you do both.

Use case: Berkeley Deep Drive

The machine learning engineer's getting reports that their safety drivers are intervening frequently when there are trucks at nighttime. Obviously, something in the stack isn’t working and we need to debug what's going wrong.

The first step to solving this problem is looking at data distribution (do we have enough truck annotations in the first place?). We can see the Berkeley Deep Drive distribution is pretty skewed, but we have around 34,000 truck annotations. That should be enough to learn what a truck looks like.

All the plots in Nucleus are fully interactive, so you can always go from aggregates to specifics. One of the core parts of Nucleus is a flexible queering engine that ingests arbitrary metadata, model prediction data, and annotation data and makes it searchable. In our case, it makes it possible to see all the images with at least one truck that was taken at night.

We're interested in debugging our machine learning model, not just looking at the ground truth annotations. Fortunately, Nucleus is a developer product. Via API, it's easy to add your own machine learning model predictions, in addition to your ground truth. If we were to add our own model predictions, we would then be able to see them in the interface, filtering alongside our ground truth.

We can see metrics generated automatically, that tell us where our predictions and ground truth agree and disagree. But again, it's not enough to just get these aggregate metrics, you have to dive into the specifics in data-centric AI to make your data better.

Let's look at the details. We can see, for example in this confusion matrix, that the truck class in ground truth is often confused as a car 27% of the time. Finding qualitative examples where the model is failing is critical to improving the performance of a computer vision system.

We can also debug our data with our ML models. Instead of trying to figure out where the model is wrong, we can also figure out where the data is wrong using the model. For example, we have a query showing all of the annotated images in my dataset where according to my model, it's a false positive. There's no ground truth, but there’s a prediction. One thing that stands out to me, is that these false positives really look like cars so why are they considered false positives? If we hide the model predictions, all the annotations look good except we're missing an annotation. It's affecting our training process, but it's actually an annotation mistake. In Nucleus, you can easily review this as an annotation mistake.

Debugging

Being able to do this bulk data cleaning with your model in the loop, and figuring out where in the data distribution your annotations went wrong, is huge for being able to drive performance improvements, especially in long tail. Once you find those issues, you have to drive those performance improvements and fix the problem.

One thing you can do is run a natural language search. Instead of creating unstructured data, we can look at natural language. Under the hood, this is powered by clip embeddings, which produce aligned vectors between language and images. When those vectors are nearby, it's likely that they correspond to the same underlying concept. It doesn't depend on annotations being present, we're just looking at the raw underlying pixels of the image and whatever is the natural language query.

Auto tag

What you can do in Nucleus is start with a single image or set of images and create what's called an ‘auto tag’. Think of this as targeted data curation on a specific class of interest, where in a few clicks, we're going to train a binary classifier inside Nucleus to recognize, for example, police cars from non-police cars.

My police car tag, with a few seed images, has kicked off training a machine learning model that is now reporting the dataset according to what it thinks is going to be informative for me to tell it about. However, we have a white handicap bus with a blue stripe, that's not a police car. By skipping that example, we're given an explicit hard negative label of what we mean by ‘police car’ vs not.

When I hit refine auto tag, Nucleus takes in real time the examples I labeled as positive and the examples I didn't label as negative and retrains the model and reinforces it on the dataset to provide another set of images that are gonna be informative for me to label. With just two rounds of doing this, you end up with surprisingly high-precision binary classifiers that you can use to prioritize unlabeled data for annotation.

After a few minutes, when we hit ‘commit auto tag’, the police car likeness score from -1 to 1 will be written on every item in the dataset for all the data that's there now and in the future. With primitives like this, we can do targeted mining of the longtail mining of the classes and the scenarios we're struggling with. Even if we can't articulate in rigorous taxonomy definitions what we're looking for, we can show a few examples and make it better.

For an auto tag that was already committed, we can run a search. For example, I want to be shown the cases where our auto tag for police cars is scoring greater than zero. Furthermore, we can refine the query to only the annotated images, because we're trying to prioritize annotations. Filter it down, and create a subset of our data that we then send off to annotation.

These are some of the processes we see teams adopting to improve performance on targeted scenarios, and to develop a more data-centric workflow.

A little about me

I joined Scale AI about a year and a half ago when they acquired my computer vision startup called Helia.

My previous role was at Tesla, as a machine learning scientist on the autopilot team responsible for the core vision neural network.

Before that, I was also a researcher in the Stanford vision lab working with Dr. Fei-Fei Li.

This article was originally published as a presentation at the Computer Vision Summit, San Jose, in April 2022. The talk was by Russell Kaplan, Head of Nucleus at Scale AI. Catch up on all presentations with our OnDemand feature.