What is Data Curation and Do I Need It?

What is Data Curation and Do I Need It?

Accurately annotating your data is a key success factor in training high-performing machine learning models. Unfortunately, it’s also notoriously difficult, time-consuming, and expensive. If you have hundreds of thousands — or even hundreds of millions — of data points to label, how should you go about picking what to label, and in which order?

[CTT: Not all data is of equal value to your machine learning models. Smart data curation saves resources, time, and a lot of headaches.]

The good news is that there are techniques that enable you to curate your data efficiently, saving the valuable time of your ML engineers who can add the most value by building high-performance models and integrating them into your business workflows.

To empower them to succeed, all you need is the right tools and a more systematic approach to data annotation.

Long-term consequences of not curating your data

One of the most common consequences of not curating training data is that the trained model will show poor performance when needing to generalize.

Here’s an extreme example: let’s say you’re training a model to detect pedestrians captured by crosswalk cameras. You feed your model with images of thousands of pedestrians crossing over the course of many days, but all at the same intersection.

Even if your labels are extremely precise, your model won’t perform well when attempting to detect pedestrians crossing a different street, or from a different angle — let alone in the rain or snow.

To avoid situations like this, some organizations will manually filter their data, picking the subsets that they believe to be most relevant. While this is a step toward a more measured data annotation strategy, this adds resources and effort to an already labor-intensive data preparation process. Even if manual sorting could fit into a development schedule, human error and misunderstanding of model training requirements could dilute your training data pool with improper or redundant examples.

In short: with manual data curation, there’s no guarantee that the data you select will in fact be the most valuable input for your model.


Defining your filtering goal

The first step to curating training data is investigating the faults in your ML model and setting training goals based on business needs. Look at your end-user experience: where is it faulting? What is the opportunity cost of solving for it?

For instance, on one end of the spectrum, your model might require an annotated training set that is representative of the entire body of data because your model is failing to perform across a variety of jobs.

“Our model is performing so poorly that our users will not trust it. We need to raise that baseline, so let’s collect more data.”

The inverse of this situation might also be true. Perhaps your model is performing well in general, but you have noticed that it often fails under specific circumstances. In this case, your training goal should be to increase data examples that contain information relevant to the ML model’s weak areas.

“My model is doing poorly on some classes that are very important to my end user. I want to find data as close to those classes, annotate them, and use them to re-train my model.”

There are an endless number of training goals that might fall anywhere in between; training goals are organization-specific and will change over time as your model develops.


You need seamless integration between tasks

With your filtering goal in place, you need seamless integration between different tasks to allow model development and iteration to move quickly and efficiently.

Let’s consider what this might look like:

  • You’ve set your filtering goal, and have a system in place to allow you to efficiently curate your data accordingly.
  • You send this subset of your data for annotation, and leverage it to re-train your model.
  • With this new data in play, you use the predictions of your model to inform the next iteration of filtering, annotation, and re-training.
  • Rinse and repeat. Tweak your filtering goal as needed.

Coupled with a strong data filtering goal, a tight feedback loop will ensure you are never annotating images that are not aligned with your current model improvement strategy. You’ll be set up to label your data in a very strategic, almost surgical way.

Put otherwise, you’re not paying for useless labels that will not help your model reach its peak performance.

Not only will you be targeting the data that would be most helpful to your model, but you are also reducing the amount of data that needs to be annotated, speeding up the entire training cycle.

More data is not necessarily better

All else being equal, more labeled data is better.

But the fact is that most organizations don’t have the luxury of infinite data pools and cash flow. In the absence of that, you must be smart about what data will have the biggest impact in the shortest timespan over your model performance.

With a strategic filtering goal, data curation tools, and tight feedback loops in place, you’ll never be annotating for long without seeing the results on your model performance.

If the results start to stagnate, you’ve either hit the maximum performance that you can reasonably achieve with the data you have, or you’re doing something else wrong – which should in turn trigger you to revisit your data filtering goal.

This approach to labeling your data yields more than just curated data; it also brings invaluable insights into your data that will get you to production more quickly and cost-effectively.

In a recent chat, Yannick Donnelly, Sr Solutions Engineer at Sama, summarized it beautifully for me:

“It’s not just data curation — it’s data insights.”

Related Resources

In-House vs Outsourcing Data Annotation for ML: Pros & Cons

13 Min Read

Sama’s Experiment-Driven Approach to Solving for High-Quality Labels at Scale

6 Min Read

ML Assisted Annotation Powered by MICROMODEL Technology

8 Min Read