It’s Time to Redefine Data Quality for Machine Learning

It’s Time to Redefine Data Quality for Machine Learning

As the quality leaders in the computer vision space for over a decade, we at Sama know what to aim for with regards to your dataset quality. As the Machine Learning community has realized that improving datasets is the key to boosting model performance, we’ve worked with our customers to define quality benchmarks and best practices to get there.

Rigor, not perfection, is the foundation of a strong data quality strategy for Machine Learning development. Indeed, one promise of deep learning models is that they are very resistant to errors in the dataset, or noise. But many insist on working with noise-free (error-free) datasets. As we’ll see in what follows, this pursuit of perfection is both infeasible and unnecessary.

Noise is inevitable

Noise is everywhere. A typo in an article, glare from an oncoming car, poor quality speakers are all examples of commonplace noise we face regularly and which make understanding the outside world slightly more difficult.

We’ve learned to ignore noise in our day-to-day-lives because it is impossible to get rid of it completely. The same goes for deep learning models. The data they learn from are inevitably noisy for three main reasons:

  1. Ground truth, or the human-annotated labels used to train the models, gets created by manual human labor. Humans inevitably make mistakes and these lapses creep into the data during the annotation process. Especially given the volume of data required by deep learning algorithms, there is no way to ensure a completely error-free dataset.
  2. The data will always contain nuanced instances where the ground truth is subjective. For example, when tracking a vehicle at night time in a video, it can be impossible to precisely define its outline against the background. Creating an absolute ground truth for every scenario is impossible.
  3. The underlying data will always contain some level of noise. Even the most precise data capturing devices only capture approximations of the real world. Imperfections in your unlabeled data, for instance low resolution or low contrast images, sets an upper bound for how accurate your annotations can be.

Your model doesn’t need perfection

This claim that chasing a noise-free dataset is futile can make people uncomfortable, especially those working on high-stakes use cases where accuracy is critical. If a model mistake, or failure mode, can lead to deaths, shouldn’t one make sure that training datasets are completely error free? Thankfully, no. A perfect dataset is superfluous.

Deep learning models are great at ignoring noise during the training process. If you feed your model scenes with tens of thousands of pedestrians correctly annotated, the model will easily learn what a pedestrian looks like even if the annotators have missed a handful. This is true as long as it has a vast majority of good examples and sees only a small minority of bad ones.

A deep learning model will do its best to fit patterns in the data, and the best fit it can achieve is to simply ignore a small number of mistakes. If your model is overfitting to your data (a case where the model just memorizes all of your training examples), that might not be true, but in this case you have much bigger problems anyways.

What noise actually hurts your model

So yes, if your failure modes are extremely costly, you should do everything you can to avoid them. But attempts to drive the ground truth error rate to 0 are not worth your time or money.

Instead, you should focus on uncovering certain specific types of noise which can cause failure modes. These are potential dataset quality issues that are worth surfacing and rectifying:

  1. A critically high level of noise in your data. We’ve seen that deep learning models are good at ignoring some level of noise. However, too many incorrect examples can prevent your model from learning the signal. Typically, noise levels below 5% are acceptable. This would mean that 95% of your annotations are labeled correctly.

    In the batch of 4 tasks above, each task score is computed as follows: 1. 1 critical error (-100%) out of 10 shapes: score of 90%; 2. 0 errors out of 6 shapes: score of 100%; 3. 1 minor error (-20%) out of 20 shapes: score of 99%; 4. 1 minor error (-20%) out of 10 shapes: score of 98%. The average of all tasks is 96%, so this batch would pass a 95% SLA.
  2. Your dataset contains particularly noisy slices or segments. Even if only a very small amount of annotations in your dataset contains errors, it is possible that a certain type of annotation is much more error-prone. For example, half of your motorcycles could be annotated incorrectly, but because they represent just 1% of all annotations, your global error rate would only be 0.5%. Unfortunately, your model will struggle mightily to properly detect motorcycles.

    An overall acceptable quality metric might hide the fact that certain classes haven’t been labeled properly. In this case, the overall SLA is 95% but half of the motorcycles have not been labeled correctly.
  3. Your dataset contains consistent noise. Deep learning models will learn any trend they find in data. If the errors in your dataset form a pattern, your model will learn to replicate it. In some cases, this arises due to instruction misunderstanding. If annotators are making the same mistake across the dataset, such as labeling skateboarders as pedestrians, the model will learn that problematic pattern.

How to identify problematic noise

How can you ensure your dataset doesn’t contain these types of problematic noise from which model failures arise? Here are some guidelines to ensuring no insidious quality issues creep into your data:

  1. Review a random subset of your data to ensure it meets an acceptable global quality standard. For example, the Sama standard quality Service Level Agreement is 95%, but we can also offer up to 99%. For all the reasons stated above, we do not offer a 100% SLA.
  2. Explore the distribution of errors found in your random review. Examine if certain slices of your dataset are much noisier. With Sama, we give you the tools to review and explore your annotated data in flexible ways to gain a deeper understanding of its quality and flag more pernicious noise.
  3. Once you find a mistake, search for similar assets to spot problematic trends. Sama offers functionality through Sama Curate to find similar images within your dataset and probe your dataset in a more efficient manner.

Most importantly, find a reliable annotation partner who can guarantee the highest level of quality and will work with you to rectify issues in your dataset. Sama can ensure a data quality baseline unmatched in the industry. We also have the expertise and tools required to supercharge your ML development through a tried data-centric approach.

Related Resources

In-House vs Outsourcing Data Annotation for ML: Pros & Cons

13 Min Read

Sama’s Experiment-Driven Approach to Solving for High-Quality Labels at Scale

6 Min Read

ML Assisted Annotation Powered by MICROMODEL Technology

8 Min Read