How to Define and Measure Your Training Data Quality

Kyra Harrington

April 8, 2021

6 Minute Read

Data quality plays a crucial role in Machine Learning (ML) models’ performance and directly leads to your projects succeeding, failing, or going off-budget. “Garbage In Garbage Out” is a phrase commonly used in the machine learning community, which means that the training data quality ultimately determines the model’s quality.

Our experience working with companies using ML has shown that the best models are based on comprehensive datasets, complete with a range of detailed labels. Unfortunately, many decision-makers still underestimate the time and resources needed to create, manage, and annotate datasets. Indeed, creating quality datasets is often one of the most expensive and time-consuming elements of building a machine learning application.

But how do you define data quality and measure it? Furthermore, how do you improve it? The answers depend on the type of problem you’re solving.


Defining Training Data Quality in Annotation

In data annotation, we often speak of “accuracy” and “consistency”. Accuracy in data labeling measures how close the labeling is to ground-truth or how well the labeled features in the data are matching real-world conditions. Consistency refers to the degree of accuracy across the overall dataset. Are annotations consistently accurate across your datasets? Other characteristics of data quality may include completeness, integrity, reasonability, timeliness, uniqueness, validity, and accessibility.

The path to high-quality, scalable data quality always begins with a deep understanding of your project requirements, allowing you to develop well-defined annotation criteria against which to measure quality. Experienced AI-driven organizations often establish a quality rubric that describes what quality means in the context of a project. Sama’s approach is to anticipate the errors that we might see in a task and assign numerical penalty values before the annotation process starts.

The main benefit of this approach is to ensure that both your company and Sama’s team are on the same page about how quality is defined and measured. This allows our experts to create comprehensive and actionable instructions without room for interpretation and inconsistencies, saving a lot of time down the road.

Measuring Training Data Quality

Several methods exist to help companies measure data quality. To define the correct annotation of given data, you want to start by creating annotation guidelines. On top of proposing a multi-level quality checks system, Sama’s experts have built unique know-how in efficiently designing annotation guidelines that enhance data quality.

Here are some of the more common data quality measurement processes:
1. Benchmarks or gold sets help measure how well a set of annotations from a group or individual matches the vetted benchmark established by knowledge experts or data scientists. Benchmarks tend to be the most affordable QA option since it involves the least amount of overlapping work. Benchmarks can provide a useful reference point as you continue to measure your output's quality during the project. They can also be used as test datasets to screen annotation candidates.

At Sama, Gold Tasks are used in two more ways: During training, to assess annotators and identify those ready to move into production, and once in production to generate an automated metric on quality. You can read more about Sama’s approach to gold tasks here.

2. Consensus measures the percentage of agreement between multiple human or machine annotators. To calculate a consensus score, it is necessary to divide the sum of agreeing labels by the total number of labels per asset. The goal is to arrive at a consensus decision for each item. An auditor typically arbitrates any disagreement amongst the overlapped judgments. Consensus can be either performed by assigning a certain number of reviewers per data point or be automated.

3. Cronbach's alpha test is an algorithm used to measure the average correlation or consistency of items in a dataset. Depending on the characteristics of research (for instance, its homogeneity), it may help quickly assess the labels’ overall reliability.

4. Review is another method to measure data quality. This method is based on the review of label accuracy by a domain expert. The review is usually conducted by visually checking a limited number of labels, but some projects review all labels. Sama enables companies to easily review quality through a sampling portal: a dedicated portal providing full transparency and accountability on data quality. Your team can get full transparency on the batch’s quality and provide direct feedback to data trainers.

Due to the iterative machine learning model testing and validation stages, we must keep in mind that data quality can change during a project. As you train your model or after making your solution live, you’ll probably find patterns in your inaccuracies or identify edge cases that will force you to adapt your dataset.

Reviewing Training Data Quality

Because no two AI projects are alike, you need to make sure that your quality assurance (QA) process is designed to meet the unique needs of your particular project. Here, both data accuracy and consistency are reviewed, separate steps that a data scientist or project manager may perform manually or in an automated way.

Two different types of QA processes:
1. Sama’s own QA managers review the tasks using both manual and automated techniques.
2. Client’s QA process performed by a data scientist, most likely manual only.

Reviewing tasks is a pain point for most data science teams. We believe that your team should spend less time on these time-consuming and tedious tasks and spend more time on strategic work. As such, we created the Auto QA process. Auto QA creates an instant feedback loop to prevent logical fallacies, which helps annotators improve and get it right the first time.

Automated logic checks are triggered before a task is submitted on Sama’s platform by our annotators. Auto logic checks can identify several potential issues in your data— for instance, invalid answer combinations, repetitions, or size requirements. Leveraging Auto QA will help you prevent errors that may be impossible to detect by the manual QA review process. It reduces the time spent by manual QAs and allows annotators to focus on the more critical errors or edge cases.

Sama’s training data platform handles the entire annotation lifecycle and employs a quality feedback loop, enabling us to offer the highest quality SLA (>95%), even on the most complex workflows. Understanding the importance and prioritizing the quality of your training data will help in achieving success with your models. The first step in gaining good quality training data begins with finding the right processes and platforms to label your training data.