Having access to top-quality data is essential for developing high-performing Machine Learning (ML) models that can deliver business value. Algorithms rely on large amounts of data to learn and make predictions but not all data is created equal.
Data annotation – the process of labeling data with metadata or tags – is an important step in preparing data for machine learning. The quality of data annotation can greatly impact the accuracy and reliability of machine learning models.
A recent survey of professionals working on ML projects showed that 78% stall at some stage before deployment. One of the main reasons is the lack of reliable scaling in data annotation volumes and quality. This is often fueled by human error, unclear assumptions, ambiguity in images, and the subjective nature of the task.
Here are three examples of poor data quality resulting in bad ML algorithms.
1. Inaccurate or missing data leads to incorrect predictions
This occurs for a few reasons – human error during the data annotation process, faulty data collection techniques, or issues with the data source.
If a machine learning model is trained on inaccurate or missing data, it may falsely classify images or make incorrect recommendations. Missing data can negatively impact machine learning models by reducing accuracy or introducing bias.
Inaccurate image annotation was blamed in the 2017 case of a self-driving car that crashed into a truck in Florida. The car, which was being operated in autopilot mode, failed to detect the white truck against the bright sky, leading to the fatal accident.
According to the investigation report by the National Transportation Safety Board , the car’s object detection system was not able to recognize the truck as an obstacle due to the use of a “misplaced object classification algorithm”. The algorithm was trained to detect objects based on their height, width, and other features, but it was not designed to recognize the side of a large truck, which is what the car encountered.
This error was attributed to inaccurate image annotation during the training of the algorithm. The training data did not include enough images of large trucks from this angle, leading to a blind spot in the algorithm’s ability to detect them.
To mitigate the impact of inaccurate or missing data, it is crucial to carefully review and validate data before using it. This involves identifying potential gaps or scenarios where data may be missing, as well as ensuring that the data is accurate and representative.
Annotators should be trained on the task they will be performing, the annotation software they will be using, the type of annotations needed, and the level of accuracy required. Providing clear guidelines and examples of annotated data can help ensure consistency and accuracy. It’s also helpful to have a quality control process in place to review the annotations and provide feedback. Using multiple annotators to work on the same data and then comparing results can help identify discrepancies and improve accuracy.
2. Outdata and non-representative data can create biases
Outdated data is no longer relevant or accurate while non-representative data does not accurately reflect the population being studied.
Outdated data occurs when information is collected over a long period of time or when the data source changes. It can negatively impact machine learning models by reducing accuracy or introducing bias. Non-representative data occurs when data is biased or when the sample size is too small.
One notorious example is the case of Amazon’s recruiting tool. In 2018, it was revealed that Amazon had developed an ML algorithm to help automate the recruiting process by sifting through resumes and identifying top candidates. But the algorithm was found to be biased against women. This was because it was trained on data which was historically biased against women.
The algorithm was trained on resumes submitted to Amazon over a 10-year period, the majority of which came from men. The algorithm learned to associate certain terms and qualifications with male candidates, leading it to rank them higher than equally qualified female candidates. It downgraded resumes that contained the word “women’s” – such as ”women’s chess club captain” – and gave higher scores to resumes that contained words like “executed” and “captured,” which are often associated with male-dominated fields.
Amazon eventually abandoned the tool, but the case highlights the importance of ensuring training data is diverse and representative of all groups. Without accurate and unbiased data, ML algorithms can perpetuate and even amplify existing biases, leading to harmful consequences.
To mitigate the impact of outdated and non-representative data, it is essential to periodically update the data and evaluate its relevance. This includes assessing whether it is still representative of the target population and the current context. It is important to consider potential biases, such as underrepresented groups or overrepresented data points. Regularly monitoring the performance of the machine learning model and analyzing any errors or biases that may arise can also help identify and address any issues.
3. Incomplete data fails to predict relevant outcomes
This is data that is missing important information or features and can occur due to technical issues during data collection or human error during data annotation.
In 2019 Walmart announced it would be implementing autonomous floor scrubbers in over 3,600 of its stores. The scrubbers would be able to navigate and clean floors without any human intervention, thanks to their sensors and cameras. However the scrubbers kept getting stuck or colliding with obstacles.
Incomplete image annotations were causing the problem. The scrubbers were supposed to be able to identify and avoid obstacles such as shopping carts and displays, but the algorithm was not properly trained to recognize all of the different types of obstacles in a store. As a result, the scrubbers would sometimes collide with carts or other obstacles, causing damage to both the scrubbers and the objects in their path.
Walmart had to manually annotate more images of the store environment to improve the training data. By adding more data and retraining the algorithm, the scrubbers were eventually able to better navigate the stores without getting stuck or causing damage.
To mitigate the impact of incomplete data, it is important to identify missing information and either fill in the missing values or remove incomplete data. It’s important to provide clear guidelines and examples for data annotators to ensure that they understand the task and the expected output to avoid incomplete or inaccurate annotations.