Good annotation and testing practices are the foundations of building a great model. However, understanding what constitutes quality data is a tricky question. While “as accurate as possible » seems the obvious answer, this thinking can bog down any team in an endless cycle of manual annotation, training, more annotation, and retraining.
In this post we’ll go over what “quality » might mean for your autonomous vehicles use case and how to start measuring it.
Unfortunately, there is no single definition of quality scoring — especially with data and model requirements coming in all shapes and sizes. Let’s go over some of the factors that determine what “quality » might mean for your use case.
First, determine the acceptable limitations of your model. LiDAR is a flexible technology that can identify and differentiate objects, as well as track speed and direction, but a single organization’s sensor setup implementation will not need perfect accuracy for all metrics. Instead, identify the metrics and accuracy requirements important to your use case.
In certain applications — like drivable area detection, for example — highly accurate location data is needed, while cuboid size may be less important. This is not to say that, in this example, woefully inaccurate cuboid sizes are acceptable. Rather, the bar for what is a quality result is set based on your model’s true needs.
For one project, perhaps tracking the outline of a moving object within 10cm of its actual edge is acceptable, while for another application capturing the outline must be done within 5cm of its actual shape. There is no one-size-fits-all quality metric. Quality is determined by acceptable model limitations, and might even change over the course of development.
The primary reason to keep this minimum viable product mindset is to save time and money on annotation. Highly accurate annotations for each metric quickly run up costs and turnaround times, but if this kind of thoroughness isn’t needed then the heavy workload turns into wasted resources. This time is better spent refining your model and preparing the project for production.
We also recommend revisiting the quality question routinely throughout the development lifecycle. Needs, practices, and design principles change over time, and so should your testing and annotation strategies.
Recognizing the acceptable limitations of your model is the first step to attaining and maintaining quality scoring. The second step is deciding how to measure it.
Where and when should you be concerned with quality in the first place? Going back to the “it depends » theme of the previous section, some applications may need annotation for small or far away objects. However, very few of them will need to be trained to identify a moving car 300ft away that would be annotated as a 2cm cube. Cut out the objects that don’t need measuring on your quality rubric.
Another consideration might be tracking relative vs actual size of an object. A car looks to be different sizes depending on distance from the sensor, so should the model be tracking the change in size or estimating the maximum or actual size of the object throughout?
For the most part, the actual size of the object will be what matters — but regardless of your use case, be sure to define instructions like this early on.
An often overlooked component of this is the identification of edge cases. Minority class annotation might seem like a waste of time, but it is in fact integral to training a successful model. You cannot launch a self-driving car that does not know how to behave when it encounters edge cases; real world driving is full of edge cases. Edge case testing is a balancing act between training a model to be prepared to recognize rare occurrences, while also not exceeding the previously determined limitations of your model.
There is also the matter of scoring and penalties for the errors types you might encounter:
|Incorrect labels||Annotator labeled a child pedestrian as an adult|
|Missed objects||Car far away that had few points and was not labelled|
|Missed points||Points outside of a cuboid that should have been inside|
|Bloom/Reflection||Points labelled as part of the object, when they are sunlight glare/bloom|
|Object tracking||Object left the scene, but came back and was tracked by a different ID|
|Object size||A far away truck with the cuboid size of a sedan|
|Too tight/Too loose||The cuboid was not big enough, or was not small enough|
|Incorrect direction of travel||Should the direction of travel change when the car is going backwards?|
|Alignment errors||The object is not properly aligned across the sequence (yaw, pitch and roll)|
|Jittering||The object position is jittering across the sequence in the X, Y or Z axis|
What is the penalty for missing an object? What about the penalty for an incorrect label or direction of travel? Assigning penalties will vary based on your use case and your working definition of quality.
You should also consider what level of noise is acceptable in your data. While you want to avoid systematic or repetitive noise, models are generally very good at having a level of noise in the data, so aiming for 100% accuracy will be a large tradeoff against cost and throughput rate. As long as your instructions and quality rubric are well defined, systematic or repetitive noise shouldn’t be an issue.
Finally, determine if your model requires scoring by video sequence or by frame. (For example, object tracking annotation would likely need to be scored by video sequence, as you want to ensure the object ID is consistent across frames.) This question links back to many other considerations talked about in this post, so you may already have an answer, but maintain the minimum viable product mindset. Efficiency can go hand-in-hand with accuracy to make a great product or service happen.
A recommended route for many organizations is to utilize automated scoring.
3D Gold Tasks are a powerful automation tool for improving scoring quality by introducing manually scored tasks into your training. To generate Gold Tasks, a human annotator will score annotated data and then introduce that scored task into your model. This method helps to train your model to recognize what a correct scoring procedure looks like, strengthening the long term efficiency of the automation.
Tools such as Auto QA can also render your annotation process more efficient, by preventing errors that are often overlooked by a manual QA review and allowing analysts to focus on more qualitative errors.
Automation, unfortunately, is not a silver bullet, even with the help of Gold Tasks and other automation such as Auto QA. These efficiency tools need to be bolstered by a skilled annotation team who is working with you to catch edge cases and iterate on instructions thanks to tight feedback loops. For the best results, these humans-in-the-loop must be set up to become experts on your data.
Third-party annotation is often needed for high-quality annotation, especially during the mid to late stages of development. In the early stages, crowdsourced or internal annotation is often suitable to get up and running. At some point, however, your model will likely become too complex for these methods. Third-party annotators bring expertise and efficiency to the table, while being much more cost effective than building out an in-house expert annotation team.
Quality in, quality out
Determining quality is a complex issue, but it shouldn’t be ignored.
Start by determining the limitations of your model and adjust your definition of quality accordingly. Then, decide how measuring and scoring should happen based on your model’s needs, and leverage efficiency boosting tools such as automation and third-party experts to get the most out of your training time.
If you’d like to learn more about 3D LiDAR annotation and working with training data, watch our webinar on the topic or learn more about how Sama is helping transportation and navigation organizations make data their competitive advantage here.