Catching and mitigating the noise is crucial during building and validation, but guaranteeing data quality isn’t easy and putting up checks and guardrails is key.
Data noise. It’s one of the biggest issues autonomous vehicle models face, especially when consistent noise creates risk that can result in life-or-death consequences in real-world environments. The problem is, as revolutionary as deep learning models are, they still learn from any trend found in the data and will replicate patterns, even if the source is an error. Catching and mitigating the noise is crucial during building and validation, but guaranteeing data quality isn’t easy and putting up checks and guardrails is key. Here are five effective ways to reduce uncertainty in AV models, with a focus on data quality and calibration.
Active learning algorithms identify data samples that are likely to be the most challenging or uncertain for the current model. By focusing data collection on these challenging scenarios, developers can acquire data that fills gaps in the model's knowledge and improves its performance where it matters most. This ensures that the dataset includes critical information that might be missed through random or passive data collection.As the model becomes more proficient in certain tasks or scenarios, active learning algorithms adjust their data selection strategies accordingly, enhancing model accuracy iteratively and over time. They’re also good at identifying edge cases and rare events that are critical for AV safety. These scenarios may be infrequent in real-world driving, are essential for comprehensive testing and training and need to be adequately represented in the dataset.
Ensuring data quality requires providing annotators with comprehensive coaching and guidance, especially at the start of the project. At Sama, for example, we require 2 week project-specific training before our annotators start working with client data. Training includes comprehensive annotation instructions including golden tasks and quality rubrics. Annotators need to understand the nuances of annotating different scenarios, including edge cases and complex situations, to avoid costly errors down the line.The key is to focus on precision rather than speed — don’t be afraid to slow things down to get this step right. Investing in training upfront means avoiding lengthy delays and hidden expenses due to rework or operational efficiencies.
From sensor wear and tear to environmental conditions, the data for your AV models is bound to change. That’s why it’s so important to include ways to identify data shifts and recalibrate regularly.Take LiDAR sensors, for example. AVs rely on them for precise distance measurements, but over time, the sensors may experience slight misalignments or fluctuations in performance. Implementing an automated way to flag the data allows developers to detect these changes early and make real-time adjustments to ensure data accuracy, minimizing the impact of sensor degradation and drift on model performance.How often you complete calibration checks will depend on your project, but a good starting point is to do them weekly.
Collecting data for AV models is a massive undertaking, and data quality can vary significantly. Random data sampling can help identify and rectify these issues and reduce inherent human bias.Take a lane detection algorithm, for example. Ideally, the autonomous vehicle is operating in standard daytime conditions with clear lane markings and nice weather. But faded lines, road debris, bad weather, and even unusual situations (like a windy mountain road) need to be accounted for. Taking random data samples at regular intervals will help ensure data diversity and that annotations accurately represent real-life road conditions, even ones that aren’t encountered frequently.You’ll want to do more sampling at the beginning stages of a project (even up to 30%), with the amount and frequency slowly decreasing over time as you become more confident in your model.
A comprehensive approach to data QA involves flexible exploration. This means delving deeper into your dataset in diverse and creative ways to uncover potential issues that may not be apparent through random sampling alone.For example, you may want to isolate assets or data points that are associated with busier or more complex scenarios, such as road intersections with heavy traffic, crowded urban areas, or challenging driving conditions. These scenarios often pose unique challenges for AV models that might make them more prone to errors.If possible, you may also want to automatically flag edge cases within your dataset. These flagged edge cases may represent rare or unexpected scenarios that require special attention.Often, AV models don’t survive production because it’s not easy to maintain a high level of data accuracy or effectively manage edge cases when using such large data sets. By using these five strategies, you can reduce the amount of uncertainty in your data so that you can be more confident in your model, which ultimately leads to better outcomes.