3 key conversations from CVPR 2024

How generative models are improving auto-labeling and synthetic data, how that will impact human annotators, and exciting papers in the world of multi-modal models.

Thank you! Your submission has been received.
We'll get back to you as soon as possible.

In the meantime, we invite you to check out our free resources to help you grow your service business!

Free Resources

Oops! Something went wrong while submitting the form.

Table of Contents

Loading....

Talk to an Expert

Sama’s ML team always comes back from CVPR energized and pondering new ideas, and 2024 was no exception. Here are the conversations, datasets, papers, and presentations we’re still thinking about a few weeks on from the event.

*Members of Sama's ML and data science departments at CVPR 2024*

Generative models are improving auto-labeling and synthetic data

Generative models are here to stay, and to help train new models with synthetic data. The question is, what proportion of the data should be real and which can be synthetic?

High quality application-driven datasets are key to transition from academic research to high impact future applications. In fact, a best paper award was given to Rich Human Feedback for Text-to-Image Generation, highlighting the importance of well-designed datasets to be able to train impactful models.

💡Check out the dataset used in the winning paper on github

As auto-labeling improves, and can take on more applications, it frees up Human-in-the-Loop (HITL) teams to focus on complex problems, and/or be leveraged in high-stakes industries.

Which means…

The role for human annotators is evolving as AI matures

The auto-labeling conversation naturally begs the question, how is the human annotator’s role evolving? It’s a topic that is close to our mission (and business) at Sama.

The answer is, humans are still very relevant—but the role is changing.

Most projects will find a good balance of auto-labeling and HITL, and that balance will be different depending on your model’s use case. Some models can’t tolerate errors in the annotations—think something like an autonomous vehicle, or the bad PR when it makes a grave miscalculation—and will always need humans.

As they become more prevalent, datasets used to train AI are facing the same quality questions. Nicolas Duchêne, Senior Applied Scientist, ML, at Sama sat in on a workshop discussion around responsible data at CVPR24, where an actionable insight came up about adding contextual metadata to dataset annotation, in part to give credit to human workers making these datasets possible, as well as to broadcast the conditions in which they’ve been created.

One thing that came up that I found interesting, that everybody agreed on, is to have data cards attached to datasets. So ways of qualifying data sets and having metadata about, Who made the annotation? Where was it done? What were the conditions like? — Nicolas Duchêne, Sr. Applied Scientist, ML @ Sama

Humans are also critical in the cycle of improving auto-annotations. Feedback loops have to be correct and corrected. Otherwise they’ll spiral in the wrong direction.

There’s an opportunity to use human feedback and interaction with auto labels in order to improve the model or improve the efficiency. — Ryan Tavakolfar, Sr. Solutions Engineer @ Sama

How LLMs collide with Computer Vision

Conversations about Generative AI were, of course, everywhere. Given that it’s a computer vision conference, the spotlight was on multi-modal innovation and progress. From outdoor scene extrapolation within ADAS, to Meta’s text-to-3D generated llamas, the industry is finding more and more ways to use it.

We couldn’t pick just one favorite, so here’s what piqued our collective interests:

Pascal Jauffret, Sr. Applied Scientist @ Sama, especially liked the paper, “Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs” where they investigate where and why all the new hot vision languages models are falling short, and how to start fixing those issues. Plus they announced Cambrian-1 which they say has a more “vision-centric” approach. ‍
Florence 2 by Microsoft, designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation‍
PRISM-1 by Wayve: photorealistic reconstruction in static and dynamic scenes in 4D (aka, 3D+time)
And finally, Visual Program Distillation from Google Research & University of Washington on essentially using LLMs to build on top of Visprog

Did you find this months after the event? Subscribe to our newsletter so you don't miss out next year!

Author