4 tech quotes from Meta FAIR’s 10-year anniversary

The Segment Anything model was announced last year to much fanfare, and we were afforded a more technical view at how the model is trained.

Thank you! Your submission has been received.
We'll get back to you as soon as possible.

In the meantime, we invite you to check out our free resources to help you grow your service business!

Free Resources

Oops! Something went wrong while submitting the form.

4 tech quotes from Meta FAIR’s 10-year anniversary

Table of Contents

Loading....

Talk to an Expert

In 1956, a small group of scientists and mathematicians gathered at Dartmouth University for an eight-week brainstorming session around the parlance and possibilities of an exciting new field in computing– then referred to as “thinking machines.” While the genesis of Artificial Intelligence is a matter of some debate, the modern conception of the field is widely agreed to have been founded in that very summer workshop in Hanover, New Hampshire.

So went the opening remarks of Joelle Pineau, the VP of AI Research at Meta, kicking off the 10 year anniversary event for the Fundamental Artificial Intelligence Research (FAIR) program. To commemorate 10 years of research and progress in the fields of AI/ML, a variety of press outlets (and precisely one podcast host) got to peek under the hood at exactly what the Meta team has been working on. Portable microphone and notebook in hand, I watched FAIR’s best and brightest present their latest models. Here are some quick thoughts on the tech, with direct quotes from the experts at FAIR.

After the event I checked in with Sama’s Director of Machine Learning, Jerome Pasquero, for some technical follow up. You can listen to our conversation here.

Segment Anything

‍

The Segment Anything model was announced last year to much fanfare, and we were afforded a more technical view at how the model is trained. Nikhila Ravi, Research Engineering Manager, walked us through two of the key breakthroughs that made SAM possible:

One of the key innovations in Segment Anything is the idea of a data engine. We use the model interactively in the loop with human annotators to collect segmentation masks, And then we use the annotations to train the model. We call this data model co-evolution. The second innovation is architectural. So SAM needs to be efficient enough to power the data engine. And this is achieved by having a two-part model:a heavyweight image encoder that an image is passed through one time which generates an image embedding. And then for every user prompt we only have to run a very lightweight mask decoder, which actually runs in real time on CPU, directly in a browser. So because of this architectural innovation, we're able to very easily generate segmentation masks for the objects in an image. We refer to this fast per-mask inference as “Amortized Real Time Inference”

The data model co-evolution appears to have been designed to meet the challenge of the sheer scale of data needed to train the model. While text and images are plentiful on the internet, segmentation masks aren’t typical user-generated content, so human annotations are paired with the self-supervising model above.

Ego-Exo 4D

Meta’s Chief AI Researcher, Turing Award winner, and bad boy of AI Yann LeCun made his presence known at the event with a few exciting contributions–namely, a panel interview where he waxed poetic on the limitations of language-based teaching:

We have, again, an AI that passed the Bar Exam. But where is my self-driving car? Any teenager can learn to drive a car in 20 hours of practice, largely without causing any accidents. But we still don’t have self-driving cars. And the ones that we have that approach self-driving have been trained on hundreds of thousands of hours of data—this huge amount of engineering—and still don’t drive like a human. So obviously we’re missing something really, really big. We’re still very, very far from human-capable AI. It’s not around the corner.

Yann’s point is that learning happens outside mere data and language, in a manner of processing unknown to AI researchers. While he didn’t explicitly connect the dot to Meta’s Ego-Exo 4D project, it’s hard not to view the Mixed Reality (MR) application as an attempt to hone in on how exactly humans learn. In this case, it’s monkey see, monkey do–but instead of swinging from branches, we’re swinging tennis rackets.

Alongside the above Ego-Exo 4d approach, Meta debuted an approach to instructional content paired with MR. The use case here couldn’t be more obvious: imagine double checking steps of a recipe without covering your phone in flour, or reviewing how to fold in cheese without clicking back through the same 15-second clip. The key breakthrough here was a type of gesture recognition whereby your position in a sequence can be recognized, so instructional steps can be served up to you at your own pace. As I donned the Quest 3 headset, muddling mint and berries for my own mocktail, the tech used my hand positioning to know when I was slicing strawberries, or when it was time for me to reach for the muddler, and other steps along the way.

Seamless Communication

The first slide of the Seamless Communication presentation featured—merely and massively—the word “Babelfish”. Any self-respecting Douglas Adams fan would light up at this curious creature’s mention, and I was no exception.Automatic, real-time translation is here, and it goes far beyond a streamlined speech-to-text translate tool. Live translation presents some unique challenges, outlined here by Research Engineer and polyglot Juan Pinot:

Streaming means that a translation is able to be generated before the end of the sentence. So this is very challenging because the model is operating with limited information. It's also challenging because people's languages have different word orders, as you can see in this example, which is German to English, we can wait for the third word, “internet”, in green, before we start translating into English. To solve this challenge, we designed a read-write policy that adaptively decides when to read more context and when to write more of the words or our speech segments. So it's important to have a model-based policy because we have to adapt not only to the input but also to the structure of the language.

Meta also afforded us the chance to demo a video translation tool that detects language, tone, and style, and instantly generates a video of the user performing the same speech in another language. In addition to the translated mouth movement generation, the tool can detect when a subject is excited, whispering, sad, or a variety of other expressions. Getting intonation and mood right in speech is an exciting breakthrough; having your speech generated into a robotic facsimile is an effective but heartless way of getting a point across. Head here to check out the Seamless Expressive demo, read more about the model and peek at the code base.

Audiobox

Lastly (and most exciting of all to this audio professional) is Audiobox, Meta’s research model for audio generation. Audiobox is a generative approach to audio creation, by way of natural language text prompts combined with audio samples. In addition to demonstrating how the product can generate audio in new languages, in different scenarios (in this case, taking a clean studio-quality recording and placing the speaker in a sprawling, echoey medieval cathedral), Research Engineer Wei-Ning Hsu provided perhaps the most comprehensive mention of creating ethical & transparent technology:

To encourage studies for responsible and safe AI in audio generation, we're going to announce applications to the Audiobox Responsible Generation grant, which gives access to the code and the models. We believe such studies shouldn't be done just by one company or one organization. They should be combined together with joint efforts. We also know that such a study will be most effective when people have access to a state-of-the-art model. So for that purpose, I want to make the model available for those who are interested in these topics.

There’s far more to share from FAIR’s event. To unpack some of what we learned, I sat down with Sama’s Director of Machine Learning, Jerome Pasquero, to get a technical reaction to some of the models I saw demonstrated live. Also, above-mentioned AI super star Joelle Pineau will be joining How AI Happens for a discussion about FAIR’s approach to developing new technology, how Meta has made self-supervised learning models the foundation of their entire AI approach, and her views on where the industry is headed.

To hear my conversation with Jerome, and Joelle, head to HowAIHappens.com or subscribe on your favorite podcasting app.

Author

The Sama Team

RESOURCES