Grounding Language by Seeing, Hearing, and Interacting

Zellers, Rowan

Grounding Language by Seeing, Hearing, and Interacting

Files

Zellers_washington_0250E_24408.pdf (13.72 MB)

Date

2022-07-14

Authors

Zellers, Rowan

Abstract

As humans, our understanding of language is grounded in a rich mental model about "how the world works." As children, we learn this mental model gradually. We take in raw perceptual input about the world through all of our senses, and learn to make sense of people and objects around us -- enough to take action in the world. Our understanding of language and vision is grounded in the world. Deep learning has made significant progress in recent years, for a variety of AI problems. Yet today's state-of-the-art models in natural language processing (NLP) and computer vision (CV) are ungrounded. They learn exclusively from text-only, or text-annotated data on the internet, making it harder for them to connect language and vision to the world beyond those modalities. In this thesis, I will present a few lines of work to bridge this gap between machines and humans. I will first discuss how we might measure grounded understanding. I will introduce a suite of approaches for constructing benchmarks, using machines in the loop to filter out spurious biases. These include benchmarking grounding through exams about written text alone, through visual scenes, as well as through interacting with humans. Then, I will introduce PIGLeT: a model that learns physical commonsense understanding by interacting with the world through simulation, using this knowledge to ground language. PIGLeT learns linguistic form and meaning – together – and outperforms text-to-text only models that are orders of magnitude larger. Finally, I will introduce MERLOT, which learns about situations in the world by watching millions of YouTube videos with transcribed speech. MERLOT is trained to jointly represent video, audio, and language, together and over time – learning multimodal and neural script knowledge representations.