Every print subscription comes with full digital access
An artificial intelligence model learned words from audio and video of a baby
Babies are prodigious language learners. After being fed the sights and words that a baby encountered, an artificial intelligence model picked up its first words.
Vera Livchak/Moment/Getty Images
The AI program was way less cute than a real baby. But like a baby, it learned its first words by seeing objects and hearing words.
After being fed dozens of hours of video of a growing tot exploring his world, an artificial intelligence model could more often than not associate words — ball, cat and car, among others — with their images, researchers report in the Feb. 2 Science. This AI feat, the team says, offers a new window into the mysterious ways that humans learn words (SN: 4/5/17).
Some ideas of language learning hold that humans are born with specialized knowledge that allows us to soak up words, says Evan Kidd, a psycholinguist at the Australian National University in Canberra who was not involved in the study. The new work, he says, is “an elegant demonstration of how infants may not necessarily need a lot of in-built specialized cognitive mechanisms to begin the process of word learning.”
The new model keeps things simple, and small — a departure from many of the large language models, or LLMs, that underlie today’s chatbots. Those models learned to talk from enormous pools of data. “These AI systems we have now work remarkably well, but require astronomical amounts of data, sometimes trillions of words to train on,” says computational cognitive scientist Wai Keen Vong, of New York University.
But that’s not how humans learn words. “The input to a child isn’t the entire internet like some of these LLMs. It’s their parents and what’s being provided to them,” Vong says. Vong and his colleagues intentionally built a more realistic model of language learning, one that relies on just a sliver of data. The question is, “Can [the model] learn language from that kind of input?”
To narrow the inputs down from the entirety of the internet, Vong and his colleagues trained an AI program with the actual experiences of a real child, an Australian baby named Sam. A head-mounted video camera recorded what Sam saw, along with the words he heard, as he grew and learned English from 6 months of age to just over 2 years.
The researchers’ AI program — a type called a neural network — used about 60 hours of Sam’s recorded experiences, connecting objects in Sam’s videos to the words he heard caregivers speak as he saw them. From this data, which represented only about 1 percent of Sam’s waking hours, the model would then “learn” how closely aligned the images and spoken words were.
As this process happened iteratively, the model was able to pick up some key words. Vong and his team tested their model similar to a lab test used to find out which words babies know. The researchers gave the model a word— crib, for instance. Then the model was asked to find the picture that contained a crib from a group of four pictures. The model landed on the right answer about 62 percent of the time. Random guessing would have yielded correct answers 25 percent of the time.
“What they’ve shown is, if you can make these associations between the language you hear and the context, then you can get off the ground when it comes to word learning,” Kidd says. Of course, the results can’t say whether children learn words in a similar way, he says. “You have to think of [the results] as existence proofs, that this is a possibility of how children might learn language.”
The model made some mistakes. The word hand proved to be tricky. Most of the training images that involved hand happened at the beach, leaving the model confused over hand and sand.
Kids get tangled up with new words, too (SN: 11/20/17). A common mistake is overgeneralizing, Kidd says, calling all adult men “Daddy,” for instance. “It would be interesting to know if [the model] made the kinds of errors that children make, because then you know it’s on the right track,” he says.
Verbs might also pose problems, particularly for an AI system that doesn’t have a body. The dataset’s visuals for running, for instance, come from Sam running, Vong says. “From the camera’s perspective, it’s just shaking up and down a lot.”
The researchers are now feeding even more audio and video data to their model. “There should be more efforts to understand what makes humans so efficient when it comes to learning language,” Vong says.
Questions or comments on this article? E-mail us at firstname.lastname@example.org | Reprints FAQ
W.K. Vong et al. Grounded language acquisition through the eyes and ears of a single child. Science. Vol. 383, February 2, 2024, p. 504. doi: 10.1126/science.adi1374.
Laura Sanders is the neuroscience writer. She holds a Ph.D. in molecular biology from the University of Southern California.
This article was supported by readers like you. Invest in quality science journalism by donating today.
Science News was founded in 1921 as an independent, nonprofit source of accurate information on the latest news of science, medicine and technology. Today, our mission remains the same: to empower people to evaluate the news and the world around them. It is published by the Society for Science, a nonprofit 501(c)(3) membership organization dedicated to public engagement in scientific research and education (EIN 53-0196483).
© Society for Science & the Public 2000–2024. All rights reserved.
Subscribers, enter your e-mail address for full access to the Science News archives and digital editions.
Not a subscriber?
Become one now.
Every print subscription comes with full digital access