Recent progress in generative models has paved the way to a manifold of tasks that some years ago were only imaginable. With the help of large-scale image-text datasets, generative models can learn powerful representations exploited in fields such as text-to-image or image-to-text translation.
The recent release of Stable Diffusion and the DALL-E API led to great excitement around text-to-image generative models capable of generating complex and stunning novel images from an input descriptive text, similar to performing a search on the internet.
With the rising interest in the reverse task, i.e., image-to-text translation, several studies tried to generate captions from input images. These methods often presume a one-to-one correspondence between pictures and their captions. However, multiple images can be connected to and paired with a long text narrative, such as photos in a news article. Therefore, the need for illustrative correspondences (e.g., “travel” or “vacation”) rather than literal one-to-one captions (e.g., “airplane flying”).
With this purpose, researchers from Google introduced NewsStories, a large-scale dataset containing over 31M articles in English, 22M images, and 1M videos from more than 28k news sources.
Furthermore, based on the presented dataset, they propose the novel task of learning a contextualized representation for a given set of input images such that it can infer the relevant story.
The goal is to maximize the semantic similarity between each article and the input images, and this is achievable by exploring two MIL (Multiple Instance Learning) subtasks.
The first one consists of the alignment of a picture with the entire article, converted through an image encoder and a language encoder, respectively, into representations.
The second involves segmenting the text article into individual sentences and encoding them into different representations. The objective is the maximization of the mutual information between the images and text sequences, expressed in probability distributions.
This last presented solution resulted in the highest accuracy.
To summarize, the contributions of this work are multiple, starting from the challenging problem of aligning a story and a set of illustrative images without temporal ordering with applications such as automated story illustration. Secondly, a large-scale multi-modal news dataset, termed NewsStories, is introduced. Lastly, the researchers present an intuitive MIL approach that outperforms state-of-the-art methods by 10% on zero-shot image-set retrieval in the state-of-the-art GoodNews dataset.
This was a summary of NewsStories, a novel method for illustrating stories with visual summaries. You can find more information in the links below if you want to learn more about it.
Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working in the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.
Marktechpost is a California based AI News Platform providing easy-to-consume, byte size updates in machine learning, deep learning, and data science research
© 2021 Marktechpost LLC. All Rights Reserved. Made with ❤️ in California
Win All-Access Pass to AI Summit in NY worth $2,500 and Hang out with Q2 and Protopia AI Leadership
Thank you for submitting the form