Document Type
Article
Publication Date
2019
Publication Title
Advances in Cognitive Systems
Abstract
Describing the content of a visual image is a fundamental ability of human vision and language systems. Over the past several years, researchers have published on major improvements on image captioning, largely due to the development of deep learning systems trained on large data sets of images and human-written captions. However, these systems have major limitations, and their development has been narrowly focused on improving scores on relatively simple “bag-of-words” metrics. Very little work has examined the overall complex patterns of the language produced by image-captioning systems and how it compares to captions written by humans. In this paper, we closely examine patterns in machine-generated captions and characterize how conventional metrics are inconsistent at penalizing them for nonhuman-like erroneous output. We also hypothesize that the complexity of a visual scene should be reflected in the linguistic variety of the captions and, in testing this hypothesis, we find that human-generated captions have a dramatically greater degree of lexical, syntactic, and semantic variation. These results have important implications for the design of performance metrics, gauging what deep learning captioning systems really understand in images, and the importance of the task of image captioning for cognitive systems research
Volume
8
First Page
335
Last Page
1
Rights
© 2019 Cognitive Systems Foundation. All rights reserved.
Recommended Citation
Dai, Minyue; Grandic, Sandra; and Macbeth, Jamie C., "Linguistic Variation and Anomalies in Comparisons of Human and Machine-Generated Image Captions" (2019). Computer Science: Faculty Publications, Smith College, Northampton, MA.
https://scholarworks.smith.edu/csc_facpubs/175
Comments
Archived as published. Open access article.
Published at http://www.cogsys.org/journal/volume8/