Generating image captions through multimodal embedding

Dash, Sandeep Kumar; Saha, Saurav; Pakray, Partha; Gelbukh, Alexander

doi:10.3233/JIFS-179027

Generating image captions through multimodal embedding

Issue title: Special Section: Intelligent and Fuzzy Systems applied to Language & Knowledge Engineering

Guest editors: David Pinto and Vivek Singh

Article type: Research Article

Authors: Dash, Sandeep Kumar^{a; *} | Saha, Saurav^a | Pakray, Partha^b | Gelbukh, Alexander^c

Affiliations: [a] Department of Computer Science and Engineering, National Institute of Technology Mizoram, India | [b] Department of Computer Science and Engineering, National Institute of Technology Silchar, India | [c] Natural Language Lab, Center for Computing Research, National Polytechnic Institute, Mexico

Correspondence: [*] Corresponding author. Sandeep Kumar Dash, Department of Computer Science and Engineering, National Institute of Technology, Mizoram, India. Tel.: +91 9612590039; Fax: 8847802530; E-mail: [email protected].

Abstract: Caption generation requires best of both Computer Vision and Natural Language Processing. Due to recent improvements in both of them many efficient models have been developed. Automatic Image Captioning can be utilized to provide descriptions of website content or to engender frame-by-frame descriptions of video for the vision-impaired and in many such applications. In this work, a model is described which is utilized to generate novel image captions for a previously unseen image by utilizing a multimodal architecture by amalgamation of a Recurrent Neural Network (RNN) and a Convolutional Neural Network (CNN). The model is trained on Microsoft Common Objects in Context (MSCOCO), an image captioning dataset that aligns captions and images in the same representation space, so that an image is close to its relevant captions in that space and far away from dissimilar captions and dissimilar images. ResNet-50 architecture is used for extracting features from the images and GloVe embeddings are used along with Gated Recurrent Unit (GRU) in Recurrent Neural Network (RNN) for text representation. MSCOCO evaluation server is used for evaluation of the machine generated caption for a given image.

Keywords: Image captioning, convolutional neural network

DOI: 10.3233/JIFS-179027

Journal: Journal of Intelligent & Fuzzy Systems, vol. 36, no. 5, pp. 4787-4796, 2019

Published: 14 May 2019

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

如果您在出版方面需要帮助或有任何建, 件至: [email protected]

Share this:

North America

Europe

Asia