Studying the impact of the Full-Network embedding on multimodal pipelines

Vilalta, Armand; Garcia-Gasulla, Dario; Parés, Ferran; Ayguadé, Eduard; Labarta, Jesus; Moya-Sánchez, E. Ulises; Cortés, Ulises

doi:10.3233/SW-180341

Studying the impact of the Full-Network embedding on multimodal pipelines

Issue title: Special Issue on Semantic Deep Learning

Guest editors: Dagmar Gromann, Luis Espinosa Anke and Thierry Declerck

Article type: Research Article

Authors: Vilalta, Armand^{a; *} | Garcia-Gasulla, Dario^a | Parés, Ferran^a | Ayguadé, Eduard^{a; b} | Labarta, Jesus^{a; b} | Moya-Sánchez, E. Ulises^a | Cortés, Ulises^{a; b}

Affiliations: [a] Barcelona Supercomputing Center (BSC), 08034 Barcelona, Spain. E-mails: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] | [b] Universitat Politècnica de Catalunya (UPC), 08034 Barcelona, Spain

Correspondence: [*] Corresponding author. E-mail: [email protected].

Abstract: The current state of the art for image annotation and image retrieval tasks is obtained through deep neural network multimodal pipelines, which combine an image representation and a text representation into a shared embedding space. In this paper we evaluate the impact of using the Full-Network embedding (FNE) in this setting, replacing the original image representation in four competitive multimodal embedding generation schemes. Unlike the one-layer image embeddings typically used by most approaches, the Full-Network embedding provides a multi-scale discrete representation of images, which results in richer characterisations. Extensive testing is performed on three different datasets comparing the performance of the studied variants and the impact of the FNE on a levelled playground, i.e., under equality of data used, source CNN models and hyper-parameter tuning. The results obtained indicate that the Full-Network embedding is consistently superior to the one-layer embedding. Furthermore, its impact on performance is superior to the improvement stemming from the other variants studied. These results motivate the integration of the Full-Network embedding on any multimodal embedding generation scheme.

Keywords: Multimodal embedding, Full-Network embedding, caption retrieval, image retrieval, deep neural network

DOI: 10.3233/SW-180341

Journal: Semantic Web, vol. 10, no. 5, pp. 909-923, 2019

Published: 26 September 2019

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

如果您在出版方面需要帮助或有任何建, 件至: [email protected]

Share this:

North America

Europe

Asia