NgramSPD: Exploring optimal n-gram model for sentiment polarity detection in different languages

Graovac, Jelena; Mladenović, Miljana; Tanasijević, Ivana

doi:10.3233/IDA-183879

NgramSPD: Exploring optimal n-gram model for sentiment polarity detection in different languages

Article type: Research Article

Authors: Graovac, Jelena^* | Mladenović, Miljana | Tanasijević, Ivana

Affiliations: Department of Computer Science, Faculty of Mathematics, University of Belgrade, Belgrade, Serbia

Correspondence: [*] Corresponding author: Jelena Graovac, Department of Computer Science, Faculty of Mathematics, University of Belgrade, Belgrade, Serbia. E-mail: [email protected].

Abstract: Due to the rapid growth of web platforms such as blogs, discussion forums, peer-to-peer networks, and various other types of social media, Sentiment Polarity Detection (SPD) (classifying texts by “positive” or “negative” orientation) has become more important and challenging task in recent years. There is a growing need for management and study of SPD not only in English, but also in other languages. The key reason for using Machine Learning (ML) for SPD lies in engineering a representative set of features. This paper explores different (byte, character and word) n-gram based text representation models in order to determine the most valuable model for the representation of text documents in various languages, which can be used successfully by ML classification techniques for solving SPD task. Proposed n-gram models were used in conjunction with k Nearest Neighbourhood (kNN), Support Vector Machine (SVM) and Maximum Entropy (MaxEnt) algorithms to determine opinion polarity of the proposed movie reviews. The effectiveness and language independence of the proposed n-gram models were demonstrated in experiments performed on seven publicly available movie review benchmarks in Arabic, Czech, English, French, Spanish,Turkish, and Serbian being the authors’ mother tongue. Formal evaluation has confirmed that the proposed byte and character n-gram models outperform word n-gram model, and in conjunction with the presented MaxEnt algorithm outperform other ML supervised techniques used with more complex document representation approaches. In some cases (Arabic, Czech, French, Serbian and Turkish), signficant improvements over the baselines have been achieved. Despite their simplicity and broad applicability, byte and character n-grams have been shown to be able to capture information on different levels – lexical and syntactic.

Keywords: Sentiment polarity detection, movie reviews, n-grams, SVM, MaxEnt, kNN

DOI: 10.3233/IDA-183879

Journal: Intelligent Data Analysis, vol. 23, no. 2, pp. 279-296, 2019

Published: 4 April 2019

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

如果您在出版方面需要帮助或有任何建, 件至: [email protected]

Share this:

North America

Europe

Asia