Affiliations: [a] The School of Computer and Information Science, Southwest University, China. E-mail: firstname.lastname@example.org | [b] Department of Computer Science, University of Memphis, Memphis, TN, USA. E-mail: email@example.com | [c] School of Software Engineering, Chongqing University, China. E-mail: firstname.lastname@example.org
Abstract: Automatically identifying Chinese characters that are similar in their glyph, pronunciations and meaning are important for building smart question generation tools in a computer-assisted language-learning environment. Previous research on the Chinese character similarity measurement focused on character glyph (e.g. structures, strokes and radicals) with heuristic algorithms whose parameter have preset values. This article presents a machine learning (regression) approach to measure the similarity between two Chinese characters, based on the information which not only includes the glyph, but also pronunciation (pinyin) and semantic meaning derived from HowNet. We evaluated various regression models using a testing set consisting of 2586 pairs of characters selected from elementary Chinese textbooks used. The study results showed that four regression models (M5, Support Vector Machine, Gaussian Process and Linear Regression) have similar results (0.617⩽Mean Absolute Error⩽0.641, 0.772⩽Root Mean Square Error⩽0.790). In addition, the study implied that the performance of the regression model could be influenced by the character frequency. Moreover, we evaluated the regression model in a well-known Chinese language learning resource, called 100 pairs of the most confusing Chinese characters. The experiment results indicated that this approach has potential in the recognition and generation of confusing Chinese character pairs.
Keywords: Natural language processing, Chinese character similarity measurement, intelligent authoring tools