Searching for just a few words should be enough to get started. If you need to make more complex queries, use the tips below to guide you.
Article type: Research Article
Authors: Chen, Liang-Chinga | Chang, Kuei-Hub; * | Wu, Chia-Henga | Chen, Shin-Chia
Affiliations: [a] Department of Foreign Languages, R.O.C. Military Academy, Kaohsiung, Taiwan | [b] Department of Management Sciences, R.O.C. Military Academy, Kaohsiung, Taiwan
Correspondence: [*] Corresponding author. Kuei-Hu Chang, Department of Management Sciences, R.O.C. Military Academy, Kaohsiung 830, Taiwan. E-mail: [email protected].
Abstract: Although natural language processing (NLP) refers to a process involving the development of algorithms or computational models that empower machines to understand, interpret, and generate human language, machines are still unable to fully grasp the meanings behind words. Specifically, they cannot assist humans in categorizing words with general or technical purposes without predefined standards or baselines. Empirically, prior researches have relied on inefficient manual tasks to exclude these words when extracting technical words (i.e., terminology or terms used within a specific field or domain of expertise) for obtaining domain information from the target corpus. Therefore, to enhance the efficiency of extracting domain-oriented technical words in corpus analysis, this paper proposes a machine-based corpus optimization method that compiles an advanced general-purpose word list (AGWL) to serve as the exclusion baseline for the machine to extract domain-oriented technical words. To validate the proposed method, this paper utilizes 52 COVID-19 research articles as the target corpus and an empirical example. After compared to traditional methods, the proposed method offers significant contributions: (1) it can automatically eliminate the most common function words in corpus data; (2) through a machine-driven process, it removes general-purpose words with high frequency and dispersion rates –57% of word types belonging to general-purpose words, constituting 90% of the total words in the target corpus. This results in 43% of word types representing domain-oriented technical words that makes up 10% of the total words in the target corpus are able to be extracted. This allows future researchers to focus exclusively on the remaining 43% of word types in the optimized word list (OWL), enhancing the efficiency of corpus analysis for extracting domain knowledge. (3) The proposed method establishes a set of standard operation procedure (SOP) that can be duplicated and generally applied to optimize any corpus data.
Keywords: Corpus, natural language processing (NLP), technical word, advanced general-purpose word list (AGWL), COVID-19
DOI: 10.3233/JIFS-236635
Journal: Journal of Intelligent & Fuzzy Systems, vol. 46, no. 4, pp. 9945-9956, 2024
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA
Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]
For editorial issues, like the status of your submitted paper or proposals, write to [email protected]
IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands
Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]
For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]
Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China
Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]
For editorial issues, like the status of your submitted paper or proposals, write to [email protected]
如果您在出版方面需要帮助或有任何建, 件至: [email protected]