Searching for just a few words should be enough to get started. If you need to make more complex queries, use the tips below to guide you.
Purchase individual online access for 1 year to this journal.
Price: EUR 135.00Impact Factor 2023: 1.7
Intelligent Data Analysis provides a forum for the examination of issues related to the research and applications of Artificial Intelligence techniques in data analysis across a variety of disciplines. These techniques include (but are not limited to): all areas of data visualization, data pre-processing (fusion, editing, transformation, filtering, sampling), data engineering, database mining techniques, tools and applications, use of domain knowledge in data analysis, big data applications, evolutionary algorithms, machine learning, neural nets, fuzzy logic, statistical pattern recognition, knowledge filtering, and post-processing.
In particular, papers are preferred that discuss development of new AI related data analysis architectures, methodologies, and techniques and their applications to various domains.
Papers published in this journal are geared heavily towards applications, with an anticipated split of 70% of the papers published being applications-oriented, research and the remaining 30% containing more theoretical research. Manuscripts should be submitted in *.pdf format only. Please prepare your manuscripts in single space, and include figures and tables in the body of the text where they are referred to. For all enquiries regarding the submission of your manuscript please contact the IDA journal editor: [email protected]
Authors: Zhou, Rucheng | Zhang, Dongmei | Zhu, Jiabao | Min, Geyong
Article Type: Research Article
Abstract: Traffic forecasting has become a core component of Intelligent Transportation Systems. However, accurate traffic forecasting is very challenging, caused by the complex traffic road networks. Most existing forecasting methods do not fully consider the topological structure information of road networks, making it difficult to extract accurate spatial features. In addition, spatial and temporal features have different impacts on traffic conditions, but the existing studies ignore the distribution of spatial-temporal features in traffic regions. To address these limitations, we propose a novel graph neural network architecture named Attention-based Spatial-Temporal Adaptive Integration Gated Network (AST-AIGN). The originality of AST-AIGN is to obtain …a spatial feature that more accurately reflects the topological structure of the road networks by embedding Graph Attention Network (GAT) into Jumping Knowledge Net (JK-Net). We propose a data-dependent function called spatial-temporal adaptive integration gate to process the diversity of feature distribution and highlight features in road networks that significantly affects traffic conditions. We evaluate our model on two real-world traffic datasets from the Caltrans Performance Measurement System (PEMS04 and PEMS08), and the extensive experimental results demonstrate the proposed AST-AIGN architecture outperforms other baselines. Show more
Keywords: Traffic forecasting, spatial-temporal dependences, jumping knowledge, gating mechanism, self-attention
DOI: 10.3233/IDA-230101
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-25, 2024
Authors: Yu, Dongjin | Ni, Ke | Li, Zhongyang | Zhang, Shengyi | Sun, Xiaoxiao | Hou, Wenjie | Ying, Yuke
Article Type: Research Article
Abstract: Process discovery techniques analyze process logs to extract models that characterize the behavior of business processes. In real-life logs, however, noises exist and adversely affect the extraction and thus decrease the understandability of discovered models. In this paper, we propose a novel double granularity filtering method, executed on both the event and trace levels, to detect noises by analyzing the directly-following and parallel relations between events. Based on the probability of an event occurring in a sequence, the infrequent behaviors and redundant events in the logs can be filtered out. In addition, the missing events in parallel blocks are detected …to further improve the performance of filtering. Experiments on synthetic logs and five real-life datasets demonstrate that our method significantly outperforms other state-of-the-art methods. Show more
Keywords: Process discovery, process mining, event logs, noise filtering, event dependency, parallel relation
DOI: 10.3233/IDA-230118
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-18, 2024
Authors: Belbekri, Adel | Benchikha, Fouzia | Slimani, Yahya | Marir, Naila
Article Type: Research Article
Abstract: Named Entity Recognition (NER) is an essential task in Natural Language Processing (NLP), and deep learning-based models have shown outstanding performance. However, the effectiveness of deep learning models in NER relies heavily on the quality and quantity of labeled training datasets available. A novel and comprehensive training dataset called SocialNER2.0 is proposed to address this challenge. Based on selected datasets dedicated to different tasks related to NER, the SocialNER2.0 construction process involves data selection, extraction, enrichment, conversion, and balancing steps. The pre-trained BERT (Bidirectional Encoder Representations from Transformers) model is fine-tuned using the proposed dataset. Experimental results highlight the superior …performance of the fine-tuned BERT in accurately identifying named entities, demonstrating the SocialNER2.0 dataset’s capacity to provide valuable training data for performing NER in human-produced texts. Show more
Keywords: Big data, deep learning, user-generated texts, text analysis, named entity recognition
DOI: 10.3233/IDA-230588
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-25, 2024
Authors: Hu, Haiping | Huo, Wei | Yan, Yingying | Zhu, Qiuyu
Article Type: Research Article
Abstract: For the pattern recognition, most classification models are solved iteratively, except for Linear LDA, KLDA and ELM etc. In this paper, a nonlinear classification network model based on predefined evenly-distributed class centroids (PEDCC) is proposed. Its analytical solution can be obtained and has good interpretability. Using the characteristics of maximizing the inter-class distance of PEDCC and derivative weighted minimum mean square error loss function to minimize the intra-class distance, we can not only realize the effective nonlinearity of the network, but also obtain the analytical solution of the network weight. Then, the sample is classified based on GDA. In order …to further improve the performance of classification, PCA is used to reduces the dimensionality of the original sample, meanwhile, the CReLU activation function are adopted to enhances the expression ability of the features. The network transforms the samples into the higher dimensional feature space through the weighted minimum mean square error, so as to find a better separation hyperplane. In experiments, the feasibility of the network structure is verified from pure linear 𝑾 , 𝑾 + Tanh, and PCA+ 𝑾 + Tanh respectively on many small data sets and large data sets, and compared with SVM and ELM in terms of training speed and recognition rate. The results show that, in general, this model has advantages on small data set both in recognition accuracy and training speed, while it has advantages in training speed on large data sets. Finally, by introducing a multi-stage network structure based on the latent feature norm, the classifier network can further significantly improve the classification performance, the recognition rate of small data sets is effectively improved and much higher than that of existing methods, while the recognition rate of large data sets is similar to that of SVM. Show more
Keywords: Pattern recognition, image classification, machine learning, GDA
DOI: 10.3233/IDA-230044
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-16, 2024
Authors: Cao, Jinhui | Di, Xiaoqiang | Liu, Xu | Xu, Rui | Li, Jinqing | Ren, Weiwu | Qi, Hui | Hu, Pengfei | Zhang, Kehan | Li, Bo
Article Type: Research Article
Abstract: Logs play an important role in anomaly detection, fault diagnosis, and trace checking of software and network systems. Log parsing, which converts each raw log line to a constant template and a variable parameter list, is a prerequisite for system security analysis. Traditional parsing methods utilizing specific rules can only parse logs of specific formats, and most parsing methods based on deep learning require labels. However, the existing parsing methods are not applicable to logs of inconsistent formats and insufficient labels. To address these issues, we propose a robust Log parsing method based on Self-supervised Learning (LogSL), which can extract …templates from logs of different formats. The essential idea of LogSL is modeling log parsing as a multi-token prediction task, which makes the multi-token prediction model learn the distribution of tokens belonging to the template in raw log lines by self-supervision mode. Furthermore, to accurately predict the tokens of the template without labeled data, we construct a Multi-token Prediction Model (MPM) combining the pre-trained XLNet module, the n-layer stacked Long Short-Term Memory Net module, and the Self-attention module. We validate LogSL on 12 benchmark log datasets, resulting in the average parsing accuracy of our parser being 3.9% higher than that of the best baseline method. Experimental results show that LogSL has superiority in terms of robustness and accuracy. In addition, a case study of anomaly detection is conducted to demonstrate the support of the proposed MPM to system security tasks based on logs. Show more
Keywords: System security, data analysis, log parsing, deep learning, self-supervised learning
DOI: 10.3233/IDA-230133
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-21, 2024
Authors: Nayancy, | Dutta, Sandip | Chakraborty, Soubhik
Article Type: Research Article
Abstract: Blockchain has attracted tremendous attention in recent years due to its significant features including anonymity, security, immutability, and audibility. Blockchain technology has been used in several nonmonetary applications, including Internet-of-Things. Though blockchain has limited resources, and scalability is computationally expensive, resulting in delays and large bandwidth overhead that are unsuitable for many IoT devices. In this paper, we work on a lightweight blockchain approach that is suited for IoT needs and provides end-to-end security. Decentralization is achieved in our lightweight blockchain implementation by building a network with a lot of high-resource devices collaborate to maintain the blockchain. The nodes in …the network is arranged in sorted order w.r.t execution time and count to reduce the mining overheads and is accountable for handling the public blockchain. We propose a distributed execution time-based consensus algorithm that decreases the delay and overhead of the mining process. We also propose a randomized node-selection algorithm for the selection of nodes to verify the mined blocks to eliminate the double-spend and 51% attack. The results are encouraging and significantly reduce the mining overhead and keep a check on the double-spending problem and 51% attack. Show more
Keywords: Blockchain, IoT, lightweight consensus, double-spend attack, 51% attack
DOI: 10.3233/IDA-230153
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-11, 2024
Authors: Boullé, Marc
Article Type: Research Article
Abstract: Histograms are among the most popular methods used in exploratory analysis to summarize univariate distributions. In particular, irregular histograms are good non-parametric density estimators that require very few parameters: the number of bins with their lengths and frequencies. Although many approaches have been proposed in the literature to infer these parameters, most existing histogram methods are difficult to exploit for exploratory analysis in the case of real-world data sets, with scalability issues, truncated data, outliers or heavy-tailed distributions. In this paper, we focus on the G-Enum histogram method, which exploits the Minimum Description Length (MDL) principle to build histograms without …any user parameter. We then propose to extend this method by exploiting a new modeling space based on floating-point representation, with the objective of building histograms resistant to outliers or heavy-tailed distributions. We also suggest several heuristics and a methodology suitable for the exploratory analysis of large scale real-world data sets, whose underlying patterns are difficult to recover for digitization reasons. Extensive experiments show the benefits of the approach, evaluated with a dual objective: the accuracy of density estimation in the case of outliers or heavy-tailed distributions, and the effectiveness of the approach for exploratory data analysis. Show more
Keywords: Density estimation, histograms, model selection, minimum description length, exploratory analysis
DOI: 10.3233/IDA-230638
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-48, 2024
Authors: Shi, Xuefeng | Hu, Min | Ren, Fuji | Shi, Piao
Article Type: Research Article
Abstract: Active Learning (AL) is a technique being widely employed to minimize the time and labor costs in the task of annotating data. By querying and extracting the specific instances to train the model, the relevant task’s performance is improved maximally within limited iterations. However, rare work was conducted to fully fuse features from different hierarchies to enhance the effectiveness of active learning. Inspired by the thought of information compensation in many famous deep learning models (such as ResNet, etc.), this work proposes a novel TextCNN-based Two ways Active Learning model (TCTWAL) to extract task-relevant texts. TextCNN takes the advantage of …little hyper-parameter tuning and static vectors and achieves excellent results on various natural language processing (NLP) tasks, which are also beneficial to human-computer interaction (HCI) and the AL relevant tasks. In the process of the proposed AL model, the candidate texts are measured from both global and local features by the proposed AL framework TCTWAL depending on the modified TextCNN. Besides, the query strategy is strongly enhanced by maximum normalized log-probability (MNLP), which is sensitive to detecting the longer sentences. Additionally, the selected instances are characterized by general global information and abundant local features simultaneously. To validate the effectiveness of the proposed model, extensive experiments are conducted on three widely used text corpus, and the results are compared with with eight manual designed instance query strategies. The results show that our method outperforms the planned baselines in terms of accuracy, macro precision, macro recall, and macro F1 score. Especially, to the classification results on AG’s News corpus, the improvements of the four indicators after 39 iterations are 40.50%, 45.25%, 48.91%, and 45.25%, respectively. Show more
Keywords: Active learning, TextCNN, maximum normalized log-probability, global information, local feature
DOI: 10.3233/IDA-230332
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-23, 2024
Authors: Zhong, Qing | Shao, Xinhui
Article Type: Research Article
Abstract: For the aspect-based sentiment analysis task, traditional works are only for text modality. However, in social media scenarios, texts often contain abbreviations, clerical errors, or grammatical errors, which invalidate traditional methods. In this study, the cross-model hierarchical interactive fusion network incorporating an end-to-end approach is proposed to address this challenge. In the network, a feature attention module and a feature fusion module are proposed to obtain the multimodal interaction feature between the image modality and the text modality. Through the attention mechanism and gated fusion mechanism, these two modules realize the auxiliary function of image in the text-based aspect-based sentiment …analysis task. Meanwhile, a boundary auxiliary module is used to explore the dependencies between two core subtasks of the aspect-based sentiment analysis. Experimental results on two publicly available multi-modal aspect-based sentiment datasets validate the effectiveness of the proposed approach. Show more
Keywords: Multimodal aspect-based sentiment analysis, hierarchical interactive fusion, multi-head interaction attention mechanism, gated mechanism
DOI: 10.3233/IDA-230305
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-16, 2024
Authors: Chen, Hongwei | Shi, Dewei | Zhou, Xun | Zhang, Man | Liu, Luanxuan
Article Type: Research Article
Abstract: Credit fraud is a common financial crime that causes significant economic losses to financial institutions. To address this issue, researchers have proposed various fraud detection methods. Recently, research on deep forests has opened up a new path for exploring deep models beyond neural networks. It combines the features of neural networks and ensemble learning, and has achieved good results in various fields. This paper mainly studies the application of deep forests to the field of fraud detection and proposes a distributed dense rotation deep forest algorithm (DRDF-spark) based on the improved RotBoost. The model has three main characteristics: firstly, it …solves the problem of multi-granularity scanning due to the lack of spatial correlation in the data by introducing RotBoost. Secondly, Spark is used for parallel construction to improve the processing speed and efficiency of data. Thirdly, a pre-aggregation mechanism is added to the distributed algorithm to locally aggregate the statistical results of sub-forests in the same node in advance to improve communication efficiency. The experiments show that DRDF-spark performs better than deep forests and some mainstream ensemble learning algorithms on the fraud dataset in this paper, and the training speed is up to 3.53 times faster. Furthermore, if the number of nodes is further increased, the speedup ratio will continue to increase. Show more
Keywords: Deep forest, credit fraud detection, ensemble learning, RotBoost, spark
DOI: 10.3233/IDA-230193
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-25, 2024
Authors: Jiménez-Gaona, Yuliana | Rodríguez-Alvarez, María José | Escudero, Líder | Sandoval, Carlos | Lakshminarayanan, Vasudevan
Article Type: Research Article
Abstract: INTRODUCTION: Ultrasound in conjunction with mammography imaging, plays a vital role in the early detection and diagnosis of breast cancer. However, speckle noise affects medical ultrasound images and degrades visual radiological interpretation. Speckle carries information about the interactions of the ultrasound pulse with the tissue microstructure, which generally causes several difficulties in identifying malignant and benign regions. The application of deep learning in image denoising has gained more attention in recent years. OBJECTIVES: The main objective of this work is to reduce speckle noise while preserving features and details in breast ultrasound images using GAN models. …METHODS: We proposed two GANs models (Conditional GAN and Wasserstein GAN) for speckle-denoising public breast ultrasound databases: BUSI, DATASET A, AND UDIAT (DATASET B). The Conditional GAN model was trained using the Unet architecture, and the WGAN model was trained using the Resnet architecture. The image quality results in both algorithms were measured by Peak Signal to Noise Ratio (PSNR, 35–40 dB) and Structural Similarity Index (SSIM, 0.90–0.95) standard values. RESULTS: The experimental analysis clearly shows that the Conditional GAN model achieves better breast ultrasound despeckling performance over the datasets in terms of PSNR = 38.18 dB and SSIM = 0.96 with respect to the WGAN model (PSNR = 33.0068 dB and SSIM = 0.91) on the small ultrasound training datasets. CONCLUSIONS: The observed performance differences between CGAN and WGAN will help to better implement new tasks in a computer-aided detection/diagnosis (CAD) system. In future work, these data can be used as CAD input training for image classification, reducing overfitting and improving the performance and accuracy of deep convolutional algorithms. Show more
Keywords: Breast cancer, ultrasound image denoising, generative adversarial network
DOI: 10.3233/IDA-230631
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-18, 2024
Authors: Göcs, László | Johanyák, Zsolt Csaba
Article Type: Research Article
Abstract: Intrusion detection systems (IDSs) are essential elements of IT systems. Their key component is a classification module that continuously evaluates some features of the network traffic and identifies possible threats. Its efficiency is greatly affected by the right selection of the features to be monitored. Therefore, the identification of a minimal set of features that are necessary to safely distinguish malicious traffic from benign traffic is indispensable in the course of the development of an IDS. This paper presents the preprocessing and feature selection workflow as well as its results in the case of the CSE-CIC-IDS2018 on AWS dataset, focusing …on five attack types. To identify the relevant features, six feature selection methods were applied, and the final ranking of the features was elaborated based on their average score. Next, several subsets of the features were formed based on different ranking threshold values, and each subset was tried with five classification algorithms to determine the optimal feature set for each attack type. During the evaluation, four widely used metrics were taken into consideration. Show more
Keywords: Ddataset preprocessing, dimension reduction, feature selection, classification, Python, CE-CIC-IDS2018
DOI: 10.3233/IDA-230264
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-27, 2024
Authors: Zhang, Shuo | Hu, Xingbang | Zhang, Wenbo | Chen, Jinyi | Huang, Hejiao
Article Type: Research Article
Abstract: For modern Intelligent Transportation System (ITS), data missing during traffic raster acquisition can be inevitable because of the loop detector malfunction or signal interference. Nevertheless, missing data imputation is meaningful due to the periodic spatio-temporal characteristics and individual randomness of traffic raster data. In this paper, traffic raster data collected from all spatial regions at each time interval are considered as a multiple channel image. Accordingly, the traffic raster data over a period of time can be regarded as video, on which an unsupervised generative neural network called MSST-VAE (Multiple Streams Spatial Temporal-VAE) is proposed for traffic raster data imputation, …and this model can even robustly performs at varied missing rates while many other approaches fail to conduct. Two major innovations can be summarized in MSSTVAE: Firstly, it uses multiple periodic streams of Variational Auto-Encoders (VAEs) with Sylvester Normalizing Flows (SNFs), which shows strong generalization ability. Secondly, after the traffic raster data are transferred into videos, an ECB (Extraction-and-Calibration Block) consisting of dilated P3D gated convolution and multi-horizon attention mechanism is employed to learn global-local-granularity spatial features and long-short-term temporal features. Extensive experiments on three real traffic flow datasets validate that MSST-VAE outperforms other classical traffic imputation models with the least imputation error. Show more
Keywords: Intelligent transportation system, traffic raster data, data imputation
DOI: 10.3233/IDA-230091
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-22, 2024
Authors: Chen, Mingcai | Du, Yuntao | Tang, Wei | Zhang, Baoming | Wang, Chongjun
Article Type: Research Article
Abstract: Real-world machine learning applications seldom provide perfect labeled data, posing a challenge in developing models robust to noisy labels. Recent methods prioritize noise filtering based on the discrepancies between model predictions and the provided noisy labels, assuming samples with minimal classification losses to be clean. In this work, we capitalize on the consistency between the learned model and the complete noisy dataset, employing the data’s rich representational and topological information. We introduce LaplaceConfidence, a method that to obtain label confidence (i.e., clean probabilities) utilizing the Laplacian energy. Specifically, it first constructs graphs based on the feature representations of all noisy …samples and minimizes the Laplacian energy to produce a low-energy graph. Clean labels should fit well into the low-energy graph while noisy ones should not, allowing our method to determine data’s clean probabilities. Furthermore, LaplaceConfidence is embedded into a holistic method for robust training, where co-training technique generates unbiased label confidence and label refurbishment technique better utilizes it. We also explore the dimensionality reduction technique to accommodate our method on large-scale noisy datasets. Our experiments demonstrate that LaplaceConfidence outperforms state-of-the-art methods on benchmark datasets under both synthetic and real-world noise. Code available at https://github.com/chenmc1996/LaplaceConfidence . Show more
Keywords: Learning with noisy labels, graph energy, label refurbishment
DOI: 10.3233/IDA-230818
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-17, 2024
Authors: Fabra-Boluda, Raül | Ferri, Cèsar | Hernández-Orallo, José | Ramírez-Quintana, M. José | Martínez-Plumed, Fernando
Article Type: Research Article
Abstract: The quest for transparency in black-box models has gained significant momentum in recent years. In particular, discovering the underlying machine learning technique type (or model family) from the performance of a black-box model is a real important problem both for better understanding its behaviour and for developing strategies to attack it by exploiting the weaknesses intrinsic to the learning technique. In this paper, we tackle the challenging task of identifying which kind of machine learning model is behind the predictions when we interact with a black-box model. Our innovative method involves systematically querying a black-box model (oracle) to …label an artificially generated dataset, which is then used to train different surrogate models using machine learning techniques from different families (each one trying to partially approximate the oracle’s behaviour). We present two approaches based on similarity measures, one selecting the most similar family and the other using a conveniently constructed meta-model. In both cases, we use both crisp and soft classifiers and their corresponding similarity metrics. By experimentally comparing all these methods, we gain valuable insights into the explanatory and predictive capabilities of our model family concept. This provides a deeper understanding of the black-box models and increases their transparency and interpretability, paving the way for more effective decision making. Show more
Keywords: Machine learning, family identification, adversarial, black-box, surrogate models
DOI: 10.3233/IDA-230707
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-21, 2024
Authors: Liu, Zhao | Wang, Aimin | Bao, Haiming | Zhang, Kunpeng | Wu, Jing | Sun, Geng | Li, Jiahui
Article Type: Research Article
Abstract: The goal of feature selection in machine learning is to simultaneously maintain more classification accuracy, while reducing lager amount of attributes. In this paper, we firstly design a fitness function that achieves both objectives jointly. Then we come up with a chaos-based binary dragonfly algorithm (CBDA) that incorporates several improvements over the conventional dragonfly algorithm (DA) for developing a wrapper-based feature selection method to solve the fitness function. Specifically, the CBDA innovatively introduces three improved factors, namely the chaotic map, evolutionary population dynamics (EPD) mechanism, and binarization strategy on the basis of conventional DA to balance the exploitation and exploration …capabilities of the algorithm and make it more suitable to handle the formulated problem. We conduct experiments on 24 well-known data sets from the UCI repository with three ablated versions of CBDA targeting different components of the algorithm in order to explain their contributions in CBDA and also with five established comparative algorithms in terms of fitness value, classification accuracy, CPU running time, and number of selected features. The results show that the proposed CBDA has remarkable advantages in most of the tested data sets. Show more
Keywords: Feature selection, dragonfly algorithm, chaos, evolutionary population dynamics, classification accuracy
DOI: 10.3233/IDA-230540
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-36, 2024
Authors: Feng, Zhuo | Du, Yajun | Huang, Jiaming | Li, Xianyong | Chen, Xiaoliang | Xie, Chunzhi
Article Type: Research Article
Abstract: Large-scale studies indicate that the distinct approach to opinion fusion employed by extreme agents exerts a more potent influence on overall opinion evolution when compared to regular agents. The presence of extreme agents within the network tends to undermine the development of opinion neutrality, which is harmful to the guidance of online public opinion. Notably, prior research often overlooks the existence of opinion extreme agents in social networks. However, existing researches seldom consider the time sunk cost in the evolution of opinions. Building upon this foundation, we introduce a temporal dimension to the opinion evolution, integrating the time sunk cost …with the opinion evolution process. Furthermore, we devise an agent partitioning method that categorizes agents into four states based on their opinion values: watch state, subjective state, firm state, and extreme state, with extreme state agents generally expressing radical opinions. We constructed an agent network based on the phenomenon of time sunk costs and proposed a model for the evolution of extreme opinions in this network. Our study found that the information sharing among extreme agents significantly influences the extremization of opinions in various networks. After restricting the exchange of opinions on extreme agents, the number of extreme agents in the network decreased by 40% to 50% compared to the initial situation. Additionally, we also discovered that imposing restrictions on extreme agents in the early stages can help increase the possibility of network opinions moving towards neutral positions. When restriction of extreme agents(REA) was performed at the beginning of the experiment compared to REA in the midway of the experiment, the final number of extreme state agents decreased by 15.57%. The results show that extreme agents have a great influence on the spread and evolution of extreme opinions on platforms. Show more
Keywords: Time sunk costs, extremists, opinion dynamics, bounded confidence model, social networks, opinion evolution
DOI: 10.3233/IDA-230677
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-20, 2024
Authors: Zhang, Fei | Chan, Patrick P.K. | He, Zhi-Min | Yeung, Daniel S.
Article Type: Research Article
Abstract: A recommender system is susceptible to manipulation through the injection of carefully crafted profiles. Some recent profile identification methods only perform well in specific attack scenarios. A general attack detection method is usually complicated or requires label samples. Such methods are prone to overtraining easily, and the process of annotation incurs high expenses. This study proposes an unsupervised divide-and-conquer method aiming to identify attack profiles, utilizing a specifically designed model for each kind of shilling attack. Initially, our method categorizes the profile set into two attack types, namely Standard and Obfuscated Behavior Attacks. Subsequently, profiles are separated into clusters within …the extracted feature space based on the identified attack type. The selection of attack profiles is then determined through target item analysis within the suspected cluster. Notably, our method offers the advantage of requiring no prior knowledge or annotation. Furthermore, the precision is heightened as the identification method is designed to a specific attack type, employing a less complicated model. The outstanding performance of our model, validated through experimental results on MovieLens-100K and Netflix under various attack settings, demonstrates superior accuracy and reduced running time compared to current detection methods in identifying Standard and Obfuscated Behavior Attacks. Show more
Keywords: PCA, item popularity, shilling attack detection, divide-and-conquer method
DOI: 10.3233/IDA-230575
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-16, 2024
Authors: Tran, Le-Anh | Kwon, Daehyun | Deberneh, Henock Mamo | Park, Dong-Chul
Article Type: Research Article
Abstract: This paper proposes a data clustering algorithm that is inspired by the prominent convergence property of the Projection onto Convex Sets (POCS) method, termed the POCS-based clustering algorithm . For disjoint convex sets, the form of simultaneous projections of the POCS method can result in a minimum mean square error solution. Relying on this important property, the proposed POCS-based clustering algorithm treats each data point as a convex set and simultaneously projects the cluster prototypes onto respective member data points, the projections are convexly combined via adaptive weight values in order to minimize a predefined objective function for data …clustering purposes. The performance of the proposed POCS-based clustering algorithm has been verified through a large scale of experiments and data sets. The experimental results have shown that the proposed POCS-based algorithm is competitive in terms of both effectiveness and efficiency against some of the prevailing clustering approaches such as the K-Means/K-Means+ + and Fuzzy C-Means (FCM) algorithms. Based on extensive comparisons and analyses, we can confirm the validity of the proposed POCS-based clustering algorithm for practical purposes. Show more
Keywords: POCS, convex sets, clustering algorithm, unsupervised learning, machine learning
DOI: 10.3233/IDA-230655
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-18, 2024
Authors: Huang, Jiaming | Li, Xianyong | Li, Qizhi | Du, Yajun | Fan, Yongquan | Chen, Xiaoliang | Huang, Dong | Wang, Shumin
Article Type: Research Article
Abstract: Emojis in texts provide lots of additional information in sentiment analysis. Previous implicit sentiment analysis models have primarily treated emojis as unique tokens or deleted them directly, and thus have ignored the explicit sentiment information inside emojis. Considering the different relationships between emoji descriptions and texts, we propose a pre-training Bidirectional Encoder Representations from Transformers (BERT) with emojis (BEMOJI) for Chinese and English sentiment analysis. At the pre-training stage, we pre-train BEMOJI by predicting the emoji descriptions from the corresponding texts via prompt learning. At the fine-tuning stage, we propose a fusion layer to fuse text representations and emoji descriptions …into fused representations. These representations are used to predict text sentiment orientations. Experimental results show that BEMOJI gets the highest accuracy (91.41% and 93.36%), Macro-precision (91.30% and 92.85%), Macro-recall (90.66% and 93.65%) and Macro-F1-measure (90.95% and 93.15%) on the Chinese and English datasets. The performance of BEMOJI is 29.92% and 24.60% higher than emoji-based methods on average on Chinese and English datasets, respectively. Meanwhile, the performance of BEMOJI is 3.76% and 5.81% higher than transformer-based methods on average on Chinese and English datasets, respectively. The ablation study verifies that the emoji descriptions and fusion layer play a crucial role in BEMOJI. Besides, the robustness study illustrates that BEMOJI achieves comparable results with BERT on four sentiment analysis tasks without emojis, which means BEMOJI is a very robust model. Finally, the case study shows that BEMOJI can output more reasonable emojis than BERT. Show more
Keywords: Pre-trained language model, emoji sentiment analysis, implicit sentiment analysis, prompt learning, multi-feature fusion
DOI: 10.3233/IDA-230864
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-25, 2024
Authors: Noronha, Marta D.M. | Zárate, Luis E.
Article Type: Research Article
Abstract: Characterizing longevity profiles from longitudinal studies is a task with many challenges. Firstly, the longitudinal databases usually have high dimensionality, and the similarities between long-lived and non-long-lived records are a highly burdening task for profile characterization. Addressing these issues, in this work, we use data from the English Longitudinal Study of Ageing (ELSA-UK) to characterize longevity profiles through data mining. We propose a method for feature engineering for reducing data dimensionality through merging techniques, factor analysis and biclustering. We apply biclustering to select relevant features discriminating both profiles. Two classification models, one based on a decision tree and the other …on a random forest, are built from the preprocessed dataset. Experiments show that our methodology can successfully discriminate longevity profiles. We identify insights into features contributing to individuals being long-lived or non-long-lived. According to the results presented by both models, the main factor that impacts longevity is related to the correlations between the economic situation and the mobility of the elderly. We suggest that this methodology can be applied to identify longevity profiles from other longitudinal studies since that factor is deemed relevant for profile classification. Show more
Keywords: Longitudinal data mining, human ageing, biclustering, factor analysis, classification
DOI: 10.3233/IDA-230314
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-24, 2024
Authors: Fan, Zeping | Zhang, Xuejun | Huang, Min | Bu, Zhaohui
Article Type: Research Article
Abstract: The Convolution-augmented Transformer (Conformer) model, which was recently introduced, has attained state-of-the-art(SOTA) results in Automatic Speech Recognition (ASR). In this paper, a series of methodical investigations uncover that the Conformer’s design decisions may not represent the most efficient choices when operating within the constraints of a limited computational budget. After a thorough re-evaluation of the Conformer architecture’s design choices, we propose Sampleformer which reduces the Conformer architecture complexity and has more robust performance. We introduce downsampling to the Conformer Encoder, and to exploit the information in the speech features, we incorporate an additional downsampling module to enhance the efficiency and …accuracy of our model. Additionally, we propose a novel and adaptable attention mechanism called multi-group attention, effectively reducing the attention complexity from O ( n 2 d ) to O ( n 2 d ⋅ f / g ) . We performed experiments on the AISHELL-1 corpora, our 13.3 million-parameter CTC model demonstrates a 3.0%/2.6% relative reduction in character error rate (CER) on the dev/test sets, all without the utilization of a language model (LM). Additionally, the model exhibits a 30% improvement in inference compared to our CTC Conformer baseline and trains 27% faster. Show more
Keywords: Speech recognition, conformer, attention mechanism, complexity reduction
DOI: 10.3233/IDA-230612
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-13, 2024
Authors: Liu, Xiaoyang | Wu, Yudie | Fiumara, Giacomo | De Meo, Pasquale
Article Type: Research Article
Abstract: Traditional community detection models either ignore the feature space information and require a large amount of domain knowledge to define the meta-paths manually, or fail to distinguish the importance of different meta-paths. To overcome these limitations, we propose a novel heterogeneous graph community detection method (called KGNN_HCD, heterogeneous graph Community Detection method based on K -nearest neighbor Graph Neural Network). Firstly, the similarity matrix is generated to construct the topological structure of K -nearest neighbor graph; secondly, the meta-path information matrix is generated using a meta-path transformation layer (Mp-Trans Layer) by adding weighted convolution; finally, a …graph convolutional network (GCN) is used to learn high-quality node representation, and the k -means algorithm is adopted on node embeddings to detect the community structure. We perform extensive experiments and on three heterogeneous datasets, ACM, DBLP and IMDB, and we consider as competitors 11 community detection methods such as CP-GNN and GTN. The experimental results show that the proposed KGNN_HCD method improves 2.54% and 2.56% on the ACM dataset, 2.59% and 1.47% on the DBLP dataset, and 1.22% and 1.67% on the IMDB dataset for both NMI and ARI. Experiments findings suggest that the proposed KGNN_HCD method is reasonable and effective, and KGNN_HCD can be applied to complex network classification and clustering tasks. Show more
Keywords: Heterogeneous graph, meta-path, K-nearest neighbor graph, graph neural network, community detection
DOI: 10.3233/IDA-230356
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-22, 2024
Authors: Yuan, Wei | Zhao, Shiyu | Wang, Li | Cai, Lijia | Zhang, Yong
Article Type: Research Article
Abstract: In the post-epidemic era, online learning has gained increasing attention due to the advancements in information and big data technology, leading to large-scale online course data with various student behaviors. Online data mining has become a popular and important way of extracting valuable insights from large amounts of data. However, previous online course analysis methods often focused on individual aspects of the data and neglected the correlation among the large-scale learning behavior data, which can lead to an incomplete understanding of the overall learning behavior and patterns within the online course. To solve the problems, this paper proposes an online …course evaluation model based on a graph auto-encoder. In our method, the features of collected online course data are used to construct K-Nearest Neighbor(KNN) graphs to represent the association among the courses. Then the variational graph auto-encoder(VGAE) is introduced to learn the useful implicit features. Finally, we feed the learned implicit features into unsupervised and semi-supervised downstream tasks for online course evaluation, respectively. We conduct experiments on two datasets. In the clustering task, our method showed a more than tenfold increase in the Calinski-Harabasz index compared to unoptimized features, demonstrating significant structural distinction and group coherence. In the classification task, compared to traditional methods, our model exhibited an overall performance improvement of about 10%, indicating its effectiveness in handling complex network data. Show more
Keywords: Educational data mining, online course evaluation, deep learning, graph auto-encoder
DOI: 10.3233/IDA-230557
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-23, 2024
Authors: K, Subha | N, Bharathi
Article Type: Research Article
Abstract: In today’s digital era, the generation and sharing of information are rapidly expanding. The increased volume of complex data is big data. YouTube is the primary source of big data. The proliferation of the internet and smart devices has led to a significant increase in content creators on social media platforms, with YouTube being a prominent example. There has been a substantial increase in content creators across various social media platforms, with YouTube emerging as one of the foremost platforms for content generation and sharing. YouTubers face challenges in enhancing content strategies due to the growing number of comments, such …as big data on shared videos. Reading and finding viewers’ opinions of such a large amount of data through manual methods is time-consuming and challenging and makes it hard to understand people’s sentiments. To address this, spark-based machine learning algorithms have emerged as a transformative tool for content creators to understand the audience. The Improved Novel Ensemble Method (INEM) algorithm is designed to predict viewers’ sentiments and emotional responses based on the content they interact through the comments. The proposed results provide valuable insights for content creators, helping them refine the strategies to optimize the channel’s revenue and performance. Fit Tuber Channel is analyzed to perform the sentiment of user comments. Show more
Keywords: Big data, sentiment analysis, machine learning, social-media, spark
DOI: 10.3233/IDA-240198
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-11, 2024
Authors: Gupta, Ayushi | Chug, Anuradha | Singh, Amit Prakash
Article Type: Research Article
Abstract: PURPOSE: Crop diseases can cause significant reductions in yield, subsequently impacting a country’s economy. The current research is concentrated on detecting diseases in three specific crops – tomatoes, soybeans, and mushrooms, using a real-time dataset collected for tomatoes and two publicly accessible datasets for the other crops. The primary emphasis is on employing datasets with exclusively categorical attributes, which poses a notable challenge to the research community. METHODS: After applying label encoding to the attributes, the datasets undergo four distinct preprocessing techniques to address missing values. Following this, the SMOTE-N technique is employed to tackle class …imbalance. Subsequently, the pre-processed datasets are subjected to classification using three ensemble methods: bagging, boosting, and voting. To further refine the classification process, the metaheuristic Ant Lion Optimizer (ALO) is utilized for hyper-parameter tuning. RESULTS: This comprehensive approach results in the evaluation of twelve distinct models. The top two performers are then subjected to further validation using ten standard categorical datasets. The findings demonstrate that the hybrid model II-SN-OXGB, surpasses all other models as well as the current state-of-the-art in terms of classification accuracy across all thirteen categorical datasets. II utilizes the Random Forest classifier to iteratively impute missing feature values, employing a nearest features strategy. Meanwhile, SMOTE-N (SN) serves as an oversampling technique particularly for categorical attributes, again utilizing nearest neighbors. Optimized (using ALO) Xtreme Gradient Boosting OXGB, sequentially trains multiple decision trees, with each tree correcting errors from its predecessor. CONCLUSION: Consequently, the model II-SN-OXGB emerges as the optimal choice for addressing classification challenges in categorical datasets. Applying the II-SN-OXGB model to crop datasets can significantly enhance disease detection which in turn, enables the farmers to take timely and appropriate measures to prevent yield losses and mitigate the economic impact of crop diseases. Show more
Keywords: Categorical data, ensemble methods, missing values imputation, metaheuristic optimization, plant disease
DOI: 10.3233/IDA-230651
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-25, 2024
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA
Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]
For editorial issues, like the status of your submitted paper or proposals, write to [email protected]
IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands
Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]
For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]
Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China
Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]
For editorial issues, like the status of your submitted paper or proposals, write to [email protected]
如果您在出版方面需要帮助或有任何建, 件至: [email protected]