Intelligent Data Analysis - Volume Pre-press, issue Pre-press - Journals

Show:

results per page

Combining jumping knowledge into traffic forecasting: An attention-based spatial-temporal adaptive integration gated network

Authors: Zhou, Rucheng | Zhang, Dongmei | Zhu, Jiabao | Min, Geyong

Article Type: Research Article

Abstract: Traffic forecasting has become a core component of Intelligent Transportation Systems. However, accurate traffic forecasting is very challenging, caused by the complex traffic road networks. Most existing forecasting methods do not fully consider the topological structure information of road networks, making it difficult to extract accurate spatial features. In addition, spatial and temporal features have different impacts on traffic conditions, but the existing studies ignore the distribution of spatial-temporal features in traffic regions. To address these limitations, we propose a novel graph neural network architecture named Attention-based Spatial-Temporal Adaptive Integration Gated Network (AST-AIGN). The originality of AST-AIGN is to obtain …a spatial feature that more accurately reflects the topological structure of the road networks by embedding Graph Attention Network (GAT) into Jumping Knowledge Net (JK-Net). We propose a data-dependent function called spatial-temporal adaptive integration gate to process the diversity of feature distribution and highlight features in road networks that significantly affects traffic conditions. We evaluate our model on two real-world traffic datasets from the Caltrans Performance Measurement System (PEMS04 and PEMS08), and the extensive experimental results demonstrate the proposed AST-AIGN architecture outperforms other baselines. Show more

Keywords: Traffic forecasting, spatial-temporal dependences, jumping knowledge, gating mechanism, self-attention

DOI: 10.3233/IDA-230101

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-25, 2024

Price: EUR 27.50

Improving process discovery by filtering noises based on event dependency

Article Type: Research Article

Abstract: Process discovery techniques analyze process logs to extract models that characterize the behavior of business processes. In real-life logs, however, noises exist and adversely affect the extraction and thus decrease the understandability of discovered models. In this paper, we propose a novel double granularity filtering method, executed on both the event and trace levels, to detect noises by analyzing the directly-following and parallel relations between events. Based on the probability of an event occurring in a sequence, the infrequent behaviors and redundant events in the logs can be filtered out. In addition, the missing events in parallel blocks are detected …to further improve the performance of filtering. Experiments on synthetic logs and five real-life datasets demonstrate that our method significantly outperforms other state-of-the-art methods. Show more

Keywords: Process discovery, process mining, event logs, noise filtering, event dependency, parallel relation

DOI: 10.3233/IDA-230118

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-18, 2024

Price: EUR 27.50

SocialNER2.0: A comprehensive dataset for enhancing named entity recognition in short human-produced text

Authors: Belbekri, Adel | Benchikha, Fouzia | Slimani, Yahya | Marir, Naila

Article Type: Research Article

Abstract: Named Entity Recognition (NER) is an essential task in Natural Language Processing (NLP), and deep learning-based models have shown outstanding performance. However, the effectiveness of deep learning models in NER relies heavily on the quality and quantity of labeled training datasets available. A novel and comprehensive training dataset called SocialNER2.0 is proposed to address this challenge. Based on selected datasets dedicated to different tasks related to NER, the SocialNER2.0 construction process involves data selection, extraction, enrichment, conversion, and balancing steps. The pre-trained BERT (Bidirectional Encoder Representations from Transformers) model is fine-tuned using the proposed dataset. Experimental results highlight the superior …performance of the fine-tuned BERT in accurately identifying named entities, demonstrating the SocialNER2.0 dataset’s capacity to provide valuable training data for performing NER in human-produced texts. Show more

Keywords: Big data, deep learning, user-generated texts, text analysis, named entity recognition

DOI: 10.3233/IDA-230588

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-25, 2024

Price: EUR 27.50

Analytical learning classifier based on predefined evenly-distributed class centroids

Authors: Hu, Haiping | Huo, Wei | Yan, Yingying | Zhu, Qiuyu

Article Type: Research Article

Abstract: For the pattern recognition, most classification models are solved iteratively, except for Linear LDA, KLDA and ELM etc. In this paper, a nonlinear classification network model based on predefined evenly-distributed class centroids (PEDCC) is proposed. Its analytical solution can be obtained and has good interpretability. Using the characteristics of maximizing the inter-class distance of PEDCC and derivative weighted minimum mean square error loss function to minimize the intra-class distance, we can not only realize the effective nonlinearity of the network, but also obtain the analytical solution of the network weight. Then, the sample is classified based on GDA. In order …to further improve the performance of classification, PCA is used to reduces the dimensionality of the original sample, meanwhile, the CReLU activation function are adopted to enhances the expression ability of the features. The network transforms the samples into the higher dimensional feature space through the weighted minimum mean square error, so as to find a better separation hyperplane. In experiments, the feasibility of the network structure is verified from pure linear 𝑾 , 𝑾 + Tanh, and PCA+ 𝑾 + Tanh respectively on many small data sets and large data sets, and compared with SVM and ELM in terms of training speed and recognition rate. The results show that, in general, this model has advantages on small data set both in recognition accuracy and training speed, while it has advantages in training speed on large data sets. Finally, by introducing a multi-stage network structure based on the latent feature norm, the classifier network can further significantly improve the classification performance, the recognition rate of small data sets is effectively improved and much higher than that of existing methods, while the recognition rate of large data sets is similar to that of SVM. Show more

Keywords: Pattern recognition, image classification, machine learning, GDA

DOI: 10.3233/IDA-230044

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-16, 2024

Price: EUR 27.50

Towards robust log parsing using self-supervised learning for system security analysis

Article Type: Research Article

Abstract: Logs play an important role in anomaly detection, fault diagnosis, and trace checking of software and network systems. Log parsing, which converts each raw log line to a constant template and a variable parameter list, is a prerequisite for system security analysis. Traditional parsing methods utilizing specific rules can only parse logs of specific formats, and most parsing methods based on deep learning require labels. However, the existing parsing methods are not applicable to logs of inconsistent formats and insufficient labels. To address these issues, we propose a robust Log parsing method based on Self-supervised Learning (LogSL), which can extract …templates from logs of different formats. The essential idea of LogSL is modeling log parsing as a multi-token prediction task, which makes the multi-token prediction model learn the distribution of tokens belonging to the template in raw log lines by self-supervision mode. Furthermore, to accurately predict the tokens of the template without labeled data, we construct a Multi-token Prediction Model (MPM) combining the pre-trained XLNet module, the n-layer stacked Long Short-Term Memory Net module, and the Self-attention module. We validate LogSL on 12 benchmark log datasets, resulting in the average parsing accuracy of our parser being 3.9% higher than that of the best baseline method. Experimental results show that LogSL has superiority in terms of robustness and accuracy. In addition, a case study of anomaly detection is conducted to demonstrate the support of the proposed MPM to system security tasks based on logs. Show more

Keywords: System security, data analysis, log parsing, deep learning, self-supervised learning

DOI: 10.3233/IDA-230133

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-21, 2024

Price: EUR 27.50

Lightweight blockchain approach to reduce double-spend and 51% attacks on Proof-of-Work

Authors: Nayancy, | Dutta, Sandip | Chakraborty, Soubhik

Article Type: Research Article

Abstract: Blockchain has attracted tremendous attention in recent years due to its significant features including anonymity, security, immutability, and audibility. Blockchain technology has been used in several nonmonetary applications, including Internet-of-Things. Though blockchain has limited resources, and scalability is computationally expensive, resulting in delays and large bandwidth overhead that are unsuitable for many IoT devices. In this paper, we work on a lightweight blockchain approach that is suited for IoT needs and provides end-to-end security. Decentralization is achieved in our lightweight blockchain implementation by building a network with a lot of high-resource devices collaborate to maintain the blockchain. The nodes in …the network is arranged in sorted order w.r.t execution time and count to reduce the mining overheads and is accountable for handling the public blockchain. We propose a distributed execution time-based consensus algorithm that decreases the delay and overhead of the mining process. We also propose a randomized node-selection algorithm for the selection of nodes to verify the mined blocks to eliminate the double-spend and 51% attack. The results are encouraging and significantly reduce the mining overhead and keep a check on the double-spending problem and 51% attack. Show more

Keywords: Blockchain, IoT, lightweight consensus, double-spend attack, 51% attack

DOI: 10.3233/IDA-230153

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-11, 2024

Price: EUR 27.50

Floating-point histograms for exploratory analysis of large scale real-world data sets

Authors: Boullé, Marc

Article Type: Research Article

Abstract: Histograms are among the most popular methods used in exploratory analysis to summarize univariate distributions. In particular, irregular histograms are good non-parametric density estimators that require very few parameters: the number of bins with their lengths and frequencies. Although many approaches have been proposed in the literature to infer these parameters, most existing histogram methods are difficult to exploit for exploratory analysis in the case of real-world data sets, with scalability issues, truncated data, outliers or heavy-tailed distributions. In this paper, we focus on the G-Enum histogram method, which exploits the Minimum Description Length (MDL) principle to build histograms without …any user parameter. We then propose to extend this method by exploiting a new modeling space based on floating-point representation, with the objective of building histograms resistant to outliers or heavy-tailed distributions. We also suggest several heuristics and a methodology suitable for the exploratory analysis of large scale real-world data sets, whose underlying patterns are difficult to recover for digitization reasons. Extensive experiments show the benefits of the approach, evaluated with a dual objective: the accuracy of density estimation in the case of outliers or heavy-tailed distributions, and the effectiveness of the approach for exploratory data analysis. Show more

Keywords: Density estimation, histograms, model selection, minimum description length, exploratory analysis

DOI: 10.3233/IDA-230638

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-48, 2024

Price: EUR 27.50

A dual-ways feature fusion mechanism enhancing active learning based on TextCNN

Authors: Shi, Xuefeng | Hu, Min | Ren, Fuji | Shi, Piao

Article Type: Research Article

Abstract: Active Learning (AL) is a technique being widely employed to minimize the time and labor costs in the task of annotating data. By querying and extracting the specific instances to train the model, the relevant task’s performance is improved maximally within limited iterations. However, rare work was conducted to fully fuse features from different hierarchies to enhance the effectiveness of active learning. Inspired by the thought of information compensation in many famous deep learning models (such as ResNet, etc.), this work proposes a novel TextCNN-based Two ways Active Learning model (TCTWAL) to extract task-relevant texts. TextCNN takes the advantage of …little hyper-parameter tuning and static vectors and achieves excellent results on various natural language processing (NLP) tasks, which are also beneficial to human-computer interaction (HCI) and the AL relevant tasks. In the process of the proposed AL model, the candidate texts are measured from both global and local features by the proposed AL framework TCTWAL depending on the modified TextCNN. Besides, the query strategy is strongly enhanced by maximum normalized log-probability (MNLP), which is sensitive to detecting the longer sentences. Additionally, the selected instances are characterized by general global information and abundant local features simultaneously. To validate the effectiveness of the proposed model, extensive experiments are conducted on three widely used text corpus, and the results are compared with with eight manual designed instance query strategies. The results show that our method outperforms the planned baselines in terms of accuracy, macro precision, macro recall, and macro F1 score. Especially, to the classification results on AG’s News corpus, the improvements of the four indicators after 39 iterations are 40.50%, 45.25%, 48.91%, and 45.25%, respectively. Show more

Keywords: Active learning, TextCNN, maximum normalized log-probability, global information, local feature

DOI: 10.3233/IDA-230332

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-23, 2024

Price: EUR 27.50

A cross-model hierarchical interactive fusion network for end-to-end multimodal aspect-based sentiment analysis

Authors: Zhong, Qing | Shao, Xinhui

Article Type: Research Article

Abstract: For the aspect-based sentiment analysis task, traditional works are only for text modality. However, in social media scenarios, texts often contain abbreviations, clerical errors, or grammatical errors, which invalidate traditional methods. In this study, the cross-model hierarchical interactive fusion network incorporating an end-to-end approach is proposed to address this challenge. In the network, a feature attention module and a feature fusion module are proposed to obtain the multimodal interaction feature between the image modality and the text modality. Through the attention mechanism and gated fusion mechanism, these two modules realize the auxiliary function of image in the text-based aspect-based sentiment …analysis task. Meanwhile, a boundary auxiliary module is used to explore the dependencies between two core subtasks of the aspect-based sentiment analysis. Experimental results on two publicly available multi-modal aspect-based sentiment datasets validate the effectiveness of the proposed approach. Show more

Keywords: Multimodal aspect-based sentiment analysis, hierarchical interactive fusion, multi-head interaction attention mechanism, gated mechanism

DOI: 10.3233/IDA-230305

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-16, 2024

Price: EUR 27.50

Application research of credit fraud detection based on distributed rotation deep forest

Authors: Chen, Hongwei | Shi, Dewei | Zhou, Xun | Zhang, Man | Liu, Luanxuan

Article Type: Research Article

Abstract: Credit fraud is a common financial crime that causes significant economic losses to financial institutions. To address this issue, researchers have proposed various fraud detection methods. Recently, research on deep forests has opened up a new path for exploring deep models beyond neural networks. It combines the features of neural networks and ensemble learning, and has achieved good results in various fields. This paper mainly studies the application of deep forests to the field of fraud detection and proposes a distributed dense rotation deep forest algorithm (DRDF-spark) based on the improved RotBoost. The model has three main characteristics: firstly, it …solves the problem of multi-granularity scanning due to the lack of spatial correlation in the data by introducing RotBoost. Secondly, Spark is used for parallel construction to improve the processing speed and efficiency of data. Thirdly, a pre-aggregation mechanism is added to the distributed algorithm to locally aggregate the statistical results of sub-forests in the same node in advance to improve communication efficiency. The experiments show that DRDF-spark performs better than deep forests and some mainstream ensemble learning algorithms on the fraud dataset in this paper, and the training speed is up to 3.53 times faster. Furthermore, if the number of nodes is further increased, the speedup ratio will continue to increase. Show more

Keywords: Deep forest, credit fraud detection, ensemble learning, RotBoost, spark

DOI: 10.3233/IDA-230193

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-25, 2024

Price: EUR 27.50

Ultrasound breast images denoising using generative adversarial networks (GANs)

Authors: Jiménez-Gaona, Yuliana | Rodríguez-Alvarez, María José | Escudero, Líder | Sandoval, Carlos | Lakshminarayanan, Vasudevan

Article Type: Research Article

Abstract: INTRODUCTION: Ultrasound in conjunction with mammography imaging, plays a vital role in the early detection and diagnosis of breast cancer. However, speckle noise affects medical ultrasound images and degrades visual radiological interpretation. Speckle carries information about the interactions of the ultrasound pulse with the tissue microstructure, which generally causes several difficulties in identifying malignant and benign regions. The application of deep learning in image denoising has gained more attention in recent years. OBJECTIVES: The main objective of this work is to reduce speckle noise while preserving features and details in breast ultrasound images using GAN models. …METHODS: We proposed two GANs models (Conditional GAN and Wasserstein GAN) for speckle-denoising public breast ultrasound databases: BUSI, DATASET A, AND UDIAT (DATASET B). The Conditional GAN model was trained using the Unet architecture, and the WGAN model was trained using the Resnet architecture. The image quality results in both algorithms were measured by Peak Signal to Noise Ratio (PSNR, 35–40 dB) and Structural Similarity Index (SSIM, 0.90–0.95) standard values. RESULTS: The experimental analysis clearly shows that the Conditional GAN model achieves better breast ultrasound despeckling performance over the datasets in terms of PSNR = 38.18 dB and SSIM = 0.96 with respect to the WGAN model (PSNR = 33.0068 dB and SSIM = 0.91) on the small ultrasound training datasets. CONCLUSIONS: The observed performance differences between CGAN and WGAN will help to better implement new tasks in a computer-aided detection/diagnosis (CAD) system. In future work, these data can be used as CAD input training for image classification, reducing overfitting and improving the performance and accuracy of deep convolutional algorithms. Show more

Keywords: Breast cancer, ultrasound image denoising, generative adversarial network

DOI: 10.3233/IDA-230631

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-18, 2024

Get PDF

Identifying relevant features of CSE-CIC-IDS2018 dataset for the development of an intrusion detection system

Authors: Göcs, László | Johanyák, Zsolt Csaba

Article Type: Research Article

Abstract: Intrusion detection systems (IDSs) are essential elements of IT systems. Their key component is a classification module that continuously evaluates some features of the network traffic and identifies possible threats. Its efficiency is greatly affected by the right selection of the features to be monitored. Therefore, the identification of a minimal set of features that are necessary to safely distinguish malicious traffic from benign traffic is indispensable in the course of the development of an IDS. This paper presents the preprocessing and feature selection workflow as well as its results in the case of the CSE-CIC-IDS2018 on AWS dataset, focusing …on five attack types. To identify the relevant features, six feature selection methods were applied, and the final ranking of the features was elaborated based on their average score. Next, several subsets of the features were formed based on different ranking threshold values, and each subset was tried with five classification algorithms to determine the optimal feature set for each attack type. During the evaluation, four widely used metrics were taken into consideration. Show more

Keywords: Ddataset preprocessing, dimension reduction, feature selection, classification, Python, CE-CIC-IDS2018

DOI: 10.3233/IDA-230264

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-27, 2024

Price: EUR 27.50

Learning traffic as videos: A spatio-temporal VAE approach to periodic traffic raster data imputation

Authors: Zhang, Shuo | Hu, Xingbang | Zhang, Wenbo | Chen, Jinyi | Huang, Hejiao

Article Type: Research Article

Abstract: For modern Intelligent Transportation System (ITS), data missing during traffic raster acquisition can be inevitable because of the loop detector malfunction or signal interference. Nevertheless, missing data imputation is meaningful due to the periodic spatio-temporal characteristics and individual randomness of traffic raster data. In this paper, traffic raster data collected from all spatial regions at each time interval are considered as a multiple channel image. Accordingly, the traffic raster data over a period of time can be regarded as video, on which an unsupervised generative neural network called MSST-VAE (Multiple Streams Spatial Temporal-VAE) is proposed for traffic raster data imputation, …and this model can even robustly performs at varied missing rates while many other approaches fail to conduct. Two major innovations can be summarized in MSSTVAE: Firstly, it uses multiple periodic streams of Variational Auto-Encoders (VAEs) with Sylvester Normalizing Flows (SNFs), which shows strong generalization ability. Secondly, after the traffic raster data are transferred into videos, an ECB (Extraction-and-Calibration Block) consisting of dilated P3D gated convolution and multi-horizon attention mechanism is employed to learn global-local-granularity spatial features and long-short-term temporal features. Extensive experiments on three real traffic flow datasets validate that MSST-VAE outperforms other classical traffic imputation models with the least imputation error. Show more

Keywords: Intelligent transportation system, traffic raster data, data imputation

DOI: 10.3233/IDA-230091

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-22, 2024

Price: EUR 27.50

LaplaceConfidence: A graph-based approach for learning with noisy labels

Authors: Chen, Mingcai | Du, Yuntao | Tang, Wei | Zhang, Baoming | Wang, Chongjun

Article Type: Research Article

Abstract: Real-world machine learning applications seldom provide perfect labeled data, posing a challenge in developing models robust to noisy labels. Recent methods prioritize noise filtering based on the discrepancies between model predictions and the provided noisy labels, assuming samples with minimal classification losses to be clean. In this work, we capitalize on the consistency between the learned model and the complete noisy dataset, employing the data’s rich representational and topological information. We introduce LaplaceConfidence, a method that to obtain label confidence (i.e., clean probabilities) utilizing the Laplacian energy. Specifically, it first constructs graphs based on the feature representations of all noisy …samples and minimizes the Laplacian energy to produce a low-energy graph. Clean labels should fit well into the low-energy graph while noisy ones should not, allowing our method to determine data’s clean probabilities. Furthermore, LaplaceConfidence is embedded into a holistic method for robust training, where co-training technique generates unbiased label confidence and label refurbishment technique better utilizes it. We also explore the dimensionality reduction technique to accommodate our method on large-scale noisy datasets. Our experiments demonstrate that LaplaceConfidence outperforms state-of-the-art methods on benchmark datasets under both synthetic and real-world noise. Code available at https://github.com/chenmc1996/LaplaceConfidence . Show more

Keywords: Learning with noisy labels, graph energy, label refurbishment

DOI: 10.3233/IDA-230818

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-17, 2024

Price: EUR 27.50

Cracking black-box models: Revealing hidden machine learning techniques behind their predictions

Authors: Fabra-Boluda, Raül | Ferri, Cèsar | Hernández-Orallo, José | Ramírez-Quintana, M. José | Martínez-Plumed, Fernando

Article Type: Research Article

Abstract: The quest for transparency in black-box models has gained significant momentum in recent years. In particular, discovering the underlying machine learning technique type (or model family) from the performance of a black-box model is a real important problem both for better understanding its behaviour and for developing strategies to attack it by exploiting the weaknesses intrinsic to the learning technique. In this paper, we tackle the challenging task of identifying which kind of machine learning model is behind the predictions when we interact with a black-box model. Our innovative method involves systematically querying a black-box model (oracle) to …label an artificially generated dataset, which is then used to train different surrogate models using machine learning techniques from different families (each one trying to partially approximate the oracle’s behaviour). We present two approaches based on similarity measures, one selecting the most similar family and the other using a conveniently constructed meta-model. In both cases, we use both crisp and soft classifiers and their corresponding similarity metrics. By experimentally comparing all these methods, we gain valuable insights into the explanatory and predictive capabilities of our model family concept. This provides a deeper understanding of the black-box models and increases their transparency and interpretability, paving the way for more effective decision making. Show more

Keywords: Machine learning, family identification, adversarial, black-box, surrogate models

DOI: 10.3233/IDA-230707

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-21, 2024

Get PDF

CBDA: Chaos-based binary dragonfly algorithm for evolutionary feature selection

Article Type: Research Article

Abstract: The goal of feature selection in machine learning is to simultaneously maintain more classification accuracy, while reducing lager amount of attributes. In this paper, we firstly design a fitness function that achieves both objectives jointly. Then we come up with a chaos-based binary dragonfly algorithm (CBDA) that incorporates several improvements over the conventional dragonfly algorithm (DA) for developing a wrapper-based feature selection method to solve the fitness function. Specifically, the CBDA innovatively introduces three improved factors, namely the chaotic map, evolutionary population dynamics (EPD) mechanism, and binarization strategy on the basis of conventional DA to balance the exploitation and exploration …capabilities of the algorithm and make it more suitable to handle the formulated problem. We conduct experiments on 24 well-known data sets from the UCI repository with three ablated versions of CBDA targeting different components of the algorithm in order to explain their contributions in CBDA and also with five established comparative algorithms in terms of fitness value, classification accuracy, CPU running time, and number of selected features. The results show that the proposed CBDA has remarkable advantages in most of the tested data sets. Show more

Keywords: Feature selection, dragonfly algorithm, chaos, evolutionary population dynamics, classification accuracy

DOI: 10.3233/IDA-230540

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-36, 2024

Price: EUR 27.50

An evolutionary approach to extreme individual impact opinions based on time sunk costs

Article Type: Research Article

Abstract: Large-scale studies indicate that the distinct approach to opinion fusion employed by extreme agents exerts a more potent influence on overall opinion evolution when compared to regular agents. The presence of extreme agents within the network tends to undermine the development of opinion neutrality, which is harmful to the guidance of online public opinion. Notably, prior research often overlooks the existence of opinion extreme agents in social networks. However, existing researches seldom consider the time sunk cost in the evolution of opinions. Building upon this foundation, we introduce a temporal dimension to the opinion evolution, integrating the time sunk cost …with the opinion evolution process. Furthermore, we devise an agent partitioning method that categorizes agents into four states based on their opinion values: watch state, subjective state, firm state, and extreme state, with extreme state agents generally expressing radical opinions. We constructed an agent network based on the phenomenon of time sunk costs and proposed a model for the evolution of extreme opinions in this network. Our study found that the information sharing among extreme agents significantly influences the extremization of opinions in various networks. After restricting the exchange of opinions on extreme agents, the number of extreme agents in the network decreased by 40% to 50% compared to the initial situation. Additionally, we also discovered that imposing restrictions on extreme agents in the early stages can help increase the possibility of network opinions moving towards neutral positions. When restriction of extreme agents(REA) was performed at the beginning of the experiment compared to REA in the midway of the experiment, the final number of extreme state agents decreased by 15.57%. The results show that extreme agents have a great influence on the spread and evolution of extreme opinions on platforms. Show more

Keywords: Time sunk costs, extremists, opinion dynamics, bounded confidence model, social networks, opinion evolution

DOI: 10.3233/IDA-230677

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-20, 2024

Price: EUR 27.50

Unsupervised contaminated user profile identification against shilling attack in recommender system

Authors: Zhang, Fei | Chan, Patrick P.K. | He, Zhi-Min | Yeung, Daniel S.

Article Type: Research Article

Abstract: A recommender system is susceptible to manipulation through the injection of carefully crafted profiles. Some recent profile identification methods only perform well in specific attack scenarios. A general attack detection method is usually complicated or requires label samples. Such methods are prone to overtraining easily, and the process of annotation incurs high expenses. This study proposes an unsupervised divide-and-conquer method aiming to identify attack profiles, utilizing a specifically designed model for each kind of shilling attack. Initially, our method categorizes the profile set into two attack types, namely Standard and Obfuscated Behavior Attacks. Subsequently, profiles are separated into clusters within …the extracted feature space based on the identified attack type. The selection of attack profiles is then determined through target item analysis within the suspected cluster. Notably, our method offers the advantage of requiring no prior knowledge or annotation. Furthermore, the precision is heightened as the identification method is designed to a specific attack type, employing a less complicated model. The outstanding performance of our model, validated through experimental results on MovieLens-100K and Netflix under various attack settings, demonstrates superior accuracy and reduced running time compared to current detection methods in identifying Standard and Obfuscated Behavior Attacks. Show more

Keywords: PCA, item popularity, shilling attack detection, divide-and-conquer method

DOI: 10.3233/IDA-230575

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-16, 2024

Price: EUR 27.50

Cluster analysis via projection onto convex sets

Authors: Tran, Le-Anh | Kwon, Daehyun | Deberneh, Henock Mamo | Park, Dong-Chul

Article Type: Research Article

Abstract: This paper proposes a data clustering algorithm that is inspired by the prominent convergence property of the Projection onto Convex Sets (POCS) method, termed the POCS-based clustering algorithm . For disjoint convex sets, the form of simultaneous projections of the POCS method can result in a minimum mean square error solution. Relying on this important property, the proposed POCS-based clustering algorithm treats each data point as a convex set and simultaneously projects the cluster prototypes onto respective member data points, the projections are convexly combined via adaptive weight values in order to minimize a predefined objective function for data …clustering purposes. The performance of the proposed POCS-based clustering algorithm has been verified through a large scale of experiments and data sets. The experimental results have shown that the proposed POCS-based algorithm is competitive in terms of both effectiveness and efficiency against some of the prevailing clustering approaches such as the K-Means/K-Means+ ⁣ + and Fuzzy C-Means (FCM) algorithms. Based on extensive comparisons and analyses, we can confirm the validity of the proposed POCS-based clustering algorithm for practical purposes. Show more

Keywords: POCS, convex sets, clustering algorithm, unsupervised learning, machine learning

DOI: 10.3233/IDA-230655

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-18, 2024

Price: EUR 27.50

Incorporating emoji sentiment information into a pre-trained language model for Chinese and English sentiment analysis

Article Type: Research Article

Abstract: Emojis in texts provide lots of additional information in sentiment analysis. Previous implicit sentiment analysis models have primarily treated emojis as unique tokens or deleted them directly, and thus have ignored the explicit sentiment information inside emojis. Considering the different relationships between emoji descriptions and texts, we propose a pre-training Bidirectional Encoder Representations from Transformers (BERT) with emojis (BEMOJI) for Chinese and English sentiment analysis. At the pre-training stage, we pre-train BEMOJI by predicting the emoji descriptions from the corresponding texts via prompt learning. At the fine-tuning stage, we propose a fusion layer to fuse text representations and emoji descriptions …into fused representations. These representations are used to predict text sentiment orientations. Experimental results show that BEMOJI gets the highest accuracy (91.41% and 93.36%), Macro-precision (91.30% and 92.85%), Macro-recall (90.66% and 93.65%) and Macro-F1-measure (90.95% and 93.15%) on the Chinese and English datasets. The performance of BEMOJI is 29.92% and 24.60% higher than emoji-based methods on average on Chinese and English datasets, respectively. Meanwhile, the performance of BEMOJI is 3.76% and 5.81% higher than transformer-based methods on average on Chinese and English datasets, respectively. The ablation study verifies that the emoji descriptions and fusion layer play a crucial role in BEMOJI. Besides, the robustness study illustrates that BEMOJI achieves comparable results with BERT on four sentiment analysis tasks without emojis, which means BEMOJI is a very robust model. Finally, the case study shows that BEMOJI can output more reasonable emojis than BERT. Show more

Keywords: Pre-trained language model, emoji sentiment analysis, implicit sentiment analysis, prompt learning, multi-feature fusion

DOI: 10.3233/IDA-230864

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-25, 2024

Price: EUR 27.50

Identifying longevity profiles from longitudinal data through factor analysis and biclustering

Authors: Noronha, Marta D.M. | Zárate, Luis E.

Article Type: Research Article

Abstract: Characterizing longevity profiles from longitudinal studies is a task with many challenges. Firstly, the longitudinal databases usually have high dimensionality, and the similarities between long-lived and non-long-lived records are a highly burdening task for profile characterization. Addressing these issues, in this work, we use data from the English Longitudinal Study of Ageing (ELSA-UK) to characterize longevity profiles through data mining. We propose a method for feature engineering for reducing data dimensionality through merging techniques, factor analysis and biclustering. We apply biclustering to select relevant features discriminating both profiles. Two classification models, one based on a decision tree and the other …on a random forest, are built from the preprocessed dataset. Experiments show that our methodology can successfully discriminate longevity profiles. We identify insights into features contributing to individuals being long-lived or non-long-lived. According to the results presented by both models, the main factor that impacts longevity is related to the correlations between the economic situation and the mobility of the elderly. We suggest that this methodology can be applied to identify longevity profiles from other longitudinal studies since that factor is deemed relevant for profile classification. Show more

Keywords: Longitudinal data mining, human ageing, biclustering, factor analysis, classification

DOI: 10.3233/IDA-230314

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-24, 2024

Price: EUR 27.50

Sampleformer: An efficient conformer-based Neural Network for Automatic Speech Recognition

Authors: Fan, Zeping | Zhang, Xuejun | Huang, Min | Bu, Zhaohui

Article Type: Research Article

Abstract: The Convolution-augmented Transformer (Conformer) model, which was recently introduced, has attained state-of-the-art(SOTA) results in Automatic Speech Recognition (ASR). In this paper, a series of methodical investigations uncover that the Conformer’s design decisions may not represent the most efficient choices when operating within the constraints of a limited computational budget. After a thorough re-evaluation of the Conformer architecture’s design choices, we propose Sampleformer which reduces the Conformer architecture complexity and has more robust performance. We introduce downsampling to the Conformer Encoder, and to exploit the information in the speech features, we incorporate an additional downsampling module to enhance the efficiency and …accuracy of our model. Additionally, we propose a novel and adaptable attention mechanism called multi-group attention, effectively reducing the attention complexity from O ⁢ ( n 2 ⁢ d ) to O ⁢ ( n 2 ⁢ d ⋅ f / g ) . We performed experiments on the AISHELL-1 corpora, our 13.3 million-parameter CTC model demonstrates a 3.0%/2.6% relative reduction in character error rate (CER) on the dev/test sets, all without the utilization of a language model (LM). Additionally, the model exhibits a 30% improvement in inference compared to our CTC Conformer baseline and trains 27% faster. Show more

Keywords: Speech recognition, conformer, attention mechanism, complexity reduction

DOI: 10.3233/IDA-230612

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-13, 2024

Price: EUR 27.50

Heterogeneous graph community detection method based on K-nearest neighbor graph neural network

Authors: Liu, Xiaoyang | Wu, Yudie | Fiumara, Giacomo | De Meo, Pasquale

Article Type: Research Article

Abstract: Traditional community detection models either ignore the feature space information and require a large amount of domain knowledge to define the meta-paths manually, or fail to distinguish the importance of different meta-paths. To overcome these limitations, we propose a novel heterogeneous graph community detection method (called KGNN_HCD, heterogeneous graph Community Detection method based on K -nearest neighbor Graph Neural Network). Firstly, the similarity matrix is generated to construct the topological structure of K -nearest neighbor graph; secondly, the meta-path information matrix is generated using a meta-path transformation layer (Mp-Trans Layer) by adding weighted convolution; finally, a …graph convolutional network (GCN) is used to learn high-quality node representation, and the k -means algorithm is adopted on node embeddings to detect the community structure. We perform extensive experiments and on three heterogeneous datasets, ACM, DBLP and IMDB, and we consider as competitors 11 community detection methods such as CP-GNN and GTN. The experimental results show that the proposed KGNN_HCD method improves 2.54% and 2.56% on the ACM dataset, 2.59% and 1.47% on the DBLP dataset, and 1.22% and 1.67% on the IMDB dataset for both NMI and ARI. Experiments findings suggest that the proposed KGNN_HCD method is reasonable and effective, and KGNN_HCD can be applied to complex network classification and clustering tasks. Show more

Keywords: Heterogeneous graph, meta-path, K-nearest neighbor graph, graph neural network, community detection

DOI: 10.3233/IDA-230356

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-22, 2024

Price: EUR 27.50

Online course evaluation model based on graph auto-encoder

Authors: Yuan, Wei | Zhao, Shiyu | Wang, Li | Cai, Lijia | Zhang, Yong

Article Type: Research Article

Abstract: In the post-epidemic era, online learning has gained increasing attention due to the advancements in information and big data technology, leading to large-scale online course data with various student behaviors. Online data mining has become a popular and important way of extracting valuable insights from large amounts of data. However, previous online course analysis methods often focused on individual aspects of the data and neglected the correlation among the large-scale learning behavior data, which can lead to an incomplete understanding of the overall learning behavior and patterns within the online course. To solve the problems, this paper proposes an online …course evaluation model based on a graph auto-encoder. In our method, the features of collected online course data are used to construct K-Nearest Neighbor(KNN) graphs to represent the association among the courses. Then the variational graph auto-encoder(VGAE) is introduced to learn the useful implicit features. Finally, we feed the learned implicit features into unsupervised and semi-supervised downstream tasks for online course evaluation, respectively. We conduct experiments on two datasets. In the clustering task, our method showed a more than tenfold increase in the Calinski-Harabasz index compared to unoptimized features, demonstrating significant structural distinction and group coherence. In the classification task, compared to traditional methods, our model exhibited an overall performance improvement of about 10%, indicating its effectiveness in handling complex network data. Show more

Keywords: Educational data mining, online course evaluation, deep learning, graph auto-encoder

DOI: 10.3233/IDA-230557

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-23, 2024

Price: EUR 27.50

Leveraging spark-based machine learning algorithm for audience sentiment analysis in youtube content

Authors: K, Subha | N, Bharathi

Article Type: Research Article

Abstract: In today’s digital era, the generation and sharing of information are rapidly expanding. The increased volume of complex data is big data. YouTube is the primary source of big data. The proliferation of the internet and smart devices has led to a significant increase in content creators on social media platforms, with YouTube being a prominent example. There has been a substantial increase in content creators across various social media platforms, with YouTube emerging as one of the foremost platforms for content generation and sharing. YouTubers face challenges in enhancing content strategies due to the growing number of comments, such …as big data on shared videos. Reading and finding viewers’ opinions of such a large amount of data through manual methods is time-consuming and challenging and makes it hard to understand people’s sentiments. To address this, spark-based machine learning algorithms have emerged as a transformative tool for content creators to understand the audience. The Improved Novel Ensemble Method (INEM) algorithm is designed to predict viewers’ sentiments and emotional responses based on the content they interact through the comments. The proposed results provide valuable insights for content creators, helping them refine the strategies to optimize the channel’s revenue and performance. Fit Tuber Channel is analyzed to perform the sentiment of user comments. Show more

Keywords: Big data, sentiment analysis, machine learning, social-media, spark

DOI: 10.3233/IDA-240198

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-11, 2024

Price: EUR 27.50

Processing and optimized learning for improved classification of categorical plant disease datasets

Authors: Gupta, Ayushi | Chug, Anuradha | Singh, Amit Prakash

Article Type: Research Article

Abstract: PURPOSE: Crop diseases can cause significant reductions in yield, subsequently impacting a country’s economy. The current research is concentrated on detecting diseases in three specific crops – tomatoes, soybeans, and mushrooms, using a real-time dataset collected for tomatoes and two publicly accessible datasets for the other crops. The primary emphasis is on employing datasets with exclusively categorical attributes, which poses a notable challenge to the research community. METHODS: After applying label encoding to the attributes, the datasets undergo four distinct preprocessing techniques to address missing values. Following this, the SMOTE-N technique is employed to tackle class …imbalance. Subsequently, the pre-processed datasets are subjected to classification using three ensemble methods: bagging, boosting, and voting. To further refine the classification process, the metaheuristic Ant Lion Optimizer (ALO) is utilized for hyper-parameter tuning. RESULTS: This comprehensive approach results in the evaluation of twelve distinct models. The top two performers are then subjected to further validation using ten standard categorical datasets. The findings demonstrate that the hybrid model II-SN-OXGB, surpasses all other models as well as the current state-of-the-art in terms of classification accuracy across all thirteen categorical datasets. II utilizes the Random Forest classifier to iteratively impute missing feature values, employing a nearest features strategy. Meanwhile, SMOTE-N (SN) serves as an oversampling technique particularly for categorical attributes, again utilizing nearest neighbors. Optimized (using ALO) Xtreme Gradient Boosting OXGB, sequentially trains multiple decision trees, with each tree correcting errors from its predecessor. CONCLUSION: Consequently, the model II-SN-OXGB emerges as the optimal choice for addressing classification challenges in categorical datasets. Applying the II-SN-OXGB model to crop datasets can significantly enhance disease detection which in turn, enables the farmers to take timely and appropriate measures to prevent yield losses and mitigate the economic impact of crop diseases. Show more

Keywords: Categorical data, ensemble methods, missing values imputation, metaheuristic optimization, plant disease

DOI: 10.3233/IDA-230651

Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-25, 2024

Price: EUR 27.50

Display: 10 | 50 | 100 items per page

Intelligent Data Analysis - Volume Pre-press, issue Pre-press

Combining jumping knowledge into traffic forecasting: An attention-based spatial-temporal adaptive integration gated network

Improving process discovery by filtering noises based on event dependency

SocialNER2.0: A comprehensive dataset for enhancing named entity recognition in short human-produced text

Analytical learning classifier based on predefined evenly-distributed class centroids

Towards robust log parsing using self-supervised learning for system security analysis

Lightweight blockchain approach to reduce double-spend and 51% attacks on Proof-of-Work

Floating-point histograms for exploratory analysis of large scale real-world data sets

A dual-ways feature fusion mechanism enhancing active learning based on TextCNN

A cross-model hierarchical interactive fusion network for end-to-end multimodal aspect-based sentiment analysis

Application research of credit fraud detection based on distributed rotation deep forest

Ultrasound breast images denoising using generative adversarial networks (GANs)

Identifying relevant features of CSE-CIC-IDS2018 dataset for the development of an intrusion detection system

Learning traffic as videos: A spatio-temporal VAE approach to periodic traffic raster data imputation

LaplaceConfidence: A graph-based approach for learning with noisy labels

Cracking black-box models: Revealing hidden machine learning techniques behind their predictions

CBDA: Chaos-based binary dragonfly algorithm for evolutionary feature selection

An evolutionary approach to extreme individual impact opinions based on time sunk costs

Unsupervised contaminated user profile identification against shilling attack in recommender system

Cluster analysis via projection onto convex sets

Incorporating emoji sentiment information into a pre-trained language model for Chinese and English sentiment analysis

Identifying longevity profiles from longitudinal data through factor analysis and biclustering

Sampleformer: An efficient conformer-based Neural Network for Automatic Speech Recognition

Heterogeneous graph community detection method based on K-nearest neighbor graph neural network

Online course evaluation model based on graph auto-encoder

Leveraging spark-based machine learning algorithm for audience sentiment analysis in youtube content

Processing and optimized learning for improved classification of categorical plant disease datasets

North America

Europe

Asia