Searching for just a few words should be enough to get started. If you need to make more complex queries, use the tips below to guide you.
Article type: Research Article
Authors: Mehta, Brijesha; * | Rao, Udai Pratapb | Gupta, Ruchikac | Conti, Maurod
Affiliations: [a] Department of Computer Science and Engineering, College of Technology and Engineering, Maharana Pratap University of Agriculture and Technology, Udaipur, Rajasthan, India | [b] Computer Engineering Department, Sardar Vallabhbhai National Institute of Technology, Surat, Gujarat, India | [c] Computer Science and Engineering Department, Chandigarh University, Mohali, Punjab, India | [d] Department of Mathematics and HIT Center, University of Padua, Italy
Correspondence: [*] Corresponding author. B. Mehta, Department of Computer Science and Engineering, College of Technology and Engineering, Maharana Pratap University of Agriculture and Technology, Udaipur, Rajasthan, India. E-mail: [email protected].
Abstract: Various sources and sophisticated tools are used to gather and process the comparatively large volume of data or big data that sometimes leads to privacy disclosure (at broader or finer level) for the data owner. Privacy preserving data publishing approaches such as k-anonymity, l-diversity, and t-closeness are very well used to de-identify data, however, chances of re-identification of attributes always exist as data is collected from multiple sources such as public web, social media, Internet whereabouts, and sensors that are highly prone to data linkages. In literature, k-anonymity stands out amongst the most popular mainstream data anonymization approaches that can also be used for large sized data. However, applying k-anonymization for variety of data (especially unstructured data) is difficult in the traditional way, due to the fact that it requires the given data to be classified into the personal data, the quasi identifiers, and the sensitive data. We identify existing approaches from the literature of Natural Language Processing(NLP) to convert the unstructured data to structured form in order to apply k-anonymization over the generated structured records. We adopt a two phase Conditional Random Field (CRF) based Named Entity Recognition (NER) approach to represent unstructured data into the structured form. Further, we propose an Improved Scalable k-Anonymization (ImSKA) to anonymize the well represented unstructured data that achieves privacy preserving unstructured big data publishing. We compare both of the propose approaches namely NER and ImSKA with existing approaches and the results show that our proposed solutions outperform the existing approaches in terms of F1 score and Normalized Cardinality Penalty (NCP), respectively. Since, NER approaches are widely used for bio-medical datasets, we have also used a well-known Bio-NER dataset called GENIA corpus for measuring the performance.
Keywords: Privacy preserving big data publishing, unstructured data privacy, named entity recognition, k-anonymity, scalable k-anonymization
DOI: 10.3233/JIFS-181231
Journal: Journal of Intelligent & Fuzzy Systems, vol. 36, no. 4, pp. 3471-3482, 2019
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA
Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]
For editorial issues, like the status of your submitted paper or proposals, write to [email protected]
IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands
Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]
For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]
Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China
Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]
For editorial issues, like the status of your submitted paper or proposals, write to [email protected]
如果您在出版方面需要帮助或有任何建, 件至: [email protected]