Quality metrics for RDF graph summarization
Issue title: Special Issue on Intelligent Exploration of Semantic Data
Guest editors: Dhaval Thakker, Daniel Schwabe, Roberto García, Kouji Kozaki, Marco Brambilla and Vania Dimitrova
Article type: Research Article
Authors: Zneika, Mussaba; b; c; d; * | Vodislav, Dana; b; c; d | Kotzinos, Dimitrisa; b; c; d
Affiliations: [a] ETIS, UMR 8051 Paris Seine University, France | [b] University of Cergy-Pontoise, Cergy-Pontoise, France | [c] ENSEA, Cergy-Pontoise, France | [d] CNRS, France. E-mails: [email protected], [email protected], [email protected]
Correspondence: [*] Corresponding author. E-mail: [email protected].
Abstract: RDF Graph Summarization pertains to the process of extracting concise but meaningful summaries from RDF Knowledge Bases (KBs) representing as close as possible the actual contents of the KB both in terms of structure and data. RDF Summarization allows for better exploration and visualization of the underlying RDF graphs, optimization of queries or query evaluation in multiple steps, better understanding of connections in Linked Datasets and many other applications. In the literature, there are efforts reported presenting algorithms for extracting summaries from RDF KBs. These efforts though provide different results while applied on the same KB, thus a way to compare the produced summaries and decide on their quality and best-fitness for specific tasks, in the form of a quality framework, is necessary. So in this work, we propose a comprehensive Quality Framework for RDF Graph Summarization that would allow a better, deeper and more complete understanding of the quality of the different summaries and facilitate their comparison. We work at two levels: the level of the ideal summary of the KB that could be provided by an expert user and the level of the instances contained by the KB. For the first level, we are computing how close the proposed summary is to the ideal solution (when this is available) by defining and computing its precision, recall and F-measure against the ideal solution. For the second level, we are computing if the existing instances are covered (i.e. can be retrieved) and at which degree by the proposed summary. Again we define and compute its precision, recall and F-measure against the data contained in the original KB. We also compute the connectivity of the proposed summary compared to the ideal one, since in many cases (like, e.g., when we want to query) this is an important factor and in general in RDF, linked datasets are usually used. We use our quality framework to test the results of three of the best RDF Graph Summarization algorithms, when summarizing different (in terms of content) and diverse (in terms of total size and number of instances, classes and predicates) KBs and we present comparative results for them. We conclude this work by discussing these results and the suitability of the proposed quality framework in order to get useful insights for the quality of the presented results.
Keywords: Quality framework, quality metrics, RDF Summarization, linked Open Data, RDF query processing
DOI: 10.3233/SW-190346
Journal: Semantic Web, vol. 10, no. 3, pp. 555-584, 2019