You are viewing a javascript disabled version of the site. Please enable Javascript for this site to function properly.
Go to headerGo to navigationGo to searchGo to contentsGo to footer
In content section. Select this link to jump to navigation

RelTopic: A graph-based semantic relatedness measure in topic ontologies and its applicability for topic labeling of old press articles


Graph-based semantic measures have been used to solve problems in several domains. They tend to compare semantic entities in order to estimate their similarity or relatedness. While semantic similarity is applicable to hierarchies or taxonomies, semantic relatedness is adapted to ontologies. In this work, we propose a novel semantic relatedness measure, named RelTopic, within topic ontologies for topic labeling purposes. In contrast to traditional measures, which are dependent on textual resources, RelTopic considers semantic properties of entities in ontologies. Thus, correlations of nodes and weights of nodes and edges are assessed. The pertinence of RelTopic is evaluated for topic labeling of old press articles. For this purpose, a topic ontology representing the articles, named Topic-OPA, is derived from open knowledge graphs by applying a SPARQL-based automatic approach. A use-case is presented in the context of the old French newspaper Le Matin. The generated topics are evaluated using a dual evaluation approach with the help of human annotators. Our approach shows an agreement quite close to that shown by humans. The entire approach’s reuse is demonstrated for labeling a different context of articles, recent (modern) newspapers.


Fig. 1.

Excerpt of Le Matin.

Excerpt of Le Matin.

This article presents the works accomplished as part of the ASTURIAS11 project in the domain of cultural heritage. The main goal of ASTURIAS is to thematically and automatically organize a collection of old press articles with a set of topics (e.g., Politics, Art, Sport, Science, Etc.). One of the specific features of old press is that it does not offer thematic entries (see Fig. 1). Articles appear and follow one another without a thematic logic. Under these conditions, it remains a tedious task to query sources that report the same events from different points of view in different areas of the newspaper. The scientific challenge is to propose robust approaches for the analysis of texts that are noisy due to the imperfect process of automatic transcription of images into electronic texts. These approaches need also to be multi-thematic, and robust to linguistic evolution over the centuries. The ambition of the ASTURIAS project (whose workflow appears in Fig. 2) is to study the digitization process from end to end of the processing chain: WP1 – from newspaper images, automatically analyze sections, articles and texts; WP2 – extract named entities from these elements; WP3 – Topic labeling and hyperlinking the articles based on the analysis made in WP1 and the named entities extracted in WP2.

Fig. 2.

The pipeline of the project ASTURIAS.

The pipeline of the project ASTURIAS.

Our work’s main goal is to propose a framework that permits automatic labeling of old press articles (WP3). This framework tends to replace humans with software for labeling a vast number of articles that would require too much human effort to do it manually. The task of labeling documents according to their topics has traditionally been addressed either by using classifiers for assigning to the articles a set of predefined topics (e.g., [8,39]), or by topic detection methods (e.g., probabilistic latent semantic analysis (pLSA) [22], latent Dirichlet allocation (LDA) [2]), which generate topics from textual resources [39]. The main advantage of the first approach is generating clean, and formally-defined research topics [39]. This approach is recommended when a good characterization of the research topics within a domain is available [39]. However, the second approach suffers from a significant limitation of generating topics from scratch leading to noisier and less interpretable results [39].

In this study, we propose applying graph-based semantic relatedness measures that permit assessment of the semantic relatedness of topics in topic ontologies with articles’ content. Graph-based semantic measures have been used to solve problems in a broad range of domains such as Natural Language Processing (e.g., [16]), Information Retrieval (e.g., [17]), Knowledge Engineering (e.g., [14]), Semantic Web and Linked Data (e.g., [9]) and Bioinformatics (e.g., [18]). They are considered essential tools for designing numerous algorithms in which semantics matters [19]. A graph-based semantic measure is a mathematical tool used to estimate the strength of the semantic interaction between entities (concepts or instances) based on the analysis of ontologies [19]. Thus, the application of this measure is strongly dependent on the availability of an ontology that represents the application domain. Two main categories of graph-based semantic measures are distinguished: (1) similarity measures adapted to taxonomies and (2) relatedness measures adapted to semantic graphs composed of different types of relationships [19]. Building semantic relatedness measures is a challenging and important research issue since they have to consider several kinds of relations and not only the taxonomic ones [30]. In the literature, apart from Hirst and St-Onge’s measure [21], there have been relatively few attempts to develop relatedness measures [30,34]. Most efforts are directed to design similarity measures such as [25,26,35,37]. For comparing ontological entities, graph-based measures are classified into two basic approaches: path-based, which compare the concepts according to properties of paths in graphs, and node-based, that use properties of concepts in the ontology graph for comparing concepts. However, these approaches suffer from different limitations.

The major contribution of this work is the design and evaluation of a semantic relatedness measure, named RelTopic, that considers the semantic properties of entities in topic ontologies. RelTopic is designed as a combination of node-based and path-based approaches. In contrast to existing measures, our measure tends to assess the relatedness of concepts and instances by considering different types of relations. RelTopic will be used for topic labeling of old press articles, which are represented by a set of “not ambiguous” named entities extracted from open data sources (WP2). A second contribution to mention is building a topic ontology named Topic-OPA from the open knowledge graph Wikidata using a SPARQL-based automatic approach. Topic-OPA is required for the application of RelTopic. Based on RelTopic and Topic-OPA, we defined the selection process of the most relevant topics for labeling the articles. To demonstrate the performance of our approach, a use-case is presented in the context of Le Matin,22 an old French newspaper first published in 1884 and discontinued in 1944. Finally, Topic-OPA and RelTopic are evaluated using dual evaluation approaches. Our approach’s reusability is demonstrated for labeling articles in different contexts, such as recent newspapers. The implementation of our study is available in GitHub33 [31].

The remainder of this paper is organized as follows: the research problem is specified in Section 2. Section 3 considers the main related works. In Section 4, we discuss our semantic relatedness measure RelTopic. Section 5 introduces Topic-OPA. Section 6 discusses the topic labeling process. In Section 7, we present a use case for labeling the articles of Le Matin. We evaluate and discuss the approach in Section 8 and Section 9 respectively. Finally, Section 10 concludes the paper.

2.Problem definition

To define our research problem, a fundamental hypothesis is considered that articles are represented by a set of “not ambiguous” named entities (e.g. person, organization, product and location) extracted from open data sources (coming from WP2 of the ASTURIAS project, as shown in Fig. 2). Thus, the research problem can be defined as follows: given a corpus of articles A, a set of named entities N (represented by a set of URIs) that are collected from A (WP2), and a topical structure T, we want to find the most relevant topics described in T that label Ai, AiA. Based on this perspective, our work (WP3) considers mainly the following issues:

  • 1. Construction of the topical structure as a predefined set of topics: it takes as input N, i.e. the set of disambiguated named entities, and constructs T, i.e. a convenient topical structure based on N.

  • 2. Named entity-topic mapping process as a relevance assessment: this process is performed for each AiA. It aims to map Ni, the named entities of Ai, to the topics of T to evaluate their relevance. Thus, the mapping process takes as inputs n, nNi, and t, tT, and evaluates if t is relevant to n or not. The relevance is examined as a semantic (not syntactic) relatedness. For this purpose, a semantic measure is needed to compute the relatedness.

  • 3. Ranking and selection of most relevant topics as a topic labeling process: takes as input the relatedness values of n and t, AiA, obtained from the entity-topic mapping process and aims to rank them and select the best topic(s) to label Ai.

3.Related works

This section outlines the following related works: graph-based semantic measures, semantic relatedness measures, topic ontologies, and ontology-based labeling of articles.

3.1.Graph-based semantic measures

For comparing ontological entities, graph-based measures are classified into two basic approaches: path-based and node-based. In path-based approaches, concepts are compared according to properties of paths in graphs. The most common property is the shortest path that connects nodes in a given ontology. The shorter the path is, the higher the similarity is. Rada’s measure is an example of similarity measures adapted to taxonomies:

where distRada is the shortest path and SimRada is the distance to similarity conversion [35].

Although, Leacock and Chodorow’s measure is an example of this category which is designed for WordNet [25]:

where len(c1,c2) is the shortest path between c1 and c2 and maxdepth(c) is the maximum depth of c, cWordNet.

In this category of measures, Hirst and St-Onge’s measure, that considers the non-taxonomic links, quantifies the weight between two concepts as follows [21]:

where C and k are constants (C=8 and k=1), and turns(c1,c2) is the number of times the path between c1 and c2 changes direction (i.e., a downward link after an upward link). The particular difficulty of this approach is to determine the direction of each link [30]. The path-based approaches suffer from a significant drawback: they consider all edges equivalent, indicating a uniform distance.

Concerning the node-based approaches, they use properties of concepts in the ontology graph for comparing concepts. The most common property is the Information Content (IC) of nodes, which is calculated based on the term’s frequency in a given corpus. IC is a property that denotes how specific and informative a concept is. The most well-known IC measures, which are based on the lowest common subsumer (LCS) property, are Resnik’s [37] and Lin’s [26] measures.

Resnik’s measure uses the Information Content of the LCS as the similarity value:

where IC of a concept is defined as the negative log of the probability of that concept:
Concerning Lin’s measure, it is considered as a refinement of Resnik’s measure and is computed as follows:
Two main limitations are recognized for these approaches: (1) they are based on textual resources, and (2) applicable only on taxonomies.

3.2.Semantic relatedness measures

This section outlines significant works in the literature that addressed the design of semantic relatedness measures. However, these measures are strongly dependent on textual resources. Mazuel and Sabouret [30] have proposed a semantic relatedness measure that evaluates the semantic relatedness of two concepts by considering the object properties in ontologies. They differed between two different types of paths. First, the single-relation path in which all the edges have the same type (e.g., hierarchical relations). Second, the mixed-relation path in which different types of relations (hierarchical and non-hierarchical) are involved. The proposed semantic relatedness measure is composed of three main tasks: (1) consider a set of patterns given in [21] to filter the paths which are not semantically correct; (2) use of the information-theoretic definition of semantic similarity given in [37] to weight the hierarchical edges in the graph; (3) compute the weight of non-hierarchical edges. Finally, the relatedness measure is the sum of these tasks. Another work to cite is a context-vector approach proposed in the biomedical domain [27,34]. This approach aims to compute the semantic relatedness between pair of concepts in the Unified Medical Language System (UMLS).44 The context-vector approach is based on a Gloss Overlaps (i.e., number of shared words in the definitions of two concepts) approach relied on the WordNet55 dictionary [3]. The gloss vector approach combines the definitions of concepts with co-occurrence data in a given corpus (e.g., clinical reports). Every word in the definition is replaced by its context vector from the co-occurrence data and relatedness is calculated as the cosine of the angle between the two vectors. Due to the limitation of semantic relations provided in WordNet (is-a, part-of), the context-vector approach extended the construction of concept definitions by using different relations in the UMLS.

3.3.Topic ontologies

Topic ontologies are considered a special type of ontologies. Their purpose is to identify the “themes” necessary to describe the knowledge structure of an application domain [46]. A topic ontology is represented as a set of topics that are interconnected using semantic relations. Two main types of topic ontologies are defined: simple, and general [28]. The simple topic ontologies are composed of topics linked by hierarchical relations. Meanwhile, in general topic ontologies, transverse relations are included to link different topics in a non-hierarchical scheme. Topic ontologies are being increasingly used in various domains such as semantic matching [45], topic labeling [1], topic modeling [41], evaluating topical search [28] and classification of research articles [38].

The most commonly known approaches for building topic ontologies are the keyword-based construction approaches, which are based mainly on text mining and information retrieval techniques [28,39]. However, these approaches are not efficient, hard, and time-consuming to construct an ontology from a large corpus of documents [28]. In the literature, few works have been found about building topic ontologies from knowledge graphs (e.g., [6]) or Web sources (e.g., [33]). In [6], building topic-specific ontologies from open knowledge graphs such as ConceptNet [43] is presented. A query-based interactive approach is applied for extracting entities and relations from the knowledge graph. Three main phases are defined in this approach: construction of the central taxonomy, ontology enrichment, and ontology cleaning. Another approach to cite is Klink-2 [33], which generates ontologies of research topics [38] by integrating multiple web sources. In particular, Klink-2 analyses networks of research entities (including papers, authors, venues, and technologies) to infer three main types of semantic relationships. For instance, the hierarchical relationships between two entities, which can occur in a set of documents, are inferred by considering the similarity between the distributions of co-occurring keywords and their string similarity. Besides, this approach handles the ambiguity of keywords that are associated with a set of noisy relationships.

3.4.Labeling articles using topic ontologies

As a considerable related work, we present the CSO Classifier, an ontology-driven classifier of scholarly articles [39] according to the Computer Science Ontology (CSO) [38]. CSO includes 14K semantic topics and 162K relationships.66 The CSO Classifier takes as input the text from the metadata associated with a scholarly article (title, abstract, and keywords) and returns a list of CSO research topics. The selection of topics is performed in three steps: (1) identify all topics in the ontology that are explicitly mentioned, or referred, in the paper; (2) identify semantically related topics, that may not be explicitly referred in the article, by utilizing part-of-speech tagging and world embeddings; the word embeddings are used to compute the semantic similarity between the terms in the document and the CSO concepts; (3) enrich the results by including super-areas topics according to CSO.

4.Our semantic relatedness measure

In this section, we propose a hybrid graph-based semantic relatedness measure within topic ontologies. As a contribution to the community of approaches that tend to overcome the limitations of existing measures (e.g., [30]), we designed our measure as a combination of path-based, and node-based approaches. Thus, we comprehensively consider the semantic properties of nodes and edges:

  • Weighting of edges: to differentiate between hierarchical and non-hierarchical relations regarding the properties of the paths. This semantic property aims to overcome the limitation of considering all edges equivalent in path-based approaches.

  • Weighting and Correlation of nodes: to consider semantic properties of concepts independently from textual resources. The weighting and computing the correlation of a concept aim to measure its neighborhood and coverage in the ontology graph respectively [20]. The application of such semantic properties can overcome the limitation of dependency of texts in node-based approaches.

4.1.Topic ontologies as semantic graphs

For the application of graph-based semantic measures, there is a need to represent ontologies as graphs using a graph-based formalism. In semantic graphs associated to general topic ontologies, we denote topics and instances as nodes and different types of relationships (hierarchical and non-hierarchical) as edges.

Definition 1.

We define the semantic graph associated to a general topic ontology as a directed weighted graph G=(V,E,T,τ,ω,δ), where V is a finite set of nodes that represent topics and instances, EV×V is a finite set of edges connecting different pair of nodes (vi,vj) from V, T is a finite set of edge types, τ:ET is a function that maps edges in E to their types in T {subclassOf, part of, used by, …}, ω:VR+ is a node-weighting function that maps nodes to their weights and δ:ER+ is an edge-weighting function that assigns weights to edges.

Definition 2.

The set of neighbours N(vi) for a node viV is represented by the nodes {vj,,vk} that are linked to vi by the edges {ej,,ek}E.

Definition 3.

The set of hypernyms H(vi) for a node viV is represented by the nodes {vh,,vk} that are linked to vi by the edges {eh,,ek}, where τ(em)={subclassOf}{instanceof}, em{eh,,ek}.

Definition 4.

A path P(vivj) between vi,vjV is a sequence of nodes and edges {vi,ei,,vk,ek,vk+1,ek+1,vj} connecting vi and vj. For every two consecutive nodes vk,vk+1V in P(vivj), there exists an edge ekE.

Definition 5.

The length of a path |P(vivj)| is obtained by summing up the weights of the edges that constitute the path between vi and vj. |P(vivj)|=eiE(P)δ(ei).

Definition 6.

The distance dist(vivj) between vi, vj is the minimum length of a path from vi to vj.

Definition 7.

The size of a semantic graph |G| is the total number of nodes in G.

4.2.Design of RelTopic

For designing RelTopic, five main phases are defined: (1) weight allocation for nodes, (2) weight allocation for edges, (3) computation of the degree centrality of nodes, (4) computation of the semantic distance and (5) computation of the semantic relatedness.

4.2.1.Weight allocation for nodes

Inspired by the information-content measures [36,37], that outlined the adequacy of the log function for node weighting [30], we propose the weight allocation for nodes based on this function. In addition, we took advantage of the neighborhood of nodes, and we differentiate between weights for topics and weights for instances. Concerning the topics, weights are formally defined by ω(vi)=log(|N(vi)||G|). For the instances, two main cases are identified:

  • 1. vi is an instance of a single hypernym node vh. In this case, the weight is formally defined by ω(vi)=ω(vh).

  • 2. vi is an instance of multiple hypernym nodes represented by H(vi)={vh,vm}. Here, ω(vi)=(ω(vn))vnH(vi), where (ω(vn)) is the average of the weights of the hypernyms of vi.

4.2.2.Weight allocation for edges

Based on the diversity of relations within the general topic ontologies, the allocation of weights for edges depends mainly on the relations types. Therefore, we consider a static weight allocation which reflects the “strength” of a given relation type [23,30]. Two main types of relations are recognized:

  • Hierarchical relations: subclassOf and instance of which are classified as vertical relations with a cost=1.

  • Non-hierarchical: part/whole relations (e.g., part of, has part) and general relations (e.g., facet of, field of work, practiced by, used by). This type of relation is considered being informative and the cost of this edge must be low [30]. Based on the experimentation, the non-hierarchical relations are given a cost [0.1,0.4]. In this study, we applied 0.25 being a discriminant value.

Given two nodes vi and vi+1 linked by an edge ei, the weight of ei is:
(7)δ(ei)=1,if τ(ei)=subclassOfinstance of0.25,otherwise

4.2.3.Computation of the degree centrality for nodes

The Degree Centrality of a node is considered as a basic indicator for studying networks and is defined as the number of adjacencies [32]. It corresponds to how much surface the node is correlated to in the whole domain of interest [20]. The degree measure is formally defined, for unweighted graphs, by D(vi)=|N(vi)|, where |N(vi)| is the number of neighbours of the node vi [42]. Meanwhile, in weighted graphs, D(vi)=vjN(vi)δ(ej)×ω(vj), where ej={vj,vi}.

In our work, we take advantage of this measure to quantify the degree centrality of topics and instances. We consider that the degree centrality of an instance is related to the degree centrality of its hypernym node(s). More precisely, for every path P(vivk), where vi is the instance node and vk is the topic node, we calculate the degree centrality for vk and for the hypernym node(s) of vi. Two main cases are identified:

  • 1. vi is an instance of a single hypernym node. Thus, the degree centrality of nodes representing instances is formally defined by: D(vi)=vjN(vh)δ(ej)×ω(vj), where vh is the hypernym of vi, eh={vi,vh}, τ(eh)={instanceof} and ej={vj,vh}.

  • 2. vi is an instance of multiple hypernym nodes. vi instance of multiple hypernym nodes that are represented by H(vi)={vh,vm}, D(vi)=(D(vn))vnH(vi), where (D(vn)) is the average of the degree centrality of the hypernyms of vi.

4.2.4.Semantic distance computation

In order to estimate the relatedness of two nodes vi and vj, there is a need to calculate the semantic distance dist(vivj) (i.e., shortest path) between them. In weighted graphs, different approaches can be used to estimate the semantic distance such as Dijkstra [11] and Bellman Ford [4] algorithms. In our study, we have applied Dijkstra’s algorithm.

4.2.5.Semantic relatedness computation

In this section, we present the computation of the semantic relatedness between instances and topics within topic ontologies. Given two elements in a given topic ontology, an instance vi and a topic vj and P(vivj) is the path between vi and vj. The semantic relatedness measure takes these elements as input and returns a numerical description, RelTopic[0,1], that quantifies their relatedness based on the following formula:

where dist(vivj) is the semantic distance between vi and vj, ω(vi) and ω(vj) are the weights of vi and vj respectively and D(vi) and D(vj) are the degree centrality of vi and vj respectively. In this formula, we also assigned a variable k that takes two possible values:
(8)k=1,if P(vivj) is semantically correct0,if P(vivj) is semantically incorrect
The correctness of the semantic path between two nodes is prescribed based on the constraints proposed in [21]. If a path P(vivj) changes the direction from upward (generalization) to downward (specialization) at a point related to a hierarchical link, P(vivj) is considered semantically incorrect. For instance, given a node vk in P(vivj), where {vk1,ek,vk}/τ(ek)={subClassOf} and {vk+1,ek+1,vk}/τ(ek+1)={subClassOf}. Thereby, all the paths traversing the top of the ontology are penalized.

5.Topic-OPA: A topic ontology for modeling topics of old press articles

In this study, RelTopic is applied within topic ontologies to compute the relatedness of instances and topics. For this purpose, we need to build a topic ontology that represents the domain of old press articles. In this section, we present Topic-OPA, a topic ontology, harvested from open knowledge graphs, for modeling topics of old press articles.

Generally, knowledge graphs are very large and contain many entities that are too general or specific to be successfully used as topics for topic labeling [6]. Meanwhile, they can be leveraged to build with moderate efforts small to medium-sized meaningful topic ontologies. As a knowledge graph, we selected Wikidata. It is a free and open knowledge graph and acts as central storage for the structured data of its Wikimedia sister projects, including Wikipedia, Wiktionary, and others [13]. Wikidata stores more than 402 million statements about over 45 million entities [29]. Today, more than 60 million items are described. The data model of Wikidata is based on a directed, labeled graph where entities are connected by edges that are labeled by “properties” [5]. Thus, the system distinguishes two main types of entities: items and properties. Items are uniquely identified by a “Q” followed by a number, such as Paris (Q90). Properties describe detailed characteristics of an item and represented by a “P” followed by a number, such as instance of (P31). Entities are represented by URIs (e.g., for Paris and for instance of). In the following, we discuss the ontology definition, specification, requirements, and development.

5.1.Ontology definition

Topic-OPA is defined as a general topic ontology by considering instances and mapping to knowledge graphs [12].

Definition 8.

We define a general topic ontology, in which instances and mapping to knowledge graphs are considered, by O=T,I,R,E,ϕ, with

  • T the set of topic concepts,

  • I the set of instances,

  • R the set of predicates: {subClassOf, instance of, part of, use, related by, etc.},

  • E the set of relationships: EETTEIT with:

    • ETTT×R×T

    • EITI×R×T

  • ϕ the mapping of T and R to entities in a knowledge graph K.

5.2.Ontology specification and requirements

The ontology specification specifies the purpose and the scope of the topic ontology. Concerning the purpose, Topic-OPA is intended to be used as a knowledge base for a topic labeling system in the domain of old press articles. Regarding the scope, Topic-OPA is application-based domain-dependent ontology. For example, given a corpus of articles of the year 1920, Topic-OPA is constructed from all the disambiguated named entities representing these articles.

For the requirements [44], Topic-OPA has a functional requirement that requires the definition of two different schemes in the ontology: hierarchical and non-hierarchical.

  • Hierarchical scheme: consists of hierarchical relations such as subClassOf that permit the inference of knowledge in the ontology graph.

  • Non-hierarchical scheme: involves non-hierarchical relations such as related, part of, used by, etc. that have an important implication in the semantic relationships between the concepts.

Besides, Topic-OPA has a non-functional requirement that considers data traceability and scalability by mapping the concepts and the relations of Topic-OPA to entities in open knowledge graphs such as Wikidata.

5.3.Ontology development: SPARQL-based approach

This section discusses a SPARQL-based approach that aims to harvest topic ontologies from open knowledge graphs. A main requirement for this approach is that the domain application is represented by a set of disambiguated named entities. The proposed approach is composed of three main phases: (1) construction of the hierarchical scheme, (2) construction of the non-hierarchical scheme and (3) ontology enrichment. In this study, the ontology development phases are applied in Wikidata.

5.3.1.Building the hierarchical scheme: Bottom-up approach

The hierarchical scheme of Topic-OPA, which represents the taxonomy of topic concepts, can be formally defined by H=T,R,E,ϕ, where T is the set of topic concepts, R is the unique predicate {subClassOf} used for ordering the topic concepts, E is the set of ordered relations, and ϕ is the mapping function to Wikidata. In the hierarchy, a root element denoted ⊤ is defined as a general subsumer for all the topic concepts, i.e., tiT, ti. For building the hierarchy, a query-based bottom-up approach is applied. The development process starts with a definition of the most specific topic concepts of the hierarchy and continues by extracting the more general concepts. The approach is launched from a set of named entities N represented by a set of URIs (Fig. 5).

Definition of the most specific topic concepts At this phase, a SELECT SPARQL query, relying mainly on N and the Knowledge graph K, is applied to define STT the most specific topic concepts of the hierarchy, tiST, tj/tjti. The SELECT query q(n,r) takes as inputs a named entity nN and a property rK and returns set of topic concepts. For the application of q, we defined two main relation types {P31, P106}. The property instance of (P31) is used for all the named entities to retrieve their superclasses.

Meanwhile, for the named entities that are instances of Human (Q5), which is a very general topic, applying the property occupation (P106) is required to fetch more specific topic concepts. In the following, the syntax of q is presented. We denote by entityId, the Wikidata ID of the named entity which is extracted from the URI.


As an example, let us consider a named entity n={John Simon(Q333091)}. In Wikidata, John Simon is instance of (P31) Human (Q5) and linked to judge, lawyer and politician by the property occupation (P106). Thus, ST(n)={Judge,Lawyer,Politician}.

Extraction of hierarchies The aim of this phase is to build the taxonomy of topic concepts H. The building process starts from the most specific to the most general concepts. For this purpose, a CONSTRUCT SPARQL query qH(ti)/tiST and associated to ϕ(ti), is applied to fetch the parent classes of ti aiming to build an RDF graph of the hierarchy. In this context, each query returns three different types of triples: (1) to define the ontology classes, (2) to create the taxonomic relations (inspired by usage in RDF rdfs:subClassOf) and (3) to label the ontology classes. All triples are denoted by (s,p,o), where s the subject, p the predicate and o the object. In the following, the syntax of qH is presented. We denote by topicId the Wikidata ID of tiST.


Thereafter, examples of triples extracted based on ST(John Simon).

H={JudgeMagistrate,MagistrateOfficialJurist,OfficialCivil Servant,Civil ServantPublic Employee,Public EmployeeEmployee,PoliticianProfessional}.

5.3.2.Building the non-hierarchical scheme

The non-hierarchical scheme of Topic-OPA can be formally defined by NH=T,R,E,ϕ, where T is the set of topic concepts, R is the finite set of predicates, ET×R×T is the set of transverse relationships among the topics and ϕ the mapping function. In this phase, the non-hierarchical relations are extracted from Wikidata for building NH. These relations are represented by the definition of the domain/range of the properties that will be added to the graph as edges between domains and ranges. For this purpose, a CONSTRUCT query qNH(ti)/tiT and associated to ϕ(ti), is applied to fetch all the triples where ti are domains or ranges. In this context, the selection of properties is restricted to a predefined list based on their relevance in different domains (e.g., field of work (P101), has part (P527), has quality (P1552), part of (P361), practiced by (P3095), etc.). In the following, the syntax of qNH is presented. We denote by topicId the Wikidata ID of tiT.



The results obtained by executing qNH are represented by triples denoted (d,p,r), where d the domain, p the predicate and r the range. Excerpts of these triples are presented in what follows.

NH={(Civil Servant,field of this occupation,Civil Service),(Politician,field of this occupation,Politics),(JudgeMagistrate,field of this occupation,Judiciary),(Public Employee,facet of,Public SectorGovernment)}

5.3.3.Ontology enrichment

In this phase, an ontology enrichment process is performed based on NH. The application of qNH has imported new concepts to the ontology such as Government, Judiciary and Politics, among many others. Therefore, these concepts will be added to the hierarchy as well as their parent classes by applying the query qH. Thereafter, excerpt of the appended hierarchical relations is presented.

H={Political OrganizationOrganization,GovernmentPolitical Organization,JudiciaryAuthority,Civil serviceOrganization,PoliticsActivity}

6.The topic labeling process

This section defines the topic labeling process, which is based mainly on RelTopic and Topic-OPA. Given an article AiA represented by a set of non-ambiguous named entities Ni, the topic labeling process of Ai is composed of three main phases: (1) assign Ni as instances of Topic-OPA, (2) apply an instance-topic mapping process, and (3) rank and select the best topics that label Ai.

6.1.Named entities as instances of topic-OPA

The named entities are categorized in: persons, locations, organizations and products. For the labeling process, we are interested mainly in: persons, organizations and products. The named entities of the type locations will be used in further works to contextualize the articles. The disambiguated named entities will be assigned as Topic-OPA instances and thereby be added as nodes to the ontology graph. Although, the instance of relations are added as hierarchical edges to the graph. Concerning the named entities associated to locations, they will be used later for contextualizing the articles (e.g., regional, local and international news).

For adding the instances, we took advantage of the properties instance of (P31) and occupation (P106) in Wikidata to select the appropriate classes in Topic-OPA (for the same reason explained in Section 5.3.1). For example, in Wikidata, John Simon (Q352) is an instance of Human (Q5) and related, by field of occupation (P245), to politician, jurist and lawyer. Therefore, in Topic-OPA, John Simon is instance of PoliticianJuristLawyer.

6.2.Instance-topic mapping: Classification of topics

Let us consider the article Ai, which is represented by a set of instances I, and T the set of topic concepts of Topic-OPA; the instance-topic mapping process is performed as a binary classification process between I and T. For each (i,t), iI and tT, we evaluate if t is a relevant topic for i or not. For this purpose, we apply RelTopic that, as evoked earlier, returns a numerical relatedness value [0,1] for each couple (i, t). For classifying the results, there is a need to fix a threshold. In this context, an ideal threshold is the average of all the relatedness values RelTopic(I,T). Therefore, we consider t is relevant to i if RelTopic(i,t)RelTopic(I,T).

6.3.Ranking and selection of labeling topics

The ranking and selection of labeling topics is accomplished based on the results of the instance-topic mapping process. For Ai, iI, TiT, tTi, RelTopic(i,t)RelTopic(I,T). The matter now is to rank the topics according to these values and select the most relevant topic(s) tkTkTi for labeling Ai. For this purpose, we define the following procedure:

  • 1. Eliminate the non relevant concepts based on three criteria:

    • (a) Level of abstraction: remove most abstract topic concepts such as, Entity, Occurrent and Knowledge, by considering their depths. In Topic-OPA, these concepts’ depths are less than the average of the depths of all the topic concepts.

    • (b) Hypernyms of named entities: remove the topic concepts that are hypernyms of the named entities. For instance, by referring to A1, John Simon is a Politician, thereby concepts such as Professional, Worker, Person, Agent and Individual are eliminated.

    • (c) Hyponyms of general concepts: remove the topic concepts that are hyponyms of Person, Organization, Product and Location. For instance, by referring to A1, Political Activist is related to the instance John Simon. However, Political Activist is not an hypernym of John Simon but a subClassOf Person. Thus, it will be eliminated being an hyponym of Person.

  • 2. Compute the most common topic concepts Tc from Tn=Ti, iI.

  • 3. Compute the size of Tc.

  • 4. If |Tc|=1, then Tc={tc} is the unique labeling topic of Ai.

  • 5. Otherwise, if |Tc|>1 calculate the average of the semantic relatedness values RelTopic(i,tc), for RelTopic(i,tc)RelTopic(I,T), tcTc, iI.

  • 6. Define two strategies to rank Tc and to select the top-ranked topic(s) that label Ai: relatedness-guided and centrality-guided. The relatedness-guided strategy aims to select the most related topic concept(s) according to the relatedness values’ average. Meanwhile, the centrality-guided strategy selects the most connected topic concept(s) based on the degree centrality values. Thus, the further considers the content of Ai, and the latter observes the semantic relevance of the topic concepts. By applying the dual strategy, we extend the selection of the best topics that label Ai.

    • (a) The relatedness-guided strategy is composed of:

      • i. Ranking the topic concepts tc, tcTc according to the average of the relatedness values RelTopic(i,tc),

      • ii. Selecting the topic concept(s) trTrTc having the highest value.

    • (b) The centrality-guided strategy is composed of:

      • i. Computing the degree centrality of tc, tcTc,

      • ii. Ranking the topic concepts tc, tcTc according to their degree centrality,

      • iii. Selecting the topic concept(s) tdTdTc having the highest value.

  • 7. Finally, compute the topic labeling set of Ai, Tk=TdTr, as a combination of the results of the centrality-guided and the relatedness-guided strategies.

Fig. 3.

Example of articles from Le Matin.

Example of articles from Le Matin.
Fig. 4.

Example of articles from the selected corpus of Le Matin.

Example of articles from the selected corpus of Le Matin.
Fig. 5.

Example of named entities extracted from {A1,A2,,A8}.

Example of named entities extracted from {A1,A2,…,A8}.

7.Use-case: Le Matin

In this section, we present a case study for labeling the old French newspaper Le Matin. For this purpose, A={A1,A2,,A48} a corpus of 48 articles, published between 1910 and 1937, is selected. Every article AiA is described by an XML file consisting of Ni a set of disambiguated named entities represented by Wikidata URIs (see Fig. 5 for an example). Generally, the named entities representing the articles are the outcome of WP2. However, in this work, they are collected manually following the hypothesis presented in Section 2. Besides, T, a set of topics representing all the topic concepts of Topic-OPA, is considered. In Figs 3 and 4, {A1,A2,,A8} a subset of A is illustrated. Our main goal is to automatically label the articles by applying our proposed semantic relatedness measure RelTopic. In order to achieve the goal, we need to construct the topic ontology Topic-OPA from these articles. Furthermore, the following processes are performed: (1) the assignment of the named entities as instances of Topic-OPA, (2) the instance-topic mapping process, and (3) the ranking and selection process.

7.1.Topic-OPA of Le Matin

For Building Topic-OPA representing Le Matin, a set of N=392 named entities representing A is considered, and the SPARQL-based automatic approach (Section 5.3) is applied. As a result, we obtained a topic ontology, as a subset of Wikidata, which is accessible and manageable in ontology editors such as Protégé.77 Note that the topic ontology is not curated. We maintained the concepts and relations as obtained by the application of the SPARQL-based approach. Thus, Topic-OPA contains 2073 concepts, 3261 SubClassOf relations and 1135 non-hierarchical relations. In Figs 6 and 7, we depict excerpts of Topic-OPA around the Politics and Medicine topics. The solid lines represent the SubClassOf relations, and the dashed lines represent the non-hierarchical relations.

Fig. 6.

Excerpt of Topic-OPA around the concept Politics.

Excerpt of Topic-OPA around the concept Politics.
Fig. 7.

Excerpt of Topic-OPA around the concept Medicine.

Excerpt of Topic-OPA around the concept Medicine.

7.2.Assignment of disambiguated named entities as instances

For each article AiA, the disambiguated named entities are assigned as instances of Topic-OPA. Therefore, AiA, Ai is represented by a set of instances Ii. In Table 1, we show the assignment of the named entities representing the articles {A1,A2,,A8}.

Table 1

Assignment of the named entities of the subset articles of A as instances of Topic-OPA

ArticleNamed entityInstance of
A1John SimonPoliticianLawyerJudge
Ramsay MacDonaldPoliticianJournalistDiplomat
Adolf HitlerPoliticianSoldierStatepersonWriterPainter
Eric PhippsPoliticianDiplomat
Anthony EdenPoliticianDiplomat
Stanley BaldwinPolitician
Foreign OfficeForeignAffairsMinistry
A2Miguel Primo de RiveraPoliticianMilitaryPersonnel
A3Jean Benoit-LévyFilmDirectorFilmProducerScreenwriter
Marie EpsteinFilmDirectorFilmProducerScreenwriterActor
La MaternelleFilm
Pension MimosasFilm
Simone BerriauFilmActorActor
Simone BourdayActor
Sylvette FillacierActor
Hubert PrelierActor
Camille BertActor
Roland CaillauxActorPainter
Henri DebainFilmActorFilmDirector
Françoise RosayActorSingerStageActorFilmActor
A4Paul AppellUniversityTeacherMathematician
Academy of ToulouseAcademicDistrict
Paris AcademyAcademicDistrict
Legion of HonourOrder
A5Georges Pelletier d’OisyAircraftPilot
A6René Le GrèvesSportCyclist
Ambrogio MorelliSportCyclist
Romain MaesSportCyclist
Félicien VervaeckeSportCyclist
Charles PélissierSportCyclist
Aldo BertoccoSportCyclist
A7Académie Nationale de MédecineAcademyNationalAcademy
Albert CalmettePhysicianBacteriologistImmunologistVirologist
BCG vaccineVaccine
A8Charles RistEconomistBanker
William H. WoodinPoliticianBusinessperson
Trésor publicPublicTreasury
Bank of FranceBankCentralBankBusiness
National Bank of BelgiumCentralBank
Paul van ZeelandEconomistPoliticianLawyerDiplomatJurist

7.3.Instance-topic mapping

The instance-topic mapping process is performed between each article AiA, which is represented by a set of instances Ii, and T the set of topic concepts of Topic-OPA. The process is executed as a binary classification process between Ii and T. For each (i,t), iIi and tT, we evaluate if t is a relevant topic for i or not. For this purpose, we apply RelTopic that takes as inputs all the instances iIi and the topic concepts of tT. In order to classify the results, we need to apply the specified threshold, which is the average of all the relatedness values RelTopic(Ii,T).

However, since Topic-OPA is not curated, it contains a vast number of general concepts. This implies that the average of the relatedness values is low (around 0.28). Such a low value of the threshold makes the overall performance of the classification process be degraded. Experimentation has shown that a threshold of about 0.5 provides good and relevant results. Therefore, we propose to use threshold(Ai)=log(RelTopic(Ii,T)), in order to shift the average value of the threshold to the interesting range.

For instance, by referring to the articles A7 and A8, the averages of the relatedness values are RelTopic(I7,T)=0.26 and RelTopic(I8,T)=0.30. Hence, the threshold values are: threshold(A7)=log(0.26)=0.55 and threshold(A8)=log(0.30)=0.52. By applying these threshold values, we seek to select the most related topic concepts for each article. Therefore, we consider t is relevant to i if RelTopic(i,t)log(RelTopic(Ii,T)).

Table 2 shows the experimental results of the mapping process of A7 to Topic-OPA. In this table, an excerpt of the instances, the relevant topics and the relatedness values, RelTopic(i,t)log(RelTopic(I7,T))iI7 and tT, are presented.

Table 2

Excerpt of the instance-topic mapping process between A7 and T

Instance (i)Related Topic (t)RelTopic(i,t)log(RelTopic(I7,T))
Académie Nationale de MédecineResearch Institute0.80
Academic District0.69
National Academy0.72
Albert CalmettePhysician0.73
Health Professional0.58
BCG vaccineMedication0.58
Health Problem0.5

7.4.Ranking and selection of labeling topics

Given a set of relevant topics for each instance iIi representing an article AiA, a ranking and selection process is performed to choose the best topic(s) for labeling Ai. This process has experimented with the 48 articles of Le Matin. Table 3 shows an excerpt of the experimental results. It presents thresholds, most common topics, average relatedness values, degree centrality, relatedness-guided topics, and centrality-guided topics. The column Selected Topics indicates the best topics produced by RelTopic.

In the following, we describe the execution of the ranking and labeling procedure (Section 6.3) for A7 (see Table 2). Note that step 1 is not shown in the present experimentation.

  • By fulfilling step 1 (b), the concepts Academy, National Academy, Physician, Health Professional, Immunologist, Medication, Vaccine, Biopharmaceutical and Disease are eliminated. For instance, Physician and Immunologist are eliminated being hypernyms of the instance Albert Calmette.

  • Furthermore, concepts such as Physicist and Research Institute are eliminated by fulfilling step 1 (c). Physicist is a hyponym of Person and Research Institute is a hyponym of Organization.

  • The aim of step 2 is to compute the most common topics Tc of Ai. For A7, Tc={ScienceMedicineBacteriologyImmunologyVirologyVaccination}. Thus, since |Tc|=7 (step 3), step 4 is not executed for A7. Meanwhile, it is implemented for A3, A5 and A8 which are labeled by the topics Art, Aviation and Economics respectively.

  • step 5 computes the average of relatedness values for each common topic concept tcTc.

  • By achieving step 6 and step 7, A7 is labeled by Vaccination as top-ranked topic having the highest average of relatedness (RelTopic(I7,Vaccination)=0.71) as well as the highest degree centrality (D(Vaccination)=13.48).

    Although, A2 is labeled by the topic Military Affairs having the highest average of relatedness (RelTopic(I2,Military Affairs)=0.67) as well as by the topic War having the highest degree centrality (D(War)=22.22).

    In addition, A4 and A6 are labeled by dual topics by fulfilling step 6 and step 7. The topics Higher Education and Science are selected as best topics for labeling A4. The topics Cycle Sport and Cycling are the top-ranked topics for labeling A6.

Table 3

Ranking and selection of labeling topics

AiThresholdMost Common Topics (tc)RelTopic(Ii,tc)Degree CentralityRelatedness-GuidedCentrality-GuidedSelected Topics
Political Activism0.566.94
A20.55Military Affairs0.676.94Military AffairsWarMilitary Affairs-War
Political Activism0.626.94
A40.52Higher Education0.5815.28Higher EducationScienceHigher Education-Science
A60.55Cycle Sport0.6813.20Cycle SportCyclingCycle Sport-Cycling

8.Evaluation and comparison

The first part of this section evaluates Topic-OPA being an application-based ontology. The second part assesses the performance of RelTopic by evaluating the results of the entire framework (Topic-OPA + RelTopic + the topic labeling process). Furthermore, we consider applying the whole approach for labeling recent press articles. Finally, RelTopic is compared to alternative graph-based semantic measures.

8.1.Evaluation of Topic-OPA

In the literature, various approaches for evaluating ontologies are recognized. These approaches are categorized depending on what kind of ontologies are being evaluated and for what purpose [7]. Examples of these approaches are [15]: gold standard-based, corpus-based, application-based, and criteria-based. In order to choose the “best” evaluation approach, there is a need to define the motivation behind evaluating a developed ontology [15]. In our study, as evoked earlier, Topic-OPA is an application-based ontology that is intended to be used in a topic labeling system for classifying and labeling a given set of old press articles. Thereby, gold standard-based and corpus-based approaches are eliminated for the following reasons. The former aims to compare the developed ontology with a previously created reference ontology. However, having a suitable gold ontology can be challenging since it should be created under similar conditions with similar goals to the developed ontology. The latter is eliminated since it is strongly dependent on textual resources. Therefore, the application-based and criteria-based approaches are applied to evaluate the performance and the semantic accuracy of Topic-OPA.

8.1.1.Application-based evaluation

The application-based approach evaluates the performance of ontologies in a specific task. Topic-OPA is employed for labeling old press articles by using it as a knowledge base. Technically, the semantic relatedness measure RelTopic is applied to the graph structure of Topic-OPA. RelTopic performs a “browsing” of the hierarchical and non-hierarchical structure of Topic-OPA. It inspects nodes and edges, their properties, such as weights and depths, and the correlation of nodes, which is defined by the degree centrality. Thus, the results obtained by using RelTopic for the classification and the labeling tasks determine the feasibility of Topic-OPA. For this purpose, the application-based evaluation of Topic-OPA is a function of the evaluation of RelTopic (see Section 8.2). Therefore, Topic-OPA is considered a pertinent ontology if the results obtained by RelTopic are accurate.

8.1.2.Structure-based evaluation

The structure-based approach quantifies how far an ontology adheres to specific desirable criteria (e.g., size and complexity). This approach is recommended as an efficient approach for evaluating the learned ontologies [10]. Several measures have been recognized for the structure-based evaluation such as Knowledge coverage and popularity measures (e.g., number of classes and number of properties) and structural measures (e.g., maximum depth, average depth, depth variance, etc.) [15]. The application of these measures relies on the assumption that is a richly populated ontology, with higher depth and breadth variance, is more likely to provide reliable semantic content. In contrast to Knowledge coverage and popularity measures, the structural measures are positively correlated with the semantic accuracy of the knowledge modeled in the ontology [40].

In the context of Topic-OPA, we quantified the following structural measures by considering the taxonomic structure of Topic-OPA: (1) Maximum depth, that represents the length of the longest taxonomic branch in the ontology, is measured as the number of concepts from the root node to the leaves of the taxonomy (maximumdepth=28); (2) Average depth is computed as the average length of all taxonomic branches (averagedepth=6); (3) Depth variance, which is the dispersion with respect to the average depth, is computed as the standard mathematical variance (depthvariance=6.38). We conclude that the majority of the topic concepts within Topic-OPA are dispersed homogeneously within the core level. This result implies two essential points. First, it will be challenging for RelTopic to distinguish between the different concepts located at the same depth to select the best labeling topics. Second, in a semantic context, the hierarchical structure of Topic-OPA is a balanced taxonomy, in which the majority of taxonomic edges have almost the same depth.

8.2.Evaluation of RelTopic

The evaluation of RelTopic consists of measuring how well this measure can label a given corpus of articles. Thereby, we evaluate the performance of the whole framework (Topic-OPA + RelTopic + the topic labeling process) using a dual evaluation approach: (1) a quantitative evaluation that compares the automatic labeling to human labeling [1] and (2) a qualitative evaluation that appraises the generated topics regarding their semantic interpretability [47].

8.2.1.Quantitative evaluation

For evaluating the relevance of the generated topics, a quantitative evaluation is used by considering human-based labeling [1] and rating [24] methods. For this purpose, we considered A as the corpus of 48 articles from Le Matin that we have introduced in Section 7. Since humans can be in contradiction for evaluating specific articles, three different annotators are involved for labeling and rating each article, AiA. Concerning the labeling process, the textual content of the articles AiA is assigned to the human annotators. The humans who were blind to Topic-OPA and the results generated by RelTopic, have read the articles and assigned (multi-) labeling topics based on the content (see Table 4). Based on human labeling, an inter-annotator evaluation is established to compute the agreement among the annotators for each AiA. A comparison of the assigned topics (performed in the context of Wikidata) has shown an agreement of 46% for exact topics (e.g., A3, A5, A6), 26% for specific/general (e.g., A7, A8), and 15.5% for semantically related (e.g., A1, A2, A4). Furthermore, we compared RelTopic topics (see Table 3) with those assigned by humans. Our approach manifested an agreement, with human labeling, of 42% for exact topics (e.g., A5, A6), 34% for specific/general (e.g., A3, A7, A8), and 6% for semantically related (e.g., A1, A4).

Table 4

Excerpt of human labeling of the articles represented in Table 3

A1International PoliticsInternational RelationsInternational Politics
A2PoliticsForeign PolicyPolitics
A3CinemaArt, CinemaCinema
A4Higher EducationPolitics, EducationPolitics, Science
A5AviationEvent, Exploration, AviationAviation
A6CyclingSport, CyclingCycling
A7MedicineScience, MedicineScience, Vaccination

To resume, our approach has an agreement quite close to the annotators’ agreement. For the rating method, the humans are asked to rate RelTopic labels for each AiA using the following scores [24]: 3 for very good labels; 2 for reasonable labels; 1 for semantically related labels, but not considered as good topics; 0 for inappropriate labels. As a result (see Table 5 for an example), 36% of the RelTopic topics are assessed as very good, 40% as reasonable, 14% as semantically related, and 10% as inappropriate.

Table 5

Excerpt of human rating of the articles represented in Table 3


In the following, we analyze the cases where RelTopic produced general or irrelevant labels considering the validity of the named entities. In this context, two main issues are observed: (1) the existence of not disambiguated named entities; (2) the lack of some types of named entities. For this purpose, two additional articles are considered A9 and A10 (Fig. 8).

Fig. 8.

Le Matin.

Le Matin.

The existence of not disambiguated named entities In the presented use-case (Section 7), 20 articles have been represented by some named entities that are not disambiguated (e.g., A5, A7). In this section, we discuss the influence of these named entities on the relevance of the automatically generated labeling topics. First, we analyzed two articles A7 (Fig. 4) and A9 (Fig. 8a). A7 consists of 5 disambiguated named entities and 2 that are not disambiguated. Despite this default, RelTopic assigned Vaccination (see Table 3) as a specific topic compared to Medicine assigned by humans. Second, we considered article A9 which consists of 10 disambiguated named entities and 2 that are not disambiguated. By applying RelTopic, A9 is labeled by Science (step 4 of the topic labeling process). However, Medicine is assigned by the human annotators. By surveying the results of the instance-topic mapping phase and the computation of the common related topics, we found that Medicine is commonly related 8 times. Meanwhile, Science is commonly related 10 times. In addition, we have inspected the named entities that are not disambiguated in A9 (Robert Wilbert and Marcel Léger). Robert Wilbert is a Veterinarian88 and Marcel Léger is a Epidemiologist, Microbiologist and Bacteriologist.99 We conclude that the existence of these not disambiguated named entities has eliminate Medicine from the most common topics set. Thereby, they have affected the labeling relevance degree of A9.

The influence of the lack of named entities types As evoked earlier, in this study, we are interested in three main types of named entities: person, organization and product. In this section, we discuss the influence of the lack of some types on the relevance of generated topics. For instance, article A10 (Fig. 8b) is composed of 6 persons and 2 products and the majority of persons are politicians (see Table 6). Thereby, A10 is labeled by Politics (step 4 of the topic labeling process). However, based on the content and the subject of A10, the human annotators have assigned the topic Economics. In this context, we recognized that most politicians, with the absence of organizations or persons related to economics, have affected the labeling results’ pertinence.

Table 6

Assignment of the disambiguated named entities of A10 as instances of Topic-OPA

ArticleNamed EntityInstance of
A10César CaireJuristLawyer
Henri GalliPoliticianJournalist
Emile Desvaux
Ambroise RenduPolitician
Alexandre LuquetPolitician

8.2.2.Qualitative evaluation

The qualitative evaluation assesses the labeling topics generated by RelTopic according to their semantic quality [47]. In linguistics, the topic, or theme, of a sentence is what is being talked about.1010 In a semantic context, defining a labeling topic within topic ontologies is not an easy task. In fact, a topic ontology consists of various concepts including the labeling topics. Meanwhile, it is difficult to find or define these topics. In our experiment, by the application of RelTopic for labeling the old press articles (see Table 3), we perceived three essential characteristics that define the semantic quality of a labeling topic:

  • Highly correlated: a concept with high degree centrality designates a large surface of connection with the concepts within the ontology. For instance, Politics, War, Science, Art and Sport have respectively 29.17, 22.22, 23.62, 31.34 and 13.89 values of degree centrality. Meanwhile, concepts such as Activity, Occupation and Group Behaviour have respectively 8.68, 9.81 and 7.63 values of degree centrality.

  • Core concept: the depth of concepts in ontologies indicates their degree of generality. In Topic-OPA, abstract concepts, such as Entity, Agent, Object, Product and Occurrence are located at depths less than the average of depths in Topic-OPA which is equal to 4 (e.g., depth(Entity)=1, depth(Object)=2 and depth(Occurrence)=3). These concepts are not recommended as labeling topics due to their abstraction interpretability. Meanwhile, the majority of the labeling topics that are produced by our relatedness measure (e.g., Politics, Art, Science, etc.) are located at depths greater than or equal to the average of depths in Topic-OPA (e.g., depth(Politics)=5, depth(Art)=4 and depth(Science)=5). Although, these topics are more general than the specific concepts (e.g., Contract Law, Pharmacy, etc.) which are located at higher levels of depth (e.g., depth(Contract Law)=7 and depth(Pharmacy)=9).

  • Not a hypernym of named entities: a labeling topic is not linked hierarchically to the named entities. Therefore, it is not a subclass of Person, Organization, Location or Product.

8.3.Evaluation of topic labeling using RelTopic in recent press articles

To evaluate the performance of topic labeling using RelTopic, we have applied the entire approach on different context of articles such as recent newspapers (e.g., Le Monde,1111 Le Figaro,1212 Liberation1313). For this purpose, a corpus of 36 recent articles is considered. The named entities representing these articles are defined and disambiguated manually using Wikidata. The total number of named entities (disambiguated and not disambiguated) is 738. As for old press articles, three types of named entities are considered (person, organization, and product) having a cardinality of 443. In contrast to old press articles, the recent articles are thematically classified using commonly known topics such as Sport, Science, Politics, and Art. The articles of the given corpus are composed of four categories: 9 articles are labeled with Sport, 10 with Science, 9 with Politics and 8 with Art.

To automatically label these articles with RelTopic, the following phases, which are defined in our approach, are fully applied:

  • 1. Construct a topic ontology representing the application domain named Topic-RPA (Topic ontology for Recent Press Articles). Thus, the SPARQL-based approach (Section 5.3) is applied based on the articles’ disambiguated named entities (the number of disambiguated named entities is 371). As a result, we obtained Topic-RPA, a not curated topic ontology composed of 2616 concepts, 1584 object properties, and 4132 SubClassOf relations. In contrast to Topic-OPA, Topic-RPA contains contemporary concepts such as Computer Science, Telecommunication, and Electronic Journal. Meanwhile, semantic properties such as the average depth is identical in both ontologies (averageofdepth=4). Concerning the ontology size, Topic-RPA is larger than Topic-OPA (+25%).

  • 2. Application of the topic labeling process using RelTopic (Section 6). This phase is composed of three basic steps: (1) assign named entities as instances of Topic-RPA, (2) apply an instance-topic mapping process, and (3) rank and select the best topics that label the recent articles. As a result, RelTopic has labeled efficiently 47% of the articles by exact topics and 25% by specific topics. The inefficiency of RelTopic to label the rest of articles is due to the following reasons. First, the considerable threshold values close to 0.65 (since the ontology is not curated and thus contains a large number of abstract or noisy concepts) make it challenging to select the most commonly related topics for some articles. Second, the named entities that are not disambiguated and the lack of some types of entities have provoked cases similar to old press articles (Section 8.2.1).

To conclude, our proposed approach has generated promising results in recent press endorsing its reusability for labeling different textual resource contexts. In this regard, applying the approach in different contexts or domains is independent of languages. It is based mainly on disambiguated named entities detached from any language.

8.4.Comparison of RelTopic with alternative graph-based measures

In this section, we compare RelTopic (Equation (8)) with alternative graph-based measures. Specifically, we choose path-based measures since node-based measures are dependent on textual resources, which are out of the scope of our study. To analyze the importance of semantic relatedness regarding semantic similarity, we compared RelTopic to SimRada (Equation (1)). Thus, SimRada is applied to the whole graph of Topic-OPA, including the hierarchical and non-hierarchical schemes. Besides, a comparison with the most commonly known semantic relatedness measure RelHS (Equation (3)) is addressed. For applying RelHS, there is a need to compute each link’s direction change (hierarchical and non-hierarchical) through all the paths. However, this computation is considered a difficult task [30]. To simplify, we computed the direction changes of the hierarchical edges only. Furthermore, we compared the results of applying these measures regarding the instance-topic mapping process on A. Table 7 shows an excerpt of the results of mapping A7 to Topic-OPA. The results imply that ti the topic concepts related to i, iIi (Ii are the instances associated to AiA) are identified by RelTopic as well as by SimRada and RelHS (e.g., Education and Research are related to i=Académie Nationale de Médecine in A7, Fig. 9). However, the use of RelTopic and RelHS makes also evident the identification of the topics that are not related to i, iIi due to the considerable gap among the relatedness values (e.g., Economics and Business are not related to i=Académie Nationale de Médecine in A7). Besides, the results obtained by RelTopic and RelHS are close. Nevertheless, the computation of semantic relatedness using RelTopic is undemanding regarding the edges’ direction changes. For an accurate comparison, the relatedness values of RelHS ([0,8], Equation (3)) are converted to [0,1] (division by 8).

Table 7

Excerpt of the results of the instance-topic mapping process of A7 to T

Instance (i)Topic Concepts (ti)SimRadaRelHSRelHS/8RelTopic(i,ti)
Académie Nationale de MédecineResearch Institute0.570.870.81
Higher Education0.2550.620.57
Albert CalmettePhysician0.570.870.73
Alfred BoquetPhysician0.570.870.74
Veterinary Medicine0.3360.750.69
BCG vaccineVaccination0.3360.750.71
Health Care0.2540.50.48
Fig. 9.

Comparison of the results of the instance-topic mapping of A7 (Académie Nationale de Médecine).

Comparison of the results of the instance-topic mapping of A7 (Académie Nationale de Médecine).


This study’s main contribution is the design of a novel graph-based semantic relatedness measure, named RelTopic, for topic labeling purposes. By proposing RelTopic as a hybrid measure, we contributed to overcoming node-based and edge-based approaches’ limitations. RelTopic considers hierarchical and non-hierarchical relations and inspects the semantic properties of entities within topic ontologies. Thus, we considered the correlation of nodes to overcome the dependency of measures to textual resources. Besides, we separated hierarchical and non-hierarchical edges using different weights to overcome the limitation of equality of edges in path-based approaches. RelTopic takes as inputs two entities (e.g., instances and concepts) and returns a numerical value representing their relatedness according to a topic ontology. In this work, RelTopic is applied mainly for labeling old press articles by assessing the relatedness of instances (named entities) and topic concepts in the topic ontology. Besides RelTopic is reused for labeling different articles, recent newspapers. The reusability of RelTopic for purposes requiring the computation of semantic relatedness between entities in a given ontology is demonstrated. However, for this purpose, two main factors are mandatory: (1) a reasonable characterization of the application domain using a domain ontology and (2) a definition of the input entities that should be included in the ontology (e.g., concept-concept, instance-concept).

This study’s second contribution is developing the general topic ontology Topic-OPA using a SPARQL-based automatic approach. Topic-OPA is harvested from open knowledge graphs (e.g., Wikidata) based on a set of disambiguated named entities representing the application domain. Topic-OPA is a domain-dependent topic ontology since it is developed from the named entities of the given domain. Nevertheless, if Topic-OPA is developed from all the named entities of the application domain (e.g., Le Matin), it could be reused as a topic ontology for labeling old press articles of any journal or newspaper belonging to the same period of time. We assume that approximately the same types of persons (e.g., politician, diplomat, actor, physician, botanist, etc.), organizations (e.g., bank, public treasury, academy, etc,), or products (e.g., vaccine, film, etc.) are available during a comparable period of time (e.g., 1910–1945). Besides, the SPARQL-based approach is reusable (as shown in Section 8.3 in the case of recent newspapers) for harvesting ontologies from open knowledge graphs, requiring the starting named entities representing the domain of discourse.

Finally, a significant contribution is applying an ontology-based automatic topic labeling approach for labeling press articles. This process, which is composed of a topic ontology and the semantic relatedness measure RelTopic, is generalizable for implementing labeling activities for any text, including newspaper or magazine articles. As demonstrated in this work, the entire approach is applied for labeling articles in two different contexts, old and recent press. A primary requirement for the approach reuse is the availability of the named entities representing the text to be labeled. These entities, which are independent of any language, will permit a topic ontology building representing the domain. Thus, RelTopic will assess the relatedness of each text’s named entities to the topics of the topic ontology. Finally, a selection process of the best topics is performed to label the textual resources.

In this context, an important question arises. What if there is a lack of named entities, or if they are ambiguous or inexact? This situation contemplates the validity of this work’s general hypothesis (see Section 2). In Section 8.2.1, we analyzed the influence of two issues on the generated results: (1) existence of not disambiguated named entities and (2) lack of some types of named entities. Both of them had an impact on the generated topics. However, this impact is relative depending on the named entities representing the text to label. For example, the first issue has affected the generality of the assigned topic (e.g., Science is given instead of Medicine). Meanwhile, the second issue has affected the relevance of the assigned topic (e.g., Politics is given instead of Economics).

To resume, the relevance of the whole framework’s outcome is a crucial measure of the validity of this work’s hypothesis. Thus, the given named entities representing the articles are valid if RelTopic and the whole framework achieves relevant labeling topics. This assumption is demonstrated in two different contexts, old and recent press.


The task of automatically labeling newspaper articles according to a predefined set of topics is a challenging research issue, specifically in cultural heritage. A pertinent characterization of the application domain is required for this purpose. In the context of the ASTURIAS project, which aims to label a vast number of old press articles automatically, we envisaged graph-based semantic measures. These measures have shown effective results in different areas such as knowledge engineering, Semantic Web, and Natural Language Processing. Graph-based semantic measures are composed of similarity and relatedness measures. The former class is adapted to taxonomies and widely investigated in the community. The latter class is adapted to ontologies, and few attempts have been found in the literature to design such measures. Designing semantic relatedness measures is a challenging research task. Nevertheless, they are valuable since they inspect the semantic properties of entities in ontologies.

In this study, we proposed a novel semantic relatedness measure, named RelTopic, within topic ontologies for topic labeling of old press articles. In contrast to existing measures, RelTopic considers hierarchical and non-hierarchical relations and assesses the relatedness between instances and concepts. To apply RelTopic, we considered topic ontologies as weighted graphs where nodes and edges are given positive numerical weights. Besides, RelTopic considers the degree centrality of nodes, which reflects the node’s surface of connection with regards to the rest of the ontology. For the application of RelTopic, a topic ontology, named Topic-OPA, representing the domain of old press articles, is harvested from Wikidata by applying a SPARQL-based automatic approach.

The proposed approach is evaluated using a dual evaluation approach. First, a quantitative evaluation is performed with the help of three different annotators. The human annotators have assigned labels to a corpus of 48 articles from Le Matin. Our approach has shown an agreement quite close to that shown by humans for exact, specific, or general topics. Furthermore, the annotators have rated the results of RelTopic regarding their relevance. We obtained 76% of the generated topics are rated as very good and reasonable. The second phase of the evaluation consists in applying a qualitative approach that appraised the semantic interpretability of the automatically generated topics. We noticed that the topic labels within Topic-OPA are highly correlated and located at the ontology’s core level. Additionally, the reuse of the entire approach is demonstrated for labeling recent newspaper articles. Promising results are achieved endorsing the reusability of the labeling approach using RelTopic in different domains. Finally, we compared RelTopic to alternative graph-based semantic measures. The strength of RelTopic is its capability to clearly identify the related topics from the non-related topics with an undemanding computation of direction changes of paths.

In future works, we will be interested in the following tasks. First, the contextualization of the articles is envisaged taking into account the named entities of type location (e.g., A1 could be labeled with International Politics, A3 with Local or French Art and A6 with French Sport). In this study, we do not consider the topic ontology’s curation; we maintained the ontology structure and content, including the abstract and specific concepts, as derived from Wikidata. In further work, we will apply a curation process to clean and leverage Topic-OPA. Furthermore, we will study the application of RelTopic on the leveraged version of Topic-OPA and assess the generated labeling topics’ quality.


1 Structural Analysis and Semantic Indexing of Newspaper Articles.

2, last visited on April 8, 2020.

4, last visited February 4, 2021.

5, last visited February 5, 2021.

6, last visited February 4, 2021.

7, last visited July 23, 2020.

10, last visited April 28, 2020.


This work is funded by the Normandy Region (France) and the European Union with the European Regional Development Fund (ERDF).



M. Allahyari and K. Kochut, A knowledge-based topic modeling approach for automatic topic labeling, International Journal of Advanced Computer Science and Applications 8: (9) ((2017) ), 335–349. doi:10.14569/IJACSA.2017.080947.


Y. Andrew, M.D. Blei and M.I. Jordan, Latent Dirichlet allocation, The Journal of Machine Learning Research 3: ((2003) ), 993–1022.


S. Banerjee and T. Pedersen, Extended gloss overlaps as a measure of semantic relatedness, in: Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, Acapulco, Mexico, (2003) , pp. 805–810.


R. Bellman, On a routing problem, Quarterly of Applied Mathematics 16: ((1958) ), 87–90. doi:10.1090/qam/102435.


A. Bielefeldt, J. Gonsior and M. Krotzsch, Practical linked data access via SPARQL: The case of Wikidata, in: Proceedings of the WWW2018 Workshop on Linked Data on the Web (LDOW-18), CEUR Workshop Proceedings, (2018) .


K. Böhm and M. Ortiz, A tool for building topic-specific ontologies using a knowledge graph, in: Proceedings of the 31st International Workshop on Description Logics Co-Located with KR 2018, (2018) .


J. Brank, M. Grobelnik and D. Mladenić, A survey of ontology evaluation techniques, in: Proceedings of the Conference on Data Mining and Data Warehouses (SiKDD 2005), (2005) .


E. Chernyak, An approach to the problem of annotation of research publications, in: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining – WSDM’15, (2015) , pp. 429–434. doi:10.1145/2684822.2697032.


C. D’Amato, Similarity-based learning methods for the semantic web, Ph.D. thesis, Universita degli Studi di Bari, 2007.


K. Dellschaft and S. Staab, Strategies for the evaluation of ontology learning, in: Proceedings of the 2008 Conference on Ontology Learning and Population: Bridging the Gap Between Text and Knowledge, Frontiers in Artificial Intelligence and Applications, (2008) , pp. 253–272.


E.W. Dijkstra, A note on two problems in connexion with graphs, Numerische Mathematik 1: ((1959) ), 269–271. doi:10.1007/BF01386390.


M. El Ghosh, C. Zanni-Merk, N. Delestre, J.P. Kotowicz and H. Abdulrab, Topic-OPA: A topic ontology for modeling topics of old press articles, in: Proceedings of the 12th International Conference on Knowledge Engineering and Ontology Development, (2020) , pp. 275–282. doi:10.5220/0010147202750282.


F. Erxleben, M. Günther, M. Krötzsch, J. Mendez and D. Vrandečić, Introducing Wikidata to the linked data web, in: The Semantic Web Conference – ISWC 2014, LNCS, (2014) , pp. 50–65. doi:10.1007/978-3-319-11964-9_4.


J. Euzenat and P. Shvaiko, Ontology Matching, 2nd edn, Springer-Verlag, Berlin Heidelberg (DE), (2013) . doi:10.1007/978-3-642-38721-0.


M. Fernández, C. Overbeeke, M. Sabou and E. Motta, What makes a good ontology? A case-study in fine-grained knowledge reuse, in: The Semantic Web, Springer, Berlin, Heidelberg, (2009) , pp. 61–75. doi:10.1007/978-3-642-10871-6_5.


S. Fernando and M. Stevenson, A semantic similarity approach to para-phrase detection, in: Proceedings of Computational Linguistics Colloquium, U.K., (2008) , pp. 45–52.


N. Fiorini, S. Ranwez, J. Montmain and V. Ranwez, USI: A fast and accurate approach for conceptual document annotation, BMC Bioinformatics 16: (83) ((2015) ), 1–10. doi:10.1186/s12859-015-0513-4.


P.H. Guzzi, M. Mina, C. Guerra and M. Cannataro, Semantic similarity analysis of protein data: Assessment with biological features and issues, Briefings, Bioinformatics 13: (5) ((2012) ), 569–585. doi:10.1093/bib/bbr066.


S. Harispe, S. Ranwez, S. Janaqi and J. Montmain, Semantic similarity from natural language and ontology analysis, Synth. Lect. Hum. Lang. Technol 8: ((2015) ), 1–254. doi:10.2200/S00639ED1V01Y201504HLT027.


J. Heitzig, N. Marwan, Y. Zou, J. Donges and J. Kurths, Consistently weighted measures for complex network topologies, Europ. Phys. J. B. 85: ((2011) ), 1–16.


G. Hirst and D. St-Onge, Lexical chains as representations of context for the detection and correction of malapropisms, in: WordNet: An Electronic Lexical Database, (1998) , pp. 305–332.


T. Hofmann, Probabilistic latent semantic indexing, in: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, (1999) , pp. 50–57. doi:10.1145/312624.312649.


J. Jiang and D. Conrath, Semantic similarity based on corpus statistics and lexical taxonomy, in: Proc. on International Conference on Research in Computational Linguistics, Taiwan, (1997) , pp. 19–33.


J.H. Lau, K. Grieser, D. Newman and T. Baldwin, Automatic labelling of topic models, in: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, (2011) , pp. 1536–1545.


C. Leacock and M. Chodorow, Filling in a sparse training space for word sense identification, in: Proceedings of the 14th International Joint Conference on Artificial Intelligence, (1995) , pp. 448–453.


D. Lin, An information-theoretic definition of similarity, in: Proceedings of the Fifteenth International Conference on Machine Learning, ICML, (1998) , pp. 296–304.


Y. Liu, B.T. Mclennes, T. Pedersen, G. Melton-Meaux and S. Pakhomov, Semantic relatedness study using second order co-occurrence vectors computed from biomedical corpora, UMLS and WordNet, in: Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, (2012) , pp. 363–372. doi:10.1145/2110363.2110405.


A.G. Maguitman, R.L. Cecchini, C.M. Lorenzetti and F. Menczer, Using topic ontologies and semantic similarity data to evaluate topical search, in: Proceedings of 36th Latin American Informatics Conference (CLEI), (2010) .


S. Malyshev, M. Krotzsch, L. Gonzalez, J. Gonsior and A. Bielefeldt, Getting the most out of Wikidata: Semantic technology usage in Wikipedia’s knowledge graph, in: Proceedings of the 17th International Semantic Web Conference (ISWC’18), LNCS, Springer, (2018) , pp. 376–394.


L. Mazuel and N. Sabouret, Semantic relatedness measure using object properties in an ontology, in: The Semantic Web – ISWC 2008, A. Sheth et al., eds, LNCS, Vol. 5318: , Springer, Berlin, Heidelberg. doi:10.1007/978--3-540-88564-1_43.


melghosh. (2022). melghosh/RelTopic: SWJ (v1.0.2-beta). Zenodo. doi:10.5281/zenodo.6201279.


T. Opsahl, F. Agneessens and J. Skvoretz, Node centrality in weighted networks: Generalizing degree and shortest paths, Social Networks 32: (3) ((2010) ), 245–251. doi:10.1016/j.socnet.2010.03.006.


F. Osborne and E. Motta, Klink-2, integrating multiple web sources to generate semantic topic networks, in: The Semantic Web Conference – ISWC 2015, LNCS, Springer International Publishing, Cham, (2015) , pp. 408–424. doi:10.1007/978-3-319-25007-6_24.


T. Pedersen, S.V.S. Pakhomov, S. Patwardhan and C.G. Chute, Measures of semantic similarity and relatedness in the biomedical domain, Journal of Biomedical Informatics 40: (3) ((2007) ), 288–299. doi:10.1016/j.jbi.2006.06.004.


R. Rada, H. Mili, E. Bicknell and M. Blettner, Development and application of a metric on semantic nets, IEEE Transactions on Systems, Man and Cybernetics 19: ((1989) ), 17–30. doi:10.1109/21.24528.


P. Resnik, Using information content to evaluate semantic similarity in a taxonomy, in: 14th International Joint Conference on Artificial Intelligence, (1995) , pp. 448–453.


P. Resnik, Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language, J. Artif. Intell. Res. 11: ((1998) ), 95–130. doi:10.1613/jair.514.


A. Salatino, T. Thanapalasingam, A. Mannocci, F. Osborne and E. Motta, The computer science ontology: A large-scale taxonomy of research areas, in: The Semantic Web – ISWC, (2018) , pp. 187–205. doi:10.1007/978-3-030-00668-6_12.


A. Salatino, T. Thanapalasingam, A. Mannocci, F. Osborne and E. Motta, The CSO classifier: Ontology-driven detection of research topics in scholarly articles, in: Digital Libraries for Open Knowledge, (2019) , pp. 296–311. doi:10.1007/978-3-030-30760-8_26.


D. Sanchez, M. Batet, S. Martinez and J.D. Ferrer, Semantic variance: An intuitive measure for ontology accuracy evaluation, Engineering Applications of Artificial Intelligence 39: ((2015) ), 89–99. doi:10.1016/j.engappai.2014.11.012.


J. Sleeman, T. Finin and M. Halem, Ontology-grounded topic modeling for climate science research, in: Proceedings of Semantic Web for Social Good Workshop, ISWC, (2018) .


J. Sosnowska and O. Skibski, Attachment centrality for weighted graphs, in: Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), (2017) , pp. 416–422.


R. Speer, J. Chin and C. Havasi, Conceptnet 5.5: An open multilingual graph of general knowledge, in: AAAI, (2017) , pp. 4444–4451.


M.C. Suárez-Figueroa, A. Gómez-Pérez and B. Villazón-Terrazas, How to write and use the ontology requirements specification document, in: On the Move to Meaningful Internet Systems: OTM 2009, R. Meersman, T. Dillon and Herrero, eds, LNCS, Springer, Berlin, Heidelberg, (2009) .


Y. Tang, P.D. Baer, G. Zhao and R. Meersman, On constructing, grouping and using topical ontology for semantic matching, in: Proceedings of OTM 2009 Workshops, R. Meersman, P. Herrero and T. Dillon, eds, Vol. 5872: , Springer, Heidelberg, (2009) , pp. 816–825.


G. Zhao and R. Meersman, Architecting ontology for scalability and versatility, in: On the Move to Meaningful Internet Systems 2005: CoopIS, DOA, and ODBASE, OTM 2005, R. Meersman and Z. Tari, eds, LNCS, Vol. 3761: , Springer, Berlin, Heidelberg, (2005) .


Y. Zuo, J. Zhao and K. Xu, Word network topic model: A simple but general solution for short and imbalanced texts, Knowledge and Information Systems 48: ((2016) ), 379–398. doi:10.1007/s10115-015-0882-z.