We have created a knowledge graph based on major data sources used in ecotoxicological risk assessment. We have applied this knowledge graph to an important task in risk assessment, namely chemical effect prediction. We have evaluated nine knowledge graph embedding models from a selection of geometric, decomposition, and convolutional models on this prediction task. We show that using knowledge graph embeddings can increase the accuracy of effect prediction with neural networks. Furthermore, we have implemented a fine-tuning architecture which adapts the knowledge graph embeddings to the effect prediction task and leads to a better performance. Finally, we evaluate certain characteristics of the knowledge graph embedding models to shed light on the individual model performance.
Ecotoxicology is a multidisciplinary field that studies the potentially adverse toxicological effects of chemicals on organisms, starting at molecular level to individuals, sub-populations, communities and ecosystems. One major societal contribution of ecotoxicology is ecological risk assessments, which compare environmental concentrations of chemicals with existing laboratory effect data to evaluate the ecosystem health status. While laboratory experiments are thus crucial, they are both labour intensive and result in a high number of animal testing. Therefore, the development of modelling techniques for extrapolating from existing laboratory effect data is a major effort in the field of ecotoxicology.
A very important challenge in ecotoxicology risk assessment is the interoperability of the disparate data sources, formats and vocabularies. The use of Semantic Web technologies and (RDF-based) knowledge graphs  can address this challenge and facilitate the orchestration of these datasets. Hence, extrapolation or prediction models can benefit from an integrated view of the data and the background knowledge provided by a knowledge graph. The use of knowledge graphs also enables the use of the available infrastructure to perform automated reasoning, explore the data via semantic queries, and compute semantic embeddings for machine learning prediction.
In this work we have created the Toxicological Effect and Risk Assessment Knowledge Graph (TERA) and implemented a prediction model over this knowledge graph to extrapolate adverse biological effects of chemicals on organisms. Here, we limit ourselves to binary effect prediction of mortality (shortened to effect prediction), i.e., where there is a chance that a chemical can affect a species in a lethal way. The work and evaluation conducted in this paper is driven by the following research question: does the use of contextual information in the form of knowledge graph embeddings brings added value in the prediction of adverse biological effects?
Our contributions can be summarized as follows:
(i) TERA aims at consolidating the relevant information to the ecological risk assessment domain. TERA integrates several disparate datasets and enables a unified (semantic) access. The formats of these data sources vary from tabular, to RDF files and SPARQL endpoints over public linked data. We have exploited external resources (e.g., Wikidata ) and ontology alignment methods (e.g., LogMap ) to discover equivalences between the data sources.
(ii) We have designed and implemented a model tailored to binary lethal chemical effect prediction. This model relies on TERA and builds upon existing knowledge graph embedding models. Moreover, it supplies the knowledge graph embedding models with additional information. This is used to tailor the embeddings to this specific task.
(iii) We have evaluated nine knowledge graph embedding (KGE) models, together with a naive baseline on the binary chemical effect prediction task. This evaluation includes four data sampling strategies which highlight the different settings of chemical effect prediction (i.e., the test data contains unseen chemical-organism pairs where: (a) the chemical and the organism may be known (but not in previously seen pairs), (b) the chemical is unknown, (c) the organism is unknown, and (d) both the chemical and the organism are unknown).
This paper extends our preliminary work presented in the In-Use Track of the 18th International Semantic Web Conference . We have (i) extended TERA with new sources (Encyclopedia of Life (EOL), MeSH, and a larger part of ChEMBL) and provided detailed steps about its creation; (ii) created a more robust prediction model with nine (up from three) embedding algorithms supported and a task-specific embedding fine-tuning strategy; and (iii) conducted a more comprehensive evaluation with all combinations of KGE models and sampling strategies totalling 648 data points (324 for each prediction model).
The rest of the paper is organized as follows. Section 2 introduces essential concepts to the subsequent sections. Section 3 introduces the use case where the knowledge graph and prediction models are applied. Section 4 introduces related work. The creation of the knowledge graph is described in Section 5. Section 6 introduces the prediction models, while Section 7 presents the evaluation of these models. Section 8 elaborates on the contributions and discusses future directions of research. Finally, the Appendix gives an overview of the knowledge graph embedding models used in this work.
In this section we introduce important background concepts that will be used throughout the paper. Table 1 contain the most important symbols.
|RDF||Resource Description Framework|
|OWL||Web Ontology Language|
|SPARQL||SPARQL Protocol and RDF Query Language|
|KGE||Knowledge graph embedding|
|The subject of a triple|
|The object of a triple|
|p, r||The predicate/relation of a triple|
|e||A KG entity|
|The set of KG triples|
|The set of KG entities|
|The set of KG relations|
|The set of literal values|
|e||The vector representation of an entity or relation|
|k||The dimension of a vector|
|The scoring function of a KGE model|
|Pre-trained KGE-based model|
|Fine-tuning KGE-based model|
|S||Refers to species|
|C||Refers to chemicals|
Taxonomy in this work refers to a species classification hierarchy. Any node in a taxonomy is called a taxon. Species is a taxon which is also a leaf node in the taxonomy. An Organism denotes an individual living organism which is an instance of a species. Chemicals or compounds are unique isotopes of substances consisting of two or more atoms. Effect, used in this work as short form for chemical effect, refers to the response of an organism (or population) to a chemical at a specific concentration. Endpoint11 denotes a measured effect on the test population at a certain time; e.g., lethal concentration to 50% of test population (LC50) measured at 48 hours. Note that, an experiment can have several endpoints, e.g., LC50 at 48 hours and LC100 at 96 hours (lethal concentration for all test organisms). See Table 2 for the most common endpoints.
2.2.Ontology-enhanced knowledge graphs
In this work we consider the most broadly accepted notion of knowledge graph within the Semantic Web: an ontology enhanced RDF-based knowledge graph (KG) . This kind of knowledge graph enables the use of the available Semantic Web infrastructure, including SPARQL engines and OWL reasoners.22 Thus, in our setting, KGs are composed by RDF triples in the form of ,33 where represents a subject (a class or an instance), p represents a predicate (a property) and represents an object (a class, an instance or a literal). KG entities (i.e., : classes, properties and instances) are represented by an URI (Uniform Resource Identifier).
An (ontology-enhanced) KG can be split into a TBox (terminology) and an ABox (assertions). The TBox is composed by triples using RDF Schema (RDFS) constructors like class subsumptions and property domain and range; and OWL constructors like disjointness, equivalence and property inverses.44 The ABox contains assertions among instances, including OWL equality and inequality, and semantic type definitions. Table 5 shows several examples of TBox and ABox triples.
Ontology alignment is the process of finding mappings or correspondences between a source and a target ontology or knowledge graph [23,66]. These mappings typically represent equivalences or broader/narrower relationships among the entities of the input ontologies. In the ontology matching community , mappings are exchanged using the RDF Alignment format ; but they can also be interpreted as standard OWL axioms (e.g., [24,35]). In this work we treat ontology alignments as OWL axioms (e.g., triple in Table 5). An ontology matching system (e.g., LogMap ) is a program that, given as input two ontologies or knowledge graphs, generates as output a set of mappings (i.e., an alignment) M.
Knowledge graph embedding (KGE) [63,78] plays a key role in link prediction problems where it is applied to knowledge graphs to resolve missing facts in largely connected knowledge graphs, such as DBpedia . Biomedical link prediction is another area where embedding models have been applied successfully (e.g., [1,5]).
The embeddings of the entities in a KG are commonly learned by (i) defining a scoring function over a triple, which is typically proportional to the probability of the existence of that triple in the KG,55 i.e., , ; and (ii) minimizing a loss function (i.e., deviation of the prediction of the scoring function with respect to the truth available in the KG). More specifically, KGE models (i) initialize the entities in a triple into a vector representation , where k is the dimension of the vector; (ii) apply a scoring function to ; and (iii) adapt the vector representations to improve the scoring and minimize the loss.
Several knowledge graph embedding models have been proposed. In this work, we used models of three major categories: decomposition models, geometric models, and convolutional models.66 The decomposition models represent the triples of the KG into a one-hot 3-order tensor and apply matrix decomposition to learn entity vectors. Geometric models, also known as translational, try to learn embeddings by defining a scoring function where the predicate in the triple act as a geometric translation (e.g., rotation) from subject to object. Convolutional models, unlike previous models, learn entity embedding with non-linear scoring functions via convolutional layers.
3.Ecotoxicological risk assessment and adverse biological effect prediction
The task of ecotoxicological risk assessment is to study the potential hazardous effects of chemicals on organisms from individuals to ecosystems. In this context, risk is the result of the intrinsic hazards of a substance on species, populations or ecosystems, combined with an estimate of the environmental exposure, i.e., the product of exposure and effect (hazard).
Figure 1 shows a simplified risk assessment pipeline. Exposure data is gathered from analysis of environmental concentrations of one or more chemicals, while effects (hazards) are characterized for a number of species in the laboratory as a proxy for more ecologically relevant organisms. These two data sources are used to calculate the so-called risk quotient (RQ; ratio between exposure and effects). The RQ for one chemical or the mixture of many chemicals is used to identify chemicals with the highest RQs (risk drivers), identify relevant modes of action77 (MoA) and characterize detailed toxicity mechanisms for one or more species (or taxa). Results from these predictions can generate a number of new hypotheses that can be investigated in the laboratory or studied in the environment. Note that, this risk assessment pipeline is a simplified version of the one in use at the Norwegian Institute for Water Research,88 however, similar methodologies are used across regulatory risk assessment pipelines.
|LC50||0.16||Lethal concentration for of test population|
|EC50||0.05||Effective concentration for of test population|
|LOEC||0.04||Lowest observable effect concentration|
|NR-LETH||0.02||Lethal to of test population|
|LD50||0.02||Lethal dose for of test population|
The chemical effect data is gathered during laboratory experiments, where a sub-population of a single species is exposed to an increasing concentration of a toxic chemical. The endpoints of the experiments are recorded at chemical concentrations and time after exposure. These endpoints are categorized into several categories, e.g., lethality rate of test population (see Table 2).
Ecological risk assessment methods require a large amount of these experimental data to give an accurate depiction of the long term risk to an ecosystem. The data must cover the relevant chemicals and species present in the ecosystem, e.g., an ecological risk assessment of agricultural runoff in Norway will mostly concern pesticides and waterflees, copepods, and frogs, among other species . Just with a few relevant chemicals and species the search space becomes immense and performing laboratory experiments becomes unfeasible. Thus, it is essential to develop in silico methods to extrapolate new chemical-species effects from known combinations. We differentiate among two types complementary strategies: (i) highly specialized (restricted in chemical and species domains) models to predict chemical concentrations that will have an effect on a test species, and (ii) models that produce rankings of highly representative chemical-species pair hypothesis which can be used by a laboratory to perform targeted experiments. In this paper we focus on the latter strategy, using a method based on knowledge graph embeddings. Methods that fall into the first strategy are introduced in Section 4.1.
This section will cover related work from ecotoxicology and knowledge graph based prediction.
There are two main research areas in toxicology to extrapolate chemical effects, i.e., Quantitative Structure-Activity Relationship (QSAR) and read-across. QSAR modelling try to find a relationship between the structure of a chemical and the chemical’s biological activity (cf. reviews [22,26]). This relationship is described using derived chemical features. Some features are simple, e.g., octanol-water partition coefficient or logP, others concern the entire chemical, e.g., chemical fingerprints. The basis of the QSAR relationship is usually modeled as polynomial equations. Parthasarathi and Dhawan  take this further by using the logarithm of chemical concentration to achieve a polynomial relationship: , and ( is a polynomial of nth degree), where κ is the chemical concentration while π and σ denote the derived chemical features hydrophobicity99 and electronic effects in the molecule, respectively. The drawback of these models is the applicability domains. Usually, a QSAR model considers a small set of chemicals (10ths to 100ths) and one single species. This means that new features and relationships need to be developed for each species and each chemical group.
The read-across methods try to mitigate these drawbacks, mainly by considering extrapolation of the effect at the chemical and species levels. Similar to QSAR models, read-across of chemicals use the chemical features to create similarity measures between chemicals to justify the read-across of chemical effects. The read-across in the species domain is harder. Species do not tend to have easily derived features. Therefore, genetic similarity has emerged as a viable option. Sequence Alignment to Predict Across Species Susceptibility (SeqAPASS), developed by the United States Environmental Protection Agency (U.S. EPA), is an example of such an approach [20,41]. SeqAPASS uses a large amount of data available for humans, mice, rats, and zebrafish to extrapolate to areas with lower coverage.
In this work, we use nine KGE models across three categories of models. Here, we will give a brief introduction to the models, while a more extended explanation of the models is found in the Appendix. The interested reader please refer to  for a comprehensive survey.
The three categories of models are decomposition, geometric, and convolutional . The decomposition models are DistMult, ComplEx, and HolE. DistMult models the score of a triple as the vector multiplication of the representation of each subject, predicate and object . ComplEx uses the same scoring function as DistMult, however, in a complex vector space, such that it can handle inverse relations . HolE is based on holographic embeddings , however, it has been shown that HolE is equivalent to ComplEx .
The geometric models are TransE, RotatE, pRotatE, and HAKE. TransE is the base of a whole family of models and scores triples based on the translation from subject to object using the representation of the predicate . RotatE is similar to TransE, however, the translation using the predicate is done by rotating it (via Euler’s identity) . Furthermore, pRotatE is a baseline for RotatE where the modulus in Euler’s identity is ignored . Finally, the hierarchical-aware model, HAKE, where entities at each level in the hierarchy is at equal distance from the origin and relations at a level is modeled as rotation .
The convolutional models take a deep learning approach to the task of KGE. We use ConvKB  and ConvE , which are similar with slightly different architectures. They have shown good performance given the relative small number of parameters.
Although quite a few KGE models have been proposed, the adopted ones are either classic models or can achieve state-of-the-art performance in some benchmarks. They are representative of mainstream techniques, and have been widely adopted in KGE research and applications . Thus, the benefits and shortcomings of the KGE models analysed in this study provide good evidence of the general performance of this type of models in a complex prediction task, i.e., adverse biological effect of chemicals on organisms.
4.3.Using KGE for prediction
Our focus to use KGE models is to predict if a chemical has a lethal effect on an organism. KGE models have been explored in the biomedical domain to solve similar predictions tasks (e.g., finding relationships between diseases, drugs, genes, and treatments). Several works have shown improvements in results by using KGE models for prediction, e.g., [1,5,46]. Chen et al.  used random walks over networks to perform drug-target predictions. The ChEMBL and DrugBank KGs have also been used to predict chemical mode of action (MoA) of anticancer drugs with high performance on benchmark datasets .
Opa2vec  and Blagec et al.  have developed embedding models to improve similarity-based prediction in the biomedical domain, while OpenBioLink  has created a framework for evaluating models in the biomedical domain.
EL Embeddings  and opa2vec  present new semantic embedding methods for KGs with expressive logic expressions (i.e., OWL ontologies) to predict protein interaction. The former utilizes complex geometric structures to model the logic relationships between entities, while the later learns a language model from a corpus extracted from the ontology. OWL2Vec*  also learns a language model from an ontology and applies the computed embeddings into two prediction tasks: class subsumption and class membership. OWL2Vec* has also been used to predict the plausibility of ontology alignments .
To the best of our knowledge there is no work using link prediction or KGE models to support ecotoxicological effect prediction. This study will give novel insights and empirical results of KGE models in this new domain.
5.TERA knowledge graph
One major challenge in ecological risk assessment processes is the interoperability of data. In this section, we introduce the Toxicological Effect and Risk Assessment (TERA), an ontology-enhanced RDF-based knowledge graph that aims at providing an integrated view of the relevant data sources for risk assessment.1010
The initial inspiration for TERA was the aid of ecotoxicological effect prediction where access to disparate resources was required (see Section 5.3). However, by integrating these sources into a KG, we were also able to directly apply TERA into the prediction process by leveraging knowledge graph embedding models (see Section 5.4).
The data sources integrated into TERA vary from tabular and RDF files to SPARQL endpoints over public linked data. The sources currently integrated into TERA are: (i) biological: NCBI Taxonomy, Encyclopedia of Life, and Wikidata mappings (∼500k species); (ii) chemical: PubChem, ChEMBL, MeSH, and Wikidata mappings (∼110M compounds); and (iii) biological effects: ECOTOXicology Knowledgebase (∼1M results, ∼12k compounds, ∼13k species), and system-generated mappings. These three distinct parts make up the sub-KGs of TERA, i.e., (i) the Taxonomy sub-KG (), (ii) the Chemical sub-KG (), and (iii) the Effects sub-KG (). The different processes to transform and integrate these sources into TERA are shown in Fig. 2.
A snapshot of TERA is available on Zenodo , where licenses permit.1111 PubChem and ChEMBL are not included in the snapshot due to size constraints; these can be downloaded from the National Institutes of Health1212 and European Bioinformatics Institute,1313 respectively. The subgraph of TERA used for prediction is available alongside the chemical effect prediction models in our GitHub repository.1414 Table 5 shows several examples of RDF triples from TERA.1515
TERA, as mentioned above, is constructed by gathering a number of sources about chemicals, species and chemical toxicity, with a diverse set of formats including tabular data, RDF dumps and SPARQL endpoints.
Biological effect data of chemicals. The largest publicly available repository of effect data is the ECOTOXicology knowledgebase (ECOTOX) developed by the US Environmental Protection Agency . This data is gathered from published toxicological studies and limited internal experiments. The dataset consists of experiments covering chemicals and species,1616 implying a chemical–species pair converge of maximum . The resulting endpoint from an experiment is categorised in one of a plethora of predefined endpoints (see Table 2 above).
|1147366||12448||134623 (diethyltoluamide)||1 (Pimephales promelas)||Water|
Tables 3 and 4 contain an excerpt of the ECOTOX database. ECOTOX includes information about the chemicals and species used in the tests. This information, however, is limited and additional (external) resources are required to complement ECOTOX.
Chemicals. The ECOTOX database uses an identifier called CAS Registry Number assigned by the Chemical Abstracts Service to identify chemicals. The CAS numbers are proprietary, however, Wikidata  (indirectly) encodes mappings between CAS numbers and open identifiers like InChIKey, a 27-character hash of the International Chemical Identifier (InChI) which encodes chemical information uniquely .1717 Wikidata also provides mappings to well known databases like PubChem, ChEMBL and MeSH, which include relevant chemical information such as chemical structure, structural classification and functional classification.
Taxonomy. ECOTOX contains a taxonomy1818 (of species), however, this only considers the species represented in the ECOTOX effect data. Hence, to enable extrapolation of effects across a larger taxonomic domain, we include the NCBI Taxonomy . This taxonomy data source consists of a number of database dump files, which contains a hierarchy for all sequenced species, which equates to around of the currently known life on Earth and is one of the most comprehensive taxonomic resources. For each of the taxa (species and classes), the taxonomy defines a handful of labels, the most commonly used of which are the scientific and common names. However, labels such as authority can be used to see the citation where the species was first mentioned, while synonym is a alternate scientific name, that may be used in the literature.
Species traits. As an analog to chemical features, we use species traits to expand the coverage of the knowledge graph. Apart from taxonomic classifications, traits are the most important information to identify species and will be of great importance when predicting the effect on the species.
The traits we have included in the knowledge graph are the habitat, endemic regions, and presence (and classifications of these). This data is gathered from the Encyclopedia of Life (EOL) , which is available as a property graph. Moreover, EOL uses external definitions of certain concepts, and mappings to these sources are available as glossary files. In addition to traits, researchers may be interested in species that have different conservation statuses, e.g., if the population is stable or declining, etc. This data can also be extracted from EOL.
In this section we present the different steps to extract, transform and integrate the source datasets into the main TERA components and sub-KGs. All data is transformed using custom mappings (scripts) from the sources to RDF triples. Table 5 shows an excerpt of the triples in TERA.
5.2.1.Effects sub-KG construction
The effect data in ECOTOX consist of two parts, i.e., test definitions and results associated with the test definitions (see Tables 3 and 4, respectively). The important columns of a test are the chemical and the species used. Other columns include metadata, but these are optional and often empty. Each result is composed by an endpoint, an effect, and a concentration (with a unit) at which the endpoint and effect are recorded.
This tabular data in ECOTOX is transformed into triples that form the effects sub-KG in TERA (). Note that a test can have multiple results. A subset of the effect triples are listed in Table 5 (see triples –). A graphical representation for an effect test and its result is also shown in Fig. 3.
ECOTOX contains metadata about the species and chemicals used in the experiments. This metadata is also included in TERA to facilitate the alignment with other resources (see Section 5.2.2).
(i) The ECOTOX metadata file species.txt includes common and Latin names, along with a (species) ECOTOX group (see triples – in Table 5). This group is a categorization of the species based on ECOTOX use cases. Prefixes and abbreviations like sp., var. are removed from the label names.
(ii) The full hierarchical lineage1919 is also available in the metadata file species.txt. Each column represents a taxonomic level, e.g., genus or family. If a column is empty, we construct an intermediate classification; for example, Daphnia magna has no genus classification in the data, then its classification is set to Daphniidae genus (family name + genus, actually called Daphnia). We construct these classifications to ensure the number of levels in the taxonomy is consistent (see triples and in Table 5). Note that when adding triples such as in Table 5, we also add a taxonomic rank to facilitate the querying for a specific taxonomic level.
(iii) The ECOTOX source file chemicals.txt includes chemical metadata and it is handled similarly to species.txt. The file includes chemical name (see in Table 5) and a (chemical) ECOTOX group.
For the units in the effect data, e.g., chemical concentrations (mg/L, mol/L, mg/kg, etc.), we reuse the QUDT 1.12020 ontologies. When an unit such as mg/L is not defined, we define it according to Listing 1.
5.2.2.Alignment with state-of-the-art tools
ECOTOX database provides proprietary chemical identifiers (i.e., CAS numbers) and internal ECOTOX ids for species. In order to extrapolate effects across a larger set of chemicals and species than those available in ECOTOX, TERA integrates taxonomy and trait data from NCBI and EOL, and chemical data from PubChem, ChEMBL and MeSH.
Alignment between ECOTOX and the NCBI Taxonomy. There does not exist a complete and public alignment between the 23,439 ECOTOX species and the 1,830,312 the NCBI Taxonomy species.2121 We have used three methods, two state-of-art ontology alignments systems and a baseline, to align ECOTOX and the NCBI Taxonomy: (i) LogMap [33,34], (ii) AgreementMakerLight (AML) , and (iii) a string matching algorithm based on Levenshtein distance . LogMap and AML were chosen since they have performed well across many datasets in the Ontology Alignment Evaluation Initiative (e.g., [2,3,61]). Most mappings in our setting are expected to be lexical, therefore, we also selected a purely lexical matcher to evaluate if more sophisticated systems like LogMap and AML bring an additional value.
Due to the large size of the NCBI Taxonomy, we needed to split NCBI into manageable chunks to enable the use of ontology alignment systems. Fortunately, this can be easily done by considering the species division, e.g., mammal or invertebrate. This divides the NCBI Taxonomy into 11 distinct parts, which can be aligned to the taxonomy in ECOTOX.
|String similarity ()||20,423||0.76||0.87|
Note that it is expected an entity from ECOTOX to match to a single entity in the NCBI Taxonomy, and vice-versa. Hence, 1-to-N and N-to-1 alignments were filtered according to the system computed confidence. A partial mapping curated by experts can be obtained through the ECOTOX Web.2222 We have gathered a total of 2,321 mappings for validation purposes. Table 6 shows the alignment results over the ground truth samples for the 1-to-1 (filtered) system mappings. We report number of mappings (#M), Recall (R) and estimated precision () with respect to the known entities in the incomplete ground truth, assuming only 1-to-1 mappings are valid. is calculated as
We have selected the union of the 1-to-1 equivalence2323 mappings computed by AML and LogMap to be integrated within TERA, as they represent the mapping set with the best recall with a reasonable estimated precision. This choice was made by considering the large uncertainty of downstream applications (effect prediction and risk assessment), where we prefer a larger coverage of the domain. See triple in Table 5 for an example of a system computed mapping between ECOTOX and the NCBI Taxonomy.
We use Wikidata as source of alignments between the NCBI Taxonomy and EOL, and among the used chemical datasets. Alignments are extracted via Wikidata’s query interface (i.e., SPARQL endpoint).2424 The data in Wikidata concerning species and chemicals are in large parts manually curated  and will have a low error rate, comparatively to using the automated ontology alignment systems.
Alignment between the NCBI Taxonomy and EOL. In order to include in TERA trait data from EOL, we need to establish an alignment between EOL and the NCBI Taxonomy. We have constructed equivalence triples between the NCBI Taxonomy and EOL identifiers using Wikidata. The species identifiers are available as literals in Wikidata. Therefore, we concatenate them with the appropriate namespace. Listing 2 represents the SPARQL CONSTRUCT query used against the Wikidata endpoint. Here, we query Wikidata for instances of taxa, thereafter adding optional triple patterns for NCBI Taxonomy and EOL identifiers which are added as owl:sameAs triples to TERA.
Examples of resulting mapping triples are shown in – in Table 5. The proportion of species in Wikidata where this mapping exists is .
Alignment between chemical entities. The mapping between ECOTOX chemical identifiers (CAS Registry Numbers) to Wikidata entities enables the alignment to a vast set of chemical datasets, e.g., PubChem, ChEBI, KEGG, ChemSpider, MeSH, UMLS, to name a few. The construction of equivalence triples between CAS, ChEMBL, MeSH, PubChem and Wikidata identifiers is shown in Listing 3. As for the case of species identifiers, the literal representing a chemical identifier is concatenated with the corresponding namespace. For the CAS Registry Numbers we also remove the hyphens to match ECOTOX notation. Examples of resulting mapping triples are shown in – in Table 5.
These mappings are not complete, but for some the coverage is large. Out of the chemicals used in ECOTOX, have an equivalence in Wikidata (through the CAS registry numbers). Moreover, Wikidata chemicals has ChEMBL identifiers, MeSH identifiers, PubChem identifiers, and InChiKey identifiers.
5.2.3.Taxonomy sub-KG construction
The Taxonomy sub-KG () integrates data from the NCBI Taxonomy and the EOL trait data. The integration of the NCBI Taxonomy into the TERA knowledge graph is split into several sub-tasks.
(i) We load the hierarchical structure included in the NCBI Taxonomy file nodes.dmp. The columns of interest are the taxon identifiers of the child and parent taxon, along with the rank of the child taxon and the division where the taxon belongs. We use this to create triples like – and – in Table 5.
(ii) To aid alignment between the NCBI Taxonomy and the ECOTOX identifiers, we add the synonyms found in names.dmp. Here, the taxon identifier, its name and name type are used to create triples like in Table 5. Note that a taxon in the NCBI Taxonomy can have several synonyms while a taxon in ECOTOX usually has two, i.e., common name and scientific name.
(iii) Finally, we add the labels of the divisions found in divisions.dmp (see triples and ). We also add disjointness axioms among unrelated divisions, e.g., triple in Table 5.
We use the TraitBank from EOL  to add species traits to TERA. The TraitBank is modeled as a property graph and can be accessed as a neo4j database or via a set of tabular files. To integrate the TraitBank into TERA we validate the identifiers used in EOL and convert to URIs. If an identifier is not a valid URI, we replace invalid symbols. A trait example is shown as triple in Table 5. The EOL TraitBank also includes subsumption definitions (i.e., via rdfs:subClassOf) for a large portion of traits. These subsumptions can be downloaded separately and are added to TERA in a similar way as mentioned above.
5.2.4.Chemical sub-KG construction
The Chemical sub-KG () is created from PubChem , ChEMBL , and MeSH . These datasets are available for download as RDF triples. In addition, ChEMBL and MeSH can be accessed through the EBI and MeSH SPARQL endpoints, respectively.
The chemical subset of PubChem is used since information about chemicals is standardized in PubChem, while information about substances is not. In this subset we use: (i) component information, i.e., what are the building blocks of the chemical or parts of a mixture; (ii) type assertions, which either link to ChEBI or describe the type of molecule, e.g., small or large; (iii) role assertions, which describe additional attributes or relationships of the chemical, e.g., FDAApprovedDrug; and (iv) drug products, which link to the clinical data in SNOMED CT . Examples of these can be seen in triples , and in Table 5.
Parent chemical data in PubChem is limited to permutations e.g., bonds, polarity, and part of mixtures axioms (triple in Table 5). Therefore, we use the hierarchical data about chemicals from MeSH. In addition to this data, we create similarity triples between chemicals. This is impractical to download, but can be calculated on demand. We add similarity triples to TERA where the Tanimoto (Jaccard) distance between the chemical fingerprints (gathered using PubChemPy ) is ,2525 see triple in Table 5.
ChEMBL contains facts about bioactivity of chemicals. This contributes in assessing the danger of a chemical. In TERA, we use the mode of action (MoA) and target (receptor targeted by MoA; triple in Table 5). These targets are organized in a hierarchy using chembl:relSubsetOf relations (see triple ). The receptors will link to which organism it belongs to, however, we leave the inclusion of this information for future work.
We use the entire MeSH dataset in TERA. MeSH is organised as several hierarchies. The most prominent classifications are based on chemical groups and the intended use of the chemicals. Triples and in Table 5 show examples of chemical group and functional classifications.
5.3.TERA for data access
TERA covers knowledge and data relevant to the ecotoxicological domain and enables an integrated semantic access across data sets. In addition, the adoption of an RDF-based knowledge graph enables the use of an extensive range of Semantic Web infrastructure (e.g., reasoning engines, ontology alignment systems, SPARQL query engines).
The data integration efforts and the construction of TERA go in line with the vision in the computational risk assessment communities (e.g., Norwegian Institute for Water Research’s Computational Toxicology Program (NCTP)), where increasing the availability and accessibility of knowledge enables optimal decision making.
The knowledge in TERA can be accessed via predefined queries2626 (e.g., classification, sibling, and name queries, and fuzzy queries over the species names) and arbitrary SPARQL queries. The (final) output is flexible to the task, and can be given either as a graph or in tabular format. Listing 4 shows an example query to extract the chemicals and concentrations, at which, the species in the Oslofjord experience lethal effects.
5.4.TERA for effect prediction
TERA is used as background knowledge in combination with machine learning models for chemical effect prediction. TERA’s sub-KGs play different roles in effect prediction. The rich semantics of the species and chemical entities in the Taxonomy sub-KG () and the Chemical sub-KG (), respectively, are embedded into low-dimensional vectors; while the Effects sub-KG () provides the training samples for the prediction model. Each sample is composed of a chemical, a species, a chemical concentration, and the outcome or endpoint of the experiment. More details are given in Section 6, where the effect prediction model is built upon state-of-the-art knowledge graph embedding models.
Table 7 shows the sparsity-related measures of common benchmark datasets2727 and TERA’s and (triples involving literals are removed). We follow Pujara et al.  and calculate the relational density, , and entity density, , where , , and are the sets of triples, relations, and entities in the knowledge graph, respectively. The entity entropy (EE) and the relation entropy (RE) indicate whether there are biases (the lower EE or RE, the larger bias) in the triples in the KG , and are calculated as
In addition, we calculate the absolute density of the graph, which is . This is the ratio of edges to the maximum number of edges possible in a simple directed graph .
High RD and low RE typically lead to a worse performance, while high ED and low EE often lead to better link prediction performance (e.g., ). In Table 7 we can see that the density and entropy values are in between those for YAGO3-10 and FB15k-237, which typically lead to worse and better predictive performance, respectively . This shows that TERA is a suitable background knowledge to extrapolate effect data and, at the same time, an interesting dataset to benchmark state-of-the-art knowledge graph embedding models. Note that using the full TERA (i.e., and ), according to RD, will be more challenging than using the reduced TERA fragments (i.e., and ) for prediction. Full details of the construction of and are given in Section 7.1.1.
6.Adverse biological effect prediction
The aim of chemical effect prediction is to extrapolate exiting data to new combinations of (possibly unknown) chemicals and species. In this section we present three classification models used to predict the adverse biological effect of chemicals on species: (i) a multilayer perceptron (MLP) model (our baseline), (ii) the baseline model fed with pre-trained KG embeddings, (iii) a model that simultaneously trains the baseline model and the KGE models (i.e., it fine-tunes the KG embeddings). A MLP was chosen as baseline as it is a basic model where additional components and penalties can be easily added and assessed as we do in our third model (see Section 6.3).
The models have three inputs, namely a chemical c, a species s, and a chemical concentration κ (denoted ). The output is a binary value that represents whether the chemical at the given concentration has a lethal effect on the species:
Notation. Throughout this section we use bold lower case letters to denote vectors while matrices are denoted as bold upper case letters. The vector representation of an entity and a relation are noted as and , respectively. These vectors are either in or , where k is the embedding dimension.
Our baseline prediction model is a multilayer perceptron (MLP) with multiple hidden layers. hidden layers are appended to the embedding of the chemical c, hidden layers are appended to the embedding of species s, and hidden layers appended to the real valued chemical concentration κ. Thereafter, n hidden layers are further appended to the output of the previous hidden layers concatenated. Specifically, the model can be expressed by the following equations (with as input):
We differentiate between two settings of the baseline model (see Fig. 4):
(i) Simple setting. Figure 4a shows the model without embedding transformation layers, i.e., , and .
(ii) Complex setting. The complex model shown in Fig. 4b introduces transformation layers on the embeddings and chemical concentration input. These transformations aim at extracting the important information in the inputs and disregard the redundant information based on the output.
In the experiments we refer to the baseline models as Simple one-hot and Complex one-hot, depending on the selected MLP setting.
6.2.Baseline model with pre-trained KG embeddings
This models relies on pre-trained embeddings of chemicals and species computed using state-of-the-art KGE models (see Section 4.2 and the Appendix for an overview). A (different) KGE model is applied to the chemicals and the species .
These pre-trained KG embeddings are then given as input instead of the one-hot encoding vectors in the baseline model. We replace the trainable matrices and in Equation (16) by the matrices composed of embeddings by the respective KGE models. Namely is set to , is set to , where denotes stacking vectors, denotes the embedding of the ith chemical in the chemicals , denotes the embedding of the ith species in the species .
In the experiments we refer to these models as Simple PT - and Complex PT -, depending on the selected MLP setting, where PT stands for pre-trained, and and are the KGE models used for the chemicals KG and the species KG, respectively (e.g., Complex PT DistMult-HAKE). For simplicity, we also refer to these models as PT-based models.
6.3.Fine-tuning optimization model
This model improves upon the pre-trained KG embeddings with fine-tuning based on the effect prediction data. This is done by simultaneously training the (selected) KGE models and the MLP-based baseline model. Such that the and , and the MLP weights ( and in Equations (10), (11), (14) and (15)) are optimized simultaneously. Note that we initialize the KGE models with the previously pre-trained embeddings.
The model architecture is shown in Fig. 5 and the overall loss to minimize is
Figure 5 shows the full simultaneous fine-tuning model and the optimization process. The initial state of the entity lookups is the pre-trained embeddings. The full training procedure is summarised as follows:
2. Generate negative knowledge graph triples (see Appendix A.5 for details) from the extracted subsets of triples from and , these negative KGs triples are referred to as and .
3. Feed-forward the input through the model and calculate loss for each model component and combine according the loss weights.
4. Optimize the KG entity and relation embeddings, and the MLP layers.
In the experiments we refer to these models as Simple FT - and Complex FT -, depending on the selected MLP setting, where FT stands for fine-tuning, and and are the KGE models used for the chemicals KG and the species KG, respectively (e.g., Simple FT HAKE-HAKE). For simplicity, we also refer to these models as FT-based models.
7.1.1.Preparation of TERA for prediction
As shown earlier, TERA consists of three sub-KGs. These are the basis for the chemical effect prediction.3333 We process the sub-KGs further to limit their size by removing irrelevant triples for prediction. This is necessary to scale up the training of the KGE models. The reduction of TERA’s sub-KGs is performed according to the following steps:
(i) Effect data. For prediction purposes, the effect data in is limited to four features, namely, chemical, species, chemical concentration, and effect. The chemical concentrations (κ, converted to mg/L) are log-normalized to remove the large discrepancy in scales. As mentioned, we separate the effects into two categories for simplicity, lethal and non-lethal effects. This reduces the possibility of ambiguity among the effects that does not cause death in the test species. We label lethal effects as 1 and non-lethal effects as 0
(ii) . For each chemical in the effect data, we extract all triples connected to them using a directed crawl. This reduces the size of to a manageable size for the KGE models. Moreover, we do not deem triples not directly connected to the effect data relevant for the prediction task, and may introduce unnecessary noise. As mentioned before, PubChem contains similarities between chemicals based on chemical fingerprints, however, for our use-case it is unpractical to query them from the PubChem RDF data, therefore, we calculate similarity triples based on queried PubChem fingerprints. We use the same similarity threshold as PubChem, i.e., 0.9 .
(iii) . The same steps as for are conducted for all species in the effect data.
These steps reduce to 241,442 triples and to 59,673 triples. Some statistics of and , and the reduced fragments and , are given in Table 7 (Section 5.4). In the rest of the paper were refer to TERA’s reduced sub-KGs simply as and .
The transformation from TERA’s and to model input is done by first dropping literals, thereafter assigning each entity an unique integer identifier which corresponds to the index of a column vector in matrices or in Equation (16), depending on which sub-KG is transformed.3434 Relations are treated similarly.
We use four sampling strategies of the effect data to analyze how the proposed classification models behave by varying the data parts that are used for training and testing. Note that, we only consider effect data where the chemical and species have mappings to external sources (e.g., NCBI Taxonomy and Wikidata, cf. Section 5.2.2) so that there is additional contextual information that can be used by the KGE models. For each of the strategies, the validation and test sets contain unseen chemical-organism pairs with respect to the training set. The strategies, however, differ with respect to the individual organism and chemical as follows:
Strategy (i) Random training/validation/test split on the entire dataset (i.e., the chemicals and the organisms in the validation and test will most probably be known).
Strategy (ii) Training/validation/test split where there is no overlap between chemicals in the three sets (i.e., the chemicals in the validation and test sets are unknown). This resulted on a split.
Strategy (iii) Training/validation/test split where there is no overlap between species in the three sets (i.e., the species in the validation and test sets are unknown). This resulted on a split.
Strategy (iv) Training/validation/test split with no chemicals or species overlap in the three sets (i.e., both the chemicals and the organisms in the validation and test sets are unknown). This resulted on a split.
There were originally 57,560 samples, however, this includes experiment duplicates, i.e., same chemical, species, and endpoint, with different chemical concentrations. This is down to large discrepancies in laboratory testing variance, therefore, we use the median concentration across the duplicates. The prior probability is approximately (i.e., of samples are labelled as non-lethal and of samples are labelled as lethal) across all sampling methods. We solve this when training by randomly oversampling the minority class until the prior probabilities are in the training set. In this case, the oversampling is performed by adding duplicates samples labelled as non-lethal. Oversampling is a well established technique used in many classification problems to remove bias during learning .
|KGE hyper-parameters||Search space|
|Margin (only hinge loss)|
|Bias (only geometric models)|
|Prediction hyper-parameters||Search space|
|(10), (11), (12), n (14)|
|# units (10), (11), (14)|
|# units (12)|
|Model||Loss function||Margin||Bias||Embedding dimension||Negative samples|
|DistMult||/||– / 2||–||143 / 383||28 / 43|
|ComplEx||/||– / 4||–||163 / 372||27 / 42|
|HolE||/||6 / –||–||188 / 376||30 / 100|
|TransE||/||4 / 7||14 / 20||226 / 196||23 / 57|
|RotatE||/||5 / 2||16 / 6||271 / 398||75 / 22|
|pRotatE||/||– / –||14 / 16||164 / 210||34 / 82|
|HAKE||/||– / –||12 / 10||108 / 359||56 / 13|
|ConvKB||/||– / 5||–||248 / 276||18 / 90|
|ConvE||/||7 / 3||–||228 / 196||68 / 40|
To optimize the hyper-parameters for the KGE and classification models we use random search over the parameter ranges. We conduct 20 trials per model. Tables 8 and 9 contain the best hyper-parameters and can be used to reproduce the top performing models.
To find the best hyper-parameters for the KGE models, we use the loss as a proxy for performance, normalized by the initial loss, , where is the training loss at epoch , is the loss with the initial weights.
We use validation loss to select the best hyper-parameter setting for the classification models presented in Section 6. The best prediction models are refitted and evaluated 10 times to reduce the influence of initial conditions on the metrics. The average and standard deviation of the metrics are presented in Section 7.2.
The hyper-parameter ranges for the KGE models are shown in Table 8 based on common values used in the literature. We conduct 20 trials of random hyper-parameters choices and validate over the validation data. In Table 9 we show the best hyper-parameters.
|Complex PT DistMult-HAKE (top-1 in (i))||(i)|
|Complex PT HolE-ConvKB (top-1 in (ii))||(ii)|
|Complex PT HAKE-DistMult (top-1 in (iii), (iv))||(iii)|
We can see in Table 9 that the decomposition models have similar hyper-parameters for and . As shown in Section 5.4, the major difference between and is the relational density. Therefore, it is reasonable to believe that a lower relational density KG requires more parameters to have an equivalent representation in the embedding space. We can get the same observation for the geometric models except for TransE, where the embedding dimensions are similar. ConvE is more efficient in embedding dimension than ConvKB, however, since ConvE is slightly more complex than ConvKB this is expected. The difference in negative samples could be down to our implementation of ConvE, which varies from the original. Our implementation of all models relies on 1-to-1 scoring of triples, while the implementation of ConvE originally used 1-to- scoring, where is the number of entities in the KG .
The fine-tuning optimization model (Section 6.3), in order to save on intensive computation, reuses the same hyper-parameters found for the KGE models. Depending on the optimizer choice, the choice of loss weights, , , and , is important. However, our optimizer choice has dynamic learning rates per variable, and therefore, will adapt regardless of the loss weights and we can set . Had we used, e.g., stochastic gradient descent, these variables would needed to be tuned.
7.1.4.Initialization of the fine-tuning optimization models
As presented in Section 6.3, we simultaneously train the KGE models and the MLP-based baseline model. This is done by initializing the model with (i) the weights learned in the correspondent baseline model with pre-trained embeddings, and (ii) the KG embeddings learned with the respective KGE models. For example, the Complex FT DistMult-HAKE model is initialized with the learned weights with the Complex PT DistMult-HAKE model and the pre-trained KG embeddings using DistMult and HAKE models. Then the model is further trained with a small learning rate. We found that reducing the learning rate by a factor of 100 worked well. Using this learning rate we optimize the model until convergence.
7.1.5.Simple and complex settings
As presented in Section 6.1, we use two settings in our classification models: simple and complex. This will help us isolate the effects of the KG embeddings versus the power of the MLP model. The simple setting uses no branching layers, i.e., and as in Equations (10), (11), (12) and (14) with 128 units in the hidden dense layer. For the complex models we use random search (20 trials) to find the optimal number of layers and units out of the ranges shown in Table 8. The optimal choices for the top performing models (using one-hot and pre-trained embeddings) are shown in Table 10.
Looking at the increasing complexity of the layer configuration of the one-hot models in Table 10 we can see a correlation from the simplest sampling strategy (i.e., (i)) through the most challenging one (i.e., (iv)). The same can be seen for PT HAKE-DisMult from strategy (iii) to (iv), where the number of layers increase. Overall we can see that the layer configurations of the chemical branch is more complex than for the species branch. This indicates that the KGE models are better at representing than .
In this section we present a summary of the conducted chemical effect prediction evaluation. Complete results are available at the project repository.3535 The default decision threshold is set to 0.5. That is, if a model predicts for an input then the chemical c is considered lethal to s at a concentration κ.3636
We use several metrics to compare the different prediction models. These are Sensitivity (i.e., recall), Specificity, and Youden’s index () . Precision and F-score were also considered as metrics. However, they were not representative for the performance with respect to non-harmful chemicals. This is attributed to the larger number of positive samples (i.e., harmful chemicals) than negative samples (i.e., non-harmful chemicals) in the test data.
Sensitivity and Specificity are defined as
In our setting, sensitivity is a measure on how well the models identify harmful chemicals while specificity measures models’ ability to identify non-harmful chemicals. Youden’s index is used to capture the usefulness of a diagnostic test (or in our case, a toxicity test). A useless test will have while with a test is useful. is also thought of as how well informed a decision might be. Note that, can be less than 0, but this is solved by swapping labeled classes. Similarly to how negative correlation is still useful.
|Simple PT HAKE-HAKE|
|Simple PT pRotatE-HAKE|
|Simple PT ConvE-HAKE|
|Simple PT pRotatE-ConvE|
|Simple PT RotatE-ConvE|
|Simple FT HAKE-HAKE|
|Simple FT pRotatE-HAKE|
|Simple FT ConvE-HAKE|
|Simple FT pRotatE-ConvE|
|Simple FT RotatE-ConvE|
|Complex PT DistMult-HAKE|
|Complex PT HAKE-ConvKB|
|Complex PT HolE-ConvKB|
|Complex PT ComplEx-DistMult|
|Complex PT HolE-pRotatE|
|Complex FT DistMult-HAKE|
|Complex FT HAKE-ConvKB|
|Complex FT HolE-ConvKB|
|Complex FT ComplEx-DistMult|
|Complex FT HolE-pRotatE|
|Simple PT HAKE-ConvKB|
|Simple PT HAKE-HAKE|
|Simple PT pRotatE-HAKE|
|Simple PT RotatE-ConvKB|
|Simple PT RotatE-ConvE|
|Simple FT HAKE-ConvKB|
|Simple FT HAKE-HAKE|
|Simple FT pRotatE-HAKE|
|Simple FT RotatE-ConvKB|
|Simple FT RotatE-ConvE|
|Complex PT HolE-ConvKB|
|Complex PT pRotatE-ConvKB|
|Complex PT TransE-ConvKB|
|Complex PT ComplEx-ConvE|
|Complex PT ConvKB-pRotatE|
|Complex FT HolE-ConvKB|
|Complex FT pRotatE-ConvKB|
|Complex FT TransE-ConvKB|
|Complex FT ComplEx-ConvE|
|Complex FT ConvKB-pRotatE|
|Simple PT ConvKB-DistMult|
|Simple PT HAKE-DistMult|
|Simple PT ConvKB-TransE|
|Simple PT ConvE-RotatE|
|Simple PT HolE-HAKE|
|Simple FT ConvKB-DistMult|
|Simple FT HAKE-DistMult|
|Simple FT ConvKB-TransE|
|Simple FT ConvE-RotatE|
|Simple FT HolE-HAKE|
|Complex PT HAKE-DistMult|
|Complex PT pRotatE-ComplEx|
|Complex PT ConvKB-DistMult|
|Complex PT ComplEx-HolE|
|Complex PT ComplEx-HAKE|
|Complex FT HAKE-DistMult|
|Complex FT pRotatE-ComplEx|
|Complex FT ConvKB-DistMult|
|Complex FT ComplEx-HolE|
|Complex FT ComplEx-HAKE|
|Simple PT HAKE-ComplEx|
|Simple PT pRotatE-ComplEx|
|Simple PT HolE-ComplEx|
|Simple PT pRotatE-RotatE|
|Simple PT HAKE-HAKE|
|Simple FT HAKE-ComplEx|
|Simple FT pRotatE-ComplEx|
|Simple FT HolE-ComplEx|
|Simple FT pRotatE-RotatE|
|Simple FT HAKE-HAKE|
|Complex PT HAKE-DistMult|
|Complex PT HolE-DistMult|
|Complex PT ConvKB-DistMult|
|Complex PT HolE-RotatE|
|Complex PT TransE-HAKE|
|Complex FT HAKE-DistMult|
|Complex FT HolE-DistMult|
|Complex FT ConvKB-DistMult|
|Complex FT HolE-RotatE|
|Complex FT TransE-HAKE|
Tables 11–14 show the results for each of the data sampling strategies (i)–(iv), respectively. The tables include the three best models (based on ) for the baseline model using one-hot and pre-trained (PT) KG embeddings, and the fine-tuning (FT) models using the same combination of KGE models as the selected PT-based models. We have also included a model with middling performance (i.e., 40 out of 81 models) and the worst performing model. Note that for the PT- and FT-based models we have evaluated 81 combinations - of KGE models. All models were evaluated using the simple and complex MLP settings. For example, the model Complex FT DistMult-HolE denotes that fine-tuning was used together with the complex MLP setting, and DistMult was selected to embed the chemicals while HolE was used to embed the species . We present the mean and standard deviation over 10 evaluation runs, i.e., we re-initialize and re-train the models 10 times. Results highlighted in bold are the best mean results of the corresponding metrics. Underlined results are where there is a chance that a single run outperforms the best mean (i.e., one standard deviation contains of results, assuming normally distribute results).3737
Overall, models with the complex setting and fine-tuning are needed as the data sampling strategies become more challenging. Moreover, all models favour sensitivity over specificity at default decision threshold (0.5). This is down to the imbalance in the data. We can see the imbalance by , it is for most models. As we use a log-loss instead of a discrete loss, this is to be expected for imbalanced data.
For settings (iii) and (iv) the performance drops and the standard deviation increases compared to the other strategies. This large standard deviation leads to large overlaps in quantiles among top-3 models in all categories, such that, by chance, one of these models could perform best in one individual evaluation.
7.2.1.One-hot baseline models
For the sampling strategy (i) the one-hot baseline models perform well, especially, with the complex one-hot model. This complex model is equivalent in terms of as the best simple pre-trained model. The story is largely the same in setting (ii), where the complex one-hot model performs within of the best simple pre-trained models. With strategies (iii) and (iv) the one-hot models degrade, especially in strategy (iv) where the Youden’s index is near zero (). This is expected as the one-hot baseline models lack important background information about the entities, specially for unseen chemicals and species, that the KG embedding models aim at capturing.
7.2.2.Baseline with pre-trained KG embeddings
We can see that the PT-based models do not lead to an important improvement with respect to in sampling strategy (i). The top-1 complex PT model, however, yields a better balance between sensitivity and specificity leading to an improved over the complex one-hot models. The two middling performing models, Simple PT pRotatE-ConvE and Complex PT ComplEx-DistMult, still retain a decent level of performance.
The results with the strategy (ii) are similar to strategy (i), the delta in between the simple and the complex PT-based models are about . This slight improvement is due to the increased balance between sensitivity and specificity which in turn leads to a higher .
In the sampling strategy (iii) we can observe that the improvement of the PT-based models over the one-hot models increases. The increase is up to in of the best PT-based model over the best one-hot model. In addition, we observe in this strategy that the standard deviation increases, especially in specificity, leading to a large portion of the models that are within one standard deviation of the best model in terms of .
Finally, the impact of using a PT-based models is strengthen in strategy (iv). The delta between the one-hot and PT-based models is up to in , and larger for . We see that all models struggle with specificity in this setting, this is down to the difficulty of predicting true negatives. This also leads to a larger variation, with certain models yielding standard deviation in the same order of magnitude as the metric (e.g., Simple FT HAKE-ComplEx).
7.2.3.Fine-tuning optimization model
The FT-based models, with some exceptions, improve the results over the PT-based models, most notably in sampling strategies (iii) and (iv). For example, the FT-based models Complex FT HolE-DistMult and Simple FT HolE-ComplEx are the best models in terms of and in strategy (iv), respectively. We can also see in strategies (i) and (ii) that the FT-based models improve middling and worst performing PT-based models, e.g., Simple FT RotatE-ConvE in strategy (i) improves from to using fine-tuning of the KG embeddings. The results are expected as the fine-tuned KG embeddings are tailored to the effect prediction task.
7.3.KG embedding analysis
In this section we look at correlations between KGE model choices and prediction performance. KGE models are designed to capture certain structures in the data, and this can give some explanation of which parts of the KGs are important for prediction.
First, in Table 15 we show how many times a KGE model is used when regarding the top 10 performing combinations (out of the total 81 possible combinations). We focus on the choices when using the simple MLP setting to reduce the influence of the non-linear transforms on the embeddings.
|KGE model||# uses (i)||# uses (ii)||# uses (iii)||# uses (iv)|
Looking at Table 15 we can see that the KGE models used to embed the chemicals in the best performing models is distributed evenly across most models and settings. This indicates that the performance of the prediction models is not highly correlated with the use of a KGE model on . Referencing Table 7, the high relational density in can contribute to worse performance  and therefore equal distribution of models in Table 15. This is different for . For sampling strategies (i) and (ii), HAKE is extensively used in the top models to embed . HAKE is designed to embed hierarchies. Therefore, this indicates that in strategies (i) and (ii) the hierarchical structure of dwarfs the rest of the KG. has a higher entity density and lower entity entropy (Table 7) than . This should lead to higher performance generally, but might also lead to larger discrepancies between models as seen in Table 15.
The use of the decomposition models increase in strategies (iii) and (iv) for the embedding of , which indicates that KG structures, other than the hierarchy, are important. Overall, DistMult and ComplEx can be used to great effect in strategies (iii) and (iv) while the geometric model, HAKE, is more successful in the less challenging strategies (i) and (ii).
Explained variance is a measure of how many principal components are required to describe all components.3838 In Fig. 6, we present how the metric depends on the explained variance of the top-10 principal components (i.e., ). We show all (81 per sampling strategy) PT-based prediction model results, simple MLP setting in Fig. 6a and complex setting in Fig. 6b. For example, in Fig. 6a, the best model in the strategy (iv), Simple PT pRotatE-ComplEx have a explained variance of 0.49 compared to the worst model, Simple PT HAKE-HAKE, with explained variance of 0.34. Coincidentally, these two points does not follow the trend lines in these figures which indicate negative correlation between and explained variance. The trend lines can be interpreted in two ways. First, it is counter-intuitive as we would expect more descriptive embeddings, i.e., larger explained variance, to perform better. On the other hand, the top-10 principal components may not be representative enough to capture the semantics of the KG embeddings, and thus, a large explained variance does not necessarily correlate with a high performance.
Figure 7 represents the explained variance against sensitivity. We can see that the trend is flat for strategy (iv), but positive for strategies (i)-(iii). This means that the trends in Fig. 6 are explained by specificity rather than sensitivity. By balancing sensitivity and specificity, i.e., as seen in Fig. 8, the rate of change is reduced compared to in Fig. 6.
Table 16 shows a few examples of correct (TP and TN) and incorrect predictions (FN and FP).
|D001556 (hexachlorocyclohexane)||59899 (walking catfish)||−3.4||0.97||1 (yes)||TP|
|C037925 (benthiocarb)||7965 (sea urchins)||0.9||0.2||0 (no)||TN|
|D026023 (permethrin)||378420 (bivalves)||0.7||0.96||1 (yes)||TP|
|D011189 (potassium chloride)||938113 (megacyclops viridis)||6.7||0.27||1 (yes)||FN|
|C427526 (carfentrazone-ethyl)||208866 (eudicots)||−0.9||0.82||0 (no)||FP|
|D010278 (parathion)||201691 (green sunfish)||−0.9||0.86||0 (no)||FP|
Benthiocarb and permethrin are both biocides with different targets: benthiocarb is a herbicide and permethrin is an insecticide. It is therefore not surprising that benthiocarb has a low predicted effect on sea urchins, while permethrin has a severe effect on bivalves.
There are several possible explanations for the failed predictions. A wrong prediction of potassium chloride toxicity to a marine copepod (Megacyclops viridis) could be due to the prediction model not being accurate enough for metal salts, or the copepod species being particularly sensitive to changes in osmolarity due to salt content. The wrong prediction of lack of herbicide toxicity (i.e., carfentrazone-ethyl) to a flower (i.e., eudicots) could be due to the fact that flowers, and plants in general, are severely underrepresented in the available effect prediction data.
We have introduced the Toxicological Effect and Risk Assessment (TERA) knowledge graph and shown how we can directly use it in chemical effect prediction. The use of TERA improves the PT-based prediction models over the one-hot baselines. In the most challenging data sampling strategies, we have also seen the benefits of creating tailored (i.e., fine-tuned) KG embeddings in the FT-based prediction models.
8.1.TERA knowledge graph
The constructed knowledge graph consists of several sources from the ecotoxicological domain. There are three major parts in TERA: the effects data, the chemical data, and the species taxonomic data. Integrating each part has different challenges. The chemical and pharmacological communities have come a long way in annotating their data as knowledge graphs and ontologies. Here, selecting the correct subsets to work with the chemical effect prediction data was a major challenge. This had to be done based on mappings between effect data and chemical data that were extracted from Wikidata. We selected a relatively small subset of the chemical sub-KG to facilitate faster model training, however, still larger than the extracted fragment from the species sub-KG. The species sub-KG was created from tabular data and cleaned by removing several annotation labels with redundant information. This sub-KG was aligned using ontology alignment systems to the species taxonomy in the effects sub-KG. This required pre-processing of the KG, where it was divided into smaller parts such that the selected systems could perform the alignment. We used several standard ontologies to facilitate the transformation of the effect data into a knowledge graph. This involved not only automatic processes, but also an important amount of manual work.
Integrating more data into TERA involves the creation of mappings to the existing data. This is possible for a large amount of chemical datasets as Wikidata links multiple datasets, e.g., the chemical compound diethyltoluamide (wd:Q408389) has distinct identifiers. Biological data, both taxonomic and effects, might be harder to align to TERA as these mappings are not available in Wikidata. Here, ontology alignment systems play an important role to fill this gap.
The additional integrated data will give larger coverage of the domain, and thereby, improve model performance. However, adding more data will also increase the memory and time requirements of KGE models. This was bypassed in this work by reducing TERA to only relevant parts.
Adding additional domain knowledge is also critical in other applications, such as using TERA for data access.
8.2.Performance of prediction models
We have shown that the ability to embed some structure types of different KGE models largely impact the prediction models. We see that some KGE models fail to capture the semantics of the chemicals and the species, which leads to similar performance to the one-hot baselines. Moreover, in a few isolated cases the performance is reduced further which leads us to believe that the embeddings collapse in one or some dimensions, making it impossible to distinguish among entities.
We suspect that the even distribution of KGE models to embed (Table 15) in most settings is likely down to the structure of . This sub-KG has, unlike ’s tree structure, a forest structure, and models that can deal with trees (as in ) fail here, e.g., an entity in can have multiple parents, but only one grand-parent. In this case, some models may create very similar or the same embeddings for the parent nodes.
9.Conclusions and future work
TERA is a novel knowledge graph which includes large amounts of data required by ecological risk assessment. We have conducted an extensive evaluation of KGE embedding models in a novel and very challenging application domain. Moreover, we have shown the value of using TERA in an ecotoxicological effect prediction task. The fine-tuning optimization model architecture to adapt the KG embeddings to the prediction task has, to our knowledge, not been applied elsewhere.
9.1.Value for the ecotoxicology community
The creation of TERA is of great importance to future effect modelling and computational risk assessment approaches within ecotoxicology. Where the strategic goal is designing and developing prediction models to assess the hazard and risks of chemicals and their mixtures where traditional laboratory data cannot easily be acquired.
A great effort in the hazard and risk assessment of chemicals is the reduction of regulatory-mandated animal testing. Wide-scale predictive approaches, as described here, answer a direct and current need for generalized prediction frameworks. These can aid in identifying especially sensitive species and toxic chemicals. At the Norwegian Institute for Water Research (NIVA), TERA will be used in this regard and will support several research projects.
In environmental risk assessment it is often unfeasible to assess the hazard and risk a chemical poses to a local species in the environment. These species may not be suitable for lab testing, or may even be endangered and thus are protected by national or international legislation. The currently presented work provides an in silico approach to predict the hazard to such species based on the taxonomic position of the species within the tree of life.
From an economic perspective, TERA and the prediction models are useful tools to evaluate new industrial chemicals during the synthetic in silico stage. Candidate chemicals can be evaluated for their potential environmental hazard, which is in line with the Green Chemistry initiatives by authorities such as the European Parliament or the US Environmental Protection Agency.
The effect prediction using TERA is also in line with a larger shift in ecological risk assessment towards the use of artificial intelligence . We also believe the development of TERA contributes to a methodological change in the community, and encourages others to make their data interoperable.
9.2.TERA as background knowledge
As mentioned, in this work we use TERA directly in prediction models. However, TERA could be used as background knowledge to improve many emerging techniques for toxicity prediction (e.g., ). These methods often use chemical features, images, fingerprints and so on as input, and machine learning methods such as Convolutional Neural Networks and Random Forests as prediction models [81,84]. These models are often uninterpretable, and the predictions lack domain explanations. TERA can also provide context for machine learning tasks such as pre-processing, feature extraction, transfer and zero/few-shot learning. Furthermore, the knowledge graph is a possible source for the (semantic) explanation of the predictions (e.g., ).
9.3.Benchmarking KG embedding models
We have shown that embedding TERA brings new challenges to state-of-the-art KGE models with respect to capturing the semantics of the chemicals and the species. Furthermore, as shown in Section 5.4 the sparsity-related measures indicate that TERA represent an interesting KG. KGE models could be benchmarked in a standard KG completion task or in a specific task such as the chemical effect prediction.
9.4.Value to the ontology alignment community
As mentioned in Section 5.2, there does not exist a complete and public alignment between ECOTOX species and the NCBI Taxonomy. Therefore the computed mappings can also be seen as a very relevant resource to the ecotoxicology community. The used alignment techniques achieve high scores for recall over the available (incomplete) reference mappings. However, aligning such large and challenging datasets requires preprocessing before ontology alignment systems can cope with them. We removed all nodes which did not share a word (or shared only a stop word) in labels across the two taxonomies. This quartered the size of ECOTOX and reduced NCBI Taxonomy 50 fold. However, the possible alignment between entities without labels is lost when reducing the dataset size. Thus, the alignment of ECOTOX and NCBI Taxonomy has the potential of becoming a new track of the Ontology Alignment Evaluation Initiative (OAEI)  to push the limits of large scale ontology alignment tools. Furthermore, the output of the different OAEI participants could be merged into a rich consensus alignment (e.g., as done in the phenotype-disease domain ) that could become the reference alignment to integrate ECOTOX and NCBI Taxonomy.
We plan to extend TERA to include a larger part of ChEBI (which ChEMBL is a part of). ChEBI includes relevant data on the interaction between chemicals and species at a cellular level, which may be very important for chemical effect prediction. In this work we only consider effect data from ECOTOX as this is the largest data set available, however, the inclusion of e.g., TOXCAST  is in our interest. New sources will always bring more coverage of the domain and will improve TERA for prediction, as background knowledge, and for data access.
We plan to evaluate the effect prediction under different parts of TERA, i.e., which sources in TERA provide value and which do not contribute in terms of the effect prediction. A similar effort in exploring different KG crawling techniques has been explored in . In a similar vain, we plan to evaluate how materialization, via OWL reasoning, of TERA’s implicit triples affects prediction performance.
Finally, as mentioned already, some KGE models cannot deal with parts of the structure of TERA. An in-depth analysis of this is an interesting direction for future research. This could be solved by embedding the hierarchy separately, e.g., , or imposing restrictions on the embeddings, such as a minimum distance constraint.
We encourage feedback from domain researchers on extensions to TERA and associated tools.
A snapshot of TERA is available at5). However, we include the full and .
All the material related to this project is available at
Source codes to create TERA are available in the TERA GitHub repository. The prediction models and data used for prediction can be found in the KGs_and_Effect_Prediction_2020 GitHub repository. The prediction models require the implementation of the KGE models from the KGE-Keras GitHub repository.
1 Not to be confused with SPARQL endpoint.
2 RDF, RDFS, OWL and SPARQL are standards defined by the W3C: https://www.w3.org/standards/semanticweb/.
3 is the set of all classes and instances, is the set of all properties, while represents the set of all literal values.
5 For the embedding process, we focus on triples where is a class or an instance.
6 The interested reader please refer to  for a comprehensive survey.
7 The mode of action describes the molecular pathway by which a chemical causes physiological change in an organism.
8 NIVA: https://www.niva.no/en.
9 Measure of the absence of attraction to water.
10 Resources to create and access TERA: https://github.com/NIVA-Knowledge-Graph/TERA.
11 EOL: Various Creative commons (CC), NCBI: Creative Commons CC0 1.0 Universal (CC0 1.0), ECOTOX: No restrictions, PubChem: Open Data Commons Open Database License, ChEMBL: CC Attribution, MeSH: Open, Courtesy of the U.S. National Library of Medicine, Wikidata: CC0 1.0.
15 Prefixes associated to the URI namespaces of entities in TERA: et: (ECOTOXicology knowledgebase), ncbi: (NCBI taxonomy), eol: (Encyclopedia of Life), mesh: (Medical Subject Heading), compound: (PubChem compound), descr: (PubChem descriptors), vocab: (PubChem vocabulary), inchikey: (InChIKey identifiers), envo: (Environment Ontology) cheminf: (Chemical information ontology), chembl: (ChEMBL), chembl_m: (ChEMBL molecule subset), chembl_t: (ChEMBL target subset), wd: (WikiData entities), wdt: (Wikidata properties), qudt: (Quantities, Units, Dimensions and Types Catalog), snomedct: (SNOMED CT ontology), and bp: (Biological PAthway eXchange ontology). owl:, rdfs:, rdf: and xsd: are prefixes referring to W3C standard vocabularies.
16 Version dated Sep. 15, 2020.
17 While InChI is unique, InChiKey is not, and collisions have greater than zero probability .
18 In the context of the paper “taxonomy” typically refers to a classification of organisms.
19 As defined by U.S. EPA. Note that species hierarchies are contested among researchers.
20 QUDT 1.1: http://linkedmodel.org/catalog/qudt/1.1/
21 There are a total of 27,133 and 2,246,074 taxa in ECOTOX and NCBI, respectively. However, we focus on species, i.e., instances.
22 ECOTOX interface: https://cfpub.epa.gov/ecotox/search.cfm.
23 There is no need for more complex mappings in this use case.
24 Wikidata endpoint: https://query.wikidata.org/sparql.
25 Default value used in PubChem .
26 Predefined queries are typically abstractions of SPARQL queries.
28 If effect is mortality (e.g., see Table 4).
29 , where if c is the ith chemical in , else 0. is defined similarly.
30 Appendix A.5 introduces the used loss-functions in this work. The selection of the loss function for a KGE model will be via a hyper-parameter.
31 Section 7.1 describes how the known effect data extracted from ECOTOX is split into training, validation and test sets.
33 All data used to create TERA was downloaded on the 14th of May 2020.
34 for and for .
36 We set the decision threshold since the model output bias (cf. Equation (15)) will be (close to) 0.5 after training. Recall that we have oversampled the classes to reach a prior probability during training (cf. Section 7.1.2).
37 Note that we only consider the best mean result and not the standard deviation in both directions.
This work is supported by the grant 272414 from the Research Council of Norway (RCN), the MixRisk project (Research Council of Norway, project 268294), SIRIUS Centre for Scalable Data Access (Research Council of Norway, project 237889), Samsung Research UK, Siemens AG, and the EPSRC projects AnaLOG (EP/P025943/1), OASIS (EP/S032347/1), UK FIRES (EP/S019111/1) and the AIDA project (Alan Turing Institute).
AppendixKnowledge graph embedding models
In this work, we use 9 KGE models of three major categories: decomposition models, geometric models, and convolutional models. The interested reader please refer to  for a comprehensive survey.
Throughout this section we use bold letters to denote vectors while matrices are denoted as M. Common notation for all KGE models are, for the n-th norm, for the inner product (dot product) between x and y, is the concatenation of x and y, indicates the reshape of a one-dimensional vector into a two-dimensional image (not in HolE where it represent the complex conjugate), finally, reshapes a matrix into a one-dimensional vector.
The vector representation of an entity and a relation are noted as and , respectively. These vectors are either in or , where k is the embedding dimension.
DistMult. Developed by  and shown to have state-of-the-art performance on link prediction tasks under optimal hyper-parameters . This model represent the score of a triple as an Hadaman product (dot product) of the vectors representing the subject, predicate, and object of a triple.
ComplEx. This model use the same scoring function as DistMult . However, the entity vector representation are in the complex space () and therefore, the drawback of lacking directionality in DistMult is solved.
HolE. The Holographic embedding model is described in , and use a circular correlation scoring function
TransE. The translational model has the scoring function 
RotatE. This model is inspired by Euler’s identity () and scores triples by rotating the relation embedding in complex space. RotatE has been shown to be efficient of modelling symmetric, inverse and composite relations . The scoring function of RotatE is defined as
pRotatE. This model is described as a baseline for RotatE enabling comparison when including modulus information in the model versus limiting to phase information only . pRotatE has the scoring function
HAKE. The hierarchy-aware model use the modulus and the phase part of the embedding vectors . Such that entities at the same level in the hierarchy is modelled using rotation, i.e., phase, and the entities at different levels are modelled using the distance from the origin, i.e., modulus. Therefore, the scoring function of HAKE is modelled in two parts
The final set of models used in this work are convolutional models. We denote convolutions between an image X and filters ω is denoted as . The models also use dense layers, which is denoted by transform matrices, e.g., W, note that this also includes bias, even though we do not explicit state it. Moreover, dropout layers are used between every convolutional and dense layer.
ConvKB. The scoring function of ConvKB  use a single convolutional layer and a single dense layer
ConvE. In contrast to ConvKB, ConvE  only perform convolution over the subject and predicate image (concatenated and reshaped) and multiples the output dense layer with the object vector as such
Work on KGE models usually define loss functions specific to the models. However, as show in [49,54] the choice of loss function has a huge impact on model performance. In this work we use four loss functions. We experimented with other loss functions, e.g., absolute/square error, however, these did not materialize in improved results.
To optimize a loss function we need to generate negative examples. Under the local closed world assumption we replace the object of each true triple with all entities and sample negative examples from this set , i.e., we sample from , . This can be expanded to the stochastic local closed world assumption, which corrupt both the subject and the object of true triples (illustrated by Fig. 3 in ). The number of negative samples sampled per positive sample is controlled by a hyper-parameter. However,  show that the largest possible number is favorable.
Pointwize hinge. The objective of pointwize losses minimize the scores of negative triples and maximize the score of positive triples.
Pointwize logistic. In contrast to hinge loss, logistic loss applies a larger non-linear loss to predictions that are further away from the true label.
Pairwise logistic. Akin to the move from pointwize to pairwize hinge, pairwize logistic maximizes the distance between positive and negative triples, however, in a non-linear way
A. Agibetov and M. Samwald, Benchmarking neural embeddings for link prediction in knowledge graphs under semantic and structural changes, J. Web Semant. 64: ((2020) ), 100590. doi:10.1016/j.websem.2020.100590.
A. Algergawy, M. Cheatham, D. Faria, A. Ferrara, I. Fundulaki, I. Harrow, S. Hertling, E. Jiménez-Ruiz, N. Karam, A. Khiat, P. Lambrix, H. Li, S. Montanelli, H. Paulheim, C. Pesquita, T. Saveta, D. Schmidt, P. Shvaiko, A. Splendiani, É. Thiéblin, C. Trojahn, J. Vatascinová, O. Zamazal and L. Zhou, Results of the ontology alignment evaluation initiative 2018, in: Proceedings of the 13th International Workshop on Ontology Matching Co-Located with the 17th International Semantic Web Conference, OM@ISWC 2018, Monterey, CA, USA, October 8, 2018, P. Shvaiko, J. Euzenat, E. Jiménez-Ruiz, M. Cheatham and O. Hassanzadeh, eds, CEUR Workshop Proceedings, Vol. 2288: , CEUR-WS.org, (2018) , pp. 76–116.
A. Algergawy, D. Faria, A. Ferrara, I. Fundulaki, I. Harrow, S. Hertling, E. Jiménez-Ruiz, N. Karam, A. Khiat, P. Lambrix, H. Li, S. Montanelli, H. Paulheim, C. Pesquita, T. Saveta, P. Shvaiko, A. Splendiani, É. Thiéblin, C. Trojahn, J. Vatascinová, O. Zamazal and L. Zhou, Results of the ontology alignment evaluation initiative 2019, in: Proceedings of the 14th International Workshop on Ontology Matching Co-Located with the 18th International Semantic Web Conference (ISWC 2019), Auckland, New Zealand, October 26, 2019, P. Shvaiko, J. Euzenat, E. Jiménez-Ruiz, O. Hassanzadeh and C. Trojahn, eds, CEUR Workshop Proceedings, Vol. 2536: , CEUR-WS.org, (2019) , pp. 46–85.
M. Ali, M. Berrendorf, C.T. Hoyt, L. Vermue, M. Galkin, S. Sharifzadeh, A. Fischer, V. Tresp and J. Lehmann, Bringing light into the dark: A large-scale evaluation of knowledge graph embedding models under a unified framework, CoRR, 2020. arXiv:2006.13365.
M. Alshahrani, M.A. Khan, O. Maddouri, A.R. Kinjo, N. Queralt-Rosinach and R. Hoehndorf, Neuro-symbolic representation learning on biological knowledge graphs, Bioinform. 33: (17) ((2017) ), 2723–2730. doi:10.1093/bioinformatics/btx275.
H. Arnaout and S. Elbassuoni, Effective searching of rdf knowledge graphs, Journal of Web Semantics 48: ((2018) ), 66–84. doi:10.1016/j.websem.2017.12.001.
T. Benson, Principles of Health Interoperability HL7 and SNOMED, Health Information Technology Standards, Springer, London, (2012) .
K. Blagec, H. Xu, A. Agibetov and M. Samwald, Neural sentence embedding models for semantic similarity estimation in the biomedical domain, BMC Bioinformatics 20: (1) ((2019) ), 178. doi:10.1186/s12859-019-2789-2.
K. Bollacker, C. Evans, P. Paritosh, T. Sturge and J.T. Freebase, A collaboratively created graph database for structuring human knowledge, in: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, Association for Computing Machinery, New York, NY, USA, (2008) , pp. 1247–1250. doi:10.1145/1376616.1376746.
A. Bordes, N. Usunier, A. García-Durán, J. Weston and O. Yakhnenko, Translating embeddings for modeling multi-relational data, in: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013, Proceedings of a meeting held December 5–8, 2013, Lake Tahoe, Nevada, United States, C.J.C. Burges, L. Bottou, Z. Ghahramani and K.Q. Weinberger, eds, (2013) , pp. 2787–2795.
P. Branco, L. Torgo and R.P. Ribeiro, A survey of predictive modeling on imbalanced domains, ACM Comput. Surv. 49: (2) ((2016) ), 31:1–31:50. doi:10.1145/2907070.
A. Breit, S. Ott, A. Agibetov and M. Samwald, Openbiolink: A benchmarking framework for large-scale biomedical link prediction, Bioinformatics 36: (13) ((2020) ), 4097–4098. doi:10.1093/bioinformatics/btaa274.
J. Chen, P. Hu, E. Jiménez-Ruiz, O.M. Holter, D. Antonyrajah and I. Horrocks, OWL2Vec*: Embedding of OWL ontologies, Mach. Learn. 110: (7) ((2021) ), 1813–1845. doi:10.1007/s10994-021-05997-6.
J. Chen, E. Jiménez-Ruiz, I. Horrocks, D. Antonyrajah, A. Hadian and J. Lee, Augmenting ontology alignment by semantic embedding and distant supervision, in: European Semantic Web Conference (ESWC), (2021) , pp. 392–408.
X. Chen, M.-X. Liu and G.-Y. Yan, Drug–target interaction prediction by random walk on the heterogeneous network, Mol. BioSyst. 8: ((2012) ), 1970–1978. doi:10.1039/c2mb00002d.
F. Chollet et al., Keras, 2015. https://github.com/fchollet/keras.
T.F. Coleman and J.J. Moré, Estimation of sparse Jacobian matrices and graph coloring blems, SIAM Journal on Numerical Analysis 20: (1) ((1983) ), 187–209. doi:10.1137/0720013.
J. David, J. Euzenat F. Scharffe and C.T. dos Santos The alignment API 4.0, Semantic Web 2: (1) ((2011) ), 3–10. doi:10.3233/SW-2011-0028.
T. Dettmers, P. Minervini, P. Stenetorp and S. Riedel, Convolutional 2d knowledge graph embeddings, in: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th Innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2–7, 2018, S.A. McIlraith and K.Q. Weinberger, eds, AAAI Press, (2018) , pp. 1811–1818.
J.A. Doering, S. Lee, K. Kristiansen, L. Evenseth, M.G. Barron, I. Sylte and C.A. LaLone, In silico site-directed mutagenesis informs species-specific predictions of chemical susceptibility derived from the sequence alignment to predict across species susceptibility (SeqAPASS) tool, Toxicological Sciences 166: (1) ((2018) ), 131–145.
X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun and W. Zhang, Knowledge vault: A web-scale approach to probabilistic knowledge fusion, in: The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA, August 24–27, 2014, S.A. Macskassy, C. Perlich, J. Leskovec, W. Wang and R. Ghani, eds, ACM, (2014) , pp. 601–610. doi:10.1145/2623330.2623623.
A.Z. Dudek, T. Arodz and J. Gálvez, Computational methods in developing quantitative structure-activity relationships (QSAR): A review, Combinatorial Chemistry & High Throughput Screening 9: (3) ((2006) ), 213–228. doi:10.2174/138620706776055539.
J. Euzenat and P. Shvaiko, Ontology Matching, 2nd edn, Springer, (2013) .
D. Faria, E. Jiménez-Ruiz, C. Pesquita, E. Santos and F.M. Couto, Towards annotating potential incoherences in bioportal mappings, in: Proceedings, Part II, The Semantic Web – ISWC 2014 – 13th International Semantic Web Conference, Riva del Garda, Italy, October 19–23, 2014, Proceedings, Part II, P. Mika, T. Tudorache, A. Bernstein, C. Welty, C.A. Knoblock, D. Vrandecic, P. Groth, N.F. Noy, K. Janowicz and C.A. Goble, eds, Lecture Notes in Computer Science, Vol. 8797: , Springer, (2014) , pp. 17–32.
D. Faria, C. Pesquita, E. Santos, M. Palmonari, I.F. Cruz and F.M. Couto, The AgreementMakerLight ontology matching system, in: On the Move to Meaningful Internet Systems: OTM 2013 Conferences – Confederated International Conferences: CoopIS, DOA-Trusted Cloud, and ODBASE 2013, Graz, Austria, September 9–13, 2013, Proceedings, (2013) , pp. 527–541.
J. Fukuchi, A. Kitazawa, K. Hirabayashi and M. Honma, A practice of expert review by read-across using QSAR toolbox, Mutagenesis 34: (1) ((2019) ), 49–54. doi:10.1093/mutage/gey046.
B.C. Grau, I. Horrocks, B. Motik, B. Parsia, P.F. Patel-Schneider and U. Sattler, OWL 2: The next step for OWL, J. Web Semant. 6: (4) ((2008) ), 309–322. doi:10.1016/j.websem.2008.05.001.
I. Harrow, E. Jiménez-Ruiz, A. Splendiani, M. Romacker, P. Woollard, S. Markel, Y. Alam-Faruque, M. Koch, J. Malone and A. Waaler, Matching disease and phenotype ontologies in the ontology alignment evaluation initiative, J. Biomed. Semant. 8: (1) ((2017) ), 55:1–55:13. doi:10.1186/s13326-017-0162-9.
J. Hastings, G. Owen, A. Dekker, M. Ennis, N. Kale, V. Muthukrishnan, S. Turner, N. Swainston, P. Mendes and C. Steinbeck, ChEBI in 2016: Improved services and an expanding collection of metabolites, Nucleic acids research 44: (D1) ((2016) ), 214–219.
K. Hayashi and M. Shimbo, On the equivalence of holographic and complex embeddings for link prediction, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada, July 2017, Association for Computational Linguistics, (2017) , pp. 554–559. doi:10.18653/v1/P17-2088.
S.R. Heller, A. McNaught, I.V. Pletnev, S. Stein and D. Tchekhovskoi, Inchi, the IUPAC international chemical identifier, J. Cheminformatics 7: ((2015) ), 23. doi:10.1186/s13321-015-0068-4.
A. Hogan, E. Blomqvist, M. Cochez, C. d’Amato, G. de Melo, C. Gutiérrez, S. Kirrane, J.E.L. Gayo, R. Navigli, S. Neumaier, A.N. Ngomo, A. Polleres, S.M. Rashid, A. Rula, L. Schmelzeisen, J.F. Sequeda, S. Staab and A. Zimmermann, Knowledge graphs, ACM Comput. Surv. 54: (4) ((2021) ), 71:1–71:37.
E. Jiménez-Ruiz, B. Cuenca Grau, Y. Zhou and I. Horrocks, Large-scale interactive ontology matching: Algorithms and implementation, in: 20th European Conference on Artificial Intelligence (ECAI), (2012) , pp. 444–449.
E. Jiménez-Ruiz and B. Cuenca Grau, LogMap: Logic-based and scalable ontology matching, in: 10th International Semantic Web Conference (ISWC), (2011) , pp. 273–288.
E. Jiménez-Ruiz, B.C. Grau, I. Horrocks and R.B. Llavori, Logic-based assessment of the compatibility of UMLS ontology sources, J. Biomed. Semant. 2: (S-1) ((2011) ), S2.
R. Kadlec, O. Bajgar and J. Kleindienst, Knowledge base completion: Baselines strike back, in: Proceedings of the 2nd Workshop on Representation Learning for NLP, Rep4NLP@ACL 2017, Vancouver, Canada, August 3, 2017, P. Blunsom, A. Bordes, K. Cho, S.B. Cohen, C. Dyer, E. Grefenstette, K.M. Hermann, L. Rimell, J. Weston and S. Yih, eds, Association for Computational Linguistics, (2017) , pp. 69–74.
S. Kim, E.E. Bolton and S.H. Bryant, Similar compounds versus similar conformers: Complementarity between PubChem 2-D and 3-D neighboring sets, Journal of Cheminformatics 8: (1) ((2016) ), 62. doi:10.1186/s13321-016-0163-1.
S. Kim, J. Chen, T. Cheng, A. Gindulyte, J. He, S. He, Q. Li, B.A. Shoemaker, P.A. Thiessen, B. Yu, L. Zaslavsky, J. Zhang and E.E. Bolton, PubChem 2019 update: Improved access to chemical data, Nucleic Acids Research 47: (D1) ((2018) ), D1102–D1109.
D.P. Kingma and J.B. Adam, A method for stochastic optimization, in: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Y. Bengio and Y. LeCun, eds, Conference Track Proceedings, (2015) .
M. Kulmanov, W. Liu-Wei, Y. Yan and R. Hoehndorf, EL embeddings: Geometric construction of models for the description logic EL++, in: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10–16, 2019, S. Kraus, ed., ijcai.org, (2019) , pp. 6103–6109.
C. LaLone, D. Villeneuve, H. Helgen and G. Ankley, Sequence alignment to predict across-species susceptibility, in: SETAC Europe, Basel, Switzerland, May 11–15, (2014) .
M. Lare, (Skolelaboratoriet i realfag ved Universitetet i Bergen). Smỵr i ferskvann. Accessed 11.06.2020.
F. Lécué and J. Wu, Semantic explanations of predictions, CoRR, 2018. arXiv:1805.10587.
J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P.N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer and C. Bizer, Dbpedia – A large-scale, multilingual knowledge base extracted from Wikipedia, Semantic Web 6: (2) ((2015) ), 167–195. doi:10.3233/SW-140134.
V.I. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals, Soviet Physics Doklady 10: ((1966) ), 707.
X. Liang, D. Li, M. Song, A. Madden, Y. Ding and Y. Bu, Predicting biomedical relationships using the knowledge and graph embedding cascade model, PLOS ONE 14: (6) ((2019) ), 1–23.
NLM. Medical Subject Headings (MeSH) RDF, 2020. https://id.nlm.nih.gov/mesh/.
G.A. Miller, Wordnet: A lexical database for English, Commun. ACM 38: (11) ((1995) ), 39–41. doi:10.1145/219717.219748.
S.K. Mohamed, V. Novácek, P. Vandenbussche and E. Muñoz, Loss functions in knowledge graph embedding models, in: Proceedings of the Workshop on Deep Learning for Knowledge Graphs (DL4KG2019) Co-Located with the 16th Extended Semantic Web Conference 2019 (ESWC 2019), M. Alam, D. Buscaldi, M. Cochez, F. Osborne, D.R. Recupero and H. Sack, eds, CEUR Workshop Proceedings, Vol. 2377: , CEUR-WS.org, (2019) , pp. 1–10.
S. Mumtaz and M. Giese, Hierarchy-based semantic embeddings for single-valued & multi-valued categorical variable, Journal of Intelligent Information Systems (2021) (in press).
E.B. Myklebust, E. Jimenez-Ruiz, J. Chen, R. Wolf and K.E. Tollefsen, Knowledge graph embedding for ecotoxicological effect prediction, The Semantic Web – ISWC 2019: ((2019) ), 490–506.
E.B. Myklebust, E. Jiménez-Ruiz, J. Chen, R. Wolf and K.E. Tollefsen, Ontology alignment in ecotoxicological effect prediction, in: 15th International Workshop on Ontology Matching, (2020) .
E.B. Myklebust, E. Jimenez-Ruiz, C. Jiaoyan, R. Wolf and K.E. Tollefsen, Toxicological Effect and Risk Assessment (TERA) Knowledge Graph, 2020, (Version 1.1.0) [Data set]. Zenodo. doi:10.5281/zenodo.4244313.
M. Nayyeri, C. Xu, Y. Yaghoobzadeh, H.S. Yazdi and J. Lehmann, Toward understanding the effect of loss function on then performance of knowledge graph embedding, 2019.
D.Q. Nguyen, T.D. Nguyen, D.Q. Nguyen and D.Q. Phung, A novel embedding model for knowledge base completion based on convolutional neural network, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, M.A. Walker, H. Ji and A. Stent, eds, (2018) , pp. 327–333.
M. Nickel, L. Rosasco and T.A. Poggio, Holographic embeddings of knowledge graphs, in: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, Arizona, USA, February 12–17, 2016, D. Schuurmans and M.P. Wellman, eds, AAAI Press, (2016) , pp. 1955–1961.
C.S. Parr, N. Wilson, P. Leary, K. Schulz, K. Lans, L. Walley, J. Hammock, A. Goddard, J. Rice and M. Studer, The encyclopedia of life v2: Providing global access to knowledge about life on earth, 2014.
C.S. Parr, N. Wilson, P. Leary, K.S. Schulz, K. Lans, L. Walley, J.A. Hammock, A. Goddard, J. Rice, M. Studer, J.T.G. Holmes and J.R.J. Corrigan, The encyclopedia of life v2: Providing global access to knowledge about life on Earth, Biodiversity Data Journal 2: ((2014) ), e1079.
R. Parthasarathi and A. Dhawan, Chapter 5 – In silico approaches for predictive toxicology, in: In Vitro Toxicology, A. Dhawan and S. Kwon, eds, Academic Press, (2018) , pp. 91–109. doi:10.1016/B978-0-12-804667-8.00005-5.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12: ((2011) ), 2825–2830.
M.A.N. Pour, A. Algergawy, R. Amini, D. Faria, I. Fundulaki, I. Harrow, S. Hertling, E. Jiménez-Ruiz, C. Jonquet, N. Karam, A. Khiat, A. Laadhar, P. Lambrix, H. Li, Y. Li, P. Hitzler, H. Paulheim, C. Pesquita, T. Saveta, P. Shvaiko, A. Splendiani, É. Thiéblin, C. Trojahn, J. Vatascinová, B. Yaman, O. Zamazal and L. Zhou, Results of the ontology alignment evaluation initiative 2020, in: Proceedings of the 15th International Workshop on Ontology Matching Co-Located with the 19th International Semantic Web Conference (ISWC 2020), Virtual conference (originally planned to be in Athens, Greece), November 2, 2020, P. Shvaiko, J. Euzenat, E. Jiménez-Ruiz, O. Hassanzadeh and C. Trojahn, eds, CEUR Workshop Proceedings, Vol. 2788: , CEUR-WS.org, (2020) , pp. 92–138.
J. Pujara, E. Augustine and L. Getoor, Sparsity and noise: Where knowledge graph embeddings fall short, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, Sept. 2017, Association for Computational Linguistics, (2017) , pp. 1751–1756.
A. Rossi, D. Barbosa, D. Firmani, A. Matinata and P. Merialdo, Knowledge graph embedding for link prediction: A comparative analysis, ACM Trans. Knowl. Discov. Data 15: (2) ((2021) ), 14:1–14:49.
E.W. Sayers, T. Barrett, D.A. Benson, S.H. Bryant, K. Canese, V. Chetvernin, D.M. Church, M. DiCuccio, R. Edgar, S. Federhen, M. Feolo, L.Y. Geer, W. Helmberg, Y. Kapustin, D. Landsman, D.J. Lipman, T.L. Madden, D.R. Maglott, V. Miller, I. Mizrachi, J. Ostell, K.D. Pruitt, G.D. Schuler, E. Sequeira, S.T. Sherry, M. Shumway, K. Sirotkin, A. Souvorov, G. Starchenko, T.A. Tatusova, L. Wagner, E. Yaschenko and J. Ye, Database resources of the National Center for Biotechnology Information, Nucleic Acids Research 37: (suppl_1) ((2008) ), D5–D15.
A.K. Sharma, G.N. Srivastava, A. Roy and V.K. Sharma, Toxim: A toxicity prediction tool for small molecules developed using machine learning and chemoinformatics approaches, Frontiers in pharmacology 8: ((2017) ), 880. doi:10.3389/fphar.2017.00880.
P. Shvaiko and J. Euzenat, Ontology matching: State of the art and future challenges, IEEE Trans. Knowl. Data Eng. 25: (1) ((2013) ), 158–176. doi:10.1109/TKDE.2011.253.
N.P.O. Skrindebakke, Understanding the Role of Background Knowledge in Predictions, Master’s thesis, 2020.
F.Z. Smaili, X. Gao and R. Hoehndorf, Opa2vec: Combining formal and informal content of biomedical ontologies to improve similarity-based prediction, Bioinform. 35: (12) ((2019) ), 2133–2140. doi:10.1093/bioinformatics/bty933.
F.M. Suchanek, G. Kasneci and G. Weikum, Yago: A core of semantic knowledge, in: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 8–12, 2007, C.L. Williamson, M.E. Zurko, P.F. Patel-Schneider and P.J. Shenoy, eds, ACM, (2007) , pp. 697–706. doi:10.1145/1242572.1242667.
Z. Sun, Z. Deng, J. Nie and J.T. Rotate, Knowledge graph embedding by relational rotation in complex space, in: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019, OpenReview.net, (2019) .
M. Swain et al., PubChemPy: Python wrapper for the pubchem pug rest api, 2014. [Online; accessed 15.08.2019].
M.E. Tipping and C.M. Bishop, Probabilistic principal component analysis, Journal of the Royal Statistical Society. Series B (Statistical Methodology) 61: (3) ((1999) ), 611–622. doi:10.1111/1467-9868.00196.
T. Trouillon, J. Welbl, S. Riedel, É. Gaussier and G. Bouchard, Complex embeddings for simple link prediction, CoRR, 2016. arXiv:1606.06357.
U.S. Environmental Protection Agency. Ecotox user guide: Ecotoxicology knowledgebase system, version 5.3, 2020.
U.S. Environmental Protection Agency. ToxCast & Tox21 Summary Files from invitrodb_v3, 2020.
D. Vrandecic and M. Krötzsch, Wikidata: A free collaborative knowledgebase, Commun. ACM 57: (10) ((2014) ), 78–85. doi:10.1145/2629489.
A. Waagmeester, G. Stupp, S. Burgstaller, B. Good, M. Griffith, O. Griffith, K. Hanspers, H. Hermjakob, T. Hudson, K. Hybiske, S. Keating, M. Manske, M. Mayers, D. Mietchen, E. Mitraka, A. Pico, T. Putman, A. Riutta, N. Queralt-Rosinach and A. Su, Wikidata as a knowledge graph for the life sciences, eLife 9: ((2020) ), e52614.
Q. Wang, Z. Mao, B. Wang and L. Guo, Knowledge graph embedding: A survey of approaches and applications, IEEE Trans. Knowl. Data Eng. 29: (12) ((2017) ), 2724–2743. doi:10.1109/TKDE.2017.2754499.
E. Willighagen, InChIKey collision: The DIY copy/pastables, 2011.
C. Wittwehr, P. Blomstedt, J.P. Gosling, T. Peltola, B. Raffael, A.-N. Richarz, M. Sienkiewicz, P. Whaley, A. Worth and M. Whelan, Artificial intelligence for chemical risk assessment, Computational Toxicology 13: ((2019) ), 100114.
Y. Wu and G. Wang, Machine learning based toxicity prediction: From chemical structural description to transcriptome analysis, International Journal of Molecular Sciences 19: ((2018) ), 2358. doi:10.18483/ijSci.1625.
Z. Wu, W. Lu, D. Wu, A. Luo, H. Bian, J. Li, W. Li, G. Liu, J. Huang, F. Cheng and Y. Tang, In silico prediction of chemical mechanism of action via an improved network-based inference method, British Journal of Pharmacology 173: (23) ((2016) ), 3372–3385. doi:10.1111/bph.13629.
B. Yang, W. Yih, X. He, J. Gao and L. Deng, Embedding entities and relations for learning and inference in knowledge bases, in: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Y. Bengio and Y. LeCun, eds, Conference Track Proceedings, (2015) .
H. Yang, L. Sun, W. Li, G. Liu and Y. Tang, In silico prediction of chemical toxicity for drug design using machine learning methods and structural alerts, Frontiers in chemistry 6: ((2018) ), 30. doi:10.3389/fchem.2018.00030.
W.J. Youden, Index for rating diagnostic tests, Cancer 3: (1) ((1950) ), 32–35. doi:10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3.
Z. Zhang, J. Cai, Y. Zhang and J. Wang, Learning hierarchy-aware knowledge graph embeddings for link prediction, in: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI Press, (2020) , pp. 3065–3072.