You are viewing a javascript disabled version of the site. Please enable Javascript for this site to function properly.
Go to headerGo to navigationGo to searchGo to contentsGo to footer
In content section. Select this link to jump to navigation

Semantics and canonicalisation of SPARQL 1.1

Abstract

We define a procedure for canonicalising SPARQL 1.1 queries. Specifically, given two input queries that return the same solutions modulo variable names over any RDF graph (which we call congruent queries), the canonicalisation procedure aims to rewrite both input queries to a syntactically canonical query that likewise returns the same results modulo variable renaming. The use-cases for such canonicalisation include caching, optimisation, redundancy elimination, question answering, and more besides. To begin, we formally define the semantics of the SPARQL 1.1 language, including features often overlooked in the literature. We then propose a canonicalisation procedure based on mapping a SPARQL query to an RDF graph, applying algebraic rewritings, removing redundancy, and then using canonical labelling techniques to produce a canonical form. Unfortunately a full canonicalisation procedure for SPARQL 1.1 queries would be undecidable. We rather propose a procedure that we prove to be sound and complete for a decidable fragment of monotone queries under both set and bag semantics, and that is sound but incomplete in the case of the full SPARQL 1.1 query language. Although the worst case of the procedure is super-exponential, our experiments show that it is efficient for real-world queries, and that such difficult cases are rare.

1.Introduction

The Semantic Web provides a variety of standards and techniques for enhancing the machine-readability of Web content in order to increase the levels of automation possible for day-to-day tasks. RDF [54] is the standard framework for the graph-based representation of data on the Semantic Web. In turn, SPARQL [24] is the standard querying language for RDF, composed of basic graph patterns extended with expressive features that include path expressions, relational algebra, aggregation, federation, among others.

The adoption of RDF as a data model and SPARQL as a query language has grown significantly in recent years [4,26]. Prominent datasets such as DBpedia [35] and Wikidata [61] contain in the order of hundreds of millions or even billions of RDF triples, and their associated SPARQL endpoints receive millions of queries per day [37,52]. Hundreds of other SPARQL endpoints are likewise available on the Web [4]. However, a survey carried out by Buil-Aranda et al. [4] found that a large number of SPARQL endpoints experience performance issues such as latency and unavailability. The same study identified the complexity of SPARQL queries as one of the main causes of these problems, which is perhaps an expected result given the expressivity of the SPARQL query language where, for example, the decision problem consisting of determining if a solution is given by a query over a graph is PSPACE-hard for the SPARQL language [44].

One way to address performance issues is through caching of sub-queries [41,62]. The caching of queries is done by evaluating a query, then storing its result set, which can then be used to answer future instances of the same query without using any additional resources. The caching of sub-queries identifies common query patterns whose results can be returned for queries that contain said query patterns. However, this is complicated by the fact that a given query can be expressed in different, semantically equivalent ways. As a result, if we are unable to verify if a given query is equivalent to one that has already been cached, we are not using the cached results optimally: we may miss relevant results.

Ideally, for the purposes of caching, we could use a procedure to canonicalise SPARQL queries. To formalise this idea better, we call two queries equivalent if (and only if) they return the same solutions over any RDF dataset. Note however that this notion of equivalence requires the variables of the solutions of both queries to coincide. In practice, variable names will often differ across queries, where we would still like to be able to cache and retrieve the results for queries whose results are the same modulo variable names. Hence we call two queries congruent if they return the same solutions, modulo variable names, over any RDF dataset; in other words, two queries are congruent if (and only if) there exists a one-to-one mapping from the variables in one query to the variables of the other query that makes the former equivalent to the latter.

In this paper, we propose a procedure by which congruent SPARQL queries can be “canonicalised”. We call such a procedure sound if the output query is congruent to the input query, and complete if the same output query is given for any two congruent input queries.

Example 1.1.

Consider the following two SPARQL queries asking for names of aunts: sw-13-sw212871-g001.jpg sw-13-sw212871-g002.jpg Both queries are equivalent: they always return the same results for any RDF dataset. Now rather consider a third SPARQL query: sw-13-sw212871-g003.jpg Note that the pattern ?b :name ?n . in this query is redundant. This query is not equivalent to the former two because the variable that is returned is different, and thus the solutions (which contain the projected variable), will not be the same. However all three queries are congruent; for example, if we rewrite ?n to ?z in the third query, all three queries become equivalent.

Canonicalisation aims to rewrite all three (original) queries to a unique, congruent, output query.

The potential use-cases we foresee for a canonicalisation procedure include the following:

Query caching:

As aforementioned, a canonicalisation procedure can improve caching for SPARQL endpoints. By capturing knowledge about query congruence, canonicalisation can increase the cache hit rate. Similar techniques could also be used to identify and cache frequently appearing (congruent) sub-queries [41].

Views:

In a conceptually similar use case to caching, our canonical procedure can be used to describe views [9]. In particular, the canonicalisation procedure can be used to create a key that uniquely identifies each of the views available.

Log analysis:

SPARQL endpoint logs can be analysed in order to understand the importance of different SPARQL features [7,52], to build suitable benchmarks [52], to understand how users build queries incrementally [7,64], etc. Our canonicalisation procedure could be used to pre-process and group congruent queries in logs.

Query optimisation:

Canonicalisation may help with query optimisation by reducing the variants to be considered for query planning, detecting duplicate sub-queries that can be evaluated once, removing redundant patterns (as may occur under query rewriting strategies for reasoning [33]), etc.

Learning over queries:

Canonicalisation can reduce superficial variance in queries used to train machine learning models. For example, recent question answering systems learn translations from natural language questions to queries [10], where canonicalisation can be used to homogenise the syntax of queries used for training.

Other possible but more speculative use-cases involve signing or hashing SPARQL queries, discovering near-duplicate or parameterised queries (by considering constants as variables), etc. Furthermore, with some adaptations, the methods proposed here could be generalised to other query languages, such as to canonicalise SQL queries, Cypher queries [3], etc.

A key challenge for canonicalising SPARQL queries is the prohibitively high computational complexity that it entails. More specifically, the query equivalence problem takes two queries and returns true if and only if they return the same solutions for any dataset, or false otherwise. In the case of SPARQL, this problem is intractable (NP-complete) even when simply permitting joins (with equality conditions). Even worse, the problem becomes undecidable when features such as projection and optional matches are added [45]. Since a canonicalisation procedure can be directly used to decide equivalence, this means that canonicalisation is at least as hard as the equivalence problem in computational complexity terms, meaning it will likewise be intractable for even simple fragments and undecidable when considering the full SPARQL language. There are thus fundamental limitations in what can be achieved for canonicalising SPARQL queries.

With these limitations in mind, we propose a canonicalisation procedure that is always sound, but only complete for a monotone fragment of SPARQL under set or bag semantics. This monotone fragment permits unions and joins over basic graph patterns, some examples of which were illustrated in Example 1.1. We further provide sound, but incomplete, canonicalisation of the full SPARQL 1.1 query language, whereby the canonicalised query will be congruent to the input query, but not all pairs of congruent input queries will result in the same output query. In the case of incomplete canonicalisation, we are still able to find novel congruences, in particular through canonical labelling of variables, which further allows for ordering operands in a consistent manner. Reviewing the aforementioned use-cases, we believe that this “best-effort” form of canonicalisation is still useful, as in the case of caching, where missing an equivalence will require re-executing the query (which would have to be done in any case), or in the case of learning over queries, where incomplete canonicalisation can still increase the homogeneity of the training examples used.

As a high-level summary, our procedure combines four main techniques for canonicalisation.

  • 1. The first technique is to convert SPARQL queries to an algebraic graph, which abstracts away syntactic variances, such as the ordering of operands for operators that are commutative, and the grouping of operands for operators that are associative.

  • 2. The second technique is to apply algebraic rewritings on the graph to achieve normal forms over combinations of operators. For example, we rewrite monotone queries – that allow any combination of join, union, basic graphs patterns, etc. – into unions of basic graph patterns; this would rewrite the first and third queries shown in Example 1.1 into a form similar to the second query.

  • 3. The third technique is to apply redundancy elimination within the algebraic graph, which typically involves the removal of elements of the query that do not affect the results; this technique would remove the redundant ?b :name ?n . pattern from the third query of Example 1.1.

  • 4. The fourth and final technique is to apply a canonical labelling of the algebraic graph, which will provide consistent labels to variables, and which in turn allows for the (unordered) algebraic graph to be serialised back into the (ordered) concrete syntax of SPARQL in a canonical way.

We remark that the techniques do not necessarily follow the presented order; in particular, the second and third techniques can be interleaved in order to provide further canonicalisation of queries.

This paper extends upon our previous work [50] where we initially outlined a sound and complete procedure for canonicalising monotone SPARQL queries. The novel contributions of this extended paper include:

  • We close a gap involving translation of monotone queries under bag semantics that cannot return duplicates into set semantics.

  • We provide a detailed semantics for SPARQL 1.1 queries; formalising and understanding this is a key prerequisite for canonicalisation.

  • We extend our algebraic graph representation in order to be able to represent SPARQL 1.1 queries, offering partial canonicalisation support.

  • We implement algebraic rewriting rules for specific SPARQL 1.1 operators, such as those relating to filters; we further propose techniques to canonicalise property path expressions.

  • We provide more detailed experiments, which now include results over a Wikidata query log, a comparison with existing systems from the literature that perform pairwise equivalence checks, and more detailed stress testing.

We also provide extended proofs of results that were previously unpublished [51], as well as providing extended discussion and examples throughout.

The outline of the paper is then as follows. Section 2 provides preliminaries for RDF, while Section 3 provides a detailed semantics for SPARQL. Section 4 provides a problem statement, formalising the notion of canonicalisation. Section 5 discusses related works in the areas of systems that support query containment, equivalence, and congruence. Sections 6 and 7 discuss our SPARQL canonicalisation framework for monotone queries, and SPARQL 1.1, respectively. Section 8 presents evaluation results. Section 9 concludes.

2.RDF data model

We begin by introducing the core concepts of the RDF data model over which the SPARQL query language will later be defined. The following is a relatively standard treatment of RDF, as can be found in various papers from the literature [22,28]. We implicitly refer to RDF 1.1 unless otherwise stated.

2.1.Terms and triples

RDF assumes three pairwise disjoint sets of terms: IRIs (I), literals (L) and blank nodes (B). Data in RDF are structured as triples, which are 3-tuples of the form (s,p,o)IB×I×IBL denoting the subject s, the predicate p, and the object o of the triple.1 There are three types of literals: a plain literal s is a simple string, a language-tagged literal (s,l) is a pair of a simple string and a language tag, and (s,d) is a pair of a simple string and an IRI (denoting a datatype).

In this paper we use Turtle/SPARQL-like syntax, where :a, xsd:string, etc., denote IRIs; _:b, _:x1, etc., denote blank nodes; "a", "xy z", etc., denote plain literals; "hello"@en, "hola"@es, etc., denote language-tagged literals; and "true"^^xsd:boolean, "1"^^xsd:int, etc., denote datatype literals.

2.2.Graph

An RDF graph G is a set of RDF triples. It is called a graph because each triple (s,p,o)G can be viewed as a directed labelled edge of the form spo, and a set of such triples forms a directed edge-labelled graph.

2.3.Simple entailment and equivalence

Blank nodes in RDF have special meaning; in particular, they are considered to be existential variables. The notion of simple entailment [22,25] captures the existential semantics of blank nodes (among other fundamental aspects of RDF). This same notion also plays a role in how the SPARQL query language is defined.

Formally, let α:BIBL denote a mapping that maps blank nodes to RDF terms; we call such a mapping a blank node mapping. Given an RDF graph G, let bnodes(G) denote all of the blank nodes appearing in G. Let α(G) denote the image of G under α; i.e., the graph G but with each occurrence of each blank node bbnodes(G) replaced with α(b). Given two RDF graphs G and H, we say that G simple-entails H, denoted GH, if and only if there exists a blank node mapping α such that α(H)G [22,25]. Furthermore, if GH and HG, then we say that they are simple equivalent, denoted GH.

Deciding simple entailment GH is known to be NP-complete [22]. Deciding the simple equivalence GH is likewise known to be NP-complete.

We remark that the RDF standard defines further entailment regimes that cover the semantics of datatypes and the special RDF and RDFS vocabularies [25]; we will not consider such entailment regimes here.

2.4.Isomorphism

Given that blank nodes are defined as existential variables [25], two RDF graphs differing only in blank node labels are thus considered isomorphic [19,28].

Formally, if a blank node mapping of the form α:BB is one-to-one, we call it a blank node bijection. Two RDF graphs G and H are defined as isomorphic, denoted GH, if and only if there exists a blank node bijection α such that α(G)=H; i.e., the two RDF graphs differ only in their blank node labels. We remark that if GH, then GH; however, the inverse does not always hold as we discuss in the following.

Deciding the isomorphism GH is known to be GI-complete [28] (as hard as graph isomorphism).

2.5.Leanness and core

Existential blank nodes may give rise to redundant triples. In particular, an RDF graph G is called lean if and only if there does not exist a proper subgraph GG of it such that GG; otherwise G is called non-lean. Non-lean graphs can be seen, under the RDF semantics, as containing redundant triples. For example, given an RDF graph G={(:x,:y,:z),(:x,:y,_:b)}, the second triple is seen as redundant: it states that :x has some value on :y, but we know this already from the first triple, so the second triple says nothing new.

The core of an RDF graph G is then an RDF graph G such that GG and G is lean; intuitively it is a version of G without redundancy. For example the core of the aforementioned graph would be G={(:x,:y,:z)}; note that GG and G is lean, but GG. The core of a graph is unique modulo isomorphism [22]; hence we refer to the core of a graph.

Deciding whether or not an RDF G is lean is known to be CONP-complete [22]. Deciding if G is the core of G is known to be DP-complete [22].

2.6.Merge

Blank nodes are considered to be scoped to a local RDF graph. Hence when combining RDF graphs, applying a merge (rather than union) avoids blank nodes with the same name in two (or more) graphs clashing. Given two RDF graphs G and G, and a blank node bijection α such that bnodes(α(G))bnodes(G)=, we call α(G)G an RDF merge, denoted GG. The merge of two graphs is unique modulo isomorphism.

3.SPARQL 1.1 semantics

We now define SPARQL 1.1 in detail [24]. We will begin by defining a SPARQL dataset over which queries are evaluated. We then introduce an abstract syntax for SPARQL queries. Thereafter we discuss the evaluation of queries under different semantics.

These definitions extend similar preliminaries found in the literature. However, our definitions of the semantics of SPARQL 1.1 extend beyond the core of the language and rather aim to be exhaustive, where a clear treatment of the full language is a prerequisite for formalising the canonicalisation of queries using the language. Table 1 provides a summary of prior works that have defined the semantics of SPARQL features. We exclude works that came after one of the works shown and use a subset of the features of that work (even if they may contribute novel results about those features). Some SPARQL 1.0 features, such as UNION, FILTER and OPTIONAL, have been featured in all studies. In terms of SPARQL 1.1, the most extensive formal definitions have been provided by Polleres and Wallner [47], and by Kaminski et al. [32]. However, both works omit query features: Polleres and Wallner [47] omit federation and aggregation, whereas Kaminski et al. [32] omit named graphs, federation, and non-SELECT query forms. Compared to these previous works, we aim to capture the full SPARQL 1.1 query language, with one simplification: we define functions and expressions abstractly, rather than defining all of the many built-ins that SPARQL 1.1 provides (e.g., +, BOUND, COUNT, IF, etc.)

Table 1

Studies that define the semantics of features in SPARQL (1.1), including Monotone (basic graph patterns, joins, UNION, un-nested SELECT DISTINCT), Filters, Optionals, Negation (OPTIONAL & !BOUND, MINUS, FILTER (NOT) EXISTS), Named Graphs (GRAPH, FROM (NAMED)), Paths, Federation (SERVICE), Assignment (BIND, VALUES), Aggregation (GROUP BY and aggregate functions), Sub Queries (nested SELECT), Solution Modifiers (LIMIT, OFFSET, ORDER BY), Query Forms (CONSTRUCT, ASK, DESCRIBE), Expressions and Functions (e.g., +, BOUND, COUNT, IF), Bag Semantics; we denote by “*” partial definitions or discussion

PaperYearMonFiltOptNegNGraPathFedAssnAggSubQSolMFormExpBag
Perez et al. [43,44]2006**
Polleres [46]2007****
Alkhateeb et al. [2]2009**
Arenas and Pérez [5]2012****
Polleres and Wallner [47]2013**
Kaminski et al. [32]2017**
Salas and Hogan2021*
Table 2

SPARQL property path syntax

The following are path expressions
pa predicate (IRI)
!(p1||pk|^pk+1||^pn)any (inv.) predicate not listed
and if e, e1, e2 are path expressions the following are also path expressions:
^ean inverse path
e1/e2a path of e1 followed by e2
e1|e2a path of e1 or e2
e*a path of zero or more e
e+a path of one or more e
e?a path of zero or one e
(e)brackets used for grouping

3.1.Query syntax

Before we introduce an abstract syntax for SPARQL queries, we provide some preliminaries:

  • A triple pattern (s,p,o) is a member of the set VIBL×VI×VIBL (i.e., an RDF triple allowing variables in any position and literals as subject).

  • A basic graph pattern B is a set of triple patterns. We denote by varsB:=(s,p,o)BV{s,p,o} the set of variables used in B.

  • A path pattern (s,e,o) is a member of the set VIBL×P×VIBL, where P is the set of all path expressions defined by Table 2.

  • A navigational graph pattern N is a set of paths patterns and triple patterns (with variable predicates). We denote by varsN:=(s,e,o)NV{s,e,o} the set of variables used in N.

  • A term in VIBL is a built-in expression. Let ⊥ denote an unbound value and ε an error. We call ϕ a built-in function if it takes a tuple of values from IBL{,ε} as input and returns a single value in IBL{,ε} as output. An expression ϕ(R1,,Rn), where each R1,,Rn is a built-in expression, is itself a built-in expression.

  • An aggregation function ψ is a function that takes a bag of tuples from IBL as input and returns a value in IBL{,ε} as output. An expression ψ(R1,,Rn), where each R1,,Rn is a built-in expression, is an aggregation expression.

  • If R is a built-in expression, and Δ is a boolean value indicating ascending or descending order, then (R,Δ) is an order comparator.

We then define the abstract syntax of a SPARQL query as shown in Table 3. Note that we abbreviate OPTIONAL as opt, FILTER EXISTS as fe, and FILTER NOT EXISTS as fne. Otherwise mapping from SPARQL’s concrete syntax to this abstract syntax is straightforward, with the following exceptions:

  • For brevity, we consider the following SPARQL 1.1 operators to be represented as functions:

    • boolean operators: ! for negation, && for conjunction, || for disjunction;

    • equality and inequality operators: =, <, >, <=, >=, !=;

    • numeric operators: unary + and - for positive/negative numbers; binary + and - for addition/subtraction, * for multiplication and / for division;

    for example, replacing ?a+?b, we assume addition to be defined as a function SUM(?a,?b).

  • We combine FROM and FROM NAMED into one feature, from, so they can be evaluated together.

  • A query such as DESCRIBE <x> <y> in the concrete syntax can be expressed in the abstract syntax with an empty pattern DESCRIBE{x,y}({}).

  • Aggregates without grouping can be expressed with GROUP{}(Q)(D). We assume that every query in the abstract syntax with a group-by pattern uses agg – possibly AGG{}(Q) – to generate a graph pattern (and “flatten” groups).

  • Some aggregation functions in SPARQL take additional arguments, including a DISTINCT modifier, or a delimiter in the case of CONCAT. For simplicity, we assume that these are distinct functions, e.g., count(·) versus countdistinct(·).

  • SPARQL allows SELECT * to indicate that values for all variables should be returned. Otherwise SPARQL requires that at least one variable be specified. A SELECT * clause can be written in the abstract syntax as SELECTV(Q) where Q is a graph pattern on V. Also the abstract syntax allows empty projections SELECT{}(Q), which greatly simplifies certain definitions and proofs; this can be represented in the concrete syntax as SELECT ?empty, where ?empty is a fresh variable not appearing in the query.

  • In the concrete syntax, SELECT allows for built-in expressions and aggregation expressions to be specified. We only allow variables to be used. However, such expressions can be bound to variables using bind or agg.2

  • In the concrete syntax, ORDER BY allows for using aggregation expressions in the order comparators. Our abstract syntax does not allow this as it complicates the definitions of such comparators. Ordering on aggregation expressions can rather be achieved using sub-queries.

  • We use [Q1SERVICExΔQ2] to denote SERVICE, where Δ=true indicates the SILENT keyword is invoked, and Δ=false indicates that it is not.

  • We do not consider SERVICE with variables as it has no normative semantics in the standard [48].

Aside from the latter point, these exceptions are syntactic conveniences that help simplify later definitions.

Table 3

Abstract SPARQL syntax

B is a basic graph pattern.B is a graph pattern on varsB.
N is a navigational graph pattern.N is a graph pattern on varsN.
Q1 is a graph pattern on V1.

Q2 is a graph pattern on V2.
∴ [Q1ANDQ2] is a graph pattern on V1V2;

∴ [Q1UNIONQ2] is a graph pattern on V1V2;

∴ [Q1OPTQ2] is a graph pattern on V1V2;

∴ [Q1MINUSQ2] is a graph pattern on V1.
Q is a graph pattern on V.

Q1 is a graph pattern on V1.

Q2 is a graph pattern on V2.

v is a variable not in V.

R is a built-in expression.
FILTERR(Q) is a graph pattern on V;

[Q1FEQ2] is a graph pattern on V1;

[Q1FNEQ2] is a graph pattern on V1;

BINDR,v(Q) is a graph pattern on V{v}.
Q is a graph pattern on V.

M is a bag of solution mappings on VM=μMdom(μ).
VALUESM(Q) is a graph pattern on VVM.
Q is a graph pattern on V.

x is an IRI.

v is a variable.
GRAPHx(Q) is a graph pattern on V.

GRAPHv(Q) is a graph pattern on V{v}.
Q1 is a graph pattern on V1.

Q2 is a graph pattern on V2.

x is an IRI.

– Δ is a boolean value.
[Q1SERVICExΔQ2] is a graph pattern on V1V2.
Q is a graph pattern on V.

Q is a group-by pattern on (V,V).

V is a set of variables.

A is an aggregation expression

– Λ is a (possibly empty) set of pairs {(A1,v1),,(An,vn)}, where A1, …, An are aggregation expressions, v1, …, vn are variables not appearing in VV such that vivj for 1i<jn, and where varsΛ={v1,,vn}
Q is a group-by pattern on (, V).

Q is a graph pattern on V.

GROUPV(Q) is a group-by pattern on (V,V).

HAVINGA(Q) is a group-by pattern on (V,V).

AGGΛ(Q) is a graph pattern on (VvarsΛ,V).
Q is a graph pattern or sequence pattern on V.

– Ω is a non-empty sequence of order comparators.

k is a non-zero natural number.
Q is a graph pattern and sequence pattern on V

ORDERΩ(Q) is a sequence pattern on V.

DISTINCT(Q) is a sequence pattern on V.

REDUCED(Q) is a sequence pattern on V.

OFFSETk(Q) is a sequence pattern on V.

LIMITk(Q) is a sequence pattern on V.
Q is a sequence pattern on V that does not contain the same blank node b in two different graph patterns.

V is a set of variables.

B is a basic graph pattern.

X is a set of IRIs and/or variables.
SELECTV(Q) is a query and a graph pattern on V.

ASK(Q) is a query.

CONSTRUCTB(Q) is a query.

DESCRIBEX(Q) is a query.
Q is a query but not a from query

X and X are sets of IRIs
FROMX,X(Q) is a from query and a query.

3.2.Datasets

SPARQL allows for indexing and querying more than one RDF graph, which is enabled through the notion of a SPARQL dataset. Formally, a SPARQL dataset D:=(G,{(x1,G1),,(xn,Gn)}) is a pair of an RDF graph G called the default graph, and a set of named graphs of the form (xi,Gi), where xi is an IRI (called a graph name) and Gi is an RDF graph; additionally, graph names must be unique, i.e., xjxk for 1j<kn. We denote by GD the default graph G of D and by D={(x1,G1),,(xn,Gn)} the set of all named graphs in D. We further denote by GD[xi] the graph Gi such that (xi,Gi)D or the empty graph if xi does not appear as a graph name in D.

3.3.Services

While the SPARQL standard defines a wide range of features that compliant services must implement, a number of decisions are left to a particular service. First and foremost, a service chooses what dataset to index. Along these lines, we define a SPARQL service as a tuple S=(D,R,A,,describe), where:

  • D is a dataset;

  • R is a set of supported built-in expressions;

  • A is a set of supported aggregation expressions;

  • ⩽ is a total ordering of RDF terms and ⊥;

  • describe is a function used to describe RDF terms.

We will denote by DS the dataset of a particular service. The latter two elements will be described in more detail as they are used. The SPARQL standard does define some minimal requirements on the set of built-in expressions, the set of aggregation expressions, the ordering of terms, etc., that a standards-compliant service should respect. We refer to the SPARQL standard for details on these requirements [24].

3.4.Query evaluation

The semantics of a SPARQL query Q can be defined in terms of its evaluation over a SPARQL dataset D, denoted Q(D) which returns solution mappings that represent “matches” for Q in D.

3.4.1.Solution mappings

A solution mapping μ is a partial mapping from variables in V to terms IBL. We denote the set of all solution mappings by M. Let dom(μ) denote the domain of μ, i.e., the set of variables for which μ is defined. Given {v1,,vn}V and {x1,xn}IBL{}, we denote by {v1/x1,,vn/xn} the mapping μ such that dom(μ)={vixi{,ε}} for 1in and μ(vi)=xi for vidom(μ). We denote by μ the empty solution mapping (such that dom(μ)=).

We say that two solution mappings μ1 and μ2 are compatible, denoted μ1μ2, if and only if μ1(v)=μ2(v) for every vdom(μ1)dom(μ2). We say that two solution mappings μ1 and μ2 are overlapping, denoted μ1μ2, if and only if dom(μ1)dom(μ2).

Given two compatible solution mappings μ1μ2, we denote by μ1μ2 their combination such that dom(μ1μ2)=dom(μ1)dom(μ2), and if vdom(μ1) then (μ1μ2)(v)=μ1(v), otherwise if vdom(μ2) then (μ1μ2)(v)=μ2(v). Since the solution mappings μ1 and μ2 are compatible, for all vdom(μ1)dom(μ2), it holds that (μ1μ2)(v)=μ1(v)=μ2(v), and thus μ1μ2 = μ2μ1.

Given a solution mapping μ and a triple pattern t, we denote by μ(t) the image of t under μ, i.e., the result of replacing every occurrence of a variable v in t by μ(v) (generating an unbound ⊥ if vdom(μ)). Given a basic graph pattern B, we denote by μ(B) the image of B under μ, i.e., μ(B):={μ(t)tB}. Likewise, given a navigational graph pattern N, we analogously denote by μ(N) the image of N under μ.

Blank nodes in SPARQL queries can likewise play a similar role to variables though they cannot form part of the solution mappings. Given a blank node mapping α, we denote by α(B) the image of B under α and by bnodes(B) the set of blank nodes used in B; we define α(N) and bnodes(N) analogously.

Finally, we denote by R(μ) the result of evaluating the image of the built-in expression R under μ, i.e., the results of replacing all variables v in R (including in nested expressions) with μ(v) and evaluating the resulting expression. We denote by μR that μ satisfies R, i.e., that R(μ) returns a value interpreted as true.

3.4.2.Set vs. bag vs. sequence semantics

SPARQL queries can be evaluated under different semantics, which may return a set of solution mappings M, a bag of solution mappings M, or a sequence of solution mappings M. Sets are unordered and do not permit duplicates. Bags are unordered and permit duplicates. Sequences are ordered and permit duplicates.

Given a solution mapping μ and a bag of solution mappings M, we denote by M(μ) the multiplicity of μ in M, i.e., the number of times that μ appears in M; we say that μM if and only if M(μ)>0 (otherwise, if M(μ)=0, we say that μM). We denote by |M|=μMM(μ) the (bag) cardinality of M. Given two bags M and M, we say that MM if and only if M(μ)M(μ) for all μM. Note that MM and MM if and only if M=M.

Given a sequence M of length n (we denote that |M|=n), we use M[i] (for 1in) to denote the ith solution mapping of M, and we say that μM if and only if there exists 1in such that M[i]=μ. We denote by M[ij] (for 1ij) the sub-sequence (M[i],M[i+1],,M[j1],M[j]) of elements i to j of M, inclusive, in order; if i=j, then M[ij] is defined to be (M[i]); if i>n, then M[ij] is defined to be the empty sequence (); otherwise if j>n, then M[ij] is defined to be M[in]. Given two sequences M1 and M2, we denote by M1M2 their concatenation. For a sequence M of length n>0, we define the deletion of index in, denoted del(M,i), as the concatenation M[1i1]M[i+1n] if 1<in, or M[2n] if i=1, or M[1n1] otherwise. We call j a repeated index of M if there exists 1i<j such that M[i]=M[j]. We define dist(M) to be the fixpoint of recursively deleting repeated indexes from M. We say that M is contained in M, denoted MM, if and only if we can derive M by recursively removing zero-or-more indexes from M. Note that MM and MM if and only if M=M.

Next we provide some convenient notation to convert between sets, bags and sequences. Given a sequence M of length n, we denote by bag(M) the bag that preserves the multiplicity of elements in M (such that bag(M)(μ):=|{i1in and M[i]=μ}|). Given a set M, we denote by bag(M) the bag such that bag(M)(μ)=1 if and only if μM; otherwise bag(M)(μ)=0. Given a bag M, we denote by set(M):={μμM} the set of elements in M; we further denote by seq(M) a random permutation of the bag M (more formally, any sequence seq(M) satisfying bag(seq(M))=M). Given a sequence M, we use set(M) as a shortcut for set(bag(M)), and given a set M, we use seq(M) as a shortcut for seq(bag(M)). Finally, given a set M, a bag M and a sequence M, we define that set(M)=M, bag(M)=M and seq(M)=M.

We continue by defining the semantics of SPARQL queries under set semantics. Later we cover bag semantics, and subsequently discuss aggregation features. Finally we present sequence semantics.

3.5.Query patterns: Set semantics

Query patterns evaluated under set semantics return sets of solution mappings without order or duplicates. We first define a set algebra of operators and then define the set evaluation of SPARQL graph patterns.

3.5.1.Set algebra

The SPARQL query language can be defined in terms of a relational-style algebra consisting of unary and binary operators [18]. Here we describe the operators of this algebra as they act on sets of solution mappings. Unary operators transform from one set of solution mappings (possibly with additional arguments) to another set of solution mappings. Binary operators transform two sets of solution mappings to another set of solution mappings. In Table 4, we define the operators of this set algebra. This algebra is not minimal: some operators (per, e.g., the definition of left-outer join) can be expressed using the other operators.

Table 4

Set algebra, where M, M1, and M2 are sets of solution mappings; V is a set of variables and v is a variable; and R is a built-in expression

M1M2:={μ1μ2μ1M1,μ2M2 and μ1μ2}Natural join
M1M2:={μμM1 or μM2}Union
M1M2:={μ1M1μ2M2:μ1μ2}Anti-join
M1M2:={μ1M1μ2M2:μ1μ2 and μ1μ2}Minus
M1M2:=(M1M2)(M1M2)Left-outer join
πV(M):={μμM:μμ and dom(μ)=Vdom(μ)}Projection
σR(M):={μMμR}Selection
βR,v(M):={μ{v/R(μ)}μM}Bind

3.5.2.Navigational graph patterns

Given an RDF graph G, we define the set of terms appearing as a subject or object in G as follows: so(G):={xp,y:(x,p,y)G or (y,p,x)G}. We can then define the evaluation of path expressions as shown in Table 5 [34], which returns a set of pairs of nodes connected by a path in the RDF graph G that satisfies the given path expression.

Table 5

Path expressions where G is an RDF graph, p, p1pn are IRIs, and e, e1, e2 are path expressions

p(G):={(s,o)(s,p,o)G}Predicate
!(p1||pn)(G):={(s,o)q:(s,q,o)G and q{p1,,pn}}Negated property set
!(^p1||^pn)(G):={(s,o)q:(o,q,s)G and q{p1,,pn}}Negated inverse property set
!(p1||pk|^pk+1||^pn)(G):=!(p1||pk)(G)!(^pk+1||^pn)(G)Negated (inverse) property set
^e(G):={(s,o)(o,s)e(G)}Inverse
e1/e2(G):={(x,z)y:(x,y)e1(G) and (y,z)e2(G)}Concatenation
e1|e2(G):=e1(G)e2(G)Disjunction
e+(G):={(y1,yn+1) for 1in:(yi,yi+1)e(G)}One-or-more
e*(G):=e+(G){(x,x)xso(G)}Zero-or-more
e?(G):=e(G){(x,x)xso(G)}Zero-or-one

Given a navigational graph pattern N, we denote by paths(N):={ePs,o:(s,e,o)N} the set of path expressions used in N (including simple IRIs, but not variables). We define the path graph of G under N, which we denote by GN, as the set of triples that materialise paths of N in G; more formally GN:={(s,e,o)epaths(N) and (s,o)e(G)}.

3.5.3.Service federation

The SERVICE feature allows for sending graph patterns to remote SPARQL services. In order to define this feature, we denote by ω a federation mapping from IRIs to services such that, given an IRI xI, then ω(x) returns the service S hosted at x or returns ε in the case that no service exists or can be retrieved. We denote by S.Q(DS) the evaluation of a query Q on a remote service S. When a service of the query evaluation is not indicated (e.g., Q(D)), we assume that it is evaluated on the local service. Finally, we define that ε.Q(Dε) invokes a query-level error ε – i.e., the evaluation of the entire query fails – while ε.Q(Dε) returns a set with the empty solution mapping {μ}.3

3.5.4.Set evaluation

The set evaluation of a SPARQL graph pattern transforms a SPARQL dataset D into a set of solution mappings. The base evaluation is given in terms of B(D) and N(D), i.e., the evaluation of a basic graph pattern B and a navigational graph pattern N over a dataset D, which generate sets of solution mappings. These solution mappings can then be transformed and combined using the aforementioned set algebra by invoking the corresponding pattern. The set evaluation of graph patterns is then defined in Table 6. We remark that for the definition of fe (filter exists) and fne (filter not exists), there is some ambiguity about what μ(Q2) precisely means when Q2 involves variables mentioned outside of a basic graph pattern or a path expression; this is a known issue for the SPARQL 1.1 standard [27,32,42,55], which we will discuss in more detail in Section 3.10.

Table 6

Set evaluation of graph patterns where D is a dataset; B is a basic graph pattern; N is a navigational graph pattern; Q, Q1 and Q2 are graph patterns; V is a set of variables; R is a built-in expression; v is a variable; M is a set of solution mappings; and x is an IRI

B(D):={μα:μ(α(B))GD and dom(μ)=varsB and dom(α)=bnodes(B)}
N(D):={μα:μ(α(N))GDNGD and dom(μ)=varsN and dom(α)=bnodes(N)}
[Q1ANDQ2](D):=Q1(D)Q2(D)
[Q1UNIONQ2](D):=Q1(D)Q2(D)
[Q1FEQ2](D):={μQ1(D)(μ(Q2))(D)} (see † below)
[Q1FNEQ2](D):={μQ1(D)(μ(Q2))(D)=} (see † below)
[Q1MINUSQ2](D):=Q1(D)Q2(D)
[Q1OPTQ2](D):=Q1(D)Q2(D)
SELECTV(Q)(D):=πV(Q(D))
FILTERR(Q)(D):=σR(Q(D))
BINDR,v(Q)(D):=βR,v(Q(D))
VALUESM(Q)(D):=Q(D)M
GRAPHx(Q)(D):=Q((GD[x],D))
GRAPHv(Q)(D):=(xi,Gi)Dβxi,v(Q((Gi,D)))
[Q1SERVICExfalseQ2](D):=Q1(D)ω(x).Q2(Dω(x))
[Q1SERVICExtrueQ2](D):=Q1(D)ω(x).Q2(Dω(x))

μ(Q2) is not well-defined in all cases. Please see Section 3.10 for discussion.

3.6.Query patterns: Bag semantics

Query patterns evaluated under bag semantics return bags of solution mappings. Like under set semantics, we first define a bag algebra of operators and then define the bag evaluation of SPARQL graph patterns.

3.6.1.Bag algebra

The bag algebra is analogous to the set algebra, but further operates over the multiplicity of solution mappings. We define this algebra in Table 7.

Table 7

Bag algebra where M, M1, and M2 are bags of solution mappings; μ, μ, μ1 and μ2 are solution mappings; V is a set of variables and v is a variable; and R is a built-in expression; for legibility, we use iverson bracket notation where [ϕ]=1 if and only if ϕ holds; otherwise, if ϕ is false or undefined, then [ϕ]=0

M1M2(μ):=(μ1,μ2)M1×M2M1(μ1)·M2(μ2)·[μ1μ2 and μ1μ2=μ]Natural join
M1M2(μ):=M1(μ)+M2(μ)Union
M1M2(μ):=M1(μ)·[μM2:μμ]Anti-join
M1M2(μ):=M1(μ)·[μM2:μμ and μμ]Minus
M1M2(μ):=((M1M2)(M1M2))(μ)Left-outer join
πV(M)(μ):=μMM(μ)·[dom(μ)=Vdom(μ) and μμ]Projection
σR(M)(μ):=M(μ)·[μR]Selection
βR,v(M)(μ):=μMM(μ)·[μ{v/R(μ)}=μ]Bind

3.6.2.Bag evaluation

The bag evaluation of a graph pattern is based on the bag evaluation of basic graph patterns and navigational graph patterns, as defined in Table 8, where the multiplicity of each individual solution is based on how many blank node mappings satisfy the solution. With the exceptions of fe and fne – which are also defined in Table 8 – the bag evaluation of other graph patterns then follows from Table 6 by simply replacing the set algebra (from Table 4) with the bag algebra (from Table 7). Note that VALUESM(Q) can now also accept a bag of solutions, and that there are again issues with the definition of fe and fne that will be discussed in Section 3.10. The set evaluation of paths (from Table 5) is again used with the exception that some path expressions in navigational patterns are rewritten (potentially recursively) to analogous query operators under bag semantics. Table 9 lists these rewritings.

Table 8

Bag evaluation of graph patterns where D is a dataset; B is a basic graph pattern; N is a navigational graph pattern; Q1, Q2 are graph patterns

B(D)(μ):=|{αμ(α(B))GD and dom(α)=bnodes(B)}|·[dom(μ)=varsB]
N(D)(μ):=|{αμ(α(N))GDN and dom(α)=bnodes(N)}|·[dom(μ)=varsN]
[Q1FEQ2](D)(μ):=Q1(D)(μ)·[(μ(Q2))(D)] (see † below)
[Q1FNEQ2](D)(μ):=Q1(D)(μ)·[(μ(Q2))(D)=] (see † below)

μ(Q2) is not well-defined in all cases. Please see Section 3.10 for discussion.

Table 9

Bag evaluation of navigational patterns where D is a dataset, N is a navigational pattern, and x is a fresh blank node

N(D):={(o,e,s)}(N{(s,^e,o)})(D) if there exists (s,^e,o)N{(s,e1,x),(x,e2,o)}(N{(s,e1/e2,o)})(D) if there exists (s,e1/e2,o)N[[{(s,e1,o)}UNION{(s,e2,o)}]ANDN{(s,e1|e2,o)}](D) if there exists (s,e1|e2,o)NN(D) under set semantics otherwise

3.7.Group-by patterns: Aggregation

Let (μ,M) denote a solution group, where μ is a solution mapping called the key of the solution group, and M is a bag of solution mappings called the bag of the solution group. Group-by patterns then return a set of solution groups M:={(μ1,M1),,(μn,Mn)} as their output. We now define their semantics along with the agg graph pattern, which allows for converting a set of solution groups to a set of solution mappings.

3.7.1.Aggregation algebra

We define an aggregation algebra in Table 10 under bag semantics with four operators that support the generation of a set of solution groups (aka., group by), the selection of solution groups (aka., having), the binding of new variables in the key of the solution group, as well as the flattening of a set of solution groups to a set of solution mappings by projecting their keys. Note that analogously to the notation for built-in expressions, given an aggregation expression A, we denote by A(M) the result of evaluating A over M, and we denote by MA the condition that M satisfies A, i.e., that A(M) returns a value interpreted as true. The aggregation algebra can also be defined under set semantics: letting M denote a set of solution mappings, we can evaluate γV(bag(M)) before other operators.

Table 10

Aggregation algebra under bag semantics, where M and M are bags of solution mappings; V is a set of variables and v is a variable; A is an aggregation expression; and M is a set of solution groups; we recall that M is the set of all solution mappings

γV(M):={(μ,M)μπV(M) and μM:M(μ)=M(μ)·[πV({μ})={μ}]}Group (bag)
σA(M):={(μ,M)MMA}Selection (aggregation)
βA,v(M):={(μ{v/A(M)},βA(M),v(M))(μ,M)M}Bind (aggregation)
ζ(M):={μ(μ,M)M}Flatten

3.7.2.Aggregation evaluation

We can use the previously defined aggregation algebra to define the semantics of group-by patterns in terms of their evaluation, per Table 11.

Table 11

Evaluation of group-by patterns where D is a dataset, Q is a graph pattern or group-by pattern, V is a set of variables, v1,,vn are variables, and A,A1,,An are aggregation expressions

GROUPV(Q)(D):=γV(Q(D))
HAVINGA(Q)(D):=σA(Q(D))
AGG{(A1,v1),,(An,vn)}(Q)(D):=ζ(βA1,v1((βAn,vn(Q(D))))) note: AGG{}(Q)(D):=ζ(Q(D))

3.8.Sequence patterns and semantics

Sequence patterns return sequences of solutions as their output, which allow duplicates and also maintain an ordering. These sequence patterns in general refer to solution modifiers that allow for ordering solutions, slicing the set of solutions, and removing duplicate solutions. We will again first define an algebra before defining the evaluation of sequence patterns.

3.8.1.Sequence algebra

Sequences deal with some ordering over solutions. We assume a total order ⩽ over IBL{} to be defined by the service (see Section 3.3), i.e., over the set of all RDF terms and unbounds. Given a non-empty sequence of order comparators Ω:=((R1,Δ1),,(Rn,Δn)), we define the total ordering Ω of solutions mappings as follows:

  • μ1=Ωμ2 if and only if Ri(μ1)=Ri(μ2) for all 1in;

  • otherwise let j denote the least value 1jn such that Rj(μ1)Rj(μ2); then:

    • Rj(μ1)<Rj(μ2) and Δk implies μ1<Ωμ2;

    • Rj(μ1)>Rj(μ2) and Δk implies μ1>Ωμ2;

    • Rj(μ1)<Rj(μ2) and not Δk implies μ1>Ωμ2;

    • Rj(μ1)>Rj(μ2) and not Δk implies μ1<Ωμ2.

In Table 12, we present an algebra composed of three operators for ordering sequences of solutions based on order comparators, and removing duplicates.

Table 12

Sequence algebra, where M and M are sequences of solutions, and Ω is a non-empty sequence of order comparators

orderΩ(M):=M such that bag(M)=bag(M) and M[i]ΩM[j] for all 1i<j|M|Order by
distinct(M):=dist(M)Distinct
reduced(M):=M such that MM and set(M)=set(M)Reduced

3.8.2.Sequence evaluation

Using the sequence algebra, we can then define the evaluation of sequence patterns as shown in Table 13.

Table 13

Evaluation of sequence patterns where Ω is a non-empty sequence of order comparators, and k is a (non-zero) natural number

ORDERΩ(Q)(D):=orderΩ(seq(Q(D)))
DISTINCT(Q)(D):=distinct(seq(Q(D)))
REDUCED(Q)(D):=reduced(seq(Q(D)))
OFFSETk(Q)(D):=seq(Q(D))[(k+1)]
LIMITk(Q)(D):=seq(Q(D))[1k]

3.9.Safe and possible variables

We now characterise different types of variables that may appear in a graph pattern in terms of being always bound, never bound, or sometimes bound in the solutions to the graph pattern. This characterisation will become important for rewriting algebraic expressions [53]. Specifically, letting Q denote a graph pattern, recall that we denote by varsQ all of the variables mentioned (possibly nested) in Q; furthermore:

  • we denote by svars(Q) the safe variables of Q, defined to be the set of variables vV such that, for all datasets D, if μQ(D), then vdom(μ);

  • we denote by pvars(Q) the possible variables of Q, defined to be the set of variables vvarsQ such that there exists a dataset D and a solution μQ(D) where vdom(μ).

Put more simply, svars(Q) denotes variables that are never unbound in any solution, while pvars(Q) denotes variables that may be unbound but are bound in at least one solution over some dataset.4

Example 3.1.

Consider the query (pattern) Q: sw-13-sw212871-g004.jpg Now:

  • varsQ={?s,?x,?y,?z};

  • svars(Q)={?s};

  • pvars(Q)={?s,?x,?y}.

Unfortunately, given a graph pattern Q, deciding if vsvars(Q) or vpvars(Q) is undecidable for the full SPARQL 1.1 language as it can answer the question of the satisfiability of Q, i.e., whether or not Q has any solution over any dataset; specifically, vvarsQ is safe in Q if and only if Q is unsatisfiable, while vvarsQ is possible in BIND1,v(Q) if and only if Q is satisfiable. For this reason we resort to syntactic approximations of the notions of safe and possible variables. In fact, when we say that Q1 is a graph pattern on V1, we can consider V1 to be a syntactic over-approximation of the possible variables of Q (called “in-scope” variables by the standard [24]), which ensures no clashes of variables in solutions (e.g., defining BIND1,v(Q) when Q can bind v). Later (in Table 21) we will define a syntactic over-approximation of safe variables to decide when it is okay to apply rewriting rules that are invalid in the presence of unbound variables.

3.10.Issues with (NOT) EXISTS

The observant reader may have noticed that in Table 6 and Table 8, in the definitions of [Q1FEQ2] (FILTER EXISTS) and [Q1FNEQ2] (FILTER NOT EXISTS), we have used the expression μ(Q2). While we have defined this for a basic graph pattern B and a navigational graph pattern N – replace any variable vdom(μ) appearing in B or N, respectively, with μ(v) – per the definition of the syntax, Q2 can be any graph pattern. This leaves us with the question of how μ(SELECTV(Q)) or μ([Q1MINUSQ2]), for example, is defined. Even in the case where Q2 is a basic graph pattern, it is unclear how we should handle variables that are replaced with blank nodes, or predicate variables that are replaced with literals. The standard does not define such cases unambiguously [27,32,42,55]. The precise semantics of [Q1FEQ2] and [Q1FNEQ2] thus cannot be defined until the meaning of μ(v) – which the standard calls substitution – is clarified. We provide an example illustrating one issue that arises.

Example 3.2.

We will start with a case that is not semantically ambiguous. Take a query: sw-13-sw212871-g005.jpg To evaluate it on a dataset D, we take each solution μ{(?x,:sister,?y)}(D) from the left of the FILTER NOT EXISTS and keep the solution μ in the final results of the query if and only if the evaluation of the pattern {(μ(?y),:sister,?z)}(D) is empty.

We next take an example of a query that is syntactically valid, but semantically ill-defined. sw-13-sw212871-g006.jpg Given a solution μ from the left, if we follow the standard literally and replace “every occurrence of a variable v in [the right pattern] by μ(v) for each v in dom(μ)”, the result is a pattern SELECT{μ(?y)}(μ(Q2)). For example, if μ(?y)=:a, then the right pattern, in concrete syntax, would become: sw-13-sw212871-g007.jpg which is syntactically invalid.

A number of similar issues arise from ambiguities surrounding substitution, and while work is underway to clarify this issue, at the time of writing, various competing proposals are being discussed [32,42,55]. We thus postpone rewriting rules for negation until a standard semantics for substitution is agreed upon.

3.11.Queries

A query accepts a set, bag or sequence of solution modifiers, depending on the semantics selected (and features supported). In the case of a SELECT query, the output will likewise be a set, bag or sequence of solution modifiers, potentially projecting away some variables. An ASK query rather outputs a boolean value. Finally, CONSTRUCT and DESCRIBE queries output an RDF graph. We will define the evaluation of these queries in terms of solution sequences, though the definitions generalise naturally to bags and sets (through bag(·) and set(·)). First we give preliminary notation. Given a sequence of solution mappings M, we denote by πV(M) a projection that preserves the input order of solution mappings (and such that bag(πV(M))=πV(bag(M))). Given a dataset D and a set of RDF terms X, we assume a function describe(X,D) that returns an RDF graph “describing” each term xX to be defined by the service (see Section 3.3); as a simple example, describe(X,D) may be defined as the set of triples in GD that mention any xX. The evaluation of queries is then shown in Table 14.

Table 14

Evaluation of queries where D is a dataset, Q is a sequence pattern or graph pattern, V is a set of variables, B is a basic graph pattern, and X is a set of IRIs and/or variables

SELECTV(Q)(D):=πV(Q(D))
ASK(Q)(D):=true if set(Q(D)); false otherwise
CONSTRUCTB(Q)(D):=μQ(D){μ(s,p,o)IB×I×IBL(s,p,o)B}
DESCRIBEX(Q)(D):=describe(X,D) where X:={xIBLxX, or xXV,μQ(D):μ(x)=x}

3.12.Dataset modifier

Queries are evaluated on a SPARQL dataset D, where dataset modifiers allow for changing the dataset considered for query evaluation. First, let X and X denote (possibly empty or overlapping) sets of IRIs and let D denote a SPARQL dataset. We then denote by D(X,X):=(xXGD[x],{(x,GD[x])xX}) a new dataset formed from D by merging all named graphs of D named by X to form the default graph of D(X,X), and by selecting all named graphs of D named by X as the named graphs of D(X,X). We define the semantics of dataset modifiers in Table 15.

Table 15

Evaluation of dataset modifiers where X and X are sets of IRIs

FROMX,X(Q)(D):=Q(D(X,X))

3.13.Non-determinism

A number of features can lead to non-determinism in the evaluation of graph patterns as previously defined. When such features are used, there may be more than one possible valid result for the graph pattern on a dataset. These features are as follows:

  • Built-in expressions and aggregation expressions may rely on non-deterministic functions, such as rand() to generate a random number, SAMPLE to randomly sample solutions from a group, etc.

  • REDUCED(Q)(D) permits a range of multiplicities for solutions (between those for Q(D) under bag semantics and DISTINCT(Q)(D)).

  • The use of sequence patterns without an explicit ORDERΩ(Q) gives a non-deterministic ordering (e.g., with OFFSETk(Q) and/or LIMITk(Q)).

In non-deterministic cases, we can say that Q(D) returns a family of (potentially infinite) valid sets/bags/sequences of solutions, denoting the space of possible results for evaluating the graph pattern. In practice any such set/bag/sequence of solutions can be returned. If Q(D) returns a singleton family for all datasets D, we consider Q to be deterministic (even if using a non-deterministic feature), where it returns the set/bag/sequence of solutions rather than the singleton; for example, we assume that REDUCED(Q) is deterministic if Q cannot generate duplicates.

3.14.Relationships between the semantics

The SPARQL standard is defined in terms of sequence semantics, i.e., it is assumed that the solutions returned have an order. However, unless the query explicitly uses the sequence algebra (and in particular ORDER BY), then the semantics is analogous to bag semantics in the sense that the ordering of solutions in the results is arbitrary. Likewise when a query does not use the sequence algebra or the aggregation algebra, but invokes SELECT DISTINCT (in the outermost query), ASK, CONSTRUCT or DESCRIBE, then the semantics is analogous to set semantics. Note however that when the aggregate or sequence algebra is included, set semantics is not the same as bag semantics with DISTINCT. Under set semantics, intermediate results are treated as sets of solutions. Under bag semantics, intermediate results are treated as bags of solutions, where final results are deduplicated. If we apply a count aggregation, for example, then set semantics will disregard duplicate solutions, while bag semantics with distinct will consider duplicate solutions (the distinct is applied to the final count, with no effect).

3.15.Query containment and equivalence

Query containment states that the results of one graph pattern are contained in the other. To begin, take two deterministic graph patterns Q1 and Q2. We say that Q1 is contained in Q2 under set, bag or sequence semantics, denoted Q1Q2, if and only if for every dataset D, it holds that Q1(D)Q2(D).

If Q1 and Q2 are non-deterministic, then under set semantics we assume that Q1(D) and Q2(D) will return a family of sets of solutions. If for every dataset D, and for all M1Q1(D), there exists an M2Q1(D) such that M1M2, then we say that Q1 is contained in Q2 under set semantics, again denoted Q1Q2. On the other hand, if Q1 is deterministic, and Q2 is non-deterministic, then Q1Q2 if and only if for every dataset D and for all MQ2(D), it holds that Q1(D)M. Conversely if Q2 is deterministic, and Q1 is non-deterministic, then Q1Q2 if and only if for every dataset D and for all M1Q1(D), it holds that M1Q2(D). Containment can be defined analogously for bags or sequences.

Query equivalence is a relation between graph patterns that states that the results of one graph pattern are equal to the other. Specifically, given two graph patterns Q1 and Q2 (be they deterministic or non-deterministic), we say that they are equivalent under set, bag or sequence semantics, denoted Q1Q2, if and only if for every dataset D, it holds that Q1(D)=Q2(D). We remark that if Q1 and Q2 are deterministic, then Q1Q2Q1Q2Q2Q1 under the corresponding semantics. If they are non-deterministic, then Q1Q2Q1Q2Q2Q1, but Q1Q2Q1Q2Q2Q1.5

Example 3.3.

In Fig. 1 we provide examples of query containment and equivalence. The leftmost query finds the maternal grandparents of :Bob while the latter three queries find both maternal and paternal grandparents. Hence the first query is contained in the latter three queries, which are themselves equivalent.

Fig. 1.

Examples of query containment and equivalence.

Examples of query containment and equivalence.
Fig. 2.

Example of query congruence.

Example of query congruence.

Regarding equivalence of non-deterministic graph patterns, we highlight that any change to the possible space of results leads to a non-equivalent graph pattern. For example, for a graph pattern Q, it holds that DISTINCT(Q)REDUCED(Q)Q, and if Q cannot return duplicates (e.g., Q is a basic graph pattern without blank nodes), then DISTINCT(Q)REDUCED(Q)Q. However, if Q may give duplicates, then DISTINCT(Q)REDUCED(Q)QDISTINCT(Q) under bag or sequence semantics. Likewise, for example, replacing a function like RAND() in Q with a constant like 0.5 changes the semantics of Q, generating a non-equivalent graph pattern.

While the previous discussion refers to graph patterns (which may include use of (sub)SELECT), we remark that containment and equivalence can be defined for ASK, CONSTRUCT and DESCRIBE in a natural way. For two deterministic ASK queries Q1 and Q2, we say that Q1 is contained in Q2, denoted Q1Q2, if and only if for any dataset D, it holds that Q1(D) implies Q2(D); i.e., for any dataset which Q1 returns true, Q2 also returns true. For two deterministic CONSTRUCT queries or DESCRIBE queries Q1 and Q2, we say that Q1 is contained in Q2, denoted Q1Q2, if and only if for any dataset D, it holds that Q2(D)Q1(D) under simple entailment. Two queries Q1 and Q2 are then equivalent, denoted Q1Q2 if and only if Q1Q2 and Q2Q1. Containment and equivalence of non-deterministic queries are then defined as before.

3.16.Query isomorphism and congruence

Many use-cases for canonicalisation prefer not to distinguish queries that are equivalent up to variable names. We call a one-to-one variable mapping ρ:VV a variable renaming. We say that Q1 and Q2 are isomorphic, denoted Q1Q2 if and only if there exists a variable renaming ρ such that ρ(Q1)=Q2. Note that in the case of SELECT queries, isomorphism does not imply equivalence as variable naming matters to the solutions produced. For this reason we introduce the notion of query congruence. Formally we say that two graph patterns Q1 and Q2 are congruent, denoted Q1Q2, if and only if there exists a variable renaming ρ such that ρ(Q1)Q2. It is not difficult to see that isomorphism implies congruence.

Example 3.4.

We provide an example of non-equivalent but congruent queries in Fig. 2. If we rewrite the variable ?gp to ?x in the first query, we see that the two queries become equivalent.

Like equivalence and isomorphism, congruence is reflexive (QQ), symmetric (Q1Q2Q2Q1) and transitive (Q1Q2Q2Q3Q1Q3); in other words, congruence is an equivalence relation. We remark that congruence is the same as equivalence for ASK, CONSTRUCT and DESCRIBE queries since the particular choice of variable names does not affect the output of such queries in any way.

3.17.Query classes

Based on a query of the form SELECTV(Q), we define eleven syntactic query classes corresponding to classes that have been well-studied in the literature.

  • basic graph patterns (BGPs): Q is a BGP and V=varsQ.

  • unions of basic graph patterns (UBGPs): Q is a graph pattern using BGPs and union and V=varsQ.

  • conjunctive queries (CQs): Q is a BGP.

  • unions of conjunctive queries (UCQs): Q is a graph pattern using BGPs and union.

  • monotone queries (MQs): Q is a graph pattern using BGPs, union and and.6

  • non-monotone queries (NMQs): Q is a graph pattern using BGPs, union, and and minus.

  • navigational graph patterns (NGPs): Q is an NGP and V=varsQ.

  • unions of navigational graph patterns (UNGPs): Q is a graph pattern using NGPs and union and V=varsQ.

  • conjunctive path queries (CPQs): Q is an NGP.

  • unions of conjunctive path queries (UCPQs): Q is a graph pattern using NGPs and union.

  • monotone path queries (MPQs): Q is a graph pattern using NGPs, union and and.

  • non-monotone path queries (NMPQs): Q is a graph pattern using NGPs, union, and and minus.

These query classes are evaluated on an RDF graph (the default graph) rather than an RDF dataset, though results extend naturally to the RDF dataset case. Likewise, since we do not consider the sequence algebra, we have the (meaningful) choice of set or bag semantics under which to consider the tasks; furthermore, since the aggregation algebra is not considered, set and distinct-bag semantics coincide.

Unlike UCQs, which are strictly unions of joins (expressed as basic graph patterns), MQs further permit joins over unions. As such, UCQs are analogous to a disjunctive normal form. Though any monotone query (under set semantics) can be rewritten to an equivalent UCQ, certain queries can be much more concisely expressed as MQs versus UCQs, or put another way, there exist MQs that are exponentially longer when rewritten as UCQs. For example, the first three queries of Fig. 1 are MQs, but only the third is a UCQ; if we use a similar pattern as the third query to go search back n generations, then we would require 2n BGPs with n triple patterns each; if we rather use the most concise MQ, based on the second query, we would need 2n BGPs with one triple pattern each.

CPQs and UCPQs are closely related to the query fragments of conjunctions of 2-way regular paths queries (C2RPQs) and unions of conjunctions of 2-way regular paths queries (UC2RPQs), but additionally allow negated property sets and variables in the predicate position [34]. NMQs are semantically related to the fragment with BGPs, projection, union, and, optional and FILTER!bound as studied by Pérez et al. [44].

3.18.Complexity

We here consider four decision problems:

Query Evaluation

Given a solution μ, a query Q and a graph G, is μQ(G)?

Query Containment

Given two queries Q1 and Q2, does Q1Q2 hold?

Query Equivalence

Given two queries Q1 and Q2, does Q1Q2 hold?

Query Congruence

Given two queries Q1 and Q2, does Q1Q2 hold?

In Table 16, we summarise known complexity results for these four tasks considering both bag and set semantics along with a reference for the result. The results refer to combined complexity, where the size of the queries and data (in the case of Evaluated) are included. The “Full” class refers to any SELECT query using any of the deterministic SPARQL features,7 while BGP′, UBGP′, NGP′ and UNGP′ refer to BGPs, UBGPs, NGPs and UNGPs without blank nodes, respectively.8 We do not present results for query classes allowing paths under bag semantics as we are not aware of work in this direction; lower bounds can of course be inferred from the analogous fragment without paths under bag semantics.

Table 16

Complexity of SPARQL tasks on core fragments (considering combined complexity for Evaluation)

EvaluationContainmentEquivalenceCongruence
Set semantics
BGP′PTime [44]PTime [1]*PTime [1]*GI-complete [1,12]*
UBGP′PTime [44]*PTime [1,49]*PTime [1,49]*GI-hard, NP
CQNP-complete [44]NP-complete [11]*NP-complete [11]*NP-complete
UCQNP-complete [44]NP-complete [11]*NP-complete [11]*NP-complete
MQNP-completeΠ2P-complete [49]*Π2P-complete [49]*Π2P-complete
NMQPSPACE-complete [44]Undecidable [60]*Undecidable [60]*Undecidable
NGP′PTime [34]PSPACE-complete [34]PSpaceGI-hard, EXPSpace
UNGP′PTime [34]PSPACE-complete [34]PSpaceGI-hard, EXPSpace
CPQNP-complete [34]EXPSPACE-complete [34]NP-hard, EXPSpaceNP-hard, EXPSpace
UCPQNP-complete [34]EXPSPACE-complete [34]NP-hard, EXPSpaceNP-hard, EXPSpace
MPQNP-hard [34]*EXPSPACE-hard [34]*Π2P-hardΠ2P-hard
NMPQPSPACE-hardUndecidableUndecidableUndecidable
FullPSPACE-hardUndecidableUndecidableUndecidable
Bag semantics
BGP′PTimePTime [1]*PTime [1]*GI-complete
CQNP-completeNP-hard, Decidability openGI-complete [12]*GI-complete
UCQNP-completeUndecidable [30]*GI-complete [17]*GI-complete
MQNP-completeUndecidableGI-hardGI-hard
NMQPSPACE-completeUndecidableUndecidable [45]*Undecidable
FullPSPACE-hardUndecidableUndecidableUndecidable

An asterisk implies that the result is not explicitly stated, but trivially follows from a result or technique used. These cases include analogous results for relational settings, upper-or-lower bounds from tasks with obvious reductions to or from the stated problem, etc. We may omit references in case a result directly follows from other results in the table. A less obvious case is that of Congruence, which has not been studied in detail. However, with the exception of queries without projection (nor blank nodes), the techniques used to prove equivalence apply analogously for Congruence, which is similar to resolving the problem of non-projected variables whose names may differ across the input queries without affecting the given relation. In the case of BGPs (without projection nor blank nodes), it is sufficient to find an isomorphism between the input queries; in fact, without projection, since the input graph is a set of triples,9 BGPs cannot produce duplicates, and thus results for set and bag semantics coincide.

Some of the more notable results include:

  • The decidability of Containment of CQs under bag semantics is a long open problem [12].

  • Equivalence (and Congruence) of CQs and UCQs are potentially easier under bag semantics (GI-complete) than under set semantics (NP-complete) as the problem under bag semantics relates to isomorphism, rather than homomorphic equivalence under set semantics.

  • Although UCQ and MQ classes are semantically equivalent (each UCQ has an equivalent MQ and vice versa), under set semantics the problems of Containment and Equivalence (and Congruence) are potentially harder for MQs than UCQs; this is because MQs are more concise.

  • While Containment for NMQs is undecidable under set semantics (due to the undecidability of FOL satisfiability), the same problem for UCQs under bag semantics is already undecidable (it can be used to solve Hilbert’s tenth problem).

These results – in particular those of Congruence – form an important background for this work.

4.Problem

With these definitions in hand, we now state the problem we wish to address: given a query Q, we wish to compute a canonical form of the query can(Q) such that can(Q)Q (sound), and for all queries such that QQ, it holds that can(Q)=can(Q) (complete). In other words, we aim to compute a syntactically canonical form for the class of queries congruent to Q where the canonical query is also in that class.

With this canonicalisation procedure, we can decide the congruence QQ by deciding the equality can(Q)=can(Q). We can thus conclude from Table 16 that canonicalisation is not feasible for queries in NMQ as it could be used to solve an undecidable problem. Rather we aim to compute a sound and complete canonicalisation procedure for MQs (which can decide a Π2Pcomplete problem, per Table 16) under both bag and set semantics, and a sound procedure for the full language under any semantics. This means that for two queries Q and Q that fall outside the MQ class, with a sound but incomplete canonicalisation procedure, can(Q)=can(Q) implies QQ, but can(Q)can(Q) does not necessarily imply QQ.

Indeed, even in the case of MQs, deciding can(Q)=can(Q) is likely to be a rather inefficient way to decide QQ. Our intended use-case is rather to partition a set of queries Q={Q1,,Qn} into the quotient set Q/, i.e., to find all sets of congruent queries in Q. This is useful, for example, in the context of caching applications where Q represents a log or stream of queries, where given Qj, we wish to know if there exists a query Qi (i<j) that is congruent in order to reuse its results. Rather than applying pairwise congruence checks, we can canonicalise queries and use their canonical forms as keys for partitioning. While these pairwise checks do not affect the computational complexity, in practice most queries are small and relatively inexpensive to canonicalise, where the O(|Q|2) cost of pairwise checks can dominate, particularly for a large set of queries Q. We will later analyse this experimentally. As per the introduction, canonicalisation is also potentially of interest for analysing logs, optimising queries, and/or learning over queries.

5.Related works

In this section, we discuss implementations of systems relating to containment, equivalence and canonicalisation of SPARQL queries.

A number of systems have been proposed to decide the containment of SPARQL queries. Among these, Letelier et al. [36] propose a normal form for quasi-well-designed pattern trees – a fragment of SPARQL allowing restricted use of OPTIONAL over BGPs – and implement a system called SPARQL Algebra for deciding containment and equivalence in this fragment based on the aforementioned normal form. The problem of determining equivalence of SPARQL queries can also be addressed by reductions to related problems. Chekol et al. [14] have used a μ-calculus solver and an XPath-equivalence checker to implement SPARQL containment/equivalence checks. These works implement pairwise checks.

Some systems have proposed isomorphism-based indexing of sub-queries. In the context of a caching system, Papailiou et al. [41] apply a canonical labelling algorithm (specifically Bliss [31]) on BGPs in order to later find isomorphic BGPs with answers available; their approach further includes methods for generalising BGPs such that it is more likely that they will be reused later. More recently, Stadler et al. [57] propose a system called JSAG for solving the containment of SPARQL queries. The system computes normal forms for queries, before representing them as a graph and applying subgraph isomorphism algorithms to detect containments. Such approaches do not discuss completeness, and would appear to miss containments for CQs under set semantics (and distinct-bag semantics), which require checking for homomorphisms rather than (sub-graph) isomorphisms.

We remark that in the context of relational database systems, there are likewise few implementations of query containment, equivalence, etc., as also observed by Chu et al. [15,16], who propose two systems for deciding the equivalence of SQL queries. Their first system, called Cosette [16], translates SQL into logical formulae, where a constraint solver is used to try to find counterexamples for equivalence; if not found, a proof assistant is used to prove equivalence. Chu et al. [15] later proposed the UDP system, which expresses SQL queries – as well as primary and foreign key constraints – in terms of unbounded semiring expressions, thereafter using a proof assistant to test the equivalence of those expressions; this approach is sound and complete for testing the equivalence of UCQs under both set and bag semantics. Zhou et al. [65] recently propose the EQUITAS system, which converts SQL queries into FOL-based formulae, reducing the equivalence problem to a satisfiability-modulo-theories (SMT) problem, which allows for capturing complex selection criteria (inequalities, boolean expressions, cases, etc.). Aside from targeting SQL, a key difference with our approach is that such systems apply pairwise checks.

In summary, while problems relating to containment and equivalence have been well-studied in the theoretical literature, relatively few practical implementations have emerged, perhaps because of the high computational costs, and indeed the undecidability results for the full SPARQL/SQL language. Of those that have emerged, they either offer sound and complete checks in a pairwise manner for query fragments, such as UCQs (e.g., [15]), or they offer sound but incomplete canonicalisation focused on isomorphic equivalence (e.g., [41]). To the best of our knowledge, the approach that we propose here, which we call QCan, is the only one that allows for canonicalising queries with respect to congruence, and that is sound and complete for monotone queries under both set and bag semantics. Our experiments will show that despite high theoretical computational complexity, QCan can be deployed in practice to detect congruent equivalence classes in large-scale, real-world query logs or streams, which are dominated by relatively small and simple queries.

6.Canonicalisation of monotone queries

In this section, we will first describe the different steps of our proposed canonicalisation process for monotone queries (MQs), i.e., queries with basic graph patterns, joins, unions, outer projection and distinct (see Section 3.17). In fact, we consider a slightly larger set of queries that we call extended monotone queries (EMQs), which are monotone queries that additionally support property paths using the (non-recursive) features “/” (followed by), “^” (inverse) and “|” (disjunction); property paths using such queries can be rewritten to monotone queries. We will cover the (sound but incomplete) canonicalisation of other features of SPARQL 1.1 later in Section 7.

As mentioned in the introduction, the canonicalisation process consists of: algebraic rewriting of parts of the query into normal forms, the representation of the query as a graph, the minimisation of the monotonic parts of the query by leaning and containment checks, the canonical labelling of the graph, and finally the mapping back to query syntax. We describe these steps in turn and then conclude the section by proving that canonicalisation is sound and complete for EMQs.

6.1.UCQ normalisation

In this section we describe the rules used to rewrite EMQs into a normal form based on unions of conjunctive queries (UCQs). We first describe the steps we apply for rewriting property paths into monotone features (where possible), thus converting EMQs into MQs. We then describe the rewriting of MQs into UCQ normal form. We subsequently describe some postprocessing of variables to ensure that those with the same name are correlated and that variables that are always unbound are removed. Finally we extend the normal form to take into account set vs. bag semantics.

6.1.1.Property path elimination

Per Table 9, property paths that can be rewritten to joins and unions are considered to be equivalent to their rewritten form under both bag and set semantics. We make these equivalences explicit by rewriting such property paths to joins and unions; i.e.:

(o,^e,s)(s,e,o)(s,e1/e2,o)(s,e1,x),(x,e2,o)(s,e1|e2,o)[(s,e1,o)UNION(s,e2,o)]
where x denotes a fresh blank node. The exact rewriting is provided in Table 9. These rewritings may be applied recursively, as needed. If the input query is an EMQ, then the output of the recursive rewriting will be an MQ, i.e., a query without property paths.
Example 6.1.

Consider the following query based on Example 1.1 looking for names of aunts. sw-13-sw212871-g010.jpg This query will be rewritten to: sw-13-sw212871-g011.jpg And then recursively to: sw-13-sw212871-g012.jpg In this case we succeed in removing all property paths; however, property paths with * or + cannot be rewritten to other query features in this way.

6.1.2.Union normalisation

Pérez et al. [44] establish that, under set semantics, joins and unions in SPARQL are commutative and associative, and that joins distribute over unions. We summarise these results in Table 17. Noting that under set semantics, the multiplicity of joins and unions is given by the multiplication and addition of natural numbers, respectively; that both multiplication and addition are commutative and associative; and that multiplication distributes over addition; the same results also apply under bag semantics.

Table 17

Equivalences given by Pérez et al. [44] for set semantics

Join is commutative[Q1ANDQ2][Q2ANDQ1]
Union is commutative[Q1UNIONQ2][Q2UNIONQ1]
Join is associative[Q1AND[Q2ANDQ3]][[Q1ANDQ2]ANDQ3]
Union is associative[Q1UNION[Q2UNIONQ3]][[Q1UNIONQ2]UNIONQ3]
Join distributes over union[Q1AND[Q2UNIONQ3]][[Q1ANDQ2]UNION[Q1ANDQ3]]

Another (folklore) result of interest is that BGPs can be rewritten to equivalent joins of their triple patterns. However, care must be taken when considering blank nodes in BGPs; otherwise the same blank node in two different triple patterns might be matched to two different terms, breaking the equivalence. Along these lines, let η:BV denote a one-to-one mapping of blank nodes to variables; we assume that η will rewrite blank nodes to fresh variables not appearing elsewhere in a query. Given a basic graph pattern B={t1,,tn} and a mapping η such that η(B)={t1,,tn} – where ti=η(ti) for 1in – the following holds:

BSELECTvarsB([{t1}AND[AND{tn}]])
i.e., a BGP B is equivalent to the result of rewriting its blank nodes to fresh variables, joining the individual triple patterns, and projecting only the variables originally in B. This equivalence again holds under bag semantics since the multiplicity of a solution μB(D) under bag semantics is defined in SPARQL as the number of blank node mappings satisfying the solution μ.

These known results give rise to a UCQ normal form for MQs [44]. More specifically, given a pattern [Q1AND[Q2UNIONQ3]], we can rewrite this to [[Q1ANDQ2]UNION[Q1ANDQ3]]; in other words, we translate joins of unions to (equivalent) patterns involving unions of joins. Also, as Schmidt et al. [53] observe, [QANDQ]Q and [QUNIONQ]Q under set semantics. Hence we can abstract the commutativity and associativity of both joins and unions by introducing two new syntactic operators:

AND(Q1,,Qn):=[Q1AND[ANDQn]]UNION(Q1,,Qn):=[Q1UNION[UNIONQn]]

Given the aforementioned equivalences, the arguments of AND(·) and UNION(·) can be considered as a set of operands under set semantics and a bag of operands under bag semantics (wherein duplicate operands may affect the multiplicities of results).

The UCQ normal form for MQs is then of the form SELECTV(UNION(Q1,,Qn)), where each Qi (1in) is of the form AND({ti,1},,{ti,m}), where each ti,j (1km) is a triple pattern. Given that duplicate triple patterns in a join do not affect the multiplicity of results, we can further remove these duplicates such that in our normal form, ti,jti,k for 1j<km. For this reason, in the case of UCQs, we can analogously consider each Q1,,Qn to be a set of triple patterns without blank nodes.

Example 6.2.

We show a case where the multiplicity of union operands changes the multiplicity of results under bag semantics. Consider the following MQ Q: sw-13-sw212871-g013.jpg Assume a dataset D with a default graph {(:s,:p,:o)}. Let μ={?s/:s,?o/:o}. Note that Q(D)(μ)=4 since each union generates μ twice, where the multiplicity of the join is then the product of both. We can describe the multiplicity of μ as (1+1)(1+1)=4.

If we rewrite this query to a UCQ, in the first step, pushing the first join inside the union, we generate: sw-13-sw212871-g014.jpg We may now describe the multiplicity of μ as 1(1+1)+1(1+1)=4. In the next step, we have: sw-13-sw212871-g015.jpg The multiplicity of this query is described as (1·1+1·1)+(1·1+1·1)=4. Since BGPs are sets of triple patterns, we should remove the duplicates. Subsequently unnesting the unions, the query then becomes: sw-13-sw212871-g016.jpg The multiplicity is then 1+1+1+1=4. In this case, the duplicate union operands are needed to preserve the original multiplicities of the query.

As was previously mentioned, the UCQ normal form may be exponentially larger than the original MQ; for example, a relatively concise EMQ of the form {(x,(p1|q1)//(pn|qn),y)} would be rewritten to an MQ with a join of n unions with two triple patterns each, and then to a UCQ with a union of 2n BGPs (for each combination of pi and qi), with each BGP containing n triple patterns.

Example 6.3.

Let us take the output query of Example 6.1 and apply the UCQ normal form. sw-13-sw212871-g017.jpg Blank nodes are rewritten to variables, and then join is distributed over union, giving the following query: sw-13-sw212871-g018.jpg If we were to consider the names of aunts or uncles (:sister|:brother) then we would end up with four unions of BGPs with four triple patterns each. If we were to consider the names of children (:son|:daughter) of aunts or uncles, we would end up with eight unions of BGPs with five triple patterns each. In this way, the UCQ rewriting may result in a query that is exponentially larger than the input.

6.1.3.Unsatisfiability normalisation

We recall that a graph pattern Q is considered unsatisfiable if and only if there does not exist a dataset D such that Q(D) is non-empty; i.e., the graph pattern never generates solutions. There is one trivial case of unsatisfiability for UCQs that must be taken into account: when subjects are literals. Specifically, SPARQL allows literal subjects even though they are disallowed in RDF graphs; this was to enable forwards-compatibility with a possible future generalisation of RDF to allow literal subjects, which has not happened as of RDF 1.1. As such, BGPs with any literal subject are unsatisfiable.

Lemma 6.1.

Let Q denote a BGP. Q is unsatisfiable if and only if it contains a literal subject.

Please see Appendix A.1.1 for the proof.

Moving to UCQs, it is not difficult to see that a union is satisfiable if and only if one of its operands is satisfiable, or, equivalently, that it is unsatisfiable if and only if all of its operands are unsatisfiable.

Lemma 6.2.

Let Q=UNION(Q1,,Qn). Q is unsatisfiable if and only if all of Q1,,Qn are unsatisfiable. Further assume that Qk is unsatisfiable and let Q=UNION(Q1,,Qk1,Qk+1,,Qn) denote Q removing the operand Qk. It holds that QQ.

Please see Appendix A.1.2 for the proof.

To deal with CQs of the form SELECTV(Q), where Q is a BGP containing a triple pattern with a literal subject, we simply replace this CQ with an arbitrary but canonical unsatisfiable query Q; for example, Q:=SELECT{?u}({("uns",?u,?u)}). In the case of UCQs, we remove operand BGPs that are unsatisfiable; if all operands are unsatisfiable, we replace the entire UCQ with the canonical query Q. If Q is produced, canonicalisation can stop.

Example 6.4.

Take the CQ: sw-13-sw212871-g019.jpg We replace this with the canonical query: sw-13-sw212871-g020.jpg

Next take the UCQ: sw-13-sw212871-g021.jpg We will rewrite this to sw-13-sw212871-g022.jpg by removing the unsatisfiable operand.

6.1.4.Variable normalisation

The same variable may sometimes occur in multiple query scopes such that replacing an occurrence of the variable in one scope with a fresh variable does not change the query results. We say that such variable occurrences are not correlated. There is one case where this issue may arise in UCQs. We call a variable v a union variable if it occurs in a union Q=UNION(Q1,,Qn) (n>1). An occurrence of v does not correlate with other occurrences of v in different operands of the same union unless v is correlated outside of Q in a query.10 In the particular case of UCQs, occurrences of non-projected union variables in different operands of the union do not correlate.

Lemma 6.3.

Let Q=SELECTV(UNION(Q1,,Qn)) denote a UCQ. Let λ1,,λn denote variable-to-variable mappings such that for 1in, λi:VV where dom(λi)=varsQiV, and, for all vdom(λi), it holds that λi(v)varsQ and there does not exist λj (1i<jn) and vdom(λj) such that λi(v)=λj(v). In other words, each variable-to-variable mapping rewrites each non-projected variable of each union operand to a fresh variable. Then the following equivalence holds:

SELECTV(UNION(Q1,,Qn))SELECTV(UNION(λ1(Q1),,λn(Qn)))

Please see Appendix A.1.3 for the proof.

These non-correlated variables give rise to non-trivial equivalences based on the “false” correspondences between variables with the same name that have no effect on each other. We address such cases by differentiating union variables that appear in multiple operands of the union but not outside the union.

Example 6.5.

We take the output of Example 6.3: sw-13-sw212871-g023.jpg The variable ?z correlates across both operands because both occurrences correlate with the same external appearance of ?z in the SELECT clause. Conversely, the variables ?w, ?x and ?y do not correlate across both operands of the union as they do not correlate with external occurrences of the same variable. Hence we differentiate ?w, ?x and ?y in both operands: sw-13-sw212871-g024.jpg The resulting query is equivalent to the original query, but avoids “false correspondences” of variables.

Next we apply a simple rule to remove variables that are always unbound in projections. Left unattended, such variables could otherwise constitute a trivial counterexample for the completeness of canonicalisation. We recall from Section 3.9 the notation pvars(Q) to denote the possible variables of a graph pattern, i.e., the variables that are bound to some RDF term in some solution of Q over some dataset D.

Lemma 6.4.

Let Q be a graph pattern, let V be a set of variables, and let V be a set of variables such that pvars(Q)V=. It holds that:

SELECTV(Q)SELECTVV(Q)

Please see Appendix A.1.4 for the proof.

We deal with such cases by removing variables that are always unbound from the projection.11

Example 6.6.

Take a query: sw-13-sw212871-g025.jpg We can remove the variable ?z without changing the semantics of the query as it will always be unbound, no matter what dataset is considered. In practice engines may return solution tables with blank columns for variables like ?z, but our definitions do not allow such columns (such columns can easily be added in practice if required).

6.1.5.Set vs. bag normalisation

The presence or absence of DISTINCT (or REDUCED) in certain queries does not affect the solutions that are generated because no duplicates can occur. In the case of UCQs, this can occur under two specific conditions. The first such case involves CQs.

Lemma 6.5.

Let Q denote a satisfiable BGP. It holds that:

DISTINCT(SELECTV(Q))SELECTV(Q).
if and only if varsQV and bnodes(Q)=.

Please see Appendix A.1.5 for the proof.

The second case involves unions.

Lemma 6.6.

Let Q1,,Qn denote satisfiable BGPs and let Q=Q1Qn denote the set union of their triple patterns. It holds that:

DISTINCT(SELECTV(UNION(Q1,,Qn)))SELECTV(UNION(Q1,,Qn))
if and only if varsQV, bnodes(Q)= and varsQivarsQj for all 1i<jn.

Please see Appendix A.1.6 for the proof.

The same equivalences trivially hold for REDUCED, which becomes deterministic when no duplicate solutions are returned. We deal with all such equivalences by simply adding DISTINCT in such cases (or replacing REDUCED with DISTINCT).12

Example 6.7.

Take a query such as: sw-13-sw212871-g026.jpg Since the query is a BGP with all variables projected and no blank nodes, no duplicates can be produced, and thus we can add a DISTINCT keyword to ensure that canonicalisation will detect equivalent or congruent queries irrespective of the inclusion or exclusion of DISTINCT or REDUCED in such queries. If the query were to not project a single variable, such as ?z, then duplicates become possible and adding DISTINCT would change the semantics of the query.

An example of a case involving union is as follows: sw-13-sw212871-g027.jpg First we note that the individual basic graph patterns forming the operands of the UNION do not contain blank nodes and have all of their variables projected; hence they cannot lead to duplicates by themselves. Regarding the union, the set of variables is different in each operand, and hence no duplicates can be given: the first operand will (always and only) produce unbounds for ?y, ?z in its solutions; the second will produce unbounds for ?x, ?z; and the third will produce unbounds for ?x, ?y. Hence no operand can possibly duplicate a solution from another operand. Since the query cannot produce duplicates, we can add DISTINCT without changing its semantics. If we were instead to project {?w,?n}, then adding DISTINCT would change the semantics of the query as the three operands may produce the same solution, and individual BGPs may duplicate solutions.

6.1.6.Summary

Given an EMQ Q, we denote by U(Q) the process described herein involving the application of:

  • 1. property path elimination (§6.1.1);

  • 2. union normalisation (§6.1.2);

  • 3. unsatisfiability normalisation (§6.1.3);

  • 4. variable normalisation (§6.1.4);

  • 5. set vs. bag normalisation (§6.1.5).

6.2.Graph representation

Given an EMQ as input, the previous steps either terminate with a canonical unsatisfiable query, or provide us with a satisfiable query in UCQ normal form, with blank nodes replaced by fresh variables, non-correlated variables differentiated, variables that are always unbound removed from the projection (while ensuring that the projection is non-empty), and the DISTINCT keyword invoked in cases where duplicate solutions can never be returned. Before continuing, we first review an example that illustrates the remaining syntactic variations and redundancies in congruent UCQs that are left to be canonicalised.

Example 6.8.

Consider the following UCQs: sw-13-sw212871-g028.jpg sw-13-sw212871-g029.jpg These queries are congruent, but differ in:

  • 1. the ordering of triple patterns within BGPs;

  • 2. the ordering of BGPs within the UCQ;

  • 3. the naming of variables;

  • 4. a redundant triple pattern in each BGP of the first query (those containing ?z and ?d, ?e);

  • 5. the redundant third BGP in the second query.

We are left to canonicalise such variations.

Table 18

Definitions for representational graphs R(Q) of a UCQ Q, where “a” abbreviates rdf:type, B is a basic graph pattern, Q1,,Qn,Q are graph patterns, V is a set of variables, and (s,p,o) is a triple pattern

·R(·)
AND(Q1,,Qn){(ι(·),:arg,ι(Q1)),,(ι(·),:arg,ι(Qn)),(ι(·),a,:And)}R(Q1)R(Qn)
UNION(Q1,,Qn){(ι(·),:arg,ι(Q1)),,(ι(·),:arg,ι(Qn)),(ι(·),a,:Union)}R(Q1)R(Qn)
SELECTV(Q){(ι(·),:arg,ι(Q)),(ι(·),a,:Select)}vV{(ι(·),:var,ι(v))}R(Q)
DISTINCT(Q){(ι(·),:arg,ι(Q)),(ι(·),a,:Distinct)}R(Q)
(s,p,o){(ι(·),:s,ι(s)),(ι(·),:p,ι(p)),(ι(·),:o,ι(o)),(ι(·),a,:TP)}

Our overall approach to address such variations is to encode queries as RDF graphs that we call representational graphs (r-graphs). This representation will allow for identifying and removing redundancies, and for canonically labelling variables such that elements of the query can be ordered deterministically.

We first establish some notation. Let λ() denote a function that returns a fresh blank node, and λ(x) denote a function that returns a fresh blank node unique to x. Let ι(·) denote an id function such that:

  • if xIL, then ι(x)=x;

  • if xVB, then ι(x)=λ(x);

  • if x is a natural number then

    ι(x)="x"^^xsd:integer;

  • if x is a boolean value then

    ι(x)="x"^^xsd:boolean;

  • otherwise ι(x)=λ().

We assume that natural numbers and boolean values produce datatype literals in canonical form (for example, we assume that ι(2)="2"^^xsd:integer rather than, say, "+02"^^xsd:integer).

Table 18 then provides formal definitions for transforming a UCQ Q in abstract syntax into its r-graph R(Q); we assume that a BGP is expressed as a join of its triple patterns. Note that for brevity, when we write ι(·), we assume that the same blank node is used for the current expression as was assigned in the parent expression. The result is then deterministic modulo isomorphism. In order to capture the intuition, we provide a more visual depiction of this transformation of UCQs in Table 19, where dashed nodes of the form sw-13-sw212871-i001.jpg are replaced with sw-13-sw212871-i002.jpg, and the graph extended with R(x). We further provide the following example.

Example 6.9.

We present an example of a UCQ and its r-graph in Fig. 3. For clarity (in particular, to avoid non-planarity), we embed the types of nodes into the nodes themselves; e.g., the lowermost node expands to sw-13-sw212871-i003.jpg. Given an input query Q1 that varies from Q1 in the naming of variables, applying the same process, the r-graph for Q1 and Q1 would be isomorphic, varying only in blank node labels.

Table 19

Mapping UCQs to r-graphs

Mapping UCQs to r-graphs

Part of the benefit of this graph representation is that it abstracts away the ordering of the operands of query operators where such order does not affect the semantics of the operator. This representation further allows us to leverage existing tools to eliminate redundancy and later canonically label variables.

6.3.Minimisation

The minimisation step removes two types of redundancies: redundant triple patterns in BGPs, and redundant BGPs in unions. It is important to note that such redundancies only apply in the case of set semantics [49]; under bag semantics, these “redundancies” affect the multiplicity of results, and thus cannot be removed without changing the query’s semantics.

6.3.1.BGP minimisation

The first type of redundancy we consider stems from redundant triple patterns. Consider a BGP Q (without blank nodes, for simplicity). We denote by ρ:VVIBL a partial mapping from variables to variables and RDF terms, whose domain is denoted by dom(ρ). Now, for a given mapping ρ such that dom(ρ)=varsQ, it holds that ρ(Q)DISTINCT(SELECTvarsρ(Q)(Q)) if and only if ρ(Q)Q. This is due to a classical result by Chandra and Merlin [11], where ρ is a homomorphism of Q onto itself: Q and ρ(Q) are homomorphically equivalent.

One may note a correspondence to RDF entailment (see Section 2.5), which is also based on homomorphisms, where the core of an RDF graph represents a redundancy-free (lean) version of the graph. We can exploit this correspondence to remove redundancies in BGPs by computing their core. However, care must be taken to ensure that we do not remove variables from the BGP that are projected; we achieve this by temporarily replacing them with IRIs so that they cannot be eliminated during the core computation.

Fig. 3.

UCQ (above) and its r-graph (below).

UCQ (above) and its r-graph (below).
Fig. 4.

BGP (above) and its r-graph (below) with the sub-graph removed during the core computation shown dashed (and in blue).

BGP (above) and its r-graph (below) with the sub-graph removed during the core computation shown dashed (and in blue).
Example 6.10.

Consider the following query, Q: sw-13-sw212871-g033.jpg Though perhaps not immediately obvious, this query is congruent with the three queries of Example 1.1. After applying UCQ normal forms and creating the base r-graph for Q, we end up with an r-graph analogous to the following query with a union of three BGPs: sw-13-sw212871-g034.jpg We then replace the blank node for the projected variable ?z with a fresh IRI, and compute the core of the sub-graph for each BGP (the graph induced by the BGP node with type :And and any node reachable from that node in the directed r-graph). Figure 4 depicts the sub-r-graph representing the third BGP (omitting the :And -typed node for clarity: it will not affect the core). Dashed nodes and edges are removed from the core per the blank node mapping:

{_:vx3/_:vd3,_:t35/_:t32, _:t33/_:t32,_:vp3/:sister,_:ve3/_:vy3,_:t34/_:t36,_:vf3/:vz,}
with the other nodes mapped to themselves. Observe that the projected variable :vz is now an IRI, and hence it cannot be removed from the graph.

If we consider applying this core computation over all three conjunctive queries, we would end up with an r-graph corresponding to the following query: sw-13-sw212871-g035.jpg We see that the projected variable is preserved in all BGPs. However, we can still see (inter-BGP) redundancy with respect to the first and third BGPs (the first is contained in the third), which we address now.

6.3.2.Union minimisation

After removing redundancy from the individual BGPs, we may still be left with a union containing redundant BGPs as highlighted by the output of Example 6.10, where the first BGP is contained in the third BGP: when we take the union, the names of :Jo’s aunts returned by the first BGP will already be contained in the third, and since we apply distinct/set semantics, the duplicates (if any) will be removed. Hence we must now apply a higher-level normalisation of unions of BGPs in order to remove such redundancy. Specifically, we must take into consideration the following equivalence [49]; let Q:=UNION(Q1,,Qn) and Q:=UNION(Q1,,Qk1,Qk+1,Qn); then:

DISTINCT(SELECTV(Q))DISTINCT(SELECTV(Q))
if and only if SELECTV(Qk)SELECTV(Qj) for ijn, ikn, jk. 13 To resolve such equivalences, we remove from Q:
  • 1. all Qk (1kn) such that there exists Qj (1j<kn) such that SELECTV(Qj)SELECTV(Qk); and

  • 2. all Qk (1kn) where there exists Qj (1jn) such that SELECTV(Qj)SELECTV(Qk) (and SELECTV(Qj)SELECTV(Qk));

i.e., we first remove all but one BGP from each group of equivalent BGPs and then remove all BGPs that are properly contained in another (considering in both cases the projected variables V).

To implement condition (1), let us first assume that all BGPs contain all projected variables. Note that in the previous step we have removed all redundancy from the CQs and hence it is sufficient to check for isomorphism between them; we can thus take the current r-graph Gj for each Qj and apply iso-canonicalisation of Gj, removing any other Qk (k>j) whose Gk is isomorphic. Thereafter, to implement step (2), we can check the simple entailment GkGj (jk), where if such an entailment holds, we can remove Gk (and thus Qk); more specifically, we can implement this entailment check using a boolean SPARQL ASK query encoding Gj and evaluated over Gk (which will return true if the entailment holds). Note that in both processes, we should preserve projected variables in V, meaning they should only be mapped to each other; to ensure this, we can simply maintain the unique IRIs created for them earlier.

Per this process, the first BGP in the output of Example 6.10 is removed as it is contained in the third BGP, with the projected variable corresponding in both. We now take another example.

Example 6.11.

Consider the following UCQ, where each BGP contains the lone projected variable: sw-13-sw212871-g036.jpg If we consider the first two BGPs, they do not contribute the same results to ?n; however, had we left the blank node _:vn to represent ?n, their r-graphs would be isomorphic whereas temporarily grounding :vn ensures they are no longer isomorphic. On the other hand, the r-graphs of the second and third BGP will remain isomorphic and thus one will be removed (for the purposes of the example, let’s arbitrarily say the third is removed). There are no further isomorphic CQs and thus we proceed to containment checks.

The fourth BGP maps to (i.e., contains) the first BGP, and thus the first BGP will be removed. This containment check is implemented by creating the following ASK query from the r-graph for the fourth BGP: sw-13-sw212871-g037.jpg and applying it to the r-graph of the first BGP: sw-13-sw212871-g038.jpg This returns true and hence the first BGP is removed. Likewise the fourth BGP maps to the fifth BGP and also the sixth BGP and hence the fifth and sixth BGPs will also be removed. This leaves us with an r-graph representing the following UCQ query: sw-13-sw212871-g039.jpg This UCQ query is redundancy-free.

Now we drop the assumption that all CQs contain the same projected variables in V, meaning that we can generate unbounds. To resolve such cases, we can partition the BGP operands Q={Q1,,Qn} of the union into sets of queries {Q1,,Qm} based on the projected variables they contain. More formally, given two graph patterns Q1 and Q2, let Q1VQ2 denote the equivalence relation such that pvars(Q1)V=pvars(Q2)V. Then {Q1,,Qm} is the quotient set of Q by V. We can then apply the same procedure as described previously, checking equivalence and containment within each such set Q1,,Qm.

Example 6.12.

Take the following UCQ, where the BGPs now contain different projected variables: sw-13-sw212871-g040.jpg Let {Q1,,Q6} denote the six BGPs, respectively. Further let V={?v,?w} denote the set of projected variables. Partitioning the BGPs by the projected variables, we end up with three sets of BGPS: {{Q1,Q2},{Q3,Q4},{Q5,Q6}} given by {?v,?w}, {?v} and {}, respectively. Within each group we apply the previous conditions. Thus, for example, we do not remove Q1 even though it would be naively contained in, for example, Q3 (where ?x3 in Q3 would map to the IRI :vw in Q1). Rather, Q1, Q2, Q3 (or Q4), and Q6 would be maintained, resulting in the query: sw-13-sw212871-g041.jpg The first two BGPs can return multiple solutions, where none can have an unbound; the third BGP will return the same solutions for ?v as the first CQ but ?w will be unbound each time; the fourth CQ will return a single tuple with an unbound for ?v and ?w if and only if the RDF graph is not empty.

The result of this process will be an r-graph for a redundancy-free UCQ. On this r-graph, we apply some minor post-processing: (i) we replace the temporary IRIs for projected variables with their original blank nodes to allow for canonical labelling in a subsequent phase; and (2) we remove unary and or union operators from the r-graph, reconnecting child and parent.

6.3.3.Summary

Given a UCQ Q being evaluated under set semantics (with distinct), we denote by M(Q) the result of minimising the UCQ, involving the two procedures:

  • 1. BGP minimisation (§6.3.1);

  • 2. union minimisation (§6.3.2).

Given a UCQ Q being evaluated under bag semantics (without distinct), we define that M(Q)=Q. If bag semantics is selected, the UCQ can only contain a syntactic form of redundancy: exact duplicate triple patterns in the same BGP, which are implicitly removed since we model BGPs as sets of triple patterns. Any other form of redundancy mentioned previously – be it within or across BGPs – will affect the multiplicity of results [12]. Hence if bag semantics is selected, we do not apply any redundancy elimination other than removing duplicate triple patterns in BGPs.

6.4.Canonical labelling

The second-last step of the canonicalisation process consists of applying a canonical labelling to the blank nodes of the RDF graph output from the previous process [28]. Specifically, given an RDF graph G, we apply a canonical labelling function L(·) such that L(G)G and for all RDF graphs G, it holds that GG if and only if L(G)=L(G); in other words, L(·) bijectively relabels the blank nodes of G in a manner that is deterministic modulo isomorphism, meaning that any isomorphic graph will be assigned the same labels. This is used to assign a deterministic labelling of query variables represented in the r-graph as blank nodes; other blank nodes presenting query operators will also be labelled as part of the process but their canonical labels are not used.

6.5.Inverse mapping

The final step of the canonicalisation process is to map from the canonically labelled r-graph to query syntax. More specifically, we define an inverse r-mapping, denoted R(G), to be a partial mapping from RDF graphs to query expressions such that R(R(Q))=Q; i.e. converting Q to its r-graph and then applying the inverse r-mapping yields the query Q again.14 We can achieve this by applying the inverse of Table 18, where canonical blank nodes in RDF term or variable positions (specifically, the objects of triples in the r-graph with predicate :s, :p, :o, or :el) are mapped to canonical variables or blank nodes using a fixed, total, one-to-one mapping ξ:BVB [50].15

To arrive at a canonical concrete syntax, we order the operands of commutative operators using a syntactic ordering on the canonicalised elements, and then serialise these operands in their lexicographical order. This then concludes the canonicalisation of EMQs.

6.6.Soundness and completeness

Given an EMQ as input, we prove soundness – i.e., that the output query is congruent to the input query – and completeness – i.e., that the output for two input queries is the same if and only if the input queries are congruent – for the proposed canonicalisation scheme.

6.6.1.Soundness

We begin the proof of soundness by showing that the UCQ normalisation preserves congruence.

Lemma 6.7.

For an EMQ Q, it holds that:

U(Q)Q.

Please see Appendix A.1.7 for the proof.

Next we prove that the canonical labelling of blank nodes in the r-graph does not affect the properties of the inverse r-mapping.

Lemma 6.8.

Given a UCQ Q, it holds that:

R(L(R(Q)))Q.

Please see Appendix A.1.8 for the proof.

Finally we prove that the minimisation of UCQs through their r-graphs preserves congruence.

Lemma 6.9.

Given a UCQ Q, it holds that:

R(M(R(Q)))Q.

Please see Appendix A.1.9 for the proof.

The following theorem then establishes soundness; i.e., that the proposed canonicalisation procedure preserves congruence of EMQs.

Theorem 6.1.

For an EMQ Q, it holds that:

R(L(M(R(U(Q)))))Q.

Please see Appendix A.1.10 for the proof.

6.6.2.Completeness

We now establish completeness: that for any two EMQs, they are congruent if and only if their canonicalised queries are equal. We will prove this by proving lemmas for various cases.

We begin by stating the following remark, which will help us to abbreviate some proofs.

Remark 6.1.

The following hold:

  • 1. if Q1Q2, then U(Q1)U(Q2).

  • 2. if U(Q1)U(Q2), then R(U(Q1))R(U(Q2));

  • 3. if R(U(Q1))R(U(Q2)), then

    M(R(U(Q1)))M(R(U(Q2)));

  • 4. if M(R(U(Q1)))M(R(U(Q2))), then

    L(M(R(U(Q1))))=L(M(R(U(Q2))));

  • 5. if L(M(R(U(Q1))))=L(M(R(U(Q2)))), then

    R(L(M(R(U(Q1)))))=R(L(M(R(U(Q2))))).

Thus, if any premise 1–5 is satisfied, it holds that R(L(M(R(U(Q1)))))=R(L(M(R(U(Q2))))).

In order to prove the result for various cases, our goal is thus to prove isomorphism of the input queries, the queries in UCQ normal form, the r-graphs of the queries, or the minimised r-graphs.

Our first lemma deals with unsatisfiable UCQs, which is a corner-case specific to SPARQL.

Lemma 6.10.

Let Q1 and Q2 denote UCQs. If Q1 and Q2 are unsatisfiable (which implies Q1Q2), then:

R(L(M(R(U(Q1)))))=R(L(M(R(U(Q2))))).

Please see Appendix A.1.11 for the proof.

In practice, if a UCQ Q is unsatisfiable, then the canonicalisation process can stop after U(Q) yields Q. We state the result in this way to align the process for both satisfiable and unsatisfiable cases. We can now focus on cases where both queries are satisfiable.

We will start with satisfiable CQs evaluated under set semantics (with distinct).

Lemma 6.11.

Let Q1 and Q2 denote satisfiable BGPs and V1 and V2 sets of variables. Further let Q1=DISTINCT(SELECTV1(Q1)) and likewise let Q2=DISTINCT(SELECTV2(Q2)). If Q1Q2 then

R(L(M(R(U(Q1)))))=R(L(M(R(U(Q2))))).

Please see Appendix A.1.12 for the proof.

We move to CQs evaluated under bag semantics (without distinct; the result also considers cases where the CQ cannot return duplicates).

Lemma 6.12.

Let Q1 and Q2 denote satisfiable BGPs and V1 and V2 sets of variables. Further let Q1=SELECTV1(Q1) and Q2=SELECTV2(Q2). If Q1Q2 then

R(L(M(R(U(Q1)))))=R(L(M(R(U(Q2))))).

Please see Appendix A.1.13 for the proof.

We now move to UCQs evaluated under set semantics (with distinct).

Lemma 6.13.

Let Q1 and Q2 denote satisfiable UCQs with distinct. If Q1Q2 then

R(L(M(R(U(Q1)))))=R(L(M(R(U(Q2))))).

Please see Appendix A.1.14 for the proof.

We next consider UCQs under bag semantics (without distinct; again, this also holds in the case that the UCQs cannot return duplicates).

Lemma 6.14.

Let Q1 and Q2 denote satisfiable UCQs without distinct. If Q1Q2 then

R(L(M(R(U(Q1)))))=R(L(M(R(U(Q2))))).

Please see Appendix A.1.15 for the proof.

Finally we consider what happens when one (U)CQ has distinct, and the other does not but is congruent to the first query.

Lemma 6.15.

Let Q denote a satisfiable UCQ without distinct. Let Q=DISTINCT(Q). If QQ, then:

R(L(M(R(U(Q)))))=R(L(M(R(U(Q))))).

Please see Appendix A.1.16 for the proof.

Having stated all of the core results, we are left to make the final claim of completeness.

Theorem 6.2.

Given two EMQs Q1 and Q2, if Q1Q2 then

R(L(M(R(U(Q1)))))=R(L(M(R(U(Q2))))).

Please see Appendix A.1.17 for the proof.

Finally we can leverage soundness and completeness for the following stronger claim.

Theorem 6.3.

Given two EMQs Q1 and Q2, it holds that Q1Q2 if and only if

R(L(M(R(U(Q1)))))=R(L(M(R(U(Q2))))).

Please see Appendix A.1.18 for the proof.

6.6.3.Complexity

With respect to the complexity of the problem of computing the canonical form of (E)MQs in SPARQL, a solution to this problem can be trivially used to decide the equivalence of MQs, which is Π2Pcomplete.

With respect to the complexity of the algorithm R(L(M(R(U(·))))), for simplicity we will assume as input an MQ Q such that all projected variables are contained in the query,16 which will allow us to consider the complexity at the level of triple patterns. We will denote by n the number of triple patterns in Q.

Letting n=km, then the largest query that can be produced by U(Q) is when we have as input:

Q=AND(UNION({t1,1},,{t1,k}),,UNION({tm,1},,{tm,k}))
which will produce a query with a union of km BGPs, each of size m:
U(Q)=UNION({t1,1,,t1,k}××{tm,1,,tm,k})
Thus U(Q) may produce a UCQ with mkm triple patterns in total. Given n=km, when n>2, then km is maximised in the general case when k=e=3 (e is Euler’s number) and m=n/k=n/3. We thus have at most O(mkm)=O((n/3)3n/3)=O(n3n/3) triple patterns for U(Q) in the worst case, with at most O(3n/3) BGPs, and the largest BGP having at most O(n) triple patterns. We remark that the complexity of the other steps for U(Q) is trivially upper-bounded by O(n3n/3).

With respect to R(·), the number of triples in the r-graph is O(j) on j the number of triple patterns in the input query, giving us O(n3n/3) for the R(·) step in R(U(·)), i.e., applying R(·) on the result of U(·).

With respect to M(·), first we consider BGP minimisation, which requires computing the core of each BGP’s r-graph G. Letting j denote the number of unique subject and objects in G being minimised, which is also an upper bound for the number of blank nodes, we will assume a brute-force O(jj) algorithm that searches over every mapping of blank nodes to terms in G, looking for the one that maps to the fewest unique terms (this mapping indicates the core [28]). Note that the number of triples in the r-graph for each BGP is bounded by O(n), and so is the number of unique subjects and objects. Furthermore, the number of BGPs is bounded by O(3n/3). Thus the cost of minimising all BGPs is O(3n/3ncn) for some constant c>1. We must also check containment between each pair of BGP r-graphs (G,G) in order to apply UCQ minimisation. Again, assuming the number of subjects and objects in GG to be bounded by j, we can assume a brute-force O(jj) algorithm that considers all mappings. Given O(3n/3) BPGs, we have O((3n/3)2)=O(32n/3) pairs of BGPs to check, giving us a cost of O(32n/3ncn). Adding both BGP and UCQ minimisation costs, we have O(ncn(3n/3+32n/3))=O(ncn32n/3) for the M(·) step in M(R(U(·))). We can then reduce O(ncn32n/3) to O(2cnlogn) by converting both bases to 2 and removing the constant factors.17

With respect to L(·), letting j denote the number of triples in the input, we will assume a brute-force O((cj)!) algorithm, for some constant c>0, that searches over all ways of canonically labelling blank nodes from the set {_:x1,,_:xb}, where b is the number of blank nodes (in O(j)). We remark that the total size of the r-graph is still bounded by O(n3n/3), as the minimisation step does not add to the size of the r-graph. Since the number of blank nodes is bounded by O(n3n/3), the cost of the L(·) step in L(M(R(U(·)))) is O((cn3n/3)!) for some constant c>0.

Finally, given a graph with j triples, then R(·) is possible in time O(jlogj), where some sorting is needed to ensure a canonical form. Given an input r-graph of size O(n3n/3), we have a cost of O(n3n/3logn3n/3)=O(n3n/3(logn+(n/3)log3))=O(n23n/3) for the R(·) step in R(L(M(R(U(·))))).

Putting it all together, the complexity of canonicalising an MQ Q with n triple patterns using the procedure R(L(M(R(U(Q))))) is as follows:

O(n3n/3+n3n/3+2cnlogn+(cn3n/3)!+n23n/3)
which we can reduce to O((cn3n/3)!), with the factorial canonical labelling of the complete exponentially-sized UCQ r-graph yielding the dominant term.

Overall, this complexity assumes worst cases that we expect to be rare in practice, and our analysis assumes brute-force methods for finding homomorphisms, computing cores, labelling blank nodes, etc., whereas we use more optimised methods. For example, the exponentially-sized UCQ r-graphs form a tree-like structure connecting each BGP, where it would be possible to canonically label this structure in a more efficient manner than suggested by this worst-case analysis. Thus, though the method has a high computational cost, this does not necessarily imply that it will be impractical for real-world queries. Still, we can conclude that the difficult cases for canonicalisation are represented by input queries with joins of unions, and that minimisation and canonical labelling will likely have high overhead. We will discuss this further in the context of experiments presented in Section 8.

7.Canonicalisation of SPARQL 1.1 queries

While the previous section describes a sound and complete procedure for canonicalising EMQs, many SPARQL 1.1 queries in practice use features that fall outside of this fragment. Unfortunately we know from Table 16 that deciding equivalence for the full SPARQL 1.1 language is undecidable, and thus that an algorithm for sound and complete canonicalisation (that is guaranteed to halt) does not exist. Since completeness is not a strong requirement for certain use-cases (e.g., for caching, it would imply a “cache miss” that would otherwise happen without canonicalisation), we rather aim for a sound canonicalisation procedure that supports all features of SPARQL 1.1. Such a procedure supports all queries found in practice, preserving congruence, but may produce different canonicalised output for congruent queries.

7.1.Algebraic rewritings

We now describe the additional rewritings we apply in the case of SPARQL 1.1 queries that are not EMQs, in particular for filters, for distinguishing local variables, and for property paths (RPQs). We further describe how canonicalisation of monotone sub-queries is applied based on the previous techniques.

7.1.1.Filter normalisation

Schmidt et al. [53] propose a number of rules for filters, which form the basis for optimising queries by applying filters as early as possible to ensure that the number of intermediate results are reduced. We implement the rules shown in Table 20. It is a folklore result that such rewritings further hold in the case of bag semantics, where they are used by a wide range of SPARQL engines for optimisation purposes: intuitively, filters set to zero the multiplicity of solutions that do not pass in the case of both bag or set semantics, and preserve the multiplicity of other solutions.

Table 20

Equivalences given by Schmidt et al. [53] for filters under set semantics

Pushing filters inside/outside union[FILTERR(Q1)UNIONFILTERR(Q2)]FILTERR([Q1UNIONQ2])
Filter conjunctionFILTERR1(FILTERR2(Q1))FILTERR1R2(Q)
Filter disjunction[FILTERR1(Q)UNIONFILTERR2(Q)]FILTERR1R2(Q)
Pushing filters inside/outside join[FILTERR(Q1)ANDQ2]FILTERR([Q1ANDQ2]) if varsRsvars(Q1)
Pushing filters inside/outside optional[FILTERR(Q1)OPTQ2]FILTERR([Q1OPTQ2]) if varsRsvars(Q1)

With respect to the latter two rules, we remark that this holds only if the variables of the filter expression R are contained within the safe variables of Q1, i.e., the variables that must be bound in any solution of Q1 over any dataset. While we defined this notion semantically in Section 3.9, in order to apply such rules in practice, Schmidt et al. [53] define safe variables syntactically. We extend their syntactic definitions to cover more features of SPARQL 1.1, as shown in Table 21. These syntactic definitions do not cover all cases, but rather under-approximate the set of safe variables; as was mentioned in Section 3.9, deciding if a variable is (semantically) safe or not is undecidable. By conservatively under-approximating safe variables, we will apply rewritings in a subset of the cases in which they may actually apply, choosing soundness over completeness in the face of undecidability.

Table 21

Syntactic approximation of safe variables where B is a basic graph pattern; N is a navigational graph pattern; Q, Q1 and Q2 are graph patterns; x is an IRI, v is a variable; V is a set of variables; R s a built-in expression; M is a bag of solution mappings; Λ is a set of aggregation expression–variable pairs; and Q is a group-by pattern

Q=Bsvars(Q)=varsB
Q=Nsvars(Q)=varsN
Q{[Q1ANDQ2],[Q1SERVICExfalseQ2]}svars(Q)=svars(Q1)svars(Q2)
Q=[Q1UNIONQ2]svars(Q)=svars(Q1)svars(Q2)
Q{[Q1FEQ2],[Q1FNEQ2],[Q1MINUSQ2],[Q1OPTQ2],[Q1SERVICExtrueQ2]}svars(Q)=svars(Q1)
Q{SELECTV(Q),GROUPV(Q)}svars(Q)=svars(Q)V
Q{FILTERR(Q),BINDR,v(Q),GRAPHx(Q)}svars(Q)=svars(Q)
Q=GRAPHv(Q)svars(Q)=svars(Q){v}
Q=VALUESM(Q)svars(Q)=svars(Q)svars(M)
Q{HAVINGA(Q),AGGΛ(Q)}svars(Q)=svars(Q)

In our case, rather than decomposing filters with disjunction or conjunction, we join them together, creating larger filter expressions that can be normalised.

Example 7.1.

Consider the following query: sw-13-sw212871-g042.jpg We will rewrite this as follows: sw-13-sw212871-g043.jpg Note that the FILTER inside the optional cannot be moved: if there is a solution μ such that μ(?x)=μ(?z), having the filter inside the OPTIONAL may result in ?z being unbound, while having it outside would always filter the solution entirely.

7.1.2.Local variable normalisation

Like in the case of union variables, we identify another case where the correspondences between variables in different scopes is coincidental; i.e., where variables with the same name do not correlate. Specifically, we call a variable v local to a graph pattern Q on V (see Table 3) if vvarsQ and vV. Much like in the case of union variables, we can simply rename local variables to distinguish them from variables with the same name in other scopes.

Example 7.2.

Consider the following query looking for the names of aunts of people without a father or without a mother. sw-13-sw212871-g044.jpg In this case, we distinguish the union variables. However, the variables ?f and ?m are local to the first and second MINUS clauses, respectively, and thus we can also differentiate them as follows: sw-13-sw212871-g045.jpg The resulting query is equivalent to the original query, but avoids “false correspondences” of variables.

7.1.3.UCQ normalisation

We continue to apply many of the rules of UCQ normalisation described in Section 6.1. Most of these rules were stated in a general way, and thus apply when other (non-EMQ) features are used in the subqueries of the union and join operators (or outside such operators). There are, however, certain caveats to consider in the more general case:

  • In the case of variable normalisation, deciding the set of possible variables becomes undecidable for the full language. It suffices for soundness (but not completeness) to use an overapproximation; given a graph pattern Q on V, we can thus simply take V as the set of possible variables.

  • In the case of unsatisfiability normalisation, there are now many other possible causes of unsatisfiable graph patterns; unfortunately, deciding if a pattern is unsatisfiable or not for the full language is undecidable [60]. We currently only remove BGPs with literal subjects, as before for EMQs.

  • In the case of set vs. bag normalisation, deciding whether or not a query can return duplicate solutions is undecidable (noting that UNION(Q,Q) cannot return duplicates if and only if Q is unsatisfiable). Currently we only apply this normalisation in EMQs, though this could be extended in future to consider other cases (such as C2RPQs).

7.1.4.Well-designed pattern normalisation

As mentioned in Section 3, SPARQL allows the querying of optional data, that is, values are returned if they are matched by a graph pattern, or are unbound otherwise. Well-designed patterns denote a class of graph patterns where for each sub-pattern of the form Q=[Q1OPTQ2] it follows that all variables that appear both outside of Q and in Q2 must also appear in Q1. We can check for well-designedness in linear time over the size of the query pattern and the number of optional sub-patterns it contains. A classical result for SPARQL is that well-designed patterns can avoid leaps in computational complexity for the evaluation of queries when adding (unrestricted) OPTIONAL. Furthermore, well-designed patterns permit additional normal forms involving OPTIONAL to be applied [44], per Table 22. We exploit these rules to capture additional equivalences involving such patterns.

Table 22

Equivalences given by Pérez et al. [44] for set semantics and well-designed patterns

Join can be pushed into optional[Q1AND[Q2OPTQ3]][[Q1ANDQ2]OPTQ3]
Join can be pushed into optional[[Q1OPTQ2]ANDQ3][[Q1ANDQ3]OPTQ2]
Filter expression can be pushed into optionalFILTERR([Q1OPTQ2])[FILTERR(Q1)OPTQ2]
Example 7.3.

Let us consider the query Q1: sw-13-sw212871-g046.jpg This query is not well-designed due to the variable ?n appearing on the right but not the left of an OPTIONAL, while also appearing outside the OPTIONAL.

On the other hand, the following query Q2 contains a well-designed pattern inside its WHERE: sw-13-sw212871-g047.jpg Thus we can rewrite it to: sw-13-sw212871-g048.jpg per the second rule of Table 22.

Note that if we were to try to apply the same rule on the non-well-designed pattern in Q1, we would get: sw-13-sw212871-g049.jpg This query is not equivalent to Q1: it returns the first names (?n) of children (?x) with some father (?y); in other words, the OPTIONAL is redundant. On the other hand, Q1 returns the first names (?n) of children (?x) with some father (?y) that does not have a first name or has a first name the same as the child (?n); this is because in the original query, the variable ?n is potentially bound to the father’s first name (if it exists) before the join on the child’s first name is applied.

We note that for queries with well-designed patterns, these rules are not sufficient for the purposes of completeness; we will show an example of incompleteness later in Section 7.5.4. Per the results of Pichler and Skritek [45], equivalence considering projection and well-designed patterns is already undecidable.

7.1.5.Summary

Given a SPARQL 1.1 query Q, we denote by A(Q) the process involving the application of:

  • 1. filter normalisation (§7.1.1);

  • 2. local variable normalisation (§7.1.2);

  • 3. UCQ normalisation (§7.1.3);

  • 4. well-designed pattern normalisation (§7.1.4).

7.2.Graph representation

We extend the graph representation defined for EMQs in Section 7.2 so as to cover all SPARQL 1.1 query features. This extension is provided in Table 23 (we again include the EMQ features for reference).

Table 23

Definitions for representational graphs R(Q) of graph patterns Q, where “a” abbreviates rdf:type, B is a basic graph pattern, N is a navigational graph pattern, Q1,,Qn,Q are graph patterns, c is an RDF term, e is a non-simple property path (not an IRI), k is a non-zero natural number, v is a variable, w is a variable or IRI, x is an IRI, y is a variable or property path, μ is a solution mapping, Δ is a boolean value, V is a set of variables, X is a set of variables and/or IRIs, Y and Y are sets of IRIs, R is a built-in expression, A is an aggregate expression, M is a bag of solution mappings, Λ is a set of aggregate expression–variable pairs, and Ω is a non-empty sequence of order comparators

·R(·)
B{(ι(·),a,:And)}(s,p,o)B{(ι(·),:arg,ι((s,p,o))}R((s,p,o))
N{(ι(·),a,:And)}(s,y,o)N{(ι(·),:arg,ι((s,y,o))}R((s,y,o))
AND(Q1,,Qn){(ι(·),:arg,ι(Q1)),,(ι(·),:arg,ι(Qn)),(ι(·),a,:And)}R(Q1)R(Qn)
UNION(Q1,,Qn){(ι(·),:arg,ι(Q1)),,(ι(·),:arg,ι(Qn)),(ι(·),a,:Union)}R(Q1)R(Qn)
[Q1MINUSQ2]{(ι(·),:left,ι(Q1)),(ι(·),:right,ι(Q2)),(ι(·),a,:Minus)}R(Q1)R(Q2)
[Q1OPTQ2]{(ι(·),:left,ι(Q1)),(ι(·),:right,ι(Q2)),(ι(·),a,:Optional)}R(Q1)R(Q2)
FILTERR(Q){(ι(·),:arg,ι(Q)),(ι(·),:exp,ι(R)),(ι(·),a,:Filter)}R(Q)R(R)
[Q1FEQ2]{(ι(·),:left,ι(Q1)),(ι(·),:right,ι(Q2)),(ι(·),a,:Exists)}R(Q1)R(Q2)
[Q1FNEQ2]{(ι(·),:left,ι(Q1)),(ι(·),:right,ι(Q2)),(ι(·),a,:NotExists)}R(Q1)R(Q2)
BINDR,v(Q){(ι(·),:arg,ι(Q)),(ι(·),:exp,ι(R)),(ι(·),:var,ι(v)),(ι(·),a,:Bind)}R(Q)R(R)
VALUESM(Q){(ι(·),:arg,ι(Q)),(ι(·),:vals,ι(M)),(ι(·),a,:Values)}R(Q)R(M)
GRAPHw(Q){(ι(·),:arg,ι(Q)),(ι(·),:graph,ι(w))}R(Q)
[Q1SERVICExΔQ2]{(ι(·),:left,ι(Q1)),(ι(·),:right,ι(Q2)),(ι(·),:srv,x),(ι(·),:sil,ι(Δ)),(ι(·),a,:Service)}R(Q1)R(Q2)
GROUPV(Q){(ι(·),:arg,ι(Q)),(ι(·),a,:GroupBy)}vV{(ι(·),:var,ι(v))}R(Q)
HAVINGA(Q){(ι(·),:arg,ι(Q)),(ι(·),:exp,ι(A)),(ι(·),a,:Having)}R(Q)R(A)
AGGΛ(Q){(ι(·),:arg,ι(Q)),(ι(·),:exp,ι(Λ)),(ι(·),a,:Aggregate)}R(Q)R(Λ)
ORDERΩ(Q){(ι(·),:arg,ι(Q)),(ι(·),:exp,ι(Ω)),(ι(·),a,:OrderBy)}R(Q)R(Ω)
DISTINCT(Q){(ι(·),:arg,ι(Q)),(ι(·),a,:Distinct)}R(Q)
REDUCED(Q){(ι(·),:arg,ι(Q)),(ι(·),a,:Reduced)}R(Q)
OFFSETk(Q){(ι(·),:arg,ι(Q)),(ι(·),:off,ι(k)),(ι(·),a,:Offset)}R(Q)
LIMITk(Q){(ι(·),:arg,ι(Q)),(ι(·),:limit,ι(k)),(ι(·),a,:Limit)}R(Q)
SELECTV(Q){(ι(·),:arg,ι(Q)),(ι(·),a,:Select)}vV{(ι(·),:var,ι(v))}R(Q)
ASK(Q){(ι(·),:arg,ι(Q)),(ι(·),a,:Ask)}R(Q)
CONSTRUCTB(Q){(ι(·),:arg,ι(Q)),(ι(·),:cons,ι(B)),(ι(·),a,:Construct)}R(Q)R(B)
DESCRIBEX(Q){(ι(·),:arg,ι(Q)),(ι(·),:desc,ι(X)),(ι(·),a,:Describe)}R(Q)R(X)
FROMY,Y(Q){(ι(·),:arg,ι(Q)),(ι(·),:from,ι(Y)),(ι(·),:fromn,ι(Y)),(ι(·),a,:From)}R(Q)R(Y)R(Y)
(s,p,o){(ι(·),:s,ι(s)),(ι(·),:p,ι(p)),(ι(·),:o,ι(o)),(ι(·),a,:TP)}
(s,e,o){(ι(·),:s,ι(s)),(ι(·),:p,ι(e)),(ι(·),:o,ι(o)),(ι(·),a,:NP)}R(e)
eminimal DFA with ι(e) as start node; see Section 7.2.2
X{(ι(·),a,:IriVarSet)}xX{(ι(·),:el,x)}
Y{(ι(·),a,:IriSet)}yY{(ι(·),:el,y)}
M{(ι(·),a,:SolBag)}(μM{(ι(·),:sol,ι((μ,M(μ))))}R((μ,M(μ))))
(μ,k){(ι(·),a,:Binding),(ι(·),:num,ι(k))}(vdom(μ){(ι(·),:el,ι((v,μ(v))))}R((v,μ(v)))
(v,c){(ι(·),:var,ι(v)),(ι(·),:val,ι(c))}
Rsee Section 7.2.1
Asee Section 7.2.1
Λ{(ι(·)