Continuous multi-query optimization for subgraph matching over dynamic graphs
There is a growing need to perform real-time analytics on dynamic graphs in order to deliver the values of big data to users. An important problem from such applications is continuously identifying and monitoring critical patterns when fine-grained updates at a high velocity occur on the graphs. A lot of efforts have been made to develop practical solutions for these problems. Despite the efforts, existing algorithms showed limited running time and scalability in dealing with large and/or many graphs. In this paper, we study the problem of continuous multi-query optimization for subgraph matching over dynamic graph data. (1) We propose annotated query graph, which is obtained by merging the multi-queries into one. (2) Based on the annotated query, we employ a concise auxiliary data structure to represent partial solutions in a compact form. (3) In addition, we propose an efficient maintenance strategy to detect the affected queries for each update and report corresponding matches in one pass. (4) Extensive experiments over real-life and synthetic datasets verify the effectiveness and efficiency of our approach and confirm a two orders of magnitude improvement of the proposed solution.
Dynamic graphs emerge in different domains, such as financial transaction network, mobile communication network, data center network [20–22], uncertain network , etc. These graphs usually contain a very large number of vertices with different attributes, and have complex relationships among vertices. In addition, these graphs are highly dynamic with frequent updates of edge insertions and deletions.
Identifying and monitoring critical patterns in a dynamic graph is important in various application domains  such as fraud detection, cyber security, and emergency response, etc. For example, cyber security applications should detect cyber intrusions and attacks in computer network traffic as soon as they appear in the data graph . In order to identify and monitor such patterns, existing work [3,13,15] studies the continuous subgraph matching problem that focuses on a-query-at-a-time. Given an initial graph G, a graph update stream consisting of edge insertions and deletions and a query graph Q. Then the continuous subgraph matching problem is to report positive (resp. negative) matches for each edge insertion (resp. deletion) operation.
Figure 1 shows an example of continuous subgraph matching. Given a query pattern Q as show in Fig. 1(a), and an initial graph G with an edge insertion operation as show in Fig. 1(b), it is necessary to find all positive matches for each operation. When occurs, it report a positive match .
However, these applications deal with dynamic graphs in such a setup that is often essential to be able to support hundreds or thousands of continuous queries simultaneously. Optimizing and answering each query separately over the dynamic graph is not always the most efficient. Zervakis et al.  first propose a continuous multi-query process engine, namely, TRIC, on the dynamic graph. It decomposes the query graphs into minimum covering paths and constructs an index. Whenever an update occurs, it continuously evaluates queries by leveraging on the shared restrictions present in query sets. Although TRIC can achieve a better performance than a-query-at-a-time approaches, it still has some serious performance problems. (1) TRIC needs to maintain a large number of materialized views, leading to worse performance in storage cost. (2) Since TRIC decomposes each query graph Q in the queries set into a set of path conjuncts, and it will cause inevitably expensive join and exploration cost for the large sets of query paths; and (3) TRIC has an expensive maintenance cost of materialized results when updates occur on the graph.
These problems of existing methods motivated us to develop a novel concept of annotated query graph (AQG), which is obtained by merging all the queries into one. Similar to prior multi-query optimization approaches, our technique relies on sharing computation to speed up query processing. Each edge e in the AQG is annotated by the queries that contain e. In order to avoid executing subgraph pattern matching repeatedly whenever some edges expire or some new edges arrive, we need to construct an auxiliary data structure to record some intermediate query results. Note that data-centric representation of intermediate results is claimed to have the best performance in storage cost . It maintains candidate query vertices for each data vertex using a graph structure such that a data vertex can appear at most once. In this paper, we also adopt this solution and construct a newly data-centric auxiliary data structure, namely, MDCG, based on the equivalent query tree of AQG. The purpose is to take advantage of the pruning power of all edges in AQG, and execute fast query evaluation by leveraging tree structure.
In summary, our contributions are:
– We propose an efficient continuous multi-query matching system, IncMQO, to resolve the problems of existing methods.
– We define annotated query graph, in which corresponding matching results can be obtained in one pass instead of multiple.
– We construct a newly data-centric auxiliary data structure based on the equivalent query tree of the annotated query graph to represent the partial solution in a compact form.
– We propose an incremental maintenance strategy to efficiently maintain the intermediate results in MDCG for each update and quickly detect the affected queries. Then we propose an efficient matching order for the annotated query to conduct subgraph pattern matching.
We experimentally evaluate the proposed solution using three different datasets, and compare the performance against the three baselines. The experiment results show that our solution can achieve up to two orders of magnitude improvement in query processing time against the sequential processing strategy.
2.Preliminaries and framework
In this section, we first introduce several essential notions and formalize the continuous multi-query processing over dynamic graphs problem. Then, we overview the proposed solution.
We focus on a labeled undirected graph . Here, V is the set of vertices, is the set of edges, and L is a labeling function that assigns a label l to each . Note that our techniques can be readily extended to handle directed graphs.
Definition 1(Graph Update Stream).
A graph update stream is a sequence of update operations , where is a triple such that is the type of operations, with I and D representing edge insertion and deletion of an edge .
A dynamic graph abstracts an initial graph G and an update stream . G transforms to after applying to G. Note that insertion of a vertex can be represented by a set of edge insertions; similarly, deletion of a vertex can be considered as a set of edge deletions.
Definition 2(Subgraph homomorphism).
Given a query graph , a data graph , Q is homomorphic to a subgraph of G if there is a mapping (or a match) f between them such that: (1) , ; and (2) , , where is the vertex in G to which v is mapped.
Since subgraph isomorphism can be obtained by just checking the injective constraint , we use the subgraph homomorphism as our default matching semantics. Note that we omit edge labels for ease of explanation, while the actual implementation of our solution and our experiments support edge labels.
Based on the above definitions, let us now define the problem of multi-query processing over dynamic graphs.
Problem statement Given a set of query graphs , an initial data graph G and graph update stream , the problem of continuous multi-query processing over dynamic graph consists of continuously calculating positive or negative matching results for each affected query graphs when applying incoming updates.
2.2.Overview of solution
In this subsection, we overview the proposed solution, which is referred to as IncMQO. Specially, we are to address two technical challenges:
– Representation of intermediate results should be compact and can be used to calculate the corresponding matches of affected queries in one pass.
– Update operation needs to be efficient such that the intermediate results can be maintained incrementally to quickly detect the affected queries.
Algorithm 1 shows the outline of IncMQO, which takes an initial data graph G, a graph update stream and queries set as input, and find the matching results of affected queries when necessary. We first merge all the queries in into an annotated query graph (AQG) (Line 1). Then, we extract from the annotated query graph AQG a equivalent query tree by choosing a root vertex (Lines 2–3). The purpose is to take advantage of the pruning power of all edges in AQG, and execute fast query evaluation by leveraging tree structure. Based on , we construct an auxiliary data structure from each start vertex in the (see Section 3), namely, MDCG, which is able to provide guidance to get affected queries and generate corresponding matches with light computation overhead (Line 4–6), here is a virtual vertex that conveniently represents the parent vertex of the root vertex. Finally, we perform continuous multi-query matching for each update operation. During a graph update stream, when an update comes, we first check whether it is an update that can affect the query results, i.e., check whether an edge e of matches the corresponding edge in the operation o. If so, we amend the auxiliary data structure MDCG depending on the update type of the operation o, and calculate the positive or negative matching results for affected queries if necessary (Lines 7–11).
3.Continuous multi-query processing model
When an edge update occurs, it is costly to conduct sequential query processing. The central idea of multi-query handling is to employ a delicate data structure, which can be used to compute matches of affected queries in one pass.
3.1.Annotated query graph
Different from the work proposed in  that decomposes queries into covering paths and handles updates by finding affected paths, we provide a novel concept of annotated query graph, namely, AQG, which merges all queries in into one. An AQG is a pair , where is the merged query and δ is a function that annotates each edge with a set of query IDs from . Intuitively, for each , indicates that e is an edge in query , i.e., is a condition on whether e is an edge in . To construct AQG, for each graph , we first add a subscript to each vertex label in to distinguish the vertices that have the same labels. That is, we randomly use as subscripts for the n duplicate vertex labels in each query . Then, for each unvisited edge e in , we check whether it exists in other queries . If so, we add into and mark e and corresponding edges in other queries as visited. We repeat this process till all the edges of each query in have been visited. Finally, we construct AQG by merging and together. Based on AQG, we can compute matches of affected queries in one pass of enumeration instead of multiple.
The queries in Fig. 2(a) are overlaped and can be merged into an annotated AQG , where AQG takes the union of the vertices and edges of the three query graphs. The edges in are annotated by δ such that , , , , , (labels omitted here). Observe that common sub-patterns of are represented only once in and the matches to can be computed in a single enumeration of matches of .
Note that, there exists a case that the queries in the have no common component. In this condition, we will process each query in sequentially. In Section 6.11, we evaluate the performance of single query processing against TurboFlux. The experiment results prove that our algorithm is still more efficient.
3.2.Auxiliary data structure
Since continuous multi-query processing is triggered by each update operation on the data graph, it is more useful to maintain some intermediate results for each vertex in the data graph as TurboFlux  did rather than in the query graph. To this end, we propose a newly data-centric auxiliary data structure based on the equivalent query tree of AQG.
The equivalent query tree of a rooted AQG is defined as the tree ETree such that each edge in AQG corresponds to a tree edge in ETree. (e.g., Fig. 3(a) is the equivalent query tree of AQG in Fig. 2(b)).
Note that since we will transform all edges of into tree edges, there are duplicate vertices in ETree (e.g., and in Fig. 3(a)). To construct , we need to choose a root vertex of . We first adopt the core-forest decomposition strategy of  to determine the core part . Then, we use edge overlapping factor and candidates set () to select the root vertex in . In detail, we quantify root degree as (, ) and select the vertex with the minimum value as the root vertex . Here, is the overlapping factor of edge , defined as the maximum number of queries annotated on edge in the AQG (e.g., in Fig. 2(b)), and represents a set of vertices in G matching with u. After that, we traverse AQG in a BFS order from , and direct all edges from upper levels to lower levels to generate the equivalent query tree of AQG.
For each vertex in AQG, supposed that the number of candidates (i.e., ) is shown in Fig. 2(c). We select the vertex with the lowest root degree as root vertex (i.e., vertex ) and then generate the equivalent query tree as shown in Fig. 3(a).
Let be an edge in ETree. For each annotated query ID on , there must exist a path from a vertex to corresponding to and has no incoming edge annotated with . Here, is called start vertex.
Consider the edge in Fig. 3(a). The start vertex corresponding to and is and , respectively.
Based on and AQG, we construct a novel data-centric auxiliary data structure called MDCG. For each vertex v in the data graph, we store corresponding candidate query vertices as incoming edges to v in intermediate results. The MDCG is a complete multigraph such that every vertex pair () has edges. Here, each edge has a query vertex ID in ETree as edge label, and its state is one of Null/Incomplete/Complete. Each query vertex ID contains an annotation set about query ID that conform corresponding states. Let be one start vertex of and be one vertex in G to which matches. Given an edge with in the MDCG, it belongs to one of the following three types.
– Null edge: For each query (), there is no data path that match .11.
– Incomplete edge: is a candidate of such that for each (), (1) there exists a data path that match ; and (2) there exists a subtree of does not match any subtree of .
– Complete edge: is a candidate of such that for each (), (1) there exists a data path that match ; and (2) every subtree of matches the corresponding subtree of .
Note that we do not store Null edges in the MDCG since they are hypothetical edges in order to explain the incremental maintenance strategy (see Section 4). Furthermore, to reduce the storage cost, we use a bitmap for each vertex v in the MDCG where the i-th bit indicates whether v has any incoming Incomplete edges whose label is .
4.Continuous multi-query evaluation phase
We rely on an incremental maintain strategy to efficiently maintain MDCG for each edge update operation, and then propose an effective matching order to conduct subgraph pattern matching for affected queries in single pass of enumeration directly.
4.1.Incremental maintenance of intermediate results
We propose an edge state transition model to efficiently identify which update operation can affect the current intermediate results and/or contribute to positive/negative matches for each affected query. The edge state transition model consists of three states and six transition rules, which demonstrates how one state is transited to another.
Handing edge insertion When an edge insertion occurs, we have the following three edge transition rules.
From Null to Null. (1) Suppose that edge fails to match any query edge in the , then the state of is Null. (2) Suppose that edge matches a query edge in the . If v in the MDCG has no Incomplete/Complete incoming edge with label u such that contains one query ID in , then the state of remains Null.
From Null to Incomplete. (1) Suppose that edge matches a query edge in the . If v has an Incomplete/Complete edge with label u such that (), then we transit the state of edge from Null to Incomplete and set . (2) Suppose that the state of edge in the MDCG is translated from Null to Incomplete, we need to propagate the update downwards. That is, for each adjacent vertex of , we will check whether matches where is the child vertex of and (). If so, we transit the state of in the MDCG from Null to Incomplete and set .
Figure 4(b)–(c) give the example of edge transition rule from Null to Incomplete. In Fig. 4(a), when the edge insertion operation (between and ) occurs, we can find that matches in the ETree. Since has an incoming Incomplete edge with label such that in Fig. 4(b), then we translate the state of edge from Null to Incomplete and set . Next, update needs to be propagated downwards. Here, an Incomplete edge with is added into the MDCG, as shown in Fig. 4(c).
From Incomplete to Complete. (1) Suppose that the state of is transited from Null to Incomplete. If is a leaf vertex in the for a query in , we transit the state of with in the MDCG from Incomplete to Complete. The state of edge with remains Incomplete. (2) Suppose that the state of in the MDCG is transited from Incomplete to Complete. If v has an outgoing Complete edge in the MDCG whose label is and contains for every in , then transit the state of every Incomplete incoming edge of v in the MDCG whose label is and contains from Incomplete to Complete. The state of edge with remains Incomplete.
Figure 4(d)–(e) give the example of edge transition rule from Incomplete to Complete. In Fig. 4(d), since is the leaf vertex in , the state of edge with is transited from Incomplete to Complete. Currently, the state of edge with in Fig. 4(c) is transited from Null to Incomplete. Since is the leaf vertex of , then the state of edge with is transited from Incomplete to Complete. Note that there is another Complete edge with , we can merge them together. Then, we further check the state of edge with . Since has an outgoing Complete edge with label and for every children of , then the state of edge with is transited from Incomplete to Complete. Next, update needs to be propagated upwards. Here, the state of edge with is transited from Incomplete to Complete, as shown in Fig. 4(e).
Handing edge deletion When an edge deletion occurs, we have the following three reversed edge transition rules.
From Complete to Null. (1) For each edge in such that v in the MDCG has an incoming Incomplete or Complete edge whose edge label is u, if matches and the state of in the MDCG is Complete, then transit the state of in the MDCG from Complete to Null. (2) Suppose that the state of edge in the MDCG is transited from Incomplete or Complete to Null. For each query in , if in the MDCG no longer has any incoming edge whose label is that contains the annotation , then for each in , transit the state of every outgoing Complete edge of in the MDCG whose label is that contain from Complete to Null.
Figure 4(f) gives the example of edge transition rule from Complete to Null. In Fig. 4(a), when the edge deletion operation (between and ) occurs, the state of edge is translated from Complete to Null in the MDCG as show in Fig. 4(f), since has an incoming Complete edge with label and matches in the ETree.
From Complete to Incomplete. Suppose that the state of in the MDCG is transited from Complete to Incomplete or Null. For each query in , if v in the MDCG no longer has any outgoing Complete edge whose label is that contains , then we transit the state of every incoming Complete edge of v in the MDCG whose label is that contains from Complete to Incomplete.
Figure 4(g)–(h) give the example of edge transition rule from Complete to Incomplete. In Fig. 4(g), since the state of edge is translated from Complete to Null, for and , does not have an outgoing Complete edge for every children of in ETree, then the state of edge with is transited from Complete to Incomplete. While the state of edge with is still remained Complete, since it meets the Complete requirement for . Next, update needs to be propagated upwards. Here, the state of edge with is transited from Complete to Incomplete, as shown in Fig. 4(h).
From Incomplete to Null. (1) If v in the MDCG has an incoming Incomplete or Complete edge whose edge label is u, and the state of in the MDCG is Incomplete, then transit the state of in the MDCG from Incomplete to Null. (2) Suppose that the state of in the MDCG is transited from Incomplete or Complete to Null. For each query in , if in the MDCG no longer has any incoming edge whose label is that contains , then for each in , transit the state of every outgoing Incomplete edge of in the MDCG whose label is that contain from Incomplete to Null.
4.2.Subgraph search phase
If the state of an edge is translated to Complete, we say the queries in are affected queries caused by edge . Then, we propose an efficient algorithm, namely, , to calculate corresponding positive matches including for each affected query based on the MDCG in single pass of enumeration directly. The main idea of is explained as follows: (1) We derive a matching order based on the number of affected queries on each edge in ETree; and then (2) compute the positive matches for each affected query based on the matching order.
In order to calculate the matching order, first marks as visited. Subsequently, given a set of unvisited vertices that is adjacent to in ETree, the next vertex is the one such that contains the maximal affected queries. If there is a tie, it chooses a query vertex having a minimum number of candidate data vertices in the MDCG. After that, we mark as visited. The matching order for the rest of query vertices is determined along the same lines.
Intuitively, outputs a “global” matching order for the query vertices in the AQG . It prioritizes vertices that are shared by more queries. Matching such vertices at an early stage could help avoid the enumeration of many unpromising intermediate results since the corresponding pruning benefits multiple queries at the same time.
In the next stage, enumerates the positive matches for each affected query graph embedded in the MDCG following the proposed matching order. It adopts the generic backtracking approach. During the matching process, it enumerates and prunes adjacent Complete edge by inspecting edge label whether it contains the affected query . The same edge for different query graphs is indeed enumerated only once.
As show in Fig. 3(d), the state of edge insertion is translated into Complete. Since , then and are the affected query graphs. Firstly, we mark as visited. Then, for each adjacent vertex (i.e., and ) in the ETree (see Fig. 2(a)), we choose as the next vertex for matching since contains both and . Finally, a matching order is deduced. Based on the matching order, enumerates all the positive matches of the affected query graphs in a single pass.
In this section, we present detailed algorithms for IncMQO. In order to efficiently handle the continuous multi-query, we first construct the auxiliary data structure MDCG, and then present two major functions insertEval and deleteEval to apply necessary transition rules to efficiently maintain the intermediate results for each update. Finally, we investigate the matching algorithm to report corresponding positive/negative matches in one pass based on the MDCG.
In this subsection, we explain BuildMDCG (Line 6 of Algorithm 1) which is designed for every v with an Incomplete incoming edge. It uses a depth-first travel strategy to extend each v in MDCG. First, we check whether there exists an edge matching , where and represent a child vertex of u and v, respectively; if so, we transit the type of edge from Null to Incomplete (Line 1–4). If is not a leaf vertex, we call BuildMDCG recursively to match the child vertex of (Lines 5–6). Otherwise, transit the state of the edge from Incomplete to Complete (Line 8). After that, we check if the subtree of v matches the corresponding subtree of u for ; if so, we transit the state of the edge from Incomplete to Complete (Lines 8–9).
The time complexity of BuildMDCG is .
In the worst case, BuildMDCG is called for every vertex u and every data vertex v. Thus, given u and v the time complexity for Lines 1–2 of Algorithm 2 is . Note that the time complexity for Lines 8–9 is . Thus, the time complexity of BuildMDCG is □
insertEval (Algorithm 3) is invoked for each new arrival edge . The main idea of insertEval is explained as follows: we try to match with the query edges in and update the MDCG based on the corresponding maintenance strategy. Then we build the MDCG downwards for the subtree of and further update the MDCG upwards until reaching any of the starting vertex . Finally, we execute the subgraph matching to report the matching results.
Note that not all the insertion edge can cause the update of MDCG. Only when there is a path matches the path , the insertion operation can cause any update to MDCG. Thus we collect all the path matched vertex u into a vertex set U (Line 1). To do this, we can guarantee v has an Incomplete or Complete edge. Then for each child query vertex of u (), if matches , we further set the type of edge to Incomplete with the transition rule From Null to Incomplete, and execute BuildMDCG downwards to build the new part of MDCG (Lines 2–6). If the type of the insertion edge transit into Complete finally, we execute updateMDCG to update the type of the edge which belongs to the path (Lines 7–8).
Here, updateMDCG (Algorithm 4) traverses the MDCG upwards and performs the transition rules if necessary. It is only called when v has an incoming edge with Incomplete type (Line 2). Then, when v’s subtree matches u’s subtree for , we further transit ) to Complete with transit rule From Incomplete to Complete (Lines 3–4). When it reaches any starting data vertex , we end of the updateMDCG. On the contrary, we continue to recurse upwards(Lines 5–6).
deleteEval algorithms for edge deletion are very similar to those for edge insertion except that they use different transition rules. Thus, we do not describe here.
In this section, we perform extensive experiments on both real and synthetic datasets to show the performance of IncMQO algorithm for continuous multi-query matching over dynamic graphs. The performance of IncMQO was evaluated using various parameters such as the overlapped rate of query set, average query size, query database size, edge update size, and graph size. The proposed algorithms were implemented using C++, running on a Linux machine with two Core Intel Xeon CPU 2.2 Ghz and 32 GB main memory.
6.1.Datasets and query generation
The SNB dataset. SNB  is a synthetic benchmark designed to accurately simulate the evolution of a social network through time. This evolution is modeled using activities that occur inside a social network. Based on the SNB generator, we derived 3 datasets: (1) SNB0.1M with a graph size of edges and vertices; (2) SNB1M with a graph size of edges and ; and (3) SNB10M with a graph size of edges and , and use the second one as default.
The NYC dataset. NYC22 is a real world set of taxi rides performed in New York City (TAXI) in 2013. TAXI contains more that 160M entries of taxi rides with information about the license, pickup and drop-off location, the trip distance, the date and duration of the trip, and the fare. We utilized the available data to generate a data graph with edges and vertices.
The BioGRID dataset. BioGRID  is a real world dataset that represents protein to protein interactions. We used BioGRID to generate a stream of updates that result in a graph with edges and vertices.
In order to construct the set of query graph patterns , we identified two typical distinct query classes: trees and graphs. Each type of query graph pattern was chosen equiprobably during the generation of the query set. The default values for the query set are: (1) an average size l of 5, where l represents the average size of the queries in ; (2) a query database size of 500 query graphs; and (3) a factor that denotes the percentage of overlap between the queries in , .
Our method, denoted as IncMQO, is compared with a number of related works. TRIC is the state-of-the-art continuous multi-query processing method over dynamic graph . It utilizes the common parts of minimum covering paths to amortize the costs of processing and answering them. TurboFlux  and GraphFlow  are the state-of-the-art continuous subgraph matching methods for single query. Both of them can be utilized for multi-query processing scenarios. That is, we adopt the sequential query processing strategy on them.
We measure and evaluate (1) the elapsed time and the size of intermediate results for IncMQO and its competitors by varying the percentage of overlap between the queries in the query set; (2) the elapsed time and the size of intermediate results for IncMQO and its competitors by varying the average query size and query database size; (3) the elapsed time and the size of intermediate results for IncMQO and its competitors by varying the edge insertion size; (4) the elapsed time and the size of intermediate results for IncMQO and its competitors by varying the edge deletion size; and (5) the scalability of IncMQO.
6.3.Evaluating the efficiency of IncMQO
In this subsection, we evaluated the performance of IncMQO against its competitors from the aspect of processing time and storage cost on three datasets: SNB1M, NYC and BioGRID with a default updates stream .
Time efficiency comparison Fig. 5(1) shows the total processing time of IncMQO and its competitors over different datasets. We can see that IncMQO is better than its competitors over all datasets. Notably, IncMQO outperforms TRIC, TurboFlux and GraphFlow by up to 8.43 times, 28.93 times, and 385.21 times, respectively. The reason is that TRIC needs to maintain a large number of indexes to track the matching results; TurboFlux and GraphFlow need to process the multiple queries sequentially, which costs a lot of time overhead. In specific, GraphFlow has the worst performance, since it does not store any intermediate results and use the re-computing method for each update.
Space efficiency comparison Fig. 5(2) shows the size of intermediate results on each dataset. We only evaluate the IncMQO, TRIC, and TurboFlux, since GraphFlow does not maintain any intermediate results. IncMQO outperforms TRIC, and TurboFlux by up to 9.03 times, 29.07 times, respectively. This is because TRIC maintains a large number of materialized views and TurboFlux needs to construct auxiliary data structure for each query in the query set, as a result, leading to worse performance in storage cost.
6.4.Varying percentage of query overlap
In Fig. 6(1) we give the results of the time efficiency evaluation when varying the parameter θ, for , , , , , and of a query set for on SNB1M dataset. Here, we fixed . In this setup, the algorithms are evaluated for varying percentage of query overlap. means that the queries in have no overlap. It is revealed that IncMQO significantly outperforms other approaches when . A higher number of query overlap should decrease the number of calculations performed by algorithms designed to exploit commonalities among the query set. The results show that IncMQO and TRIC behave in a similar manner as previously described, while TurboFlux and GraphFlow do not since they focus on a-query-at-a-time. Note that in Fig. 6(1), when , IncMQO is slightly worse than that of TurboFlux while still better than that of TRIC by 1.25 times and GraphFlow by 5.07 times. Figure 6(2) plots the average size of intermediate results. IncMQO achieves the smallest size of intermediate results since it merges all the queries into one and builds a concise auxiliary data structure. In specific, IncMQO is superior to up to TurboFlux 63.2 times when .
6.5.Varying the average query size
In this subsection, we evaluate the impact of the average query size in on the performance of IncMQO and its competitors. Figure 7(1)–(2) show the performance results on SNB1M dataset. We set l from 3 to 9 in 2 increments and fixed . Note that the matching cost does not always increase as the average query size increases. Figure 7(1) shows the elapsed time, IncMQO significantly outperforms its competitors regardless of average query size. Specially, IncMQO outperforms TRIC by 6.40–11.72 times, TurboFlux by 39.06–49.24 times and GraphFlow by 139.90–172.33 times. Figure 7(2) gives the average size of intermediate results. It is reviewed that the average size of intermediate result of TRIC increases rapidly with the increase of the average query size. In specific, IncMQO outperforms TRIC by up to 12.31 times when l is 9. Since TRIC uses path join operations, the larger the query graph pattern, the more join operations it requires.
6.6.Varying query database size
In this subsection, we evaluate the impact of the size of query database on the performance of IncMQO and its competitors. Figure 8(1)–(2) show the performance results using SNB for varying the size of the query database . More specifically, we fix and vary from 250 to 1000 in 250 increments. Please notice the y-axis is in a logarithmic scale. Figure 8(1) shows the processing time for each algorithm when varying . It revealed that IncMQO significantly outperforms its competitors regardless of query database size. Specially, IncMQO outperforms TRIC by up to 8.92 times, TurboFlux by up to 82.73 times and GraphFlow by up to 287.77 times when . IncMQO also outperforms its competitors in terms of the size of intermediate results, as shown in Fig. 8(2). The performance gap between IncMQO and TRIC will increase as increases. Specially, the average size of intermediate results of TRIC and TurboFlux is larger than that of IncMQO by up to 10.78 times and 50.47 times, respectively, when .
6.7.Varying the edge insertion size
Figure 9(1)–(2) show the performance results using SNB1M for varying edge insertion size. Here, we vary the number of newly-inserted edges from to in increments with respect to the number of triples in the graph update stream. Figure 9(1) shows the total processing time for each algorithm. The results demonstrate that all algorithm’s behavior is aligned with our previous observations. It can be seen that IncMQO has consistent better performance than its competitors. Specially, IncMQO outperforms TRIC by up to 6.43 times, TurboFlux by up to 24.04 times and GraphFlow by up to 152.33 times. In terms of the size of intermediate results, IncMQO also has a better performance than its competitors, as shown in Fig. 9(2). Specially, the average size of intermediate results of TRIC and TurboFlux is larger than that of IncMQO by up to 9.96 times and 60.24 times, respectively, when the insertion size is .
6.8.Varying the edge deletion size
Figure 10(1)–(2) show the performance results using SNB1M for varying edge deletion size. Here, we vary the number of expired edges from to in increments with respect to the number of triples in the graph update stream. Figure 10(1) shows the total processing time for each algorithm. Note that deletion of an edge may affect the auxiliary data structure. As the edge deletion size increases, incremental subgraph matching times of IncMQO, TRIC, and TurboFlux increase, while that of GraphFlow decreases. The reason is that the edge deletions reduce the input data size of GraphFlow directly. Nevertheless, IncMQO still consistently outperforms its competitors regardless of the edge deletion size. As shown in Fig. 10(2), the average number of intermediate results of TRIC is larger than that of IncMQO by up to 6.4 times, and TurboFlux is larger than that of IncMQO by up to 23.4 times when the deletion size is .
6.9.Varying dataset size
Figure 11(1)–(2) show the performance results using SNB for varying dataset size. Here, we fixed and varied the size of SNB from 0.1M to 10M. In Fig. 11(1), IncMQO consistently outperforms its competitors regardless of the dataset size. In specific, the figure reads a non-exponential increase as the dataset size grows. The scalability suggests that IncMQO can handle reasonably large real-life graphs as those existing algorithms for deterministic graphs. Figure 11(2) shows similar scalability of intermediate result sizes for IncMQO, TRIC and TurboFlux. Specially, IncMQO outperforms TRIC by up to 10.70 times and TurboFlux by up to 53.86 times.
6.10.Comparison of different matching orders
In this set of experiments, we evaluate the performance of different matching orderings on SNB1M dataset. We compare our proposed matching order with that proposed in . adopts the subgraph matching algorithm of TurboFlux which used the candidate set of each query path to determine the matching order. We set l from 3 to 9 in 2 increments and fixed . The results are within expectation. Figure 12(1) shows the elapsed time, IncMQO outperforms by 1.91–2.76 times. Figure 12(2) gives the average size of intermediate results. IncMQO still performs better than . Since IncMQO preferentially matches the vertices that are shared by more query which could help avoid the enumeration of many unpromising intermediate results.
6.11.Performance evaluation of single query
In this set of experiments, we evaluate the performance of single query to text the efficiency when there is no common components in the query set . Figure 13(1)–(2) show the performance results using SNB1M dataset between IncMQO (without AQG) and TurboFlux. We set l from 3 to 9 in 2 increments and fixed . Specifically, IncMQO outperforms TurboFlux in terms of elapsed time and intermediate result size. This is because we translate the query graph into an equivalent query tree rather than a spanning tree. When there is a circle in the query graph, the spanning tree will ignore the non-tree edges, while the equivalent query tree takes both tree edges and non-tree edges into consideration. As a result, using an equivalent query tree can achieve a stronger pruning ability.
6.12.Comparison of incremental matching and recomputing algorithm
In this subsection, we further compare the incremental algorithm (IncMQO and TRIC) and the recomputing algorithm (MQO ) to detect the limitations that the number of updates brings to our algorithm. Recomputing algorithm means that conducts subgraph matching for pattern graph over updated data graph with batch mode. We conduct experiments on the two synthetic and real-life data graphs by varying the newly-inserted edges from to in increments with respect to the number of triples in the graph update stream. Figure 14(1) and Fig. 14(2) show the total processing time for each algorithm on SNB1M and NYC. We see that IncMQO behaves better than TRIC by 2.03–4.24 times and MQO by 1.34–14.30 times when the size of updates is no more than 30%|G|. However, when the number of edge insertion size continues to rise, the auxiliary data structure in the incremental matching needs to maintain a lot intermediate results, making it less effective than matching from scratch over the updated graph.
We categorize the related work as follows.
Subgraph isomorphism research Subgraph isomorphism research is a fundamental requirement for graph databases and has been widely used in many fields. While this problem is NP-hard, in recent years, many algorithm have been proposed to solve it in a reasonable time for real datasets, such as VF2 , GraphQL , TurBOiSO , QuickSI . Most all these algorithms follow the framework of Ullmann , with some pruning strategies, heuristic matching order algorithm and auxiliary neighborhood information to improve the performance of subgraph matching search. Lee et al.  compared these subgraph isomorphism algorithms in a common code base and gave in-depth analysis. However, these techniques are designed for static graphs and are not suitable for processing continuous graph queries on evolving graphs.
Multiple query optimization for relational database Multiple query optimization (MQO) has been well studied in relational databases [24,25]. Most work on relational database extend existing cost model based on statistics, and search for a global optimal access plan among all possible access plan combinations for each single query. Meanwhile, some intermediate access plan can be shared to accelerate multi-query evaluation. However, these relational MQO methods cannot be directly used for subgraph homomorphism search of MQO, because we do not assume that statistics or indexes exist on the data graph, and some relational optimization strategies (such as push selection and projection) are not suitable for subgraph homomorphism search. In addition, the methods of identifying common relation subexpression  are also difficult or inefficient for graph-based multi-query evaluation since it is inefficient to evaluate graph pattern queries by converting them into relational queries .
One-time multiple query evaluation Le et al.  studied the problem of evaluating SPARQL one-time multiple queries (SPARQL-MQO) over RDF graphs. Given a batch of graph queries, it clustered the graph queries into disjoint finer groups and then rewrote the patterns into a single query common pattern graph for each group. However, its clustering technique (and selectivity information) becomes degenerate and the construction of query sets ignores the cyclic structures in queries. Subsequently, Ren et al.  extended one-time multiple queries for an undirected labeled graph. It organized the useful common subgraphs and the original queries in a data structure called pattern containment map (PCM), and then it further cached the intermediate results based on PCM to balance memory usage and the execution time. Note that, PCM was designed for static graph. When using it in dynamic graph, we need reconstruct PCM for each update operation, which is practically infeasible. In contrast, our proposed MDCG stored intermediate results in the data graph for each data vertex which can reduce maintenance operations associated with data graph updates.
Continuous query process Continuous query process has first been considered in  which means continuously monitoring a set of graph streams and reporting the appearances of a set of pattern graphs in a set of graph streams at each timestamp [3,14]. But it offered an approximate answer instead of using subgraph isomorphism verification to find the exact query answers. In the latter study, Fan et al.  proposed the concept of incremental subgraph matching to handle continuous query problem. It only executed subgraph matching over the updated part, avoiding recomputing from scratch. In addition, Gillani et al.  proposed continuous graph pattern matching over knowledge graph streams and used two different executional models with an automata-based model to guide the searching process. Kim et al.  proposed a novel data-centric presentation to process continuous subgraph matching. However, all of the above algorithms evaluate each query separately and cannot be directly used for multi-query problem.
Continuous multiple query evaluation In addition, there are some researches on the topic of continuous multiple queries evaluation. Pugliese et al.  studied maintaining multiple subgraph query views under single-edge insertion workloads. It took advantage of common substructures and used an optimal merge strategy. However, it only focused on insertion. But in the real world, deletions also occur. Kankanamge et al.  studied the problem of optimizing and evaluating multiple subgraph queries continuously in a changing graph. It decomposed all the queries into multiple delta queries, which were then evaluated one query vertex at a time, without requiring any auxiliary data structures. Mhedhbi et al.  optimized both one-time and continuous subgraph queries using worst-case optimal joins. Since the methods in literatures  and  did not store any intermediate results, they needed to evaluate subgraph matching for each update, even if the update did not generate any positive/negative match. In recently, Zervakis et al.  handled the continuous multi-query problem over graph streams via decomposing the query into covering paths, and then it constructed a tree-based data structure to indexing and clustering continuous graph queries. However, this data structure is not concise enough, leading to many expensive join operations. Compared to covering paths, our proposed MDCG uses the data-centric representation of intermediate results, which is more concise. As a result, MDCG needs less memory consumption and has smaller maintenance costs for each update.
8.Conclusion and further work
In this paper, we proposed an efficient continuous multi-query processing engine, namely, IncMQO, in dynamic graphs. We showed that IncMQO can resolve the problems of existing methods and process continuous multiple subgraph matching for each update operation efficiently. We first developed a novel concept of annotated query graph that merges multi-query into one. Then we constructed a data-centric auxiliary data structure based on the equivalent query tree of the annotated query graph to represent partial solutions in a concise form. For each update, we proposed an edge transition strategy to maintain the intermediate results incrementally and detect the affected queries quickly. What’s more, we proposed an efficient matching order to calculate the positive or negative matching results for each affected query in one pass. Finally, comprehensive experiments performed on real and benchmark datasets demonstrate that our proposed algorithm outperforms alternatives.
A couple of issues need further study. We only simply consider the scenario that all the queries have been given at the start. However, the queries set will actually be updated due to users’ demands. To this end, we are going to consider this scenario in the future work and design an efficient algorithm to deal with the updates of queries set in multiple queries processing over dynamic graph.
1 means the parent of in .
This work is partially supported by National key research and development program under Grant Nos. 2018YFB1800203, Tianjin Science and Technology Foundation under Grant No. 18ZXJMTG00290, National Natural Science Foundation of China under Grant No. 61872446, and National Natural Science Foundation of Hunan Province under grant No. 2019JJ20024.
F. Bi, L. Chang, X. Lin, L. Qin and W. Zhang, Efficient subgraph matching by postponing Cartesian products, in: Proceedings of the 2016 International Conference on Management of Data, ACM, San Francisco, CA, (2016) . doi:10.1145/2882903.2915236.
Y. Chen, X. Zhao, X. Lin, Y. Wang, D. Guo, B. Ren, G. Cheng and D. Guo, Efficient mining of frequent patterns on uncertain graphs, IEEE Trans. Knowl. Data Eng. 31: (2) ((2019) ), 287–300. doi:10.1109/TKDE.2018.2830336.
S. Choudhury, L.B. Holder, G. Chin, K. Agarwal and J. Feo, A selectivity based approach to continuous pattern detection in streaming graphs, in: Proceedings of the 18th International Conference on Extending Database Technology, OpenProceedings.org, Brussels, Belgium, (2015) , pp. 157–168. doi:10.5441/002/edbt.2015.15.
L.P. Cordella, P. Foggia, C. Sansone and M. Vento, A (sub)graph isomorphism algorithm for matching large graphs, IEEE Trans. Pattern Anal. Mach. Intell. 26: (10) ((2004) ), 1367–1372. doi:10.1109/TPAMI.2004.75.
O. Erling, A. Averbuch, J. Larriba-Pey, H. Chafi, A. Gubichev, A. Prat-Pérez, M. Pham and P.A. Boncz, The LDBC social network benchmark: Interactive workload, in: Proceedings of the 2015 International Conference on Management of Data, ACM, Melbourne, Victoria, Australia, (2015) , pp. 619–630. doi:10.1145/2723372.2742786.
W. Fan, J. Li, J. Luo, Z. Tan, X. Wang and Y. Wu, Incremental graph pattern matching, in: Proceedings of the 2011 International Conference on Management of Data, ACM, Athens, Greece, (2011) , pp. 925–936. doi:10.1145/1989323.1989420.
S.J. Finkelstein, Common subexpression analysis in database applications, in: Proceedings of the 1982 International Conference on Management of Data, ACM Press, Orlando, Florida, (1982) , pp. 235–245. doi:10.1145/582353.582400.
J. Gao, C. Zhou and J.X. Yu, Toward continuous pattern detection over evolving large graph with snapshot isolation, VLDB J. 25: (2) ((2016) ), 269–290. doi:10.1007/s00778-015-0416-z.
S. Gillani, G. Picard and F. Laforest, Continuous graph pattern matching over knowledge graph streams, in: Proceedings of the 10th International Conference on Distributed and Event-Based Systems, ACM, Irvine, CA, (2016) , pp. 214–225. doi:10.1145/2933267.2933306.
W. Han, J. Lee and J. Lee, Turboiso: Towards ultrafast and robust subgraph isomorphism search in large graph databases, in: Proceedings of the 2013 International Conference on Management of Data, ACM, New York, NY, (2013) , pp. 337–348. doi:10.1145/2463676.2465300.
H. He and A.K. Singh, Query language and access methods for graph databases, in: Managing and Mining Graph Data, Advances in Database Systems, Vol. 40: , Springer, (2010) , pp. 125–160. doi:10.1007/978-1-4419-6045-0_4.
C. Kankanamge, Multiple continuous subgraph query optimization using delta subgraph queries, Thesis, 2018.
C. Kankanamge, S. Sahu, A. Mhedbhi, J. Chen and S. Salihoglu, Graphflow: An active graph database, in: Proceedings of the 2017 International Conference on Management of Data, ACM, Chicago, IL, (2017) , pp. 1695–1698. doi:10.1145/3035918.3056445.
U. Khurana and A. Deshpande, Efficient snapshot retrieval over historical graph data, in: Proceedings of the 29th International Conference on Data Engineering, IEEE Computer Society, Brisbane, Australia, (2013) , pp. 997–1008. doi:10.1109/ICDE.2013.6544892.
K. Kim, I. Seo, W. Han, J. Lee, S. Hong, H. Chafi, H. Shin and G. Jeong, TurboFlux: A fast continuous subgraph matching system for streaming graph data, in: Proceedings of the 2018 International Conference on Management of Data, ACM, Houston, TX, (2018) , pp. 411–426. doi:10.1145/3183713.3196917.
W. Le, A. Kementsietsidis, S. Duan and F. Li, Scalable multi-query optimization for SPARQL, in: Proceedings of the 28th International Conference on Data Engineering, IEEE Computer Society, Washington, DC, (2012) , pp. 666–677. doi:10.1109/ICDE.2012.37.
J. Lee, W. Han, R. Kasperovics and J. Lee, An in-depth comparison of subgraph isomorphism algorithms in graph databases, Proc. VLDB Endow. 6: (2) ((2012) ), 133–144. doi:10.14778/2535568.2448946.
A. Mhedhbi, C. Kankanamge and S. Salihoglu, Optimizing one-time and continuous subgraph queries using worst-case optimal joins, ACM Trans. Database Syst. 46: (2) ((2021) ), 6:1–6:45. doi:10.1145/3446980.
A. Pugliese, M. Bröcheler, V.S. Subrahmanian and M. Ovelgönne, Efficient multiview maintenance under insertion in huge social networks, ACM Trans. Web 8: (2) ((2014) ), 10:1–10:32. doi:10.1145/2541290.
Y. Qin, D. Guo, X. Lin and G. Cheng, Design and optimization of VLC enabled data center network, Tsinghua Science and Technology 25: (1) ((2020) ), 82–92. doi:10.26599/TST.2018.9010105.
Y. Qin, D. Guo, X. Lin, G. Tang and B. Ren, TIO: A VLC-enabled hybrid data center network architecture, Tsinghua Science and Technology ((2019) ). doi:10.26599/TST.2018.9010093.
B. Ren, G. Cheng and D. Guo, Minimum-cost forest for uncertain multicast with delay constraints, Tsinghua Science and Technology 24: (2) ((2019) ), 13. doi:10.26599/TST.2018.9010072.
X. Ren and J. Wang, Multi-query optimization for subgraph isomorphism search, Proc. VLDB Endow. 10: (3) ((2016) ), 121–132. doi:10.14778/3021924.3021929.
T.K. Sellis, Multiple-query optimization, ACM Trans. Database Syst. 13: (1) ((1988) ), 23–52. doi:10.1145/42201.42203.
T.K. Sellis and S. Ghosh, On the multiple-query optimization problem, IEEE Trans. Knowl. Data Eng. 2: (2) ((1990) ), 262–266. doi:10.1109/69.54724.
H. Shang, Y. Zhang, X. Lin and J.X. Yu, Taming verification hardness: An efficient algorithm for testing subgraph isomorphism, Proc. VLDB Endow. 1: (1) ((2008) ), 364–375. doi:10.14778/1453856.1453899.
C. Stark, B. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz and M. Tyers, BioGRID: A general repository for interaction datasets, Nucleic Acids Res. 34: (Database–Issue) ((2006) ), 535–539. doi:10.1093/nar/gkj109.
J.R. Ullmann, An algorithm for subgraph isomorphism, J. ACM 23: (1) ((1976) ), 31–42. doi:10.1145/321921.321925.
C. Wang and L. Chen, Continuous subgraph pattern search over graph streams, in: Proceedings of the 25th International Conference on Data Engineering, IEEE Computer Society, Shanghai, China, (2009) , pp. 393–404. doi:10.1109/ICDE.2009.132.
L. Zervakis, V. Setty, C. Tryfonopoulos and K. Hose, Efficient continuous multi-query processing over graph streams, in: Proceedings of the 23rd International Conference on Extending Database Technology, OpenProceedings.org, Copenhagen, Denmark, (2020) , pp. 13–24. doi:10.5441/002/edbt.2020.03.