Note: [] Current address: School of Computer Science, Queen's University
Belfast, Belfast, BT7 1NN, UK. E-mail: {w.liu, j.hong}
Abstract: Query processing over the Internet involving autonomous data sources
is a major task in data integration. It requires the estimated costs of
possible queries in order to select the best one that has the minimum cost. In
this context, the cost of a query is affected by three factors: network
congestion, server contention state, and complexity of the query. In this
paper, we study the effects of both the network congestion and server
contention state on the cost of a query. We refer to these two factors together
as system contention states. We present a new approach to determining the
system contention states by clustering the costs of a sample query. For each
system contention state, we construct two cost formulas for unary and join
queries respectively using the multiple regression process. When a new query is
submitted, its system contention state is estimated first using either the time
slides method or the statistical method. The cost of the query is then
calculated using the corresponding cost formulas. The estimated cost of the
query is further adjusted to improve its accuracy. Our experiments show that
our methods can produce quite accurate cost estimates of the submitted queries
to remote data sources over the Internet.