Selecting queries from sample to crawl deep web data sources

Wang, Yan; Lu, Jianguo; Liang, Jie; Chen, Jessica; Liu, Jiming

doi:10.3233/WIA-2012-0232

Selecting queries from sample to crawl deep web data sources

Article type: Research Article

Authors: Wang, Yan^; | Lu, Jianguo^; | Liang, Jie | Chen, Jessica | Liu, Jiming

Affiliations: School of Computer Science, University of Windsor, Windsor, Ontario, Canada, E-mail: {jlu,wang16c,liangr,xjchen}@uwindsor.ca | Department of Computer Science, Hong Kong Baptist University, Hong Kong, China, E-mail: [email protected] | State Key Lab for Novel Software Technology, Nanjing University, Nanjing, China

Note: [] Corresponding author.

Abstract: This paper studies the problem of selecting queries to efficiently crawl a deep web data source using a set of sample documents. Crawling deep web is the process of collecting data from search interfaces by issuing queries. One of the major challenges in crawling deep web is the selection of the queries so that most of the data can be retrieved at a low cost. We propose to learn a set of queries from a sample of the data source. To verify that the queries selected from a sample also produce a good result for the entire data source, we carried out a set of experiments on large corpora including Gov2, newsgroups, wikipedia and Reuters. We show that our sampling-based method is effective by empirically proving that 1) The queries selected from samples can harvest most of the data in the original database; 2) The queries with low overlapping rate in samples will also result in a low overlapping rate in the original database; and 3) The size of the sample and the size of the terms from where to select the queries do not need to be very large. Compared with other query selection methods, our method obtains the queries by analyzing a small set of sample documents, instead of learning the next best query incrementally from all the documents matched with previous queries.

Keywords: Deep web, hidden web, invisible web, crawling, query selection, sampling, set covering, web service

DOI: 10.3233/WIA-2012-0232

Journal: Web Intelligence and Agent Systems: An International Journal, vol. 10, no. 1, pp. 75-88, 2012

Published: 2012

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

如果您在出版方面需要帮助或有任何建, 件至: [email protected]

Share this:

North America

Europe

Asia