Affiliations: School of Computing, National University of Singapore,
Singapore, 117543. E-mail: {yesr,chuats,jkei}@comp.nus.edu.sg
Note: [] Corresponding author
Abstract: One of the most frequent Web surfing tasks is to search for persons
and organizations by their names. Such names are often not distinctive,
commonly occurring, and non-unique. Thus, a single name may be mapped to
several named target entities. This paper describes a new methodology to
cluster web pages returned by a search engine so that pages belonging to
different entities are clustered into different groups. The algorithm uses a
combination of named entities, and link-based and structure-based information
as features to partition the document set into direct and indirect pages by
means of a decision-tree model. It then chooses the appropriate distinctive
direct pages as seeds to cluster the document set into different clusters. The
algorithm has been found to be effective for web-based information retrieval
applications.
Keywords: Web clustering, persons and organizations, machine learning, text classification, information retrieval, named entity