Abstract: The Web is transforming from a merely information dissemination
platform towards a distributed knowledge-based platform for supporting complex
problem solving. However, the existing Web contains a large amount of knowledge
which is only tagged using layout related markups, making them hard to be
discovered and used. In this paper, we purpose to model semantic-rich and
self-contained knowledge units embedded in a web site as a mixture of bipartite
sub-graphs and to extract the subgraphs as the web site abstraction via
hyperlink structure and file hierarchy analysis. A recursive algorithm, named
ReHITS, is derived which can identify bipartite sub-graphs with a hierarchical
organization. Each identified sub-graph contains a set of associated
authorities and hubs as its summarized semantic description. The effectiveness
of the algorithm has been evaluated using three real web sites (containing
∼ 10000 web pages) with promising results. Detailed
interpretation of the experimental results and qualitative comparison with
other related work are also included.
Keywords: Web structure mining, web site abstraction, HITS algorithm, knowledge discovery