Abstract: Little is known about the content of the major search engines. We
present an automatic learning method which trains an ontology with world
knowledge of hundreds of different subjects in a three-level taxonomy covering
the documents offered in our university library. We then mine this ontology to
find important classification rules, and then use these rules to perform an
extensive analysis of the content of the largest general purpose internet
search engines in use today. Instead of representing documents and collections
as a set of terms, we represent them as a set of subjects, which is a highly
efficient representation, leading to a more robust representation of
information and a decrease of synonymy.