Abstract: Cloud computing offers elastic features to alleviate the challenges of web crawling. Building crawlers in a scalable fashion has become highly needed. This paper proposes a new Focused Crawler (FC) architecture that can be introduced as a service over the cloud computing. The proposed FC has a service called a Topic Filter Service (TFS), which is responsible for filtering retrieved pages before indexing and extracting links to add them in the crawling queue. TFS relies on the Deep Neural Network (DNN) classifier. TFS is trained by a dataset. This dataset is processed by an outlier rejection using support vector machine…classifier. Moreover, this proposed FC has a further service called Concept Weight Handler (CWH). It is responsible for handling the keywords such as concepts based on meanings and it calculates the weight of each concept. Experimental results show that cloud computing services provide a better environment for running and improving the speed of crawling. The proposed classifier has been tested in comparison with other classification techniques and has proved highly accurate. The overall accuracy offered by the employed architecture confirms that the effectiveness and performance of the proposed FC is high.
Show more
Keywords: Focused Crawler, deep neural network, cloud computing, topic filter service, concept web page