Searching for just a few words should be enough to get started. If you need to make more complex queries, use the tips below to guide you.
Article type: Research Article
Authors: Bipin Nair, B.J.a | Shobha Rani, N.a; * | Khan, Mustaqeemb
Affiliations: [a] Department of Computer Science, School of Computing, Mysuru Campus, Amrita Vishwa Vidyapeetham, India | [b] Department of Computer Vision, Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Correspondence: [*] Corresponding author. [*]N. Shobha Rani, Department of Computer Science, School of Computing, Mysuru Campus, Amrita Vishwa Vidyapeetham, India. E-mail: [email protected].
Abstract: The method for document image classification presented in this paper mainly focuses on six different Malayalam palm leaf manuscripts categories. The proposed approach consists of three phases: dataset analysis, building a bag of words repository followed by recognition and classification using a voting approach. The palm leaf manuscripts are initially subject to pre-processing and subjective analysis techniques to create a bag of words repository during the dataset analysis phase. Next, the textual components from the manuscripts are extracted for recognition using Tesseract 4 OCR with default and self-adapted training sets and a deep-learning algorithm. The Bag of Words approach is used in the third phase to categorize the palm leaf manuscripts based on textual components recognized by OCR using a voting process. Experimental analysis was done to analyze the proposed approach with and without the voting techniques, varying the size of the Bag of Words with default/self-adapted training datasets using Tesseract OCR and a deep learning model. Experimental analysis proves that the proposed approach works equally well with/ without voting with a bag of words technique using Tesseract OCR. It is noticed that, for document classification, an overall accuracy of 83% without voting and 84.5% with voting is achieved with an F-score of 0.90 in both cases using Teserract OCR. Overall, the proposed approach proves to be high generalizable based on trial wise experiments with Bag of Words, offering a reliable way for classifying deteriorated Malayalam handwritten palm manuscripts.
Keywords: Document image classification, palm leaf manuscripts, handwritten document analysis, Tesseract OCR, deep learning, ancient document images
DOI: 10.3233/JIFS-223713
Journal: Journal of Intelligent & Fuzzy Systems, vol. 45, no. 3, pp. 4031-4049, 2023
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA
Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]
For editorial issues, like the status of your submitted paper or proposals, write to [email protected]
IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands
Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]
For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]
Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China
Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]
For editorial issues, like the status of your submitted paper or proposals, write to [email protected]
如果您在出版方面需要帮助或有任何建, 件至: [email protected]