Affiliations: Center for Pharmacoinformatics, National Institute of
Pharmaceutical Education and Research S.A.S. Nagar, Sector 67, S.A.S Nagar,
Punjab 160 062, India | Department of Biotechnology, National Institute of
Pharmaceutical Education and Research, S.A.S. Nagar, India
Abstract: High-throughput genome sequencing projects continue to churn out
enormous amounts of raw sequence data. However, most of this raw sequence data
is unannotated and, hence, not very useful. Among the various approaches to
decipher the function of a protein, one is to determine its localization.
Experimental approaches for proteome annotation including determination of a
protein's subcellular localizations are very costly and labor intensive.
Besides the available experimental methods, in silico methods present
alternative approaches to accomplish this task. Here, we present two machine
learning approaches for prediction of the subcellular localization of a protein
from the primary sequence information. Two machine learning algorithms, k
Nearest Neighbor (k-NN) and Probabilistic Neural Network (PNN) were used to
classify an unknown protein into one of the 11 subcellular localizations. The
final prediction is made on the basis of a consensus of the predictions made by
two algorithms and a probability is assigned to it. The results indicate that
the primary sequence derived features like amino acid composition, sequence
order and physicochemical properties can be used to assign subcellular
localization with a fair degree of accuracy. Moreover, with the enhanced
accuracy of our approach and the definition of a prediction domain, this method
can be used for proteome annotation in a high throughput manner. SubCellProt is available at www.databases.niper.ac.in/SubCellProt.
Keywords: Protein function, subcellular localization, machine learning, PNN, kNN