Affiliations: Department of Computer Science, Human Genome (HuGe)
Lab, The University of Texas at San Antonio, 6900 N. Loop 1604 West, San
Antonio, TX 78249-0667, USA. E-mail:
Note:  Corresponding author
Abstract: In silico prediction of protein subcellular localization based on
amino acid sequence can reveal valuable information about the protein's innate
roles in the cell. Unfortunately, such prediction is made difficult because of
complex protein sorting signals. Some prediction methods are based on searching
for similar proteins with known localization, assuming that known homologs
exist. However, it may not perform well on proteins with no known homolog. In
contrast, machine learning-based approaches attempt to infer a predictive model
that describes the protein sorting signals. Alas, in doing so, it does not take
advantage of known homologs (if they exist) by doing a simple "table lookup".
Here, we capture the best of both worlds by combining both approaches. On a dataset with 12 locations, similarity-based and machine
learning independently achieve an accuracy of 83.8% and 72.6%, respectively.
Our hybrid approach yields an improved accuracy of 85.9%. We compared our
method with three other methods' published results. For two of the methods, we
used their published datasets for comparison. For the third we used the 12
location dataset. The Error Correcting Output Code algorithm was used to
construct our predictive model. This algorithm gives attention to all the
classes regardless of number of instances and led to high accuracy among each
of the classes and a high prediction rate overall. We also illustrated how the machine learning classifier we use,
built over a meaningful set of features can produce interpretable rules that
may provide valuable insights into complex protein sorting mechanisms.