Protein Subcellular Localization Prediction Using a Hybrid of Similarity Search and Error-Correcting Output Code Techniques That Produces Interpretable Results

Doderer, Mark; Yoon, Kihoon; Salinas, John; Kwek, Stephen

Protein Subcellular Localization Prediction Using a Hybrid of Similarity Search and Error-Correcting Output Code Techniques That Produces Interpretable Results

Article type: Research Article

Authors: Doderer, Mark | Yoon, Kihoon | Salinas, John | Kwek, Stephen

Affiliations: Department of Computer Science, Human Genome (HuGe) Lab, The University of Texas at San Antonio, 6900 N. Loop 1604 West, San Antonio, TX 78249-0667, USA. E-mail: {mdoderer,kyoon,jsalinas,kwek}@cs.utsa.edu

Note: [] Corresponding author

Abstract: In silico prediction of protein subcellular localization based on amino acid sequence can reveal valuable information about the protein's innate roles in the cell. Unfortunately, such prediction is made difficult because of complex protein sorting signals. Some prediction methods are based on searching for similar proteins with known localization, assuming that known homologs exist. However, it may not perform well on proteins with no known homolog. In contrast, machine learning-based approaches attempt to infer a predictive model that describes the protein sorting signals. Alas, in doing so, it does not take advantage of known homologs (if they exist) by doing a simple "table lookup". Here, we capture the best of both worlds by combining both approaches. On a dataset with 12 locations, similarity-based and machine learning independently achieve an accuracy of 83.8% and 72.6%, respectively. Our hybrid approach yields an improved accuracy of 85.9%. We compared our method with three other methods' published results. For two of the methods, we used their published datasets for comparison. For the third we used the 12 location dataset. The Error Correcting Output Code algorithm was used to construct our predictive model. This algorithm gives attention to all the classes regardless of number of instances and led to high accuracy among each of the classes and a high prediction rate overall. We also illustrated how the machine learning classifier we use, built over a meaningful set of features can produce interpretable rules that may provide valuable insights into complex protein sorting mechanisms.

Keywords: Subcellular localization prediction, similarity search, blast, error correcting output code, decision tree

Journal: In Silico Biology, vol. 6, no. 5, pp. 419-433, 2006

Received 21 April 2006

Accepted 21 July 2006

Published: 2006

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

如果您在出版方面需要帮助或有任何建, 件至: [email protected]

Share this:

North America

Europe

Asia