Abstract: Promoter prediction is an important and complex problem. Pattern
recognition algorithms typically require features that could capture this
complexity. A special bias towards certain combinations of base pairs in the
promoter sequences may be possible. In order to determine these biases n-grams
are usually extracted and analyzed. An n-gram is a selection of n contiguous
characters from a given character stream, DNA sequence segments in this case.
Here a systematic study is made to discover the efficacy of n-grams for n = 2,
3, 4, 5 in promoter prediction. A study of n-grams as features for a neural
network classifier for E. coli and Drosophila promoters is made. In case of E.
coli n = 3 and in case of Drosophila n = 4 seem to give optimal prediction
values. Using the 3-gram features, promoter prediction in the genome sequence
of E. coli is done. The results are encouraging in positive identification of
promoters in the genome compared to software packages such as BPROM, NNPP, and
SAK. Whole genome promoter prediction in Drosophila genome was also performed
but with 4-gram features.
Keywords: Biological data sets, machine learning method, neural networks, in silico method for promoter prediction, binary classification, cascaded classifiers