TR DCC-2007-01

Technical Report: DCC-2007-01

Fast Discovery of Statistically Interesting Words

Pedro Pereira, Nuno A. Fonseca and Fernando Silva

DCC-FC & LIACC, Universidade do Porto
R. do Campo Alegre 1021/1059, 4169-007 Porto, Portugal
Phone: +351 220402900, Fax: +351 220402950
E-mail: {pdr,nf,fds}@dcc.fc.up.pt

January 2007

Abstract

In biological sequences, statistically interesting words can lead us to detect subsequences that have a relevant functional significance (e.g. subsequences preserved through evolution or copies of genes). In this paper, we present an efficient method for discovering statistically interesting words in biological sequences. It is based on suffix arrays and for a sequence of size n it has a worst case time of O(n2) but it is faster in practice. It consumes 5n bytes of primary memory in practice but needs 9n bytes of secondary memory. The filter to classify statistical interesting words is based on the measure for their overrepresentation in the sequence or sets of sequences where they occur. We implemented the word discovery tool, evaluated its performance and validated its usefulness by running it on large and real biological sequences, and by checking if the words found were known motifs or patterns in the literature.