TR DCC-2007-01
In biological sequences, statistically interesting words can lead us to detect subsequences that have a relevant functional significance (e.g. subsequences preserved through evolution or copies of genes). In this paper, we present an efficient method for discovering statistically interesting words in biological sequences. It is based on suffix arrays and for a sequence of size n it has a worst case time of O(n2) but it is faster in practice. It consumes 5n bytes of primary memory in practice but needs 9n bytes of secondary memory. The filter to classify statistical interesting words is based on the measure for their overrepresentation in the sequence or sets of sequences where they occur. We implemented the word discovery tool, evaluated its performance and validated its usefulness by running it on large and real biological sequences, and by checking if the words found were known motifs or patterns in the literature.