TR DCC-2006-08

Technical Report: DCC-2006-08

A High Performance Distributed Tool for Mining Patterns in Biological Sequences

Pedro Pereira, Nuno A. Fonseca and Fernando Silva

DCC-FC & LIACC, Universidade do Porto
R. do Campo Alegre 1021/1059, 4169-007 Porto, Portugal
Phone: +351 220402900, Fax: +351 220402950
E-mail: {pdr,nf,fds}@dcc.fc.up.pt

December 2006

Abstract

The identification of interesting patterns (or subsequences) in biosequences has an important role in computational biology. Databases of genomic and proteomic sequences have grown exponentially, and therefore pattern discovery is a hard problem requiring clever strategies and powerful pattern languages to achieve manageable levels of efficiency. As far as we are aware of, known tools are either inefficient or can only handle limited amounts of data. We present a new pattern discovery tool called BIORED for mining patterns in proteomic and genomic sequences. It uses a genetic algorithm to find interesting patterns in the form of regular expressions, and a new efficient pattern matching procedure to count pattern occurrences. The counting operation is still the bottleneck of the matching algorithm, therefore we propose a parallel and distributed version of this algorithm. It adequately partitions the sequence being searched among the processing units so that these can perform in parallel, and independently, the counting for a set of occurring patterns. We validate the tool's usefulness and study the sequential and parallel performance of the base algorithm on several data sets. Our results indicate that BIORED scales well, achieving almost linear speedups, up to 22 processors, on a distributed memory computer.