Algorithmic Complexity of Protein Identification

Mark Cieliebak1, Thomas Erlebach2, Zsuzsanna Lipták1, Jens Stoye3, and Emo Welzl1

1 Institute for Theoretical Computer Science, ETH Zürich
2 Computer Engineering and Networks Laboratory (TIK), ETH Zürich
3 Max Planck Institute for Molecular Genetics, Berlin

We investigate a combinatorial problem that originates from computational biology: Given a string $\sigma$ of length n over a weighted alphabet, find a data structure and a query algorithm which, for a given weight M, decides whether $\sigma$ has a (contiguous) substring of weight M. Here, the weight of a string is the sum of the weights of its letters. There are two simple algorithms that solve the problem: one runs in time linear in n, the other has running time $O(\log n)$, but requires an additional data structure of size O(n2). We present an algorithm that uses preprocessing on $\sigma$ to construct a data structure of linear size that allows to perform queries in sublinear time, namely time $O(\frac{n}{\log n})$.



Mark Cieliebak, October, 08th, 2002.