1Algorithms in bioinformatics (CSI 5126)Marcel Turcotte(turcotte@site.uottawa.ca)School of Information Technology and EngineeringUniversity of OttawaCanadaOctober 2, 20091Please don’t print these lecture notes unless you really need to!Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsPlanI String algorithmsI MotivationI NotationI Glimpse of Boyer-Moore string matching algorithmI Su x treesI De nitionI Na ve construction algorithmI Daniela Cernea’s Java libraryMarcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsMotivationI The simplest model of a macromolecule is a string. Yet, thislevel of abstraction is su cient for a considerably largenumber of applicationsI The size of the biological databases has been doubling every12 to 18 months for the last few yearsI Millions of queries are made to (static) databasesI Instances of exact and approximate string matchingproblems are solved as sub-tasks of several bioinformaticsapplications, such as the DNA assembly processI Natural transition between approximate string matching andmolecular sequence alignmentsMarcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsExamples of Problems on Strings1. Exact string matching: nding all occurrences of a string in atext2. Approximate string matching: nd all positions in a text wherea pattern occurs, allowing for a certain number of mismatches3. Longest common ...
IThe simplest model of a macromolecule is a string. Yet, this level of abstraction is sufficient for a considerably large number of applications IThe size of the biological databases has been doubling every 12 to 18 months for the last few years IMillions of queries are made to (static) databases IInstances of exact and approximate string matching problems are solved as sub-tasks of several bioinformatics applications, such as the DNA assembly process INatural transition between approximate string matching and molecular sequence alignments
1. all occurrences of a string in a findingExact string matching: text 2. all positions in a text whereApproximate string matching: find a pattern occurs, allowing for a certain number of mismatches 3.Longest common substring 4.Sequence similarities and differencescomparison: highlight between twosequences 5.Regular expression matching 6. motif discovery, like repeats,Structural pattern matching: tandems and palindromes
⇒Efficient data structures, such as suffix trees, are often at the heart of efficient algorithms.
A string,S, is an ordered list of characters written contiguously from left to right. |S|denotes the length of the stringS. represents the empty string. S(i) denotes theith character ofS. The index 1 denotes the first character ofS.
Notation
scoibiinmstimaornf
The set of all characters is called thealphabet, often denoted Σ or A all our applications the alphabet is a finite set of symbols,. In although it varies in size from one application to the other: IDNA alphabet: {A,C,G,T} IProtein alphabet: {A, ..,Z} \ {B,J,O,U,X,Z} ISolvent accessibility, Inside (hydrophobic), Surface (hydrophilic): {I,S} Istructure states of proteins, Helix, Strand and Coil:Secondary {H,E,C}
S[i..j] denotes the (contiguous)substringofSthat starts at positioniand stops at positionj,S(i)S(i+ 1). . .S(j); also called afactor. S[1..i] is theprefixofS. S[i..|S|] is thesuffixofS. A substring, prefix or suffix isproperif it’s not the entire string (and it is not empty). We say thatSis asubsequenceofT, if there exists an increasing set of indices ofT,i1<i2 < . .< .im, such that S=T[i1]T[i2]. . .T[im other words, the string]. InScan be obtained by deleting zero or more characters ofT. E.g.tieis a subsequence ofotherwise. We say that two charactersmatchif they are the same; otherwise we say it’s amismatch.
Problem:Given a patternPand a textT, determine ifPoccurs inT. (boolean find(P, T)) Problem:Given a patternPand a textT, find all occurrences of PinT. (int[] findall(P, T)) P=string T= Algorithms on text (strings) have long been studied in computer science, and computation on molecular sequence data (strings) is at the heart of computational molecular biology. Present and potential algorithms forstringcomputation provide a significant intersection between computer science and molecular biology. How do you approach such problem?
Move a window of size|P|is moved along the text (T). In the worst case, for every starting location, 1. . .|T|, all the symbols of the pattern (|P|) must be considered. Therefore requiring|T| × |P|comparisons.