Algorithms in bioinformatics (CSI 5126)=1Please don t print these lecture notes unless you really need

82 pages

English

Algorithms in bioinformatics (CSI 5126)=1Please don't print these lecture notes unless you really need

Eftou - Marcel Turcotte (Turcotte@Site.Uottawa.Ca)

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

82 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

1Algorithms in bioinformatics (CSI 5126)Marcel Turcotte(turcotte@site.uottawa.ca)School of Information Technology and EngineeringUniversity of OttawaCanadaOctober 2, 20091Please don’t print these lecture notes unless you really need to!Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsPlanI String algorithmsI MotivationI NotationI Glimpse of Boyer-Moore string matching algorithmI Su x treesI De nitionI Na ve construction algorithmI Daniela Cernea’s Java libraryMarcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsMotivationI The simplest model of a macromolecule is a string. Yet, thislevel of abstraction is su cient for a considerably largenumber of applicationsI The size of the biological databases has been doubling every12 to 18 months for the last few yearsI Millions of queries are made to (static) databasesI Instances of exact and approximate string matchingproblems are solved as sub-tasks of several bioinformaticsapplications, such as the DNA assembly processI Natural transition between approximate string matching andmolecular sequence alignmentsMarcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsExamples of Problems on Strings1. Exact string matching: nding all occurrences of a string in atext2. Approximate string matching: nd all positions in a text wherea pattern occurs, allowing for a certain number of mismatches3. Longest common ...

Informations

Publié par	Eftou
Nombre de lectures	30
Langue	English

Extrait

1Plirpthttnesae’nodnoresuteelestuecylenerlaysuolnseurcocelT!Maredtotis@ettocrut(ettSI)Ccaa.awttuoe.oibnofnitamrsci2651lg.Aitorsihm

Algorithms in bioinformatics (CSI 5126)

School of Information Technology and Engineering University of Ottawa Canada

October 2, 2009

Marcel Turcotte (turcotte@site.uottawa.ca)

oibiinmstimaornf

String algorithms IMotivation INotation IGlimpse of Boyer-Moore string matching algorithm ISuﬃx trees IDeﬁnition INa¨ıve construction algorithm IDaniela Cernea’s Java library

Plan

sc.uotsite.ca)tawa21.6SC5IirhtlAogtte@urcote(trcotleuTaMcr

Marutcrtt(eruocecTlawttuoe.it@steotglA.6215ISC)ac.arotimhisbnoiniofrmatics

Motivation

IThe simplest model of a macromolecule is a string. Yet, this level of abstraction is suﬃcient for a considerably large number of applications IThe size of the biological databases has been doubling every 12 to 18 months for the last few years IMillions of queries are made to (static) databases IInstances of exact and approximate string matching problems are solved as sub-tasks of several bioinformatics applications, such as the DNA assembly process INatural transition between approximate string matching and molecular sequence alignments

215IlA.6irogsmhtbiinnfoimaorcstitoett(ruoctt@eiste.uottawa.ca)CSMarcTuelrc

1. all occurrences of a string in a ﬁndingExact string matching: text 2. all positions in a text whereApproximate string matching: ﬁnd a pattern occurs, allowing for a certain number of mismatches 3.Longest common substring 4.Sequence similarities and diﬀerencescomparison: highlight between twosequences 5.Regular expression matching 6. motif discovery, like repeats,Structural pattern matching: tandems and palindromes

⇒Eﬃcient data structures, such as suﬃx trees, are often at the heart of eﬃcient algorithms.

Examples of Problems on Strings

tocrt(etocru@ettrcMaTuel21.6SC5IirhtlAog.uotsite.ca)tawa

A string,S, is an ordered list of characters written contiguously from left to right. |S|denotes the length of the stringS. represents the empty string. S(i) denotes theith character ofS. The index 1 denotes the ﬁrst character ofS.

Notation

scoibiinmstimaornf

The set of all characters is called thealphabet, often denoted Σ or A all our applications the alphabet is a ﬁnite set of symbols,. In although it varies in size from one application to the other: IDNA alphabet: {A,C,G,T} IProtein alphabet: {A, ..,Z} \ {B,J,O,U,X,Z} ISolvent accessibility, Inside (hydrophobic), Surface (hydrophilic): {I,S} Istructure states of proteins, Helix, Strand and Coil:Secondary {H,E,C}

Notation (cont.)

MruTlecrabnisnioimrofcita51SI.A26orlghmitti.eouttwa.aacC)cotte(turcotte@s

nismioibrofnitamcs

Notation

S[i..j] denotes the (contiguous)substringofSthat starts at positioniand stops at positionj,S(i)S(i+ 1). . .S(j); also called afactor. S[1..i] is thepreﬁxofS. S[i..|S|] is thesuﬃxofS. A substring, preﬁx or suﬃx isproperif it’s not the entire string (and it is not empty). We say thatSis asubsequenceofT, if there exists an increasing set of indices ofT,i1<i2 < . .< .im, such that S=T[i1]T[i2]. . .T[im other words, the string]. InScan be obtained by deleting zero or more characters ofT. E.g.tieis a subsequence ofotherwise. We say that two charactersmatchif they are the same; otherwise we say it’s amismatch.

TuelrcMa@ettocrut(ettocr.ca)tawa.uotsiteirhtlAog21.6SC5I

cismrta

Notation (cont.)

be a text

LetPdenote a pattern (query, for now a string) andT (think of it as a database), in general|P|<<|T|.

ISC)ac.aglA.6215sihmitorfoinionbtte(turccelTurco.eouttwatoets@tiraM

MraecTlmrtaniofbnoimhisorit.Alg5126)CSIac.awattou.etis@teotrctue(ttcour

Initial motivation

Problem:Given a patternPand a textT, determine ifPoccurs inT. (boolean find(P, T)) Problem:Given a patternPand a textT, ﬁnd all occurrences of PinT. (int[] findall(P, T)) P=string T= Algorithms on text (strings) have long been studied in computer science, and computation on molecular sequence data (strings) is at the heart of computational molecular biology. Present and potential algorithms forstringcomputation provide a signiﬁcant intersection between computer science and molecular biology. How do you approach such problem?

cis

arMlTce@site.uottawa.caruoctt(eutcrtoettamrofnioibnismhitorlg.A2651SI)C

Na¨ıve algorithm

Move a window of size|P|is moved along the text (T). In the worst case, for every starting location, 1. . .|T|, all the symbols of the pattern (|P|) must be considered. Therefore requiring|T| × |P|comparisons.

ics

rmatinfonbiohmsirotiA.gl1562C)IScaa.awttuoe.it@settocrut(ettocruMarcelTsic

} return count;

while ( ! done ) { offset++; if ( offset == lp ) { done = true; count++; } else { done = p.charAt( offset ) != t.charAt( pos+offset ); } }

Na¨ıve algorithm (cont.)

public static int findall( String p, String t ) { int lp = p.length(), lt = t.length(), count = 0; for ( int pos=0; pos <= lt-lp; pos++ ) {

}

int offset = 0; boolean done = p.charAt( offset ) != t.charAt( pos+offset );

Univers
Ebooks
Livres audio
Presse
Podcasts
BD
Documents

Algorithms in bioinformatics (CSI 5126)=1Please don't print these lecture notes unless you really need

YouScribe

Le catalogue

Le service

Les conditions