genome 373: genomic informatics - borenstein...
TRANSCRIPT
![Page 1: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/1.jpg)
Genome 373: Genomic Informatics
Elhanan Borenstein
![Page 2: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/2.jpg)
• This course is intended to introduce students to the breadth of problems and methods in computational analysis of genomes and biological systems, arguably the single most important new(?) area in biological research.
• The specific subjects will include: • Sequence alignment • Phylogenetic tree reconstruction • Clustering gene expression, annotation and enrichment • Network analysis • Gene finding • Machine learning • DNA sequencing and assembly
Genome 373
![Page 3: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/3.jpg)
Outline
• Course logistics
• Why Bioinformatics
• Introduction to sequence alignment
![Page 4: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/4.jpg)
Instructors
• Elhanan Borenstein: Weeks 1-5
• Doug Fowler: Weeks 6-10
• Office hours: Monday 2:20-3:00
![Page 5: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/5.jpg)
Who am I? Faculty at Genome Sciences & Computer Science
Training: CS, physics, hi-tech, biology
Interests: Metagenomics; Human Microbiome;
Complex networks; Computational systems biology
Emphasis Informatics: From sequence to systems
Algorithms !
Concepts !
http://elbo.gs.washington.edu
![Page 6: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/6.jpg)
Quiz Section
• Cecilia Noecker (TA) will review additional topics including programming and problem solving skills.
• Material covered in section is required, and will be on the exams.
![Page 7: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/7.jpg)
Webpage
• Web site:
• Page has links to
– Lecture notes (but please keep the class interactive)
– Handouts
– Many useful resources on:
• Bioinformatics
• Python
http://elbo.gs.washington.edu/courses/GS_373_17_sp/
![Page 8: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/8.jpg)
Programming
• Note: Historically, this course required prior programming experience.
• Understanding how programs work and how code is written is crucial for understanding algorithms (including bioinformatic algorithms)
• If you do not have any programming experience, that’s totally ok, but … you will need to catch up.
![Page 9: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/9.jpg)
Why Python?
• Python is
– easy to learn
– fast enough
– object-oriented
– widely used
– fairly portable
• C is much faster but much harder to learn and use.
• Java is somewhat faster but harder to learn and use.
• Perl is a little slower and a little harder to learn.
![Page 10: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/10.jpg)
Grading
• 50% homework
• 20% midterm exam (in class)
• 30% final exam
• Final exam is cumulative.
![Page 11: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/11.jpg)
Homework
• Posted through Catalyst each Wednesday and due the following Wednesday.
• Homework is a mix of (mostly) bioinformatics problems and (some) programming.
• Homework assignments are to be submitted through Catalyst
• Programming assignments should be implemented in Python.
• More on home assignment submission in the quiz section.
![Page 12: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/12.jpg)
Textbooks
![Page 13: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/13.jpg)
Let us know who you are ….
• Background survey 1. Major
2. Primary background (biology, computation, other)
3. Programming experience (how much, what language)
• Registered/not-registered/waiting-list
![Page 14: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/14.jpg)
Why Bioinformatics?
![Page 15: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/15.jpg)
![Page 16: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/16.jpg)
![Page 17: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/17.jpg)
tgcaagcatgcacatgtaccaggagaaaatgaagacaattgtggaaacttttagacttttcatcaactttctagtgtcacttttttgccgctttcct
atctgatagttgcgaagactccgaagaaaatgagaatggtgaaggctagcatgctgatgcttcatttctctggagcaattgtggatttctatctaag
cttcatttcgatcccagtgctcactttgcccgtttgctcaggtatccattgggattctcgttggtgttaggaattccaacgtctgttcaagtttata
tcggagtttcatgtatgggcggtgggtcgctctgttgcaggaggtcttgaatttcttttttgcagtaatcggtgtaactattcttatatttttcgaa
aatcgttactttcaactaatcaatggatcttctggtggtagaagttggaagcgaaaactatatgttttgtgtaattacgcgttctctgtaactttta
tagctccagcgtttttagacatttttagtgaagaacaaggaagagcgtgcacgtttgaagtaagttaggcaaaccaaactcgctagtgtgatgaaat
tttccagaaaattccgagtatccctatcgacgtgccttctcgctcaggatattttgtcctattaattgataacccagtctacagcatttgcgtaagc
ctcttggtaattaaagtgtgcccacaaattggtatagtcgttttgttcatattcccttatattgttcaaacgaaatcacattctcgagccacacttc
gtttacttcttcacttttttatcgcgatgtgtatccagctgtctattccatttttggtcatcttcttgccggctgcttttatagtgtacgcaattca
atatgactattataatcaaggtatgaatattaggccttccacgaaggcgctattctcgcccgcccgtaccacaccaacgctcttctcagttgcacgc
ggctatagtagcgcgagggcccgcgtagcgtcggccgccttcatagaaggtctaatgaatatatagtattaagtataatttaaataaagtttcagca
gcaaacaacttggcgatggcaacaatggcattccatggggtatgtactacactgaccatgatcatcgtgcatacaccgtatcgtaacgctactttga
gcattttacatctgaaatcggaaaaatcggcaaaaacagtgactgattcgaagattgtgtggaaaagtaacaagggagtacagatgacataaactat
gcccattgttaccctatattttatttttctctatggtgacaactttatcttaagaaaaacacgcatataaatcaagcagttcctggtcacaggacgt
ttacttccacctgtttctaatttcttataaaaccctatatctttcaagttttttccacaagactctgccactctgacacttatgtgctcgactagcc
tcagcttctttgcttccgagcaaacatatataaaacttctacatactcttaccatacttgaactttccactcactcttttggagcatacatcatcat
tacaaaaacaccgaaaaagttggaatccgtgaaggccagcatgctctatctacaatttgttggagcatttgtcgatgtctatttcagttggttagct
atgccgattctagtactacctttatgtgcaggacatgcgattggcttactttcattttttggggttccaagctcgttgcaagtttatgtaggtttct
gttcactagcaggttggttcttaagaatgatggagagcgtcacatgtattgtgttgtacagatacaatttgaaagcaatccaatacagcgtgtaaaa
gttttgcaattataaacatcattgcagttatggttatgacagtagtgatctttctggaagatcgtcgatatcggttggtgaacggtcaaaagtcaaa
caaaatgagaaaattgtatcggttactgtttgtcacagctaattatgtttatgctacattgtaccctgctcccatatactttttgcttcccgaccaa
gaatatggaagaattttatcgaaaagtgtacgtcttaaaaagtttgaaacatatacaatgaaatgtcttacttttaaagtttgcgtttcagaaaaat
ccgtgtattccgaacgaatatttaaaccatcctaatttctttttgcttgatctcgatggaaagtatacttcaatttgtatcctgcttatgttgagtt
ctctggtctctcaaatgttttggcaaattggactgattttccgtcagatgctcaaaaatccgtccgtttctcaaaatacgcaccgactacaatacca
gtttttaattgcaatgagcttgcaaggcaccattccaatgattatcattgtttttccagcttttttctatgttgtctcaattatgttaaattatcat
aatcaaggtattgtatctattcggaacaagacattaaacataattccaacttttcaggtgcaaataacttatcgtttcttatcatttccatgcatgg
agttctatcaacgttgacaatgctcatggcacacagaccgtatagacaatcgattgtcaaaatgttgaatctgaatttcaataaggcaggtggtggt
gttcaacgtatttggacgctttccagaagaaataattaatgatgaccttggaaaaggctaatcttcacaacaatcaaatcaaataatcataaaagtt
tttattgaagaaaaataaactatctgtgcacagaaatccaatgaattgctctatctacaatttgttggagcatttgtcgatgtctatttcagttggt
tagctatgccgattctagtactacctttatgtgcaggacatgcgattggcttactttcattttttggggttccaagctcgttgcaagtttatgtagg
tttctgttcactagcaggttggttcttaagaatgatggagagcgtcacatgtattgtgttgtacagatacaatttgaaagcaatccaatacagcgtg
taaaagttttgcaattataaacatcattgcagttatggttatgacagtagtgatctttctggaagatcgtcgatatcggttggtgaacggtcaaaag
Find the binding sequence: caattatgttaaa
![Page 18: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/18.jpg)
tgcaagcatgcacatgtaccaggagaaaatgaagacaattgtggaaacttttagacttttcatcaactttctagtgtcacttttttgccgctttcct
atctgatagttgcgaagactccgaagaaaatgagaatggtgaaggctagcatgctgatgcttcatttctctggagcaattgtggatttctatctaag
cttcatttcgatcccagtgctcactttgcccgtttgctcaggtatccattgggattctcgttggtgttaggaattccaacgtctgttcaagtttata
tcggagtttcatgtatgggcggtgggtcgctctgttgcaggaggtcttgaatttcttttttgcagtaatcggtgtaactattcttatatttttcgaa
aatcgttactttcaactaatcaatggatcttctggtggtagaagttggaagcgaaaactatatgttttgtgtaattacgcgttctctgtaactttta
tagctccagcgtttttagacatttttagtgaagaacaaggaagagcgtgcacgtttgaagtaagttaggcaaaccaaactcgctagtgtgatgaaat
tttccagaaaattccgagtatccctatcgacgtgccttctcgctcaggatattttgtcctattaattgataacccagtctacagcatttgcgtaagc
ctcttggtaattaaagtgtgcccacaaattggtatagtcgttttgttcatattcccttatattgttcaaacgaaatcacattctcgagccacacttc
gtttacttcttcacttttttatcgcgatgtgtatccagctgtctattccatttttggtcatcttcttgccggctgcttttatagtgtacgcaattca
atatgactattataatcaaggtatgaatattaggccttccacgaaggcgctattctcgcccgcccgtaccacaccaacgctcttctcagttgcacgc
ggctatagtagcgcgagggcccgcgtagcgtcggccgccttcatagaaggtctaatgaatatatagtattaagtataatttaaataaagtttcagca
gcaaacaacttggcgatggcaacaatggcattccatggggtatgtactacactgaccatgatcatcgtgcatacaccgtatcgtaacgctactttga
gcattttacatctgaaatcggaaaaatcggcaaaaacagtgactgattcgaagattgtgtggaaaagtaacaagggagtacagatgacataaactat
gcccattgttaccctatattttatttttctctatggtgacaactttatcttaagaaaaacacgcatataaatcaagcagttcctggtcacaggacgt
ttacttccacctgtttctaatttcttataaaaccctatatctttcaagttttttccacaagactctgccactctgacacttatgtgctcgactagcc
tcagcttctttgcttccgagcaaacatatataaaacttctacatactcttaccatacttgaactttccactcactcttttggagcatacatcatcat
tacaaaaacaccgaaaaagttggaatccgtgaaggccagcatgctctatctacaatttgttggagcatttgtcgatgtctatttcagttggttagct
atgccgattctagtactacctttatgtgcaggacatgcgattggcttactttcattttttggggttccaagctcgttgcaagtttatgtaggtttct
gttcactagcaggttggttcttaagaatgatggagagcgtcacatgtattgtgttgtacagatacaatttgaaagcaatccaatacagcgtgtaaaa
gttttgcaattataaacatcattgcagttatggttatgacagtagtgatctttctggaagatcgtcgatatcggttggtgaacggtcaaaagtcaaa
caaaatgagaaaattgtatcggttactgtttgtcacagctaattatgtttatgctacattgtaccctgctcccatatactttttgcttcccgaccaa
gaatatggaagaattttatcgaaaagtgtacgtcttaaaaagtttgaaacatatacaatgaaatgtcttacttttaaagtttgcgtttcagaaaaat
ccgtgtattccgaacgaatatttaaaccatcctaatttctttttgcttgatctcgatggaaagtatacttcaatttgtatcctgcttatgttgagtt
ctctggtctctcaaatgttttggcaaattggactgattttccgtcagatgctcaaaaatccgtccgtttctcaaaatacgcaccgactacaatacca
gtttttaattgcaatgagcttgcaaggcaccattccaatgattatcattgtttttccagcttttttctatgttgtctcaattatgttaaattatcat
aatcaaggtattgtatctattcggaacaagacattaaacataattccaacttttcaggtgcaaataacttatcgtttcttatcatttccatgcatgg
agttctatcaacgttgacaatgctcatggcacacagaccgtatagacaatcgattgtcaaaatgttgaatctgaatttcaataaggcaggtggtggt
gttcaacgtatttggacgctttccagaagaaataattaatgatgaccttggaaaaggctaatcttcacaacaatcaaatcaaataatcataaaagtt
tttattgaagaaaaataaactatctgtgcacagaaatccaatgaattgctctatctacaatttgttggagcatttgtcgatgtctatttcagttggt
tagctatgccgattctagtactacctttatgtgcaggacatgcgattggcttactttcattttttggggttccaagctcgttgcaagtttatgtagg
tttctgttcactagcaggttggttcttaagaatgatggagagcgtcacatgtattgtgttgtacagatacaatttgaaagcaatccaatacagcgtg
taaaagttttgcaattataaacatcattgcagttatggttatgacagtagtgatctttctggaagatcgtcgatatcggttggtgaacggtcaaaag
Find the binding sequence: caattatgttaaa
![Page 19: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/19.jpg)
Well, computers would definitely help …
but why bioinformatics?
![Page 20: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/20.jpg)
Moore’s law Computer
processing power doubles every ~2
years.
dotted line - 2 year doubling
![Page 21: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/21.jpg)
Sequencing cost decreasing much faster than computing cost
>2-fold drop per year
? - changing so fast hard to be specific
![Page 22: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/22.jpg)
Sequencing data acquisition is constantly accelerating
![Page 23: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/23.jpg)
~3100 bases (3.1Kb)
• Viruses ~3-1200 Kb • Bacteria ~1-5 Mb • Archaea ~1-5 Mb • Fungi ~10-50 Mb • Animals ~100-5,000 Mb • Plants ~100-10,000 Mb
![Page 24: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/24.jpg)
~3100 bases (3.1Kb)
• Viruses ~3-1200 Kb • Bacteria ~1-5 Mb • Archaea ~1-5 Mb • Fungi ~10-50 Mb • Animals ~100-5,000 Mb • Plants ~100-10,000 Mb
TOTAL NUMBER OF SPECIES
NUMBER OF SPECIES
IDENTIFIED
NUMBER OF SPECIES WITH SEQUENCED
BACTERIA, ARCHAEA
100,000 to 10 million 12,000 (460
cultured Archaea) 17,420 bacteria,
362 Archaea
FUNGI 1.5 million 100,000 356
INSECTS 10 million 1 million 98
PLANTS 435,000 (land plants and green algae)
300,000 150
TERRESTRIAL VERTEBRATES, FISH
80,500 (5,500 mammalian) 62,345 (5,487 mammalian)
235 (80 mammalian)
MARINE INVERTEBRATES 6.5 million 1.3 million 60
OTHER INVERTEBRATES
1 million nematode, several thousandDrosophila
23,000 nematode, 1,300 Drosophila
17 nematode, 21 Drosophila
The Scientist, April 2014
![Page 25: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/25.jpg)
A computational bottleneck
![Page 26: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/26.jpg)
Find the binding sequence: caattatgttaaa
… allowing for one mutation and one insertion
caattatgtta-aa
caatt-atgttaaa
catttatgttaaa
cagttatgttaa-a caattatgt-taaa
caattatgttaaa
caaatatgttaaa
ca-attatggtaaa
caattatattaaa
cagttat-gttaaa
caattatgttaga
cagttatgttaaa
caattatgttaaa
c-aattatgttata
caat-tatgttaaa
caattatgttaat
gaattatgttaaa
![Page 27: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/27.jpg)
How well can the string GAATTCAGTTA match the string
GGATCGA? (what is the best alignment between the two strings?)
G – A A T T C A G T T A | | | | | | G G – A – T C – G - - A
![Page 28: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/28.jpg)
Informatics Challenges: Examples
• Sequence comparison:
– Find the best match (alignment) of a given sequence in a large dataset of sequences
– Find the best alignment of two sequences
– Find the best alignment of multiple sequences
• Motif and gene finding
• Relationship between sequences
– Phylogeny
• Clustering and classification
• Many many many more …
![Page 29: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/29.jpg)
Sequence Comparison
![Page 30: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/30.jpg)
Informatic Challenges: Examples
• Sequence comparison:
– Find the best match (alignment) of a given sequence in a large dataset of sequences
– Find the best alignment of two sequences
– Find the best alignment of multiple sequences
• Motif and gene finding
• Relationship between sequences
– Phylogeny
• Clustering and classification
• Many many many more …
![Page 31: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/31.jpg)
One of many commonly used tools that depend on sequence alignment.
![Page 32: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/32.jpg)
Motivation
• Why compare/align two protein or DNA sequences?
![Page 33: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/33.jpg)
Motivation
• Why compare/align two protein or DNA sequences?
– Determine whether they are descended from a common ancestor (homologous).
– Infer a common function.
– Locate functional elements (motifs or domains).
– Infer protein or RNA structure, if the structure of one of the sequences is known.
– Analyze sequence evolution
![Page 34: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/34.jpg)
Sequence Alignment
![Page 35: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/35.jpg)
Mission: Find the best alignment
between two sequences.
![Page 36: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/36.jpg)
Sequence Alignment
GAATC
CATAC
GAATC-
CA-TAC
GAAT-C
C-ATAC
GAAT-C
CA-TAC
-GAAT-C
C-A-TAC
GA-ATC
CATA-C
(some of a very large number of possibilities)
• Find the best alignment of GAATC and CATAC.
![Page 37: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/37.jpg)
Mission: Find the best alignment
between two sequences.
This is an optimization problem!
What do we need to solve this problem?
![Page 38: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/38.jpg)
Mission: Find the best alignment
between two sequences.
A method for scoring
alignments
A “search” algorithm for finding the alignment
with the best score
Substitution matrix Gap penalties
Dynamic programming
![Page 39: Genome 373: Genomic Informatics - Borenstein Labelbo.gs.washington.edu/courses/GS_373_17_sp/slides/1A...Quiz Section •Cecilia Noecker (TA) will review additional topics including](https://reader033.vdocuments.us/reader033/viewer/2022052616/609a07aefdf0ca5cbe740c1b/html5/thumbnails/39.jpg)