introducció a la bioinformàtica roderic guigó i serra [email protected] bioinformàtica, upf...
TRANSCRIPT
Introducció a la BioinformàticaRoderic Guigó i [email protected]
Bioinformàtica, UPF Curs 2011-2012
Van Leeuwenhoek and the microscopy
In 1676 his credibility was questioned when he sent the Royal Society a copy of his first observations of microscopic single celled organisms. Heretofore, the existence of single celled organisms was entirely unknown … The Royal Society arranged to send an English vicar, as well as a team of respected jurists and doctors to Delft, Holland to determine whether it was in fact Van Leeuwenhoek's ability to observe and reason clearly (wikipedia)
El impacte de la tecnologia en la practica
de la biologia
gagttttatcgcttccatgacgcagaagttaacactttcggatatttctgatgagtcgaaaaattatcttgataaagcaggaattactactgcttgtttacgaattaaatcgaagtggactgctggcggaaaatgagaaaattcgacctatccttgcgcagctcgagaagctcttactttgcgacctttcgccatcaactaacgattctgtcaaaaactgacgcgttggatgaggagaagtggcttaatatgcttggcacgttcgtcaaggactggtttagatatgagtcacattttgttcatggtagagattctcttgttgacattttaaaagagcgtggattactatctgagtccgatgctgttcaaccactaataggtaagaaatcatgagtcaagttactgaacaatccgtacgtttccagaccgctttggcctctattaagctcattcaggcttctgccgttttggatttaaccgaagatgatttcgattttctgacgagtaacaaagtttggattgctactgaccgctctcgtgctcgtcgctgcgttgaggcttgcgtttatggtacgctggactttgtgggataccctcgctttcctgctcctgttgagtttattgctgccgtcattgcttattatgttcatcccgtcaacattcaaacggcctgtctcatcatggaaggcgctgaatttacggaaaacattattaatggcgtcgagcgtccggttaaagccgctgaattgttcgcgtttaccttgcgtgtacgcgcaggaaacactgacgttcttactgacgcagaagaaaacgtgcgtcaaaaattacgtgcggaaggagtgatgtaatgtctaaaggtaaaaaacgttctggcgctcgccctggtcgtccgcagccgttgcgaggtactaaaggcaagcgtaaaggcgctcgtctttggtatgtaggtggtcaacaattttaattgcaggggcttcggccccttacttgaggataaattatgtctaatattcaaactggcgccgagcgtatgccgcatgacctttcccatcttggcttccttgctggtcagattggtcgtcttattaccatttcaactactccggttatcgctggcgactccttcgagatggacgccgttggcgctctccgtctttctccattgcgtcgtggccttgctattgactctactgtagacatttttactttttatgtccctcatcgtcacgtttatggtgaacagtggattaagttcatgaaggatggtgttaatgccactcctctcccgactgttaacactactggttatattgaccatgccgcttttcttggcacgattaaccctgataccaataaaatccctaagcatttgtttcagggttatttgaatatctataacaactattttaaagcgccgtggatgcctgaccgtaccgaggctaaccctaatgagcttaatcaagatgatgctcgttatggtttccgttgctgccatctcaaaaacatttggactgctccgcttcctcctgagactgagctttctcgccaaatgacgacttctaccacatctattgacattatgggtctgcaagctgcttatgctaatttgcatactgaccaagaacgtgattacttcatgcagcgttaccatgatgttatttcttcatttggaggtaaaacctcttatgacgctgacaaccgtcctttacttgtcatgcgctctaatctctgggcatctggctatgatgttgatggaactgaccaaacgtcgttaggccagttttctggtcgtgttcaacagacctataaacattctgtgccgcgtttctttgttcctgagcatggcactatgtttactcttgcgcttgttcgttttccgcctactgcgactaaagagattcagtaccttaacgctaaaggtgctttgacttataccgatattgctggcgaccctgttttgtatggcaacttgccgccgcgtgaaatttctatgaaggatgttttccgttctggtgattcgtctaagaagtttaagattgctgagggtcagtggtatcgttatgcgccttcgtatgtttctcctgcttatcaccttcttgaaggcttcccattcattcaggaaccgccttctggtgatttgcaagaacgcgtacttattcgccaccatgattatgaccagtgtttccagtccgttcagttgttgcagtggaatagtcaggttaaatttaatgtgaccgtttatcgcaatctgccgaccactcgcgattcaatcatgacttcgtgataaaagattgagtgtgaggttataacgccgaagcggtaaaaattttaatttttgccgctgaggggttgaccaagcgaagcgcggtaggttttctgcttaggagtttaatcatgtttcagacttttatttctcgccataattcaaactttttttctgataagctggttctcacttctgttactccagcttcttcggcacctgttttacagacacctaaagctacatcgtcaacgttatattttgatagtttgacggttaatgctggtaatggtggttttcttcattgcattcagatggatacatctgtcaacgccgctaatcaggttgtttctgttggtgctgatattgcttttgatgccgaccctaaattttttgcctgtttggttcgctttgagtcttcttcggttccgactaccctcccgactgcctatgatgtttatcctttgaatggtcgccatgatggtggttattataccgtcaaggactgtgtgactattgacgtccttccccgtacgccgggcaataacgtttatgttggtttcatggtttggtctaactttaccgctactaaatgccgcggattggtttcgctgaatcaggttattaaagagattatttgtctccagccacttaagtgaggtgatttatgtttggtgctattgctggcggtattgcttctgctcttgctggtggcgccatgtctaaattgtttggaggcggtcaaaaagccgcctccggtggcattcaaggtgatgtgcttgctaccgataacaatactgtaggcatgggtgatgctggtattaaatctgccattcaaggctctaatgttcctaaccctgatgaggccgcccctagttttgtttctggtgctatggctaaagctggtaaaggacttcttgaaggtacgttgcaggctggcacttctgccgtttctgataagttgcttgatttggttggacttggtggcaagtctgccgctgataaaggaaaggatactcgtgattatcttgctgctgcatttcctgagcttaatgcttgggagcgtgctggtgctgatgcttcctctgctggtatggttgacgccggatttgagaatcaaaaagagcttactaaaatgcaactggacaatcagaaagagattgccgagatgcaaaatgagactcaaaaagagattgctggcattcagtcggcgacttcacgccagaatacgaaagaccaggtatatgcacaaaatgagatgcttgcttatcaacagaaggagtctactgctcgcgttgcgtctattatggaaaacaccaatcttcccaagcaacagcaggtttccgagattatgcgccaaatgcttactcaagctcaaacggctggtcagtattttaccaatgaccaaatcaaagaaatgactcgcaaggttagtgctgaggttgacttagttcatcagcaaacgcagaatcagcggtatggctcttctcatattggcgctactgcaaaggatatttctaatgtcgtcactgatgctgcttctggtgtggttgatatttttcatggtattgataaagctgttgccgatacttggaacaatttctggaaagacggtaaagctgatggtattggctctaatttgtctaggaaataaccgtcaggattgacaccctcccaattgtatgttttcatgcctccaaatcttggaggcttttttatggttcgttcttattacccttctgaatgtcacgctgattattttgactttgag
1975: DNA Sequencing.Sanger. Maxam and Gilbert
Human Genome Project Milestones
2001: la culminación del proyecto
What’s past is prologue
• Shakespeare, The Tempest
• 2008: Major genome centers can sequence the same number of base pairs every 4 days
• 1000 Genome project launched
• World-wide capacity dramatically increasing
Further Evolution of Large-scale Genome Sequencing
• 2000: Human genome working drafts
• Data unit of approximately 10x coverage of human– 10 years and cost about $3 billion
• 2009: Every 4 hours ($25,000)
• 2010: Every 14 minutes ($5,000)
• Illumina HiSeq2000 machine produces 200 gigabases per 8 day run (BGI have ordered have 128)
Slide from Paul Flicek. EBI Bioinformatics Advisory Council
Moore’s Law
Fig. 1 A doubling of sequencing output every 9 months has outpaced and overtaken performance improvements within the disk storage and high-performance computation
fields.
S D Kahn Science 2011;331:728-729Published by AAAS
Sequencing challenges
• Sequencing to survey dynamics of ecosystems• Metagenomes
– Within individual ecosystems
• Other species genomes• Reference Human Genome• Individual genomes• Individual meta-genomes• Within individual genomic diversity• Sequencing as the read-out of experiments
– Chip-Seq and nucleosome positioning• RNA sequencing as a proxy to the cell’s
phenotype
Sequencing challenges
• Sequencing to survey dynamics of ecosystems• Metagenomes
– Within individual ecosystems
• Other species genomes• Reference Human Genome• Individual genomes• Individual meta-genomes• Within individual genomic diversity• Sequencing as the read-out of experiments
– Chip-Seq and nucleosome positioning• RNA sequencing as a proxy to the cell’s
phenotype
Sequencing challenges
• Sequencing to survey dynamics of ecosystems• Metagenomes
– Within individual ecosystems
• Other species genomes• Reference Human Genome• Individual genomes• Individual meta-genomes• Within individual genomic diversity• Sequencing as the read-out of experiments
– Chip-Seq and nucleosome positioning• RNA sequencing as a proxy to the cell’s
phenotype
Sequencing challenges
• Sequencing to survey dynamics of ecosystems• Metagenomes
– Ecosystems (enviromental, individual)
• Other species genomes• Reference Human Genome• Individual genomes• Individual meta-genomes• Within individual genomic diversity• Sequencing as the read-out of experiments
– Chip-Seq and nucleosome positioning• RNA sequencing as a proxy to the cell’s
phenotype
Sequencing challenges
• Sequencing to survey dynamics of ecosystems• Metagenomes
– Ecosystems (enviromental, individual)
• Other species genomes• Reference Human Genome• Individual genomes• Individual meta-genomes• Within individual genomic diversity• Sequencing as the read-out of experiments
– Chip-Seq and nucleosome positioning• RNA sequencing as a proxy to the cell’s
phenotype
Sequencing challenges
• Sequencing to survey dynamics of ecosystems• Metagenomes
– Ecosystems (enviromental, individual)
• Other species genomes• Reference Human Genome• Individual genomes• Individual meta-genomes• Within individual genomic diversity• Sequencing as the read-out of experiments
– Chip-Seq and nucleosome positioning• RNA sequencing as a proxy to the cell’s
phenotype
Sequencing challenges
• Sequencing to survey dynamics of ecosystems• Metagenomes
– Ecosystems (enviromental, individual)
• Other species genomes• Reference Human Genome• Individual genomes• Individual meta-genomes• Within individual genomic diversity• Sequencing as the read-out of experiments
– Chip-Seq, nucleosome positioning, …• RNA sequencing as a proxy to the cell’s
phenotype
Sequencing challenges
• Sequencing to survey dynamics of ecosystems• Metagenomes
– Ecosystems (enviromental, individual)
• Other species genomes• Reference Human Genome• Individual genomes• Individual meta-genomes• Within individual genomic diversity• RNA sequencing as a proxy to the cell’s
phenotype
Sequencing challenges
• Sequencing to survey dynamics of ecosystems• Metagenomes
– Ecosystems (enviromental, individual)
• Other species genomes• Reference Human Genome• Individual genomes• Individual meta-genomes• Within individual genomic diversity• RNA sequencing as a proxy to the cell’s
phenotype• Sequencing to interrogate the activity of
the genome, the epigenome
The ENCODE Project
transcripts
Epigenetic modifications
Transcription Factor
Binding sites
Experimental assays used by the ENCODE consortium
in summary
Biology is transitioning (at least partially)
from an “analytic” science: the real world
is disected in its elemental components in order to be comprehended
to “syntetic” science: the challenge is
the integration of globlal information on the living cell/individual/population/(eco)sytem.
From analytic to syntetic
Biology, a science in which the effort has traditionally been directed towards data aquisition has become in a very short time a discipline in which the data is obtained with almost no human intervention, and the effort is turning towards data analysis.
From data acquisition to data analysis
DNA microarrays
bioinformatics
Medline articles
year # articlesTo 1990 0
bioinformatics
Medline articles
year # articlesTo 1990 0
1990-1994 15
bioinformatics
Medline articles
year # articlesTo 1990 0
1990-1994 151995-1999 823
bioinformatics
Medline articles
year # articlesTo 1990 0
1990-1994 151995-1999 8232000-2004 7827
bioinformatics
Medline articles
year # articlesTo 1990 0
1990-1994 151995-1999 8232000-2004 78272005-2008 18822
Bioinformatics, Genomics, Systems Biology in Medline
Engineering and Biology increasingly connected
Engineering and biology: increasingly interconnected• Improved technologies to survey
Biological Systems– NGS and the like
• Engineering of Biological Systems– Medicine– New and modified biological systems
• Using Biology to build non-biological systems– DNA computing
Biology has changed and it is changing
•Quantitative thinking•Ability to attack unanticipated problems
Biology requires quantitative thinking
• Statistics • Mathematics• Computer Science• …
and programming skills (unix)
• The ability to interrogate data, and to models systems
bioinformatics 14,100,000
chemoinformatics 226,000
astroinformatics 195
neuroinformatics 364,000
socioinformatics 610
geoinformatics 506,000
meteoinformatics 48
econoinformatics 441
ecoinformatics 160,000
Bioinformatics
Google search: X-informatics (11 juny, 2007)
ACTCAGCCCCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGAAGCGCAGTCGGGGGCACGGGGATGAGCTCAGGGGCCTCTAGAAAGATGTAGCTGGGACCTCGGGAAGCCCTGGCCTCCAGGTAGTCTCAGGAGAGCTACTCAGGGTCGGGCTTGGGGAGAGGAGGAGCGGGGGTGAGGCCAGCAGCAGGGGACTGGACCTGGGAAGGGCTGGGCAGCAGAGACGACCCGACCCGCTAGAAGGTGGGGTGGGGAGAGCATGTGGACTAGGAGCTAAGCCACAGCAGGACCCCCACGAGTTGTCACTGTCATTTATCGAGCACCTACTGGGTGTCCCCAGTGTCCTCAGATCTCCATAACTGGGAAGCCAGGGGCAGCGACACGGTAGCTAGCCGTCGATTGGAGAACTTTAAAATGAGGACTGAATTAGCTCATAAATGGAAAACGGCGCTTAAATGTGAGGTTAGAGCTTAGAATGTGAAGGGAGAATGAGGAATGCGAGACTGGGACTGAGATGGAACCGGCGGTGGGGAGGGGGAGGGGGTGTGGAATTTGAACCCCGGGAGAGAAAGATGGAATTTTGGCTATGGAGGCCGACCTGGGGATGGGGAAATAAGAGAAGACCAGGAGGGAGTTAAATAGGGAATGGGTTGGGGGCGGCTTGGTAACTGTTTGTGCTGGGATTAGGCTGTTGCAGATAATGGAGCAAGGCTTGGAAGGCTAACCTGGGGTGGGGCCGGGTTGGGGTCGGGCTGGGGGCGGGAGGAGTCCTCACTGGCGGTTGATTGACAGTTTCTCCTTCCCCAGACTGGCCAATCACAGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGGCAGGTATGGGGCGGGGCTTGCTCGGTTTTCCCCGCTTCTCCCCCTCTCATCCTCACCTCAACCTCCTGGCCCCATTCAAGCACACCCTGGGCCCCCTCTTCTTCTGCTGGTCTGTCCCCTGAGGGGAAAGCCCAGGTCTGAGGCTTCTATGCTGCTTTCTGGCTCAGAACAGCGATTTGACGCTCTGTGAGCCTCGGTTCCTCCCCCGCTTTTTTTTTTTCAGCCAGAGTCTCACTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCAATCTCAGCTCACTGCAAGCTCCGCCTCCCGGGTTCACGCTATTCTCCCGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCGCCCGCCACCATGCCCGGCTAATTTTTTGTACTTTGAGTAGGGAAGGGGTTTCACTGTATTATCCAGGATGGTCTCTATCTCCTGACCTCGTGATCTGCCCGCCTGGCCTCCCAAAGTGCTGGAATTACAGGCGTGAGCCTCCGCGCCCGGCCTCCCCATCCTTAATATAGGAGTTAGAAGTTTTTGTTTGTTTGTTTTGTTTTGTTTTTGTTTTGTTTTGAGATGAAGTCCCTCTGTCGCCCAGGCTGGAGTGCAGTGGCTCCCAGGCTGGAGTTCAGTGGCTGGATCTCGGCTCACTGCAAGCTCCGCCTCCCAGGTTCACGCCATTCTCCTGCCTCAGCCTCCGGAGTAGCTGGGACTACAGGAACATGCCACCACACCCGACTAACTTTTTTTGTATTTTTAGTAGAGACGGGGTTTCACCATGTTGGCCAGGCTGGTCTGGAACTCCTGACCTCAGGTGATCTGCCTGCTTCAACCTCCCAAAGTGCTGGGATTACAGACGTGGGCCACCGCGCCCGGCTGGGAGTTAAGAGGTTTCTAATGCATTGCATTAGAATACCAGACACGGGACAGCTGTGATCTTTATTCTCCATCACCCCACACAGCCCTGCCTGGGGCACACAAGGACACTCAATACACGCTTTTCGGGCGCGGTGGCTCAAGCTGTAATCCCAGCACTTTGGGAGGCTGAGGCGGGTGGTACATGAGGTCAGGAGATCGAGACCATCCTGGCTAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAAACTAGCCCGGGCGTGGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGAGGCTGAGGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGTGACACAGCGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATACACGCTTTTCCGCTAGGCACGGTGGCTCACCCCTGTAATCCCAGCATTTTGGGAGGCCAAGGTGGGAGGATCACTTGAGCCCAGGAGTTCAACACCAGACTCAGCAACATAGTGAGACTCTCTCTACTAAAAATACAAAAATTAGCCAGGCCTGGTGCCACACACCTGTGGTCCCAGCTACTCAGAAGGCTAAGGCAGGAGGATCGCTTAAGCCCAGAAGGTCAAGGTTGCAGTGAACCACGTTCAGGCCACTGCAGTCCAGCCTGGGTGACAGAGCAAGACCCTGTCTGTAAATAAATAACGCTTTTCAAGTGATTAAACAGACTCCCCCCTCACCCTGCCCACCATGGCTCCAAAGCAGCATTTGTGGAGCACCTTCTGTGTGCCCCTAGGTACTAGCTGCCTGGACGGGGTCAGAAGGAACCTGAACCACCTTCAACTTGTTCCACACAGGATGCCAGGCCAAGGTGGAGCAACCGGTGGAGCCAGAGACAGAACCCGACGTTCGCCAGCAGGCTGAGTGGCAGAGCGGCCAGCCCTGGGAGCTGGCACTGGGTCGCTTTTGGGATTACCTGCGCTGGGTGCAGACACTGTCTGAGCAGGTGCAGGAGGAGCTGCTCAGCCCCCAGGTCACCCAGGAACTGACGTGAGTGTCCCCATCCCGGCCCTTGACCCTCCTGGTGGGCGGCTATACCTCCCCAGGTCCAGGTTTCATTCTGCCCCTGCCACTAAGTCTTGGGGGCCTGGGTCTCTGCTGGTTCTAGCTTCCTCTTCCCATTTCTGACTCCTGGCTTTAGCTCTCTGGAATTCTCTCTCTCAGTTCTGTTTCTCCCTCTTCCCTTCTGACTCAGCCTGTCACACTCGTCCTGGCGCTGTCTCTGTCCTTCACTAGCTCTTTTATATAGAGACAGAGAGATGGGGTCTCACTGTGTTGCCCAGGCTGGTCTTGAACTTCTGGGCTCAAGCGATCCTCCCACCTCGCCTCCCAAAGTGCTGGGAATAGAGACATGAGCCACCTTGCTCGGCCTCCTAGCTCTTTCTTCGTCTCTGCCTCTGCTCTCTGCGTCTGTCTTTGTCTCCTCTCTGCCTCTGTCCCGTTCCTTCTCTCTTGGTTCACTGCCCTTCTGTCTCTCCCTGTTCTCCTTAGGAGACTCTCCTCTCTTCCTTCTCGAGTCTCTCTGGCTGATCCCCATCTCACCCACACCTATCC
La matèria cromosòmica és “un cristall aperiòdic”, constituït per la successió d'un nombre petit d'elements isomèrics*, la seqüència concreta dels quals és la responsable de la seva funcionalitat.
(*) “the number of atoms in such a structure need not to be very large to produce an almost unlimited number of possible arrangements. For illustration, think of the Morse code…”
La matèria cromosòmica és “un cristall aperiòdic”, constituït per la successió d'un nombre petit d'elements isomèrics*, la seqüència concreta dels quals és la responsable de la seva funcionalitat.
(*) “the number of atoms in such a structure need not to be very large to produce an almost unlimited number of possible arrangements. For illustration, think of the Morse code…”
1943: Schroëdinger, “What is life?”