the big goal “the greatest challenge, however, is analytical. … deeper biological insight is...
Post on 20-Dec-2015
214 views
TRANSCRIPT
The BIG Goal“The greatest challenge, however, is
analytical. … Deeper biological insight is likely to emerge from examining datasets with scores of samples.”
Eric Lander, “array of hope” Nat. Gen.
volume 21 supplement pp 3 - 4, 1999.
Bio-informatics:Provide methodologies for elucidating biological knowledge from biological data.
Molecular Structure
GeneticInformation
BiochemicalFunction Symptoms
Central Paradigm of Bio-informatics
Molecular Structure
GeneticInformation
BiochemicalFunction Symptoms
Central Paradigm of Bio-informatics
http://www.sanger.ac.uk/PostGenomics/S_pombe/presentations/EMBOCopenhagenWebsite.pdf
Computer Science Tools are Crucial
Computer Science Tools are Crucial• New bio-technologies create huge amounts of data.
• It is impossible to analyze data by manual inspection.
• Novel mathematical, statistical, algorithmic and computational tools are necessary !
What is Bio-Informatics ?• A field of science in which Biology, Computer Science and Information Technology merge into a single discipline.
• Computers (& software tools) are used to collect, analyze and interpret biological information at the molecular level.
• Goal: To enable the discovery of new biological insights and create a global perspective for biologists.
• Development of new algorithms and statistical methods to assess relationships among members of large data sets.
• Analysis and interpretation of various types of data.
• Development and implementation of tools to efficiently access and manage different types of information.
Disciplines
Why Use Bio-Informatics ?
An explosive growth in the amount of biological information necessitates the use of computers for cataloging and retrieval of data (> 3 billion bps, > 30,000 genes).
• The human genome project.• Automated sequencing.• GenBank has over 16 Billion bases and is doubling every year !!!
New Types of Biological Data• Micro arrays - gene expression.
• Multi-level maps: genetic, physical: sequence, annotation.
• Networks of protein-protein interactions.
• Cross-species relationships:• Homologous genes.• Chromosome organization.http://www.the-scientist.com/yr2002/apr/
research020415.html
• A more global view of experimental design. (from “one scientist = one gene/protein/disease” paradigm to whole organism consideration).
• Data mining - functional/structural information is important for studying the molecular basis of diseases, diagnostics, developing drugs (personal medicine), evolutionary patterns, etc.
Why Bio Informatics ? (cont.)
http://www.library.csi.cuny.edu/~davis/Bioinfo_326/lectures/lect14/lect_14.html
Why Bio Informatics ? (cont.)
http://www.usgenomics.com/technology/index.shtml
Principle milestones in data mining and genome analysis:• Sanger method for sequencing, invented in 1977 (winner of the Nobel Prize in 1980), • Polymerase chain reaction (PCR), invented in 1989 (awarded the Nobel Prize in 1993).
Future of Genomic Research
The next step:
Locate all the genesand understand their function.
This will probably take another 15-20 years !
One can efficiently find information:Using databases and software on the web .
Question: How likely are you to use a free bio-informatics library of accessible software ?
http://www.cryst.bbk.ac.uk/classlib/BBSRC_poster/potential.html
The job of biologists is changing…
Broad Classification of Biological Databases
http://www.mrc-lmb.cam.ac.uk/genomes/madanm/pres/biodb.htm
Genome
Proteome
Transcriptome
Gene function
Metabolome
Glycome
89,300 1,701
Google search PubMed
2.1x106 76,566
9,960 229
1.2x106 6.5x105
1,170 29
Post-genomic terms (Oct. 2002)
138 6
Pub
Med
Hits Proteome
From: Computational Proteomics, Mark B Gerstein, Yale U.
Similarity / Analogy
Examples:
If looks like an elephant, and smells like an elephant– it’s an elephant.
If walks like a duck, and quacks like a duck– it’s a duck.
http://cbms.st-and.ac.uk/academics/ryan/Teaching/molbiol/Bioinf_files/v3_document.htm
Similarity Search in Databanks
Find similar sequencesto a working draft.
As databanks grow, homologies get harder,and quality is reduced.
Alignment Tools: BLAST & FASTA (time saving heuristics-approximations).
>gb|BE588357.1|BE588357 194087 BARC 5BOV Bos taurus cDNA 5'.
Length = 369
Score = 272 bits (137), Expect = 4e-71
Identities = 258/297 (86%), Gaps = 1/297 (0%)
Strand = Plus / Plus
Query: 17 aggatccaacgtcgctccagctgctcttgacgactccacagataccccgaagccatggca 76
|||||||||||||||| | ||| | ||| || ||| | |||| ||||| |||||||||
Sbjct: 1 aggatccaacgtcgctgcggctacccttaaccact-cgcagaccccccgcagccatggcc 59
Query: 77 agcaagggcttgcaggacctgaagcaacaggtggaggggaccgcccaggaagccgtgtca 136
|||||||||||||||||||||||| | || ||||||||| | ||||||||||| ||| ||
Sbjct: 60 agcaagggcttgcaggacctgaagaagcaagtggagggggcggcccaggaagcggtgaca 119
Query: 137 gcggccggagcggcagctcagcaagtggtggaccaggccacagaggcggggcagaaagcc 196
|||||||| | || | ||||||||||||||| ||||||||||| || ||||||||||||
Sbjct: 120 tcggccggaacagcggttcagcaagtggtggatcaggccacagaagcagggcagaaagcc 179
Query: 197 atggaccagctggccaagaccacccaggaaaccatcgacaagactgctaaccaggcctct 256
||||||||| | |||||||| |||||||||||||||||| ||||||||||||||||||||
Sbjct: 180 atggaccaggttgccaagactacccaggaaaccatcgaccagactgctaaccaggcctct 239
Query: 257 gacaccttctctgggattgggaaaaaattcggcctcctgaaatgacagcagggagac 313
|| || ||||| || ||||||||||| | |||||||||||||||||| ||||||||
Sbjct: 240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgacagaagggagac 296
Pairwise alignment:
PhylogenyEvolution - a process in which small changes occur within species over time.
These changes could be monitored today using molecular techniques.
The Tree of Life: A classical, basic science problem, since Darwin’s 1859 “Origin of Species”.
Origin of the universe ?
Formation of the solar system
First self replicating systems
Prokaryotes/eukaryotes
Plant/animals
Invertebrates/vertebrates
Mammalianradiation
Tree of Life:Searching Protein Sequence Databases -How far can we see back ?
• Write down all of human DNA on a single CD
(“completed” 2001).
• Identify all genes, their location and function (far from completion).
The Human Genome Project (HGP)
• Fluorescent labeled probes hybridize to specific chromosomal locations.• Example application: low resolution localization of a gene.
FISH - Fluorescence In-Situ Hybridization.
Gene Finding• Only 2-3% of the human genome encodes for functional genes.
• Genes are found along large non-coding DNA regions.
• Repeats, pseudo-genes, introns, contamination of vectors, are very confusing.
Gene Finding - cont.
Find special gene patterns:
• Translation start and stop sites (open reading frames - ORF).• Transcription factors, promoters.• Intron splice sites.Etc…
Micro Arrays (“DNA Chips”)New biotechnology breakthrough: measure RNA expression levels of thousands of genes (in one experiment).
Clustering Analysis of Gene Expression Data
DNA chips and personalized medicine (leading edge, future technologies).
Pharmaco-genomicsUse DNA information to measure and predict the reaction to drugs.
Personalized medicine.
Faster clinical trials: selected populations.
Less drug side-effects.
Protein and Other ArraysSequencing the human genome => finite problem.Studying the proteome => endless possible variations, dynamic.
Protein arrayFuture fields of study:
Proteins + Genomics = Proteomics
Lipids + Genomics = Lipomics
Sugars + Genomics = Glycomics
SEQUENCEALIGNMENT
ORTHOLOG GENES (Taxonomy)
CONSERVED DOMAINS
CODINGREGIONS
3-DSTRUCTURE
GENEFAMILIES
MUTATIONS &POLYMORPHISMGENOME
MAPSCELLULARLOCATION
SIGNALPEPTIDE
Putting it all together: Bio-Informatics
SEQUENCES & LITERATURE