introduction to bioinformatics 234525-236523 lecturer: dr. yael mandel-gutfreund teaching...
TRANSCRIPT
Introduction to Bioinformatics234525-236523
Lecturer: Dr. Yael Mandel-Gutfreund
Teaching Assistance:
Martin Akerman
Sivan Bercovici
Course web site :http://webcourse.cs.technion.ac.il/234525
2
What is Bioinformatics?
3
Course Objectives
• To introduce the bioinfomatics discipline • To make the students familiar with the major
biological questions which can be addressed by bioinformatics tools
• To introduce the major tools used for sequence and structure analysis and explain in general how they work (limitation etc..)
4
Course Structure and Requirements
1.Class Structure1. 2 hours Lecture 2. 1 hour tutorial
2. Home work• Homework projects will be given every second week• The homework will be done in pairs.• 5/5 homework projects submitted
2. A final project will be conducted and submitted in pairs
5
Grading
• 30 % Homework assignments
• 70% final project
6
Literature list• Gibas, C., Jambeck, P. Developing Bioinformatics
Computer Skills. O'Reilly, 2001. • Lesk, A. M. Introduction to Bioinformatics. Oxford
University Press, 2002.
• Mount, D.W. Bioinformatics: Sequence and Genome Analysis. 2nd ed.,Cold Spring Harbor Laboratory Press, 2004.
Advanced Reading
Jones N.C & Pevzner P.A. An introduction to Bioinformatics algorithms MIT Press, 2004
7
What is Bioinformatics?
8
“The field of science in which biology, computer science, and information technology merge to form a single discipline”
Ultimate goal: to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned.
What is Bioinformatics?
9
from purely lab-based science to an information science
BioinformaticsBio = Informatics
10
Central Paradigm in Molecular Biology
mRNAGene (DNA) Protein
21ST centaury
Genome Transcriptome Proteome
11
Genome
• Chromosomal DNA of an organism
• Coding and non-coding DNA
• Genome size and number of genes does not necessarily determine organism complexity
12
Transcriptome
• Complete collection of all possible mRNAs (including splice variants) of an organism.
• Regions of an organism’s genome that get transcribed into messenger RNA.
• Transcriptome can be extended to include all transcribed elements, including non-coding RNAs used for structural and regulatory purposes.
13
Proteome
• The complete collection of proteins that can be produced by an organism.
• Can be studied either as static (sum of all proteins possible) or dynamic (all proteins found at a specific time point) entity
14
From DNA to Genome
Watson and Crick DNA model
First protein sequence1955
1960
1965
1970
1975
1980
1985
First protein structure
15
1995
1990
2000 First human genome draft
First bacterial genome
Hemophilus Influenzae
Yeast genome
16
Total 706 456
Eukaryotes 78 43
Bacteria 578 383
Archaea 50 29
Complete Genomes
2008 2007
17
Comparison between the full drafts of the human and chimp genomesrevealed that they differ only by 1.23%
How humans are chimps?
Perhaps not surprising!!!
18
The “post-genomics” eraThe “post-genomics” era
Goal:
to understand the living cell
Annotation Comparativegenomics
Structuralgenomics
Functionalgenomics
What’s Next ?
19
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT ......
.............. TGAAAAACGTA
Annotation
20
Annotation
Identify the genes within a given sequence of DNA
Identify the sitesWhich regulate the gene
Predict the function
21
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG
AAT .................................
.............. TGAAAAACGTA
TF binding sitepromoter
Ribosome binding SiteORF=Open Reading FrameCDS=Coding Sequence
Transcription
Start Site
22
Comparativegenomics
Human ATAGCGGGGGGATGCGGGCCCTATACCCChimp ATAGGGG - - GGATGCGGGCCCTATACCCMouse ATAGCG - - - GGATGCGGCGC -TATACCA
23
Researchers have learned a great deal about the function of human genes by examining their counterparts in simpler model organisms such as the mouse.
Conservation of the IGFALS (Insulin-like growth factor)Between human and mouse.
24
Functionalgenomics
25
Understanding the function of genes and other parts of the genome
26
A large network of 8184 interactions among 4140 S. Cerevisiae proteins
A network of interactions can be built For all proteins in an organism
27
Structural genomics
28
Assigning the structures of all proteins
Protein-ligand complexes
Functional sites
fold Evolutionaryrelationship
Shape and electrostatics
Active sites
protein complexes
Biologic processes
29
Resources and Databases
The different types of data are collected in database
– Sequence databases – Structural databases– Databases of Experimental Results
All databases are connected
30
Sequence databases
• Gene database
• Genome database
• SNPs database
• Disease related mutation database
31
Gene database
• Give information into gene functionality
• Alternative splicing of genes– Alternative pattern of exons included to create
gene product
• EST
32
Genome Databases
• Data organized by species
• Clones assembled into contigous pieces ‘contigs’ or whole chromosomes
• Information on non-coding regions
• Relativity
33
Genome Browsers
• Annotation adds value to sequence
• Easy “walk” through the genome
• Comparative genomics
34
Genome Browsers
• UCSC Genome Browser http://genome.ucsc.edu/
• Ensembl Genome Browser (http://www.ensembl.org)
• WormBase: http://www.wormbase.org/
• AceDB: http://www.acedb.org/
• Comprehensive Microbial Resource: http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl
• FlyBase: http://flybase.bio.indiana.edu/
35
SNP database
Single Nucleotide Polymorphisms (SNPs)
• Single base difference in a single position among two different individuals of the same species
• Play an important role in differentiation and disease
36
Sickle Cell Anemia
• Due to 1 swapping an A for a T, causing inserted amino acid to be valine instead of glutamine in hemoglobin
Image source: http://www.cc.nih.gov/ccc/ccnews/nov99/
37
Healthy Individual>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA
ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA
GGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACTGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC
>gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens]
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG
AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH
38
Diseased Individual>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA
ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA
GGTGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACTGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC
>gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens]
MVHLTPVEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG
AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH
39
Disease Databases
• Genes are involved in disease
• Many diseases are well studied
• Description of diseases and what is known about them is stored
40
Structure Databases
• 3-dimensional structures of proteins, nucleic acids, molecular complexes etc
• 3-d data is available due to techniques such as NMR and X-Ray crystallography
41
42
Databases of Experimental Results
• Data such as experimental microarray images- expression data
• Proteomic data
• Metabolic pathways, protein-protein interaction data, regulatory networks
• ETC………….
43
PubMed
• MEDLINE publication database– Over 17,000 journals– 15 million citations since 1950
Service of the National Library of Medicine
http://www.ncbi.nlm.nih.giv/PubMed
Literature Databases
44
Putting it all Together
• Each Database contains specific information
• Like other biological systems also these databases are interrelated
45
GENOMIC DATAGenBank
DDBJ
EMBL
ASSEMBLED GENOMES
GoldenPath
WormBase
TIGR
PROTEIN
PIR
SWISS-PROT
STRUCTUREPDB
MMDB
SCOP
LITERATURE
PubMed
PATHWAYKEGG
COG
DISEASE
LocusLink
OMIM
OMIA
GENESRefSeq
AllGenes
GDBSNPs
dbSNP
ESTs
dbEST
unigene
MOTIFS
BLOCKS
Pfam
Prosite
GENE EXPRESSION
Stanford MGDB
NetAffx
ArrayExpress