bionf/beng 203: functional genomics lecture ti 1 trey ideker ucsd department of bioengineering...
TRANSCRIPT
BIONF/BENG 203:Functional Genomics
Lecture TI 1Trey IdekerUCSD Department of Bioengineering
Sources of Functional DataLectures 1 and 2
Grading
40% Problem Sets (best 4 of 5)30% Midterm30% Final Project
Outline of the course
Biological data
sources (2)
Data pre-processing
(6)
Unsupervised:
Clustering
Inference
Supervised:
Classification
Population Genetics and
Linkage
Single Source (3) (3) (1)Multi-
Source (2) FINAL PROJECT
FINAL
PROJECT
Project Presentations
(2)
Total of 17 lectures
Functional Genomics Data
– ExpressionmRNA, protein
– Molecular interactionsProtein, mRNA, small molecules
– Knockout phenotypes1st, 2nd, higher orders
– SNP sequence (polymorphism) data– Imaging data
Sub-cellular localizationCell morphology
– Gene ontology
Dividing the data into two classes of information:Biological Networks and Network States
Directly observe the network “wires” themselves
Protein-protein interactions:Two-hybrid system, coIP, protein antibody arrays
BIND, DIP
Protein-DNA interactions:Chromatin IP
BIND, Transfac, SCPD
Other types not yet possible:e.g., protein-small molecule
Observe molecular states that result from the interaction wiring
DNA/RNA Gene expression:DNA microarrays, SAGE
Protein levels, locations, and modifications:
Mass spectrometry, fluorescence microscopy, protein arrays
Gross phenotypes:e.g., growth rates of single and double deletion strains
1)
2)
High-throughput methods for measuring cellular states
Gene expression levels: RT-PCR, arrays
Protein levels, modifications: mass specProtein locations: fluorescent tagging
Metabolite levels: NMR and mass spec
Systematic phenotyping
The transcriptome and proteome
The transcriptome is the full complement of RNA molecules produced by a genome
The proteome is the full complement of proteins enabled by the transcriptome
DNA RNA protein Genome transcriptome proteome 30,000 genes ??? RNAs ??? proteins?
For example, the drosophila gene Dscam can generate 40,000 distinct transcripts through alternative splicing.
What is the minimum number of exons that would be required?
Expression: High-throughput approaches
RNA DNA Microarrays cDNA / EST sequencing RT-PCR Differential display SAGE Massively parallel signature sequencing (MPSS)
Proteins 2D PAGE Mass spectrometry
Gene expression arrays
They are really, really, really, really, really, really, really, really, really, really, really, really, really important
Microarrays
Monitors the level of each gene:
Is it turned on or off in a particular biological condition?
Is this on/off state different between two biological conditions?
Microarray is a rectangular grid of spots printed on a glass microscope slide, where each spot contains DNA for a different gene
Two-color DNA microarray design
ReverseTranscription
cDNA-chip of brain glioblastoma
Types of microarrays
Spotted (cDNA)– Robotic transfer of cDNA clones or PCR products– Spotting on nylon membranes or glass slides coated with poly-lysine
Synthetic (oligo)– Direct oligo synthesis on solid microarray substrate– Uses photolithography (Affymetrix) or ink-jet printing (Agilent)
All configurations assume the DNA on the array is in excess of the hybridized sample—thus the kinetics are linear and the spot intensity reflects that amount of hybridized sample.
Labeling can be radioactive, fluorescent (one-color), or two-color
Microarray Spotter
Affymetrix High Density Arrays Affymetrix High Density Arrays
Microarrays (continued)
Imaging– Radioactive 32P labeling: Autoradiography or
phosphorimager– Fluorescent labeling: Confocal microscope (invented
by Marvin Minsky!!)
Feature density– Nylon membrane macroarrays 100-1000 features– Glass slide spotted array 5,000 features / cm2
– Synthesized arrays 50,000 features / cm2
Microarrayconfocal scanner
Collects sharply defined optical sections from which 3D renderings can be created
The key is spatial filtering to eliminate out-of-focus light or glare in specimens whose thickness exceeds the immediate plane of focus.
Two lasers for excitation Two color scan in less than 10 minutes High resolution, 10 micron pixel size
cDNA / EST sequencing projects cDNA = complementary or copy DNA EST = Expressed Sequence Tag
The microarray could be described as a “closed system” because information about RNAs is limited by the targets available for hybridization. RNAs not represented on the array are not interrogated.
Direct sequencing of cDNAs (yielding ESTs) overcomes this problem by large-scale random sampling of sequences from a whole-cell RNA extract
Statistical counting of distinct sequences provides an estimate of expression level
Conversely, cDNA library can be normalized to capture rare messages
Requires large scale sequencing to get statistical significance
cDNA / EST Sequencing:Preparation of a cDNA library in phage vector
SerialAnalysis ofGeneExpression
Takes idea of sequence sampling to the extremeTakes idea of sequence sampling to the extreme
Generates short ESTs (9-14nt) which are joined into long Generates short ESTs (9-14nt) which are joined into long concatamers and then sequencedconcatamers and then sequenced
4499 is 262,144, ~5-fold the number of human genes is 262,144, ~5-fold the number of human genes
The count of each type of tag estimates RNA copy numberThe count of each type of tag estimates RNA copy number
>50X more efficient than cDNA sequencing because many >50X more efficient than cDNA sequencing because many RNAs are represented in a single sequencing runRNAs are represented in a single sequencing run
SAGE Technology
Steps to SAGE
Copy mRNA ds cDNA using biotinylated (dT) Cleave with anchoring enzyme (AE) which cleaves
within ~250bp of poly-A tail at 3’ end. Capture this segment on streptavidin beads Ligate to linkers containing a type IIs restriction site,
which cleave DNA 14 bp away from this site. Ligate sequences to each other and PCR amplify Cleave with AE to remove linkers Concatenate, clone, and sequence
WHY DI-TAGS?Ditags are used to detect bias in the PCR amplification step.
The probability of any two tags being coupled in the same ditag is small.
Biased amplification can be detected as many ditags always having the same 2 tags present.
Velculescu et al. Velculescu et al. ScienceScience (1995) (1995)
AA BBBBBBAA
AA
PrimerAPrimerA PrimerBPrimerB
PrimerAPrimerA PrimerBPrimerB
SAGE (continued)
Tag Sequence Count
ATCTGAGTTC 1075
GCGCAGACTT 125
TCCCCGTACA 112
TAGGACGAGG 92
GCGATGGCGG 91
TAGCCCAGAT 83
GCCTTGTTTA 80
Example of a concatemer:
CATGCATGACCCACGAGCAGGGTACGATGATAACCCACGAGCAGGGTACGATGATACATGCATGGAAACCTATGCACCTTGGGTAGCAGAAACCTATGCACCTTGGGTAGCACATGCATG
TAG1TAG1 TAG2TAG2 TAG3TAG3 TAG4TAG4
Tag Sequence
Count
GCGATATTGT 66
TACGTTTCCA 66
TCCCGTACAT 66
TCCCTATTAA 66
GGATCACAAT 55
AAGGTTCTGG 54
CAGAACCGCG 50
GGACCGCCCC 48
Counting the tags:
Proteomics
SDS PAGE
2D PAGE
MS/MS
An example SDS-PAGE
Protein stains:SilverCopperCoomassie Blue
How many proteins are in a band?
2D-PAGE
Dimension 1: Isoelectric
focusing gel
Dimension 2: size
2D gel from macrophage phagosomes
Mass spectrometry
Mass spectrometers consist of three essential parts
– Ionization source: Converts peptides into gas-phase ions (MALDI + ESI)
– Mass analyzer: Separates ions by mass to charge (m/z) ratio (Ion trap, time of flight, quadrupole)
– Ion detector: Current over time indicates amount of signal at each m/z value
MS/MS Overview
MS/MS Overview
A raw fragmentation spectrumBy calculating the molecular weight difference between ions of the same type the sequence can be determined.
SEQUEST uses the fragmentation pattern to search through a complete protein database to identify the sequence which best fits the pattern.
Tandem Mass Spec (MS/MS)
Typical nanoelectrospray source
Isotope Coded Affinity Tags (ICAT)
Biotin Biotin tagtag
Linker (d0 or d8)Linker (d0 or d8) Thiol specific Thiol specific reactive groupreactive group
ICATICAT ReagentsReagents:: Heavy reagent: d8-ICATHeavy reagent: d8-ICAT ((XX=deuterium)=deuterium)Normal reagent: d0-ICAT (Normal reagent: d0-ICAT (XX=hydrogen)=hydrogen)
S
N N
O
N OO
O N IO OXX
XX
XX
XX
XX
XX
XX
XX
Mass spec based method for measuring relative protein abundances between two samples
Combine and Combine and proteolyzeproteolyze(trypsin)(trypsin)
Affinity Affinity separationseparation
(avidin)(avidin)
Protein identificationProtein identification
ICAT-ICAT-labeled labeled
cysteinescysteines
550550 560560 570570 580580m/zm/z
00
100100
200200 400400 600600 800800m/zm/z
00
100100
NHNH22-EACDPLR--EACDPLR-COOHCOOH
LightLight HeavyHeavy
Mixture 2Mixture 2
Mixture 1Mixture 1
Protein Quantification & Identification Protein Quantification & Identification viavia ICAT Strategy ICAT Strategy
QuantitationQuantitation
ICAT Flash animation:http://occawlonline.pearsoned.com/bookbind/pubbooks/bc_mcampbell_genomics_1/medialib/method/ICAT/ICAT.html
ICAT continued The heavy (blue) and light (gray) peptides are separated and
quantified to produce a ratio for each peptide – here, a single peptide ratio is shown
Each peptide is subjected to CID fragmentation in the second MS stage in order to identify it
Metabolomic measurements
2D NMR or mass spectrometry
Currently not global and in less widespread use than microarrays, but have tremendous potential
Replacement of yeast ORFS with kanMX gene flanked by unique oligo barcodes– Yeast Deletion Project Consortium
Gene knockout and RNAi libraries for model speciesExample from yeast:
YFP tagging for protein localization
NIC96 Nuclear Pore
YPF is green, transmitted light is red
TUB1 Tubulin cytoskeleton
HHF2 Histone Nucleus
BNI4 Bud neck
Images courtesy T. Davis lab
See also recent work byWeissman and O’Shea labs at UCSF
Systematic phenotyping
yfg1 yfg2 yfg3
CTAACTC TCGCGCA TCATAATBarcode
(UPTAG):
DeletionStrain:
Growth 6hrsin minimal media
(how many doublings?)
Rich media
…
Harvest and label genomic DNA
Systematic phenotyping with a barcode arrayRon Davis and friends…
These oligo barcodes are also spotted on a DNA microarray
Growth time in minimal media:– Red: 0 hours– Green: 6 hours
Molecular Interactions
Among proteins, mRNA, small molecules, and so on…
Protein→DNAinteractions
Gene levels(on/off)
Protein—proteininteractions
Protein levels(present/absent)
Biochemicalreactions
Biochemicallevels
▲ Chromatin IP
▼ DNA microarray
▲ Protein coIP▼ Mass
spectrometry
▲Not yet!!!Metabolic
flux ▼ measurement
s
Also like sequence, protein interaction data are exponentially growing…
DIP Database Growthtotal interactions
(As are the false positives!!!)
EMBL Database Growthtotal nucleotides (gigabases)
1980 20001990
0
10
5
High-throughput methods for measuring interaction networks
2-hybrid co-immunoprecipitation w/ mass spec chIP-on-chip systematic genetic analysis
Yeast two-hybrid method
Fields and Song
Detection of protein interactions with antibody arrays
McBeath and Schreiber
Kinase-target interactions
Mike Snyder and colleagues
High-throughput methods for measuring networks
2-hybrid
co-immunoprecipitation w/ mass spec
chIP-on-chip
systematic genetic analysis
Protein interactions by protein immunoprecipitation followed by mass spectrometry
Gavin / Cellzome
TEV = Tobacco Etch Virus proteolytic site
CBP = Calmodulin binding peptide
Protein A = IgG binding from Staphylococcus
TAP purification
Image courtesy of Bertrand
Seraphin
High-throughput methods for measuring networks
2-hybrid
co-immunoprecipitation w/ mass spec
chIP-on-chip
systematic genetic analysis
ChIP-chip measurement of protein→DNA interactions
From Figure 1 of Simon et al. Cell 2001
High-throughput methods for measuring networks
2-hybrid
co-immunoprecipitation w/ mass spec
chIP-on-chip
systematic genetic analysis
Genetic interactions: synthetic lethals and suppressors
Adapted from Tong et al., Science 2001
Genetic Interactions:
Widespread method used by geneticists to discover pathways in yeast, fly, and worm
Implications for drug targeting and drug development for human disease
Thousands are now reported in literature and systematic studies
As with other types, the number of known genetic interactions is exponentially increasing…
Most recorded genetic interactions are synthetic lethal relationships
Adapted from Hartman, Garvik, and Hartwell, Science 2001
A B A B A B A B
A B
A B
BA
X
A
Suppressor protein interaction
Synthetic-lethal protein interaction
A
A
BA
X
B
B BA
BA
A
A
B
Parallel Effects (Redundant or Additive)
Sequential Effects (Additive)
Single A or B mutations typically abolish their biochemical activities
Single A or B mutations typically reduce their biochemical activities
Interpretation of genetic interactions (Guarente T.I.G. 1990)
A B
GOAL: Identify downstream
physical pathways