1
BIONF/BENG 203: Functional Genomics
Lecture TI 1,2
Trey Ideker
UCSD Departments of Medicine & Bioengineering
Sources of Functional Data Lectures 1 and 2
Instructors
Trey Ideker
Vineet Bafna
Anand Patel (TA)
2
3
Grading
40% Problem Sets (best 4 of 5)
30% Midterm
30% Final Project
Topics Covered By This Course
① Signal detection in bioinformatics
② Large-scale data generation platforms
③ Understanding next-gen sequencing data
④ Understanding mass spectrometry data
⑤ Clustering and Classification
⑥ Genotype-phenotype association
⑦ Understanding physical & genetic networks
⑧ Gene network inference and evolution 4
Ideker, Dutkowski, Hood. Cell 2011
Bioinformatics as Signal Detection
Ideker, Dutkowski, Hood. Cell 2011 Test Statistic t
Power, FDR, and all that...
Test Statistic t
Power, FDR, and all that...
An Example:
Pathway-Level Integration of
Genome-wide Association Studies
Segrè et al., 2010 A.V. Segrè, L. Groop, V.K. Mootha, M.J. Daly and D.
Altshuler, PLoS Genet. 6 (2010), p. e1001058.
2) Molecular Networks 1) Molecular States
3) Phenotypic traits
Classes of biological measurements
Protein-protein interactions:
Two-hybrid system, coIP, protein
antibody array
Protein-DNA interactions:
Chromatin IP (chip) sequencing
Protein-compound
DNA sequence / genotype: Next-gen sequencing, SNP & CNV arrays
Gene expression:
DNA microarrays, mRNA sequencing
Protein levels, locations, mods:
Mass spectrometry, fluorescence
microscopy, protein arrays
Physiological or disease state, binary or quantitative
Growth rate, response to stimulus or stress
Behaviors
Sequencing By Synthesis
(Illumina GenomeAnalyzer or HiSeq)
Bridge Amplification
Pyrosequencing
Note: No actual houses
are burned down in
pyrosequencing
Pyrosequencing (Life Sciences / Roche 454)
A luciferase is an enzyme which emits light in
the presence of ATP.
Several organisms, such as the American firefly and the
poisonous Jack-o-lantern mushroom, produce luciferases.
Detecting polymerase activity
Recall: Pyrophosphate is also known as PPi,
also known as “two phosphate groups stuck
together”. During replication, each addition
of a dNTP releases pyrophosphate
In the reaction mixture, PPi allows adenosine
phosphosulfate (APS) to be converted to
ATP; this ATP allows luciferase to luciferate
(emit light).
Measures strand extension as it happens
Pyrosequencing cycle
Add dATP. If light is emitted, your sequence
starts with A. If not, the dATP is degraded
(or elutes past immobilized primer).
Add dGTP. If light is emitted, the next base
must be a G.
Then add T, then C. You now know at least
one (maybe more) base of the sequence.
Repeat!
Pyrosequencing output
Runs of bases produce higher peaks – for instance, the sequence for (a)
is GGCCCTTG. Sample (c) comes from a heterozygous individual (hence
the heights in multiples of ½)
The X Prize Foundation
In October 2006, the X Prize Foundation
established an initiative to promote the
development of full genome sequencing
technologies, called the Archon X Prize,
intending to award $10 million to "the first
Team that can build a device and use it to
sequence 100 human genomes within 10 days
or less, with an accuracy of no more than one
error in every 100,000 bases sequenced, with
sequences accurately covering at least 98% of
the genome, and at a recurring cost of no more
than $10,000 (US) per genome.”
http://genomics.xprize.org/
26
Gene and Protein Expression
The transcriptome is the full complement of RNA molecules produced by a genome
The proteome is the full complement of proteins enabled by the transcriptome
DNA RNA protein
Genome transcriptome proteome
30,000 genes ??? RNAs ??? proteins?
For example, the drosophila gene Dscam can generate 40,000 distinct transcripts through alternative splicing.
What is the minimum number of exons that would be required?
27
mRNA Expression: Two dominant approaches
RNA sequencing
DNA Microarrays
Others / older approaches:
EST sequencing
RT-PCR
Differential display
SAGE
Massively parallel signature sequencing (MPSS)
28
Microarrays
Monitors the level of each gene:
Is it turned on or off in a particular biological condition?
Is this on/off state different between two biological conditions?
Microarray is a rectangular grid of spots printed on a glass microscope slide, where each spot contains DNA for a different gene
29
Two-color DNA microarray design
Reverse Transcription
30
Types of microarrays
Spotted (cDNA)
– Robotic transfer of cDNA clones or PCR products
– Spotting on nylon membranes or glass slides coated with poly-lysine
Synthetic (oligo)
– Direct oligo synthesis on solid microarray substrate
– Uses photolithography (Affymetrix) or ink-jet printing (Agilent)
– 100,000 features per cm2
All configurations assume the DNA on the array is in excess of the hybridized sample—thus the kinetics are linear and the spot intensity reflects that amount of hybridized sample.
Labeling can be radioactive, fluorescent (one-color), or two-color
31
Microarray Spotter
Affymetrix High Density Arrays
Microarray confocal scanner
Collects sharply defined optical sections from which 3D renderings can be created
The key is spatial filtering to eliminate out-of-focus light or glare in specimens whose thickness exceeds the immediate plane of focus.
Two lasers for excitation
Two color scan in less than 10 minutes
High resolution, 10 micron pixel size
Next-Gen Sequencing of mRNAs
cDNA = complementary or copy DNA
EST = Expressed Sequence Tag
The microarray could be described as a “closed system” because information about RNAs is limited by the targets available for hybridization. RNAs not represented on the array are not interrogated.
Direct sequencing of cDNAs overcomes this problem by large-scale random sampling of sequences from a whole-cell RNA extract
Statistical counting of distinct sequences provides a precise estimate of expression level
cDNA library can be normalized to capture rare messages
Has been dramatically enabled by large scale sequencing
mRNA Sequencing: Preparation of a cDNA library in phage vector
36
Proteomics
MS / MS
1D and 2D SDS PAGE
37
Mass spectrometry
Mass spectrometers consist of 3 essential parts
– Ionization source: Converts peptides into gas-phase ions
(MALDI + ESI)
– Mass analyzer:
Separates ions by mass to charge (m/z) ratio
(Ion trap, time of flight, quadrupole)
– Ion detector: Current over time indicates amount of signal at
each m/z value
MS/MS Overview
MS/MS Overview
A raw fragmentation spectrum
By calculating the molecular weight difference between ions of the same type the sequence can be determined.
Algorithms like SEQUEST use the fragmentation pattern to search through a complete protein database to identify the sequence which best fits the pattern.
43 Tandem Mass Spec (MS/MS)
Isotope Coded Affinity Tags (ICAT)
Biotin
tag
Linker (d0 or d8) Thiol specific
reactive group
ICAT Reagents: Heavy reagent: d8-ICAT (X=deuterium)
Normal reagent: d0-ICAT (X=hydrogen)
S
N N
O
N O O
O N I
O O X
X
X
X
X
X
X
X
Mass spec based method for measuring relative protein abundances between two samples
Combine and
proteolyze
(trypsin)
Affinity
separation
(avidin)
Protein identification
ICAT-
labeled
cysteines
550 560 570 580
m/z
0
100
200 400 600 800
m/z
0
100 NH2-EACDPLR-COOH
Light Heavy
Mixture 2
Mixture 1
Protein Quantification & Identification
via ICAT Strategy
Quantitation
ICAT Flash animation: http://occawlonline.pearsoned.com/bookbind/pubbooks/bc_mcampbell_genomics_1/medialib/method/ICAT/ICAT.html
ICAT continued
The heavy (blue) and light (gray) peptides are separated and quantified to produce a ratio for each peptide – here, a single peptide ratio is shown
Each peptide is subjected to CID fragmentation in the second MS stage in order to identify it
Gene replacement for yeast & other model species
Using HR-based gene replacement, genes can be replaced with drug
resistance cassette, tagged with GFP, epitope tagged, etc.
Systematic phenotyping
yfg1 yfg2 yfg3
CTAACTC TCGCGCA TCATAAT Barcode
(UPTAG):
Deletion Strain:
Growth 6hrs in minimal media
(how many doublings?)
Rich media
…
Harvest and label genomic DNA
Systematic phenotyping with a barcode array Ron Davis and friends…
These oligo barcodes are also
spotted on a DNA microarray
Growth time in minimal media:
– Red: 0 hours
– Green: 6 hours
YFP tagging for protein localization
NIC96 Nuclear Pore
YPF is green, transmitted light is red
TUB1 Tubulin cytoskeleton
HHF2 Histone Nucleus
BNI4 Bud neck
Images courtesy T. Davis lab
See also work by Weissman and O’Shea labs at UCSF
51
Molecular Interactions
Among proteins,
mRNA, small
molecules, and so on…
52
Protein→DNA interactions
Gene levels (on/off)
Protein—protein interactions
Protein levels (present/absent)
Biochemical reactions
Biochemical levels
▲ Chromatin IP
▼ DNA microarray
▲ Protein coIP
▼ Mass spectrometry
▲Not yet!!!
Metabolic flux ▼ measurements
53
Measurements of molecular interactions
Protein-protein interactions
Yeast-two-hybrid
Kinase-substrate assays
Co-immunoprecipitation w/ mass spec
Protein-DNA interactions
ChIP-on-chip and ChIP-seq
Genetic interactions
Systematic Genetic Analysis
54
Yeast two-hybrid method
Fields and Song
55
Kinase-target interactions
Mike Snyder and colleagues
56
Protein interactions by protein immunoprecipitation followed by mass spectrometry
Gavin / Cellzome
TEV = Tobacco Etch Virus proteolytic site
CBP = Calmodulin binding peptide
Protein A = IgG binding from Staphylococcus
ChIP measurement of protein→DNA interactions
From Figure 1 of Simon et al. Cell 2001
Genetic interactions: synthetic lethals and suppressors
Genetic Interactions:
Widespread method used by geneticists to discover pathways in yeast, fly, and worm
Implications for drug targeting and drug development for human disease
Thousands are now reported in literature and systematic studies
As with other types, the
number of known genetic interactions is exponentially increasing…
Adapted from Tong et al., Science 2001
59
Most recorded genetic interactions are synthetic lethal relationships
Adapted from Hartman, Garvik, and Hartwell, Science 2001
A B A B A B A B
A
B
Parallel Effects
(Redundant or Additive)
Sequential Effects
(Additive)
Single A or B mutations typically
abolish their biochemical activities
Single A or B mutations typically
reduce their biochemical activities
Interpretation of genetic interactions (Guarente T.I.G. 1990)
A B
GOAL: Identify
downstream
physical pathways