bionf/beng 203: functional genomics lecture ti 1 trey ideker ucsd department of bioengineering...

BIONF/BENG 203:Functional Genomics

Lecture TI 1Trey IdekerUCSD Department of Bioengineering

Sources of Functional DataLectures 1 and 2

Grading

40% Problem Sets (best 4 of 5)30% Midterm30% Final Project

Outline of the course

Biological data

sources (2)

Data pre-processing

(6)

Unsupervised:

Clustering

Inference

Supervised:

Classification

Population Genetics and

Linkage

Single Source (3) (3) (1)Multi-

Source (2) FINAL PROJECT

FINAL

PROJECT

Project Presentations

(2)

Total of 17 lectures

Functional Genomics Data

– ExpressionmRNA, protein

– Molecular interactionsProtein, mRNA, small molecules

– Knockout phenotypes1st, 2nd, higher orders

– SNP sequence (polymorphism) data– Imaging data

Sub-cellular localizationCell morphology

– Gene ontology

Dividing the data into two classes of information:Biological Networks and Network States

Directly observe the network “wires” themselves

Protein-protein interactions:Two-hybrid system, coIP, protein antibody arrays

BIND, DIP

Protein-DNA interactions:Chromatin IP

BIND, Transfac, SCPD

Other types not yet possible:e.g., protein-small molecule

Observe molecular states that result from the interaction wiring

DNA/RNA Gene expression:DNA microarrays, SAGE

Protein levels, locations, and modifications:

Mass spectrometry, fluorescence microscopy, protein arrays

Gross phenotypes:e.g., growth rates of single and double deletion strains

1)

2)

High-throughput methods for measuring cellular states

Gene expression levels: RT-PCR, arrays

Protein levels, modifications: mass specProtein locations: fluorescent tagging

Metabolite levels: NMR and mass spec

Systematic phenotyping

The transcriptome and proteome

The transcriptome is the full complement of RNA molecules produced by a genome

The proteome is the full complement of proteins enabled by the transcriptome

DNA RNA protein Genome transcriptome proteome 30,000 genes ??? RNAs ??? proteins?

For example, the drosophila gene Dscam can generate 40,000 distinct transcripts through alternative splicing.

What is the minimum number of exons that would be required?

Expression: High-throughput approaches

RNA DNA Microarrays cDNA / EST sequencing RT-PCR Differential display SAGE Massively parallel signature sequencing (MPSS)

Proteins 2D PAGE Mass spectrometry

Gene expression arrays

They are really, really, really, really, really, really, really, really, really, really, really, really, really important

Microarrays

Monitors the level of each gene:

Is it turned on or off in a particular biological condition?

Is this on/off state different between two biological conditions?

Microarray is a rectangular grid of spots printed on a glass microscope slide, where each spot contains DNA for a different gene

Two-color DNA microarray design

ReverseTranscription

cDNA-chip of brain glioblastoma

Types of microarrays

Spotted (cDNA)– Robotic transfer of cDNA clones or PCR products– Spotting on nylon membranes or glass slides coated with poly-lysine

Synthetic (oligo)– Direct oligo synthesis on solid microarray substrate– Uses photolithography (Affymetrix) or ink-jet printing (Agilent)

All configurations assume the DNA on the array is in excess of the hybridized sample—thus the kinetics are linear and the spot intensity reflects that amount of hybridized sample.

Labeling can be radioactive, fluorescent (one-color), or two-color

Microarray Spotter

Affymetrix High Density Arrays Affymetrix High Density Arrays

Microarrays (continued)

Imaging– Radioactive 32P labeling: Autoradiography or

phosphorimager– Fluorescent labeling: Confocal microscope (invented

by Marvin Minsky!!)

Feature density– Nylon membrane macroarrays 100-1000 features– Glass slide spotted array 5,000 features / cm2

– Synthesized arrays 50,000 features / cm2

Microarrayconfocal scanner

Collects sharply defined optical sections from which 3D renderings can be created

The key is spatial filtering to eliminate out-of-focus light or glare in specimens whose thickness exceeds the immediate plane of focus.

Two lasers for excitation Two color scan in less than 10 minutes High resolution, 10 micron pixel size

cDNA / EST sequencing projects cDNA = complementary or copy DNA EST = Expressed Sequence Tag

The microarray could be described as a “closed system” because information about RNAs is limited by the targets available for hybridization. RNAs not represented on the array are not interrogated.

Direct sequencing of cDNAs (yielding ESTs) overcomes this problem by large-scale random sampling of sequences from a whole-cell RNA extract

Statistical counting of distinct sequences provides an estimate of expression level

Conversely, cDNA library can be normalized to capture rare messages

Requires large scale sequencing to get statistical significance

cDNA / EST Sequencing:Preparation of a cDNA library in phage vector

SerialAnalysis ofGeneExpression

Takes idea of sequence sampling to the extremeTakes idea of sequence sampling to the extreme

Generates short ESTs (9-14nt) which are joined into long Generates short ESTs (9-14nt) which are joined into long concatamers and then sequencedconcatamers and then sequenced

4499 is 262,144, ~5-fold the number of human genes is 262,144, ~5-fold the number of human genes

The count of each type of tag estimates RNA copy numberThe count of each type of tag estimates RNA copy number

>50X more efficient than cDNA sequencing because many >50X more efficient than cDNA sequencing because many RNAs are represented in a single sequencing runRNAs are represented in a single sequencing run

SAGE Technology

Steps to SAGE

Copy mRNA ds cDNA using biotinylated (dT) Cleave with anchoring enzyme (AE) which cleaves

within ~250bp of poly-A tail at 3’ end. Capture this segment on streptavidin beads Ligate to linkers containing a type IIs restriction site,

which cleave DNA 14 bp away from this site. Ligate sequences to each other and PCR amplify Cleave with AE to remove linkers Concatenate, clone, and sequence

WHY DI-TAGS?Ditags are used to detect bias in the PCR amplification step.

The probability of any two tags being coupled in the same ditag is small.

Biased amplification can be detected as many ditags always having the same 2 tags present.

Velculescu et al. Velculescu et al. ScienceScience (1995) (1995)

AA BBBBBBAA

AA

PrimerAPrimerA PrimerBPrimerB

PrimerAPrimerA PrimerBPrimerB

SAGE (continued)

Tag Sequence Count

ATCTGAGTTC 1075

GCGCAGACTT 125

TCCCCGTACA 112

TAGGACGAGG 92

GCGATGGCGG 91

TAGCCCAGAT 83

GCCTTGTTTA 80

Example of a concatemer:

CATGCATGACCCACGAGCAGGGTACGATGATAACCCACGAGCAGGGTACGATGATACATGCATGGAAACCTATGCACCTTGGGTAGCAGAAACCTATGCACCTTGGGTAGCACATGCATG

TAG1TAG1 TAG2TAG2 TAG3TAG3 TAG4TAG4

Tag Sequence

Count

GCGATATTGT 66

TACGTTTCCA 66

TCCCGTACAT 66

TCCCTATTAA 66

GGATCACAAT 55

AAGGTTCTGG 54

CAGAACCGCG 50

GGACCGCCCC 48

Counting the tags:

Proteomics

SDS PAGE

2D PAGE

MS/MS

An example SDS-PAGE

Protein stains:SilverCopperCoomassie Blue

How many proteins are in a band?

2D-PAGE

Dimension 1: Isoelectric

focusing gel

Dimension 2: size

2D gel from macrophage phagosomes

Mass spectrometry

Mass spectrometers consist of three essential parts

– Ionization source: Converts peptides into gas-phase ions (MALDI + ESI)

– Mass analyzer: Separates ions by mass to charge (m/z) ratio (Ion trap, time of flight, quadrupole)

– Ion detector: Current over time indicates amount of signal at each m/z value

MS/MS Overview

A raw fragmentation spectrumBy calculating the molecular weight difference between ions of the same type the sequence can be determined.

SEQUEST uses the fragmentation pattern to search through a complete protein database to identify the sequence which best fits the pattern.

Tandem Mass Spec (MS/MS)

Typical nanoelectrospray source

Isotope Coded Affinity Tags (ICAT)

Biotin Biotin tagtag

Linker (d0 or d8)Linker (d0 or d8) Thiol specific Thiol specific reactive groupreactive group

ICATICAT ReagentsReagents:: Heavy reagent: d8-ICATHeavy reagent: d8-ICAT ((XX=deuterium)=deuterium)Normal reagent: d0-ICAT (Normal reagent: d0-ICAT (XX=hydrogen)=hydrogen)

S

N N

O

N OO

O N IO OXX

XX

XX

XX

XX

XX

XX

XX

Mass spec based method for measuring relative protein abundances between two samples

Combine and Combine and proteolyzeproteolyze(trypsin)(trypsin)

Affinity Affinity separationseparation

(avidin)(avidin)

Protein identificationProtein identification

ICAT-ICAT-labeled labeled

cysteinescysteines

550550 560560 570570 580580m/zm/z

00

100100

200200 400400 600600 800800m/zm/z

00

100100

NHNH22-EACDPLR--EACDPLR-COOHCOOH

LightLight HeavyHeavy

Mixture 2Mixture 2

Mixture 1Mixture 1

Protein Quantification & Identification Protein Quantification & Identification viavia ICAT Strategy ICAT Strategy

QuantitationQuantitation

ICAT Flash animation:http://occawlonline.pearsoned.com/bookbind/pubbooks/bc_mcampbell_genomics_1/medialib/method/ICAT/ICAT.html

http://occawlonline.pearsoned.com/bookbind/pubbooks/bc_mcampbell_genomics_1/medialib/method/ICAT/ICAT.html

ICAT continued The heavy (blue) and light (gray) peptides are separated and

quantified to produce a ratio for each peptide – here, a single peptide ratio is shown

Each peptide is subjected to CID fragmentation in the second MS stage in order to identify it

Metabolomic measurements

2D NMR or mass spectrometry

Currently not global and in less widespread use than microarrays, but have tremendous potential

Replacement of yeast ORFS with kanMX gene flanked by unique oligo barcodes– Yeast Deletion Project Consortium

Gene knockout and RNAi libraries for model speciesExample from yeast:

YFP tagging for protein localization

NIC96 Nuclear Pore

YPF is green, transmitted light is red

TUB1 Tubulin cytoskeleton

HHF2 Histone Nucleus

BNI4 Bud neck

Images courtesy T. Davis lab

See also recent work byWeissman and O’Shea labs at UCSF

Systematic phenotyping

yfg1 yfg2 yfg3

CTAACTC TCGCGCA TCATAATBarcode

(UPTAG):

DeletionStrain:

Growth 6hrsin minimal media

(how many doublings?)

Rich media

…

Harvest and label genomic DNA

Systematic phenotyping with a barcode arrayRon Davis and friends…

These oligo barcodes are also spotted on a DNA microarray

Growth time in minimal media:– Red: 0 hours– Green: 6 hours

Molecular Interactions

Among proteins, mRNA, small molecules, and so on…

Protein→DNAinteractions

Gene levels(on/off)

Protein—proteininteractions

Protein levels(present/absent)

Biochemicalreactions

Biochemicallevels

▲ Chromatin IP

▼ DNA microarray

▲ Protein coIP▼ Mass

spectrometry

▲Not yet!!!Metabolic

flux ▼ measurement

s

Also like sequence, protein interaction data are exponentially growing…

DIP Database Growthtotal interactions

(As are the false positives!!!)

EMBL Database Growthtotal nucleotides (gigabases)

1980 20001990

0

10

5

High-throughput methods for measuring interaction networks

2-hybrid co-immunoprecipitation w/ mass spec chIP-on-chip systematic genetic analysis

Yeast two-hybrid method

Fields and Song

Detection of protein interactions with antibody arrays

McBeath and Schreiber

Kinase-target interactions

Mike Snyder and colleagues

High-throughput methods for measuring networks

2-hybrid

co-immunoprecipitation w/ mass spec

chIP-on-chip

systematic genetic analysis

Protein interactions by protein immunoprecipitation followed by mass spectrometry

Gavin / Cellzome

TEV = Tobacco Etch Virus proteolytic site

CBP = Calmodulin binding peptide

Protein A = IgG binding from Staphylococcus

TAP purification

Image courtesy of Bertrand

Seraphin


2-hybrid


chIP-on-chip


ChIP-chip measurement of protein→DNA interactions

From Figure 1 of Simon et al. Cell 2001


2-hybrid


chIP-on-chip


Genetic interactions: synthetic lethals and suppressors

Adapted from Tong et al., Science 2001

Genetic Interactions:

Widespread method used by geneticists to discover pathways in yeast, fly, and worm

Implications for drug targeting and drug development for human disease

Thousands are now reported in literature and systematic studies

As with other types, the number of known genetic interactions is exponentially increasing…

Most recorded genetic interactions are synthetic lethal relationships

Adapted from Hartman, Garvik, and Hartwell, Science 2001

A B A B A B A B

A B

A B

BA

X

A

Suppressor protein interaction

Synthetic-lethal protein interaction

A

A

BA

X

B

B BA

BA

A

A

B

Parallel Effects (Redundant or Additive)

Sequential Effects (Additive)

Single A or B mutations typically abolish their biochemical activities

Single A or B mutations typically reduce their biochemical activities

Interpretation of genetic interactions (Guarente T.I.G. 1990)

A B

GOAL: Identify downstream

physical pathways

bionf/beng 203: functional genomics lecture ti 1 trey ideker ucsd department of bioengineering...

Documents

proteinprotein interactions

sage protein levels

dip proteindna interactions

mass spec protein locations

protein antibody arraysbind

dnarna gene expression

arraysprotein levels

data preprocessing