1 functional genomics introduction julie a dickerson electrical and computer engineering iowa state...

Post on 12-Jan-2016

212 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Functional Genomics Introduction

Julie A Dickerson

Electrical and Computer Engineering

Iowa State University

Module Structure: Day 1

Introduction to Functional Genomics Transcriptomics

Analysis and Experiment Design for Microarray Data (Dr. Peng Liu)

RNA-Seq Data (Mr. Kun Liang) LAB:

Using R for Normalizing, processing microarray data, and clustering analysis of ‘omics data (John Van Hemert)

June 15, 2010

BBSI - 2010

3

Module Structure: Day 2

Metabolomics (Dr. Ann Perera) Proteomics (Dr. Young-Jin Lee) Pathways and data integration methods (Dr.

Julie Dickerson and Erin Boggess)

Lab: Analyzing integrated sets of microarray, proteomics

and metabolomics data (Erin Boggess)

4

F1: Outline

Module Structure What is Functional Genomics? Data Types Available Transcriptomics

Basic biology behind microarrays What can you learn from microarrays? Types of arrays Limitations of microarrays

5

Functional Genomics Definition Functional genomics is a field of molecular

biology that attempts to make use of the data produced by genomic projects to describe gene (and protein) functions and interactions. Functional genomics focuses on the dynamic aspects such as gene transcription, translation, and protein-protein interactions, as opposed to the static aspects such as DNA sequence or structures.

From Wikipedia, the free encyclopedia

Genome Wide View of Metabolism

Streptococcuspneumoniae

Explore capabilities of global network How do we go from a pretty picture to a

model we can manipulate?

Metabolic Pathways

Metabolitesglucose

Enzymesphosphofructokinase

Reactions & Stoichiometry1 F6P => 1 FBP

Kinetics

Regulationgene regulation

metabolite regulation

hexokinase

phosphoglucoisomerase

phosphofructokinase

aldolase

triosephosphate isomerase

G3P dehydrogenase

phosphoglycerate kinase

phosphoglycerate mutase

enolase

pyruvate kinase

Metabolic Modeling: The Dream

June 11, 2009 BBSI - 2009 9

Data Types Available for Determining Function Genomes Genes Proteins Metabolites Phenotypes

Sequence Microarrays,

Nextgen sequencing Proteomics Metabolomics Phenomics

10

A VERY Simplified Eukaryotic Cell

nucleus

chromosome

DNA strands

DNA contains thousands of genes.

cytoplasm

Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton

11

Posttranscriptional Modificationsto Primary TranscriptPrimary transcript

Intervening sequences corresponding to intronsthat are removed through splicing

3’ UTR5’ UTR

Primary transcript after modification: messenger RNA (mRNA)

AAAAAA...AAAA

poly-A tailCoding portions of RNA sequencecorresponding to exons

5’ UTR 3’ UTR

5’ cap

G

Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton

12

Transcription takes place inside the nucleus.

nucleus

chromosome

DNA strands cytoplasm

Translation takes place outside the nucleus.

Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton

13

Translation

mRNA

Ribosome

amino acid sequence

folds to become a protein

Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton

14

During translation transfer RNA (tRNA) translates the genetic code

... ...A A C GU GU

codon codon

A A U

leu

U G C

thr

tRNAanticodon

amino acids

Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton

15

The Genetic Code

UUU phe UCU ser UAU tyr UGU cysUUC phe UCC ser UAC tyr UGC cysUUA leu UCA ser UAA STOP UGA STOPUUG leu UCG ser UAG STOP UGG trp

CUU leu CCU pro CAU his CGU argCUC leu CCC pro CAC his CGC argCUA leu CCA pro CAA gln CGA argCUG leu CCG pro CAG gln CGG arg

AUU ile ACU thr AAU asn AGU serAUC ile ACC thr AAC asn AGC serAUA ile ACA thr AAA lys AGA argAUG met ACG thr AAG lys AGG arg

GUU val GCU ala GAU asp GGU glyGUC val GCC ala GAC asp GGC glyGUA val GCA ala GAA glu GGA glyGUG val GCG ala GAG glu GGG gly

Firs

t B

ase

Second Base

U

C

A

G

U C A G

mRNAcodon

aminoacid

Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton

16

Miscellaneous Comments The biology is more complicated than I described.

Humans have somewhere around 30,000 genes. (The exact number is a subject for debate.) Regulation of these genes seems to be more important than number!

Much of the variation is created by differences in how cells use the genes they have.

Microarrays are a tool that can help us understand how cells of various types use their genes in response to varying conditions.

04/21/23BCB570 Gene Expression Data

Analysis 17

Microarrays With only a few exceptions, every

cell of the body contains a full set of chromosomes and identical genes.

Only a fraction of these genes are turned on, however, and it is the subset that is "expressed" that confers unique properties to each cell type.

"Gene expression" is the term used to describe the transcription of the information contained within the DNA, the repository of genetic information, into messenger RNA (mRNA) molecules that are then translated into the proteins that perform most of the critical functions of cells.

04/21/23BCB570 Gene Expression Data

Analysis 18

Microarrays

Microarrays work by exploiting the ability of a given mRNA molecule (target) to bind specifically to, or hybridize to, the DNA template (probe) from which it originated.

This mechanism acts as both an "on/off" switch to control which genes are expressed in a cell as well as a "volume control" that increases or decreases the level of expression of particular genes as necessary.

Source: The Genetic Science Learning Center, University of Utah

04/21/23BCB570 Gene Expression Data

Analysis 19

DNA Microarrays

Small, solid supports onto which the sequences from thousands of different genes are immobilized, or attached, at fixed locations.

The DNA is printed, spotted, or actually synthesized directly onto the support.

The spots themselves can be DNA, complementary DNA (cDNA, DNA synthesized from a mRNA template) , or oligonucleotides. (or oligo, a short fragment of a single-stranded DNA that is typically 5 to 50 nucleotides long)

04/21/23BCB570 Gene Expression Data

Analysis 20

Why do microarray experiments? Comparing two conditions to find differentially

expressed genes Control/treatment Disease/normal

Compare more than two conditions; some of which may interact Different treatments, different strains

Exploratory analysis What genes are expressed under drought stress?

04/21/23BCB570 Gene Expression Data

Analysis 21

Why use microarrays (cont)?

What happens over time? Developmental stages

Predicting certain conditions (cancer vs. normal)

Patterns of gene expression that characterize a patient’s or organism’s response

04/21/23BCB570 Gene Expression Data

Analysis 22

Differentially Expressed Genes

Find genes that show a large difference in expression between groups and are similar within a group

Statistical tests (t-test), look at if the groups have different means or variances (chi-squared, F-statistics)

Adapted from “Practical Microarray Analysis”, Presentation by Benedikt Brors, German Cancer Research Center

04/21/23BCB570 Gene Expression Data

Analysis 23

Multiple Conditions

Are there differences in expression level between the k conditions?

Analysis of Variance (ANOVA)

Mutant 1 Mutant 2

Inoculated Control Inoculated Control

24

Some Example Microarray Experiments from Iowa State UniversityJim Reecy from Animal Science: muscle undergoing

hypertrophy vs. normal muscle

David Putthoff, Steve Rodermel, Thomas Baum fromPlant Pathology: roots infected with soybean cystnematodes vs. uninfected roots

Anne Bronikowski in Genetics: wheel-running mice vs.non-runners

Roger Wise, Rico Caldo in Plant Pathology: interactionbetween multiple isolates of powdery mildew andmultiple genotypes of barley.

Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton

Wild-type vs. Myostatin Knockout Mice

Belgian Blue cattle have a mutation in the myostatin gene.

26

Identifying Genes Involved in Pathways That DistinguishCompatible from Incompatible Interactions

Barley Genotype

Mla6 Mla13 Mla1

Bg

h Is

ola

te

5874

K1

Incompatible

Incompatible Incompatible

IncompatibleCompatible

Compatible

Caldo, Nettleton, Wise (2004). The Plant Cell. 16, 2514-2528.

27

An Example Gene of Interest

Hours after Inoculation

Log

Exp

ress

ion

Incompatible

Compatible

Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton

04/21/23BCB570 Gene Expression Data

Analysis 28

Exploratory Analysis

Find patterns in data to see what genes are expressed under different conditions

Analysis includes clustering methods Used when little or no prior knowledge exists about

the problem

04/21/23BCB570 Gene Expression Data

Analysis 29Copyright ©1999 by the National Academy of Sciences

Perou, Charles M. et al. (1999) Proc. Natl. Acad. Sci. USA 96, 9212-9217

Fig. 5 (see Supplemental data at http://www.pnas.orgwww.pnas.org) for the full cluster diagram with all gene names\]

04/21/23BCB570 Gene Expression Data

Analysis 30

Time Series

Goal: find patterns of co-expressed genes over time or partial time

Typical length is 3-10 time points Cluster to find similar patterns (k-means, self-organizing

maps) Correlations to find genes that behave like a given gene of

interest.

0 hours 4 hours 12 hours 24 hours

04/21/23BCB570 Gene Expression Data

Analysis 31

Classification

Learn characteristic patterns from a training set and evaluate with a test set.

Classify tumor types based on expression patterns

Predict disease susceptibility, stages, etc.

04/21/23BCB570 Gene Expression Data

Analysis 32Source: “Practical Microarray Analysis”, Presentation by Benedikt Brors, German Cancer Research Center

33

Some Commonly Used Toolsfor Microarray Analysis Oligonucleotide arrays

Affymetrix GeneChips

Nimblegen

Agilent

34

Oligonucleotides An oligonucleotide is a short sequence of nucleotides.

(oligonucleotide=oligo for short)

An oligonucleotide microarray is a microarray whose probes consist of synthetically created DNA oligonucleotides.

Probes sequences are chosen to have good and relatively uniform hybridization characteristics.

A probe is chosen to match a portion of its target mRNA transcript that is unique to that sequence.

Oligo probes can distinguish among multiple mRNA transcripts with similar sequences.

Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton

04/21/23BCB570 Gene Expression Data

Analysis 35

Simplified Example

gene 1

gene 2

shared green regions indicatehigh degree of sequence similaritythroughout much of the transcript

ATTACTAAGCATAGATTGCCGTATAoligo probefor gene 1

GCGTATGGCATGCCCGGTAAACTGG

oligo probe for gene 2

...

... ...

...

Source: Dan Nettleton Course Notes Statistics 416/516X

36

Oligo Microarray Fabrication

Oligos can be synthesized and stored in solution.

Oligo sequences can be synthesized on a slide or chip using various commercial technologies.

The company Affymetrix uses a photolithographic approach which we will describe briefly.

Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton

37

Affymetrix GeneChips Affymetrix (www.affymetrix.com) manufactures

GeneChips.

GeneChips are oligonucleotide arrays.

Each gene (more accurately sequence of interest or feature) is represented by multiple short (25-nucleotide) oligo probes.

Some GeneChips include probes for around 120,000 genes and gene variants.

mRNA that has been extracted from a biological sample can be labeled (dyed) and hybridized to a GeneChip.

Only one sample is hybridized to each GeneChip.

Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton

04/21/23BCB570 Gene Expression Data

Analysis 38

Different Probe Pairs Represent Different Parts of the Same Gene

gene sequence

Probes are selected to be specific to the target geneand have good hybridization characteristics.

Source: Dan Nettleton Course Notes Statistics 416/516X

39

Affymetrix Probe Sets A probe set is used to measure mRNA levels of a single

gene.

Each probe set consists of multiple probe cells.

Each probe cell contains millions of copies of one oligo.

Each oligo is intended to be 25 nucleotides in length.

Probe cells in a probe set are arranged in probe pairs.

Each probe pair contains a perfect match (PM) probe cell and a mismatch (MM) probe cell.

A PM oligo perfectly matches part of a gene sequence.

A MM oligo is identical to a PM oligo except that the middle nucleotide (13th of 25) is replaced by its complementary nucleotide.

Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton

40

A Probe Set for Measuring Expression Level of a Particular Gene

probepair

probecell

gene sequence...TGCAATGGGTCAGAAGGACTCCTATGTGCCT...AATGGGTCAGAAGGACTCCTATGTGAATGGGTCAGAACGACTCCTATGTG

perfect match sequencemismatch sequence

probe set

Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton

41

Different Probe Pairs Represent Different Parts of the Same Gene

gene sequence

Probes are selected to be specific to the target geneand have good hybridization characterictics.

Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton

42

Affymetrix’s Photolithographic Approach

GeneChip

maskmaskmaskmaskmaskmaskmask

mask

A ACC

GG

TT

TA

TT A

A C C

Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton

43

Sou

rce:

ww

w.a

ffym

etrix

.com

44Source: www.affymetrix.com

45Source: www.affymetrix.com

46Source: www.affymetrix.com

Image from Hybridized GeneChip

47

Image Processing for Affymetrix GeneChips

Image processing for Affymetrix GeneChips is typically done using proprietary Affymetrix software.

The entire surface of a GeneChip is covered with square-shaped cells containing probes.

Probes are synthesized on the chip in precise locations.

Thus spot finding and image segmentation are not major issues.

Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton

48

Probe Cell

8 x 8 =64pixels

borderpixelsexcluded

75th percentileof the 36 pixelintensitiescorrespondingto the center 36pixels is usedto quantifyfluorescenceintensity foreach probe cell.

These values arecalled PM valuesfor perfect-matchprobe cells andMM values formismatch probecells.

The PM and MM values are used to computeexpression measures for each probe set.

Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton

Normalization

Outputs from each individual probe pair are statistically combined to give an expression level for the gene represented by the probe set.

Normalization accounts for background noise on the chip, levels of control probes, etc

Key methods are MAS5.0, RMA, GCRMA

Summary of Microarrays

Positives: commercial chips are accurate and repeatable in experienced hands and the statistics and modeling have been well-explored

Negatives: cost, can only see what is on the chip and difficult to update to new knowledge.

June 11, 2007 BBSI - 2007 50

Short Read Sequencing

Sequencing technology has evolved in the last 15 years

Eventual goal is to be able to sequence a genome for $1000 (NIH).

Why not just sequence the transcriptome directly and see what is there?

June 11, 2007 BBSI - 2007 51

Sequencing by synthesis (454) Takes a single strand of DNA and

synthesizes its complementary strand enzymatically one base pair at a timedetecting which base was actually added at each step.

Pyrosequencing detect the activity of DNA polymerase with a chemiluminescent enzyme.

Reads are about 400-500 bp

June 11, 2007 BBSI - 2007 52

Other Techologies

Illumina Solexa: 40-100 bp, tag DNA or RNA at both ends

ABI SOLID around 50 bp

Digital Gene Expression

Sequence census methods for functional genomicsBarbara Wold & Richard M Myers

 

top related