bioinformatics - genetic programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 ·...

23
1 Jason H. Moore, Ph.D. Frank Lane Research Scholar in Computational Genetics Associate Professor of Genetics Adjunct Associate Professor of Biological Sciences Adjunct Associate Professor of Community and Family Medicine Dartmouth College Affiliate Associate Professor of Computer Science University of New Hampshire Adjunct Associate Professor of Computer Science University of Vermont www.epistasis.org [email protected] © Jason H. Moore Bioinformatics 2 Genotype -> Phenotype 3 Nucleus DNA 4 Alberts et al. 2002 DNA GECCO 2007 Tutorial / Bioinformatics 3435

Upload: others

Post on 06-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 1

1

Jason H. Moore, Ph.D.Frank Lane Research Scholar in Computational Genetics

Associate Professor of GeneticsAdjunct Associate Professor of Biological Sciences

Adjunct Associate Professor of Community and Family MedicineDartmouth College

Affiliate Associate Professor of Computer ScienceUniversity of New Hampshire

Adjunct Associate Professor of Computer ScienceUniversity of Vermont

[email protected]

© Jason H. Moore

Bioinformatics

2

Genotype -> Phenotype

3

Nucleus

DNA

4Alberts et al. 2002

DNA

GECCO 2007 Tutorial / Bioinformatics

Copyright is held by the author/owner(s). GECCO’07, July 7–11, 2007, London, England, United Kingdom. ACM 978-1-59593-698-1/07/0007. 3435

Page 2: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 2

5Alberts et al. 2002

Chromosome Structure

6Alberts et al. 2002

Transcription

7Alberts et al. 2002

Regulation of Transcription

8Alberts et al. 2002

Translation

GECCO 2007 Tutorial / Bioinformatics

3436

Page 3: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 3

9

Alberts et al. 2002

10

Alberts et al. 2002

Control of Gene Expression

11 12Alberts et al. 2002

Protein Folding

GECCO 2007 Tutorial / Bioinformatics

3437

Page 4: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 4

13Alberts et al. 2002

Amino Acids

14

Brown 2002

Amino Acids → Peptides → Proteins

15 16

GECCO 2007 Tutorial / Bioinformatics

3438

Page 5: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 5

17 18

Gene Networks

19

Protein NetworksBarabasi, Scientific American (2003)

© Jason H. Moore

20Sing et al., Arter. Thromb. Vasc. Biol. (2003)

Genetic Architecture

GECCO 2007 Tutorial / Bioinformatics

3439

Page 6: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 6

21

© Jason H. Moore

The Tree of Life

http://tolweb.org/tree/22Brown 2002

Molecular Phylogenetics

23

Measuring DNA

24July 3, 2000

GECCO 2007 Tutorial / Bioinformatics

3440

Page 7: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 7

25 26

Transposon Repeat Polymorphism

27

SNPsSingle Nucleotide Polymorphisms

Subject #1

-- AGGTCA ---- AGGTCA --

Subject #2

-- AGGTCA ---- AGCTCA --

Subject #3

-- AGCTCA ---- AGCTCA --

Two alleles (G and C)

Three genotypes (GG, GC, CC)

28

SNPsSingle Nucleotide Polymorphisms

• ~ 1 SNP every 100 bp• ~ 30 million SNPs• ~500,000 SNPs in coding DNA

– Synonymous (silent)– Nonsynonymous

• Deleterious effect• Beneficial effect• No effect

GECCO 2007 Tutorial / Bioinformatics

3441

Page 8: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 8

29 30

Measuring RNA

31

cDNA Microarrays

Mol Interv. 2002 32

GECCO 2007 Tutorial / Bioinformatics

3442

Page 9: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 9

33

Process Flow for Microarrays

34Mol Interv. 2002

35

Measuring Proteins

36

2D Gels and Mass Spectrometry

Metabolic Engineering 4, 98–106 (2002)

GECCO 2007 Tutorial / Bioinformatics

3443

Page 10: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 10

37

Protein Profiling in Tissues

Am J Pathol. 2004 38

Protein Profiling in Tissues

Am J Pathol. 2004

39

Tissue Microarrays

Nat Rev Drug Discov. 2003 40

Emerging Technologies

GECCO 2007 Tutorial / Bioinformatics

3444

Page 11: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 11

41

Nanotechnology

42

Nanotechnology

Nature Reviews Cancer 2005

43

Lab-on-a-Chip

Current Opinion in Structural Biology 2003 44

Databases

GECCO 2007 Tutorial / Bioinformatics

3445

Page 12: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 12

45

Bioinformatics: Databases

101100010010

46

http://www.ncbi.nlm.nih.gov

47

http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi

48

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed

GECCO 2007 Tutorial / Bioinformatics

3446

Page 13: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 13

49

http://www.ncbi.nlm.nih.gov/geo/

50

www.pharmgkb.org

51

http://genome-www5.stanford.edu/

52

http://bioinformatics.icmb.utexas.edu/OPD/

GECCO 2007 Tutorial / Bioinformatics

3447

Page 14: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 14

53

Analysis

54

Mining Biomolecular Patterns

• Can we classify and/or predict biological and clinical endpoints using genetic, genomic, and/or proteomic data?

• Which biomolecules are important?

• What is their pattern or statistical relationship?

© Jason H. Moore

55

ObjectivesData Variable

SelectionStatistical Modeling

PredictionClassification

Tumor A

Tumor Bl = a1x1 + a2x2

Iterate

© Jason H. Moore

56

ObjectivesData Variable

SelectionStatistical Modeling

PredictionClassification

Tumor A

Tumor Bl = a1x1 + a2x2

Iterate

© Jason H. Moore

GECCO 2007 Tutorial / Bioinformatics

3448

Page 15: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 15

57

Hypothesis Testing“The truth is out there”

Truth

Dec

isio

n

Ho False Ho True

Rej

e ct H

oAc

cept

Ho

Yes!

Yes!

Type IError

Type IIError

© Jason H. Moore

Type I errorType II errorType III error

58

Cross-Validation (CV)

• Data-driven methods susceptible to overfitting• Biological datasets often have more variables

than observations (i.e. wide data)• The value of any statistical model is its ability to

make predictions in new data• Cross-validation (CV) allows generalizability to

be estimated in a single dataset

© Jason H. Moore

59

Cross-Validation (CV)

• CV uses independent portions of the data to estimate the testing accuracy

© Jason H. Moore

60

Cross-Validation (CV)Hastie et al. The Elements of Statistical Learning (2001)

Ripley BD. Pattern Recognition and Neural Networks (1996)

• Leave One Out Cross Validation (LOOCV)– Better for small datasets– Unbiased estimate of prediction error– High variance due to similarity of training sets

• n-fold CV (e.g. 5-fold or 10-fold CV)– Better for larger datasets– Estimate of prediction error may be biased– Lower variance– May need to repeat several times and average results

© Jason H. Moore

GECCO 2007 Tutorial / Bioinformatics

3449

Page 16: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 16

61

Cross-Validation Consistency (CVC)Ritchie et al., American Journal of Human Genetics 69:138-147 (2001)

Moore et al., Genetic Epidemiology 23:57-69 (2002)Moore, Lecture Notes in Computer Science 2611, Springer-Verlag, Berlin (2003).

• CV can be difficult with data-driven methods• Can find different models with each CV dataset• CVC is a measure of how consistently particular

variables or combination of variables are identified in each CV interval.

• CVC can be used as a measure of association• Once important variables are found, a model

can be fit to the entire dataset using just those variables.

© Jason H. Moore

62

Cross-Validation Consistency (CVC)Ritchie et al., American Journal of Human Genetics 69:138-147 (2001)

Moore et al., Genetic Epidemiology 23:57-69 (2002)Moore, Lecture Notes in Computer Science 2611, Springer-Verlag, Berlin (2003).

CVC Example with 5-fold CV

Data Interval Genes Identified in Best Model

A,C,E

A,C,H

A,B,C,F

A,C,D

A,G,H

A and C

Significant Genes

© Jason H. Moore

63

Generalizability: The Three-Way Data SplitRowland, Lecture Notes in Computer Science 2611, Springer-Verlag, Berlin (2003).

Train Test

Generalize

Generalize Train

Test

Test Generalize

Train

1. Choose model that min[ abs( Etrain – Etest ) ]2. Evaluate generalizability of the final model3. The model with the best prediction error may not generalize the

best.

© Jason H. Moore

64

Permutation TestingP. Good, Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses (2000)

• Many data-driven methods are nonparametric and model-free.

• Permutation testing can be used to assess statistical significance to allow formal hypothesis testing.

• Basic Idea: Randomize data so it is consistent with null hypothesis.

© Jason H. Moore

GECCO 2007 Tutorial / Bioinformatics

3450

Page 17: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 17

65

Permutation TestingP. Good, Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses (2000)

ExampleY X1 X21 AA Bb1 Aa BB1 aa Bb0 AA Bb0 Aa bb0 aa bb

Y X1 X20 AA Bb1 Aa BB0 aa Bb0 AA Bb1 Aa bb1 aa bb

Permute Y

© Jason H. Moore

Calculate Statistic

Z

Repeat

66

Permutation TestingP. Good, Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses (2000)

Distribution of Statistic under the Null Hypothesisfrom Many Permutations

Z

Critical region, α

© Jason H. Moore

67

BootstrappingB. Efron, Annals of Statistics 7:1-26 (1979)

AC Davidson and DV Hinkley, Bootstrap Methods and their Application (1997)CE Lunneborg, Data Analysis by Resampling: Concepts and Applications (2000)

PopulationRandom Sample

Inference

Distribution Known

PopulationRandom Sample

Inference

Distribution Unknown

Random Samples

© Jason H. Moore

68

Genetic Programming

GECCO 2007 Tutorial / Bioinformatics

3451

Page 18: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 18

69

Is GP an appropriate computational tool for solving

complex biological problems??

70

Vanilla GP - NO!

GP

71

Modern GP - YES!

GP

72

Case Study:Symbolic Modeling

GECCO 2007 Tutorial / Bioinformatics

3452

Page 19: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 19

73Sing et al., Arter. Thromb. Vasc. Biol. (2003)

Genetic Architecture

?

74

Genetic Analysis of Atrial FibrillationTsai et al., Circulation (2004)

Moore et al., Journal of Theoretical Biology (2006)

• 250 consecutive patients admitted with AF from Taipei, Taiwan

• 250 age and gender matched controls• 3 candidate genes

– Angiotensin converting enzyme (ACE) gene• I/D polymorphism

– Angotensinogen (AGT)• T174M, M235T, G-6A, A-20C, G-152A, and G-217A polymorphisms

– Angiotensin II type I receptor (AT-1)• A1166C polymorphism

© Jason H. Moore

75

Renin-Angiotensin SystemTsai et al., Circulation (2004)

Moore et al., Journal of Theoretical Biology (2006)

© Jason H. Moore

76

Symbolic Discriminant AnalysisMoore et al., In: De Raedt, L., Flach, P. (eds) Lecture Notes in Artificial Intelligence 2167 (2001)

Moore et al., Genetic Epidemiology 23, 57-69 (2002)Moore, Lecture Notes in Computer Science 2611, Springer-Verlag, Berlin (2003)

• Supervised classification approach• Can use GP to build discriminant functions• Accuracy is fitness function

X2 X3

X1*

+

Y = X1 + X2* X3

Freq

uen c

y

Symbolic Discriminant Scores

A B

© Jason H. Moore

GECCO 2007 Tutorial / Bioinformatics

3453

Page 20: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 20

77

SDA Modeling using Modern GPMoore et al., Human Heredity (2007)

• Parameter Sweeps– STEP 1: Full factorial experimental design– STEP 2: ANOVA

• Coarse-Grained Search– STEP 3: 105 SDA runs

• Expert Knowledge– STEP 4: Statistical model of 100 best trees

• Fine-Grained Knowledge-Based Search– STEP 5: 109 SDA runs using EDA

• Model Interpretation– STEP 6: Function mapping– STEP 7: Interaction dendrogram

© Jason H. Moore

78

Cross-Validation StrategyMoore et al., Human Heredity, (2007)

© Jason H. Moore

20 20

1

79

Parameter SweepMoore et al., Human Heredity, (2007)

• STEP 1: Full factorial experimental design– Population Size (P): {100, 500, 1000}– Generations (G): {100, 500, 1000}– Tree Depth (T): {1, 2, 3}– Function Set (F):

• 1: {+, -, *, /}• 2: {<, >, <=, >=, =, !=, max, min}• 3: 12• 4: {AND, OR, NOT, NOR, XOR}• 5: 14• 6: 24• 7: 124

– 3*3*3*7 = 189 level combinations– 189 levels * 10 random seeds = 1890 SDA runs

© Jason H. Moore

80

Parameter SweepMoore et al., Human Heredity, (2007)

• STEP 2: 4-Way Analysis of Variance (ANOVA)

© Jason H. Moore

<0.00110.72Population0.0012.424D*F*P

0.1611.82Generations<0.0011028.26Function<0.001715.62Depth

P-ValueFDFTreatment

GECCO 2007 Tutorial / Bioinformatics

3454

Page 21: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 21

81

Parameter SweepMoore et al., Human Heredity, (2007)

© Jason H. Moore

P = 0.001

82

Course-Grained SearchMoore et al., Human Heredity, (2007)

• STEP 3: 105 SDA runs

© Jason H. Moore

83

Generating Expert KnowledgeMoore et al., Human Heredity, (2007)

• STEP 4: Statistical modeling of 100 best trees

© Jason H. Moore

84

Fine-Grained SearchMoore et al., Human Heredity, (2007)

• STEP 5: 109 SDA runs using an EDA

© Jason H. Moore

Accuracy = 0.644

GECCO 2007 Tutorial / Bioinformatics

3455

Page 22: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 22

85

Fine-Grained SearchMoore et al., Human Heredity, (2007)

• STEP 5: 109 SDA runs using an EDA

© Jason H. Moore

I-1

II-1 II-2

III-1 III-2

IV-1 IV-2 IV-3

V-1

Accuracy = 0.644

86

InterpretationMoore et al., Human Heredity, (2007)

• STEP 6: Function Mapping

© Jason H. Moore

87

InterpretationMoore et al., Human Heredity, (2007)

Moore et al., Journal of Theoretical Biology (2006)

• STEP 6: Interaction Dendrogram

© Jason H. Moore

88

InterpretationMoore et al., Human Heredity, (2007)

© Jason H. Moore

GECCO 2007 Tutorial / Bioinformatics

3456

Page 23: Bioinformatics - Genetic Programminggpbib.cs.ucl.ac.uk/gecco2007/docs/p3435.pdf · 2007-06-22 · GECCO 2007 Tutorial / Bioinformatics 3441 (c) Jason H. Moore 8 29 30 Measuring RNA

(c) Jason H. Moore 23

89

InterpretationMoore, Nature Genetics (2005)

Moore and Williams, BioEssays (2005)

© Jason H. Moore

90

Modern GP - YES!

GP

91

Additional Examples• Moore, J.H., White, B.C. Genome-wide genetic analysis using

genetic programming: The critical need for expert knowledge. In: Genetic Programming Theory and Practice IV, in press, Springer (2006).

• Moore, J.H., White, B.C. Exploiting expert knowledge in genetic programming for genome-wide genetic analysis. Lecture Notes in Computer Science, in press (2006).

• Moore, J.H. Genome-wide analysis of epistasis using multifactor dimensionality reduction: feature selection and construction in the domain of human genetics. In: Knowledge Discovery and Data Mining: Challenges and Realities with Real World Data, IGI, in press (2006).

© Jason H. Moore

92

AcknowledgmentsNate BarneyBill White

NIH R01 AI59694, LM009012

[email protected]

© Jason H. Moore

GECCO 2007 Tutorial / Bioinformatics

3457