biocomputational puzzles

54
Biocomputational Puzzles Dennis Shasha Courant Institute, New York Univ shasha @ cs . nyu . edu Collaborations with labs of Philip Benfey (bio, Duke) , Gloria Coruzzi (bio, NYU), Allen Mincer (physics, NYU), Ken Birnbaum, Laurence Lejay, Peter Palenchar (bio, NYU), Rodrigo

Upload: zarek

Post on 24-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Biocomputational Puzzles. Dennis Shasha Courant Institute, New York Univ [email protected] - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Biocomputational Puzzles

Biocomputational Puzzles

Dennis Shasha

Courant Institute, New York Univ

[email protected]

Collaborations with labs of Philip Benfey (bio, Duke) , Gloria Coruzzi (bio, NYU),

Allen Mincer (physics, NYU), Ken Birnbaum, Laurence Lejay, Peter

Palenchar (bio, NYU), Rodrigo Gutierrez, Manny Kitari, Chris Poultny

Page 2: Biocomputational Puzzles

2

Overview

• Activist Data Mining

• Transcription factor-cis-element prediction

• Visualization tool for multi-factor data

• Time series for fun and profit.

Page 3: Biocomputational Puzzles

3

Classical Data Mining

• Wait for data to appear

• Find patterns in it.

• Hope they are actionable.

• Works well when data is pertinent, e.g. Amazon’s other books recommendation, extrapolation of trends.

Page 4: Biocomputational Puzzles

4

Activist Data Mining

• Propose initial experiments to explore subspace of some predefined search space

• Evaluate the results

• Propose new experiments, evaluate, propose, evaluate, propose ….

• Iterative and adaptive

Page 5: Biocomputational Puzzles

5

Which is Better for Natural Science?

• Classical is obviously right when you have no control over data generation.

• When you do, active data mining (active learning) may work much better.

Page 6: Biocomputational Puzzles

6

Natural Science Lesson 1

• Natural scientists, like most people, care about their own time. Computational time matters only if it costs people time.

• If you can get more insight out of their data, they like you.

• If you can save them experimental time, they love you.

• If you can lead them to new discoveries, they will give you cookies!

Page 7: Biocomputational Puzzles

7

Activist Data Mining Philosophy

Passive Approach: Natural scientists do experiments.

Computer Scientists help to glean something from it.

Activist Approach: Computer scientists help (1)Design experiments(2)Analyze results(3)Design new experiments based on results

Page 8: Biocomputational Puzzles

8

Activist Data Mining Philosophy (Reminder)

Passive Approach: Natural scientists do experiments.

Computer Scientists help to glean something from it.

Activist Approach: Computer scientists help (1)Design experiments (2)Analyze results(3)Design new experiments based on resultsOur particular methodology: Adaptive Combinatorial Design

Our innovation: applying this combinatorial design in an interative way.

Page 9: Biocomputational Puzzles

9

Safecracking Puzzle: you are a thief…

Combinatorial Safe: 10 switches with 3 settings each. Over 59,000 (3^10) possible configurations. However there is a certain pair of switches (you don’t know which pair) and a certain pair of values of those switches that will open the safe.

Illustration: S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 C A

Challenge: Open the safe in 15 configurations or fewer.

Page 10: Biocomputational Puzzles

10

Scientific Goal

Long-term goal: Virtual plant (and later … frankenfoods)

We want to describe the factors (e.g. light, carbon, nitrogen….) that determine whether plants will produce critical amino acids and how those factors interact.

Page 11: Biocomputational Puzzles

11

Light, Carbon and Amino acidsdifferentially regulate N-assimilation genes-- pathway diagram

Light Carbon

GS2

Gln

C:N

C5:N2

Light Carbon

AS1

Asn

C:N

C4:N2

Amino acids Amino acids

Page 12: Biocomputational Puzzles

12

Design Space

Inputs: *Light*Starvation to Various Nutrients*Carbon*Inorganic N (NO3/NH4)*Organic N (Glu)*Organic N (Gln)

If inputs are take binary values (first approximation)6 binary (+/-) inputs= 26 or 64 input combinations (or treatments)

Use 2-factor combinatorial design to reduce number of treatment combinations required to cover the experimental space, assuming that important interactions will have to do with two factors.

Page 13: Biocomputational Puzzles

13

EXPT 1NOPIVOT

LANE ILLUMIN STARVE CARBON

NO3NH4 GLU GLN

1 DARK Y 0 L H H

2 DARK N L 0 0 H

3 LIGHT N 0 0 H 0

4 LIGHT Y L L 0 0

5 LIGHT Y L 0 H H

6 DARK N 0 L 0 0

“Combinatorial design” finds six conditions to explore every pairwise interaction. Want to discover important factors.

Notice: for each pair of input factors and combination of valuesfrom those factors, some experiment has that combination, e.g. LightNo carbon; Starve No Glu.

After doing this experiment,certain factors jump out as worth further study: Illumination, Carbon (both have significant repressive correlations)

Page 14: Biocomputational Puzzles

14

Key to activist data mining is adapting to results of experiments already done. Many ways to do this, e.g.Tong, S. & Koller, D. Active Learning for Structure in Bayesian Networks. Seventeenth International Joint Conference on ArtificialIntelligence, 863-869 (2001).

Advocates pool-based active learning. Pool of unlabeled instances (don’t know output value). An active learner chooses which instance to query next in hope it will reduce set of possible answers.

Ideker, T. E., Thorsson, V. & Karp, R. M. Discovery of regulatory interactions through perturbation: inference and experimental design. Pac Symp Biocomput, 305-16 (2000).

Similar idea with Boolean circuits.

Adaptation Following No Pivot Design

Page 15: Biocomputational Puzzles

15

Find most important inputs in order to see their effects in more detail.

That is, we focus our search space on those inputs that are likely to exert the most influence over outputs of interest.

Our Use of Adaptation

Page 16: Biocomputational Puzzles

16

1. Is any single factor so important that its presence determines the outcome regardless of the other contexts? (e.g. Light in context X is repressive compared with Dark in context Y for all X, Y)

2. Is a factor important enough that it has an effect for any particular context? (e.g. for all X, Light in context X is repressive compared with Dark in X)

3. Is a factor consistently important when compared with a fixed background? (e.g. for all X, is Light in context X repressive compared with background?)

Three questions of particular interest

Page 17: Biocomputational Puzzles

17

EXPT 1NOPIVOT

LANE ILLUMIN STARVE CARBON

NO3NH4 GLU GLN

1 DARK Y 0 L H H

2 DARK N L 0 0 H

3 LIGHT N 0 0 H 0

4 LIGHT Y L L 0 0

5 LIGHT Y L 0 H H

6 DARK N 0 L 0 0

Pivot Design 1: Start with no pivot design

Create dark and light pairs by just setting Illumin to light and dark respectively.

Page 18: Biocomputational Puzzles

18

EXPT 1NOPIVOT

LANE ILLUMIN STARVE CARBON

NO3NH4 GLU GLN

1 DARK Y 0 L H H

2 DARK N L 0 0 H

3 DARK N 0 0 H 0

4 DARK Y L L 0 0

5 DARK Y L 0 H H

6 DARK N 0 L 0 0

Pivot Design 2: Dark Design

Exactly the same as no pivot tests but with DARK everywhere. Requires only three more experiments than in no pivot case.

Page 19: Biocomputational Puzzles

19

EXPT 1NOPIVOT

LANE ILLUMIN STARVE CARBON

NO3NH4 GLU GLN

1 LIGHT Y 0 L H H

2 LIGHT N L 0 0 H

3 LIGHT N 0 0 H 0

4 LIGHT Y L L 0 0

5 LIGHT Y L 0 H H

6 LIGHT N 0 L 0 0

Pivot Design 3: Light Design

Exactly the same as DARK tests but with Light everywhere. Again, three more experiments than in no pivot case.

Important: First experiment for light = First Experiment for Dark except for Illumination itself. Differs only in pivot. Minimal pair.

Page 20: Biocomputational Puzzles

20

A set of well-spaced minimal pairs, differing only in the pivot. Suggests answers for first two questions:

• Is any single factor so important that its presence determines the outcome regardless of the other contexts? (e.g. Light in context X is repressive compared with Dark in context Y for all X, Y).

No, for this biological system.• Is a factor important enough that it has an

effect for any particular context? (e.g. for all X, Light in context X is repressive compared with Dark in X)

Yes, for this biological system.

What Accomplished

Page 21: Biocomputational Puzzles

21

EXPT 1NOPIVOT

LANE ILLUMIN STARVE CARBON

NO3NH4 GLU GLN

1 LIGHT Y 0 L H H

2 LIGHT N L 0 0 H

3 LIGHT N 0 0 H 0

4 LIGHT Y L L 0 0

5 LIGHT Y L 0 H H

6 LIGHT N 0 L 0 0

7 DARK N 0 0 0 0

“Half-pivot” Light against a fixed background

Exactly the same as LIGHT tests but with one added background.

Allows us to create a circuit (binary in this case because inputs are binary.)

Page 22: Biocomputational Puzzles

22

ASN1

C L L C

S Q

AND

OR

E

AND

S N Q

AND

AND

S N E Q

AND

C L

AND

CL

AND

S

AND

OR

AND

N Q NS

AND

E

AND

S Q E

OR

AND

S E Q

AND

S N E

S N

E

AND

SQ

AND

N

OR

E

AND

N QE N

OR

Q

AND

QE

AND

S

QE

AND

Q

S N

OR

N N

AND AND

OR

AND

Fig. 3

B

Page 23: Biocomputational Puzzles

23

Circuits could consist of continuous functions.D`Haeseleer, P., Liang, S. & Somogyi, R.

Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics 16, 707-26. (2000).

Discuss pros and cons of such an approach. Major con is that feedback is hard to represent correctly. Major pro is that noise is less of an issue.

8.Huang, S. Genomics, complexity and drug discovery: insights from Boolean network models of cellular regulation. Pharmacogenomics 2, 203-22. (2001).

Argues that Boolean is very adept for genomics.

Boolean Not Only Representation

Page 24: Biocomputational Puzzles

24

Adaptive Experimental Design along “Borders”

Because combinatorial design explores only a (well spread) subset of possibilities, the apparent effects of factors may depend on other factors that haven’t been explored.

After constructing boolean circuits, software suggests “experiments to clarify border” between inductive and non-inductive, e.g.,

Starvation_Y, Carbon_N, NH4NO3_N, GLU_Y, GLN_Y

Page 25: Biocomputational Puzzles

25

Steps of Methodology

No Pivot: Small set of well-spaced experiments to find most important influences on a target. Also, a good method in genomics applications to find clusters because of good spacing. Small? 10 inputs with 4 values gives a no pivot of about 30 experiments.

Pivot: Can find out whether an input is likely to have an effect regardless of context (for all X, for all Y) or for every context (for all contexts X)

Half-pivot: For comparison with a fixed background

Border Adaptation: Study differences between repressinve case and non-repressive one to discover fine structure.

Page 26: Biocomputational Puzzles

26

Inspiration of this approach

Combinatorial design: Inspired by work in software testing by

David Cohen, Siddhartha Dalal, Michael Fredman and

Gardner Patton at Bellcore/Telcordia. Their problem: how to test a good set of inputs to

a program to discover whether there are any bugs. Not program coverage, but input coverage.Not all input combinations, but all combinations

of every pair of of input variables (“no pivot”

design).

Hypothesis: every input combination should give same output: no error.If true for designed subset, then program is ok.

Page 27: Biocomputational Puzzles

27

How This Could Help You

Use this approach: Pose an experimental setting of interest to you. (Names of input variables, possible values).

Describe a “no pivot” design for your setting.Based on that result, describe a pivot design to

isolate the exact effect of a specific input.Get a good sense of whether the pivot is decisive

by itself or has a consistent strong influence.

Theoretical Guarantee: For k-factor design, if there is a set of k values that dominates the result, you will find it.

Page 28: Biocomputational Puzzles

28

Combinatorial Design vs. Random Sampling

Practical Question: Adaptive Combinatorial Design is a sampling method. How well does it work compared to random sampling?

Simulation experiment: Create simulated data with a single important attribute and microarray-quality noise (factor of 2 to 5 change in biological replicates).

Empirical Conclusions: Random and Adaptive Combinatorial Design did equally well at identifying the important attribute, however Random falsely identified other attributes as important about 4 times more often than Adaptive Combinatorial Design. (see cdtables.doc)

Ref: Lejay et al. Systems Bio vol. 1.2 Dec. 2004

Page 29: Biocomputational Puzzles

29

Safecracking Solution

• Number S1 S2 S3 S4 S5 S6 S7 S8 S9 S10• 1: A A A A A A A A A A• 2: A B B B B B B B B B• 3: A C C C C C C C C C• 4: B A B C A B C A B C• 5: B B C A B C A B C A• 6: B C A B C A B C A B• 7: C A C B A C B A C B• 8: C B A C B A C B A C• 9: C C B A C B A C B A• 10: X A A A B B B C C C• 11: X A A A C C C B B B• 12: X B B B A A A C C C• 13: X B B B C C C A A A• 14: X C C C A A A B B B• 15: X C C C B B B A A A

Page 30: Biocomputational Puzzles

30

Finding the minimal no pivot design is NP hard, but heuristic solutions work well in practice (within small additive factor):

1. At first, each pair of input variables (n choose 2) is considered to be unfinished.

2. Basic step: choose a random set of disjoint unfinished pairs. Finish them by designing treatments to include all remaining value combinations for each pair, again at random.

3. If an input variable is already finished or not a member of a disjoint pair, put in a don’t care.

4. Fill up earlier don’t cares to complete unfinished pairs.

5. Repeat 2-4 with different random seeds.

Algorithmic (Heuristic) Idea

Page 31: Biocomputational Puzzles

31

S1: A B CS2: A B S3: A BS4: A B C

1. Choose disjoint unfinished pairs at random{ {S1, S3}, {S2, S4}}Generate 6 rows (experiments): S1 S2 S3 S4A B B CB B A BA A A AC A B CC A A BB B B A

Example (1)

Page 32: Biocomputational Puzzles

32

Not only is { {S1, S3}, {S2, S4}} finished but also:

{S2, S3}. Others are partly finished.S1 S2 S3 S4A B B CB B A BA A A AC A B CC A A BB B B A

Example (2)

Page 33: Biocomputational Puzzles

33

Try {S1, S4} next. Fill S2 and S3 with X (don’t care).

S1 S2 S3 S4A B B CB B A BA A A AC A B CC A A BB B B AA X X BB X X CC X X A

Example (3)

Page 34: Biocomputational Puzzles

34

Now replace X for unfinished pairs: {S1, S2}, {S3, S4}

S1 S2 S3 S4A B B CB B A BA A A AC A B CC A A BB B B AA X B BB A A CC B X A

Example (3)

Page 35: Biocomputational Puzzles

35

Further Reading: some other uses of combinatorial designin biology Universal DNA tag systems: a combinatorial design schemeRecomb 2000Amir Ben-Dor, Richard Karp, Benno Schwikowski and Zohar

Yakhini.

Experimental design for gene expression microarrays, Biostatistics, 2:183-201.Kerr and Churchill(2001),Normal: N microarrays will be used to test N conditions against a common reference.Authors propose to use the colors to compare N conditionsagainst one another in a looping fashion: 1 with 2, 2 with 3, … n with 1. Result: deconvolves certain effects (e.g. binding affinity ofreference dye.

Page 36: Biocomputational Puzzles

Sungear Multifactor Visualization

Joint work with Rodrigo Gutiérrez, Manny Katari, Brad Paley, Chris

Poultney, and Gloria Coruzzi

Page 37: Biocomputational Puzzles

37

Typical Genomic Questions

• Multiple experiments (multiple time points, multiple conditions), many Go categories, or other features of genes: want to know when certain Go categories are highly represented.

• Many species, want to know which genes have presence in many species and perhaps which GO categories

Page 38: Biocomputational Puzzles

38

Computational Desires

• Simple, responsive interface

• Visualize lots of data

• Many ways to query

• Many different data representations

Page 39: Biocomputational Puzzles

39

Sungear Design

• Generalizes Venn diagrams to more than three

• Visual outline is an ellipse having anchors on borders and vessels in the interior.

• Each vessel points to associated anchors.• Linked views to hierarchies, lists, and

graphs, so can simultaneously update data depending on user queries (selection events).

Page 40: Biocomputational Puzzles

40

Venn Diagram: great for three factors

Page 41: Biocomputational Puzzles

41

Page 42: Biocomputational Puzzles

42

Sungear Principle

• “Sungear is stupid”

• Doesn’t care which kind of data it is representing, though there is built-in support for genes (because of links to GO and to cytoscape).

• Basic Sungear representation could be used to describe anything from yachting gear to demographics.

Page 43: Biocomputational Puzzles

43

Seed Development Stages(an example)

ATGE_84isolated seeds144 - 192green cotyledons10

ATGE_83isolated seeds120 - 144curled cotyledons to early green cotyledons9

ATGE_82isolated seeds108 - 120walking-stick to early curled cotyledons8

ATGE_81isolated seeds96 - 108late torpedo to early walking-stick7

ATGE_79isolated seeds90 - 96mid torpedo to late torpedo6

ATGE_78siliques containing seeds84 - 90late heart to mid torpedo5

ATGE_77siliques containing seeds66 - 84early heart to late heart4

ATGE_76siliques containing seeds48 - 66mid globular to early heart3

-24 - 48early globular to mid globular2

- (Weigel)12 - 24up to four cells1

AtGE No.Sample harvestedHours after

flowering Description

(main stage of embryogenesis)Stage No.

Page 44: Biocomputational Puzzles

44

Page 45: Biocomputational Puzzles

45

Demos

• Growth stages showing when genes are transcribed (N-reg AtGenExpSeedDev)

• Blast comparison of Arabidopsis against most fully sequenced organisms.

• Nitrogen, carbon, light, organ showing regulation -- relative expression (cnlo)

• Interspecies comparisons that might show which kinds of genes are missing in gymnosperms, for example (Vicogenta)

Page 46: Biocomputational Puzzles

46

Genes that respond to N in leaves and C in roots form the largest group (cnlo)

Page 47: Biocomputational Puzzles

47

PII and other genes involved in N-metabolism are among these 566

Page 48: Biocomputational Puzzles

48

HYPOTHESIS: Most of the regulated genes are involved in metabolism.

Page 49: Biocomputational Puzzles

49

… this is not the case for other processes

Page 50: Biocomputational Puzzles

50

Genes that are regulated by N & L together

Page 51: Biocomputational Puzzles

51

Gene networks of NL-responsive genes

Page 52: Biocomputational Puzzles

52

Sungear Conclusion

• If you have lots of data about some common entity (genes, people, goods, whatever) and several factors or experiments whose interaction you want to visualize, Sungear is for you.

• Available late summer 2005. (sungear2/run.bat)

Page 53: Biocomputational Puzzles

53

Overall Conclusion

• Combinatorial design if you have a large search space to analyze and you want an intelligent sampling method.

• Sungear for visualizing experimental data.

Page 54: Biocomputational Puzzles

54

Other Projects

• Diabetes treatment (Harvard med school): given patient histories and interventions, try to find best practice interventions.

• Time series work: fusing data from different sources, detecting bursts over many window lengths, query by humming

• Graph search and matching: given a query graph find matching subgraphs.