timesearcher: interactive querying for identification of patterns in genetic data

18
TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data Harry Hochheiser Eric Baehrecke Stephen Mount Ben Shneiderman Harry Hochheiser is supported by a fellowship from America Online.

Upload: ivi

Post on 07-Jan-2016

20 views

Category:

Documents


0 download

DESCRIPTION

TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data. Harry Hochheiser Eric Baehrecke Stephen Mount Ben Shneiderman. Harry Hochheiser is supported by a fellowship from America Online. Time Series Data. Real-Valued function over time Goal: find patterns - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data

TimeSearcher: Interactive Querying for Identification of Patterns in

Genetic Data

Harry Hochheiser Eric Baehrecke Stephen MountBen Shneiderman

Harry Hochheiser is supported by a fellowship from America Online.

Page 2: TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data

2

Time Series Data• Real-Valued function over time• Goal: find patterns

– “Starts Low, Ends High”– Outliers– Periodic Patterns– Laggards and Leaders

• Hypothesis generation

Page 3: TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data

3

Microarray Data

Chu, et al. The transcriptional program of sporulation in budding yeast, Science 1998 Oct 23; 282(5389): 699-705.

Page 4: TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data

4

Timeboxes• Rectangular query regions• Value must be in range for all time points in region• Combine multiple timeboxes for conjunctive query

Sharp Rise Panic Reversal

Page 5: TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data

5

TimeSearcher/Microrarray demo

Page 6: TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data

6

TimeSearcher

• Interactive exploration of time-series data

• Dynamic queries (<100ms)• Linear display of individual items • Create queries on graph area• Move, scale timeboxes to modify query• Drag-and-Drop for query-by-example

Page 7: TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data

7

Other Applications

• “Time”: linear ordered sequence• Use TimeSearcher for general sequences

– E.g., DNA

Page 8: TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data

8

SF1Splicing signals are recognized during earlysteps in the biochemical process of splicing.U2AF65

Exon 1U1

U2AF35

(Y)n AGExon 2

BranchSite

Application to the case of the Arabidopsis thaliana branch site consensus splicing signal.

Steve MountCell Biology and Molecular Genetics

Harry Hochheiser and Ben ShneidermanHuman Computer Interaction Lab

Steven SalzbergThe Institute for Genomic Research

TimeSearcher for analysis of weak signals in nucleotide sequences:

Page 9: TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data

9

Two-step pre-mRNA splicing mechanism with branched intermediate:

Diagram courtesy of Dr. Martinez Hewlett

Yeast (Saccharomyces cerevisiae)Invariant: TACTAAC

Humans (Homo sapiens)Consensus: TNYTRAYY

Fruit flies (Drosophila melanogaster)Invariant: WCTAATY

Weeds (Arabidopsis thaliana):Invariant: CTRAY

Consensus sequences:

Here we sought to verify and extend the experimentally determined branch site consensus CTRAY determined by Simpson et al. (2002). Our long-term goal is the characterization of an even weaker signal, the ‘exonic splicing enhancer.’

Y = C or T; W = A or T; R = A or G; N = A, C, G or T

Page 10: TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data

10

Page 11: TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data

11

Page 12: TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data

12

Page 13: TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data

13

Page 14: TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data

14

Page 15: TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data

15

Page 16: TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data

16

ACTAA ACTGA ATAAC ATTGA CTAAA CTAAC CTAAT CTCAT CTGAC TAACG TAACT TCTAA TGACT TGATT TTAAC WYTRAY

Branch site

Pyrimidines

Distance to 3’ splice site

Num

ber

of

over-

repre

sente

d w

ord

s

one sigma

two sigma

Y = C or T; W = A or T; R = A or G; N = A, C, G or T

Conclusions:TimeSearcher can be used to identify weak signals in aligned nucleotide sequences.

Analysis of 8,550 exons from Arabidopsis supports the branch site consensus WYTRAY.

Page 17: TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data

17

Future Work: Extensions to query model

• Leaders and Laggards– Identification of regulatory genes

• Multiple time-varying values• Variable Time timeboxes• Collaborations with biologists

inform design

What sort of queries are of interest?

Page 18: TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data

18

Conclusions• TimeSearcher: interactive tool for

graphical exploration of time series data• Ongoing use for analyzing microarray

data and sequence data

We’re interested in working with motivated users & real data sets

www.cs.umd.edu/hcil/timesearcher