progress review mike langston’s research team

39
PROGRESS REVIEW Mike Langston’s Research Team Department of Computer Science University of Tennessee with collaborative efforts at Oak Ridge National Laboratory November 22, 2005

Upload: thanos

Post on 14-Jan-2016

29 views

Category:

Documents


3 download

DESCRIPTION

PROGRESS REVIEW Mike Langston’s Research Team Department of Computer Science University of Tennessee with collaborative efforts at Oak Ridge National Laboratory November 22, 2005. Team Members in Attendance Bhavesh Borate, Elissa Chesler, John Eblen, Roumyana Kirova, Mike Langston, - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: PROGRESS REVIEW Mike Langston’s Research Team

PROGRESS REVIEW

Mike Langston’s Research Team

Department of Computer ScienceUniversity of Tennessee

with collaborative efforts at

Oak Ridge National Laboratory

November 22, 2005

Page 2: PROGRESS REVIEW Mike Langston’s Research Team

Team Members in Attendance

Bhavesh Borate, Elissa Chesler, John Eblen, Roumyana Kirova, Mike Langston,

Andy Perkins, Yun Zhang

Team Members Absent

Xinxia Peng, Jon Scharff, Josh Steadmon

Page 3: PROGRESS REVIEW Mike Langston’s Research Team

Mike Langston’s Progress ReportFall, 2005

• Team ChangesNew Students: Belma Ford (GST), Peter Shaw (Australia)New Colleagues: Elissa Chesler & Roumyana Kirova (ORNL)Graduating Soon: Xinxia Peng (December), Jon Scharff (May)Moved Collaborators: Jay Snoddy (Vanderbilt)

• Recent Conferences/TalksACiD (England), Dagstuhl (Germany), COCOON (China), Purdue, Supercomputing (Seattle)

• Upcoming Visits/TalksRECOMB WS (San Diego), Texas A&M, Carleton (Canada),

Göteborg (Sweden), AICCSA-06 (UAE), ACM SAC (France)• Support

NIH (John), ORNL (Yun), Science Alliance (Andy)Proposals Outstanding

• Sample ProjectsEukaryotes: Allergy (Human), Diabetes (Mice), IR (Mice),

Neuroscience (Mice), othersProkaryotes: Operon (R. palustris), Shock (Shewanella)

Page 4: PROGRESS REVIEW Mike Langston’s Research Team

Yun Zhang

• Recent conferences/talks– Prepared slides for Cocoon05, China– Presented in SC05 (SuperComputing), Seattle

• Upcoming events– Cray MTA (Multithreaded Architecture) Workshop, ORNL

• Projects: maximal clique enumeration– Comparisons of multithreaded implementations on

• Altix vs. Cray vs. IBM• Cray: Vectorization of for-loops

– Implementations on distributed-memory machines• Using MPI vs. Global Arrays• Load-balancing using master/slave vs. peer-to-peer model

– Comparison of MPI vs. Multithreaded

Page 5: PROGRESS REVIEW Mike Langston’s Research Team

Parallel Clique Enumeration

• Object– Minimize data communication vs. maximize balanced load

• Dynamic load balancing– Data transfer: peer-to-peer– DLB strategies: master/slave vs. peer-to-peer

k = 1

k = 2

k = 3

k = 4

k = 5

1 2 3 5 64

1 5 2

1

3

A task needed to be transferred from slave1 to slave5

Search tree

Page 6: PROGRESS REVIEW Mike Langston’s Research Team

Clique Enumeration

• Methods to speed up the computation core• Bit compression to save memory, and corresponding

bitwise operations on compressed bitmaps

a

a b c d e f g

0 1 1 1 1 0 0

b 1 0 1 1 1 1 0

c 1 1 0 1 1 1 1

d 1 1 1 0 1 0 1

e 1 1 1 1 0 0 1

f 0 1 1 0 0 0 1

g 0 0 1 1 1 1 0

(a, b, c, d) sparse0 0 0 0 1 0 0f

a

b

c d

e

g

dense

Vertices

Cli

qu

es

Page 7: PROGRESS REVIEW Mike Langston’s Research Team

Andy PerkinsProjects

• Low dose• Allergy• Shewanella• HRT

Page 8: PROGRESS REVIEW Mike Langston’s Research Team

Microarray Data

• Normalization• Filtering low or unchanging expression values• Control spots

Page 9: PROGRESS REVIEW Mike Langston’s Research Team

Differential Analysis

• Cliquification– In a large percent of cliques in one group and few

in the other.

• Expression– 2-fold change in expression between groups

• Correlation– Correlation value >= 0.85 in one group and <=

0.25 in the other.

Page 10: PROGRESS REVIEW Mike Langston’s Research Team

Differential Analysis

Red edge: >=0.85 in dose and <= 0.25 in control

Blue edge: >= 0.85 in control and <= 0.25 in dose

Page 11: PROGRESS REVIEW Mike Langston’s Research Team

Other research

• Thresholding• Pearson’s vs Spearman’s• Random graphs

Page 12: PROGRESS REVIEW Mike Langston’s Research Team

Papers

• ``Computational Analysis of Mass Spectrometry Data Using Novel Combinatorial Methods,'' Proceedings, ACS/IEEE International Conference on Computer Systems and Applications, Dubai, United Arab Emirates, March, 2006, with A. Fadiel, M. A. Langston, F. Naftolin, X. Peng, P. Pevsner, H.S. Talor, O. Tuncalp, and D. Vitello.

• ``Innovative Computational Methods for Transcriptomic Data Analysis,'' Proceedings, ACM Symposium on Applied Computing, Dijon, France, April, 2006, with M. A. Langston, A. M. Saxton, J. A. Scharff and B. H. Voy.

Page 13: PROGRESS REVIEW Mike Langston’s Research Team

John EblenClique Analysis Tool Chain

• Projects– Gerling Data – NOD mice– Shewanella Data

• Three Interesting Problems– Aggregating Maximal Cliques– Thresholding– Biological Analysis of Clique Results

Page 14: PROGRESS REVIEW Mike Langston’s Research Team

Aggregating Maximal Cliques

• The Problem– A great deal of overlap among maximal cliques– Many cliques differ by only a few nodes

• Solutions– Paraclique (Dr. Langston)– Nucleated Clique (Jon Scharff)– Clique Difference or “Nonoverlap”– Others

Page 15: PROGRESS REVIEW Mike Langston’s Research Team

Direct Maximum Clique

• Parallel version scales well on Altix supercomputer, shared memory machines

• Currently working on base serial code efficiency

• Ultimate goal is speed– Best algorithm possible– Smart implementation(s)

Page 16: PROGRESS REVIEW Mike Langston’s Research Team

Keller 7 Conjecture

• Goal is to find or prove nonexistence of 128-clique in Keller 7 graph

• Current approach– Found set of 128 nonoverlapping ISs– Currently searching for more– Should greatly reduce search space

Page 17: PROGRESS REVIEW Mike Langston’s Research Team

Bhavesh BorateThresholding

• GO Pairwise Similarity Analysis• Percentage of Cliques with Biological Meaning at each threshold • Confidence Intervals • Graph Properties (Edge Density, Maximal Cliques, Maximum

Clique) • Spectral Graph Theory• Bayesian Statistics • Control Spot Threshold verification • Utilization of Info from Pathway Databases • Combinatorial Strategy • Kentucky Windage ;)

Page 18: PROGRESS REVIEW Mike Langston’s Research Team

Graph of GO-Pairwise Scores v/s Correlation Values

Shewanella data

Avg functional similarity v/s Correlation

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

00.20.40.60.811.2

Correlation

Avg

fu

nct

ion

al s

imil

arit

y

Series1

Page 19: PROGRESS REVIEW Mike Langston’s Research Team

For each pair of genes, we find a GO category X that covers both the genes and has the minimum number of total genes

Get a GO score for each pair of genes

Accumulate correlation scores in bins 1,0.99,0.98…….0

Average the GO scores of pairs in each bin.

Plot.

GO Pairwise Similarity Analysis

Page 20: PROGRESS REVIEW Mike Langston’s Research Team

Pairwise Scores Score for each Clique

Get P-value for each Clique

For each threshold 0.8:0.01:0.95

At each threshold calculate % Cliques with P-value < 0.01

Page 21: PROGRESS REVIEW Mike Langston’s Research Team

Updates from Xinxia

• Kevin was born in May• Defended in October• Graduating in December• Working on publications• Starting a job in December

Thank you all and Keep in touch!

Page 22: PROGRESS REVIEW Mike Langston’s Research Team

Suman DuvvuruData analysis

• Effect of Strain: Currently working on Dr.Brynns mice strain data and I am writing up the code in SAS to see which strain is producing strong correlation in the data.

• The problem with microarray data

1. The numbers of variables is much higher than the number of observations – causes many eigenvalues in the Covariance matrix to be 0 – Correlation matrix is problematic.

2. Can be corrected using • shrinkage based correlation

• Information criteria based methods (using smooth covariance estimators) .

• (Implementation of these methods currently in progress)

Page 23: PROGRESS REVIEW Mike Langston’s Research Team

Roumyana KirovamRNA expressions and Linkage

Gene expression data: N genes, K strains

Probe BXD1 BXD2 BXD3 BXD5 BXD8 …1 4.46 5.30 5.80 5.51 4.90 ...2 4.10 4.49 4.24 4.06 4.46 ...3 5.15 4.74 5.04 6.10 5.20 ...4 6.45 6.03 5.79 6.56 7.32 ...5 4.06 5.06 4.35 4.09 4.09 ...

12000 4.16 4.06 5.37 5.28 5.31 …

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

4 4.4 4.5 5.5 6 6.5 7 7.5 More

aa

AA

Marker BXD1 BXD2 BXD3M1 AA aa AAM2 AA AA AA M3 aa AA AA M4 AA aa AA M5 AA AA aa

M3000 aa aa AA

Polymorphisms

Page 24: PROGRESS REVIEW Mike Langston’s Research Team

Model: QTL mapping

1 ,

expression levels

2 if AA, 0 if aa

error terms

( , 0) / ( , 0)

m l

i i ij i ji i j

i

i i

y q x x x e

y

x

e

LOD P Model q P Model q

Page 25: PROGRESS REVIEW Mike Langston’s Research Team

C1

Paraclique1

RegulatoryModel 1

Clique 2

Paraclique2

Expressions: 0.46 0.30 0.80 1.51 0.90

C2

Regulatory

Model 2

Page 26: PROGRESS REVIEW Mike Langston’s Research Team

Correlation histogram

res

Fre

qu

en

cy

-0.5 0.0 0.5

02

00

50

0

0 500 1000 1500 2000

0.0

0.4

0.8

Regulatory ID 2840

Paraclique members

res

Correlation histogram

res

Fre

qu

en

cy

-0.5 0.0 0.5

02

00

50

0

0 500 1000 1500 2000

0.0

0.4

0.8

Regulatory ID 267

Paraclique membersre

s

Page 27: PROGRESS REVIEW Mike Langston’s Research Team

QTL Model 1C1

Paraclique1

C2

Principal components

QTL Model 2C1

Paraclique2

C2

Principal components

Page 28: PROGRESS REVIEW Mike Langston’s Research Team

QTL Model 1

Principal components

QTL Model 2

Principal components

Common QTL

Meta component

Page 29: PROGRESS REVIEW Mike Langston’s Research Team

1. How stable are the paracliques and QTL models if we choose different samples (not the average of the replicates). • generate samples of the data by choosing randomly replicates and

build confidence intervals.• fit a multi variance model: Expression ~ Strain + Sample + Strain:

Sample • adding covariates in the QTL model to adjust for the gender effect.

2. Power issues: How many strains, replicates and how many terms in the model.• simulate expression data and calculate power as a function of the

sample size.

3. Parametric vs non-parametric analysis.

4. Multiple tests adjustments.

Open questions:

Page 30: PROGRESS REVIEW Mike Langston’s Research Team

Ontological Discovery for Ethanol ResearchOntological Discovery for Ethanol Research(…the new acronym stinks)(…the new acronym stinks)

Elissa J. CheslerElissa J. CheslerDepartment of Anatomy and NeurobiologyDepartment of Anatomy and Neurobiology

Center for Genomics and BioinformaticsCenter for Genomics and Bioinformatics

University of Tennessee-MemphisUniversity of Tennessee-Memphis

Health Science CenterHealth Science Center

Page 31: PROGRESS REVIEW Mike Langston’s Research Team

Ontological Discovery for

Ethanol ResearchSPECIFIC AIMS

• Aim 1: To develop a data archive of ethanol, brain and behavior related gene sets that have been derived both empirically and through literature review.

• Aim 2: To develop a tool that allows cross-species, cross-molecule type gene set comparison.

• Aim 3: To develop a Web interface to the data archive and analysis that is aimed toward behavioral neuroscientists.

Pressure

Audiogenic

EtOH Withdrawal

Cocaine& PTZ

T4

ATPases The Seizure Related

Phenotype Landscape

Highly related phenotypes share many

common mRNA correlates

Page 32: PROGRESS REVIEW Mike Langston’s Research Team

Ontological Discovery fromPhenotype Centered Gene Sets

ERGO:

Ethanol Related Gene Ontology

Phenotypes are operationally defined, based on phenomenology.

Gene sets can be empirically associated with phenotypes.

But what underlying construct really “IS”?Can we identify it by examining shared biological

substrates of related processes.

Page 33: PROGRESS REVIEW Mike Langston’s Research Team

AIM 1: Gene set assembly and archive

• Gene set is broadly defined. – mRNA differential

expression– mRNA correlation– Literature review– KO, mutants with

trait effects

• Search – by gene– by descriptor– by set matching

• Attributes of each gene set include:• Type (mRNA, lit, protein)• Species• Free text description• Structured description,

e.g. MPO• Source DB (GO, KEGG,

WebQTL100)• Associated document

(e.g. abstract, publication)

Page 34: PROGRESS REVIEW Mike Langston’s Research Team

Aim 2: Analytic tool

• Translates gene sets to a common reference species via homology.

• Similar to existing tools, but archives more information about gene set

• Allows multiple set comparisons (intersection analyses are not limited to two sets).

• Percent positive matching allows estimation of the relation of gene sets w/o specific regard to identity of genes. This allows a basis for clustering phenotypes based on gene annotation

Page 35: PROGRESS REVIEW Mike Langston’s Research Team

GeneKeyDB can be used to generate translation tables across species

Page 36: PROGRESS REVIEW Mike Langston’s Research Team

Aim 3 Behavioral Neuroscience Friendly interface

• Does the world need another boutique? • Making genomics accessible to broader research

community. • Text searching to retrieve, e.g. all gene sets related

to ‘stress’.• Text mining• Apparatus specific details• OUR GOAL IS TO CREATE A TOOL FOR

PHENOTYPIC ANALYSIS, GENES CAN BE A BLACK BOX THAT GET US THERE!

Page 37: PROGRESS REVIEW Mike Langston’s Research Team

Future DirectionsBleeding Edge

From a matrix of set-set correlations estimated by jacquards positive match, can we draw and analyzegraphs of gene set relations?

From a set of documents associated with overlapping gene sets, can we mine text for frequently occurring terms? e.g. to answer “What term is most commonly occuring in the set of sets extracted by match to expression upregulation in response to handling stress?”

Page 38: PROGRESS REVIEW Mike Langston’s Research Team

Research challenges

• Translation of genes across species:– Homology is not perfect, how do we match when

no homologues are found?

• Reference Set– What is the “reference set” for category

representation analysis when gene sets are drawn from diverse sources?

– Lack of comprehensivity of reference sets, e.g. a list of KO mice does not include all genes screened.

• Generation and curation of gene sets: establishing meaningful protocols and definitions to increase the quality and utility– Use GenMapp or Stanford models.

Page 39: PROGRESS REVIEW Mike Langston’s Research Team

Gene set overlap unites diverse phenomena

ConsumptionCorrelates in

RI lines

GeneExpression Correlates of Htr1b

Upregulated in SocialIsolation

Upregulatedin P vs NP

LiteratureOn

NeuroactiveSteroid

Synthesis

ontology

Induction of a research question:

“If I antagonize the gene product of consumption correlate in socially isolated monkeys, consumption will decrease.”

“Hey, you put your social isolation in my NP mice!Yeah, well you put your P mice in my binge drinking!”