progress review mike langston’s research team

PROGRESS REVIEW

Mike Langston’s Research Team

Department of Computer ScienceUniversity of Tennessee

with collaborative efforts at

Oak Ridge National Laboratory

November 22, 2005

Team Members in Attendance

Bhavesh Borate, Elissa Chesler, John Eblen, Roumyana Kirova, Mike Langston,

Andy Perkins, Yun Zhang

Team Members Absent

Xinxia Peng, Jon Scharff, Josh Steadmon

Mike Langston’s Progress ReportFall, 2005

• Team ChangesNew Students: Belma Ford (GST), Peter Shaw (Australia)New Colleagues: Elissa Chesler & Roumyana Kirova (ORNL)Graduating Soon: Xinxia Peng (December), Jon Scharff (May)Moved Collaborators: Jay Snoddy (Vanderbilt)

• Recent Conferences/TalksACiD (England), Dagstuhl (Germany), COCOON (China), Purdue, Supercomputing (Seattle)

• Upcoming Visits/TalksRECOMB WS (San Diego), Texas A&M, Carleton (Canada),

Göteborg (Sweden), AICCSA-06 (UAE), ACM SAC (France)• Support

NIH (John), ORNL (Yun), Science Alliance (Andy)Proposals Outstanding

• Sample ProjectsEukaryotes: Allergy (Human), Diabetes (Mice), IR (Mice),

Neuroscience (Mice), othersProkaryotes: Operon (R. palustris), Shock (Shewanella)

Yun Zhang

• Recent conferences/talks– Prepared slides for Cocoon05, China– Presented in SC05 (SuperComputing), Seattle

• Upcoming events– Cray MTA (Multithreaded Architecture) Workshop, ORNL

• Projects: maximal clique enumeration– Comparisons of multithreaded implementations on

• Altix vs. Cray vs. IBM• Cray: Vectorization of for-loops

– Implementations on distributed-memory machines• Using MPI vs. Global Arrays• Load-balancing using master/slave vs. peer-to-peer model

– Comparison of MPI vs. Multithreaded

Parallel Clique Enumeration

• Object– Minimize data communication vs. maximize balanced load

• Dynamic load balancing– Data transfer: peer-to-peer– DLB strategies: master/slave vs. peer-to-peer

k = 1

k = 2

k = 3

k = 4

k = 5

1 2 3 5 64

1 5 2

1

3

A task needed to be transferred from slave1 to slave5

Search tree

Clique Enumeration

• Methods to speed up the computation core• Bit compression to save memory, and corresponding

bitwise operations on compressed bitmaps

a

a b c d e f g

0 1 1 1 1 0 0

b 1 0 1 1 1 1 0

c 1 1 0 1 1 1 1

d 1 1 1 0 1 0 1

e 1 1 1 1 0 0 1

f 0 1 1 0 0 0 1

g 0 0 1 1 1 1 0

(a, b, c, d) sparse0 0 0 0 1 0 0f

a

b

c d

e

g

dense

Vertices

Cli

qu

es

Andy PerkinsProjects

• Low dose• Allergy• Shewanella• HRT

Microarray Data

• Normalization• Filtering low or unchanging expression values• Control spots

Differential Analysis

• Cliquification– In a large percent of cliques in one group and few

in the other.

• Expression– 2-fold change in expression between groups

• Correlation– Correlation value >= 0.85 in one group and <=

0.25 in the other.

Differential Analysis

Red edge: >=0.85 in dose and <= 0.25 in control

Blue edge: >= 0.85 in control and <= 0.25 in dose

Other research

• Thresholding• Pearson’s vs Spearman’s• Random graphs

Papers

• ``Computational Analysis of Mass Spectrometry Data Using Novel Combinatorial Methods,'' Proceedings, ACS/IEEE International Conference on Computer Systems and Applications, Dubai, United Arab Emirates, March, 2006, with A. Fadiel, M. A. Langston, F. Naftolin, X. Peng, P. Pevsner, H.S. Talor, O. Tuncalp, and D. Vitello.

• ``Innovative Computational Methods for Transcriptomic Data Analysis,'' Proceedings, ACM Symposium on Applied Computing, Dijon, France, April, 2006, with M. A. Langston, A. M. Saxton, J. A. Scharff and B. H. Voy.

John EblenClique Analysis Tool Chain

• Projects– Gerling Data – NOD mice– Shewanella Data

• Three Interesting Problems– Aggregating Maximal Cliques– Thresholding– Biological Analysis of Clique Results

Aggregating Maximal Cliques

• The Problem– A great deal of overlap among maximal cliques– Many cliques differ by only a few nodes

• Solutions– Paraclique (Dr. Langston)– Nucleated Clique (Jon Scharff)– Clique Difference or “Nonoverlap”– Others

Direct Maximum Clique

• Parallel version scales well on Altix supercomputer, shared memory machines

• Currently working on base serial code efficiency

• Ultimate goal is speed– Best algorithm possible– Smart implementation(s)

Keller 7 Conjecture

• Goal is to find or prove nonexistence of 128-clique in Keller 7 graph

• Current approach– Found set of 128 nonoverlapping ISs– Currently searching for more– Should greatly reduce search space

Bhavesh BorateThresholding

• GO Pairwise Similarity Analysis• Percentage of Cliques with Biological Meaning at each threshold • Confidence Intervals • Graph Properties (Edge Density, Maximal Cliques, Maximum

Clique) • Spectral Graph Theory• Bayesian Statistics • Control Spot Threshold verification • Utilization of Info from Pathway Databases • Combinatorial Strategy • Kentucky Windage ;)

Graph of GO-Pairwise Scores v/s Correlation Values

Shewanella data

Avg functional similarity v/s Correlation

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

00.20.40.60.811.2

Correlation

Avg

fu

nct

ion

al s

imil

arit

y

Series1

For each pair of genes, we find a GO category X that covers both the genes and has the minimum number of total genes

Get a GO score for each pair of genes

Accumulate correlation scores in bins 1,0.99,0.98…….0

Average the GO scores of pairs in each bin.

Plot.

GO Pairwise Similarity Analysis

Pairwise Scores Score for each Clique

Get P-value for each Clique

For each threshold 0.8:0.01:0.95

At each threshold calculate % Cliques with P-value < 0.01

Updates from Xinxia

• Kevin was born in May• Defended in October• Graduating in December• Working on publications• Starting a job in December

Thank you all and Keep in touch!

Suman DuvvuruData analysis

• Effect of Strain: Currently working on Dr.Brynns mice strain data and I am writing up the code in SAS to see which strain is producing strong correlation in the data.

• The problem with microarray data

1. The numbers of variables is much higher than the number of observations – causes many eigenvalues in the Covariance matrix to be 0 – Correlation matrix is problematic.

2. Can be corrected using • shrinkage based correlation

• Information criteria based methods (using smooth covariance estimators) .

• (Implementation of these methods currently in progress)

Roumyana KirovamRNA expressions and Linkage

Gene expression data: N genes, K strains

Probe BXD1 BXD2 BXD3 BXD5 BXD8 …1 4.46 5.30 5.80 5.51 4.90 ...2 4.10 4.49 4.24 4.06 4.46 ...3 5.15 4.74 5.04 6.10 5.20 ...4 6.45 6.03 5.79 6.56 7.32 ...5 4.06 5.06 4.35 4.09 4.09 ...

12000 4.16 4.06 5.37 5.28 5.31 …

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

4 4.4 4.5 5.5 6 6.5 7 7.5 More

aa

AA

Marker BXD1 BXD2 BXD3M1 AA aa AAM2 AA AA AA M3 aa AA AA M4 AA aa AA M5 AA AA aa

M3000 aa aa AA

Polymorphisms

Model: QTL mapping

1 ,

expression levels

2 if AA, 0 if aa

error terms

( , 0) / ( , 0)

m l

i i ij i ji i j

i

i i

y q x x x e

y

x

e

LOD P Model q P Model q

C1

Paraclique1

RegulatoryModel 1

Clique 2

Paraclique2

Expressions: 0.46 0.30 0.80 1.51 0.90

C2

Regulatory

Model 2

Correlation histogram

res

Fre

qu

en

cy

-0.5 0.0 0.5

02

00

50

0

0 500 1000 1500 2000

0.0

0.4

0.8

Regulatory ID 2840

Paraclique members

res

Correlation histogram

res

Fre

qu

en

cy

-0.5 0.0 0.5

02

00

50

0

0 500 1000 1500 2000

0.0

0.4

0.8

Regulatory ID 267

Paraclique membersre

s

QTL Model 1C1

Paraclique1

C2

Principal components

QTL Model 2C1

Paraclique2

C2


QTL Model 1


QTL Model 2


Common QTL

Meta component

1. How stable are the paracliques and QTL models if we choose different samples (not the average of the replicates). • generate samples of the data by choosing randomly replicates and

build confidence intervals.• fit a multi variance model: Expression ~ Strain + Sample + Strain:

Sample • adding covariates in the QTL model to adjust for the gender effect.

2. Power issues: How many strains, replicates and how many terms in the model.• simulate expression data and calculate power as a function of the

sample size.

3. Parametric vs non-parametric analysis.

4. Multiple tests adjustments.

Open questions:

Ontological Discovery for Ethanol ResearchOntological Discovery for Ethanol Research(…the new acronym stinks)(…the new acronym stinks)

Elissa J. CheslerElissa J. CheslerDepartment of Anatomy and NeurobiologyDepartment of Anatomy and Neurobiology

Center for Genomics and BioinformaticsCenter for Genomics and Bioinformatics

University of Tennessee-MemphisUniversity of Tennessee-Memphis

Health Science CenterHealth Science Center

Ontological Discovery for

Ethanol ResearchSPECIFIC AIMS

• Aim 1: To develop a data archive of ethanol, brain and behavior related gene sets that have been derived both empirically and through literature review.

• Aim 2: To develop a tool that allows cross-species, cross-molecule type gene set comparison.

• Aim 3: To develop a Web interface to the data archive and analysis that is aimed toward behavioral neuroscientists.

Pressure

Audiogenic

EtOH Withdrawal

Cocaine& PTZ

T4

ATPases The Seizure Related

Phenotype Landscape

Highly related phenotypes share many

common mRNA correlates

Ontological Discovery fromPhenotype Centered Gene Sets

ERGO:

Ethanol Related Gene Ontology

Phenotypes are operationally defined, based on phenomenology.

Gene sets can be empirically associated with phenotypes.

But what underlying construct really “IS”?Can we identify it by examining shared biological

substrates of related processes.

AIM 1: Gene set assembly and archive

• Gene set is broadly defined. – mRNA differential

expression– mRNA correlation– Literature review– KO, mutants with

trait effects

• Search – by gene– by descriptor– by set matching

• Attributes of each gene set include:• Type (mRNA, lit, protein)• Species• Free text description• Structured description,

e.g. MPO• Source DB (GO, KEGG,

WebQTL100)• Associated document

(e.g. abstract, publication)

Aim 2: Analytic tool

• Translates gene sets to a common reference species via homology.

• Similar to existing tools, but archives more information about gene set

• Allows multiple set comparisons (intersection analyses are not limited to two sets).

• Percent positive matching allows estimation of the relation of gene sets w/o specific regard to identity of genes. This allows a basis for clustering phenotypes based on gene annotation

GeneKeyDB can be used to generate translation tables across species

Aim 3 Behavioral Neuroscience Friendly interface

• Does the world need another boutique? • Making genomics accessible to broader research

community. • Text searching to retrieve, e.g. all gene sets related

to ‘stress’.• Text mining• Apparatus specific details• OUR GOAL IS TO CREATE A TOOL FOR

PHENOTYPIC ANALYSIS, GENES CAN BE A BLACK BOX THAT GET US THERE!

Future DirectionsBleeding Edge

From a matrix of set-set correlations estimated by jacquards positive match, can we draw and analyzegraphs of gene set relations?

From a set of documents associated with overlapping gene sets, can we mine text for frequently occurring terms? e.g. to answer “What term is most commonly occuring in the set of sets extracted by match to expression upregulation in response to handling stress?”

Research challenges

• Translation of genes across species:– Homology is not perfect, how do we match when

no homologues are found?

• Reference Set– What is the “reference set” for category

representation analysis when gene sets are drawn from diverse sources?

– Lack of comprehensivity of reference sets, e.g. a list of KO mice does not include all genes screened.

• Generation and curation of gene sets: establishing meaningful protocols and definitions to increase the quality and utility– Use GenMapp or Stanford models.

Gene set overlap unites diverse phenomena

ConsumptionCorrelates in

RI lines

GeneExpression Correlates of Htr1b

Upregulated in SocialIsolation

Upregulatedin P vs NP

LiteratureOn

NeuroactiveSteroid

Synthesis

ontology

Induction of a research question:

“If I antagonize the gene product of consumption correlate in socially isolated monkeys, consumption will decrease.”

“Hey, you put your social isolation in my NP mice!Yeah, well you put your P mice in my binge drinking!”

progress review mike langston’s research team

Documents

transcriptomic data

peer modelcomparison

ornl yun

team changesnew students

diabetes mice

ir mice

neuroscience mice

acm sac francesupportnih