efficient clustering of large est data sets on parallel computers cecs 694-04 bioinformatics journal...

36
Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003, 31(11), 2963- 2974 Presented by Elizabeth Cha

Upload: patience-wells

Post on 18-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

Efficient Clustering of Large EST Data Sets on Parallel Computers

CECS 694-04 Bioinformatics Journal Club

September 17, 2003

Nucleic Acids Research, 2003, 31(11), 2963-2974Presented by Elizabeth Cha

Page 2: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

2

Problem Statement

We are given an EST database from a single species, where multiple EST sequences may belong to the same gene.

We want to find an efficient algorithm to cluster EST sequences, so that all EST sequences in a cluster belong to a single gene. (It’s possible to have more than one cluster for a gene.)

Page 3: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

3

Efficient Algorithm Considerations

Memory efficiency to reduce the memory required to linear in the size of input

Computational efficiency without sacrificing the quality of clustering

Reduction of run-time of clustering large EST data sets by parallel processing (e.g. MPI)

Page 4: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

4

EST database (dbEST)

Expressed Sequence Tag (EST) representations provide a dynamic view of genome content and expression > 5 million human ESTs > 3.5 million mouse ESTs

Reference information: dbEST

(ncbi.nlm.nih.gov/dbEST/dbEST_summary.html)

Page 5: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

5

What is EST?

A unique DNA sequence derived from a cDNA library.

The length of EST is around 200 ~ 500 nucleotides long.

ESTs are generated by sequencing either one of both ends of an expressed gene.

The EST can be mapped, by a combination of genetic mapping procedures, to a unique locus in the genome and serves to identify that gene locus.

Page 6: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

6

An overview of the process of protein synthesis

Image adopted by http://ncbi.nlm.nih.gov/About/primer/est.html

Page 7: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

7

An overview of how ESTs are generated.

Image adopted from ncbi.nlm.nih.gov/About/primer/est.html

Page 8: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

8

Current Problems in dbEST

Imposing size of EST database Low sequence quality Highly similar (but distinct) gene family

members Chimeric cDNA clones Retained introns and alternatively spliced

transcripts Incomplete gene coverage Other limitations

Page 9: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

9

Types of alternative splicing

Skipped exons

Retained introns

Alternative donor or acceptor site

Image adopted from Trends in Genetics, 2002, 18(1), 53-57

Page 10: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

10

How to solve the problems

Remove the redundancy by clustering ESTs representing the same native transcripts

Current software for clustering ESTs UniGene STACK (Sequence Tag Alignment and

Consensus Knowledgebase) HGI (Human Gene Index) TIGR Assembler CAP3 Phrap

Page 11: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

11

Goals of clustering ESTs

Each cluster represents a distinct gene, including all alternative transcript isoforms derived from the same gene (e.g. UniGene).

Each cluster is deemed to represent a distinct mRNA transcript (e.g. CAP3, TIGR Assembler, Phrap).

ESTs and first categorized by their RNA source and are subsequently clustered separately for each source sample (e.g. STACK).

Page 12: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

12

Ideas to get evidential gene or transcript

1. Pairwise sequence alignment with dynamic programming algorithm

2. Fast identification of promising pairs with good quality overlap

3. Report pairs based on maximal common substrings

Page 13: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

13

Page 14: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

14

PaCE (Parallel Clustering of ESTs)

A software program for EST clustering on parallel computers

2 reasons for this combination enables clustering and assembly of large-scale EST data sets Memory requirement: grows linearly in the size of

input The input size is reduced from the complete set of

ESTs to the size of the biggest cluster

Page 15: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

15

EST Clustering

Given:

ESTs drawn from multiple mRNAs

Partition:

The ESTs into clusters such that ESTs from the same gene are put together in a distinct cluster

Page 16: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

16

EST Clustering (Cont’d)

Image adopted from the Summer Lecture of Dr. Aluru’s of Iowa State Univ.

Page 17: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

17

EST Clustering Algorithm

Initially, treat each EST as a cluster by itself

If two ESTs from two different clusters show significant overlap, merge the clusters

Output the clusters once finished

Page 18: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

18

EST Clustering (Cont’d)

Merging Clusters

Successful overlap results in:

Image adopted from the Summer Lecture of Dr. Aluru’s of Iowa State Univ.

Page 19: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

19

Determining Overlaps

Compute only lower and upper rectangles Do banded dynamic programming

Image adopted from the Summer Lecture of Dr. Aluru’s of Iowa State Univ.

Page 20: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

20

Maximum Common Substring

Given:

a set of strings Find:

Pairs of strings that have a maximal common substring ≥ a threshold φ

Image adopted from the Summer Lecture of Dr. Aluru’s of Iowa State Univ.

Page 21: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

21

Organization of PaCE

1. Build a distributed representation of the GST data structure in parallel

2. Use a single processor to handle maintaining and updating the EST clusters

Page 22: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

22

Generalized Suffix Tree (GST)

A GST for a set of n sequences is a suffix tree constructed using all suffixes of the n sequences.

Page 23: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

23

Basic Concept of Suffix Tree

A suffix tree is a data structure that exposes the internal structure of a string in a deeper way than does the fundamental preprocessing.

Page 24: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

24

Definition of Suffix Tree

1. A suffix tree T for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m.

2. Each internal node has at least 2 children and each edge is labeled with a nonempty substring of S.

3. No 2 edges out of a node can have edge-labels beginning with the same character.

Page 25: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

25

Definition of Suffix Tree (Cont’d)

4. Key feature:

for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i.

Page 26: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

26

Ukkonen’s Algorithm to Construct a Suffix Tree

Construct tree I1

(It is just the single edge labeled by character S(1))for i = 1 to m-1 dobegin {phase i+1}

for j = 1 to i+1begin {extension j}Find the end of the path from the root labeled S[j..i] in the current tree.If needed, extend that path by adding character S(i+1).end;

end;

Page 27: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

27

Suffix Tree

Construct a suffix tree of sequence gaac

Page 28: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

28

Suffix Tree (Cont’d)

Image adopted from article (1999) Nucleic Acids Research, 27, 2369-2376

Page 29: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

29

Main idea to use GST data structure

Image adopted from the Summer Lecture of Dr. Aluru’s of Iowa State Univ.

• Maximal Common Substring

Page 30: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

30

Parallel Clustering

A master-slave paradigm is used. Master processor:

maintains and updates the clusters Slave processors:

1. Generate pairs as demanded by the master processor

2. Perform pairwise alignments of the pairs dispatched by the master processor

Data structure for maintaining the clusters: union-find algorithm

Page 31: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

31

Software availability

PaCE is freely available for non-profit, academic use.

To request source code and executables Contact information : [email protected]

Page 32: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

32

Quality Assessment

Benchmark data set:

Arabidopsis thaliana 168,200 ESTs Small genome (114.5 Mb / 125 Mb

total) has been sequenced in year 2000

Reference information:

http://www.arabidopsis.org/info/aboutarabidopsis.html

Page 33: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

33

Achievements of PaCE

Reduce the worst-case memory requirement from quadratic to linear

Generate promising pairs in decreasing order of maximal common substring length and cluster the ESTs such that the number of pairwise alignments is reduced by an order of magnitude without affecting the quality of clustering

Reduce the number of duplicates generated for each promising pairs

Page 34: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

34

Future Research

Extend PaCE to do assembly and build consensus sequences in parallel

Incorporate quality values available to ESTs as part of input

Ensure quality clustering and assembly

Page 35: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

35

System used to implement

IBM xSeries cluster 30 dual-processor nodes 1.26 GHz Intel Pentium III processors connected by Myrinet 2.25 GB memory at each node 512 MB of RAM

Page 36: Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

36

Quality Assessment of PaCE and CAP3