sequence clustering

Advancing Science with DNA Sequence

Sequence Clustering

MGM WorkshopSeptember 26, 2011

Reducing Search Space in Protein and

DNA/RNA Sequence Analysis

Denis Kaznadzey, GBP


Sequence clustering

- Classify into groups of essentially similar objects

- When new data arrives, assign objects to existing groups

- Classify ‘leftovers’- Occasionally review entire classification

Problem: What is essentially similar’?• Finding properties that are important

(Ontological relevancy)• Does classification reflect reality in any

way?

To deal with a huge variety of individual ‘objects’:


Sequence clustering

Taxonomical Classification vs.

Continuity of Great Chain of Being

Even if reductionist, classification is a tool to study the world – the biology in particular.

When data is incomplete, any classification is a convention. At the same time, it is an approximation of a “reality”.

Carl Linnaeus Georges Buffon


Sequence clustering

In Modern Biology: Most abundant type of data is sequence:• Genomic DNA• RNA (through RNASeq)• Derived ProteinsPrimary feature is Primary Structure, but- Classification criteria depends on application.


Sequence Clustering

Genome Assembly: Binning, Scaffolding

Transcriptomics: EST (read) clustering

Protein Function and Evolution studies:Protein families

Phylogenetic profiling: OTUs

Select Applications in Genomic Sciences:


Sequence Clustering

In Metagenomics: Primary tasks:• Assess diversity• Find genes• Predict functions• Predict pathways • Estimate

capabilities

Based on sequence comparison.


Sequence Clustering

- Any Clustering is based on the Distance in some Metric.

- Initial clustering is based on pair-wise distances.

- Subsequent classification is based on distances from object to clusters- Representative- Set of representatives (all at

extreme)- Other measure, may be

unrelated to initial.


Sequence Clustering

When distance measure is chosen, and distances are obtained / computed:

• There is a HUGE variety of clustering methods (clustering / classification is a very elaborate methodology)• K-mean, average linkage, complete linkage, single linkage,

iterative, SOM, etc.• However options for large volume clustering are

limited due to performance of algorithms.• Single-linkage can be computed very efficiently• (Method for pledging new sequences to clusters may

be computationally more intense)


Sequence clustering

Most efficient clustering: transitive-closure based.

• Requires ‘boolean’ distances (two sequences can be linked or not linked

• Requires number of nodes to be known• Space ~ NodesNo• Run-time (worst) ~ EdgesNo* AveClustSize• Run-time (average) ~ EdgesNo * log2 (AveClustSize))


Sequence clustering

Practical Transitive Closure algorithm:Allocate array of sequence numbers A [0..N]

Phase I: connect linked vertices through vertex of smallest index

For each edge (m, n):While A [n] != n:

n = A [n]While A [m] != m:

m = A [m]A [max (m, n)] = min (m, n)

Phase II: propagate smallest indices as cluster identifiers

For each n from 0 to N:If A [n] ! = A [ A [n]]:

A [n] = A [A [n]]

Phase III: collect clusters. (Implementation dependent)

Count number of distinct cluster “id”s => M (1 pass)

Allocate array of sizes; Count size of each cluster (1 pass)

Allocate array of clusters; fill it in (1 pass)

0 1 2 3 4 5 6

0 1 2 1 4 5 6

0 1 2 1 4 5 5

0 1 2 1 4 1 5

0 1 2 1 4 1 1

+(1,3)

+(5,6)

+(6, 1)

(0); (1,3,5,6); (2); (4)


OK

Sequence clustering

Computing ‘boolean’ distances:• Threshold – based• Additional rules (match arrangement)

Example: read/EST clustering% identity + length + arrangement:


Computing similarity measure:- Edit distance or (ungapped) statistics P-value: BLAST,

Fasta, needle, water, etc.- Adjusted edit distance through progressive alignment:

Clustal, MUSCLE, T-coffee- K-mere statistics: CD-HIT, USEARCH, MUSCLE- Suffix trees (and probabilistic suffix trees): MUMmer,

Reputer, CLUSEQ- Suffix Arrays: Bowtie, BWT- Position-Specific scoring matrix: PSI-Blast, Impala- Hidden Markoff Models: HMMer, HHSearch/HHPred, SAM


Sequence clustering

Distance computing is harder then clustering. (Bundled solutions: BLASTCLUST, CD-HIT, UCLUST, CLUSEQ)

- For large data sets only k-mere and suffix array measures are practical.

- However: incremental/ greedy approaches can be used to avoid entire distance matrix computing. This makes use of sensitive similarity measures possible.

- For boolean distance, iterative similarity detection is possible. Fast binning->slow comparing. (no off-the-shelve implementations(?))


Sequence clustering

Boolean distance clustering killer:CLUSTER AGGREGATION.In large clusters, even a small number of

random links lead to huge conglomerates.


Common causes:1) Contamination with standard

constructs2) Repeats3) Chimeras4) Spurious similarities (low complexity

zones etc.


Sequence clustering

Fighting aggregation

- Vector / adapter trimming:- Lucy, Figaro, etc. Integrated in many assembly suites

(newbler, velvet, AMOS, CLCbio, etc.)- Low complexity detection / masking:

- SEG, DUST, FastQC, WindowMasker etc. – often integrated in search tools.


Sequence clustering

- Repeat detection / masking:- Regular (tandem) repeats:

- Pre-search masking: Based on structure (IMEx, SRF); or on database (TRDB)

- Post-search detection based on similarity properties (multiple parallel threads)

- Irregular (long) repeats:- Database based: RepeatMasker- De-novo: RepeatScout, orrb, PILER, etc.

Require genome as input, construct database.


Sequence clustering

Detecting chimeric sequences:• Abundance-based: Perseus, UCHIME

• Chimeras undergo less amplification cycles. So chimera segments in native arrangement are more frequent.

• Specific to 16S: ChimeraSlayer, Bellerophon• Chimera ‘arms’ are closer to originating

phyla then entire chimera


Sequence clustering

Detecting chimeric sequences• Similarity coverage based: Mira assembler


Sequence clustering

Detecting chimeric sequences• Similarity graph topology based: dchim

Alignment view Connectivity view


Protein Clusters: various criteria- Primary structure similarity- Close evolutionary relationship- Similarity in physical properties- 3-D structure similarity- Similar fold arrangement- Domain structure similarity- Common or similar functions- etc.


Sequence clustering

Functional and structural classifications in IMG


Sequence clustering

Direct similarity measure by edit distance is not sensitive enough for evolutionary distant species

Position-specific scoring matrices and profile-HMMs provide better sensitivity, but MUCH SLOWER.

For individual genomes (103 -5x104 proteins) could be used with massively parallel computations (while number of genomes is within thousands)

For metagenomes can not be used with foreseeable computing resources.


Sequence clustering

Functional annotation of metagenome genes through protein clusters (under development):

- Build set of functionally homogenous clusters of similar proteins – for annotated genomes

- Build HMMs for each cluster, compose model database- Pledge metagenome proteins to clusters by matching to models- Cluster unpledged proteins, build models, update model

database.- Balance model database by creating model tree: aggregating

small relative clusters and dissecting large ones. - Perform hierarchical searches through profiles tree.


Sequence clustering

Clustering reduces search space, but adds another level of indirection, which is a source of errors, and complexity, which consumes effort.

Improves only searches within parameters space used for clustering (structure-based clusters not useful for searching for certain codon usage, etc.)


However, for proteins, which form dense relationship networks, clustering is a great tool.


Thank you!

sequence clustering

Documents