sequence clustering
DESCRIPTION
Sequence Clustering. Reducing Search S pace in Protein and DNA /RNA S equence A nalysis Denis Kaznadzey, GBP. MGM Workshop September 26, 2011. Sequence clustering. To deal with a huge variety of individual ‘objects’:. Classify into groups of essentially similar objects - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/1.jpg)
Advancing Science with DNA Sequence
Sequence Clustering
MGM WorkshopSeptember 26, 2011
Reducing Search Space in Protein and
DNA/RNA Sequence Analysis
Denis Kaznadzey, GBP
![Page 2: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/2.jpg)
Advancing Science with DNA Sequence
Sequence clustering
- Classify into groups of essentially similar objects
- When new data arrives, assign objects to existing groups
- Classify ‘leftovers’- Occasionally review entire classification
Problem: What is essentially similar’?• Finding properties that are important
(Ontological relevancy)• Does classification reflect reality in any
way?
To deal with a huge variety of individual ‘objects’:
![Page 3: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/3.jpg)
Advancing Science with DNA Sequence
Sequence clustering
Taxonomical Classification vs.
Continuity of Great Chain of Being
Even if reductionist, classification is a tool to study the world – the biology in particular.
When data is incomplete, any classification is a convention. At the same time, it is an approximation of a “reality”.
Carl Linnaeus Georges Buffon
![Page 4: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/4.jpg)
Advancing Science with DNA Sequence
Sequence clustering
In Modern Biology: Most abundant type of data is sequence:• Genomic DNA• RNA (through RNASeq)• Derived ProteinsPrimary feature is Primary Structure, but- Classification criteria depends on application.
![Page 5: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/5.jpg)
Advancing Science with DNA Sequence
Sequence Clustering
Genome Assembly: Binning, Scaffolding
Transcriptomics: EST (read) clustering
Protein Function and Evolution studies:Protein families
Phylogenetic profiling: OTUs
Select Applications in Genomic Sciences:
![Page 6: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/6.jpg)
Advancing Science with DNA Sequence
Sequence Clustering
In Metagenomics: Primary tasks:• Assess diversity• Find genes• Predict functions• Predict pathways • Estimate
capabilities
Based on sequence comparison.
![Page 7: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/7.jpg)
Advancing Science with DNA Sequence
Sequence Clustering
- Any Clustering is based on the Distance in some Metric.
- Initial clustering is based on pair-wise distances.
- Subsequent classification is based on distances from object to clusters- Representative- Set of representatives (all at
extreme)- Other measure, may be
unrelated to initial.
![Page 8: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/8.jpg)
Advancing Science with DNA Sequence
Sequence Clustering
When distance measure is chosen, and distances are obtained / computed:
• There is a HUGE variety of clustering methods (clustering / classification is a very elaborate methodology)• K-mean, average linkage, complete linkage, single linkage,
iterative, SOM, etc.• However options for large volume clustering are
limited due to performance of algorithms.• Single-linkage can be computed very efficiently• (Method for pledging new sequences to clusters may
be computationally more intense)
![Page 9: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/9.jpg)
Advancing Science with DNA Sequence
Sequence clustering
Most efficient clustering: transitive-closure based.
• Requires ‘boolean’ distances (two sequences can be linked or not linked
• Requires number of nodes to be known• Space ~ NodesNo• Run-time (worst) ~ EdgesNo* AveClustSize• Run-time (average) ~ EdgesNo * log2 (AveClustSize))
![Page 10: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/10.jpg)
Advancing Science with DNA Sequence
Sequence clustering
Practical Transitive Closure algorithm:Allocate array of sequence numbers A [0..N]
Phase I: connect linked vertices through vertex of smallest index
For each edge (m, n):While A [n] != n:
n = A [n]While A [m] != m:
m = A [m]A [max (m, n)] = min (m, n)
Phase II: propagate smallest indices as cluster identifiers
For each n from 0 to N:If A [n] ! = A [ A [n]]:
A [n] = A [A [n]]
Phase III: collect clusters. (Implementation dependent)
Count number of distinct cluster “id”s => M (1 pass)
Allocate array of sizes; Count size of each cluster (1 pass)
Allocate array of clusters; fill it in (1 pass)
0 1 2 3 4 5 6
0 1 2 1 4 5 6
0 1 2 1 4 5 5
0 1 2 1 4 1 5
0 1 2 1 4 1 1
+(1,3)
+(5,6)
+(6, 1)
(0); (1,3,5,6); (2); (4)
![Page 11: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/11.jpg)
Advancing Science with DNA Sequence
OK
Sequence clustering
Computing ‘boolean’ distances:• Threshold – based• Additional rules (match arrangement)
Example: read/EST clustering% identity + length + arrangement:
![Page 12: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/12.jpg)
Advancing Science with DNA Sequence
Computing similarity measure:- Edit distance or (ungapped) statistics P-value: BLAST,
Fasta, needle, water, etc.- Adjusted edit distance through progressive alignment:
Clustal, MUSCLE, T-coffee- K-mere statistics: CD-HIT, USEARCH, MUSCLE- Suffix trees (and probabilistic suffix trees): MUMmer,
Reputer, CLUSEQ- Suffix Arrays: Bowtie, BWT- Position-Specific scoring matrix: PSI-Blast, Impala- Hidden Markoff Models: HMMer, HHSearch/HHPred, SAM
![Page 13: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/13.jpg)
Advancing Science with DNA Sequence
Sequence clustering
Distance computing is harder then clustering. (Bundled solutions: BLASTCLUST, CD-HIT, UCLUST, CLUSEQ)
- For large data sets only k-mere and suffix array measures are practical.
- However: incremental/ greedy approaches can be used to avoid entire distance matrix computing. This makes use of sensitive similarity measures possible.
- For boolean distance, iterative similarity detection is possible. Fast binning->slow comparing. (no off-the-shelve implementations(?))
![Page 14: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/14.jpg)
Advancing Science with DNA Sequence
Sequence clustering
Boolean distance clustering killer:CLUSTER AGGREGATION.In large clusters, even a small number of
random links lead to huge conglomerates.
![Page 15: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/15.jpg)
Advancing Science with DNA Sequence
Common causes:1) Contamination with standard
constructs2) Repeats3) Chimeras4) Spurious similarities (low complexity
zones etc.
![Page 16: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/16.jpg)
Advancing Science with DNA Sequence
Sequence clustering
Fighting aggregation
- Vector / adapter trimming:- Lucy, Figaro, etc. Integrated in many assembly suites
(newbler, velvet, AMOS, CLCbio, etc.)- Low complexity detection / masking:
- SEG, DUST, FastQC, WindowMasker etc. – often integrated in search tools.
![Page 17: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/17.jpg)
Advancing Science with DNA Sequence
Sequence clustering
- Repeat detection / masking:- Regular (tandem) repeats:
- Pre-search masking: Based on structure (IMEx, SRF); or on database (TRDB)
- Post-search detection based on similarity properties (multiple parallel threads)
- Irregular (long) repeats:- Database based: RepeatMasker- De-novo: RepeatScout, orrb, PILER, etc.
Require genome as input, construct database.
![Page 18: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/18.jpg)
Advancing Science with DNA Sequence
Sequence clustering
Detecting chimeric sequences:• Abundance-based: Perseus, UCHIME
• Chimeras undergo less amplification cycles. So chimera segments in native arrangement are more frequent.
• Specific to 16S: ChimeraSlayer, Bellerophon• Chimera ‘arms’ are closer to originating
phyla then entire chimera
![Page 19: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/19.jpg)
Advancing Science with DNA Sequence
Sequence clustering
Detecting chimeric sequences• Similarity coverage based: Mira assembler
![Page 20: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/20.jpg)
Advancing Science with DNA Sequence
Sequence clustering
Detecting chimeric sequences• Similarity graph topology based: dchim
Alignment view Connectivity view
![Page 21: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/21.jpg)
Advancing Science with DNA Sequence
Protein Clusters: various criteria- Primary structure similarity- Close evolutionary relationship- Similarity in physical properties- 3-D structure similarity- Similar fold arrangement- Domain structure similarity- Common or similar functions- etc.
![Page 22: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/22.jpg)
Advancing Science with DNA Sequence
Sequence clustering
Functional and structural classifications in IMG
![Page 23: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/23.jpg)
Advancing Science with DNA Sequence
Sequence clustering
Direct similarity measure by edit distance is not sensitive enough for evolutionary distant species
Position-specific scoring matrices and profile-HMMs provide better sensitivity, but MUCH SLOWER.
For individual genomes (103 -5x104 proteins) could be used with massively parallel computations (while number of genomes is within thousands)
For metagenomes can not be used with foreseeable computing resources.
![Page 24: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/24.jpg)
Advancing Science with DNA Sequence
Sequence clustering
Functional annotation of metagenome genes through protein clusters (under development):
- Build set of functionally homogenous clusters of similar proteins – for annotated genomes
- Build HMMs for each cluster, compose model database- Pledge metagenome proteins to clusters by matching to models- Cluster unpledged proteins, build models, update model
database.- Balance model database by creating model tree: aggregating
small relative clusters and dissecting large ones. - Perform hierarchical searches through profiles tree.
![Page 25: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/25.jpg)
Advancing Science with DNA Sequence
Sequence clustering
Clustering reduces search space, but adds another level of indirection, which is a source of errors, and complexity, which consumes effort.
Improves only searches within parameters space used for clustering (structure-based clusters not useful for searching for certain codon usage, etc.)
![Page 26: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/26.jpg)
Advancing Science with DNA Sequence
However, for proteins, which form dense relationship networks, clustering is a great tool.
![Page 27: Sequence Clustering](https://reader036.vdocuments.us/reader036/viewer/2022062501/56816686550346895dda33c1/html5/thumbnails/27.jpg)
Advancing Science with DNA Sequence
Thank you!