![Page 1: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/1.jpg)
mosaicmosaicexploring reticulate protein family evolution
UQ, COMBIOAU, Brisbane02-03-09Maetschke/Kassahn
![Page 2: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/2.jpg)
2
motivationmotivation
evolution is complex (horizontal gene transfer, hybridization, genetic recombination, ...)
describing reticulate (non-tree like) phylogenetic relationships as trees maybe an oversimplification
phylogenetic tree inference gets increasingly complex is not suitable phylogenetic networks are even more complex and visualization is difficult
traditional methodstraditional methods
fast method to analyze and visualize (phylogenetic) sequence relationships applied to identify and study non-tree like protein families aim to perform whole proteome scans for reticulate proteins
mosaicmosaic
the problemthe problem
![Page 3: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/3.jpg)
3
n-grams & dot plotsn-grams & dot plots
MSKRRMSVGQQTW...MSKRRMSVGQQTW...
"alignment free" methods Split sequence in overlapping
subsequences of length n
MSKRSKRR
KRRMRRMS
...
4-grams 4-grams
phylogenetics: alignment is corner stone classical alignment may fail for reticulate proteins
M S K R R M Q Q V T Q
MSKRRMKRRM
n-gram dot plot
A B
B A
S1
S2
![Page 4: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/4.jpg)
4
some real n-gram dot plotssome real n-gram dot plots
4-grams are "unique" for a sequence we talk about '4' later...
c=10c=10n=4n=4
>AR_Pt MEVQLGLGRVYPRPPSKTYRGAFQNLFQSVREVIQNPGPRHPEAASAAPPGASLLLQQQQQQQQQQQQQQQQQQQQQQETSPRQQQQQGEDGSPQAHRRGPTGYLVLDEEQQPSQPQSAPECHPERGCVPEPGAAVAASKGLPQQLPAPPDEDDSAAPSTLSLLGPTFPGLSSCSADLKDILSEASTMQLLQQQQQEAVSEGSSSGRAREASGAPTSSKDNYLGGTSTISDSAKELCKAV...
c=10c=10n=4n=4
c=2c=2n=1n=1
![Page 5: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/5.jpg)
5
another n-gram dot plot another n-gram dot plot nuclear receptors
DBD: DNA binding, two zinc finger motifs LBD: Ligand binding domain AF-1/AF-2: Transcriptional activation domains DBDDBD
LBDLBD
![Page 6: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/6.jpg)
6
n-gram sequence similarity sn-gram sequence similarity s
21
21
,min SS
SSs
max: global alignmentmax: global alignmentmin: local alignmentmin: local alignment
s s [0...1] [0...1]
number of shared n-gramsnumber of shared n-gramsS = set of n-grams, S = set of n-grams, e.g. {AAGR, AGRK, GRKQ, ...}e.g. {AAGR, AGRK, GRKQ, ...}
given two sequences and their n-gram given two sequences and their n-gram setssets S S11 and S and S22::
{AAG,AGQ,GQQ} {AAG,AGQ,GQQ} { GQQ, QQQ} = { GQQ } { GQQ, QQQ} = { GQQ }
5.02,3min
1s
![Page 7: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/7.jpg)
7
n-gram similarityn-gram similarity
fast: linear wrt. size of n-gram sets(classical alignment is quadratic wrt. sequence length)
easy to interpret(0.5 = half of the n-grams are shared)
no parameters (gap penalty, gap extension penalty, ...)
can deal with shuffling of conserved segments and other "strange" cases (Are they actually strange?)
better or worse than BLAST/FASTA? Who knows?(Hoehl 2008: alignment free can be as good as classical alignment for inference of phylogeny, Edgar 2004: MUSCLE: n-gram based alignment method)
![Page 8: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/8.jpg)
8
why 4 and not 42why 4 and not 42 Hoehl 2008: n= 3...5 correlation between n-gram sequence
similarity and species divergence times standard deviation of sequence similarities maximum AUC when distinguish related
and randomly shuffled sequences
MR, r=0.93
44
![Page 9: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/9.jpg)
9
phylogenetic networksphylogenetic networks
different node and edge types Identification of reticulate events
(e.g. recombination) is error prone computational expensive larger networks become messy
T-RexT-Rex
Makarenkov et al. 2001
NeighborNet/SplitsTreeNeighborNet/SplitsTree
Bryant et al. 2004, Huson et al. 1998
NewickNewick
Cardona et al. 2008
![Page 10: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/10.jpg)
10
larger networks - examplelarger networks - example
Huson et al. 2005 Bryant et al. 2004
![Page 11: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/11.jpg)
11
graph = ridiculugramgraph = ridiculugram
layout dependent distorted distances random initialization local minima slow
GRGR
MRMR
PRPR
ARAR
nuclear receptorsnuclear receptors
spring layout
![Page 12: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/12.jpg)
12
mosaic plot mosaic plot
point size is similarity no distortions no random initialization preserve full information automatic clustering
(spectral rearrangement) no hard decision about
number of clusters
![Page 13: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/13.jpg)
13
spectral clusteringspectral clustering22 2/)1( ijseA
k
ikaD
ADL
)(, Leigve
vv22: eigenvector for 2nd smallest eigenvalue (Fiedler vector): eigenvector for 2nd smallest eigenvalue (Fiedler vector) indicates clusters and how well they are separated indicates clusters and how well they are separated
"Degree" matrix"Degree" matrix
Laplacian matrixLaplacian matrix
ssijij :n-gram similarity between sequences :n-gram similarity between sequences
Affinity matrixAffinity matrix
σσ : defines neighborhood radius : defines neighborhood radius
eigenvector eigenvector decompositiondecompositione : eigenvaluese : eigenvaluesv : eigenvectorsv : eigenvectors
A = exp(-(1-S)**2/sig)A = exp(-(1-S)**2/sig)D = diag(A.sum(axis=0))D = diag(A.sum(axis=0))L = D-AL = D-Ae,v = eigh(L)e,v = eigh(L)
![Page 14: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/14.jpg)
14
spectral rearrangementspectral rearrangement
![Page 15: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/15.jpg)
15
recursive spectral rearrangementrecursive spectral rearrangement
![Page 16: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/16.jpg)
16
spectral clusteringspectral clustering takes "global" properties into account fast and scales well no random initialization
=> single run global minimum
=> single, unique solution few parameters: L, σ
σ <= mean of distance matrix "better" than k-means (works for non-spherical clusters)
or single linkage hierarchical clustering (no chaining problem) clustering is NP-hard and spectral clustering is
"just another approximation" recursive spectral clustering to improve cluster quality
![Page 17: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/17.jpg)
17
mosaic - demomosaic - demo
![Page 18: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/18.jpg)
18
the endthe end
fast technique to visualize/analyze reticulate protein family evolution
matrix representation spectral clustering n-gram similarity many other applications
PerlPerlfree! free!
![Page 19: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/19.jpg)
19
questionsquestions
??
![Page 20: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/20.jpg)
20
SCOPSCOP SCOP five families randomly selected
![Page 21: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/21.jpg)
21
Nuclear receptorsNuclear receptorsLigand binding domain N-terminal section Zinc-finger domain
![Page 22: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/22.jpg)
22
mosaic - examplesmosaic - examples
![Page 23: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/23.jpg)
23
Full length sequence:Full length sequence:
G
R
MR
P
R
A
R
MrBayes v3.1.2106 generations, 4 chains240 CPU-hrs
![Page 24: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/24.jpg)
24
Zinc finger domainZinc finger domain
AR
GR
MR
P
R
MrBayes v3.1.2106 generations, 4 chains9 CPU-hrs
![Page 25: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/25.jpg)
25
Ligand-binding domainLigand-binding domain
PR
AR
M
R
GR
MrBayes v3.1.2106 generations, 4 chains27 CPU-hrs
![Page 26: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/26.jpg)
26
Upstream regionUpstream region
?MrBayes v3.1.2106 generations, 4 chains87 CPU-hrs
![Page 27: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/27.jpg)
27
quality qquality q
21
21
,min
,max
nn
SSdiagq
max: global alignmentmax: global alignmentmin: local alignmentmin: local alignment
diagdiag = set of dot sums along diagonals = set of dot sums along diagonals
qq [0...1] [0...1]
given two sequences and their n-gram dot plot:given two sequences and their n-gram dot plot:
nn = length of sequence = length of sequence
66.08,6min
0,1,2,4maxq
![Page 28: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/28.jpg)
28
q over sq over s
![Page 29: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/29.jpg)
29
q-spectrumq-spectrum
![Page 30: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn](https://reader036.vdocuments.us/reader036/viewer/2022062618/5513d78c5503463a298b5435/html5/thumbnails/30.jpg)
30
n-gram dot plotsn-gram dot plots