computational searches of biological sequences
DESCRIPTION
Computational searches of biological sequences. Conceptos básicos. Homología y otras relaciones evolutivas (paralógos, ortólogos, xenólogos) Uso preferencial de codones, CAI y expresividad Microarreglos y aproximaciones estadísticas para su análisis. Descripción de programas existentes. - PowerPoint PPT PresentationTRANSCRIPT
Computational searches of biological sequences
Conceptos básicosConceptos básicos Homología y otras relaciones Homología y otras relaciones
evolutivas (paralógos, ortólogos, evolutivas (paralógos, ortólogos, xenólogos)xenólogos)
Uso preferencial de codones, CAI y Uso preferencial de codones, CAI y expresividadexpresividad
Microarreglos y aproximaciones Microarreglos y aproximaciones estadísticas para su análisisestadísticas para su análisis
Descripción de programas Descripción de programas existentesexistentes
BLAST (Comparación apareada de BLAST (Comparación apareada de secuencias)secuencias)
MEME/MAST (Identificación de MEME/MAST (Identificación de motivos sobre-representados)motivos sobre-representados)
Planteamiento de problemas Planteamiento de problemas para resolverpara resolver
1.1. Grupo mínimo de genes para la vidaGrupo mínimo de genes para la vida2.2. Predicción de operones bacterianosPredicción de operones bacterianos3.3. Expresividad en unidades Expresividad en unidades
transcripcionalestranscripcionales4.4. Conservación de expresividad entre Conservación de expresividad entre
organismosorganismos5.5. Identificación de genes transferidos Identificación de genes transferidos
horizontalmente horizontalmente H. pyloriH. pylori6.6. Regulación por glucosa en Regulación por glucosa en E. coliE. coli
C S T P A G N D E Q H R K M I L V F Y WC 12S 0 2T -1 2 3P -3 0 0 8A 1 1 1 0 2G -2 0 -1 -2 1 7N -2 1 1 -1 0 0 4D -3 1 0 -1 0 0 2 5E -3 0 0 -1 0 -1 1 3 4Q -2 0 0 0 0 -1 1 1 2 3H -1 0 0 -1 -1 -1 1 0 0 1 6R -2 0 0 -1 -1 -1 0 0 0 2 1 5K -3 0 0 -1 0 -1 1 1 1 2 1 3 3M -1 -1 -1 -2 -1 -4 -2 -3 -2 -1 -1 -2 -1 4I -1 -2 -1 -3 -1 -5 -3 -4 -3 -2 -2 -2 -2 3 4L -2 -2 -1 -2 -1 -4 -3 -4 -3 -2 -2 -2 -2 3 3 4V 0 -1 0 -2 0 -3 -2 -3 -2 -2 -2 -2 -2 2 3 2 3F -1 -3 -2 -4 -2 -5 -3 -5 -4 -3 0 -3 -3 2 1 2 0 7Y -1 -2 -2 -3 -2 -4 -1 -3 -3 -2 2 -2 -2 0 -1 0 -1 5 8W -1 -3 -4 -5 -4 -4 -4 -5 -4 -3 -1 -2 -4 -1 -2 -1 -3 4 4 14
PAM 250PAM 250
AGGIDGGHGFMG117137
Matriz de substitución para Matriz de substitución para aminoácidosaminoácidos
Unitary matrix for DNA Unitary matrix for DNA sequencessequences
A
A C G T
C
G
T
1
1
1
1
0 0 0
0
0
0
00
0
0
0
0
In any case, the values obtain in the comparison are the same along the entire alignment
It is well know that some residues in a protein or in a nucleotide sequence plays important roles and therefore are constrained to vary
These conserved regions constitutes motif which are sometimes These conserved regions constitutes motif which are sometimes recognized in a set of recognized in a set of aligned sequences. aligned sequences.
SCK1_CENEL/13-36 CLKPCKDLYGPHAGAKCMNGKCKCSCKM_CENMA/13-36 CLPPCKAQFGQSAGAKCMNGKCKCSCT2_ANDAU/35-57 CASVCRRVIGVAAG-KCINGRCVCSCK3_ANDMA/13-35 CASVCRKVIGVAAG-KCINGRCVCSCK4_MESMA/35-57 CASVCRREIGVAAG-KCINGKCVCSCKK_TITSE/35-57 CYSACKKLVGKATG-KCTNGRCDCSCK2_TITDI/14-36 CVKICIDRYNTRGA-KCINGRCTCSCKP3_TITSE/7-28 CNRKCCPG-GCRSG-KCINGKCQCSCBX_MESMA/8-29 CRVKCVAM-GFSSG-KCINSKCKCSCKL_LEIQH/8-28 CQLSCRSL-GL-LG-KCIGDKCECSCK5_ANDMA/8-28 CQLSCRSL-GL-LG-KCIGVKCECSCK1_CENNO/36-57 CDKDCKRR-GYRSG-KCINNACKCSCK2_CENNO/8-29 CDKDCTSR-KYRSG-KCINNACKC * * . ** . *
What is a motif in a biological sequence?What is a motif in a biological sequence?
Represents a conserved region of a sequence.
This conservation might be due to a functional constraint.
There are conserved structural domains in a family of proteins. Amino acid sequences can almost always represent such motifs.
Motif identification is useful to classify and understand protein or nucleotide function.
Example of a protein motif.
Motifs can be represented by Weight Matrices:Motifs can be represented by Weight Matrices:
Example of a RNA motif.
Motifs can be represented by Weight Matrices:Motifs can be represented by Weight Matrices:
Example of a DNA motif.
Motifs can be represented by Weight Matrices:Motifs can be represented by Weight Matrices:
How can we obtain a How can we obtain a Weight Weight Matrix for a specific motifMatrix for a specific motif??
……. by evaluating the relative frequency of its elements in a set of aligned sequences.
This frequency matrix contains relevantBiological Information about your protein and can be used to obtain a:
PPosition osition SSpecific pecific SScore core MMatrix atrix PSSMPSSM
PPosition osition SSpecific pecific SScore core MMatrix atrix PSSMPSSM
While PAM and Blosum matrices are used to compare two amino acids of a pair of sequences regardless of their position in the aligned sequences, a PSSM analysis uses a different matrix in which the score varies depending on the conservation of each position of the aligned sequences
Serin Protease
Actually, the frequencies are not used as such to score putative sites. The score assigned assigned to a piece of sequence, S, is calculated as the log-ratio of two probabilities:
P(S|M), the probability to observe sequence S given the motif model M (the matrix).
P(S|B), the probability to observe sequence S given the background model B (the genomic context).
The score of a sequence segment is WS=log[P(S|M)/P(S|B)]
PPosition osition SSpecific pecific SScore core MMatrix atrix PSSMPSSM
Different programs have been developed to find motifs
1 AKSJDFHLASUHERLAKSNBKAJNCLKJASHDKFJAHSEJ2 DLKTJNKHBHEASHRGHBDFASJGHBCLKUSHKLCSDHGK3 GNLKXDHKIASGCSDKJCSKHDGKJELHBHEAJFNLOIJS4 JHSLRCKJGHXBDKSLCFALSIZDNGJDFGNLCKJSDNSD5 LKSAJDHBFCKGLSHBHEAUABSXDJKFASODFHBHKAHS6 JSHGHAEKHKSDFJHKSJDFHKAJSEHRKAJHBHEAPERI7 QWHBHEACVLXMNCVKUIEHRMBDKFJAHLIDHRTRKKQP8 LICVUWJENOMNVIDFGKJERJSGFAHGSIUOPIAKHVIU9 OIEURTKSHOIUCVBSDFGUYWERKJHDFLIUHBHEAERT10 OIUWERMXCVKJHBHEAWIERUOIUVMBNAWIUEYRHASS
Different programs have been developed to find motifs
1 AKSJDFHLASUHERLAKSNBKAJNCLKJASHDKFJAHSEJ2 DLKTJNKHBHEASHRGHBDFASJGHBCLKUSHKLCSDHGK3 GNLKXDHKIASGCSDKJCSKHDGKJELHBHEAJFNLOIJS4 JHSLRCKJGHXBDKSLCFALSIZDNGJDFGNLCKJSDNSD5 LKSAJDHBFCKGLSHBHEAUABSXDJKFASODFHBHKAHS6 JSHGHAEKHKSDFJHKSJDFHKAJSEHRKAJHBHEAPERI7 QWHBHEACVLXMNCVKUIEHRMBDKFJAHLIDHRTRKKQP8 LICVUWJENOMNVIDFGKJERJSGFAHGSIUOPIAKHVIU9 OIEURTKSHOIUCVBSDFGUYWERKJHDFLIUHBHEAERT10 OIUWERMXCVKJHBHEAWIERUOIUVMBNAWIUEYRHASS
…..if the alignment is not an option?
How do they work?
A) Counting all the “words” of certain length and evaluating the more frequent and statistically significant.
B) In a aleatory fashion, taking fragments chosen randomly and evaluating if these fragments manage to generate a conserved representative motif
(Gibbs sampler algorithm)Gibbs sampler algorithm)
Gibbs sampler algorithmGibbs sampler algorithmMultiple Local Alignment (MLA)Multiple Local Alignment (MLA)
We mark a sequence into the motif site (occurrence), which is described by a probability-positional matrix q(i,r) , and the background, which is described by background symbol probabilities f(i). r is a nucleotide (a residue); r {A,T,G,C}i is a position in the site, i=1..s , s is the motif length
Positional-Probabilistic Model (PPM) and backgroundPositional-Probabilistic Model (PPM) and background
What is a motif
Two probabilistic models, foreground (the motif) and background, are formulated.
We classify (mark) all the input sequences into these two models-obtained parts.
A Gibbs sampling step
Motif and background bases counters are computed from all the sequence fragments except the current one.
The probability distribution of the new site position or its absence in the current sequence is derived from the statistical models and the current sequence content.
A new site location is sampled from the distribution.
Statistical models for the background and for the motif are formed using the counters.
The current sequence
…..if the alignment is not an option?
Gibbs sampler algorithmGibbs sampler algorithm
Gibbs sampler algorithmGibbs sampler algorithm
Motif site (occurrence), which is described by a probability-positional matrix q(i,r)
background, which is described by background symbol probabilities f(i).
Two probabilistic models are formulated: the foreground model (the motif) and the background model
Gibbs sampler algorithmGibbs sampler algorithm
A probability distribution (where the foreground and background models are different) can be evaluated
A com
plete s
tatis
tical
descr
iptio
n of th
e meth
od is
not in
the s
cope o
f this
talk
One of the sequences, chosen randomly,is removed from the alignment.
The main idea of the method …..The main idea of the method …..
A probability distribution profile is evaluated
and replaced by new sequence searched with the previous motif profile
A new probability distribution profile is evaluated again
The main idea of the method …..The main idea of the method …..
The main idea of the method …..The main idea of the method …..
After several cycles, the method tends to identify a significant motif After several cycles, the method tends to identify a significant motif
GLAM2 is a software package for finding motifs in sequences, typically amino-acid or nucleotide sequences.
The main innovation of GLAM2 is that it allows insertions and deletions in motifs.
The package includes these programs:
* glam2 - for discovering motifs shared by a set of sequences.
* glam2scan - for finding matches, in a sequence database, to a motif discovered by glam2.
* glam2format - for converting glam2 motifs to standard alignment formats.
* glam2mask - for masking glam2 motifs out of sequences, so that weaker motifs can be found.
* purge - for removing highly similar members of a set of sequences.
http://bioinformatics.org.au/glam2/doc/
http://meme.sdsc.edu/meme4/cgi-bin/glam2.cgi
Basic usageRunning glam2 without any arguments gives a usage message:
Usage: glam2 [options] alphabet my_seqs.fa Main alphabets: p = proteins, n = nucleotides Main options (default settings): -h: show all options and their default settings -o: output file (stdout) -r: number of alignment runs (10) -n: end each run after this many iterations without improvement (10000) -2: examine both strands forward and reverse complement --z: minimum number of sequences in the alignment (2) --a: minimum number of aligned columns (2) --b: maximum number of aligned columns (50) --w: initial number of aligned columns (20)
-The main input to glam2 is a file of sequences in FASTA format:>MyFirstSequence GHYWVVCTGGGACH >My2ndSequence LLIGGPWVWWADDDF (etc.)
You need to tell glam2 which alphabet to use:glam2 p my_prots.fa glam2 n my_nucs.fa
Use -o to write the output to a file rather than to the screen:glam2 -o my_prots.glam2 p my_prots.fa
How it works
To use glam2 starts from a random alignment, and makes many small, random changes to it, which are designed to find high-scoring alignments in the long run. The longer you let it run, the more likely it is to find a maximal-scoring alignment.
To check that a reproducible, high-scoring motif has been found, the whole procedure is run several (e.g. 10) times from different starting alignments. If all runs produce identical alignments, we have maximum confidence that this is the optimal motif.
If a few of the runs produce different, lower-scoring motifs, we still have high confidence.
If all the runs produce completely different alignments, we have low confidence, and the run-length needs to be increased.
MEMEMEME: Multiple Expectation maximization for Motif Elicitation
MAIN DIFFERENCES
Try many different initial segments in order to get one that converges to an optimum.
Try different window analysis sizes.
In order to generate a motif with gaps, more than one motif can be generated.
Motif discovery from unaligned sequences Optimal for Genomic or Protein sequences Especially if you do not know the size of the motif
Identifies profile motifs Simultaneously analyze Multiple motifs for any input
Flexible model of motif presence Motif can be absent in some sequences Motif can appear several times in one sequence
A very useful program to discoverA very useful program to discover Patterns Patterns
MEMEMEME: Multiple Expectation maximization for Motif Elicitation
The input to MEME contains the following The input to MEME contains the following
fields.fields. Sequences
Protein or DNA sequences (Do not merge) in fasta format.
Notice that sequence names must not be repeated
Valid examples are
>seq1 GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAK >seq2 GDMFCPGYCPDVKPVGDFDLSAFAGAWHELAK
>GdhA Glutamate dehydrogenase from Escherichia coli
RDMFCPGGCPDVKPVGDHDLSAFKGAWHELAL
Motif distribution
One per sequence (oops)
Zero or one per sequence (zoops)
Any number of repetitions (tcm)
Number of motifs [optional]
The program will stop the analysis after this number of motifs is found.
Number of sites (Minimum or Maximum sites) (<= 300)
Minimum sites = 5 Maximum sites = 8
Motif width .
MEME will find the optimum width of each motif within the limits you specify : Minimum or Maximum
MEME input. continued….MEME input. continued….
Text output format
By default, MEME output is in hypertext (HTML) format.
Shuffle letters in input sequences
Useful for further statistical analysis
Look for palindromes onlyAverage the letter frequencies in corresponding motif columns together
MEME input. continued….MEME input. continued….
http://meme.sdsc.edu/meme/meme.html
MEME OutputMEME Output
Motif length
# of motifs found
Expectation value
“Position-Specific Probability Matrix
Information content
Consensus
Sequence names
Strand (reverse or complement)
Position in sequence
Statistical significance
Motif within sequence
MEME Output
Overall Statistical significance
Sequence length
Motif in complement strand
MEME OutputMEME Output
Searches for motifs (one or more) in sequence databases: Like BLAST but motifs are used as input Similar to the matrices obtained by iteration in PSI-
BLAST
Profile defines statistical significance of a match Multiple motif matches per sequence Combined E value for all motifs
MEME uses MAST to summarize results: Each MEME result is accompanied by the MAST result
for searching the discovered motifs on the given sequences.
MASTMAST
Email address
Database (like BLAST)
Motif file (e.g. MEME output)
Consider matched sequence length
E value threshold
MAST inputMAST input
Matched accession
Match E value
Length of sequence
Link to GenBank
MAST outputMAST output
Motif diagram
MASTMAST output output
Position of each instance
P value of instance
Matched parts of sequence
Motif ‘consensus’
Motif and orientation
MAST outputMAST output
Affymetrix GeneChip arrays High density oligonucleotide
array technology as developed by Affymetrix www.affymetrix.com
Overview images courtesy of Affymetrix unless otherwise specified
Probes and Probesets
Sample Preparation
Hybridization to the Chip
The Chip is Scanned
The task of analysing a gene expression experimentfor differential genes falls into the following steps:
(1) Ranking: genes are ranked according to theirevidence of differential expression.
(2) Assigning significance: a statistical significance isbeing assigned to each gene.
(3) Cut-off value: to arrive at a limited number ofdifferentially expressed genes a cut-off value forthe statistical significance needs to be determined.
DIFFERENTIAL EXPRESSION Motivation
Planteamiento de problemas Planteamiento de problemas para resolverpara resolver
1.1. Grupo mínimo de genes para la vidaGrupo mínimo de genes para la vida2.2. Predicción de operones bacterianosPredicción de operones bacterianos3.3. Expresividad en unidades Expresividad en unidades
transcripcionalestranscripcionales4.4. Conservación de expresividad entre Conservación de expresividad entre
organismosorganismos5.5. Identificación de genes transferidos Identificación de genes transferidos
horizontalmente horizontalmente H. pyloriH. pylori6.6. Regulación por glucosa en Regulación por glucosa en E. coliE. coli
Datos de microarreglos sobre Datos de microarreglos sobre la reglulación por glucosa en la reglulación por glucosa en
E. coliE. coli ID B_number crp1 crp1 crp2 crp2 wt1 wt1 wt2 wt2 wtg1 wtg1 wtg2 wtg3
señal det señal det señal det señal det señal det señal det
1 aas_b2836_at 279.2 P 293.8 P 360.8 P 332.2 P 475.6 P 374.4 P
2 aat_b0885_at 344.9 A 332.5 A 110.7 A 256.8 A 237.1 A 35.2 A
3 abc_b0199_at 697 M 682.4 A 530.5 A 532 A 747.9 M 882.8 A
4 abrB_b0715_at 537.3 P 518.3 P 446.3 A 329.6 M 357.9 P 406.7 A
5 accA_b0185_at 1735.3 P 1083.1 P 1821.7 P 1798.6 P 2416.7 P 1794.5 P
6 accB_b3255_at 2314.2 P 1811.4 P 1598 P 1835.2 P 3182.8 P 2893.6 P
7 accC_b3256_at 2486.2 P 1845.6 P 1165.7 P 1146 P 2861.6 P 2883.3 P
8 accD_b2316_at 1770.3 P 1468.1 P 1592.1 P 1808.5 P 2257.6 P 2873.7 P
9 aceA_b4015_at 632.6 P 790.2 P 3005.5 P 2489 P 474.1 P 521.8 P
10 aceB_b4014_at 282.7 P 394.4 P 1610.5 P 1254.1 P 274.8 A 316.6 A
11 aceE_b0114_at 5060.1 P 3979.2 P 2093.5 P 1665.9 P 7943.2 P 8411.5 P
12 aceF_b0115_at 3981.4 P 2287.3 P 1998.1 P 1425.8 P 5515.2 P 4993.9 P
13 aceK_b4016_at 386.8 P 331 P 534.9 P 442.9 P 395.4 P 395.3 P
14 ackA_b2296_at 2624.8 P 2084.2 P 1225.1 P 1273 P 3624.6 P 2990.2 P
15 acnA_b1276_at 488 P 326.2 P 701.9 P 773.2 P 368.7 A 460.8 A
16 acnB_b0118_at 2417.2 P 2185.7 P 3693 P 2864.5 P 1401.4 P 1786 P
P-Confiable, A-Dudoso, M-no confiable
Enfoques de estudio Tecnología de microarreglos
Se pueden complementar con otros métodos (RT-PCR, por ejemplo)
Outliers Outliers iteration iteration methodmethod
1. Se obtienen los promedios de las diferentes repeticiones
2. Se comparan los resultados de los experimentos de estudio con los del experimento control. En nuestro ejemplo:
a) WTglucosa/WT b) CRP/WT
3. Se obtienen la media y desviación estandar de los datos.
4.- Para identificar a los genes con mayor significancia asumimos un corte arbitrario de 2 desviaciones estardar relativos a la media.
Outliers Outliers iteration iteration methodmethod
Genes inducidos
Genes reprimidos
5.- Se vuelve a calcular la media y desviación estandar de los datos que no fueron considerados como estadísticamente importantes.
Outliers Outliers iteration iteration methodmethod
6.- Nuevamente se considera que los genes por arriba o debajo de dos desviaciones 2 desviaciones estardar relativos a la media son estadísticamente relevantes
Outliers Outliers iteration iteration methodmethod
Genes inducidos
Genes reprimidos
Outliers Outliers iteration methoditeration methodSe itera hasta ya no encontrar valores por Se itera hasta ya no encontrar valores por
afuera de las 2 DS.afuera de las 2 DS.
Genes inducidos
Genes reprimidos
Genes inducidos
Genes reprimidos
Metodo de Metodo de ordenamiento (Rank)ordenamiento (Rank)
Destaca los genes más expresados. Por lo tanto, no importa el valor en términos absolutos si no
su valor relativo a los demás valores del arreglo Los genes en principio deben de conservar su posición dada
la naturaleza de su expresión.
Metodo de Metodo de ordenamiento (Rank)ordenamiento (Rank)
Gene Valor
A 0.3
B 0.6
C 1.4
D 3.2
E 1.1
F 4.3
G 0.9
H 0.8
I 0.4
J 0.7
Pos Gene Valor
1 F 4.3
2 D 3.2
3 C 1.4
4 E 1.1
5 G 0.9
6 H 0.8
7 J 0.7
8 B 0.6
9 I 0.4
10 A 0.3
Metodo de Metodo de ordenamiento (Rank)ordenamiento (Rank)
Pos Gene Valor
1 F 4.3
2 D 3.2
3 C 1.4
4 E 1.1
5 G 0.9
6 H 0.8
7 J 0.7
8 B 0.6
9 I 0.4
10 A 0.3
¿Cuál es la probabilidad de quedar en primer lugar?
P-primero = 1/n = 1/10 = 0.1
¿Cuál es la probabilidad de quedar en primer o segundo lugar?
p= p-primero + p-segundo = 0.2
¿Cuál es la probabilidad de quedar en segundo lugar?
P-segundo = 1/n = 1/10 = 0.1
Metodo de Metodo de ordenamiento (Rank)ordenamiento (Rank)
Pos Gene Valor
1 F 4.3
2 D 3.2
3 C 1.4
4 E 1.1
5 G 0.9
6 H 0.8
7 J 0.7
8 B 0.6
9 I 0.4
10 A 0.3
Pos Gene
Valor Prob
1 F 4.3 0.1
2 D 3.2 0.2
3 C 1.4 0.3
4 E 1.1 0.4
5 G 0.9 0.5
6 H 0.8 0.6
7 J 0.7 0.7
8 B 0.6 0.8
9 I 0.4 0.9
10 A 0.3 1.0
Metodo de Metodo de ordenamiento (Rank)ordenamiento (Rank)
¿Cómo consideramos la probabilidad si existen
repeticiones del experimento?Media geométrica
1er exp -> 3er lugar -> 0.32do exp -> 1er lugar -> 0.13er exp -> 2do lugar -> 0.2
P = (0.3 x 0.1 x 0.2)1/3
P = (0.06)1/3
P = 0.39
Problemas a resolver sobre Problemas a resolver sobre análisis de microarreglosanálisis de microarreglos
1. ¿Cuáles son los genes reprimidos e inducidos cuando E. coli es crecida en glucosa?
2. ¿Cuáles son los genes reprimidos e inducidos en la mutante CRP de E. coli?
3. ¿Cuál es la coincidencia entre los genes de los puntos 1 y 2?
4. ¿Qué tanto coinciden los resultados del análisis de Outliers y Rank?
5. Identifique el posible elemento de regulación común a estos genes utilizando Glim2 y MEME