computational searches of biological sequences

Computational searches of biological sequences

Conceptos básicosConceptos básicos Homología y otras relaciones Homología y otras relaciones

evolutivas (paralógos, ortólogos, evolutivas (paralógos, ortólogos, xenólogos)xenólogos)

Uso preferencial de codones, CAI y Uso preferencial de codones, CAI y expresividadexpresividad

Microarreglos y aproximaciones Microarreglos y aproximaciones estadísticas para su análisisestadísticas para su análisis

Descripción de programas Descripción de programas existentesexistentes

BLAST (Comparación apareada de BLAST (Comparación apareada de secuencias)secuencias)

MEME/MAST (Identificación de MEME/MAST (Identificación de motivos sobre-representados)motivos sobre-representados)

Planteamiento de problemas Planteamiento de problemas para resolverpara resolver

1.1. Grupo mínimo de genes para la vidaGrupo mínimo de genes para la vida2.2. Predicción de operones bacterianosPredicción de operones bacterianos3.3. Expresividad en unidades Expresividad en unidades

transcripcionalestranscripcionales4.4. Conservación de expresividad entre Conservación de expresividad entre

organismosorganismos5.5. Identificación de genes transferidos Identificación de genes transferidos

horizontalmente horizontalmente H. pyloriH. pylori6.6. Regulación por glucosa en Regulación por glucosa en E. coliE. coli

C S T P A G N D E Q H R K M I L V F Y WC 12S 0 2T -1 2 3P -3 0 0 8A 1 1 1 0 2G -2 0 -1 -2 1 7N -2 1 1 -1 0 0 4D -3 1 0 -1 0 0 2 5E -3 0 0 -1 0 -1 1 3 4Q -2 0 0 0 0 -1 1 1 2 3H -1 0 0 -1 -1 -1 1 0 0 1 6R -2 0 0 -1 -1 -1 0 0 0 2 1 5K -3 0 0 -1 0 -1 1 1 1 2 1 3 3M -1 -1 -1 -2 -1 -4 -2 -3 -2 -1 -1 -2 -1 4I -1 -2 -1 -3 -1 -5 -3 -4 -3 -2 -2 -2 -2 3 4L -2 -2 -1 -2 -1 -4 -3 -4 -3 -2 -2 -2 -2 3 3 4V 0 -1 0 -2 0 -3 -2 -3 -2 -2 -2 -2 -2 2 3 2 3F -1 -3 -2 -4 -2 -5 -3 -5 -4 -3 0 -3 -3 2 1 2 0 7Y -1 -2 -2 -3 -2 -4 -1 -3 -3 -2 2 -2 -2 0 -1 0 -1 5 8W -1 -3 -4 -5 -4 -4 -4 -5 -4 -3 -1 -2 -4 -1 -2 -1 -3 4 4 14

PAM 250PAM 250

AGGIDGGHGFMG117137

Matriz de substitución para Matriz de substitución para aminoácidosaminoácidos

Unitary matrix for DNA Unitary matrix for DNA sequencessequences

A

A C G T

C

G

T

1

1

1

1

0 0 0

0

0

0

00

0

0

0

0

In any case, the values obtain in the comparison are the same along the entire alignment

It is well know that some residues in a protein or in a nucleotide sequence plays important roles and therefore are constrained to vary

These conserved regions constitutes motif which are sometimes These conserved regions constitutes motif which are sometimes recognized in a set of recognized in a set of aligned sequences. aligned sequences.

SCK1_CENEL/13-36 CLKPCKDLYGPHAGAKCMNGKCKCSCKM_CENMA/13-36 CLPPCKAQFGQSAGAKCMNGKCKCSCT2_ANDAU/35-57 CASVCRRVIGVAAG-KCINGRCVCSCK3_ANDMA/13-35 CASVCRKVIGVAAG-KCINGRCVCSCK4_MESMA/35-57 CASVCRREIGVAAG-KCINGKCVCSCKK_TITSE/35-57 CYSACKKLVGKATG-KCTNGRCDCSCK2_TITDI/14-36 CVKICIDRYNTRGA-KCINGRCTCSCKP3_TITSE/7-28 CNRKCCPG-GCRSG-KCINGKCQCSCBX_MESMA/8-29 CRVKCVAM-GFSSG-KCINSKCKCSCKL_LEIQH/8-28 CQLSCRSL-GL-LG-KCIGDKCECSCK5_ANDMA/8-28 CQLSCRSL-GL-LG-KCIGVKCECSCK1_CENNO/36-57 CDKDCKRR-GYRSG-KCINNACKCSCK2_CENNO/8-29 CDKDCTSR-KYRSG-KCINNACKC * * . ** . *

What is a motif in a biological sequence?What is a motif in a biological sequence?

Represents a conserved region of a sequence.

This conservation might be due to a functional constraint.

There are conserved structural domains in a family of proteins. Amino acid sequences can almost always represent such motifs.

Motif identification is useful to classify and understand protein or nucleotide function.

Example of a protein motif.

Motifs can be represented by Weight Matrices:Motifs can be represented by Weight Matrices:

Example of a RNA motif.


Example of a DNA motif.


How can we obtain a How can we obtain a Weight Weight Matrix for a specific motifMatrix for a specific motif??

……. by evaluating the relative frequency of its elements in a set of aligned sequences.

This frequency matrix contains relevantBiological Information about your protein and can be used to obtain a:

PPosition osition SSpecific pecific SScore core MMatrix atrix PSSMPSSM


While PAM and Blosum matrices are used to compare two amino acids of a pair of sequences regardless of their position in the aligned sequences, a PSSM analysis uses a different matrix in which the score varies depending on the conservation of each position of the aligned sequences

Serin Protease

Actually, the frequencies are not used as such to score putative sites. The score assigned assigned to a piece of sequence, S, is calculated as the log-ratio of two probabilities:

P(S|M), the probability to observe sequence S given the motif model M (the matrix).

P(S|B), the probability to observe sequence S given the background model B (the genomic context).

The score of a sequence segment is WS=log[P(S|M)/P(S|B)]


Different programs have been developed to find motifs

1 AKSJDFHLASUHERLAKSNBKAJNCLKJASHDKFJAHSEJ2 DLKTJNKHBHEASHRGHBDFASJGHBCLKUSHKLCSDHGK3 GNLKXDHKIASGCSDKJCSKHDGKJELHBHEAJFNLOIJS4 JHSLRCKJGHXBDKSLCFALSIZDNGJDFGNLCKJSDNSD5 LKSAJDHBFCKGLSHBHEAUABSXDJKFASODFHBHKAHS6 JSHGHAEKHKSDFJHKSJDFHKAJSEHRKAJHBHEAPERI7 QWHBHEACVLXMNCVKUIEHRMBDKFJAHLIDHRTRKKQP8 LICVUWJENOMNVIDFGKJERJSGFAHGSIUOPIAKHVIU9 OIEURTKSHOIUCVBSDFGUYWERKJHDFLIUHBHEAERT10 OIUWERMXCVKJHBHEAWIERUOIUVMBNAWIUEYRHASS

Different programs have been developed to find motifs

1 AKSJDFHLASUHERLAKSNBKAJNCLKJASHDKFJAHSEJ2 DLKTJNKHBHEASHRGHBDFASJGHBCLKUSHKLCSDHGK3 GNLKXDHKIASGCSDKJCSKHDGKJELHBHEAJFNLOIJS4 JHSLRCKJGHXBDKSLCFALSIZDNGJDFGNLCKJSDNSD5 LKSAJDHBFCKGLSHBHEAUABSXDJKFASODFHBHKAHS6 JSHGHAEKHKSDFJHKSJDFHKAJSEHRKAJHBHEAPERI7 QWHBHEACVLXMNCVKUIEHRMBDKFJAHLIDHRTRKKQP8 LICVUWJENOMNVIDFGKJERJSGFAHGSIUOPIAKHVIU9 OIEURTKSHOIUCVBSDFGUYWERKJHDFLIUHBHEAERT10 OIUWERMXCVKJHBHEAWIERUOIUVMBNAWIUEYRHASS

…..if the alignment is not an option?

How do they work?

A) Counting all the “words” of certain length and evaluating the more frequent and statistically significant.

B) In a aleatory fashion, taking fragments chosen randomly and evaluating if these fragments manage to generate a conserved representative motif

(Gibbs sampler algorithm)Gibbs sampler algorithm)

Gibbs sampler algorithmGibbs sampler algorithmMultiple Local Alignment (MLA)Multiple Local Alignment (MLA)

We mark a sequence into the motif site (occurrence), which is described by a probability-positional matrix q(i,r) , and the background, which is described by background symbol probabilities f(i). r is a nucleotide (a residue); r {A,T,G,C}i is a position in the site, i=1..s , s is the motif length

Positional-Probabilistic Model (PPM) and backgroundPositional-Probabilistic Model (PPM) and background

What is a motif

Two probabilistic models, foreground (the motif) and background, are formulated.

We classify (mark) all the input sequences into these two models-obtained parts.

A Gibbs sampling step

Motif and background bases counters are computed from all the sequence fragments except the current one.

The probability distribution of the new site position or its absence in the current sequence is derived from the statistical models and the current sequence content.

A new site location is sampled from the distribution.

Statistical models for the background and for the motif are formed using the counters.

The current sequence

…..if the alignment is not an option?

Gibbs sampler algorithmGibbs sampler algorithm


Motif site (occurrence), which is described by a probability-positional matrix q(i,r)

background, which is described by background symbol probabilities f(i).

Two probabilistic models are formulated: the foreground model (the motif) and the background model


A probability distribution (where the foreground and background models are different) can be evaluated

A com

plete s

tatis

tical

descr

iptio

n of th

e meth

od is

not in

the s

cope o

f this

talk

One of the sequences, chosen randomly,is removed from the alignment.

The main idea of the method …..The main idea of the method …..

A probability distribution profile is evaluated

and replaced by new sequence searched with the previous motif profile

A new probability distribution profile is evaluated again



After several cycles, the method tends to identify a significant motif After several cycles, the method tends to identify a significant motif

GLAM2 is a software package for finding motifs in sequences, typically amino-acid or nucleotide sequences.

The main innovation of GLAM2 is that it allows insertions and deletions in motifs.

The package includes these programs:

* glam2 - for discovering motifs shared by a set of sequences.

* glam2scan - for finding matches, in a sequence database, to a motif discovered by glam2.

* glam2format - for converting glam2 motifs to standard alignment formats.

* glam2mask - for masking glam2 motifs out of sequences, so that weaker motifs can be found.

* purge - for removing highly similar members of a set of sequences.

http://bioinformatics.org.au/glam2/doc/

http://bioinformatics.org.au/glam2/doc/

http://meme.sdsc.edu/meme4/cgi-bin/glam2.cgi

http://meme.sdsc.edu/meme4/cgi-bin/glam2.cgi

Basic usageRunning glam2 without any arguments gives a usage message:

Usage: glam2 [options] alphabet my_seqs.fa Main alphabets: p = proteins, n = nucleotides Main options (default settings): -h: show all options and their default settings -o: output file (stdout) -r: number of alignment runs (10) -n: end each run after this many iterations without improvement (10000) -2: examine both strands forward and reverse complement --z: minimum number of sequences in the alignment (2) --a: minimum number of aligned columns (2) --b: maximum number of aligned columns (50) --w: initial number of aligned columns (20)

-The main input to glam2 is a file of sequences in FASTA format:>MyFirstSequence GHYWVVCTGGGACH >My2ndSequence LLIGGPWVWWADDDF (etc.)

You need to tell glam2 which alphabet to use:glam2 p my_prots.fa glam2 n my_nucs.fa

Use -o to write the output to a file rather than to the screen:glam2 -o my_prots.glam2 p my_prots.fa

How it works

To use glam2 starts from a random alignment, and makes many small, random changes to it, which are designed to find high-scoring alignments in the long run. The longer you let it run, the more likely it is to find a maximal-scoring alignment.

To check that a reproducible, high-scoring motif has been found, the whole procedure is run several (e.g. 10) times from different starting alignments. If all runs produce identical alignments, we have maximum confidence that this is the optimal motif.

If a few of the runs produce different, lower-scoring motifs, we still have high confidence.

If all the runs produce completely different alignments, we have low confidence, and the run-length needs to be increased.

MEMEMEME: Multiple Expectation maximization for Motif Elicitation

MAIN DIFFERENCES

Try many different initial segments in order to get one that converges to an optimum.

Try different window analysis sizes.

In order to generate a motif with gaps, more than one motif can be generated.

Motif discovery from unaligned sequences Optimal for Genomic or Protein sequences Especially if you do not know the size of the motif

Identifies profile motifs Simultaneously analyze Multiple motifs for any input

Flexible model of motif presence Motif can be absent in some sequences Motif can appear several times in one sequence

A very useful program to discoverA very useful program to discover Patterns Patterns

MEMEMEME: Multiple Expectation maximization for Motif Elicitation

The input to MEME contains the following The input to MEME contains the following

fields.fields. Sequences

Protein or DNA sequences (Do not merge) in fasta format.

Notice that sequence names must not be repeated

Valid examples are

>seq1 GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAK >seq2 GDMFCPGYCPDVKPVGDFDLSAFAGAWHELAK

>GdhA Glutamate dehydrogenase from Escherichia coli

RDMFCPGGCPDVKPVGDHDLSAFKGAWHELAL

Motif distribution

One per sequence (oops)

Zero or one per sequence (zoops)

Any number of repetitions (tcm)

Number of motifs [optional]

The program will stop the analysis after this number of motifs is found.

Number of sites (Minimum or Maximum sites) (<= 300)

Minimum sites = 5 Maximum sites = 8

Motif width .

MEME will find the optimum width of each motif within the limits you specify : Minimum or Maximum

MEME input. continued….MEME input. continued….

Text output format

By default, MEME output is in hypertext (HTML) format.

Shuffle letters in input sequences

Useful for further statistical analysis

Look for palindromes onlyAverage the letter frequencies in corresponding motif columns together

MEME input. continued….MEME input. continued….

http://meme.sdsc.edu/meme/meme.html

http://meme.sdsc.edu/meme/meme.html

MEME OutputMEME Output

Motif length

# of motifs found

Expectation value

“Position-Specific Probability Matrix

Information content

Consensus

Sequence names

Strand (reverse or complement)

Position in sequence

Statistical significance

Motif within sequence

MEME Output

Overall Statistical significance

Sequence length

Motif in complement strand

MEME OutputMEME Output

Searches for motifs (one or more) in sequence databases: Like BLAST but motifs are used as input Similar to the matrices obtained by iteration in PSI-

BLAST

Profile defines statistical significance of a match Multiple motif matches per sequence Combined E value for all motifs

MEME uses MAST to summarize results: Each MEME result is accompanied by the MAST result

for searching the discovered motifs on the given sequences.

MASTMAST

Email address

Database (like BLAST)

Motif file (e.g. MEME output)

Consider matched sequence length

E value threshold

MAST inputMAST input

Matched accession

Match E value

Length of sequence

Link to GenBank

MAST outputMAST output

Motif diagram

MASTMAST output output

Position of each instance

P value of instance

Matched parts of sequence

Motif ‘consensus’

Motif and orientation

MAST outputMAST output

Affymetrix GeneChip arrays High density oligonucleotide

array technology as developed by Affymetrix www.affymetrix.com

Overview images courtesy of Affymetrix unless otherwise specified

http://www.affymetrix.com/

Probes and Probesets

Sample Preparation

Hybridization to the Chip

The Chip is Scanned

The task of analysing a gene expression experimentfor differential genes falls into the following steps:

(1) Ranking: genes are ranked according to theirevidence of differential expression.

(2) Assigning significance: a statistical significance isbeing assigned to each gene.

(3) Cut-off value: to arrive at a limited number ofdifferentially expressed genes a cut-off value forthe statistical significance needs to be determined.

DIFFERENTIAL EXPRESSION Motivation

Planteamiento de problemas Planteamiento de problemas para resolverpara resolver

1.1. Grupo mínimo de genes para la vidaGrupo mínimo de genes para la vida2.2. Predicción de operones bacterianosPredicción de operones bacterianos3.3. Expresividad en unidades Expresividad en unidades

transcripcionalestranscripcionales4.4. Conservación de expresividad entre Conservación de expresividad entre

organismosorganismos5.5. Identificación de genes transferidos Identificación de genes transferidos

horizontalmente horizontalmente H. pyloriH. pylori6.6. Regulación por glucosa en Regulación por glucosa en E. coliE. coli

Datos de microarreglos sobre Datos de microarreglos sobre la reglulación por glucosa en la reglulación por glucosa en

E. coliE. coli ID B_number crp1 crp1 crp2 crp2 wt1 wt1 wt2 wt2 wtg1 wtg1 wtg2 wtg3

señal det señal det señal det señal det señal det señal det

1 aas_b2836_at 279.2 P 293.8 P 360.8 P 332.2 P 475.6 P 374.4 P

2 aat_b0885_at 344.9 A 332.5 A 110.7 A 256.8 A 237.1 A 35.2 A

3 abc_b0199_at 697 M 682.4 A 530.5 A 532 A 747.9 M 882.8 A

4 abrB_b0715_at 537.3 P 518.3 P 446.3 A 329.6 M 357.9 P 406.7 A

5 accA_b0185_at 1735.3 P 1083.1 P 1821.7 P 1798.6 P 2416.7 P 1794.5 P

6 accB_b3255_at 2314.2 P 1811.4 P 1598 P 1835.2 P 3182.8 P 2893.6 P

7 accC_b3256_at 2486.2 P 1845.6 P 1165.7 P 1146 P 2861.6 P 2883.3 P

8 accD_b2316_at 1770.3 P 1468.1 P 1592.1 P 1808.5 P 2257.6 P 2873.7 P

9 aceA_b4015_at 632.6 P 790.2 P 3005.5 P 2489 P 474.1 P 521.8 P

10 aceB_b4014_at 282.7 P 394.4 P 1610.5 P 1254.1 P 274.8 A 316.6 A

11 aceE_b0114_at 5060.1 P 3979.2 P 2093.5 P 1665.9 P 7943.2 P 8411.5 P

12 aceF_b0115_at 3981.4 P 2287.3 P 1998.1 P 1425.8 P 5515.2 P 4993.9 P

13 aceK_b4016_at 386.8 P 331 P 534.9 P 442.9 P 395.4 P 395.3 P

14 ackA_b2296_at 2624.8 P 2084.2 P 1225.1 P 1273 P 3624.6 P 2990.2 P

15 acnA_b1276_at 488 P 326.2 P 701.9 P 773.2 P 368.7 A 460.8 A

16 acnB_b0118_at 2417.2 P 2185.7 P 3693 P 2864.5 P 1401.4 P 1786 P

P-Confiable, A-Dudoso, M-no confiable

Enfoques de estudio Tecnología de microarreglos

Se pueden complementar con otros métodos (RT-PCR, por ejemplo)

Outliers Outliers iteration iteration methodmethod

1. Se obtienen los promedios de las diferentes repeticiones

2. Se comparan los resultados de los experimentos de estudio con los del experimento control. En nuestro ejemplo:

a) WTglucosa/WT b) CRP/WT

3. Se obtienen la media y desviación estandar de los datos.

4.- Para identificar a los genes con mayor significancia asumimos un corte arbitrario de 2 desviaciones estardar relativos a la media.


Genes inducidos

Genes reprimidos

5.- Se vuelve a calcular la media y desviación estandar de los datos que no fueron considerados como estadísticamente importantes.


6.- Nuevamente se considera que los genes por arriba o debajo de dos desviaciones 2 desviaciones estardar relativos a la media son estadísticamente relevantes


Genes inducidos

Genes reprimidos

Outliers Outliers iteration methoditeration methodSe itera hasta ya no encontrar valores por Se itera hasta ya no encontrar valores por

afuera de las 2 DS.afuera de las 2 DS.

Genes inducidos

Genes reprimidos

Genes inducidos

Genes reprimidos

Metodo de Metodo de ordenamiento (Rank)ordenamiento (Rank)

Destaca los genes más expresados. Por lo tanto, no importa el valor en términos absolutos si no

su valor relativo a los demás valores del arreglo Los genes en principio deben de conservar su posición dada

la naturaleza de su expresión.


Gene Valor

A 0.3

B 0.6

C 1.4

D 3.2

E 1.1

F 4.3

G 0.9

H 0.8

I 0.4

J 0.7

Pos Gene Valor

1 F 4.3

2 D 3.2

3 C 1.4

4 E 1.1

5 G 0.9

6 H 0.8

7 J 0.7

8 B 0.6

9 I 0.4

10 A 0.3


Pos Gene Valor

1 F 4.3

2 D 3.2

3 C 1.4

4 E 1.1

5 G 0.9

6 H 0.8

7 J 0.7

8 B 0.6

9 I 0.4

10 A 0.3

¿Cuál es la probabilidad de quedar en primer lugar?

P-primero = 1/n = 1/10 = 0.1

¿Cuál es la probabilidad de quedar en primer o segundo lugar?

p= p-primero + p-segundo = 0.2

¿Cuál es la probabilidad de quedar en segundo lugar?

P-segundo = 1/n = 1/10 = 0.1


Pos Gene Valor

1 F 4.3

2 D 3.2

3 C 1.4

4 E 1.1

5 G 0.9

6 H 0.8

7 J 0.7

8 B 0.6

9 I 0.4

10 A 0.3

Pos Gene

Valor Prob

1 F 4.3 0.1

2 D 3.2 0.2

3 C 1.4 0.3

4 E 1.1 0.4

5 G 0.9 0.5

6 H 0.8 0.6

7 J 0.7 0.7

8 B 0.6 0.8

9 I 0.4 0.9

10 A 0.3 1.0


¿Cómo consideramos la probabilidad si existen

repeticiones del experimento?Media geométrica

1er exp -> 3er lugar -> 0.32do exp -> 1er lugar -> 0.13er exp -> 2do lugar -> 0.2

P = (0.3 x 0.1 x 0.2)1/3

P = (0.06)1/3

P = 0.39

Problemas a resolver sobre Problemas a resolver sobre análisis de microarreglosanálisis de microarreglos

1. ¿Cuáles son los genes reprimidos e inducidos cuando E. coli es crecida en glucosa?

2. ¿Cuáles son los genes reprimidos e inducidos en la mutante CRP de E. coli?

3. ¿Cuál es la coincidencia entre los genes de los puntos 1 y 2?

4. ¿Qué tanto coinciden los resultados del análisis de Outliers y Rank?

5. Identifique el posible elemento de regulación común a estos genes utilizando Glim2 y MEME

computational searches of biological sequences

Documents

protein motif

weight matrix

specific motif

dna motif

set of aligned sequences

frequency matrix

pair of sequences

weight matrices