comparative genomics and genome alignmentasa/courses/cs548/fall11/pdfs... · mulan phylogenetic...

4
1 Comparative genomics and genome alignment Conservation detected by genome alignment Conservation of ZFPM1 among human, mouse, rat, and mouse. The large introns have several highly conserved regions. Those with conserved GATA-1 binding sites and high regulatory potential (predicted CRMs) are indicated. http://genome.cshlp.org/content/15/1/184.full Mulan phylogenetic tree (A) and sequence conservation profile (B) for the GATA3 gene locus from human, rat, mouse, chicken, frog, and three fish genomes. Ovcharenko I et al. Genome Res. 2005;15:184-194 ©2005 by Cold Spring Harbor Laboratory Press What is comparative genomics? Methods for characterizing DNA sequences using multiple genomes Examples of problems that benefit from a comparative approach: Gene prediction Gene regulation Understanding evolutionary relationships between species according to their genome architecture Comparing Genomes: Global Alignment of Long DNA Sequences Motivation Genomic sequences are very long Aligning genomic regions is useful for revealing conserved elements Want to compare regions > 1,000,000-long

Upload: others

Post on 27-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Comparative genomics and genome alignmentasa/courses/cs548/fall11/pdfs... · Mulan phylogenetic tree (A) and sequence conservation profile (B) for the GATA3 gene locus from human,

1

Comparative genomics and genome alignment �

Conservation detected by genome alignment �

Conservation of ZFPM1 among human, mouse, rat, and mouse. The large introns have several highly conserved regions. Those with conserved GATA-1 binding sites and high regulatory potential (predicted CRMs) are indicated.

http://genome.cshlp.org/content/15/1/184.full

Mulan phylogenetic tree (A) and sequence conservation profile (B) for the GATA3 gene locus from human, rat, mouse, chicken, frog, and three fish genomes.

Ovcharenko I et al. Genome Res. 2005;15:184-194

©2005 by Cold Spring Harbor Laboratory Press

What is comparative genomics?�

•  Methods for characterizing DNA sequences using multiple genomes�

•  Examples of problems that benefit from a comparative approach: �•  Gene prediction �•  Gene regulation �•  Understanding evolutionary relationships

between species according to their genome architecture�

Comparing Genomes: Global Alignment of Long DNA Sequences�

Motivation �

•  Genomic sequences are very long �

•  Aligning genomic regions is useful for revealing conserved elements�•  Want to compare regions > 1,000,000-long �

Page 2: Comparative genomics and genome alignmentasa/courses/cs548/fall11/pdfs... · Mulan phylogenetic tree (A) and sequence conservation profile (B) for the GATA3 gene locus from human,

2

Alignment Uses�

•  Whole genome alignment�•  Synteny analysis �•  Polymorphism detection �•  Sequence mapping �

•  Multiple genome alignment�•  Identify conserved sequence, e.g. functional elements

(annotation)�•  Polymorphism detection �

Alignment methods�

•  Local alignment methods (BLAST-like): �•  Do not provide the global picture�•  Weakly conserved local regions can be missed�

•  Global alignment methods (e.g. Needleman-Wunsch):�•  Assume two regions are globally alignable�•  Slow, memory intensive algorithms that cannot

be run on long genomic segments�

Alignment of long genomic regions�

•  Input: two long genomic sequences that are homologous �

Main Idea�

Genomic regions of interest contain islands of similarity, such as genes�

�1.  Find local alignments/anchors�2.  Chain an optimal subset of them�3.  Refine/complete the alignment �

�Systems that use this idea to various degrees: �MUMmer, GLASS, DIALIGN, AVID, LAGAN, TBA �

Saving cells in DP�

1.  Find local alignments�

2.  Chain �

3.  Restricted DP�

Chaining of Local Alignments�

Each local alignment has a weight ��FIND the chain with highest total weight �

Page 3: Comparative genomics and genome alignmentasa/courses/cs548/fall11/pdfs... · Mulan phylogenetic tree (A) and sequence conservation profile (B) for the GATA3 gene locus from human,

3

A Graph Structure�

•  Build Directed Acyclic Graph (DAG): �•  Nodes: local alignments (xa,ya),(xb,yb) & score�•  Directed edges: local alignments that can be chained�

•  edge ( (xa, ya), (xb, yb) ), ( (xc, yc), xd, yd) ) if: �"xa < xb < xc < xd�"ya < yb < yc < yd�

Each local alignment �is a node vi with �alignment score si�

A Dynamic Programming Algorithm�

Initialization: �"Find each node va s.t. there is no edge (u,va)�"Set score of V(a) to be sa�

Iteration: "�"For each vi, optimal path ending in vi has total score: �

V(i) = maxedges j incident on i ( si + V(j) ) �Termination: �

"Optimal global chain: �" j = argmax ( V(j) ); trace chain from vj�

�Complexity: quadratic�

Multiple Genome Alignment �

•  Given a pairwise alignment method it is possible to convert it into a progressive MSA method.�

•  Examples: �Lagan --> Mlagan, Avid --> Mavid�•  In Mavid progressive alignment is performed by inferring

ancestor sequences and aligning them.�•  Mlagan performs an optional step of iterative refinement �•  Both methods use anchors from the pairwise comparisons

to avoid making the sort of errors that progressive aligners can make.�

MLAGAN vs. ClustalW� Sidetrack: visualization �

•  How can we visualize genome alignments?�

•  With an alignment dot plot �•  N x M matrix�

•  Let i = position in genome A �•  Let j = position in genome B �•  Fill cell (i,j) if Ai shows similarity to Bj�

•  A perfect alignment between A and B would completely fill the positive diagonal�

Page 4: Comparative genomics and genome alignmentasa/courses/cs548/fall11/pdfs... · Mulan phylogenetic tree (A) and sequence conservation profile (B) for the GATA3 gene locus from human,

4

B

A

B

A

Translocation � Inversion � Insertion �

http://mummer.sourceforge.net/manual/AlignmentTypes.pdf

Glocal Alignment Problem�Find least cost transformation of one sequence into another using new operations �

•  Sequence edits�

•  Inversions �

•  Translocations�

•  Duplications�

Implemented in Shuffle-Lagan �

Whole genome alignment �

figure from the mummer manual