comparative genomics and genome alignmentasa/courses/cs548/fall11/pdfs... · mulan phylogenetic...
TRANSCRIPT
![Page 1: Comparative genomics and genome alignmentasa/courses/cs548/fall11/pdfs... · Mulan phylogenetic tree (A) and sequence conservation profile (B) for the GATA3 gene locus from human,](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2b0368b41a33a450fd7c0/html5/thumbnails/1.jpg)
1
Comparative genomics and genome alignment �
Conservation detected by genome alignment �
Conservation of ZFPM1 among human, mouse, rat, and mouse. The large introns have several highly conserved regions. Those with conserved GATA-1 binding sites and high regulatory potential (predicted CRMs) are indicated.
http://genome.cshlp.org/content/15/1/184.full
Mulan phylogenetic tree (A) and sequence conservation profile (B) for the GATA3 gene locus from human, rat, mouse, chicken, frog, and three fish genomes.
Ovcharenko I et al. Genome Res. 2005;15:184-194
©2005 by Cold Spring Harbor Laboratory Press
What is comparative genomics?�
• Methods for characterizing DNA sequences using multiple genomes�
• Examples of problems that benefit from a comparative approach: �• Gene prediction �• Gene regulation �• Understanding evolutionary relationships
between species according to their genome architecture�
Comparing Genomes: Global Alignment of Long DNA Sequences�
Motivation �
• Genomic sequences are very long �
• Aligning genomic regions is useful for revealing conserved elements�• Want to compare regions > 1,000,000-long �
![Page 2: Comparative genomics and genome alignmentasa/courses/cs548/fall11/pdfs... · Mulan phylogenetic tree (A) and sequence conservation profile (B) for the GATA3 gene locus from human,](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2b0368b41a33a450fd7c0/html5/thumbnails/2.jpg)
2
Alignment Uses�
• Whole genome alignment�• Synteny analysis �• Polymorphism detection �• Sequence mapping �
• Multiple genome alignment�• Identify conserved sequence, e.g. functional elements
(annotation)�• Polymorphism detection �
Alignment methods�
• Local alignment methods (BLAST-like): �• Do not provide the global picture�• Weakly conserved local regions can be missed�
• Global alignment methods (e.g. Needleman-Wunsch):�• Assume two regions are globally alignable�• Slow, memory intensive algorithms that cannot
be run on long genomic segments�
Alignment of long genomic regions�
• Input: two long genomic sequences that are homologous �
Main Idea�
Genomic regions of interest contain islands of similarity, such as genes�
�1. Find local alignments/anchors�2. Chain an optimal subset of them�3. Refine/complete the alignment �
�Systems that use this idea to various degrees: �MUMmer, GLASS, DIALIGN, AVID, LAGAN, TBA �
Saving cells in DP�
1. Find local alignments�
2. Chain �
3. Restricted DP�
Chaining of Local Alignments�
Each local alignment has a weight ��FIND the chain with highest total weight �
![Page 3: Comparative genomics and genome alignmentasa/courses/cs548/fall11/pdfs... · Mulan phylogenetic tree (A) and sequence conservation profile (B) for the GATA3 gene locus from human,](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2b0368b41a33a450fd7c0/html5/thumbnails/3.jpg)
3
A Graph Structure�
• Build Directed Acyclic Graph (DAG): �• Nodes: local alignments (xa,ya),(xb,yb) & score�• Directed edges: local alignments that can be chained�
• edge ( (xa, ya), (xb, yb) ), ( (xc, yc), xd, yd) ) if: �"xa < xb < xc < xd�"ya < yb < yc < yd�
Each local alignment �is a node vi with �alignment score si�
A Dynamic Programming Algorithm�
Initialization: �"Find each node va s.t. there is no edge (u,va)�"Set score of V(a) to be sa�
Iteration: "�"For each vi, optimal path ending in vi has total score: �
�
V(i) = maxedges j incident on i ( si + V(j) ) �Termination: �
"Optimal global chain: �" j = argmax ( V(j) ); trace chain from vj�
�Complexity: quadratic�
Multiple Genome Alignment �
• Given a pairwise alignment method it is possible to convert it into a progressive MSA method.�
• Examples: �Lagan --> Mlagan, Avid --> Mavid�• In Mavid progressive alignment is performed by inferring
ancestor sequences and aligning them.�• Mlagan performs an optional step of iterative refinement �• Both methods use anchors from the pairwise comparisons
to avoid making the sort of errors that progressive aligners can make.�
MLAGAN vs. ClustalW� Sidetrack: visualization �
• How can we visualize genome alignments?�
• With an alignment dot plot �• N x M matrix�
• Let i = position in genome A �• Let j = position in genome B �• Fill cell (i,j) if Ai shows similarity to Bj�
• A perfect alignment between A and B would completely fill the positive diagonal�
![Page 4: Comparative genomics and genome alignmentasa/courses/cs548/fall11/pdfs... · Mulan phylogenetic tree (A) and sequence conservation profile (B) for the GATA3 gene locus from human,](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2b0368b41a33a450fd7c0/html5/thumbnails/4.jpg)
4
B
A
B
A
Translocation � Inversion � Insertion �
http://mummer.sourceforge.net/manual/AlignmentTypes.pdf
Glocal Alignment Problem�Find least cost transformation of one sequence into another using new operations �
• Sequence edits�
• Inversions �
• Translocations�
• Duplications�
�
Implemented in Shuffle-Lagan �
Whole genome alignment �
figure from the mummer manual