vista family of computational tools for comparative genomics how can we leverage genome sequences...

VISTA family of computational tools for comparative genomics

• How can we leverage genome How can we leverage genome sequences from many species to sequences from many species to learn about genome function? learn about genome function?

• Microbial applicationsMicrobial applications

Inna Dubchak, Genomics Division LBNL, JGI [email protected]@lbl.gov

Human Genome AnnotationHuman Genome Annotation

Gene AGene A

• only 1–2% codingonly 1–2% coding

• efficient identification of efficient identification of regulatory sequences?regulatory sequences?

Sequence conservation implies function

AGTTGAAAC GGAGCTGATGGAGC GGTGGGC T AGTTGAAAC GGAGCTGATGGAGC GGTGGGC T

TACATTTCG ACTGTATCGCCTCG CAACCCT ATACATTTCG ACTGTATCGCCTCG CAACCCT A

potentialpotentialfunctional regionfunctional region

conservationconservation

sequencesequence

CTATAAATGCCTATAAATGC

CTATAAATGCCTATAAATGC

AA CC

AA CC

Last Common AncestorLast Common Ancestor

divergencedivergence==

non functionalnon functional

functional regionfunctional region==

conservationconservation80 million years80 million years

Comparative Genomics Introduction

Human

DrosophilaMouseUrchinChimp

Similar Genes Synteny

Sequence Alignment

http://genome.lbl.gov/vistahttp://genome.lbl.gov/vistahttp://genome.lbl.gov/vistahttp://genome.lbl.gov/vista

VISTAVISTA is an integrated system foris an integrated system for global global sequence alignment and visualization for sequence alignment and visualization for

comparative genomic analysis comparative genomic analysis

AlgorithmAlgorithm FeatureFeature

AVIDAVID** can handle draft sequence can handle draft sequence

LAGANLAGAN**** produces true multiple alignmentsproduces true multiple alignments

Shuffle-LAGANShuffle-LAGAN**** handles rearrangementshandles rearrangements(inversions, translocations)(inversions, translocations)

** Lior Pachter, UC Berkeley Lior Pachter, UC Berkeley**** Michael Brudno, U. Toronto Michael Brudno, U. Toronto

How does VISTA Work:How does VISTA Work:Global Genomic AligmentsGlobal Genomic Aligments

sequence 1sequence 1

sequence 2sequence 2

1- anchoring: identify regions of strong similarity1- anchoring: identify regions of strong similarity

2- chaining: join regions of weak or no similarity2- chaining: join regions of weak or no similarity

104670599 TCCCCAACTATAAATGGATGAAATTGCAGGAAATGACAGGTA-----TGACCCCTTCTCT 104670653>>>>>>>>> ||| ||| | |||||| | || || | | | ||||||| || <<<<<<<<<052328645 TCCTCAATTCAGAATGGAGGGAAGCACACAGGACACAGAGATCCCTTTACCCCCTTCGCT 052328704

104670654 ACCAGAGGCTTGGATTTTTTTTCTTCTTCTCCTCCCTTAGCCCGTGTTGAGCTATTTCGG 104670713>>>>>>>>> | | | || | | | <<<<<<<<<052328705 ATGT----------------------------------------TATCAGGCCACTCAAG 052328724

104670714 AGTTTCCTGGCAGGGAAGAGCGAGTGAGGCTGCCTTACCTTCAGGATGACCACTAGCAGG 104670773>>>>>>>>> |||| | || || | ||||| ||||||| | ||| ||||||| ||||||||| |||||| <<<<<<<<<052328725 AGTTCCTTGTCAAG-AAGAGTGAGTGAGTCCACCTCACCTTCAAGATGACCACCAGCAGG 052328783

104670774 CCAGCGCTCACAAGAAGAGGAATGAGGCTACTAATGAACCAGCTAAACCAGAGGATGCTG 104670833>>>>>>>>> |||||||||||||| ||||| |||||||| |||| |||||||||||||||||||||| <<<<<<<<<052328784 CCAGCGCTCACAAGCAGAGGGATGAGGCTGCTAACAAACCAGCTAAACCAGAGGATGCCA 052328843

104670834 TTGTCCAGGCCCATGATCCGCATGGTCTCTTTCAGCCGTGCCTCCTTCTCATACACGATG 104670893>>>>>>>>> |||||||| |||||||||||||||||||| |||||||| ||||||||||||||||| ||| <<<<<<<<<052328844 TTGTCCAGACCCATGATCCGCATGGTCTCCTTCAGCCGAGCCTCCTTCTCATACACAATG 052328903

104670894 CCCTTGATGATCACAGCCACTGAGTAAATCCAGGCCAGCGTCATGAAGAGGGGCATTGAC 104670953>>>>>>>>> | ||||||||||||||| || ||||| |||||||| || ||||||||||||||||||||| <<<<<<<<<052328904 CTCTTGATGATCACAGCGACAGAGTAGATCCAGGCTAGAGTCATGAAGAGGGGCATTGAC 052328963

104670954 CGGCTCATCACCCGCAGAAAGCTGGAGGCCCCAAGGAAGGACAAGGGGAGAAAGAAAGAC 104671013>>>>>>>>> |||||||| ||||||||||| |||||||| | || || | || ||| | || |||| <<<<<<<<<052328964 CGGCTCATGACCCGCAGAAAACTGGAGGCACAGAGAAAAGGCATGGGAAAAATGAAAAGT 052329023

104671014 ACACGTGAGCCAGGGTGATGGGCCAAGGCCTCTGAGCCTGCATGCTAGAGGGAGCACCAC 104671073>>>>>>>>> ||||||| || | ||||||||| |||| || |||| ||| | <<<<<<<<<052329024 ----GTGAGCCCGG-CACCGATCCAAGGCCT-------TGCACACTGGAGGACAAACCTC 052329071

104671074 ATCTGGGCCACAGAAGGACAGGCCCTCTAGACTCTGAAATGTACGTATGATCCAATGCTT 104671133>>>>>>>>> ||| ||| | | | | | |||||| || ||||| ||||| | | || | || <<<<<<<<<052329072 ATCAGGGTCGCTTATGAA-AGGCCCACTGAACTCTCAAATG--------ACCAAAGGTTT 052329122

104671134 CACGAGCAATGCAATGTAGAGAGAAAAACGAGGCTAACAAAGTGTTGCCAAACCAAATTT 104671193>>>>>>>>> || |||| || | ||||| ||| | || | | || | ||| | |||||| <<<<<<<<<052329123 CATTAGCAGTGGA---CAGAGATGAAACCTGGGTTTCGAGGGTATGGCCGTGCAAAATTT 052329179

104671194 CTTTGGGGGCTTGCTTCAGTAACTAGGTAACTGTGAGCGATAC-TTAAACTAAAGGTAGA 104671252>>>>>>>>> || |||||| ||| | || ||||| || | || | | |||| |||| || <<<<<<<<<052329180 TTTCAGGGGCTCTCTTTAATAGCTAGGAAATGGATAGGGTAATATTAAGATAAATATAAG 052329239

104671253 TTATGTTA--AAGTACTAAAAACCAAAACA------AAAAAACAACTCATTCTCTCACAA 104671304>>>>>>>>> ||| || |||||||||| || || | || ||||| ||| | | | <<<<<<<<<052329240 TTACTCTACTAAGTACTAAACACAAAGGGCGGGGGCAGAATCCAACTTGGTCTTCCGCTA 052329299

Global Genomic Aligner OutputGlobal Genomic Aligner Output

VISTA visualization104637349 GTAGTGCCACTGAGTGTGACAGGGATGGCAAGAAAAGCATTAAGTTCCAAGGGGAAAGAA 104637408>>>>>>>>> | || ||| ||| |||| |||||||||| | || || |||| | |||||||| <<<<<<<<<052290302 GAGATGTCACCAAGTA-AACAGAGATGGCAAGAGGACCAATAGGTTCTAGTGGGAAAGAC 052290360

“sliding window” to measure sequence conservation(default window size 100bp)

Graphical presentation of sequence conservation as “peaks-and-valley” curve

>70% identity

base sequence coordinates

%identity

VISTA homepage: http://genome.lbl.gov/vista

VISTA Servers(submit your own data)

VISTA Browsers(precomputedalignments)

Other VISTA-related Projects

• Access servers, browsers, other information

wgVISTAwgVISTA

Align and compare sequences, including microbial assemblies

mVISTAmVISTA

Align and compare sequences

rVISTArVISTA

Search for TFBS combined with a comparative sequence analysis

VISTA Servers

GenomeVISTAGenomeVISTA

Align DNA sequence to a genome

VISTA BrowserVISTA Browser

Browse through pre-computed whole-genome alignments

Whole Genome rVISTAWhole Genome rVISTA

Whole genome analysis for conserved TFBS over-represented

in upstream regions of genes

Precomputed Alignments

VISTA-PointVISTA-Point

Browse and obtain sequence and alignment data

VISTA Browser: Access

VISTA Browser: Input Menu

genome position

visualization

Java 2, if needed

• Choose “base” genome • Select location• Determine visualization preference

VISTA Browser

VISTA tracks on UCSC Browser

VISTA-Point

VISTA Browser: Alignment Details

direction

exonrepeats

alignment

SNPsgene

VISTA Browser: Result

Position on chromosome

ControlPanel

Graphical display of genome alignments

Color Legend

CursorInfo

Menu & Icons

Curve annotation (species)

1 row

VISTA Browser: Zooming

vs. rhesus

vs. dog

VISTA browserVISTA browser

VISTA Point: Access Overview

VISTA Point: Graphics Table

VISTA Point: AlignmentsTable

sequence

Google map-like Dot-Plot

BlockView – Synteny Plot tool

RegTransBase – experimental data

manually curated database of regulatory interactions captured from literature; 6000 papers

RegPrecise – computational predictions

manually curated database of regulons inferred by comparative genomics approach

RegPredict – web tool for regulon inference

integrated system for fast and accurate inference of regulons by comparative genomics

NAR database issue, 2010; Featured Article

NAR Web Server issue, 2010; Featured Article

Principal components

NAR database issue, 2007

mVISTA: Access

mVISTA: Interface

• Our example will show 3 sequences• Align up to 100 sequences

mVISTA: Input of Sequences

• Provide your email address• Upload your sequences• Or enter GenBank ID

your email

upload fileor GenBank ID

AVIDmultiple pair wise alignments

accepts finished or draft sequences

LAGAN true multiple alignments

mVISTA: Input Parameters

Shuffle-LAGAN– multiple pair wise alignments

– detects sequence rearrangements and inversions

mVISTA: Results

PDFPDF

VISTA BrowserVISTA BrowserVISTA-PointVISTA-Point

wgVISTA: Microbial Assemblies Comparison

• wgVISTA: whole genome VISTA• Compares 2 sequences (up to 10 Mb)• Draft or finished microbial assembly sequences can be used

rVISTA: Access

Regulatory VISTA (rVISTA):Regulatory VISTA (rVISTA):prediction of transcription factor binding sitesprediction of transcription factor binding sites

Simultaneous searches of the major transcription factor Simultaneous searches of the major transcription factor binding site database (binding site database (TransfacTransfac) and the use of global ) and the use of global

sequence alignment to sieve through the datasequence alignment to sieve through the data

rVISTA search is automatically run when submitting:rVISTA search is automatically run when submitting:• mVISTAmVISTA• genomeVISTAgenomeVISTA

Human TGATTTCTCGGCAGCAAGGGAGGGCCCCATGACAAAGCCATTTGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCTGTCTCTCCCTTCCCCTCTGMouse TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCACTCGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCTCTCTCTTCCTCCCCCTCCADog TGATTTCTCGGCAGCAAGGGAGGGCCCCATGACGAAGCCATTTGAAATCCCAGAAGCGATTTTCTACCTACGACCTCACTTTCTGTTGCGCTCACTCCCTTCCCCTGCARat TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCACTCGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGTTCTCTCTTCCTCCCCCTCCACow TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCATTTGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCGTTCTCTCCCTTCCCCTCCTRabbit TGATTTCTCGGCAGCCAGGGAGGGCCCCACGAC-AAGCCATTCAAAATCCCAGAAGTGATTTTCTACTTACGACCTCACTTTCTGTTG----CTCTCTCCTTCCCTCCA

Ikaros-2 Ikaros-2 NFAT Ikaros-2

20 bp dynamic 20 bp dynamic shifting windowshifting window

>80% ID>80% ID

1. Identify potential transcription factor binding sites for each sequence using library of matrices (TRANSFAC)

2. Identify aligned sites using VISTA

3. Identify conserved sites using dynamic shifting window

Regulatory VISTA (rVISTA):Regulatory VISTA (rVISTA):

rVISTA: Interface

your email

sequences

• rVISTA sequence submission: set number• Submit email address, sequences, and set parameters• Key step: click the box for: Find potential transcription factors

rVISTA: Select TRANSFAC Matrices

rVISTA: Mailed Results

• Emailed results will provide a link• Choose which binding sites matrices to display• You can then choose visualization options

display

rVISTA: Results Graphic

• Blue all transcription factor (TF) binding sites• Red TF sites which are aligned in both sequences• Green TF sites which are aligned & in conserved

regions

sequences

sites

Whole Genome rVISTA: Access

Whole Genome rVISTA: Select Alignment

IDs or symbols

upstream range

Whole Genome rVISTA: Results

sites found

view genes

Examples of VISTA usage

• Non-coding regulatory regions, for example enhancers

• Genes from the same gene families• Alternative splicing• Transcriptional regulation• Genetic studies

References collected are available through the Publications link at the VISTA home page http://genome.lbl.gov/vista

http://genome.lbl.gov/vista

VISTA-related Publications

http:/www.openhelix.com

VISTA thanksVISTA thanks

Biology Genomics Division, LBNL lead by Dr. Edward Rubin

Dario Boffelli Kelly Frazer Gaby Loots

Len Pennacchio Marcelo Nobrega Axel Visel

Bioinformatics

Michael Brudno Olivier Couronne Simon Minovitsky

Igor Ratner Alexander Poliakov Lior Pachter (UCB)

Shyam Prabhakar Dmitriy Ryaboy Nameeta Shah

Inna Dubchak

vista family of computational tools for comparative genomics how can we leverage genome sequences...

Documents

genome sequences

vista work

vista homepage

genome vistacompares

genome function

global sequence alignment

accessvista browser

sequence conservation