vista family of computational tools for comparative genomics how can we leverage genome sequences...
TRANSCRIPT
VISTA family of computational tools for comparative genomics
• How can we leverage genome How can we leverage genome sequences from many species to sequences from many species to learn about genome function? learn about genome function?
• Microbial applicationsMicrobial applications
Inna Dubchak, Genomics Division LBNL, JGI [email protected]@lbl.gov
Human Genome AnnotationHuman Genome Annotation
Gene AGene A
• only 1–2% codingonly 1–2% coding
• efficient identification of efficient identification of regulatory sequences?regulatory sequences?
Sequence conservation implies function
AGTTGAAAC GGAGCTGATGGAGC GGTGGGC T AGTTGAAAC GGAGCTGATGGAGC GGTGGGC T
TACATTTCG ACTGTATCGCCTCG CAACCCT ATACATTTCG ACTGTATCGCCTCG CAACCCT A
potentialpotentialfunctional regionfunctional region
conservationconservation
sequencesequence
CTATAAATGCCTATAAATGC
CTATAAATGCCTATAAATGC
AA CC
AA CC
Last Common AncestorLast Common Ancestor
divergencedivergence==
non functionalnon functional
functional regionfunctional region==
conservationconservation80 million years80 million years
Comparative Genomics Introduction
Human
DrosophilaMouseUrchinChimp
Similar Genes Synteny
Sequence Alignment
http://genome.lbl.gov/vistahttp://genome.lbl.gov/vistahttp://genome.lbl.gov/vistahttp://genome.lbl.gov/vista
VISTAVISTA is an integrated system foris an integrated system for global global sequence alignment and visualization for sequence alignment and visualization for
comparative genomic analysis comparative genomic analysis
AlgorithmAlgorithm FeatureFeature
AVIDAVID** can handle draft sequence can handle draft sequence
LAGANLAGAN**** produces true multiple alignmentsproduces true multiple alignments
Shuffle-LAGANShuffle-LAGAN**** handles rearrangementshandles rearrangements(inversions, translocations)(inversions, translocations)
** Lior Pachter, UC Berkeley Lior Pachter, UC Berkeley**** Michael Brudno, U. Toronto Michael Brudno, U. Toronto
How does VISTA Work:How does VISTA Work:Global Genomic AligmentsGlobal Genomic Aligments
sequence 1sequence 1
sequence 2sequence 2
1- anchoring: identify regions of strong similarity1- anchoring: identify regions of strong similarity
2- chaining: join regions of weak or no similarity2- chaining: join regions of weak or no similarity
104670599 TCCCCAACTATAAATGGATGAAATTGCAGGAAATGACAGGTA-----TGACCCCTTCTCT 104670653>>>>>>>>> ||| ||| | |||||| | || || | | | ||||||| || <<<<<<<<<052328645 TCCTCAATTCAGAATGGAGGGAAGCACACAGGACACAGAGATCCCTTTACCCCCTTCGCT 052328704
104670654 ACCAGAGGCTTGGATTTTTTTTCTTCTTCTCCTCCCTTAGCCCGTGTTGAGCTATTTCGG 104670713>>>>>>>>> | | | || | | | <<<<<<<<<052328705 ATGT----------------------------------------TATCAGGCCACTCAAG 052328724
104670714 AGTTTCCTGGCAGGGAAGAGCGAGTGAGGCTGCCTTACCTTCAGGATGACCACTAGCAGG 104670773>>>>>>>>> |||| | || || | ||||| ||||||| | ||| ||||||| ||||||||| |||||| <<<<<<<<<052328725 AGTTCCTTGTCAAG-AAGAGTGAGTGAGTCCACCTCACCTTCAAGATGACCACCAGCAGG 052328783
104670774 CCAGCGCTCACAAGAAGAGGAATGAGGCTACTAATGAACCAGCTAAACCAGAGGATGCTG 104670833>>>>>>>>> |||||||||||||| ||||| |||||||| |||| |||||||||||||||||||||| <<<<<<<<<052328784 CCAGCGCTCACAAGCAGAGGGATGAGGCTGCTAACAAACCAGCTAAACCAGAGGATGCCA 052328843
104670834 TTGTCCAGGCCCATGATCCGCATGGTCTCTTTCAGCCGTGCCTCCTTCTCATACACGATG 104670893>>>>>>>>> |||||||| |||||||||||||||||||| |||||||| ||||||||||||||||| ||| <<<<<<<<<052328844 TTGTCCAGACCCATGATCCGCATGGTCTCCTTCAGCCGAGCCTCCTTCTCATACACAATG 052328903
104670894 CCCTTGATGATCACAGCCACTGAGTAAATCCAGGCCAGCGTCATGAAGAGGGGCATTGAC 104670953>>>>>>>>> | ||||||||||||||| || ||||| |||||||| || ||||||||||||||||||||| <<<<<<<<<052328904 CTCTTGATGATCACAGCGACAGAGTAGATCCAGGCTAGAGTCATGAAGAGGGGCATTGAC 052328963
104670954 CGGCTCATCACCCGCAGAAAGCTGGAGGCCCCAAGGAAGGACAAGGGGAGAAAGAAAGAC 104671013>>>>>>>>> |||||||| ||||||||||| |||||||| | || || | || ||| | || |||| <<<<<<<<<052328964 CGGCTCATGACCCGCAGAAAACTGGAGGCACAGAGAAAAGGCATGGGAAAAATGAAAAGT 052329023
104671014 ACACGTGAGCCAGGGTGATGGGCCAAGGCCTCTGAGCCTGCATGCTAGAGGGAGCACCAC 104671073>>>>>>>>> ||||||| || | ||||||||| |||| || |||| ||| | <<<<<<<<<052329024 ----GTGAGCCCGG-CACCGATCCAAGGCCT-------TGCACACTGGAGGACAAACCTC 052329071
104671074 ATCTGGGCCACAGAAGGACAGGCCCTCTAGACTCTGAAATGTACGTATGATCCAATGCTT 104671133>>>>>>>>> ||| ||| | | | | | |||||| || ||||| ||||| | | || | || <<<<<<<<<052329072 ATCAGGGTCGCTTATGAA-AGGCCCACTGAACTCTCAAATG--------ACCAAAGGTTT 052329122
104671134 CACGAGCAATGCAATGTAGAGAGAAAAACGAGGCTAACAAAGTGTTGCCAAACCAAATTT 104671193>>>>>>>>> || |||| || | ||||| ||| | || | | || | ||| | |||||| <<<<<<<<<052329123 CATTAGCAGTGGA---CAGAGATGAAACCTGGGTTTCGAGGGTATGGCCGTGCAAAATTT 052329179
104671194 CTTTGGGGGCTTGCTTCAGTAACTAGGTAACTGTGAGCGATAC-TTAAACTAAAGGTAGA 104671252>>>>>>>>> || |||||| ||| | || ||||| || | || | | |||| |||| || <<<<<<<<<052329180 TTTCAGGGGCTCTCTTTAATAGCTAGGAAATGGATAGGGTAATATTAAGATAAATATAAG 052329239
104671253 TTATGTTA--AAGTACTAAAAACCAAAACA------AAAAAACAACTCATTCTCTCACAA 104671304>>>>>>>>> ||| || |||||||||| || || | || ||||| ||| | | | <<<<<<<<<052329240 TTACTCTACTAAGTACTAAACACAAAGGGCGGGGGCAGAATCCAACTTGGTCTTCCGCTA 052329299
Global Genomic Aligner OutputGlobal Genomic Aligner Output
VISTA visualization104637349 GTAGTGCCACTGAGTGTGACAGGGATGGCAAGAAAAGCATTAAGTTCCAAGGGGAAAGAA 104637408>>>>>>>>> | || ||| ||| |||| |||||||||| | || || |||| | |||||||| <<<<<<<<<052290302 GAGATGTCACCAAGTA-AACAGAGATGGCAAGAGGACCAATAGGTTCTAGTGGGAAAGAC 052290360
“sliding window” to measure sequence conservation(default window size 100bp)
Graphical presentation of sequence conservation as “peaks-and-valley” curve
>70% identity
base sequence coordinates
%identity
VISTA homepage: http://genome.lbl.gov/vista
VISTA Servers(submit your own data)
VISTA Browsers(precomputedalignments)
Other VISTA-related Projects
• Access servers, browsers, other information
wgVISTAwgVISTA
Align and compare sequences, including microbial assemblies
mVISTAmVISTA
Align and compare sequences
rVISTArVISTA
Search for TFBS combined with a comparative sequence analysis
VISTA Servers
GenomeVISTAGenomeVISTA
Align DNA sequence to a genome
VISTA BrowserVISTA Browser
Browse through pre-computed whole-genome alignments
Whole Genome rVISTAWhole Genome rVISTA
Whole genome analysis for conserved TFBS over-represented
in upstream regions of genes
Precomputed Alignments
VISTA-PointVISTA-Point
Browse and obtain sequence and alignment data
VISTA Browser: Access
VISTA Browser: Input Menu
genome position
visualization
Java 2, if needed
• Choose “base” genome • Select location• Determine visualization preference
VISTA Browser
VISTA tracks on UCSC Browser
VISTA-Point
VISTA Browser: Alignment Details
direction
exonrepeats
alignment
SNPsgene
VISTA Browser: Result
Position on chromosome
ControlPanel
Graphical display of genome alignments
Color Legend
CursorInfo
Menu & Icons
Curve annotation (species)
1 row
VISTA Browser: Zooming
vs. rhesus
vs. dog
VISTA browserVISTA browser
VISTA Point: Access Overview
VISTA Point: Graphics Table
VISTA Point: AlignmentsTable
sequence
Google map-like Dot-Plot
BlockView – Synteny Plot tool
RegTransBase – experimental data
manually curated database of regulatory interactions captured from literature; 6000 papers
RegPrecise – computational predictions
manually curated database of regulons inferred by comparative genomics approach
RegPredict – web tool for regulon inference
integrated system for fast and accurate inference of regulons by comparative genomics
NAR database issue, 2010; Featured Article
NAR Web Server issue, 2010; Featured Article
Principal components
NAR database issue, 2007
mVISTA: Access
mVISTA: Interface
• Our example will show 3 sequences• Align up to 100 sequences
mVISTA: Input of Sequences
• Provide your email address• Upload your sequences• Or enter GenBank ID
your email
upload fileor GenBank ID
AVIDmultiple pair wise alignments
accepts finished or draft sequences
LAGAN true multiple alignments
mVISTA: Input Parameters
Shuffle-LAGAN– multiple pair wise alignments
– detects sequence rearrangements and inversions
mVISTA: Results
PDFPDF
VISTA BrowserVISTA BrowserVISTA-PointVISTA-Point
wgVISTA: Microbial Assemblies Comparison
• wgVISTA: whole genome VISTA• Compares 2 sequences (up to 10 Mb)• Draft or finished microbial assembly sequences can be used
rVISTA: Access
Regulatory VISTA (rVISTA):Regulatory VISTA (rVISTA):prediction of transcription factor binding sitesprediction of transcription factor binding sites
Simultaneous searches of the major transcription factor Simultaneous searches of the major transcription factor binding site database (binding site database (TransfacTransfac) and the use of global ) and the use of global
sequence alignment to sieve through the datasequence alignment to sieve through the data
rVISTA search is automatically run when submitting:rVISTA search is automatically run when submitting:• mVISTAmVISTA• genomeVISTAgenomeVISTA
Human TGATTTCTCGGCAGCAAGGGAGGGCCCCATGACAAAGCCATTTGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCTGTCTCTCCCTTCCCCTCTGMouse TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCACTCGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCTCTCTCTTCCTCCCCCTCCADog TGATTTCTCGGCAGCAAGGGAGGGCCCCATGACGAAGCCATTTGAAATCCCAGAAGCGATTTTCTACCTACGACCTCACTTTCTGTTGCGCTCACTCCCTTCCCCTGCARat TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCACTCGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGTTCTCTCTTCCTCCCCCTCCACow TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCATTTGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCGTTCTCTCCCTTCCCCTCCTRabbit TGATTTCTCGGCAGCCAGGGAGGGCCCCACGAC-AAGCCATTCAAAATCCCAGAAGTGATTTTCTACTTACGACCTCACTTTCTGTTG----CTCTCTCCTTCCCTCCA
Ikaros-2 Ikaros-2 NFAT Ikaros-2
20 bp dynamic 20 bp dynamic shifting windowshifting window
>80% ID>80% ID
1. Identify potential transcription factor binding sites for each sequence using library of matrices (TRANSFAC)
2. Identify aligned sites using VISTA
3. Identify conserved sites using dynamic shifting window
Regulatory VISTA (rVISTA):Regulatory VISTA (rVISTA):
rVISTA: Interface
your email
sequences
• rVISTA sequence submission: set number• Submit email address, sequences, and set parameters• Key step: click the box for: Find potential transcription factors
rVISTA: Select TRANSFAC Matrices
rVISTA: Mailed Results
• Emailed results will provide a link• Choose which binding sites matrices to display• You can then choose visualization options
display
rVISTA: Results Graphic
• Blue all transcription factor (TF) binding sites• Red TF sites which are aligned in both sequences• Green TF sites which are aligned & in conserved
regions
sequences
sites
Whole Genome rVISTA: Access
Whole Genome rVISTA: Select Alignment
IDs or symbols
upstream range
Whole Genome rVISTA: Results
sites found
view genes
Examples of VISTA usage
• Non-coding regulatory regions, for example enhancers
• Genes from the same gene families• Alternative splicing• Transcriptional regulation• Genetic studies
References collected are available through the Publications link at the VISTA home page http://genome.lbl.gov/vista
VISTA-related Publications
http:/www.openhelix.com
VISTA thanksVISTA thanks
Biology Genomics Division, LBNL lead by Dr. Edward Rubin
Dario Boffelli Kelly Frazer Gaby Loots
Len Pennacchio Marcelo Nobrega Axel Visel
Bioinformatics
Michael Brudno Olivier Couronne Simon Minovitsky
Igor Ratner Alexander Poliakov Lior Pachter (UCB)
Shyam Prabhakar Dmitriy Ryaboy Nameeta Shah
Inna Dubchak