vista family of computational tools for comparative genomics how can we leverage genome sequences...

47
VISTA family of computational tools for comparative genomics How can we leverage genome How can we leverage genome sequences from many species to sequences from many species to learn about genome function? learn about genome function? Microbial applications Microbial applications Inna Dubchak, Genomics Division LBNL, JGI [email protected] [email protected]

Upload: thomasina-boyd

Post on 03-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

VISTA family of computational tools for comparative genomics

• How can we leverage genome How can we leverage genome sequences from many species to sequences from many species to learn about genome function? learn about genome function?

• Microbial applicationsMicrobial applications

Inna Dubchak, Genomics Division LBNL, JGI [email protected]@lbl.gov

Page 2: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

Human Genome AnnotationHuman Genome Annotation

Gene AGene A

• only 1–2% codingonly 1–2% coding

• efficient identification of efficient identification of regulatory sequences?regulatory sequences?

Page 3: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

Sequence conservation implies function

AGTTGAAAC GGAGCTGATGGAGC GGTGGGC T AGTTGAAAC GGAGCTGATGGAGC GGTGGGC T

TACATTTCG ACTGTATCGCCTCG CAACCCT ATACATTTCG ACTGTATCGCCTCG CAACCCT A

potentialpotentialfunctional regionfunctional region

conservationconservation

sequencesequence

CTATAAATGCCTATAAATGC

CTATAAATGCCTATAAATGC

AA CC

AA CC

Last Common AncestorLast Common Ancestor

divergencedivergence==

non functionalnon functional

functional regionfunctional region==

conservationconservation80 million years80 million years

Page 4: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

Comparative Genomics Introduction

Human

DrosophilaMouseUrchinChimp

Similar Genes Synteny

Sequence Alignment

Page 5: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

http://genome.lbl.gov/vistahttp://genome.lbl.gov/vistahttp://genome.lbl.gov/vistahttp://genome.lbl.gov/vista

VISTAVISTA is an integrated system foris an integrated system for global global sequence alignment and visualization for sequence alignment and visualization for

comparative genomic analysis comparative genomic analysis

Page 6: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

AlgorithmAlgorithm FeatureFeature

AVIDAVID** can handle draft sequence can handle draft sequence

LAGANLAGAN**** produces true multiple alignmentsproduces true multiple alignments

Shuffle-LAGANShuffle-LAGAN**** handles rearrangementshandles rearrangements(inversions, translocations)(inversions, translocations)

** Lior Pachter, UC Berkeley Lior Pachter, UC Berkeley**** Michael Brudno, U. Toronto Michael Brudno, U. Toronto

How does VISTA Work:How does VISTA Work:Global Genomic AligmentsGlobal Genomic Aligments

sequence 1sequence 1

sequence 2sequence 2

1- anchoring: identify regions of strong similarity1- anchoring: identify regions of strong similarity

2- chaining: join regions of weak or no similarity2- chaining: join regions of weak or no similarity

Page 7: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

104670599 TCCCCAACTATAAATGGATGAAATTGCAGGAAATGACAGGTA-----TGACCCCTTCTCT 104670653>>>>>>>>> ||| ||| | |||||| | || || | | | ||||||| || <<<<<<<<<052328645 TCCTCAATTCAGAATGGAGGGAAGCACACAGGACACAGAGATCCCTTTACCCCCTTCGCT 052328704

104670654 ACCAGAGGCTTGGATTTTTTTTCTTCTTCTCCTCCCTTAGCCCGTGTTGAGCTATTTCGG 104670713>>>>>>>>> | | | || | | | <<<<<<<<<052328705 ATGT----------------------------------------TATCAGGCCACTCAAG 052328724

104670714 AGTTTCCTGGCAGGGAAGAGCGAGTGAGGCTGCCTTACCTTCAGGATGACCACTAGCAGG 104670773>>>>>>>>> |||| | || || | ||||| ||||||| | ||| ||||||| ||||||||| |||||| <<<<<<<<<052328725 AGTTCCTTGTCAAG-AAGAGTGAGTGAGTCCACCTCACCTTCAAGATGACCACCAGCAGG 052328783

104670774 CCAGCGCTCACAAGAAGAGGAATGAGGCTACTAATGAACCAGCTAAACCAGAGGATGCTG 104670833>>>>>>>>> |||||||||||||| ||||| |||||||| |||| |||||||||||||||||||||| <<<<<<<<<052328784 CCAGCGCTCACAAGCAGAGGGATGAGGCTGCTAACAAACCAGCTAAACCAGAGGATGCCA 052328843

104670834 TTGTCCAGGCCCATGATCCGCATGGTCTCTTTCAGCCGTGCCTCCTTCTCATACACGATG 104670893>>>>>>>>> |||||||| |||||||||||||||||||| |||||||| ||||||||||||||||| ||| <<<<<<<<<052328844 TTGTCCAGACCCATGATCCGCATGGTCTCCTTCAGCCGAGCCTCCTTCTCATACACAATG 052328903

104670894 CCCTTGATGATCACAGCCACTGAGTAAATCCAGGCCAGCGTCATGAAGAGGGGCATTGAC 104670953>>>>>>>>> | ||||||||||||||| || ||||| |||||||| || ||||||||||||||||||||| <<<<<<<<<052328904 CTCTTGATGATCACAGCGACAGAGTAGATCCAGGCTAGAGTCATGAAGAGGGGCATTGAC 052328963

104670954 CGGCTCATCACCCGCAGAAAGCTGGAGGCCCCAAGGAAGGACAAGGGGAGAAAGAAAGAC 104671013>>>>>>>>> |||||||| ||||||||||| |||||||| | || || | || ||| | || |||| <<<<<<<<<052328964 CGGCTCATGACCCGCAGAAAACTGGAGGCACAGAGAAAAGGCATGGGAAAAATGAAAAGT 052329023

104671014 ACACGTGAGCCAGGGTGATGGGCCAAGGCCTCTGAGCCTGCATGCTAGAGGGAGCACCAC 104671073>>>>>>>>> ||||||| || | ||||||||| |||| || |||| ||| | <<<<<<<<<052329024 ----GTGAGCCCGG-CACCGATCCAAGGCCT-------TGCACACTGGAGGACAAACCTC 052329071

104671074 ATCTGGGCCACAGAAGGACAGGCCCTCTAGACTCTGAAATGTACGTATGATCCAATGCTT 104671133>>>>>>>>> ||| ||| | | | | | |||||| || ||||| ||||| | | || | || <<<<<<<<<052329072 ATCAGGGTCGCTTATGAA-AGGCCCACTGAACTCTCAAATG--------ACCAAAGGTTT 052329122

104671134 CACGAGCAATGCAATGTAGAGAGAAAAACGAGGCTAACAAAGTGTTGCCAAACCAAATTT 104671193>>>>>>>>> || |||| || | ||||| ||| | || | | || | ||| | |||||| <<<<<<<<<052329123 CATTAGCAGTGGA---CAGAGATGAAACCTGGGTTTCGAGGGTATGGCCGTGCAAAATTT 052329179

104671194 CTTTGGGGGCTTGCTTCAGTAACTAGGTAACTGTGAGCGATAC-TTAAACTAAAGGTAGA 104671252>>>>>>>>> || |||||| ||| | || ||||| || | || | | |||| |||| || <<<<<<<<<052329180 TTTCAGGGGCTCTCTTTAATAGCTAGGAAATGGATAGGGTAATATTAAGATAAATATAAG 052329239

104671253 TTATGTTA--AAGTACTAAAAACCAAAACA------AAAAAACAACTCATTCTCTCACAA 104671304>>>>>>>>> ||| || |||||||||| || || | || ||||| ||| | | | <<<<<<<<<052329240 TTACTCTACTAAGTACTAAACACAAAGGGCGGGGGCAGAATCCAACTTGGTCTTCCGCTA 052329299

Global Genomic Aligner OutputGlobal Genomic Aligner Output

Page 8: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

VISTA visualization104637349 GTAGTGCCACTGAGTGTGACAGGGATGGCAAGAAAAGCATTAAGTTCCAAGGGGAAAGAA 104637408>>>>>>>>> | || ||| ||| |||| |||||||||| | || || |||| | |||||||| <<<<<<<<<052290302 GAGATGTCACCAAGTA-AACAGAGATGGCAAGAGGACCAATAGGTTCTAGTGGGAAAGAC 052290360

“sliding window” to measure sequence conservation(default window size 100bp)

Graphical presentation of sequence conservation as “peaks-and-valley” curve

>70% identity

base sequence coordinates

%identity

Page 9: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

VISTA homepage: http://genome.lbl.gov/vista

VISTA Servers(submit your own data)

VISTA Browsers(precomputedalignments)

Other VISTA-related Projects

• Access servers, browsers, other information

Page 10: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

wgVISTAwgVISTA

Align and compare sequences, including microbial assemblies

mVISTAmVISTA

Align and compare sequences

rVISTArVISTA

Search for TFBS combined with a comparative sequence analysis

VISTA Servers

GenomeVISTAGenomeVISTA

Align DNA sequence to a genome

Page 11: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

VISTA BrowserVISTA Browser

Browse through pre-computed whole-genome alignments

Whole Genome rVISTAWhole Genome rVISTA

Whole genome analysis for conserved TFBS over-represented

in upstream regions of genes

Precomputed Alignments

VISTA-PointVISTA-Point

Browse and obtain sequence and alignment data

Page 12: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

VISTA Browser: Access

Page 13: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

VISTA Browser: Input Menu

genome position

visualization

Java 2, if needed

• Choose “base” genome • Select location• Determine visualization preference

VISTA Browser

VISTA tracks on UCSC Browser

VISTA-Point

Page 14: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

VISTA Browser: Alignment Details

direction

exonrepeats

alignment

SNPsgene

Page 15: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

VISTA Browser: Result

Position on chromosome

ControlPanel

Graphical display of genome alignments

Color Legend

CursorInfo

Menu & Icons

Curve annotation (species)

1 row

Page 16: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

VISTA Browser: Zooming

vs. rhesus

vs. dog

Page 17: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

VISTA browserVISTA browser

Page 18: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

VISTA Point: Access Overview

Page 19: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

VISTA Point: Graphics Table

Page 20: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

VISTA Point: AlignmentsTable

sequence

Page 21: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How
Page 22: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

Google map-like Dot-Plot

Page 23: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How
Page 24: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

BlockView – Synteny Plot tool

Page 25: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How
Page 26: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How
Page 27: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

RegTransBase – experimental data

manually curated database of regulatory interactions captured from literature; 6000 papers

RegPrecise – computational predictions

manually curated database of regulons inferred by comparative genomics approach

RegPredict – web tool for regulon inference

integrated system for fast and accurate inference of regulons by comparative genomics

NAR database issue, 2010; Featured Article

NAR Web Server issue, 2010; Featured Article

Principal components

NAR database issue, 2007

Page 28: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

mVISTA: Access

Page 29: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

mVISTA: Interface

• Our example will show 3 sequences• Align up to 100 sequences

Page 30: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

mVISTA: Input of Sequences

• Provide your email address• Upload your sequences• Or enter GenBank ID

your email

upload fileor GenBank ID

Page 31: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

AVIDmultiple pair wise alignments

accepts finished or draft sequences

LAGAN true multiple alignments

mVISTA: Input Parameters

Shuffle-LAGAN– multiple pair wise alignments

– detects sequence rearrangements and inversions

Page 32: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

mVISTA: Results

PDFPDF

VISTA BrowserVISTA BrowserVISTA-PointVISTA-Point

Page 33: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

wgVISTA: Microbial Assemblies Comparison

• wgVISTA: whole genome VISTA• Compares 2 sequences (up to 10 Mb)• Draft or finished microbial assembly sequences can be used

Page 34: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

rVISTA: Access

Page 35: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

Regulatory VISTA (rVISTA):Regulatory VISTA (rVISTA):prediction of transcription factor binding sitesprediction of transcription factor binding sites

Simultaneous searches of the major transcription factor Simultaneous searches of the major transcription factor binding site database (binding site database (TransfacTransfac) and the use of global ) and the use of global

sequence alignment to sieve through the datasequence alignment to sieve through the data

rVISTA search is automatically run when submitting:rVISTA search is automatically run when submitting:• mVISTAmVISTA• genomeVISTAgenomeVISTA

Page 36: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

Human TGATTTCTCGGCAGCAAGGGAGGGCCCCATGACAAAGCCATTTGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCTGTCTCTCCCTTCCCCTCTGMouse TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCACTCGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCTCTCTCTTCCTCCCCCTCCADog TGATTTCTCGGCAGCAAGGGAGGGCCCCATGACGAAGCCATTTGAAATCCCAGAAGCGATTTTCTACCTACGACCTCACTTTCTGTTGCGCTCACTCCCTTCCCCTGCARat TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCACTCGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGTTCTCTCTTCCTCCCCCTCCACow TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCATTTGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCGTTCTCTCCCTTCCCCTCCTRabbit TGATTTCTCGGCAGCCAGGGAGGGCCCCACGAC-AAGCCATTCAAAATCCCAGAAGTGATTTTCTACTTACGACCTCACTTTCTGTTG----CTCTCTCCTTCCCTCCA

Ikaros-2 Ikaros-2 NFAT Ikaros-2

20 bp dynamic 20 bp dynamic shifting windowshifting window

>80% ID>80% ID

1. Identify potential transcription factor binding sites for each sequence using library of matrices (TRANSFAC)

2. Identify aligned sites using VISTA

3. Identify conserved sites using dynamic shifting window

Regulatory VISTA (rVISTA):Regulatory VISTA (rVISTA):

Page 37: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

rVISTA: Interface

your email

sequences

• rVISTA sequence submission: set number• Submit email address, sequences, and set parameters• Key step: click the box for: Find potential transcription factors

Page 38: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

rVISTA: Select TRANSFAC Matrices

Page 39: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

rVISTA: Mailed Results

• Emailed results will provide a link• Choose which binding sites matrices to display• You can then choose visualization options

display

Page 40: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

rVISTA: Results Graphic

• Blue all transcription factor (TF) binding sites• Red TF sites which are aligned in both sequences• Green TF sites which are aligned & in conserved

regions

sequences

sites

Page 41: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

Whole Genome rVISTA: Access

Page 42: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

Whole Genome rVISTA: Select Alignment

IDs or symbols

upstream range

Page 43: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

Whole Genome rVISTA: Results

sites found

view genes

Page 44: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

Examples of VISTA usage

• Non-coding regulatory regions, for example enhancers

• Genes from the same gene families• Alternative splicing• Transcriptional regulation• Genetic studies

References collected are available through the Publications link at the VISTA home page http://genome.lbl.gov/vista

Page 45: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

VISTA-related Publications

Page 46: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

http:/www.openhelix.com

Page 47: VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How

VISTA thanksVISTA thanks

Biology Genomics Division, LBNL lead by Dr. Edward Rubin

Dario Boffelli Kelly Frazer Gaby Loots

Len Pennacchio Marcelo Nobrega Axel Visel

Bioinformatics

Michael Brudno Olivier Couronne Simon Minovitsky

Igor Ratner Alexander Poliakov Lior Pachter (UCB)

Shyam Prabhakar Dmitriy Ryaboy Nameeta Shah

Inna Dubchak