web apollo tutorial for medfly research community
Post on 10-May-2015
204 Views
Preview:
DESCRIPTION
TRANSCRIPT
UNIVERSITY OF CALIFORNIA
An introduction to Web Apollo. A webinar for the Ceratitis capitata research community.
Monica Munoz-Torres, PhD | @monimunozto Berkeley Bioinformatics Open-Source Projects (BBOP)
Genomics Division, Lawrence Berkeley National Laboratory 15 July, 2014
UNIVERSITY OF CALIFORNIA
Outline 1. What is Web Apollo?:
• Definition & working concept.
2. Our Experience With Community Based Curation.
3. The Manual Annotation Process.
4. Becoming acquainted with Web Apollo.
An introduction to Web Apollo. A webinar for the Ceratitis capitata research community.
Outline 3
During this webinar you will:
• Learn to identify homologs of known genes of interest in your newly sequenced genome.
• Become familiar with the environment and functionality of the Web Apollo genome annotation editing tool.
• Receive a brief introduction to the resources available for the Ceratitis capitata genome.
Footer 4
What is Web Apollo? • Web Apollo is a web-based, collaborative genomic
annotation editing platform. We need annota)on edi)ng tools to modify and refine the precise loca)on and structure of the genome elements that predic)ve algorithms cannot yet resolve automa)cally.
5 1. What is Web Apollo?
Find more about Web Apollo at http://GenomeArchitect.org
and Genome Biol 14:R93. (2013).
Brief history of Apollo*:
a. Desktop: one person at a time editing a specific region, annotations saved in local files; slowed down collaboration. b. Java Web Start: users saved annotations directly to a centralized database; potential issues with stale annotation data remained.
1. What is Web Apollo? 6
Biologists could finally visualize computational analyses and experimental evidence from genomic features and build manually-curated consensus gene structures. Apollo became a very popular, open source tool (insects, fish, mammals, birds, etc.).
*
Web Apollo • Browser-based tool integrated with JBrowse.
• Two new tracks: “Annotation” and “DNA Sequence”
• Allows for intuitive annotation creation and editing, with gestures and pull-down menus to create and modify transcripts and exons structures, insert comments (CV, freeform text), etc.
• Customizable look & feel.
• Edits in one client are instantly pushed to all other clients: Collaborative!
1. What is Web Apollo? 7
Working Concept
In the context of gene manual annotation, curation tries to find the best examples and/or eliminate most errors.
To conduct manual annotation efforts: Gather and evaluate all available evidence
using quality-control metrics to corroborate or modify automated annotation predictions.
Perform sequence similarity searches (phylogenetic framework) and use literature and public databases to: • Predict functional assignments from experimental data.
• Distinguish orthologs from paralogs, and classify gene membership in families and networks.
2. In our experience. 8
Automated gene models
Evidence: cDNAs, HMM domain searches, alignments with assemblies or
genes from other species.
Manual annotation & curation
Dispersed, community-based gene manual annotation efforts. We continuously train and support
hundreds of geographically dispersed scientists from many research communities, to perform biologically supported manual annotations using Web Apollo.
– Gate keepers and monitoring. – Written tutorials. – Training workshops and geneborees. – Personalized user support.
2. In our experience. 9
What we have learned.
Harvesting expertise from dispersed researchers who assigned functions to predicted and curated peptides we have developed more interactive and responsive tools, as well as better visualization, editing, and analysis capabilities.
10 2. In our experience.
http://people.csail.mit.edu/fredo/PUBLI/Drawing/
Collaborative Efforts Improved Automated Annotations*
In many cases, automated annotations have been improved (e.g: Apis mellifera. Elsik et al. BMC Genomics 2014, 15:86).
Also, learned of the challenges of newer sequencing technologies, e.g.: – Frameshifts and indel errors – Split genes across scaffolds – Highly repetitive sequences
To face these challenges, we train annotators in recovering coding sequences in agreement with all available biological evidence.
11 2. In our experience.
It is helpful to work together. Scientific community efforts bring together domain-specific and natural history expertise that would otherwise remain disconnected.
Breaking down large amounts of data into manageable portions and mobilizing groups of researchers to extract the most accurate representation of the biology from all available data distills invaluable knowledge from genome analysis.
12 2. In our experience.
Understanding the evolution of sociality Comparing the genomes of 7 species of ants
contributed to a better understanding of the evolution and organization of insect societies at the molecular level.
Insights drawn mainly from six core aspects of ant biology:
1. Alternative morphological castes 2. Division of labor 3. Chemical Communication 4. Alternative social organization 5. Social immunity 6. Mutualism
13
Libbrecht et al. 2012. Genome Biology 2013, 14:212
2. In our experience.
Atta cephalotes (above) and Harpegnathos saltator. ©alexanderwild.com
Groups of communities continue to guide our efforts.
A little training goes a long way!
With the right tools, wet lab scientists make exceptional curators who can easily learn to maximize the generation of accurate, biologically supported gene models.
14 2. In our experience.
Manual Annotation
How do we get there?
15
Assembly Manual
annotation Experimental
validation Automated Annotation
In a genome sequencing project…
3. How do we get there?
Gene Prediction
Identification of protein-coding genes, tRNAs, rRNAs, regulatory motifs, repetitive elements (masked), etc.
- Ab initio (DNA composition): Augustus, GENSCAN, geneid, fgenesh
- Homology-based: E.g: SGP2, fgenesh++
16
Nucleic Acids 2003 vol. 31 no. 13 3738-3741
3. How do we get there?
Gene Annotation Integration of data from prediction tools to generate a
consensus set of predictions or gene models. • Models may be organized using:
- automatic integration of predicted sets; e.g: GLEAN - packaging necessary tools into pipeline; e.g: MAKER
• All available biological evidence (e.g. transcriptomes) further informs the annotation process.
17 3. How do we get there?
In some cases algorithms and metrics used to generate consensus sets may actually reduce the accuracy of the gene’s representation; in such cases it is usually better to use an ab initio model to create a new annotation.
Manual Genome Annotation
• Identifies elements that best represent the underlying biology.
• Eliminates elements that reflect the systemic errors of automated genome analyses.
• Determines functional roles through comparative analysis of well-studied, phylogenetically similar genome elements using literature, databases, and the researcher’s experience.
18 3. How do we get there?
Curation Process is Necessary
1. A computationally predicted consensus gene set is generated using multiple lines of evidence.
2. Manual annotation takes place.
3. Ideally consensus computational predictions will be integrated with manual annotations to produce an updated Official Gene Set (OGS).
Otherwise, “incorrect and incomplete genome annotations will poison every experiment that uses them”.
- M. Yandell.
19 3. How do we get there?
The Collaborative Curation Process at i5K
1) A computationally predicted consensus gene set has been generated using multiple lines of evidence; e.g. JAMg Consensus Gene Set v1. 2) i5K Projects will integrate consensus computational predictions with manual annotations to produce an updated Official Gene Set (OGS):
» If it’s not on either track, it won’t make the OGS! » If it’s there and it shouldn’t, it will still make the OGS!
20 3. How do we get there?
Consensus set: reference and start point
• In some cases algorithms and metrics used to generate consensus sets may actually reduce the accuracy of the gene’s representation; e.g. use Augustus model instead to create a new annotation.
• Isoforms: drag original and alternatively spliced form to ‘User-created Annotations’ area.
• If an annotation needs to be removed from the consensus set, drag it to the ‘User-created Annotations’ area and label as ‘Delete’ on Information Editor.
• Overlapping interests? Collaborate to reach agreement. • Follow guidelines for i5K Pilot Species Projects as shown at
http://goo.gl/LRu1VY and download the MedFly Annotation guide from http://goo.gl/YY0tNw
21 3. How do we get there?
Web Apollo
Sort
Web Apollo
23
The Sequence Selection Window
4. Becoming Acquainted with Web Apollo.
23
Navigation tools: pan and zoom Search box: go
to a scaffold or a gene model.
Grey bar of coordinates indicates location. You can also select here in order to zoom to a sub-region.
‘View’: change color by CDS, toggle strands, set highlight.
‘File’: Upload your own evidence: GFF3, BAM, BigWig, VCF*. Add combination and sequence search tracks.
‘Tools’: Use BLAT to query the genome with a protein or DNA sequence.
Available Tracks
Evidence Tracks Area
‘User-created Annotations’ Track
Login
Web Apollo
24
Graphical User Interface (GUI) for editing annotations
4. Becoming Acquainted with Web Apollo.
Flags non-canonical splice sites.
Selection of features and sub-features
Edge-matching
Evidence Tracks Area
‘User-created Annotations’ Track
The editing logic in the server: § selects longest ORF as CDS § flags non-canonical splice sites
25
Web Apollo
4. Becoming Acquainted with Web Apollo.
25
DNA Track
‘User-created Annotations’ Track
Web Apollo
26 4. Becoming Acquainted with Web Apollo.
§ There are two new kinds of tracks for: § annotation editing § sequence alteration editing
Web Apollo
27
Annotations, annotation edits, and History: stored in a centralized database.
4. Becoming Acquainted with Web Apollo.
27
Web Apollo
28 4. Becoming Acquainted with Web Apollo.
28
• DBXRefs • PubMed IDs • GO terms • Comments
The Information Editor
Additional Functionality In addition to protein-coding gene annotation that you know and love.
• Non-coding genes: ncRNAs, miRNAs, repeat regions, and TEs
• Sequence alterations (less coverage = more fragmentation)
• Visualization of stage and cell-type specific transcription data as coverage plots, heat maps, and alignments
29 4. Becoming Acquainted with Web Apollo.
29
Webservices & additional tools
• Alignments - Jalview
• BLAST - blastp
• Signal Peptide – search using signalP.
• Just_Annotate_My_proteins: Pick a Gene Ontology, Enzyme, KEGG, etc term and it gives you a list
of genes that have a significant Hidden Markov Model alignment to a SwissProt protein (i.e. only real proteins that have been validated) and that has real experimental evidence (i.e. from the literature) for that term.
The search is conservative and does not allow IEA evidence codes to avoid possibly propagating annotation errors. However, the search is run twice: first every annotated gene is searched against SwissProt. Then a profile alignment is created with the good matches and searched again.
Footer 30
1. Select a chromosomal region of interest, e.g. scaffold. 2. Select appropriate evidence tracks. 3. Determine whether a feature in an existing evidence track will
provide a reasonable gene model to start working. - If yes: select and drag the feature to the ‘User-created
Annotations’ area, creating an initial gene model. If necessary use editing functions to adjust the gene model.
- If not: let’s talk. 4. Check your edited gene model for integrity and accuracy by
comparing it with available homologs.
4. Becoming Acquainted with Web Apollo
General Process of Curation
31 |
Always remember: when annotating gene models using Web Apollo, you are looking at a ‘frozen’ version of the genome assembly and you will not be able to modify the assembly itself.
31
There are a number of ways to find the gene region you wish to annotate. It depends what you’re starting with: a) The protein sequence from another species b) Sequence from a similar gene c) You provided Alexie with golden genes and he provided back alignments d) You provided Alexie with high quality proteins and/or gene family alignments (multi or
single species) and he created domain annotations.
So how do I start curating!?
Option 1 – You have a sequence but don’t know where it is in this genome 1. You will need to BLAT it 2. If protein-based BLAT doesn’t find it, you can BLAST it 3. You can use the i5k BLAST server here :
http://i5k.nal.usda.gov/blastn 4. Or you can use any other tool, for example Geneious
Option 2 – the genome has already been annotated with your sequences and you have an ID 1. In other words, someone has told you where to look: if you give Alexie
profile alignments of your favorite gene family we could do that for you.
2. Type the ID in the Search box of Web Apollo • Web Apollo autocompletes using a case-insensitive search
anchored on the left-hand side of the word e.g. so HaGR will show all hagr objects (up to 10)
3. Choose one of the gene and click Go You can do that with Domains, Alignments or Gene names provided to you.
Option 3 – Get genes based on a GO / EC etc term This is a fun, new tool Alexie has made, called Just_Annotate_My_proteins.
Example Live Demonstration using the Apis mellifera genome.
Example 33
A public Honey Bee Web Apollo Demo is available at http://genomearchitect.org/WebApolloDemo
Arthropodcentric Thanks! AgriPest Base FlyBase Hymenoptera Genome Database VectorBase
Acromyrmex echinatior Acyrthosiphon pisum
Apis mellifera Atta cephalotes
Bombus terrestris Camponotus floridanus
Helicoverpa armigera Linepithema humile
Manduca sexta Mayetiola destructor Nasonia vitripennis
Pogonomyrmex barbatus Solenopsis invicta
Tribolium castaneum… and you!
34
34
Thanks! • Berkeley Bioinformatics Open-source Projects
(BBOP), Berkeley Lab: Web Apollo and Gene Ontology teams. Suzanna E. Lewis (PI).
• Christine G. Elsik (PI). § University of Missouri.
• Ian Holmes (PI). * University of California Berkeley.
• Arthropod genomics community, i5K Steering Committee, Alexie Papanicolaou at CSIRO, Monica Poelchau at USDA/NAL, fringy Richards at HGSC-BCM, Oliver Niehuis at 1KITE http://www.1kite.org/, BGI, and the Honey Bee Genome Sequencing Consortium.
• Web Apollo is supported by NIH grants 5R01GM080203 from NIGMS, and 5R01HG004483 from NHGRI, and by the Director, Office of Science, Office of Basic Energy Sciences, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
• Insect images used with permission: http://AlexanderWild.com and O. Niehuis.
• For your attention, thank you!
Thank you. 35
Web Apollo
Gregg Helt
Ed Lee
Colin Diesh §
Deepak Unni §
Rob Buels *
Gene Ontology
Chris Mungall
Seth Carbon
Heiko Dietze
BBOP
Web Apollo: http://GenomeArchitect.org
GO: http://GeneOntology.org
i5K: http://arthropodgenomes.org/wiki/i5K
top related