basic annotation guide - helsinki.fi · basic annotation guide note: this document presents an...
Post on 03-Oct-2020
7 Views
Preview:
TRANSCRIPT
juhana.kammonen@helsinki.fi; +358503785335; Slack IMS: http://mcinxiaannota-ctk7011.slack.com
Basic annotation guide
NOTE: This document presents an example case of a basic annotation workflow of a
gene. You do not have to follow the protocol to the point, but it may be useful to go
through the steps if you are doing the annotation for the very first time.
Apollo software authors also have made their own guide available at:
http://genomearchitect.github.io/users-guide/
1. Open your received list of genes and login to main annotation
server (Apollo)
Open the gene list you received in an appropriate text editor (e.g. Notepad, Excel). If
you do not have a gene list, follow the instructions at:
https://www.helsinki.fi/en/researchgroups/life-history-evolution/research/melitaea-
cinxia-manual-annotation
at section “Selecting genes to annotate”
If you have received a list of genes, proceed to navigate (using Firefox!) to Apollo
annotation server URL: http://dna-marker.biocenter.helsinki.fi:8018/apollo
A login window appears, so log in with your Apollo credentials. If you do not have
these or have lost these, contact daniel.blande@helsinki.fi to reclaim your
credentials.
2. Locate a gene model on your gene list
Your gene list will show gene model locations like this:
M05_B06_H03:943048-955620
Each line of your gene list is a referenced gene model related to the gene family of
you selected. The information contains information of the contig (here
M05_B06_H03) where your target gene is located and the actual base level location
after the colon : (e.g. starting base 943048; ending base 955620 in the example
above). You can navigate to this location by inserting the contig name to the top right
annotation view window and selecting the option of the autofill below (Figure 1a).
Figure 1a. Entering a contig from your gene list in the search field (red rectangle on the right). Click
the auto-filled option in the dropdown to proceed to the target contig.
Use the Apollo navigation tools to navigate to the approximate location of your gene
model (Figure 1b)
Figure 1b. Using Apollo navigation tools to find approximate location of the gene model on the gene
list.
3. Make evidence tracks visible in the Apollo view
If you logged in to Apollo for the first time, the evidence tracks are most likely hidden
from view initially. Once you believe you are in the location of your gene click on the
“Tracks” tab on the right-side panel (Figure 2).
Figure 2. Tracks tab selection (red rectangle) in the main annotation view.
Click on the “MAKER_genes_V3” gene models (Figure 3). You should find the gene
model with the name from your gene list in the location. Zoom in more if you cannot
see the names (Figure 1b).
Figure 3. Clicking on the MAKER_genes_V3 tracks in Apollo main annotation view (right red
rectangle) and locating the gene model in the annotation (center red rectangle).
You can identify the correct gene model based on the information on your gene list.
You will see something like maker-M05_B06_H03-augustus-gene-1.459-mRNA-1
in your gene list entry. Make absolutely sure that the name of the gene model on
your list matches the one you are viewing in Apollo! The number code of the
gene model name (here 1.459 ) is a very good identifier of the correct model.
4. Add annotation to the User-created annotation area
Next we take a look at the user-created annotation area (the yellowish area on top of
the main annotation view in previous figures).
Set zoom of the main annotation view so that you can see the entire gene model that
overlaps the location in your gene list. Select the gene model from the appropriate
evidence track by left-clicking once from the tip of the gene model arrowhead (Figure
4).
Figure 4. Selecting the gene model on the EVM track by clicking the tip of the arrow.
NOTE: If there’s an existing annotation in the user created annotation area (colored
bars with an arrow) at this location, that means that somebody has already
annotated that gene or is in the process of doing so. If this happens, move mouse on
top of the existing annotation and let it stay there for a while. See the contents
“owner:” field in the floating box that appears (it should be an email address).
Indicate in your gene list on this line that there was an overlapping annotation and its
owner in the location of your gene list: Add e.g. “annotated by test@localhost.com”
to the end of the line in your gene list. After this, move on to the next gene on your
list until you find one that has no previous annotations.
If the user-created annotations area is empty at your location, as it should be,
proceed to drag and drop that annotation from the track to the User-created
annotations area by clicking at the tip of the arrow and dragging the model with the
mouse button pressed (Figure 5). Drop the model (release the mouse button) when
the model is on top of the user-created annotations area. The model appears in the
annotation area as a new annotation (Figure 6).
NOTE: If the annotation does not appear in the user-created annotation area, try to
log out from Apollo (logout button in the top-right corner of screen) and then re-login.
You will be returned into the same location and view where you were when you
logged out. Then try selecting the gene model and dragging and dropping again.
This is one of the Apollo glitches that we are currently sorting out.
Figure 5. Dragging a model from the MAKER_V3 track into the User-created annotations area.
Figure 6. Upon releasing the mouse button the model appears as a new annotation in the User-
created Annotations area.
5. Get the cDNA sequence of your annotation and perform
BLASTX alignment
At the very minimum you should check with BLAST whether the expected protein
domain of your gene of interest is found on the gene model. This step describes how
to do this.
Get the peptide sequence of your annotation. This is done by clicking the tip of the
arrow in the User-created annotations area so that the whole annotation becomes
active (Figure 7). Then right-click to get a menu of annotation edit options visible
(Figure 8). On top of the list is “Get sequence”, select that. A window with amino-acid
sequence becomes visible (Figure 9). This is the peptide sequence of this
annotation.
Figure 7. Clicking the target annotation active from the arrow-tip on the User-created annotation area.
Figure 8. Right-clicking the arrow-tip gives you annotation edit options. Select “Get sequence”.
Figure 9. Selecting the “Get sequence” option. The amino acid-sequence becomes visible in the
annotation window.
In this example, we use the cDNA for our alignment, so click the radio button for
“cDNA sequence” in the view (Figure 10). cDNA contains the coding sequences of
the gene accompanied by the 5’ and 3’ untranslated regions (UTRs) should there be
any.
Figure 10. cDNA sequence of the annotation selected (red rectangle) and visible in the “Get
sequence” window. We use cDNA for the most accurate possible hits in the sequence databases (see
the following steps)
Then open another tab in Firefox or use another browser and navigate to:
https://blast.ncbi.nlm.nih.gov/Blast.cgi
The NCBI BLAST tools main page opens (Figure 11). Select the BLASTX tool in the
middle of the page.
Figure 11. NCBI BLAST tools main page open in browser. The link to the BLASTX tool is highlighted
in the middle (red rectangle).
In the top field of the opening view (Figure 12), copy and paste the cDNA sequence
from the annotation view. Be sure to include all the text that is in the field including
the header line (easiest done by right-clicking the text in the annotation view and
choosing “Select all”). Then scroll to the bottom of the page and click “BLAST”.
Figure 12. Copy-pasting the cDNA into the top field of the search (top red rectangle) and clicking
“BLAST” on the bottom of the page (bottom red rectangle).
The search starts and opens a new page showing the status of the search. It may
last many minutes until a result is given depending on the length of the pasted
sequence, other traffic on the NCBI BLAST service etc.
Finally, a result page is given (Figure 13). The example search took 5+ minutes.
Figure 13. A BLASTX result page showing an illustration of the alignments against the hits in the
database. This example looks rather good. The result shows significant, although gapped hits for the
length of the query sequence for at least the top 8 hits in the database.
Scroll down the result page until you see a text format list of the hits (Figure 14). The
contents of this list will verify that this is an actual butterfly gene in the location.
Usually, the species is reported and if you see latin names such as Vanessa
tameamea [=Khamehamea butterfly] Bombyx mori [=silk moth] this is a real gene in
this location. Moreover, the result list should also give you some indication of your
gene of interest, most likely the protein domain name e.g. “stress protein”.
Figure 14. Scrolling down the BLASTX result page and finding the top hits in the list. In this example,
the top hit is a Vanessa tameamea protein, a strong indication that the gene model sequence in
Apollo is that of an actual gene of Glanville fritillary.
If the protein domain of interest was not found and BLASTX provides you with a very
few or poor hits, it is possible that this gene model is a false positive, i.e. an
automated detection of a gene while there is no actual gene here. Mark into your
gene list into the end of the line of this gene some indication of the result, e.g. “false
positive”. Leave the annotation to the user-created annotations area and add the
suspected false positive into the information field of this annotation (see step 8 of
these instructions).
NOTE: Another (and a quicker way) of performing the initial alignment is to use the
SANSparallel tool (http://ekhidna2.biocenter.helsinki.fi/cgi-bin/sans/sans.cgi )
developed at Liisa Holm’s group in the Institute of Biotechnology. Step 6 of these
instructions can also be performed in the SANSparallel.
6. Perform multiple alignment against the best BLASTX hits
Open a text editor. In windows, WordPad is recommended but you can also use e.g.,
Notepad…
Copy and paste the peptide sequence from Apollo window to the text editor (Select
radio button Peptide sequence [Figure 9] and copy and paste the contents).
Get the FASTA peptide sequences of at least two of the best BLASTX hits of your
search on the previous step and append these into the text editor with their FASTA
headers (FASTA header is the line that begins with a “>” and always precedes the
actual sequence).
You can do this by checking the hits in the BLASTX result list and selecting the
“Download” menu (Figure 15). In the “Download menu”, select the top option
“FASTA (Complete sequence)” and click “Continue”. The sequences are saved
into file called seqdump.txt by default. Edit if needed.
Figure 15. Selecting the best BLASTX hits and clicking the Download menu.
Open the downloaded file and copy and paste the contents after your peptide
sequence from Apollo in the text editor (Figure 16a).
Proceed to the online MAFFT tool (Multiple Alignment using Fast Fourier Transform)
at URL: http://mafft.cbrc.jp/alignment/server/
NOTE: Another alternative for multiple alignment is to use the “MSA” postprocessing
option in the SANSparallel tool
(http://ekhidna2.biocenter.helsinki.fi/cgibin/sans/sans.cgi )
Figure 16a. The Apollo peptide sequence and the best BLASTX hits open in Windows WordPad in a
same text file.
Copy-and-paste all the sequences from the WordPad file into the MAFFT input field
(Figure 16b). And click the “Submit” button on bottom of the MAFFT page.
Figure 16b. Uploading a sequence file using the upload tool of the MAFFT server (red rectangle). You
can also copy and paste the sequences directly into the text field above.
The multiple alignment is then processed by the server and a result page is given
(Figure 17). The alignment will show exactly where the input sequence aligns with
the top hits from the BLASTX search.
In the example case the alignment of the input sequence begins much later than the
database entries and the database sequences also continue further than the input
sequence (Figure 17).
Figure 17. An example MAFFT result. The last of the BLASTX hits has shorter end as compared to
the other three sequences and the final ca. 15 amino acids show differences (end of 6th and the 7th
line of the alignment).
7. Set correct exon lengths and add missing exons into your
annotation
The multiple alignment gives you an idea of the missing exons or other gaps in the
alignment that need to be adjusted. In the example case, the multiple alignment was
quite good to begin with. If you see amber-colored circles with exclamation marks at
the exon boundaries, it means there’s a non-canonic start or end of an exon. The
cano
Navigate back to Apollo and close the sequence view from the small “X” in the top
right corner of the sequence window.
Figure 18. Checking RNA-seq evidence for the gene. In this example, RNA-seq track from a female
larvae is clicked on (PLUS_STRAND as the gene is on the (+) strand (arrow-from-left-to-right))
8. Edit information of your annotation
Navigate back to the Apollo view to see your annotation. Click some of the RNA
tracks active from the tracks tab use the tracks with the suffix PLUS_STRAND if your
gene is on the (+) strand (arrow-from-left-to-right) and those with the suffix
MINUS_STRAND if your gene is on the – strand (arrow-from-right-to-left) (Figure
18). Typically, a single RNAseq evidence track fills the view on the screen so you
may click the tracks on and off to compare them. This way you can see rudimentary
overlapping evidence in support of the gene model in this location.
Select the whole annotation by clicking the tip of the arrow on the User-created
annotations area, then right-click the selection and select “Edit information” (Figure
19).
Scroll down to the bottom of the information window and add to the bottom-left (gene
side) “Comments” section at least the following information in separate comment
fields and in this order:
1. Name of the gene (e.g. copy-paste from the name of the best BLASTX hit
without the organism name)
2. Protein superfamilies returned by the BLASTX search (if any) These are
found at the top part of the BLASTX result page (Figure 13).
3. “RNAseq” if there’s overlapping evidence with overall good
concordance with the gene model exons on the RNAseq tracks. (Figure
18 is an example of good concordance). Type “vague RNAseq” if only a
couple of RNAseq reads overlap the gene. Type “no RNAseq” if there are no
RNAseq reads overlapping the gene on any of the RNAseq tracks.
4. Indication of the quality and length of the multiple alignment you
performed, e.g. “Full length MSA” for full length alignment. If some
sequence is missing from the start and cannot be found: “Start missing
in MSA” etc.
5. Other brief information you feel the annotation curators should know
about, e.g. “Frameshift in the 3rd exon”, “Stop codon in the 2nd exon”,
“Poor alignment”, “5-prime UTR” etc. Note: avoid special characters in
all comments, dash (-) and underscore (_) are OK but avoid all other
special chars.
6. Write “Apollo problem” if your adjustments of the gene model cause an
incorrect reading frame or other
The example case (Figures 3-18) had overlapping RNAseq in good concordance as
well as a full alignment with differences towards the end of the gene. There was also
a Transketolase_C superfamily present in the BLASTX alignment (Figure 13). The
start of the gene has an untranslated (white) region which may well be a 5-prime
UTR. The appropriate comments for the example case are presented (Figure
20).
Figure 19. Selecting the annotation on the User-created annotations area by left-clicking from the tip
of the annotation arrowhead (red rectangle) and right-clicking will bring the list of options for this
annotation visible. The “Edit information” option is highlighted.
Figure 20. Scrolling down to the bottom-left comments field of the “Edit information” view and adding
the example case comments here. Note that the user interface is a bit tricky here: First click “Add” and
then click on the appearing field to edit the comment. Click outside the field to save the comment.
Reclick “Add” to add another comment. If you are unsure whether the information was saved after you
close the view, you can always return to this view by right-clicking the annotation in the User-created
annotations area and selecting “Edit information”.
Close the “Edit information” view from the small “x” in the top-right of the information
editor window.
Do not edit any other information than “Comments” in the information editor.
Back in your gene list, add some comments to the end of the line. E.g. “X” to just
check that it is annotated. “Needs revision” to indicate that the annotation curators
should look at this one. “Annotated by: WHO?” in case there was already an
annotation in the location. Close the “Edit information” window and leave the
annotation visible to the User-created annotations area. Then continue to the next
gene in your gene list (back to step 2 of this document).
NOTE: The minute you drag and drop an annotation into the User-created
annotation area, that annotation is saved into the database. This means that this
annotation now is a part of what is to become the official gene set (Version 2) of the
Melitaea cinxia genome. You can delete, copy and undo steps of your annotations
by selecting the annotation from the arrow tip and right-clicking and selecting the
different options. Deleting an annotation from the User-created annotations area
removes the annotation from the database (and the forthcoming official gene set V2
of the Melitaea cinxia).
Thank you for joining the annotation effort of the Melitaea cinxia!
9. Once you’re done with the annotations for the day, logout from
Apollo
When you are done with annotations for the day, logout from the Apollo service by
clicking the logout button in the top right corner of the screen (Figure 16). This shuts
your connection to the database and prevents potential congestion of the database
while you are not annotating (but others participating the effort may be). The Apollo
server tends to remember where you were located when you logged out and will
return into this location upon your next login, even if you log in from a different
computer.
10. Once your whole gene list is finished and commented,
send it back to daniel.blande@helsinki.fi . You may also request
a new gene list on the same if you like.
If you want to request a new gene list, select a gene family of interest by navigating
to https://www.helsinki.fi/en/researchgroups/life-history-evolution/research/melitaea-
cinxia-manual-annotation
and to section “Selecting genes to annotate” on that page.
juhana.kammonen@helsinki.fi; +358503785335; Slack IMS: http://mcinxiaannota-ctk7011.slack.com
top related