basic annotation guide - helsinki.fi · basic annotation guide note: this document presents an...

juhana.kammonen@helsinki.fi; +358503785335; Slack IMS: http://mcinxiaannota-ctk7011.slack.com

Basic annotation guide

NOTE: This document presents an example case of a basic annotation workflow of a

gene. You do not have to follow the protocol to the point, but it may be useful to go

through the steps if you are doing the annotation for the very first time.

Apollo software authors also have made their own guide available at:

http://genomearchitect.github.io/users-guide/

1. Open your received list of genes and login to main annotation

server (Apollo)

Open the gene list you received in an appropriate text editor (e.g. Notepad, Excel). If

you do not have a gene list, follow the instructions at:

https://www.helsinki.fi/en/researchgroups/life-history-evolution/research/melitaea-

cinxia-manual-annotation

at section “Selecting genes to annotate”

If you have received a list of genes, proceed to navigate (using Firefox!) to Apollo

annotation server URL: http://dna-marker.biocenter.helsinki.fi:8018/apollo

A login window appears, so log in with your Apollo credentials. If you do not have

these or have lost these, contact daniel.blande@helsinki.fi to reclaim your

credentials.

2. Locate a gene model on your gene list

Your gene list will show gene model locations like this:

M05_B06_H03:943048-955620

Each line of your gene list is a referenced gene model related to the gene family of

you selected. The information contains information of the contig (here

M05_B06_H03) where your target gene is located and the actual base level location

after the colon : (e.g. starting base 943048; ending base 955620 in the example

above). You can navigate to this location by inserting the contig name to the top right

annotation view window and selecting the option of the autofill below (Figure 1a).

Figure 1a. Entering a contig from your gene list in the search field (red rectangle on the right). Click

the auto-filled option in the dropdown to proceed to the target contig.

Use the Apollo navigation tools to navigate to the approximate location of your gene

model (Figure 1b)

Figure 1b. Using Apollo navigation tools to find approximate location of the gene model on the gene

3. Make evidence tracks visible in the Apollo view

If you logged in to Apollo for the first time, the evidence tracks are most likely hidden

from view initially. Once you believe you are in the location of your gene click on the

“Tracks” tab on the right-side panel (Figure 2).

Figure 2. Tracks tab selection (red rectangle) in the main annotation view.

Click on the “MAKER_genes_V3” gene models (Figure 3). You should find the gene

model with the name from your gene list in the location. Zoom in more if you cannot

see the names (Figure 1b).

Figure 3. Clicking on the MAKER_genes_V3 tracks in Apollo main annotation view (right red

rectangle) and locating the gene model in the annotation (center red rectangle).

You can identify the correct gene model based on the information on your gene list.

You will see something like maker-M05_B06_H03-augustus-gene-1.459-mRNA-1

in your gene list entry. Make absolutely sure that the name of the gene model on

your list matches the one you are viewing in Apollo! The number code of the

gene model name (here 1.459 ) is a very good identifier of the correct model.

4. Add annotation to the User-created annotation area

Next we take a look at the user-created annotation area (the yellowish area on top of

the main annotation view in previous figures).

Set zoom of the main annotation view so that you can see the entire gene model that

overlaps the location in your gene list. Select the gene model from the appropriate

evidence track by left-clicking once from the tip of the gene model arrowhead (Figure

Figure 4. Selecting the gene model on the EVM track by clicking the tip of the arrow.

NOTE: If there’s an existing annotation in the user created annotation area (colored

bars with an arrow) at this location, that means that somebody has already

annotated that gene or is in the process of doing so. If this happens, move mouse on

top of the existing annotation and let it stay there for a while. See the contents

“owner:” field in the floating box that appears (it should be an email address).

Indicate in your gene list on this line that there was an overlapping annotation and its

owner in the location of your gene list: Add e.g. “annotated by test@localhost.com”

to the end of the line in your gene list. After this, move on to the next gene on your

list until you find one that has no previous annotations.

If the user-created annotations area is empty at your location, as it should be,

proceed to drag and drop that annotation from the track to the User-created

annotations area by clicking at the tip of the arrow and dragging the model with the

mouse button pressed (Figure 5). Drop the model (release the mouse button) when

the model is on top of the user-created annotations area. The model appears in the

annotation area as a new annotation (Figure 6).

NOTE: If the annotation does not appear in the user-created annotation area, try to

log out from Apollo (logout button in the top-right corner of screen) and then re-login.

You will be returned into the same location and view where you were when you

logged out. Then try selecting the gene model and dragging and dropping again.

This is one of the Apollo glitches that we are currently sorting out.

Figure 5. Dragging a model from the MAKER_V3 track into the User-created annotations area.

Figure 6. Upon releasing the mouse button the model appears as a new annotation in the User-

created Annotations area.

5. Get the cDNA sequence of your annotation and perform

BLASTX alignment

At the very minimum you should check with BLAST whether the expected protein

domain of your gene of interest is found on the gene model. This step describes how

to do this.

Get the peptide sequence of your annotation. This is done by clicking the tip of the

arrow in the User-created annotations area so that the whole annotation becomes

active (Figure 7). Then right-click to get a menu of annotation edit options visible

(Figure 8). On top of the list is “Get sequence”, select that. A window with amino-acid

sequence becomes visible (Figure 9). This is the peptide sequence of this

annotation.

Figure 7. Clicking the target annotation active from the arrow-tip on the User-created annotation area.

Figure 8. Right-clicking the arrow-tip gives you annotation edit options. Select “Get sequence”.

Figure 9. Selecting the “Get sequence” option. The amino acid-sequence becomes visible in the

annotation window.

In this example, we use the cDNA for our alignment, so click the radio button for

“cDNA sequence” in the view (Figure 10). cDNA contains the coding sequences of

the gene accompanied by the 5’ and 3’ untranslated regions (UTRs) should there be

Figure 10. cDNA sequence of the annotation selected (red rectangle) and visible in the “Get

sequence” window. We use cDNA for the most accurate possible hits in the sequence databases (see

the following steps)

Then open another tab in Firefox or use another browser and navigate to:

https://blast.ncbi.nlm.nih.gov/Blast.cgi

The NCBI BLAST tools main page opens (Figure 11). Select the BLASTX tool in the

middle of the page.

Figure 11. NCBI BLAST tools main page open in browser. The link to the BLASTX tool is highlighted

in the middle (red rectangle).

In the top field of the opening view (Figure 12), copy and paste the cDNA sequence

from the annotation view. Be sure to include all the text that is in the field including

the header line (easiest done by right-clicking the text in the annotation view and

choosing “Select all”). Then scroll to the bottom of the page and click “BLAST”.

Figure 12. Copy-pasting the cDNA into the top field of the search (top red rectangle) and clicking

“BLAST” on the bottom of the page (bottom red rectangle).

The search starts and opens a new page showing the status of the search. It may

last many minutes until a result is given depending on the length of the pasted

sequence, other traffic on the NCBI BLAST service etc.

Finally, a result page is given (Figure 13). The example search took 5+ minutes.

Figure 13. A BLASTX result page showing an illustration of the alignments against the hits in the

database. This example looks rather good. The result shows significant, although gapped hits for the

length of the query sequence for at least the top 8 hits in the database.

Scroll down the result page until you see a text format list of the hits (Figure 14). The

contents of this list will verify that this is an actual butterfly gene in the location.

Usually, the species is reported and if you see latin names such as Vanessa

tameamea [=Khamehamea butterfly] Bombyx mori [=silk moth] this is a real gene in

this location. Moreover, the result list should also give you some indication of your

gene of interest, most likely the protein domain name e.g. “stress protein”.

Figure 14. Scrolling down the BLASTX result page and finding the top hits in the list. In this example,

the top hit is a Vanessa tameamea protein, a strong indication that the gene model sequence in

Apollo is that of an actual gene of Glanville fritillary.

If the protein domain of interest was not found and BLASTX provides you with a very

few or poor hits, it is possible that this gene model is a false positive, i.e. an

automated detection of a gene while there is no actual gene here. Mark into your

gene list into the end of the line of this gene some indication of the result, e.g. “false

positive”. Leave the annotation to the user-created annotations area and add the

suspected false positive into the information field of this annotation (see step 8 of

these instructions).

NOTE: Another (and a quicker way) of performing the initial alignment is to use the

SANSparallel tool (http://ekhidna2.biocenter.helsinki.fi/cgi-bin/sans/sans.cgi )

developed at Liisa Holm’s group in the Institute of Biotechnology. Step 6 of these

instructions can also be performed in the SANSparallel.

6. Perform multiple alignment against the best BLASTX hits

Open a text editor. In windows, WordPad is recommended but you can also use e.g.,

Notepad…

Copy and paste the peptide sequence from Apollo window to the text editor (Select

radio button Peptide sequence [Figure 9] and copy and paste the contents).

Get the FASTA peptide sequences of at least two of the best BLASTX hits of your

search on the previous step and append these into the text editor with their FASTA

headers (FASTA header is the line that begins with a “>” and always precedes the

actual sequence).

You can do this by checking the hits in the BLASTX result list and selecting the

“Download” menu (Figure 15). In the “Download menu”, select the top option

“FASTA (Complete sequence)” and click “Continue”. The sequences are saved

into file called seqdump.txt by default. Edit if needed.

Figure 15. Selecting the best BLASTX hits and clicking the Download menu.

Open the downloaded file and copy and paste the contents after your peptide

sequence from Apollo in the text editor (Figure 16a).

Proceed to the online MAFFT tool (Multiple Alignment using Fast Fourier Transform)

at URL: http://mafft.cbrc.jp/alignment/server/

NOTE: Another alternative for multiple alignment is to use the “MSA” postprocessing

option in the SANSparallel tool

(http://ekhidna2.biocenter.helsinki.fi/cgibin/sans/sans.cgi )

Figure 16a. The Apollo peptide sequence and the best BLASTX hits open in Windows WordPad in a

same text file.

Copy-and-paste all the sequences from the WordPad file into the MAFFT input field

(Figure 16b). And click the “Submit” button on bottom of the MAFFT page.

Figure 16b. Uploading a sequence file using the upload tool of the MAFFT server (red rectangle). You

can also copy and paste the sequences directly into the text field above.

The multiple alignment is then processed by the server and a result page is given

(Figure 17). The alignment will show exactly where the input sequence aligns with

the top hits from the BLASTX search.

In the example case the alignment of the input sequence begins much later than the

database entries and the database sequences also continue further than the input

sequence (Figure 17).

Figure 17. An example MAFFT result. The last of the BLASTX hits has shorter end as compared to

the other three sequences and the final ca. 15 amino acids show differences (end of 6th and the 7th

line of the alignment).

7. Set correct exon lengths and add missing exons into your

annotation

The multiple alignment gives you an idea of the missing exons or other gaps in the

alignment that need to be adjusted. In the example case, the multiple alignment was

quite good to begin with. If you see amber-colored circles with exclamation marks at

the exon boundaries, it means there’s a non-canonic start or end of an exon. The

Navigate back to Apollo and close the sequence view from the small “X” in the top

right corner of the sequence window.

Figure 18. Checking RNA-seq evidence for the gene. In this example, RNA-seq track from a female

larvae is clicked on (PLUS_STRAND as the gene is on the (+) strand (arrow-from-left-to-right))

8. Edit information of your annotation

Navigate back to the Apollo view to see your annotation. Click some of the RNA

tracks active from the tracks tab use the tracks with the suffix PLUS_STRAND if your

gene is on the (+) strand (arrow-from-left-to-right) and those with the suffix

MINUS_STRAND if your gene is on the – strand (arrow-from-right-to-left) (Figure

18). Typically, a single RNAseq evidence track fills the view on the screen so you

may click the tracks on and off to compare them. This way you can see rudimentary

overlapping evidence in support of the gene model in this location.

Select the whole annotation by clicking the tip of the arrow on the User-created

annotations area, then right-click the selection and select “Edit information” (Figure

Scroll down to the bottom of the information window and add to the bottom-left (gene

side) “Comments” section at least the following information in separate comment

fields and in this order:

1. Name of the gene (e.g. copy-paste from the name of the best BLASTX hit

without the organism name)

2. Protein superfamilies returned by the BLASTX search (if any) These are

found at the top part of the BLASTX result page (Figure 13).

3. “RNAseq” if there’s overlapping evidence with overall good

concordance with the gene model exons on the RNAseq tracks. (Figure

18 is an example of good concordance). Type “vague RNAseq” if only a

couple of RNAseq reads overlap the gene. Type “no RNAseq” if there are no

RNAseq reads overlapping the gene on any of the RNAseq tracks.

4. Indication of the quality and length of the multiple alignment you

performed, e.g. “Full length MSA” for full length alignment. If some

sequence is missing from the start and cannot be found: “Start missing

in MSA” etc.

5. Other brief information you feel the annotation curators should know

about, e.g. “Frameshift in the 3rd exon”, “Stop codon in the 2nd exon”,

“Poor alignment”, “5-prime UTR” etc. Note: avoid special characters in

all comments, dash (-) and underscore (_) are OK but avoid all other

special chars.

6. Write “Apollo problem” if your adjustments of the gene model cause an

incorrect reading frame or other

The example case (Figures 3-18) had overlapping RNAseq in good concordance as

well as a full alignment with differences towards the end of the gene. There was also

a Transketolase_C superfamily present in the BLASTX alignment (Figure 13). The

start of the gene has an untranslated (white) region which may well be a 5-prime

UTR. The appropriate comments for the example case are presented (Figure

Figure 19. Selecting the annotation on the User-created annotations area by left-clicking from the tip

of the annotation arrowhead (red rectangle) and right-clicking will bring the list of options for this

annotation visible. The “Edit information” option is highlighted.

Figure 20. Scrolling down to the bottom-left comments field of the “Edit information” view and adding

the example case comments here. Note that the user interface is a bit tricky here: First click “Add” and

then click on the appearing field to edit the comment. Click outside the field to save the comment.

Reclick “Add” to add another comment. If you are unsure whether the information was saved after you

close the view, you can always return to this view by right-clicking the annotation in the User-created

annotations area and selecting “Edit information”.

Close the “Edit information” view from the small “x” in the top-right of the information

editor window.

Do not edit any other information than “Comments” in the information editor.

Back in your gene list, add some comments to the end of the line. E.g. “X” to just

check that it is annotated. “Needs revision” to indicate that the annotation curators

should look at this one. “Annotated by: WHO?” in case there was already an

annotation in the location. Close the “Edit information” window and leave the

annotation visible to the User-created annotations area. Then continue to the next

gene in your gene list (back to step 2 of this document).

NOTE: The minute you drag and drop an annotation into the User-created

annotation area, that annotation is saved into the database. This means that this

annotation now is a part of what is to become the official gene set (Version 2) of the

Melitaea cinxia genome. You can delete, copy and undo steps of your annotations

by selecting the annotation from the arrow tip and right-clicking and selecting the

different options. Deleting an annotation from the User-created annotations area

removes the annotation from the database (and the forthcoming official gene set V2

of the Melitaea cinxia).

Thank you for joining the annotation effort of the Melitaea cinxia!

9. Once you’re done with the annotations for the day, logout from

Apollo

When you are done with annotations for the day, logout from the Apollo service by

clicking the logout button in the top right corner of the screen (Figure 16). This shuts

your connection to the database and prevents potential congestion of the database

while you are not annotating (but others participating the effort may be). The Apollo

server tends to remember where you were located when you logged out and will

return into this location upon your next login, even if you log in from a different

computer.

10. Once your whole gene list is finished and commented,

send it back to daniel.blande@helsinki.fi . You may also request

a new gene list on the same if you like.

If you want to request a new gene list, select a gene family of interest by navigating

to https://www.helsinki.fi/en/researchgroups/life-history-evolution/research/melitaea-

cinxia-manual-annotation

and to section “Selecting genes to annotate” on that page.

juhana.kammonen@helsinki.fi; +358503785335; Slack IMS: http://mcinxiaannota-ctk7011.slack.com

basic annotation guide - helsinki.fi · basic annotation guide note: this document presents an...

Documents

lecturer: dos. vesa hänninen vesa.hanninen@helsinki.fi...

jaakko seppälä jaakko.i.seppala@helsinki.fi

verb interpretation for basic action types: annotation...

graduate students of vähätalo - helsinki.fi · graduate...

a simple introduction to ncbi blast - gep community...

annotation -...

from basic genetics to biomedical applications · •...

annotation! · annotation! learn it! live it! love it!...

kaarina aitamurto aleksanteri institute...

timo.paivarinta@helsinki.fi timo.paivarinta@helsinki.fi...

package ‘annotationtools’ - bioconductor.riken.jp...

review free egasp: the human encode genome annotation...

annotation as algebra: a formal framework for linguistic...

annotation and evaluation - gate · university of...

verb interpretation for basic action types: annotation

neovision 2 annotation guidelines - welcome to...

block annotation: better image annotation with …...block...

minna starck april 4, 2005 university of helsinki email:...

utility data annotation with amazon mechanical...

partitur-editor: annotation. transcription: – basic,...