biological(databases(and(blast(...
Post on 29-May-2020
11 Views
Preview:
TRANSCRIPT
Joyce Njoki Nzioki BecA-‐ILRI Hub, Nairobi, Kenya h;p://hub.africabiosciences.org/ h;p://www.Ilri.org/ j.n.njuguna@cgiar.org
Biological Databases and BLAST ILRI /Ethiopia Training
27, August 2015
Biological Databases
• A biological database is a computerized archive used to store, organize and ease retrieval of sequence data
Biological Databases
• A biological database is a computerized archive used to store, organize and ease retrieval of sequence data
• A database typically supports the following opera=ons
ü Retrieval ü Inser=on ü Upda=ng ü Dele=on
Biological databases I. A database can be thought as a large table where row
represents record and columns represents fields. II. The organiza=on of records allows for querying on
the data to retrieve informa=on from the databases III. An ideal biological database has fields as shown
below
Accession number
Name Length Sequence Taxonomy Reference
NR235462.1 MTGA 268 ACTTGC... E.coli A.Kelly et. al
NR235463.1 HKY 350 TGAGTA… E.coli J.Jone et. al
NR235464.1 THY 289 TGACGT… S.Aurius K.Moy et. al
Types of Biological Databases
1. Primary databases: hold raw sequenced data • GenBank • EMBL (European Molecular Biology Laboratory) • DDBJ (DNA Data Bank of Japan) • PDB (Protein Data Bank)
Types of Biological Databases
1. Primary databases: hold raw sequenced data • GenBank • EMBL (European Molecular Biology Laboratory) • DDBJ (DNA Data Bank of Japan) • PDB (Protein Data Bank)
2. Secondary databases: they are curated and annotated • Swiss-‐Prot – detailed annota=on if proteins • RefSeq – non-‐redundant , curated sequenced data
Types of Biological Databases
1. Primary databases: hold raw sequenced data • GenBank • EMBL (European Molecular Biology Laboratory) • DDBJ (DNA Data Bank of Japan) • PDB (Protein Data Bank)
2. Secondary databases: they are curated and annotated • Swiss-‐Prot – detailed annota=on if proteins • RefSeq – non-‐redundant , curated sequenced data
3. Specialized databases: these focus on data of specific research interest • VectorBase • PlasmoDB
Where to look for Biological databases
ü Search Engines (Google) ü Journals related to bioinforma=cs
Nucleic Acid research NAR online database issue
ü Websites like; www.expasy.ch
Major Biological databases
ü There are many public resources but few key resources
ü Many databases are part of projects that have limited lifespan
ü Use major public resources: NCBI, EBI, Ensemble, PDB, KEGG
ü If your field is more specific, iden=fy the major resource in that area
Accession Numbers
ü Stable ways of iden=fying GenBank Entries. ü No biological meaning. ü Originally an uppercase lefer followed by 5 digits U00002
ü Now two uppercase lefers followed by six digits BC037153
ü Version of entry added later as a decimal BC037153.1
Sequence formats ü This is the required arrangement of characters symbols and keywords in a sequence record.
ü There many different sequence formats for the purpose of database integra=on and organizing sequenced data
ü When considering a format for retrieval: • What is easy to parse • What format do the tools need • What informa=on is needed
FASTA format
• Used by fasta tools • Comment line > then sequence data
GenBank format
• Flat file format used by GenBank • Has annota=on, author, version etc
Sequence formats
• Sequence are store in databases or in files as simple text (ASCI text)
• Microsok Word format is not a sequence format
• Save sequence files as text .txt file !!! • Use text editors like note-‐pad, text-‐pad to open such files
Problems of general sequence databases
i. Redundancy of sequence informa=on ii. Inadequate sequence iii. Old sequences iv. Par=ally annotated sequences v. Inconsistent and outdated annota=ons vi. Error sequences
Sequence comparison ü Newly sequenced DNA data is compared to that already available in biological databases.
Sequence comparison ü Newly sequenced DNA data is compared to that already available in biological databases.
ü Sequence comparison (of DNA / Protein data) is achieved through alignment, the process by which regions of similarity is searched between sequences.
Sequence comparison ü Newly sequenced DNA data is compared to that already available in biological databases.
ü Sequence comparison (of DNA / Protein data) is achieved through alignment, the process by which regions of similarity is searched between sequences.
ü This eases annota=on of new sequences as biological knowledge from well characterized homologs can be conferred
Some terminology 1. Sequence similarity; this is when two sequences are very alike in
base pair or amino acid sequence ü Sta=s=cal measures like E-‐value. P-‐Value and bit score ü Percentage iden=ty (% of iden=cal residues between sequences) ü The length of sequence stretch that is similar
Some terminology 1. Sequence similarity; this is when two sequences are very alike in
base pair or amino acid sequence ü Sta=s=cal measures like E-‐value. P-‐Value and bit score ü Percentage iden=ty (% of iden=cal residues between sequences) ü The length of sequence stretch that is similar
2. Homology; homologs diverse from a common ancestor and homology is inferred by sequence, structural and func=onal similarity ü Orthologs – arise due to a specia=on event ü Paralogs – arise due to gene duplica=on within the sequence
Some terminology 1. Sequence similarity; this is when two sequences are very alike in
base pair or amino acid sequence ü Sta=s=cal measures like E-‐value. P-‐Value and bit score ü Percentage iden=ty (% of iden=cal residues between sequences) ü The length of sequence stretch that is similar
2. Homology; homologs diverse from a common ancestor and homology is inferred by sequence, structural and func=onal similarity ü Orthologs – arise due to a specia=on event ü Paralogs – arise due to gene duplica=on within the sequence
If sequence similarity in high enough it can infer homology and consequently structural and func=onal similarity but not in all cases.
Sequence Alignment
Pairwise sequence alignment; comparison of two sequences to establish regions of residue similarity Why align sequence; ü Iden=fy sequences with significant similarity / homology ü Iden=fy the domains with sequence, structure & func=onal similarity.
ü Database similarity searching ü Confer func=on from the known to the unknown
Sequence Alignment There two types of alignment strategies 1. Global alignment; find op=mal alignment over the
en=re length of the two compared sequences
ü Best for highly similar sequence ü May miss important biological rela=onships in low similarity
Sequence Alignment 2. Local alignment; aligns short regions of similarity
between sequences.
ü Useful when looking for domains in proteins and gene finding in DNA
ü Best for sequences of low similarity and different length
Searching sequence database ü A query sequence is searched against a database to look for homologs.
ü The algorithm used aligns your query to those in the database and returns highly similar sequences.
ü A scoring procedure is implemented on searches to measure the degree of similarity.
ü Judgment needs to be made on whether the similar sequences are homologous to your query based on scien=fic knowledge
ü There 2 programs for this: 1. BLAST (Altschul et al. 1990) 2. FastA (Pearson and Lipman 1988)
Basic Local Alignment Search Tool
ü Blast is a heuris=c approach used to calculate similarity for biological sequences
ü It finds best local alignment ü Blast displays results as a list of sequence matches ordered by sta=s=cal significance (hits)
ü BLAST is useful in iden=fying unknown sequences as well as gene or protein func=on predic=on.
Blast Flavors Some Flavors of BLAST
ucleotide rotein N
NN
N
N
N
P
P
blastx
tblastn
tblastx
P P P P P P
P P P P P P P P P P P P
P P P P P P
Query Database Program
blastp
blastn
P P N N
PP
Graphical BLAST results
ü This is a graphical view of the distribu=on of BLAST hits on the query sequence
ü The length of the hits shows the query coverage and regions of similarity
Query Sequence
Blast Hits, A mouse over gives you the details. Click to view alignment
Hit List BLAST result Hit list gives the iden-fy of sequences similar to your query sequence ranked by similarity
Bit score values < 50 unreliable
% Query coverage
E-‐value
Sequence defini=on, click to view the pairwise alignment
Accession number, Link to the record in GenBank
Pairwise alignment (Protein)
Pairwise alignment (nucleo=de)
BLAST result interpreta=on • How do you make your conclusion on homology: • E-‐value = Expected value. (this indicates the probability that the blast hit may have occurred by random chance).
• The lower the E-‐value (or the closer it is to 0) the more significant the hit. To be certain of homology your E-‐value must be below 10-‐4 or 0.001.
• % iden=ty the higher the iden=ty the increasing likelihood of homology.
• Query coverage – if a hit has high query coverage and similarity in increases the chances of homology.
Summary for nucleo=de sequence
Length Database Purpose BLAST Program
20 bp or longer Nucleo=de Iden=fy the query sequence
blastn megablast
Find similar nucleo=de sequence.
blastn
Find similar proteins to translated query in a translated nucleo=de database
tblastx
Protein Find proteins coded in my query DNA sequence
blastx
Summary for protein sequences Length Database Purpose BLAST Program
15 residues or Longer
Protein Iden=fy your query sequence or find protein sequences similar to it
blastp
Find members of a protein family or build a custom posi=on specific scoring matrix(PSSMs)
PSI-‐blast
Find proteins similar to the query around a given pafern
PHI-‐blast
Conserved domains
Find conserved domains in your query and iden=fy other proteins with similar domains
CD-‐search
Nucleic Find similar sequences in a translated nucleo=de sequence database.
tblastn
The End Acknowledge E=enne for some of the slides on
biological databases
Thank you
top related