biological(databases(and(blast(...

Joyce Njoki Nzioki BecA-‐ILRI Hub, Nairobi, Kenya h;p://hub.africabiosciences.org/ h;p://www.Ilri.org/ j.n.njuguna@cgiar.org

Biological Databases and BLAST ILRI /Ethiopia Training

27, August 2015

Biological Databases

•  A biological database is a computerized archive used to store, organize and ease retrieval of sequence data

Biological Databases

•  A biological database is a computerized archive used to store, organize and ease retrieval of sequence data

•  A database typically supports the following opera=ons

ü Retrieval ü Inser=on ü Upda=ng ü Dele=on

Biological databases I.  A database can be thought as a large table where row

represents record and columns represents fields. II.  The organiza=on of records allows for querying on

the data to retrieve informa=on from the databases III.  An ideal biological database has fields as shown

Accession number

Name Length Sequence Taxonomy Reference

NR235462.1 MTGA 268 ACTTGC... E.coli A.Kelly et. al

NR235463.1 HKY 350 TGAGTA… E.coli J.Jone et. al

NR235464.1 THY 289 TGACGT… S.Aurius K.Moy et. al

Types of Biological Databases

1.  Primary databases: hold raw sequenced data •  GenBank •  EMBL (European Molecular Biology Laboratory) •  DDBJ (DNA Data Bank of Japan) •  PDB (Protein Data Bank)

2.  Secondary databases: they are curated and annotated •  Swiss-‐Prot – detailed annota=on if proteins •  RefSeq – non-‐redundant , curated sequenced data

3.  Specialized databases: these focus on data of specific research interest •  VectorBase •  PlasmoDB

Where to look for Biological databases

ü Search Engines (Google) ü Journals related to bioinforma=cs

Nucleic Acid research NAR online database issue

ü Websites like; www.expasy.ch

Major Biological databases

ü There are many public resources but few key resources

ü Many databases are part of projects that have limited lifespan

ü Use major public resources: NCBI, EBI, Ensemble, PDB, KEGG

ü If your field is more specific, iden=fy the major resource in that area

Accession Numbers

ü Stable ways of iden=fying GenBank Entries. ü No biological meaning. ü Originally an uppercase lefer followed by 5 digits U00002

ü Now two uppercase lefers followed by six digits BC037153

ü Version of entry added later as a decimal BC037153.1

Sequence formats ü This is the required arrangement of characters symbols and keywords in a sequence record.

ü There many different sequence formats for the purpose of database integra=on and organizing sequenced data

ü When considering a format for retrieval: •  What is easy to parse •  What format do the tools need •  What informa=on is needed

FASTA format

•  Used by fasta tools •  Comment line > then sequence data

GenBank format

•  Flat file format used by GenBank •  Has annota=on, author, version etc

Sequence formats

•  Sequence are store in databases or in files as simple text (ASCI text)

•  Microsok Word format is not a sequence format

•  Save sequence files as text .txt file !!! •  Use text editors like note-‐pad, text-‐pad to open such files

Problems of general sequence databases

i.  Redundancy of sequence informa=on ii.  Inadequate sequence iii.  Old sequences iv.  Par=ally annotated sequences v.  Inconsistent and outdated annota=ons vi.  Error sequences

Sequence comparison ü Newly sequenced DNA data is compared to that already available in biological databases.

ü Sequence comparison (of DNA / Protein data) is achieved through alignment, the process by which regions of similarity is searched between sequences.

Sequence comparison ü Newly sequenced DNA data is compared to that already available in biological databases.

ü Sequence comparison (of DNA / Protein data) is achieved through alignment, the process by which regions of similarity is searched between sequences.

ü This eases annota=on of new sequences as biological knowledge from well characterized homologs can be conferred

Some terminology 1.  Sequence similarity; this is when two sequences are very alike in

base pair or amino acid sequence ü  Sta=s=cal measures like E-‐value. P-‐Value and bit score ü  Percentage iden=ty (% of iden=cal residues between sequences) ü  The length of sequence stretch that is similar

2.  Homology; homologs diverse from a common ancestor and homology is inferred by sequence, structural and func=onal similarity ü  Orthologs – arise due to a specia=on event ü  Paralogs – arise due to gene duplica=on within the sequence

If sequence similarity in high enough it can infer homology and consequently structural and func=onal similarity but not in all cases.

Sequence Alignment

Pairwise sequence alignment; comparison of two sequences to establish regions of residue similarity Why align sequence; ü Iden=fy sequences with significant similarity / homology ü Iden=fy the domains with sequence, structure & func=onal similarity.

ü Database similarity searching ü Confer func=on from the known to the unknown

Sequence Alignment There two types of alignment strategies 1.  Global alignment; find op=mal alignment over the

en=re length of the two compared sequences

ü Best for highly similar sequence ü May miss important biological rela=onships in low similarity

Sequence Alignment 2.  Local alignment; aligns short regions of similarity

between sequences.

ü Useful when looking for domains in proteins and gene finding in DNA

ü Best for sequences of low similarity and different length

Searching sequence database ü A query sequence is searched against a database to look for homologs.

ü The algorithm used aligns your query to those in the database and returns highly similar sequences.

ü A scoring procedure is implemented on searches to measure the degree of similarity.

ü Judgment needs to be made on whether the similar sequences are homologous to your query based on scien=fic knowledge

ü There 2 programs for this: 1.  BLAST (Altschul et al. 1990) 2.  FastA (Pearson and Lipman 1988)

Basic Local Alignment Search Tool

ü Blast is a heuris=c approach used to calculate similarity for biological sequences

ü It finds best local alignment ü Blast displays results as a list of sequence matches ordered by sta=s=cal significance (hits)

ü BLAST is useful in iden=fying unknown sequences as well as gene or protein func=on predic=on.

Blast Flavors Some Flavors of BLAST

ucleotide rotein N

blastx

tblastn

tblastx

P P P P P P

P P P P P P P P P P P P

P P P P P P

Query Database Program

blastp

blastn

P P N N

Graphical BLAST results

ü This is a graphical view of the distribu=on of BLAST hits on the query sequence

ü The length of the hits shows the query coverage and regions of similarity

Query Sequence

Blast Hits, A mouse over gives you the details. Click to view alignment

Hit List BLAST result Hit list gives the iden-fy of sequences similar to your query sequence ranked by similarity

Bit score values < 50 unreliable

% Query coverage

E-‐value

Sequence defini=on, click to view the pairwise alignment

Accession number, Link to the record in GenBank

Pairwise alignment (Protein)

Pairwise alignment (nucleo=de)

BLAST result interpreta=on •  How do you make your conclusion on homology: •  E-‐value = Expected value. (this indicates the probability that the blast hit may have occurred by random chance).

•  The lower the E-‐value (or the closer it is to 0) the more significant the hit. To be certain of homology your E-‐value must be below 10-‐4 or 0.001.

•  % iden=ty the higher the iden=ty the increasing likelihood of homology.

•  Query coverage – if a hit has high query coverage and similarity in increases the chances of homology.

Summary for nucleo=de sequence

Length Database Purpose BLAST Program

20 bp or longer Nucleo=de Iden=fy the query sequence

blastn megablast

Find similar nucleo=de sequence.

blastn

Find similar proteins to translated query in a translated nucleo=de database

tblastx

Protein Find proteins coded in my query DNA sequence

blastx

Summary for protein sequences Length Database Purpose BLAST Program

15 residues or Longer

Protein Iden=fy your query sequence or find protein sequences similar to it

blastp

Find members of a protein family or build a custom posi=on specific scoring matrix(PSSMs)

PSI-‐blast

Find proteins similar to the query around a given pafern

PHI-‐blast

Conserved domains

Find conserved domains in your query and iden=fy other proteins with similar domains

CD-‐search

Nucleic Find similar sequences in a translated nucleo=de sequence database.

tblastn

The End Acknowledge E=enne for some of the slides on

biological databases

Thank you

biological(databases(and(blast(...

Documents

overview of current biological databases - cornell...

biomedical literature mining for biological databases...

a study of migrating biological data from relational ... ·...

introduction to biological databases · why biological...

biological databases

biological sequence databases

chemistry and biological sciences databases

an introduction to biological databases

biological databases an introduction

understanding and using biological databases

an introduction to biological databases august 2001

other biological databases

lecture 2 – biological databases filebioinformatics can be...

bodc’s oceanographic databases bodc biological data

designing biological databases

biological databases - bf528

designing databases for biological research

biological databases exercises. discovery of distinct...

exploring biological databases...

declarative querying for biological sequence databases