biological databasescompbio.ucdenver.edu/77112014/dowell database-14.pdf9/10/14 3...

9/10/14

1

Biological Databases

What will we discuss today?

•  Types of biological data •  What is a database? •  Standardized data file formats •  Genbank, PubMed and NCBI •  Query strategies •  Other major databases

http://techcrunch.com/2012/11/25/the-big-data-fallacy-data-%E2%89%A0-information-%E2%89%A0-insights/

9/10/14

2

Biologists Collect Lots of Data

•  Hundreds of thousands of species •  Millions of arIcles in scienIfic journals •  GeneIc informaIon:

–  gene names (thousands) –  phenotype of mutants (infinite?) –  locaIon of genes/mutaIons on chromosmes –  linkage (distances between genes)

•  High Throughput technology – Rapid inexpensive DNA sequencing

– Many methods of collecIng genotype data •  Assays for specific polymorphisms •  Genome-‐wide SNP chips

•  Must have data quality assessment prior to analysis

One sequencer => 1-2Tb/week !!

9/10/14

3

Curated Biological Data DNA, nucleotide sequences

Gene boundaries, topology Gene structure

Introns, exons, ORFs, splicing

Expression data Mass spectometry

Mass spectometry (metabolomics, proteomics)

Post-Translational protein Modification (PTM)

Curated Biological Data Proteins, residue sequences

MCTUYTCUYFSTYRCCTYFSCD Extended sequence information

Secondary structure

Hydrophobicity, motif data

Protein-protein interaction

9/10/14

4

Curated Biological data 3D Structures, folds

WHAT is a database? •  A collecIon of data that needs to be:

–  Structured –  Searchable –  Updated (periodically) –  Cross referenced

•  Challenge: –  To change “meaningless” data into useful informaIon that can be

accessed and analysed the best way possible.

For example: HOW would YOU organize all biological sequences so that the biological informaIon is opImally accessible?

http://en.wikibooks.org/wiki/Data_Management_in_Bioinformatics

9/10/14

5

A Spreadsheet can be a Database

•  columns are Fields •  Rows are Records •  Can search for a term within just one field

•  Or combine searches across several fields

SNP ID SNPSeq ID!

Gene +primer -primer Hap A Hap B Hap C

D1Mit160_1" 10.MMHAP67FLD1.seq"

lymphocyte antigen 84"

AAGGTAAAAGGCAATCAGCACAGCC"

TCAACCTGGAGTCAGAGGCT"

C — A

M-05554_1" 12.MMHAP31FLD3.seq"

procollagen, type III, alpha "

TGCGCAGAAGCTGAAGTCTA"

TTTTGAGGTGTTAATGGTTCT"

C — A

M-05554_2" X60184" complement component factor i"

ACTTCCAGCCCTGGCTCT"

ATATGCCACCAAGAAGCA"

A C —

M-09947_3" AF067835" caspase 8" TCACAGAGGGAAACATGAAG"

CTCCACATTGAACCAAAGCA"

G C T

M-11415_1" U02023" insulin-like growth factor binding protein "

GGGAAAAGCCTGAAAGAAGC"

AGCTGAAACCGGACATCAAT"

T G —

D1Mit284_3"

J05234" nucleolin" TGTTGGAACCGACTTCTTCA"

AAGAGTCAAAGAATTTATGGAATGA"

G T T

DBMS

•  Internal organizaIon – Controls speed and flexibility

•  A unity of programs that – Store – Extract – Modify

Database

Store Extract Modify

USER(S)

9/10/14

6

DBMS organisaIon types •  Flat file databases (flat DBMS)

–  Simple, restricIve, table

•  Hierarchical databases (hierarchical DBMS) –  Simple, restricIve, tables

•  RelaIonal databases (RDBMS) –  Complex,versaIle, tables

•  Object-‐oriented databases (ODBMS) –  Complex, versaIle, objects

Information system

Query system

Storage System

Data

Structured Data

•  Repository of informaIon

•  managed and accessed differently

•  Flat-‐file (text) •  RelaIonal (key) •  “talk” to each other

9/10/14

7

RelaIonal databases

•  Data is stored in mulIple related tables

•  Data relaIonships across tables can be either many-‐to-‐one or many-‐to-‐many

•  A few rules allow the database to be viewed in many ways

RelaIonal Databases

•  What have we achieved? –  No repeaIng informaIon –  Less storage space –  Be`er reality representaIon –  Easy modificaIon/management –  Easy usage of any combinaIon of records

9/10/14

8

Three reasons to care …

•  Database proliferaIon – Dozens to hundreds at the moment

•  More and more scienIfic discoveries result from inter-‐database analysis and mining

•  Rising complexity of required data-‐combinaIons – E.g. translaIonal medicine: “from bench to bedside” (genomic data vs. clinical data)

AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, !

ARR, AsDb, BBDB, BCGD, Beanref, Biolmage,!BioMagResBank, BIOMDB, BLOCKS, BovGBASE,!

BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,!CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,!

ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,!CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,!Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,!ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,!ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,!

GCRDB, GDB, GENATLAS, Genbank, GeneCards,!Genline, GenLink, GENOTK, GenProtEC, GIFTS,!

GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,!HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,!

HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,!HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,!

KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,!Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5!

Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,!MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,!OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,!PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,!

PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,!PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,!

SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,!SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,!

SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-!MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,!TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,!VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD,!

YPM, etc .................. !!!!!!

Some Biological databases …

9/10/14

9

Some staIsIcs •  More than 1000 different databases •  Generally accessible through the web (useful link: www.expasy.ch/alinks.html) •  Variable size: <100Kb to >10Gb

–  DNA: > 10 Gb –  Protein: 1 Gb –  3D structure: 5 Gb –  Other: smaller

•  Update frequency: daily to annually

NAR Database Issue

•  Online collecIon of biological databases: h`p://www.oxfordjournals.org/nar/database/c/

9/10/14

10

Standard Data Formats •  DNA sequence = ACGT, but what about gaps, unknown le`ers, etc. –  How many le`ers per line ??? –  ?? Spaces, numbers, headers, etc. –  Store as a string, code as binary numbers, etc.

•  Use a completely different format for proteins?

Need standard formats!!

FASTA Format •  William Pearson (1985)

•  The FASTA format is now universal for all databases and somware that handles DNA and protein sequences

>URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 ..!CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA!ACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTT!GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACC!CACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTG!TGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA!GGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT!CTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA!TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG

One header line, starts with > with a [return] at end All other characters are part of sequence.

9/10/14

11

MulI-‐Sequence FASTA file >FBpp0074027 type=protein; loc=X:complement(16159413..16159860,16160061..16160497); ID=FBpp0074027; name=CG12507-‐PA;

parent=FBgn0030729,FBtr0074248; dbxref=FlyBase:FBpp0074027,FlyBase_AnnotaIon_IDs:CG12507 PA,GB_protein:AAF48569.1,GB_protein:AAF48569; MD5=123b97d79d04a06c66e12fa665e6d801; release=r5.1; species=Dmel; length=294;

MRCLMPLLLANCIAANPSFEDPDRSLDMEAKDSSVVDTMGMGMGVLDPTQ PKQMNYQKPPLGYKDYDYYLGSRRMADPYGADNDLSASSAIKIHGEGNLA SLNRPVSGVAHKPLPWYGDYSGKLLASAPPMYPSRSYDPYIRRYDRYDEQ YHRNYPQYFEDMYMHRQRFDPYDSYSPRIPQYPEPYVMYPDRYPDAPPLR DYPKLRRGYIGEPMAPIDSYSSSKYVSSKQSDLSFPVRNERIVYYAHLPE IVRTPYDSGSPEDRNSAPYKLNKKKIKNIQRPLANNSTTYKMTL >FBpp0082232 type=protein; loc=3R:complement(9207109..9207225,9207285..9207431); ID=FBpp0082232; name=mRpS21-‐PA;

parent=FBgn0044511,FBtr0082764; dbxref=FlyBase:FBpp0082232,FlyBase_AnnotaIon_IDs:CG32854-‐PA,GB_protein:AAN13563.1,GB_protein:AAN13563; MD5=dcf91821f75ffab320491d124a0d816c; release=r5.1; species=Dmel; length=87;

MRHVQFLARTVLVQNNNVEEACRLLNRVLGKEELLDQFRRTRFYEKPYQV RRRINFEKCKAIYNEDMNRKIQFVLRKNRAEPFPGCS >FBpp0091159 type=protein; loc=2R:complement(2511337..2511531,2511594..2511767,2511824..2511979,2512032..2512082); ID=FBpp0091159;

name=CG33919-‐PA; parent=FBgn0053919,FBtr0091923; dbxref=FlyBase:FBpp0091159,FlyBase_AnnotaIon_IDs:CG33919-‐PA,GB_protein:AAZ52801.1,GB_protein:AAZ52801; MD5=c91d880b654cd612d7292676f95038c5; release=r5.1; species=Dmel; length=191;

MKLVLVVLLGCCFIGQLTNTQLVYKLKKIECLVNRTRVSNVSCHVKAINW NLAVVNMDCFMIVPLHNPIIRMQVFTKDYSNQYKPFLVDVKIRICEVIER RNFIPYGVIMWKLFKRYTNVNHSCPFSGHLIARDGFLDTSLLPPFPQGFY QVSLVVTDTNSTSTDYVGTMKFFLQAMEHIKSKKTHNLVHN >FBpp0070770 type=protein; loc=X:join(5584802..5585021,5585925..5586137,5586198..5586342,5586410..5586605); ID=FBpp0070770; name=cv-‐PA;

parent=FBgn0000394,FBtr0070804; dbxref=FlyBase:FBpp0070770,FlyBase_AnnotaIon_IDs:CG12410-‐PA,GB_protein:AAF46063.1,GB_protein:AAF46063; MD5=0626ee34a518f248bbdda11a211f9b14; release=r5.1; species=Dmel; length=257;

MEIWRSLTVGTIVLLAIVCFYGTVESCNEVVCASIVSKCMLTQSCKCELK NCSCCKECLKCLGKNYEECCSCVELCPKPNDTRNSLSKKSHVEDFDGVPE LFNAVATPDEGDSFGYNWNVFTFQVDFDKYLKGPKLEKDGHYFLRTNDKN LDEAIQERDNIVTVNCTVIYLDQCVSWNKCRTSCQTTGASSTRWFHDGCC ECVGSTCINYGVNESRCRKCPESKGELGDELDDPMEEEMQDFGESMGPFD GPVNNNY …

Reformavng Data Files

•  Much of the rouIne (yet annoying) work of bioinformaIcs involves messing around with data files to get them into formats that will work with various somware

•  Then messing around with the results produced by that somware to create a useful summary…

9/10/14

12

GenBank

DDBJ EMBL

EMBL

Entrez

SRS

getentry

NIG CIB EBI

NCBI

NIH

• Submissions • Updates



Public Sequence Databases Same sequence information in all three, but different tools for searching and retrieval

GenBank •  Contains all DNA and protein sequences described in the scienIfic literature or collected in publicly funded research

•  Flawile: Composed enIrely of text •  Each submi`ed sequence is a record •  Had fields for Organism, Date, Author, etc. •  Unique idenIfier for each sequence

– Locus and Accession #

9/10/14

13

Growth of Genbank

9/10/14

14

GenBank Flat File (GBFF) LOCUS MUSNGH 1803 bp mRNA ROD 29-AUG-1997 DEFINITION Mouse neuroblastoma and rat glioma hybridoma cell line NG108-15 cell TA20 mRNA, complete cds. ACCESSION D25291 NID g1850791 KEYWORDS neurite extension activity; growth arrest; TA20. SOURCE Murinae gen. sp. mouse neuroblastma-rat glioma hybridoma cell_line:NG108-15 cDNA to mRNA. ORGANISM Murinae gen. sp. Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae. REFERENCE 1 (sites) AUTHORS Tohda,C., Nagai,S., Tohda,M. and Nomura,Y. TITLE A novel factor, TA20, involved in neuronal differentiation: cDNA cloning and expression JOURNAL Neurosci. Res. 23 (1), 21-27 (1995) MEDLINE 96064354 REFERENCE 3 (bases 1 to 1803) AUTHORS Tohda,C. TITLE Direct Submission JOURNAL Submitted (18-NOV-1993) to the DDBJ/EMBL/GenBank databases. Chihiro Tohda, Toyama Medical and Pharmaceutical University, Research Institute for Wakan-yaku, Analytical Research Center for Ethnomedicines; 2630 Sugitani, Toyama, Toyama 930-01, Japan (E-mail:[email protected], Tel:+81-764-34-2281(ex.2841), Fax:+81-764-34-5057) COMMENT On Feb 26, 1997 this sequence version replaced gi:793764. FEATURES Location/Qualifiers source 1..1803 /organism="Murinae gen. sp." /note="source origin of sequence, either mouse or rat, has not been identified" /db_xref="taxon:39108" /cell_line="NG108-15" /cell_type="mouse neuroblastma-rat glioma hybridoma" misc_signal 156..163 /note="AP-2 binding site" GC_signal 647..655 /note="Sp1 binding site" TATA_signal 694..701 gene 748..1311 /gene="TA20" CDS 748..1311 /gene="TA20" /function="neurite extensiion activity and growth arrest effect" /codon_start=1 /db_xref="PID:d1005516" /db_xref="PID:g793765" /translation="MMKLWVPSRSLPNSPNHYRSFLSHTLHIRYNNSLFISNTHLSRR KLRVTNPIYTRKRSLNIFYLLIPSCRTRLILWIIYIYRNLKHWSTSTVRSHSHSIYRL RPSMRTNIILRCHSYYKPPISHPIYWNNPSRMNLRGLLSRQSHLDPILRFPLHLTIYY RGPSNRSPPLPPRNRIKQPNRIKLRCR" polyA_site 1803 BASE COUNT 507 a 458 c 311 g 527 t ORIGIN 1 tcagtttttt tttttttttt tttttttttt tttttttttt tttttttttg ttgattcatg 61 tccgtttaca tttggtaagt tcacaggcct cagtcaacac aattggactg ctcaggaaat 121 cctccttggt gaccgcagta tacttggcct atgaacccaa gccacctatg gctaggtagg 181 agaagctcaa ctgtagggct gactttggaa gagaatgcac atggctgtat cgacatttca 241 catggtggac ctctggccag agtcagcagg ccgagggttc tcttccgggc tgctccctca 301 ctgcttgact ctgcgtcagt gcgtccatac tgtgggcgga cgttattgct atttgccttc 361 cattctgtac ggcattgcct ccatttagct ggagagggac agagcctggt tctctagggc 421 gtttccattg gggcctggtg acaatccaaa agatgagggc tccaaacacc agaatcagaa 481 ggcccagcgt atttgtaaaa acaccttctg gtgggaatga atggtacagg ggcgtttcag 541 gacaaagaac agcttttctg tcactcccat gagaaccgtc gcaatcactg ttccgaagag 601 gaggagtcca gaatacacgt gtatgggcat gacgattgcc cggagagagg cggagcccat 661 ggaagcagaa agacgaaaaa cacacccatt atttaaaatt attaaccact cattcattga 721 cctacctgcc ccatccaaca tttcatcatg atgaaacttt gggtcccttc taggagtctg 781 cctaatagtc caaatcatta caggtctttt cttagccata cactacacat cagatacaat 841 aacagccttt tcatcagtaa cacacatttg tcgagacgta aattacgggt gactaatccg 901 atatatacac gcaaacggag cctcaatatt ttttatttgc ttattccttc atgtcggacg 961 aggcttatat tatggatcat atacatttat agaaacctga aacattggag tacttctact 1021 gttcgcagtc atagccacag catttatagg ctacgtcctt ccatgaggac aaatatcatt 1081 ctgaggtgcc acagttatta caaacctcct atcagccatc ccatatattg gaacaaccct 1141 agtcgaatga atttgagggg gcttctcagt agacaaagcc accttgaccc gattcttcgc 1201 tttccacttc atcttaccat ttattatcgc ggccctagca atcgttcacc tcctcttcct 1261 ccacgaaaca ggatcaaaca acccaacagg attaaactca gatgcagata aaattccatt 1321 tcacccctac tatacatcaa agatatccta ggtatcctaa tcatattctt aattctcata 1381 accctagtat tatttttccc agacatacta ggagacccag acaactacat accagctaat 1441 ccactaaaca ccccacccca tattaaaccc gaatgatatt tcctatttgc atacgccatt 1501 ctacgctcaa tccccaataa actaggaggt gtcctagcct taatcttatc tatcctaatt 1561 ttagccctaa tacctttcct tcatacctca aagcaacgaa gcctaatatt ccgcccaatc 1621 acacaaattt tgtactgaat cctagtagcc aacctactta tcttaacctg aattgggggc 1681 caaccagtag acacccattt attatcattg gccaactagc ctccatctca tacttctcaa 1741 tcatcttaat tcttatacca atctcaggaa ttatcgaaga caaaatacta aaattatatc 1801 cat //

Features (AA seq)

DNA Sequence

Header • Title • Taxonomy • Citation

Fields

9/10/14

15

Accession Numbers!! •  Databases are designed to be searched by accession numbers (and locus IDs)

•  These are guaranteed to be non-‐redundant, accurate, and not to change.

•  Searching by gene names and keywords is doomed to frustraIon and probable failure

Neither scienIsts nor computers can be trusted to accurately and consistently annotate database entries!!

h`p://www.ncbi.nlm.nih.gov/Genbank

•  Once upon a time, GenBank mailed out sequences on CD-ROM disks a few times per year.

•  At least doubles in size every 18 months

•  There are approximately 130,671,233,801 bases, from 142,284,608 reported sequences in the tradiIonal GenBank divisions as of August 2011.

9/10/14

16

DistribuIon of sequence databases

•  Books, arIcles 1968 -‐> 1985 •  Computer tapes 1982 -‐>1992 •  Floppy disks 1984 -‐> 1990 •  CD-‐ROM 1989 -‐> ? •  FTP 1989 -‐> ? •  On-‐line services 1982 -‐> 1994 •  WWW 1993 -‐> ? •  DVD 2001 -‐> ? •  Mailing hard drives 2009 -‐> ?

•  Many sequences in GenBank correspond to the same gene

•  genomic clones, full length mRNA, various kinds of ESTs, submi`ed by different invesIgators

•  RefSeq is the “Reference Sequence” for a gene -‐ as determined by GenBank curators –  best guess given the current evidence, can change –  usually based on the longest mRNA –  usually has both 5’ and 3’ UTR

•  Not necessarily reliable –  A lot is not yet known… eg, alternaIve splicing

9/10/14

17

Last thoughts on Genbank ...

•  Omen only use FASTA files (eg for BLAST) •  GBFF are simply human readable versions of these records

•  GBFF have become a vehicle for a lot more informaIon than they where meant to do

•  Keep in mind that GenBank is DNA centric and is a poor vehicle for protein and mRNA expression/interacIon informaIon

Many Datasets at NCBI •  The NCBI hosts a huge interconnected database system that, in addiIon to DNA and protein, includes: –  Journal ArIcles (PubMed) – GeneIc Diseases (OMIM) – Polymorphisms (dbSNP) – CytogeneIcs (CGH/SKY/FISH & CGAP) – Gene Expression (GEO) – Taxonomy – Chemistry (PubChem)

9/10/14

18

Accessing database informaIon

•  A request for data from a database is called a query

•  Queries can be of three forms: – Choose from a list of parameters – Query by example (QBE) – Query language

Web Query

•  Most databases have a web-‐based query tool

•  It may be simple…

9/10/14

19

… or complex

Query Languages •  The standard

– SQL (Structured Query Language) originally called SEQUEL (Structured English QUEry Language)

– Developed by IBM in 1974; introduced commercially in 1979 by Oracle Corp.

– Standard interacIve and programming language for gevng informaIon from and updaIng a database.

– RDMS (SQL), ODBMS (Java, C++, OQL etc)

9/10/14

20

ENTREZ is the GenBank web query tool

Advanced query

interface:

9/10/14

21

Database Searching A database can only be searched in ways that it was designed to be searched

Boolean: "AND" and "OR" searches

Bad to search for "human hemoglobin" in a 'Descrip2on' field

Much be`er to search for "homo sapiens in 'Organism' AND "HBB" in 'gene name'

Strategies

•  Use accession numbers whenever possible •  Start with broad keywords and narrow the

search using more specific terms •  Try variants of spelling, numbers, etc. •  Search all relevant databases

• Be persistent!!

9/10/14

22

ENTREZ has pre-‐computed links between Tables

• Relationships between sequences are computed with BLAST

• Relationships between articles are computed with "MESH" terms (shared keywords)

• Relationships between DNA and protein sequences rely on accession numbers • Relationships between sequences and PubMed articles rely on both shared keywords and the mention of accession numbers in the articles.

UCSC Genome Browser Search by gene name:

or by sequence:

9/10/14

23

Lots of additional data can be added as optional "tracks"

- anything that can be mapped to locations on the genome

9/10/14

24

Ensembl at EBI/EMBL

http://genome.cshlp.org/content/14/5/971.full

9/10/14

25

9/10/14

26

KEGG: Kyoto Encylopedia of

Genes and Genomes •  EnzymaIc and regulatory pathways •  Mapped out by EC number and cross-‐referenced to genes in all known organisms (wherever sequence informaIon exits)

•  Parallel maps of regulatory pathways

9/10/14

27

http://www.wwpdb.org

9/10/14

28

Genome Ontology •  Biology is a messy science

•  Assortment of names, mutants, odd phenotypes –  “sonic hedgehog”

•  Genome Ontology – Molecular funcIon (specific tasks) – Biological process (broad biological goal) – Cellular component (locaIon)

9/10/14

29

Golden Rules

•  Use published databases and methods – Supported, maintained, trusted by community

•  Document what you have done !!! – Sequence idenIficaIon numbers – Server, database, program VERSION – Program parameters

•  Assess reliability of results

Bio-‐databases: A short word on problems

•  Even today we face some key limitaIons –  There is no standard format

•  Every database or program has its own format

–  There is no standard nomenclature •  Every database has its own names

–  Data is not fully opImized •  Some datasets have missing informaIon without indicaIons of it

–  Data errors •  Data is someImes of poor quality, erroneous, misspelled •  Error propagaIon resulIng from computer annotaIon

9/10/14

30

What to take home •  Databases are a collecIon of data

–  Need to access and maintain easily and flexibly

•  Biological informaIon is vast and someImes very redundant

•  Computers can only create data, they do not give answers

•  Learn to use the big reliable databases (e.g. NCBI)

•  Open access to sequences is not only essenIal for all of the work we do, if it was not there, there would be no bioinformaIcs, no BLAST, no ComputaIonal Bioscience Program

•  Open access to sequence informaIon is not all that needs to be open. We also need open access to the literature.

9/10/14

31

http://mibiol.biol.lu.se.webbhotell.ldc.lu.se/Bioinformatics/Exercises/databases.html

http://wiki.bio.dtu.dk/teaching/index.php/Exercise:_Searching_the_GenBank_database

http://biocourse.sanbi.ac.za/wp-content/uploads/2013/02/Biological-Databases-Exercises.pdf

RECOMMENDED EXERCISES

biological databasescompbio.ucdenver.edu/77112014/dowell database-14.pdf9/10/14 3...

Documents