biological databasescompbio.ucdenver.edu/77112014/dowell database-14.pdf9/10/14 3...

31
9/10/14 1 Biological Databases What will we discuss today? Types of biological data What is a database? Standardized data file formats Genbank, PubMed and NCBI Query strategies Other major databases http://techcrunch.com/2012/11/25/the-big-data-fallacy-data-%E2%89%A0-information-%E2%89%A0-insights/

Upload: others

Post on 05-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

1

Biological Databases

What  will  we  discuss  today?  

•  Types  of  biological  data  •  What  is  a  database?  •  Standardized  data  file  formats  •  Genbank,  PubMed  and  NCBI  •  Query  strategies  •  Other  major    databases  

http://techcrunch.com/2012/11/25/the-big-data-fallacy-data-%E2%89%A0-information-%E2%89%A0-insights/

Page 2: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

2

Biologists  Collect  Lots  of  Data  

•  Hundreds  of  thousands  of  species  •  Millions  of  arIcles  in  scienIfic  journals  •  GeneIc  informaIon:    

–  gene  names  (thousands)  –  phenotype  of  mutants  (infinite?)  –  locaIon  of  genes/mutaIons  on  chromosmes  –  linkage  (distances  between  genes)  

•  High  Throughput    technology  – Rapid  inexpensive  DNA  sequencing  

– Many  methods  of  collecIng  genotype  data  •  Assays  for  specific  polymorphisms  •  Genome-­‐wide  SNP  chips  

•  Must  have  data  quality  assessment  prior  to  analysis  

One sequencer => 1-2Tb/week !!

Page 3: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

3

Curated  Biological  Data    DNA, nucleotide sequences

Gene boundaries, topology Gene structure

Introns, exons, ORFs, splicing

Expression data Mass spectometry

Mass spectometry (metabolomics, proteomics)

Post-Translational protein Modification (PTM)

Curated  Biological  Data  Proteins, residue sequences

MCTUYTCUYFSTYRCCTYFSCD Extended sequence information

Secondary structure

Hydrophobicity, motif data

Protein-protein interaction

Page 4: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

4

Curated  Biological  data  3D  Structures,  folds  

WHAT  is  a  database?  •  A  collecIon  of  data  that  needs  to  be:  

–  Structured  –  Searchable  –  Updated  (periodically)  –  Cross  referenced  

•  Challenge:  –  To  change  “meaningless”  data  into  useful  informaIon  that  can  be  

accessed  and  analysed  the  best  way  possible.  

For  example:      HOW  would  YOU  organize  all  biological  sequences  so  that  the  biological  informaIon  is  opImally  accessible?  

     

  http://en.wikibooks.org/wiki/Data_Management_in_Bioinformatics

Page 5: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

5

A  Spreadsheet  can  be    a  Database  

•  columns  are  Fields    •  Rows  are  Records  •  Can  search  for  a  term  within  just  one  field  

•  Or  combine  searches  across  several  fields  

SNP ID SNPSeq ID!

Gene +primer -primer Hap A Hap B Hap C

D1Mit160_1" 10.MMHAP67FLD1.seq"

lymphocyte antigen 84"

AAGGTAAAAGGCAATCAGCACAGCC"

TCAACCTGGAGTCAGAGGCT"

C — A

M-05554_1" 12.MMHAP31FLD3.seq"

procollagen, type III, alpha "

TGCGCAGAAGCTGAAGTCTA"

TTTTGAGGTGTTAATGGTTCT"

C — A

M-05554_2" X60184" complement component factor i"

ACTTCCAGCCCTGGCTCT"

ATATGCCACCAAGAAGCA"

A C —

M-09947_3" AF067835" caspase 8" TCACAGAGGGAAACATGAAG"

CTCCACATTGAACCAAAGCA"

G C T

M-11415_1" U02023" insulin-like growth factor binding protein "

GGGAAAAGCCTGAAAGAAGC"

AGCTGAAACCGGACATCAAT"

T G —

D1Mit284_3"

J05234" nucleolin" TGTTGGAACCGACTTCTTCA"

AAGAGTCAAAGAATTTATGGAATGA"

G T T

DBMS  

•  Internal  organizaIon  – Controls  speed  and  flexibility  

•  A  unity  of  programs  that    – Store  – Extract  – Modify  

Database

Store Extract Modify

USER(S)

Page 6: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

6

DBMS  organisaIon  types  •  Flat  file  databases  (flat  DBMS)  

–  Simple,  restricIve,  table  

•  Hierarchical  databases  (hierarchical  DBMS)  –  Simple,  restricIve,  tables  

•  RelaIonal  databases  (RDBMS)  –  Complex,versaIle,  tables  

•  Object-­‐oriented  databases  (ODBMS)  –  Complex,  versaIle,  objects    

Information system

Query system

Storage System

Data

Structured  Data  

•  Repository  of  informaIon  

•  managed  and  accessed  differently  

•  Flat-­‐file  (text)  •  RelaIonal  (key)  •  “talk”  to  each  other  

Page 7: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

7

RelaIonal  databases  

•  Data  is  stored  in  mulIple  related  tables  

•  Data  relaIonships  across  tables  can  be  either  many-­‐to-­‐one  or  many-­‐to-­‐many  

•  A  few  rules  allow  the  database  to  be  viewed  in  many  ways  

RelaIonal  Databases  

•  What  have  we  achieved?  –  No  repeaIng  informaIon  –  Less  storage  space  –  Be`er  reality  representaIon  –  Easy  modificaIon/management  –  Easy  usage  of  any  combinaIon  of  records    

Page 8: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

8

Three  reasons  to  care  …  

•  Database  proliferaIon  – Dozens  to  hundreds  at  the  moment  

•  More  and  more  scienIfic  discoveries  result  from  inter-­‐database  analysis  and  mining  

•  Rising  complexity  of  required  data-­‐combinaIons  – E.g.  translaIonal  medicine:  “from  bench  to  bedside”  (genomic  data  vs.  clinical  data)  

AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, !

ARR, AsDb, BBDB, BCGD, Beanref, Biolmage,!BioMagResBank, BIOMDB, BLOCKS, BovGBASE,!

BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,!CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,!

ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,!CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,!Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,!ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,!ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,!

GCRDB, GDB, GENATLAS, Genbank, GeneCards,!Genline, GenLink, GENOTK, GenProtEC, GIFTS,!

GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,!HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,!

HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,!HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,!

KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,!Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5!

Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,!MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,!OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,!PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,!

PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,!PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,!

SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,!SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,!

SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-!MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,!TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,!VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD,!

YPM, etc .................. !!!!!!

Some Biological databases …

Page 9: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

9

Some  staIsIcs  •  More  than  1000  different  databases  •  Generally  accessible  through  the  web      (useful  link:  www.expasy.ch/alinks.html)    •  Variable  size:  <100Kb  to  >10Gb  

–  DNA:  >  10  Gb  –   Protein:  1  Gb  –  3D  structure:  5  Gb  –  Other:  smaller  

•  Update  frequency:  daily  to  annually  

NAR  Database  Issue  

•  Online  collecIon  of  biological  databases:  h`p://www.oxfordjournals.org/nar/database/c/    

Page 10: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

10

Standard  Data  Formats  •  DNA  sequence  =  ACGT,  but  what  about  gaps,  unknown  le`ers,  etc.  –  How  many  le`ers  per  line  ???  –  ??  Spaces,  numbers,  headers,  etc.  –  Store  as  a  string,  code  as  binary  numbers,  etc.      

•  Use  a  completely  different  format  for  proteins?  

 Need  standard  formats!!  

FASTA  Format  •  William  Pearson  (1985)  

•  The  FASTA  format  is  now  universal  for  all  databases  and  somware  that  handles  DNA  and  protein  sequences  

>URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 ..!CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA!ACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTT!GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACC!CACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTG!TGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA!GGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT!CTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA!TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG

One header line, starts with > with a [return] at end All other characters are part of sequence.

Page 11: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

11

MulI-­‐Sequence  FASTA  file  >FBpp0074027  type=protein;  loc=X:complement(16159413..16159860,16160061..16160497);  ID=FBpp0074027;  name=CG12507-­‐PA;  

parent=FBgn0030729,FBtr0074248;  dbxref=FlyBase:FBpp0074027,FlyBase_AnnotaIon_IDs:CG12507  PA,GB_protein:AAF48569.1,GB_protein:AAF48569;  MD5=123b97d79d04a06c66e12fa665e6d801;  release=r5.1;  species=Dmel;  length=294;    

MRCLMPLLLANCIAANPSFEDPDRSLDMEAKDSSVVDTMGMGMGVLDPTQ  PKQMNYQKPPLGYKDYDYYLGSRRMADPYGADNDLSASSAIKIHGEGNLA  SLNRPVSGVAHKPLPWYGDYSGKLLASAPPMYPSRSYDPYIRRYDRYDEQ  YHRNYPQYFEDMYMHRQRFDPYDSYSPRIPQYPEPYVMYPDRYPDAPPLR  DYPKLRRGYIGEPMAPIDSYSSSKYVSSKQSDLSFPVRNERIVYYAHLPE  IVRTPYDSGSPEDRNSAPYKLNKKKIKNIQRPLANNSTTYKMTL  >FBpp0082232  type=protein;  loc=3R:complement(9207109..9207225,9207285..9207431);  ID=FBpp0082232;  name=mRpS21-­‐PA;  

parent=FBgn0044511,FBtr0082764;  dbxref=FlyBase:FBpp0082232,FlyBase_AnnotaIon_IDs:CG32854-­‐PA,GB_protein:AAN13563.1,GB_protein:AAN13563;  MD5=dcf91821f75ffab320491d124a0d816c;  release=r5.1;  species=Dmel;  length=87;    

MRHVQFLARTVLVQNNNVEEACRLLNRVLGKEELLDQFRRTRFYEKPYQV  RRRINFEKCKAIYNEDMNRKIQFVLRKNRAEPFPGCS  >FBpp0091159  type=protein;  loc=2R:complement(2511337..2511531,2511594..2511767,2511824..2511979,2512032..2512082);  ID=FBpp0091159;  

name=CG33919-­‐PA;  parent=FBgn0053919,FBtr0091923;  dbxref=FlyBase:FBpp0091159,FlyBase_AnnotaIon_IDs:CG33919-­‐PA,GB_protein:AAZ52801.1,GB_protein:AAZ52801;  MD5=c91d880b654cd612d7292676f95038c5;  release=r5.1;  species=Dmel;  length=191;    

MKLVLVVLLGCCFIGQLTNTQLVYKLKKIECLVNRTRVSNVSCHVKAINW  NLAVVNMDCFMIVPLHNPIIRMQVFTKDYSNQYKPFLVDVKIRICEVIER  RNFIPYGVIMWKLFKRYTNVNHSCPFSGHLIARDGFLDTSLLPPFPQGFY  QVSLVVTDTNSTSTDYVGTMKFFLQAMEHIKSKKTHNLVHN  >FBpp0070770  type=protein;  loc=X:join(5584802..5585021,5585925..5586137,5586198..5586342,5586410..5586605);  ID=FBpp0070770;  name=cv-­‐PA;  

parent=FBgn0000394,FBtr0070804;  dbxref=FlyBase:FBpp0070770,FlyBase_AnnotaIon_IDs:CG12410-­‐PA,GB_protein:AAF46063.1,GB_protein:AAF46063;  MD5=0626ee34a518f248bbdda11a211f9b14;  release=r5.1;  species=Dmel;  length=257;    

MEIWRSLTVGTIVLLAIVCFYGTVESCNEVVCASIVSKCMLTQSCKCELK  NCSCCKECLKCLGKNYEECCSCVELCPKPNDTRNSLSKKSHVEDFDGVPE  LFNAVATPDEGDSFGYNWNVFTFQVDFDKYLKGPKLEKDGHYFLRTNDKN  LDEAIQERDNIVTVNCTVIYLDQCVSWNKCRTSCQTTGASSTRWFHDGCC  ECVGSTCINYGVNESRCRKCPESKGELGDELDDPMEEEMQDFGESMGPFD  GPVNNNY  …  

Reformavng  Data  Files  

•  Much  of  the  rouIne  (yet  annoying)  work  of  bioinformaIcs  involves  messing  around  with  data  files  to  get  them  into  formats  that  will  work  with  various  somware  

•  Then  messing  around  with  the  results  produced  by  that  somware  to  create  a  useful  summary…  

Page 12: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

12

GenBank

DDBJ EMBL

EMBL

Entrez

SRS

getentry

NIG CIB EBI

NCBI

NIH

• Submissions • Updates

• Submissions • Updates

• Submissions • Updates

Public  Sequence  Databases  Same sequence information in all three, but different tools for searching and retrieval

GenBank  •  Contains  all  DNA  and  protein  sequences  described  in  the  scienIfic  literature  or  collected  in  publicly  funded  research  

•  Flawile:  Composed  enIrely  of  text  •  Each  submi`ed  sequence  is  a  record  •  Had  fields  for  Organism,  Date,  Author,  etc.  •  Unique  idenIfier  for  each  sequence    

– Locus  and  Accession  #  

Page 13: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

13

Growth  of  Genbank  

Page 14: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

14

GenBank  Flat  File  (GBFF)  LOCUS MUSNGH 1803 bp mRNA ROD 29-AUG-1997 DEFINITION Mouse neuroblastoma and rat glioma hybridoma cell line NG108-15 cell TA20 mRNA, complete cds. ACCESSION D25291 NID g1850791 KEYWORDS neurite extension activity; growth arrest; TA20. SOURCE Murinae gen. sp. mouse neuroblastma-rat glioma hybridoma cell_line:NG108-15 cDNA to mRNA. ORGANISM Murinae gen. sp. Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae. REFERENCE 1 (sites) AUTHORS Tohda,C., Nagai,S., Tohda,M. and Nomura,Y. TITLE A novel factor, TA20, involved in neuronal differentiation: cDNA cloning and expression JOURNAL Neurosci. Res. 23 (1), 21-27 (1995) MEDLINE 96064354 REFERENCE 3 (bases 1 to 1803) AUTHORS Tohda,C. TITLE Direct Submission JOURNAL Submitted (18-NOV-1993) to the DDBJ/EMBL/GenBank databases. Chihiro Tohda, Toyama Medical and Pharmaceutical University, Research Institute for Wakan-yaku, Analytical Research Center for Ethnomedicines; 2630 Sugitani, Toyama, Toyama 930-01, Japan (E-mail:[email protected], Tel:+81-764-34-2281(ex.2841), Fax:+81-764-34-5057) COMMENT On Feb 26, 1997 this sequence version replaced gi:793764. FEATURES Location/Qualifiers source 1..1803 /organism="Murinae gen. sp." /note="source origin of sequence, either mouse or rat, has not been identified" /db_xref="taxon:39108" /cell_line="NG108-15" /cell_type="mouse neuroblastma-rat glioma hybridoma" misc_signal 156..163 /note="AP-2 binding site" GC_signal 647..655 /note="Sp1 binding site" TATA_signal 694..701 gene 748..1311 /gene="TA20" CDS 748..1311 /gene="TA20" /function="neurite extensiion activity and growth arrest effect" /codon_start=1 /db_xref="PID:d1005516" /db_xref="PID:g793765" /translation="MMKLWVPSRSLPNSPNHYRSFLSHTLHIRYNNSLFISNTHLSRR KLRVTNPIYTRKRSLNIFYLLIPSCRTRLILWIIYIYRNLKHWSTSTVRSHSHSIYRL RPSMRTNIILRCHSYYKPPISHPIYWNNPSRMNLRGLLSRQSHLDPILRFPLHLTIYY RGPSNRSPPLPPRNRIKQPNRIKLRCR" polyA_site 1803 BASE COUNT 507 a 458 c 311 g 527 t ORIGIN 1 tcagtttttt tttttttttt tttttttttt tttttttttt tttttttttg ttgattcatg 61 tccgtttaca tttggtaagt tcacaggcct cagtcaacac aattggactg ctcaggaaat 121 cctccttggt gaccgcagta tacttggcct atgaacccaa gccacctatg gctaggtagg 181 agaagctcaa ctgtagggct gactttggaa gagaatgcac atggctgtat cgacatttca 241 catggtggac ctctggccag agtcagcagg ccgagggttc tcttccgggc tgctccctca 301 ctgcttgact ctgcgtcagt gcgtccatac tgtgggcgga cgttattgct atttgccttc 361 cattctgtac ggcattgcct ccatttagct ggagagggac agagcctggt tctctagggc 421 gtttccattg gggcctggtg acaatccaaa agatgagggc tccaaacacc agaatcagaa 481 ggcccagcgt atttgtaaaa acaccttctg gtgggaatga atggtacagg ggcgtttcag 541 gacaaagaac agcttttctg tcactcccat gagaaccgtc gcaatcactg ttccgaagag 601 gaggagtcca gaatacacgt gtatgggcat gacgattgcc cggagagagg cggagcccat 661 ggaagcagaa agacgaaaaa cacacccatt atttaaaatt attaaccact cattcattga 721 cctacctgcc ccatccaaca tttcatcatg atgaaacttt gggtcccttc taggagtctg 781 cctaatagtc caaatcatta caggtctttt cttagccata cactacacat cagatacaat 841 aacagccttt tcatcagtaa cacacatttg tcgagacgta aattacgggt gactaatccg 901 atatatacac gcaaacggag cctcaatatt ttttatttgc ttattccttc atgtcggacg 961 aggcttatat tatggatcat atacatttat agaaacctga aacattggag tacttctact 1021 gttcgcagtc atagccacag catttatagg ctacgtcctt ccatgaggac aaatatcatt 1081 ctgaggtgcc acagttatta caaacctcct atcagccatc ccatatattg gaacaaccct 1141 agtcgaatga atttgagggg gcttctcagt agacaaagcc accttgaccc gattcttcgc 1201 tttccacttc atcttaccat ttattatcgc ggccctagca atcgttcacc tcctcttcct 1261 ccacgaaaca ggatcaaaca acccaacagg attaaactca gatgcagata aaattccatt 1321 tcacccctac tatacatcaa agatatccta ggtatcctaa tcatattctt aattctcata 1381 accctagtat tatttttccc agacatacta ggagacccag acaactacat accagctaat 1441 ccactaaaca ccccacccca tattaaaccc gaatgatatt tcctatttgc atacgccatt 1501 ctacgctcaa tccccaataa actaggaggt gtcctagcct taatcttatc tatcctaatt 1561 ttagccctaa tacctttcct tcatacctca aagcaacgaa gcctaatatt ccgcccaatc 1621 acacaaattt tgtactgaat cctagtagcc aacctactta tcttaacctg aattgggggc 1681 caaccagtag acacccattt attatcattg gccaactagc ctccatctca tacttctcaa 1741 tcatcttaat tcttatacca atctcaggaa ttatcgaaga caaaatacta aaattatatc 1801 cat //

Features (AA seq)

DNA Sequence

Header • Title • Taxonomy • Citation

Fields

Page 15: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

15

Accession  Numbers!!  •  Databases  are  designed  to  be  searched  by  accession  numbers  (and  locus  IDs)  

•  These  are  guaranteed  to  be  non-­‐redundant,  accurate,  and  not  to  change.  

•  Searching  by  gene  names  and  keywords  is  doomed  to  frustraIon  and  probable  failure  

Neither  scienIsts  nor  computers  can  be  trusted  to  accurately  and  consistently  annotate  database  entries!!  

h`p://www.ncbi.nlm.nih.gov/Genbank  

•  Once upon a time, GenBank mailed out sequences on CD-ROM disks a few times per year.

•  At least doubles in size every 18 months

•  There  are  approximately  130,671,233,801  bases,  from  142,284,608  reported  sequences  in  the  tradiIonal  GenBank  divisions  as  of  August  2011.

Page 16: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

16

DistribuIon  of  sequence  databases  

•  Books,  arIcles    1968  -­‐>  1985  •  Computer  tapes  1982  -­‐>1992  •  Floppy  disks    1984  -­‐>  1990  •  CD-­‐ROM        1989  -­‐>  ?  •  FTP            1989  -­‐>  ?  •  On-­‐line  services        1982  -­‐>  1994  •  WWW        1993  -­‐>  ?  •  DVD                  2001  -­‐>  ?  •  Mailing  hard  drives      2009  -­‐>  ?  

•  Many  sequences  in  GenBank  correspond  to  the  same  gene  

•  genomic  clones,  full  length  mRNA,  various  kinds  of  ESTs,  submi`ed  by  different  invesIgators  

•  RefSeq  is  the  “Reference  Sequence”  for  a  gene  -­‐  as  determined  by  GenBank  curators  –  best  guess  given  the  current  evidence,  can  change  –  usually  based  on  the  longest  mRNA  –  usually  has  both  5’  and  3’  UTR    

•  Not  necessarily  reliable  –  A  lot  is  not  yet  known…  eg,  alternaIve  splicing  

Page 17: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

17

Last  thoughts  on  Genbank  ...  

•  Omen  only  use  FASTA  files  (eg  for  BLAST)  •  GBFF  are  simply  human  readable  versions  of  these  records  

•  GBFF  have  become  a  vehicle  for  a  lot  more  informaIon  than  they  where  meant  to  do  

•  Keep  in  mind  that  GenBank  is  DNA  centric  and  is  a  poor  vehicle  for  protein  and  mRNA  expression/interacIon  informaIon  

Many  Datasets  at  NCBI  •  The  NCBI  hosts  a  huge  interconnected    database  system  that,  in  addiIon  to  DNA  and  protein,  includes:  –  Journal  ArIcles  (PubMed)  – GeneIc  Diseases  (OMIM)  – Polymorphisms  (dbSNP)  – CytogeneIcs  (CGH/SKY/FISH  &  CGAP)  – Gene  Expression  (GEO)  – Taxonomy  – Chemistry  (PubChem)  

Page 18: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

18

Accessing  database  informaIon  

•  A  request  for  data  from  a  database  is  called  a  query  

•  Queries  can  be  of  three  forms:  – Choose  from  a  list  of  parameters  – Query  by  example  (QBE)  – Query  language  

Web  Query  

•  Most  databases  have  a  web-­‐based  query  tool  

•  It  may  be  simple…  

Page 19: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

19

…  or    complex  

Query  Languages  •  The  standard    

– SQL  (Structured  Query  Language)  originally    called  SEQUEL  (Structured  English  QUEry  Language)  

– Developed  by  IBM  in  1974;  introduced  commercially  in  1979  by  Oracle  Corp.  

– Standard  interacIve  and  programming  language  for  gevng  informaIon  from  and  updaIng  a  database.  

– RDMS  (SQL),  ODBMS  (Java,  C++,  OQL  etc)  

Page 20: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

20

ENTREZ  is  the  GenBank    web  query  tool  

Advanced query

interface:

Page 21: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

21

Database  Searching  A  database  can  only  be  searched  in  ways  that  it  was  designed  to  be  searched  

   Boolean:  "AND"    and  "OR"  searches  

 

Bad  to  search  for  "human  hemoglobin"  in  a  'Descrip2on'  field  

Much  be`er  to  search  for  "homo  sapiens  in  'Organism'    AND  "HBB"  in  'gene  name'  

Strategies  

•  Use accession numbers whenever possible •  Start with broad keywords and narrow the

search using more specific terms •  Try variants of spelling, numbers, etc. •  Search all relevant databases

• Be persistent!!

Page 22: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

22

ENTREZ  has  pre-­‐computed  links  between  Tables  

• Relationships between sequences are computed with BLAST

• Relationships between articles are computed with "MESH" terms (shared keywords)

• Relationships between DNA and protein sequences rely on accession numbers • Relationships between sequences and PubMed articles rely on both shared keywords and the mention of accession numbers in the articles.

UCSC Genome Browser Search by gene name:

or by sequence:

Page 23: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

23

Lots of additional data can be added as optional "tracks"

- anything that can be mapped to locations on the genome

Page 24: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

24

Ensembl at EBI/EMBL

http://genome.cshlp.org/content/14/5/971.full

Page 25: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

25

Page 26: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

26

KEGG:  Kyoto  Encylopedia  of    

Genes  and  Genomes  •  EnzymaIc  and  regulatory  pathways  •  Mapped  out  by  EC  number  and  cross-­‐referenced  to  genes  in  all  known  organisms      (wherever  sequence  informaIon  exits)  

•  Parallel  maps  of  regulatory  pathways  

Page 27: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

27

http://www.wwpdb.org

Page 28: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

28

Genome  Ontology  •  Biology  is  a  messy  science  

•  Assortment  of  names,  mutants,  odd  phenotypes  –  “sonic  hedgehog”  

•  Genome  Ontology  – Molecular  funcIon  (specific  tasks)  – Biological  process  (broad  biological  goal)  – Cellular  component  (locaIon)    

 

Page 29: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

29

Golden  Rules    

•  Use  published  databases  and  methods  – Supported,  maintained,  trusted  by  community  

•  Document  what  you  have  done  !!!  – Sequence  idenIficaIon  numbers  – Server,  database,  program  VERSION  – Program  parameters  

•  Assess  reliability  of  results  

Bio-­‐databases:  A  short  word  on  problems  

•  Even  today  we  face  some  key  limitaIons  –  There  is  no  standard  format  

•  Every  database  or  program  has  its  own  format  

–  There  is  no  standard  nomenclature  •  Every  database  has  its  own  names  

–  Data  is  not  fully  opImized  •  Some  datasets  have  missing  informaIon  without  indicaIons  of  it  

–  Data  errors  •  Data  is  someImes  of  poor  quality,  erroneous,  misspelled  •  Error  propagaIon  resulIng  from  computer  annotaIon  

Page 30: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

30

What  to  take  home  •  Databases  are  a  collecIon  of  data  

–  Need  to  access  and  maintain  easily  and  flexibly  

•  Biological  informaIon  is  vast  and  someImes  very  redundant  

•  Computers  can  only  create  data,  they  do  not  give  answers  

•  Learn  to  use  the  big  reliable  databases  (e.g.  NCBI)  

•  Open  access  to  sequences  is  not  only  essenIal  for  all  of  the  work  we  do,  if  it  was  not  there,  there  would  be  no  bioinformaIcs,  no  BLAST,  no  ComputaIonal  Bioscience  Program  

•  Open  access  to  sequence  informaIon  is  not  all  that  needs  to  be  open.    We  also  need  open  access  to  the  literature.  

Page 31: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,

9/10/14

31

http://mibiol.biol.lu.se.webbhotell.ldc.lu.se/Bioinformatics/Exercises/databases.html

http://wiki.bio.dtu.dk/teaching/index.php/Exercise:_Searching_the_GenBank_database

http://biocourse.sanbi.ac.za/wp-content/uploads/2013/02/Biological-Databases-Exercises.pdf

RECOMMENDED EXERCISES