biological databasescompbio.ucdenver.edu/77112014/dowell database-14.pdf9/10/14 3...
TRANSCRIPT
![Page 1: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/1.jpg)
9/10/14
1
Biological Databases
What will we discuss today?
• Types of biological data • What is a database? • Standardized data file formats • Genbank, PubMed and NCBI • Query strategies • Other major databases
http://techcrunch.com/2012/11/25/the-big-data-fallacy-data-%E2%89%A0-information-%E2%89%A0-insights/
![Page 2: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/2.jpg)
9/10/14
2
Biologists Collect Lots of Data
• Hundreds of thousands of species • Millions of arIcles in scienIfic journals • GeneIc informaIon:
– gene names (thousands) – phenotype of mutants (infinite?) – locaIon of genes/mutaIons on chromosmes – linkage (distances between genes)
• High Throughput technology – Rapid inexpensive DNA sequencing
– Many methods of collecIng genotype data • Assays for specific polymorphisms • Genome-‐wide SNP chips
• Must have data quality assessment prior to analysis
One sequencer => 1-2Tb/week !!
![Page 3: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/3.jpg)
9/10/14
3
Curated Biological Data DNA, nucleotide sequences
Gene boundaries, topology Gene structure
Introns, exons, ORFs, splicing
Expression data Mass spectometry
Mass spectometry (metabolomics, proteomics)
Post-Translational protein Modification (PTM)
Curated Biological Data Proteins, residue sequences
MCTUYTCUYFSTYRCCTYFSCD Extended sequence information
Secondary structure
Hydrophobicity, motif data
Protein-protein interaction
![Page 4: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/4.jpg)
9/10/14
4
Curated Biological data 3D Structures, folds
WHAT is a database? • A collecIon of data that needs to be:
– Structured – Searchable – Updated (periodically) – Cross referenced
• Challenge: – To change “meaningless” data into useful informaIon that can be
accessed and analysed the best way possible.
For example: HOW would YOU organize all biological sequences so that the biological informaIon is opImally accessible?
http://en.wikibooks.org/wiki/Data_Management_in_Bioinformatics
![Page 5: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/5.jpg)
9/10/14
5
A Spreadsheet can be a Database
• columns are Fields • Rows are Records • Can search for a term within just one field
• Or combine searches across several fields
SNP ID SNPSeq ID!
Gene +primer -primer Hap A Hap B Hap C
D1Mit160_1" 10.MMHAP67FLD1.seq"
lymphocyte antigen 84"
AAGGTAAAAGGCAATCAGCACAGCC"
TCAACCTGGAGTCAGAGGCT"
C — A
M-05554_1" 12.MMHAP31FLD3.seq"
procollagen, type III, alpha "
TGCGCAGAAGCTGAAGTCTA"
TTTTGAGGTGTTAATGGTTCT"
C — A
M-05554_2" X60184" complement component factor i"
ACTTCCAGCCCTGGCTCT"
ATATGCCACCAAGAAGCA"
A C —
M-09947_3" AF067835" caspase 8" TCACAGAGGGAAACATGAAG"
CTCCACATTGAACCAAAGCA"
G C T
M-11415_1" U02023" insulin-like growth factor binding protein "
GGGAAAAGCCTGAAAGAAGC"
AGCTGAAACCGGACATCAAT"
T G —
D1Mit284_3"
J05234" nucleolin" TGTTGGAACCGACTTCTTCA"
AAGAGTCAAAGAATTTATGGAATGA"
G T T
DBMS
• Internal organizaIon – Controls speed and flexibility
• A unity of programs that – Store – Extract – Modify
Database
Store Extract Modify
USER(S)
![Page 6: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/6.jpg)
9/10/14
6
DBMS organisaIon types • Flat file databases (flat DBMS)
– Simple, restricIve, table
• Hierarchical databases (hierarchical DBMS) – Simple, restricIve, tables
• RelaIonal databases (RDBMS) – Complex,versaIle, tables
• Object-‐oriented databases (ODBMS) – Complex, versaIle, objects
Information system
Query system
Storage System
Data
Structured Data
• Repository of informaIon
• managed and accessed differently
• Flat-‐file (text) • RelaIonal (key) • “talk” to each other
![Page 7: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/7.jpg)
9/10/14
7
RelaIonal databases
• Data is stored in mulIple related tables
• Data relaIonships across tables can be either many-‐to-‐one or many-‐to-‐many
• A few rules allow the database to be viewed in many ways
RelaIonal Databases
• What have we achieved? – No repeaIng informaIon – Less storage space – Be`er reality representaIon – Easy modificaIon/management – Easy usage of any combinaIon of records
![Page 8: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/8.jpg)
9/10/14
8
Three reasons to care …
• Database proliferaIon – Dozens to hundreds at the moment
• More and more scienIfic discoveries result from inter-‐database analysis and mining
• Rising complexity of required data-‐combinaIons – E.g. translaIonal medicine: “from bench to bedside” (genomic data vs. clinical data)
AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, !
ARR, AsDb, BBDB, BCGD, Beanref, Biolmage,!BioMagResBank, BIOMDB, BLOCKS, BovGBASE,!
BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,!CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,!
ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,!CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,!Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,!ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,!ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,!
GCRDB, GDB, GENATLAS, Genbank, GeneCards,!Genline, GenLink, GENOTK, GenProtEC, GIFTS,!
GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,!HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,!
HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,!HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,!
KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,!Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5!
Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,!MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,!OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,!PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,!
PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,!PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,!
SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,!SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,!
SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-!MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,!TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,!VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD,!
YPM, etc .................. !!!!!!
Some Biological databases …
![Page 9: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/9.jpg)
9/10/14
9
Some staIsIcs • More than 1000 different databases • Generally accessible through the web (useful link: www.expasy.ch/alinks.html) • Variable size: <100Kb to >10Gb
– DNA: > 10 Gb – Protein: 1 Gb – 3D structure: 5 Gb – Other: smaller
• Update frequency: daily to annually
NAR Database Issue
• Online collecIon of biological databases: h`p://www.oxfordjournals.org/nar/database/c/
![Page 10: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/10.jpg)
9/10/14
10
Standard Data Formats • DNA sequence = ACGT, but what about gaps, unknown le`ers, etc. – How many le`ers per line ??? – ?? Spaces, numbers, headers, etc. – Store as a string, code as binary numbers, etc.
• Use a completely different format for proteins?
Need standard formats!!
FASTA Format • William Pearson (1985)
• The FASTA format is now universal for all databases and somware that handles DNA and protein sequences
>URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 ..!CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA!ACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTT!GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACC!CACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTG!TGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA!GGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT!CTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA!TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG
One header line, starts with > with a [return] at end All other characters are part of sequence.
![Page 11: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/11.jpg)
9/10/14
11
MulI-‐Sequence FASTA file >FBpp0074027 type=protein; loc=X:complement(16159413..16159860,16160061..16160497); ID=FBpp0074027; name=CG12507-‐PA;
parent=FBgn0030729,FBtr0074248; dbxref=FlyBase:FBpp0074027,FlyBase_AnnotaIon_IDs:CG12507 PA,GB_protein:AAF48569.1,GB_protein:AAF48569; MD5=123b97d79d04a06c66e12fa665e6d801; release=r5.1; species=Dmel; length=294;
MRCLMPLLLANCIAANPSFEDPDRSLDMEAKDSSVVDTMGMGMGVLDPTQ PKQMNYQKPPLGYKDYDYYLGSRRMADPYGADNDLSASSAIKIHGEGNLA SLNRPVSGVAHKPLPWYGDYSGKLLASAPPMYPSRSYDPYIRRYDRYDEQ YHRNYPQYFEDMYMHRQRFDPYDSYSPRIPQYPEPYVMYPDRYPDAPPLR DYPKLRRGYIGEPMAPIDSYSSSKYVSSKQSDLSFPVRNERIVYYAHLPE IVRTPYDSGSPEDRNSAPYKLNKKKIKNIQRPLANNSTTYKMTL >FBpp0082232 type=protein; loc=3R:complement(9207109..9207225,9207285..9207431); ID=FBpp0082232; name=mRpS21-‐PA;
parent=FBgn0044511,FBtr0082764; dbxref=FlyBase:FBpp0082232,FlyBase_AnnotaIon_IDs:CG32854-‐PA,GB_protein:AAN13563.1,GB_protein:AAN13563; MD5=dcf91821f75ffab320491d124a0d816c; release=r5.1; species=Dmel; length=87;
MRHVQFLARTVLVQNNNVEEACRLLNRVLGKEELLDQFRRTRFYEKPYQV RRRINFEKCKAIYNEDMNRKIQFVLRKNRAEPFPGCS >FBpp0091159 type=protein; loc=2R:complement(2511337..2511531,2511594..2511767,2511824..2511979,2512032..2512082); ID=FBpp0091159;
name=CG33919-‐PA; parent=FBgn0053919,FBtr0091923; dbxref=FlyBase:FBpp0091159,FlyBase_AnnotaIon_IDs:CG33919-‐PA,GB_protein:AAZ52801.1,GB_protein:AAZ52801; MD5=c91d880b654cd612d7292676f95038c5; release=r5.1; species=Dmel; length=191;
MKLVLVVLLGCCFIGQLTNTQLVYKLKKIECLVNRTRVSNVSCHVKAINW NLAVVNMDCFMIVPLHNPIIRMQVFTKDYSNQYKPFLVDVKIRICEVIER RNFIPYGVIMWKLFKRYTNVNHSCPFSGHLIARDGFLDTSLLPPFPQGFY QVSLVVTDTNSTSTDYVGTMKFFLQAMEHIKSKKTHNLVHN >FBpp0070770 type=protein; loc=X:join(5584802..5585021,5585925..5586137,5586198..5586342,5586410..5586605); ID=FBpp0070770; name=cv-‐PA;
parent=FBgn0000394,FBtr0070804; dbxref=FlyBase:FBpp0070770,FlyBase_AnnotaIon_IDs:CG12410-‐PA,GB_protein:AAF46063.1,GB_protein:AAF46063; MD5=0626ee34a518f248bbdda11a211f9b14; release=r5.1; species=Dmel; length=257;
MEIWRSLTVGTIVLLAIVCFYGTVESCNEVVCASIVSKCMLTQSCKCELK NCSCCKECLKCLGKNYEECCSCVELCPKPNDTRNSLSKKSHVEDFDGVPE LFNAVATPDEGDSFGYNWNVFTFQVDFDKYLKGPKLEKDGHYFLRTNDKN LDEAIQERDNIVTVNCTVIYLDQCVSWNKCRTSCQTTGASSTRWFHDGCC ECVGSTCINYGVNESRCRKCPESKGELGDELDDPMEEEMQDFGESMGPFD GPVNNNY …
Reformavng Data Files
• Much of the rouIne (yet annoying) work of bioinformaIcs involves messing around with data files to get them into formats that will work with various somware
• Then messing around with the results produced by that somware to create a useful summary…
![Page 12: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/12.jpg)
9/10/14
12
GenBank
DDBJ EMBL
EMBL
Entrez
SRS
getentry
NIG CIB EBI
NCBI
NIH
• Submissions • Updates
• Submissions • Updates
• Submissions • Updates
Public Sequence Databases Same sequence information in all three, but different tools for searching and retrieval
GenBank • Contains all DNA and protein sequences described in the scienIfic literature or collected in publicly funded research
• Flawile: Composed enIrely of text • Each submi`ed sequence is a record • Had fields for Organism, Date, Author, etc. • Unique idenIfier for each sequence
– Locus and Accession #
![Page 13: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/13.jpg)
9/10/14
13
Growth of Genbank
![Page 14: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/14.jpg)
9/10/14
14
GenBank Flat File (GBFF) LOCUS MUSNGH 1803 bp mRNA ROD 29-AUG-1997 DEFINITION Mouse neuroblastoma and rat glioma hybridoma cell line NG108-15 cell TA20 mRNA, complete cds. ACCESSION D25291 NID g1850791 KEYWORDS neurite extension activity; growth arrest; TA20. SOURCE Murinae gen. sp. mouse neuroblastma-rat glioma hybridoma cell_line:NG108-15 cDNA to mRNA. ORGANISM Murinae gen. sp. Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae. REFERENCE 1 (sites) AUTHORS Tohda,C., Nagai,S., Tohda,M. and Nomura,Y. TITLE A novel factor, TA20, involved in neuronal differentiation: cDNA cloning and expression JOURNAL Neurosci. Res. 23 (1), 21-27 (1995) MEDLINE 96064354 REFERENCE 3 (bases 1 to 1803) AUTHORS Tohda,C. TITLE Direct Submission JOURNAL Submitted (18-NOV-1993) to the DDBJ/EMBL/GenBank databases. Chihiro Tohda, Toyama Medical and Pharmaceutical University, Research Institute for Wakan-yaku, Analytical Research Center for Ethnomedicines; 2630 Sugitani, Toyama, Toyama 930-01, Japan (E-mail:[email protected], Tel:+81-764-34-2281(ex.2841), Fax:+81-764-34-5057) COMMENT On Feb 26, 1997 this sequence version replaced gi:793764. FEATURES Location/Qualifiers source 1..1803 /organism="Murinae gen. sp." /note="source origin of sequence, either mouse or rat, has not been identified" /db_xref="taxon:39108" /cell_line="NG108-15" /cell_type="mouse neuroblastma-rat glioma hybridoma" misc_signal 156..163 /note="AP-2 binding site" GC_signal 647..655 /note="Sp1 binding site" TATA_signal 694..701 gene 748..1311 /gene="TA20" CDS 748..1311 /gene="TA20" /function="neurite extensiion activity and growth arrest effect" /codon_start=1 /db_xref="PID:d1005516" /db_xref="PID:g793765" /translation="MMKLWVPSRSLPNSPNHYRSFLSHTLHIRYNNSLFISNTHLSRR KLRVTNPIYTRKRSLNIFYLLIPSCRTRLILWIIYIYRNLKHWSTSTVRSHSHSIYRL RPSMRTNIILRCHSYYKPPISHPIYWNNPSRMNLRGLLSRQSHLDPILRFPLHLTIYY RGPSNRSPPLPPRNRIKQPNRIKLRCR" polyA_site 1803 BASE COUNT 507 a 458 c 311 g 527 t ORIGIN 1 tcagtttttt tttttttttt tttttttttt tttttttttt tttttttttg ttgattcatg 61 tccgtttaca tttggtaagt tcacaggcct cagtcaacac aattggactg ctcaggaaat 121 cctccttggt gaccgcagta tacttggcct atgaacccaa gccacctatg gctaggtagg 181 agaagctcaa ctgtagggct gactttggaa gagaatgcac atggctgtat cgacatttca 241 catggtggac ctctggccag agtcagcagg ccgagggttc tcttccgggc tgctccctca 301 ctgcttgact ctgcgtcagt gcgtccatac tgtgggcgga cgttattgct atttgccttc 361 cattctgtac ggcattgcct ccatttagct ggagagggac agagcctggt tctctagggc 421 gtttccattg gggcctggtg acaatccaaa agatgagggc tccaaacacc agaatcagaa 481 ggcccagcgt atttgtaaaa acaccttctg gtgggaatga atggtacagg ggcgtttcag 541 gacaaagaac agcttttctg tcactcccat gagaaccgtc gcaatcactg ttccgaagag 601 gaggagtcca gaatacacgt gtatgggcat gacgattgcc cggagagagg cggagcccat 661 ggaagcagaa agacgaaaaa cacacccatt atttaaaatt attaaccact cattcattga 721 cctacctgcc ccatccaaca tttcatcatg atgaaacttt gggtcccttc taggagtctg 781 cctaatagtc caaatcatta caggtctttt cttagccata cactacacat cagatacaat 841 aacagccttt tcatcagtaa cacacatttg tcgagacgta aattacgggt gactaatccg 901 atatatacac gcaaacggag cctcaatatt ttttatttgc ttattccttc atgtcggacg 961 aggcttatat tatggatcat atacatttat agaaacctga aacattggag tacttctact 1021 gttcgcagtc atagccacag catttatagg ctacgtcctt ccatgaggac aaatatcatt 1081 ctgaggtgcc acagttatta caaacctcct atcagccatc ccatatattg gaacaaccct 1141 agtcgaatga atttgagggg gcttctcagt agacaaagcc accttgaccc gattcttcgc 1201 tttccacttc atcttaccat ttattatcgc ggccctagca atcgttcacc tcctcttcct 1261 ccacgaaaca ggatcaaaca acccaacagg attaaactca gatgcagata aaattccatt 1321 tcacccctac tatacatcaa agatatccta ggtatcctaa tcatattctt aattctcata 1381 accctagtat tatttttccc agacatacta ggagacccag acaactacat accagctaat 1441 ccactaaaca ccccacccca tattaaaccc gaatgatatt tcctatttgc atacgccatt 1501 ctacgctcaa tccccaataa actaggaggt gtcctagcct taatcttatc tatcctaatt 1561 ttagccctaa tacctttcct tcatacctca aagcaacgaa gcctaatatt ccgcccaatc 1621 acacaaattt tgtactgaat cctagtagcc aacctactta tcttaacctg aattgggggc 1681 caaccagtag acacccattt attatcattg gccaactagc ctccatctca tacttctcaa 1741 tcatcttaat tcttatacca atctcaggaa ttatcgaaga caaaatacta aaattatatc 1801 cat //
Features (AA seq)
DNA Sequence
Header • Title • Taxonomy • Citation
Fields
![Page 15: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/15.jpg)
9/10/14
15
Accession Numbers!! • Databases are designed to be searched by accession numbers (and locus IDs)
• These are guaranteed to be non-‐redundant, accurate, and not to change.
• Searching by gene names and keywords is doomed to frustraIon and probable failure
Neither scienIsts nor computers can be trusted to accurately and consistently annotate database entries!!
h`p://www.ncbi.nlm.nih.gov/Genbank
• Once upon a time, GenBank mailed out sequences on CD-ROM disks a few times per year.
• At least doubles in size every 18 months
• There are approximately 130,671,233,801 bases, from 142,284,608 reported sequences in the tradiIonal GenBank divisions as of August 2011.
![Page 16: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/16.jpg)
9/10/14
16
DistribuIon of sequence databases
• Books, arIcles 1968 -‐> 1985 • Computer tapes 1982 -‐>1992 • Floppy disks 1984 -‐> 1990 • CD-‐ROM 1989 -‐> ? • FTP 1989 -‐> ? • On-‐line services 1982 -‐> 1994 • WWW 1993 -‐> ? • DVD 2001 -‐> ? • Mailing hard drives 2009 -‐> ?
• Many sequences in GenBank correspond to the same gene
• genomic clones, full length mRNA, various kinds of ESTs, submi`ed by different invesIgators
• RefSeq is the “Reference Sequence” for a gene -‐ as determined by GenBank curators – best guess given the current evidence, can change – usually based on the longest mRNA – usually has both 5’ and 3’ UTR
• Not necessarily reliable – A lot is not yet known… eg, alternaIve splicing
![Page 17: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/17.jpg)
9/10/14
17
Last thoughts on Genbank ...
• Omen only use FASTA files (eg for BLAST) • GBFF are simply human readable versions of these records
• GBFF have become a vehicle for a lot more informaIon than they where meant to do
• Keep in mind that GenBank is DNA centric and is a poor vehicle for protein and mRNA expression/interacIon informaIon
Many Datasets at NCBI • The NCBI hosts a huge interconnected database system that, in addiIon to DNA and protein, includes: – Journal ArIcles (PubMed) – GeneIc Diseases (OMIM) – Polymorphisms (dbSNP) – CytogeneIcs (CGH/SKY/FISH & CGAP) – Gene Expression (GEO) – Taxonomy – Chemistry (PubChem)
![Page 18: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/18.jpg)
9/10/14
18
Accessing database informaIon
• A request for data from a database is called a query
• Queries can be of three forms: – Choose from a list of parameters – Query by example (QBE) – Query language
Web Query
• Most databases have a web-‐based query tool
• It may be simple…
![Page 19: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/19.jpg)
9/10/14
19
… or complex
Query Languages • The standard
– SQL (Structured Query Language) originally called SEQUEL (Structured English QUEry Language)
– Developed by IBM in 1974; introduced commercially in 1979 by Oracle Corp.
– Standard interacIve and programming language for gevng informaIon from and updaIng a database.
– RDMS (SQL), ODBMS (Java, C++, OQL etc)
![Page 20: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/20.jpg)
9/10/14
20
ENTREZ is the GenBank web query tool
Advanced query
interface:
![Page 21: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/21.jpg)
9/10/14
21
Database Searching A database can only be searched in ways that it was designed to be searched
Boolean: "AND" and "OR" searches
Bad to search for "human hemoglobin" in a 'Descrip2on' field
Much be`er to search for "homo sapiens in 'Organism' AND "HBB" in 'gene name'
Strategies
• Use accession numbers whenever possible • Start with broad keywords and narrow the
search using more specific terms • Try variants of spelling, numbers, etc. • Search all relevant databases
• Be persistent!!
![Page 22: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/22.jpg)
9/10/14
22
ENTREZ has pre-‐computed links between Tables
• Relationships between sequences are computed with BLAST
• Relationships between articles are computed with "MESH" terms (shared keywords)
• Relationships between DNA and protein sequences rely on accession numbers • Relationships between sequences and PubMed articles rely on both shared keywords and the mention of accession numbers in the articles.
UCSC Genome Browser Search by gene name:
or by sequence:
![Page 23: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/23.jpg)
9/10/14
23
Lots of additional data can be added as optional "tracks"
- anything that can be mapped to locations on the genome
![Page 24: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/24.jpg)
9/10/14
24
Ensembl at EBI/EMBL
http://genome.cshlp.org/content/14/5/971.full
![Page 25: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/25.jpg)
9/10/14
25
![Page 26: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/26.jpg)
9/10/14
26
KEGG: Kyoto Encylopedia of
Genes and Genomes • EnzymaIc and regulatory pathways • Mapped out by EC number and cross-‐referenced to genes in all known organisms (wherever sequence informaIon exits)
• Parallel maps of regulatory pathways
![Page 27: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/27.jpg)
9/10/14
27
http://www.wwpdb.org
![Page 28: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/28.jpg)
9/10/14
28
Genome Ontology • Biology is a messy science
• Assortment of names, mutants, odd phenotypes – “sonic hedgehog”
• Genome Ontology – Molecular funcIon (specific tasks) – Biological process (broad biological goal) – Cellular component (locaIon)
![Page 29: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/29.jpg)
9/10/14
29
Golden Rules
• Use published databases and methods – Supported, maintained, trusted by community
• Document what you have done !!! – Sequence idenIficaIon numbers – Server, database, program VERSION – Program parameters
• Assess reliability of results
Bio-‐databases: A short word on problems
• Even today we face some key limitaIons – There is no standard format
• Every database or program has its own format
– There is no standard nomenclature • Every database has its own names
– Data is not fully opImized • Some datasets have missing informaIon without indicaIons of it
– Data errors • Data is someImes of poor quality, erroneous, misspelled • Error propagaIon resulIng from computer annotaIon
![Page 30: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/30.jpg)
9/10/14
30
What to take home • Databases are a collecIon of data
– Need to access and maintain easily and flexibly
• Biological informaIon is vast and someImes very redundant
• Computers can only create data, they do not give answers
• Learn to use the big reliable databases (e.g. NCBI)
• Open access to sequences is not only essenIal for all of the work we do, if it was not there, there would be no bioinformaIcs, no BLAST, no ComputaIonal Bioscience Program
• Open access to sequence informaIon is not all that needs to be open. We also need open access to the literature.
![Page 31: Biological Databasescompbio.ucdenver.edu/77112014/Dowell database-14.pdf9/10/14 3 Curated%Biological%Data% DNA, nucleotide sequences % Gene boundaries, topology Gene structure Introns,](https://reader034.vdocuments.us/reader034/viewer/2022052016/602eb7de4b9aee6182310f69/html5/thumbnails/31.jpg)
9/10/14
31
http://mibiol.biol.lu.se.webbhotell.ldc.lu.se/Bioinformatics/Exercises/databases.html
http://wiki.bio.dtu.dk/teaching/index.php/Exercise:_Searching_the_GenBank_database
http://biocourse.sanbi.ac.za/wp-content/uploads/2013/02/Biological-Databases-Exercises.pdf
RECOMMENDED EXERCISES