b.sc biochem i bobi u 2 database
TRANSCRIPT
Biological DatabasesBiological Databases
Course: B.Sc BiochemistrySubject: Basic of Bioinformatics
Unit: II
What can be discovered about a gene What can be discovered about a gene by a database search?by a database search?
A little or a lot, depending on the geneA little or a lot, depending on the gene Evolutionary informationEvolutionary information: homologous genes, taxonomic : homologous genes, taxonomic
distributions, allele frequencies, synteny, etc.distributions, allele frequencies, synteny, etc. Genomic informationGenomic information: chromosomal location, introns, : chromosomal location, introns,
UTRs, regulatory regions, shared domains, etc.UTRs, regulatory regions, shared domains, etc. Structural informationStructural information: associated protein structures, fold : associated protein structures, fold
types, structural domainstypes, structural domains Expression informationExpression information: expression specific to particular : expression specific to particular
tissues, developmental stages, phenotypes, diseases, etc.tissues, developmental stages, phenotypes, diseases, etc. Functional informationFunctional information: enzymatic/molecular function, : enzymatic/molecular function,
pathway/cellular role, localization, role in diseasespathway/cellular role, localization, role in diseases
Using a databaseUsing a database
How to get information out of a database:How to get information out of a database: Browsing: no targeted information to retrieveBrowsing: no targeted information to retrieve Search: looking for particular informationSearch: looking for particular information
Searching a database:Searching a database: Must have a key that identifies the element(s) of the Must have a key that identifies the element(s) of the
database that are of interest.database that are of interest. Name of geneName of gene Sequence of geneSequence of gene Other informationOther information
Helps to have particular Helps to have particular informational goalsinformational goals
Searching for informationSearching for informationabout genes and their productsabout genes and their products
Gene and gene product databases are often organized Gene and gene product databases are often organized by sequenceby sequence Genomic sequence encodes all traits of an organism. Genomic sequence encodes all traits of an organism. Gene products are uniquely described by their sequences.Gene products are uniquely described by their sequences. Similar sequences among biomolecules indicates both similar Similar sequences among biomolecules indicates both similar
function and an evolutionary relationship function and an evolutionary relationship
Macromolecular sequences provide biologically Macromolecular sequences provide biologically meaningful keys for searching databasesmeaningful keys for searching databases
Searching sequence databasesSearching sequence databases
Start from sequence, find information about itStart from sequence, find information about it Many kinds of input sequencesMany kinds of input sequences
Could be amino acid or nucleotide sequenceCould be amino acid or nucleotide sequence Genomic or mRNA/cDNA or protein sequenceGenomic or mRNA/cDNA or protein sequence Complete or fragmentary sequencesComplete or fragmentary sequences
Exact matches are rare (even uninteresting in many Exact matches are rare (even uninteresting in many cases), so often goal is to retrieve a set of similar cases), so often goal is to retrieve a set of similar sequences.sequences. Both small (mutations) and large (required for function) Both small (mutations) and large (required for function)
differences within “similar” can be interesting.differences within “similar” can be interesting.
What might we want What might we want to know about a sequence?to know about a sequence?
Is this sequence similar to any known genes? How close Is this sequence similar to any known genes? How close is the best match? Significance?is the best match? Significance?
What do we know about that gene?What do we know about that gene? Genomic (chromosomal location, allelic information, Genomic (chromosomal location, allelic information,
regulatory regions, etc.)regulatory regions, etc.) Structural (known structure? structural domains? etc.)Structural (known structure? structural domains? etc.) Functional (molecular, cellular & disease)Functional (molecular, cellular & disease)
Evolutionary information: Evolutionary information: Is this gene found in other organisms? Is this gene found in other organisms? What is its taxonomic tree?What is its taxonomic tree?
A historical perspectiveA historical perspective
The 1960s: the birth of The 1960s: the birth of bioinformaticsbioinformatics High-level computer High-level computer
languageslanguages Protein sequence dataProtein sequence data Academic access to Academic access to
computerscomputers Margaret Oakley DayhoffMargaret Oakley Dayhoff
First protein databaseFirst protein database First program for sequence First program for sequence
assemblyassembly IBM 7090 computer
Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
1.
By way of comparison…By way of comparison…
IBM 7090 computer
32 Kbytes RAM
2.18 µHz
$2,900,000 in 1960
20” Apple iMac
1 GB RAM
2.4 GHz
$1199 in 2008
2.
Solving problems in computer Solving problems in computer sciencescience
Necessary parameters for assessing the difficulty Necessary parameters for assessing the difficulty of a computer science problemof a computer science problem Algorithmic complexityAlgorithmic complexity
Is the problem theoretically solvable?Is the problem theoretically solvable? If so, what is the most efficient solution?If so, what is the most efficient solution?
Current state of computer technologyCurrent state of computer technology MemoryMemory CPU speedCPU speed CostCost
Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
AlgorithmsAlgorithms
An An algorithmalgorithm is a sequence of instructions that one is a sequence of instructions that one must perform in order to solve a well-formulated must perform in order to solve a well-formulated problemproblem
First you must identify exactly what the problem is!First you must identify exactly what the problem is! A A problemproblem describes a class of computational tasks. describes a class of computational tasks.
A problem A problem instanceinstance is one particular input from is one particular input from that taskthat task
In general, you should design your algorithms to In general, you should design your algorithms to work for work for anyany instance of a problem (although there instance of a problem (although there are cases in which this is not possible)are cases in which this is not possible)
Computer technology: memory, CPU speed, costComputer technology: memory, CPU speed, cost• Dramatic improvements on yearly basis
• We do a lot of our work using desktop Macs out of the box
- 2 quad core 2.8 GHz processors, 500 GB disk space, 4 GB RAM for ~$3000
- 2 quad core 3.0 GHz processors, 2.5 TB disk space, 8 GB RAM for ~$6000
• CPU speed vs. memory: which is more important?
- for protein structure, might need many calculations but limited memory
- for genome searches, might have few calculations but huge amounts to store in memory
• Reading from memory is several orders of magnitude faster than reading from disk
DatabasesDatabases
What is a database?What is a database? A collection of related data elementsA collection of related data elements
tablestables columns (fields)columns (fields) rows (records)rows (records)
Records retrieved using a query languageRecords retrieved using a query language Database technology is well establishedDatabase technology is well established
Databases are a fundamental part of the bioinformatics revolution. Much of Databases are a fundamental part of the bioinformatics revolution. Much of the conceptual framework for databases had already been developed by the the conceptual framework for databases had already been developed by the 1960s.1960s.
By the 1970s, database technology had already permeated much of the By the 1970s, database technology had already permeated much of the government and corporate sectors. government and corporate sectors.
Modern databases can be described as well-organized collections of data Modern databases can be described as well-organized collections of data that can be accessed through the use of a query language. that can be accessed through the use of a query language.
Two databases of particular importance to biologists are Two databases of particular importance to biologists are GenBankGenBank®®, which , which encompasses all publicly available protein and nucleotide sequences, and encompasses all publicly available protein and nucleotide sequences, and the the Protein Data BankProtein Data Bank, which contains high quality 3-D structures of , which contains high quality 3-D structures of proteins, nucleic acids, and carbohydrates. proteins, nucleic acids, and carbohydrates.
The entire sequence of a single human could fit on one or two CD-ROMS. The entire sequence of a single human could fit on one or two CD-ROMS. As we shall see shortly, it is the comparison of sequences that presents As we shall see shortly, it is the comparison of sequences that presents algorithmic challenges.algorithmic challenges.
Tables (entitites)
•basic elements of information to track, e.g., gene, organism, sequence, citation
Columns (fields)•attributes of tables, e.g. for citation table, title, journal, volume, author
Rows (records)•actual data•whereas fields describe what data is stored, the rows of a table are where the actual data is stored
DatabasesDatabases
What is database?What is database?
A database is a computerized records used to A database is a computerized records used to store and organize data in such a way that store and organize data in such a way that information can be retrieved easily via a variety information can be retrieved easily via a variety of search criteria. Databases are composed of of search criteria. Databases are composed of computer hardware and software for data computer hardware and software for data management.management.
What is database?What is database?
Each record, also called an entry, should contain Each record, also called an entry, should contain a number of fields that hold the actual data a number of fields that hold the actual data items, for example, fields for names, phone items, for example, fields for names, phone numbers, addresses, dates.numbers, addresses, dates.
To retrieve a particular record from the To retrieve a particular record from the database, a user can specify a particular piece of database, a user can specify a particular piece of information, called value, to be found in a information, called value, to be found in a particular field and expect the computer to particular field and expect the computer to retrieve the whole data record. retrieve the whole data record.
This process is called making a queryThis process is called making a query
What is database?What is database?
A biological database is a collection of both experimental A biological database is a collection of both experimental and theoretical data that is organized so that its contents and theoretical data that is organized so that its contents can be easilycan be easily accessedaccessed managedmanaged updatedupdated RetrievedRetrieved
The activity of preparing a database can be divided in to:The activity of preparing a database can be divided in to: Collection of data in a form which can be easily accessedCollection of data in a form which can be easily accessed Making it available to a multi-user systemMaking it available to a multi-user system
Types of databaseTypes of database
Flat file database
A flat file database describes any of various means to encode a database model (most commonly a table) as a single file. A flat file can be a plain text file or a binary file. There are usually no structural relationships between the records.
"Flat file database" may be defined very narrowly, or more broadly. "Flat file database" may be defined very narrowly, or more broadly. Strictly, a flat file database should consist of nothing but data and, if records vary in Strictly, a flat file database should consist of nothing but data and, if records vary in
length, delimiters. length, delimiters. More broadly, the term refers to any database which exists in a single file in the form More broadly, the term refers to any database which exists in a single file in the form
of rows and columns, with no relationships or links between records and fields except of rows and columns, with no relationships or links between records and fields except the table structure.the table structure.
Terms used to describe different aspects of a database and its tools differ from one Terms used to describe different aspects of a database and its tools differ from one implementation to the next, but the concepts remain the same. implementation to the next, but the concepts remain the same.
FileMaker uses the term "Find", while MySQL uses the term "Query"; but the concept FileMaker uses the term "Find", while MySQL uses the term "Query"; but the concept is the same. FileMaker "files", in version 7 and above, are equivalent to MySQL is the same. FileMaker "files", in version 7 and above, are equivalent to MySQL "databases", and so forth. To avoid confusing the reader, one consistent set of terms is "databases", and so forth. To avoid confusing the reader, one consistent set of terms is used throughout this article.used throughout this article.
However, the basic terms "record" and "field" are used in nearly every flat file database However, the basic terms "record" and "field" are used in nearly every flat file database implementationimplementation
Rational databaseRational database
Relational databases are both created and queried Relational databases are both created and queried by DataBase Management Systems (DBMSs). by DataBase Management Systems (DBMSs).
Relational databases displaced hierarchical Relational databases displaced hierarchical databases because the ability to add new relations made it databases because the ability to add new relations made it possible to add new information that was valuable but possible to add new information that was valuable but "broke" a database's original hierarchical conception."broke" a database's original hierarchical conception.
The trend continues as a networked planet and social The trend continues as a networked planet and social media create the world of "big data" which is larger media create the world of "big data" which is larger and less structured than the datasets and tasks that and less structured than the datasets and tasks that relational databases handle well (it is instructive to relational databases handle well (it is instructive to compareHadoop).compareHadoop).
Rational databaseRational database
Object oriented databaseObject oriented database
An object database (also object-oriented An object database (also object-oriented database management system) is a database database management system) is a database management system in which information is management system in which information is represented in the form of objects as used represented in the form of objects as used in object-oriented programming. in object-oriented programming.
Object databases are different from relational Object databases are different from relational databases which are table-oriented.databases which are table-oriented.
Biological databaseBiological database
3.
Online DatabasesOnline DatabasesWhen you query an online database, your query is translated into SQL, the database is interrogated, and the answer displayed on your web browser.
Your computer and browser (the “client”)
Software to receive and translate the instructions you enter into your browser (on the “server”)
The database itself
Image source: David Lane and Hugh E. Williams. Web Database Applications with PHP & MySQL. O’Reilly (2002).
4.
Biological Databases•Over 1000 biological databases
•Vary in size, quality, coverage, level of interest
•Many of the major ones covered in the annual Database Issue of Nucleic Acids Research
•What makes a good database?
•comprehensiveness
•accuracy
•is up-to-date
•good interface
•batch search/download
•API (web services, DAS, etc.)
“Ten Important Bioinformatics Databases”GenBank www.ncbi.nlm.nih.gov nucleotide sequences
Ensembl www.ensembl.org human/mouse genome (and others)
PubMed www.ncbi.nlm.nih.gov literature references
NR www.ncbi.nlm.nih.gov protein sequences
SWISS-PROT www.expasy.ch protein sequences
InterPro www.ebi.ac.uk protein domains
OMIM www.ncbi.nlm.nih.gov genetic diseases
Enzymeswww.chem.qmul.ac.uk enzymes
PDB www.rcsb.org/pdb/ protein structures
KEGG www.genome.ad.jp metabolic pathways
Source: Bioinformatics for Dummies
NCBI (National Center for Biotechnology Information)
• over 30 databases including GenBank, PubMed, OMIM, and GEO
• Access all NCBI resources via Entrez (www.ncbi.nlm.nih.gov/Entrez/)
5.
6.
7.
8.
9.
10.
11.
PubMedPubMed
12.
13.
14.
15.
16.
17.
18.
INFORMATION RETRIEVAL INFORMATION RETRIEVAL FROM BIOLOGICAL DATABASESFROM BIOLOGICAL DATABASES NCBI-EntrezNCBI-Entrez SRS(Sequenceretrievalsystem)SRS(Sequenceretrievalsystem)
NCBI and EntrezNCBI and Entrez
The Central Dogma & Biological DataThe Central Dogma & Biological Data
Protein structures-Experiments-Models (homologues)
Literature information
Original DNA Sequences(Genomes)
Protein Sequences-Inferred -Direct sequencing
Expressed DNA sequences( = mRNA Sequences= cDNA sequences)Expressed Sequence Tags (ESTs)
19.
NCBI Databases and ServicesNCBI Databases and Services
GenBank primary sequence databaseGenBank primary sequence database
Free public access to biomedical literatureFree public access to biomedical literature PubMed free Medline (3 million searches per day)PubMed free Medline (3 million searches per day) PubMed Central full text online accessPubMed Central full text online access
Entrez integrated molecular and literature databasesEntrez integrated molecular and literature databases
PRIMARYPRIMARY VS. VS. DERIVATIVE DERIVATIVE SEQUENCE DATABASESSEQUENCE DATABASES
GenBankGenBank
SequencingSequencingCentersCenters
GA
GAGA
ATT
ATTCCGAGA
ATT
ATTCC
AT
GAGA
ATTCC GAGA
ATTCC
TTGACAAT
TGACTA
ACGTGC
TTGACA
CGTGAATTGACTA
TATAGCCG
ACGTGC
ACGTGCACGTGC
TTGACA
TTGACA
CGTGA
CGTGA
CGTGA
ATTGACTA
ATTGACTAATTGACTA
ATTGACTA
TATAGCCG
TATAGCCGTATAGCCGTATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCG
CATT
GAGA
ATTCC GAGA
ATTCC LabsLabs
AlgorithmsAlgorithms
UniGene
CuratorsCurators
RefSeq
GenomeAssembly
TATAGCCGAGCTCCGATACCGATGACAA
Updated continually by NCBI
Updated continually by NCBI
Updated ONLY by submitters
20.
Sequence Databases at NCBISequence Databases at NCBI
PrimaryPrimary GenBank: NCBI’s primary sequence databaseGenBank: NCBI’s primary sequence database Trace Archive: reads from capillary sequencers Trace Archive: reads from capillary sequencers Sequence Read Archive: next generation dataSequence Read Archive: next generation data
DerivativeDerivative GenPept (GenBank translations)GenPept (GenBank translations) Outside Protein (UniProt—Swiss-Prot, PDB) Outside Protein (UniProt—Swiss-Prot, PDB) NCBI Reference Sequences NCBI Reference Sequences (RefSeq)(RefSeq)
GENBANK -GENBANK - PRIMARY SEQUENCE DBPRIMARY SEQUENCE DB
Nucleotide only Nucleotide only sequence database sequence database
Archival(Records) Archival(Records) in naturein nature HistoricalHistorical Reflective of submitter point of view (subjective)Reflective of submitter point of view (subjective) RedundantRedundant
DataData Direct submissions (traditional records)Direct submissions (traditional records) Batch submissionsBatch submissions FTP accounts (genome data)FTP accounts (genome data)
GENBANK -GENBANK - PRIMARY SEQUENCE DB (2)PRIMARY SEQUENCE DB (2)
Three collaborating databasesThree collaborating databases
1.1. GenBankGenBank
2.2. DNA Database of Japan (DDBJ) DNA Database of Japan (DDBJ)
3.3. European Molecular Biology Laboratory (EMBL) European Molecular Biology Laboratory (EMBL) DatabaseDatabase
Traditional GenBank RecordTraditional GenBank Record
ACCESSION U07418
VERSION U07418.1 GI:466461
ACCESSION U07418
VERSION U07418.1 GI:466461
Accession•Stable•Reportable•Universal
Accession•Stable•Reportable•Universal
VersionTracks changes in sequenceVersionTracks changes in sequence
GI numberNCBI internal useGI numberNCBI internal use
well annotatedwell annotated
the sequence is the datathe sequence is the data21.
NCBI and EntrezNCBI and Entrez
One of the most useful and comprehensive sources of One of the most useful and comprehensive sources of databases is the NCBI, part of the National Library of databases is the NCBI, part of the National Library of Medicine.Medicine.
NCBI provides interesting summaries, browsers for NCBI provides interesting summaries, browsers for genome data, and search toolsgenome data, and search tools
Entrez is their database search interfaceEntrez is their database search interfacehttp://www.ncbi.nlm.nih.gov/Entrezhttp://www.ncbi.nlm.nih.gov/Entrez
Can search on gene names, sequences, chromosomal Can search on gene names, sequences, chromosomal location, diseases, keywords, ...location, diseases, keywords, ...
What did we just do?What did we just do?
Identify loci (genes) associated with the sequence. Identify loci (genes) associated with the sequence. Input was Alcohol DehydrogenaseInput was Alcohol Dehydrogenase
For each particular “hit”, we can look at that For each particular “hit”, we can look at that sequence and its alignment in more detail.sequence and its alignment in more detail.
See similar sequences, and the organisms in which See similar sequences, and the organisms in which they are found.they are found.
But there’s But there’s much moremuch more that can be found on that can be found on these genes, even just inside NCBI…these genes, even just inside NCBI…
22.
More from Entrez GeneMore from Entrez Gene
23.
And more…And more…
Sequence Retrieval SystemSequence Retrieval System
The Sequence Retrieval System is a The Sequence Retrieval System is a database system that works with flat-files. In database system that works with flat-files. In addition, many bioinformatics tools are addition, many bioinformatics tools are incorporated and can be combined with the incorporated and can be combined with the databases searches. databases searches.
24.
NCBI is not all there is...NCBI is not all there is... Links to non-NCBI databasesLinks to non-NCBI databases
Reactome & KEGG for pathwaysReactome & KEGG for pathways HGNC for nomenclatureHGNC for nomenclature UCSC Human Genome BrowserUCSC Human Genome Browser
Other important gene/protein resources not linked to:Other important gene/protein resources not linked to: UniProt (most carefully annotated)UniProt (most carefully annotated) PDBPDB (main macromolecular structure repository) (main macromolecular structure repository)
Other key biological data sourcesOther key biological data sources Gene OntologyGene Ontology/Open Biological Ontologies/Open Biological Ontologies EnzymeEnzyme
Scientific society: iscb.orgScientific society: iscb.org Journals, Conferences…Journals, Conferences…
Take home messagesTake home messages
There are a lot of molecular biology databases, There are a lot of molecular biology databases, containing a lot of valuable informationcontaining a lot of valuable information
Not even the best databases have everything (or Not even the best databases have everything (or the best of everything)the best of everything)
These databases are moderately well cross-These databases are moderately well cross-linked, and there are “linker” databaseslinked, and there are “linker” databases
Sequence is a good identifier, maybe even better Sequence is a good identifier, maybe even better than gene name!than gene name!
FILE FORMATEFILE FORMATE
IG/Stanford Fitch Plain/Raw
GenBank/GB Fasta/Pearson PIR/CODATA
NBRF Zuker MSF
EMBL Olsen ASN 1.8
GCG Phylip 3.2 PAUP/NEXUS
DNAStrider Phylip Pretty
IG/Stanford Fitch Plain/Raw
GenBank/GB Fasta/Pearson PIR/CODATA
NBRF Zuker MSF
EMBL Olsen ASN 1.8
GCG Phylip 3.2 PAUP/NEXUS
DNAStrider Phylip Pretty
LOCUS, Accession, NID and protein_idLOCUS, Accession, NID and protein_id
LOCUS: Unique string of 10 letters and numbers in the database. Not maintained amongst databases, and is therefore a poor sequence identifier. ACCESSION: A unique identifier to that record, citable entity; does not change when record is updated. A good record identifier, ideal for citation in publication.VERSION: : New system where the accession and version play the same function as the accession and gi number. Nucleotide gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes.PID: Protein Identifier: g, e or d prefix to gi number. Can have one or two on one CDS.Protein gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes.protein_id: Identifier which has the same structure and function as the nucleotide Accession.version numbers, but slightlt different format.
Accession.version
LOCUS, Accession, gi and PIDLOCUS, Accession, gi and PIDLOCUS HSU40282 1789 bp mRNA PRI 21-MAY-1998DEFINITION Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds.ACCESSION U40282VERSION U40282.1 GI:3150001
CDS 157..1515 /gene="ILK" /note="protein serine/threonine kinase" /codon_start=1 /product="integrin-linked kinase" /protein_id="AAC16892.1" /db_xref="PID:g3150002" /db_xref="GI:3150002"
LOCUS: HSU40282 ACCESSION: U40282 VERSION: U40282.1 GI: 3150001 PID: g3150002 Protein gi: 3150002 protein_id: AAC16892.1 Protein_idprotein gi
ACCESSIONLOCUS
PIDgi
PLAIN SEQUENCE FORMAT
A sequence in plain format may contain only IUPAC characters and
spaces (no numbers!).
Note: A file in plain sequence format may only contain one sequence,
while most other formats accept several sequences in one file.
An example sequence in plain format is:
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCCTGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGCCGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCT
PLAIN SEQUENCE PLAIN SEQUENCE FORMATEFORMATE
FASTA FORMATEFASTA FORMATE
FASTA FORMAT
A sequence in Fasta format begins with a single-line description,
followed by lines of sequence data.
The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column.
It is recommended that all lines of text be shorter than 80 characters in length
An example sequence in FASTA format is:
>U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1)AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCCTGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGCCGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCTCTTGGTTCCGGC
• The first line of each sequence entry is the ID definition line which contains entry name, dataclass, molecule, division and sequence length.
• XX line contains no data, just a separator• The AC line lists the accession number.• DE line gives description about the sequence• FT precise annotation for the sequence• Sequence information SQ in the first two spaces.• The sequence information begins on the fifth line of the sequence entry. • The last line of each sequence entry in the file is a terminator line which has the two characters // in
the first two spaces.
ID AA03518 standard; DNA; FUN; 237 BP. XX AC U03518;XXAC U03518; XX DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S DE rRNA and 5.8S rRNA genes, partial sequence. DE rRNA and 5.8S rRNA genes, partial sequence. RX MEDLINE; 94303342. RX PUBMED; 8030378. XXFT rRNA <1..20 FT /product="18S ribosomal RNA" FT misc_RNA 21..205 FT /standard_name="Internal transcribed spacer 1 (ITS1)" FT rRNA 206..>237 FT /product="5.8S ribosomal RNA" SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other; aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237 //
EMBL/Swiss Prot (http://www.ebi.ac.uk/help/formats_frame.html)
EMBL FORMAT
A sequence file in EMBL format can contain several sequences.
One sequence entry starts with an identifier line ("ID "), followed by further annotation lines. The start of the sequence is marked by a line starting with "SQ" and the end of the sequence is marked by two slashes ("//").
An example sequence in EMBL format is:
ID AA03518 standard; DNA; FUN; 237 BP.XXAC U03518;XXDE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18SDE rRNA and 5.8S rRNA genes, partial sequence.XXSQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other; aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237//
GENBANK FORMAT
A sequence file in GenBank format can contain several sequences.One sequence in GenBank format starts with a line containing the word LOCUS and a number of annotation lines. The start of the sequence is marked by a line containing "ORIGIN" and the end of the sequence is marked by two slashes ("//").•Can contain several sequences•One sequence starts with: “LOCUS”•The sequence starts with: "ORIGIN“•The sequence ends with: "//“ An example sequence in GenBank format is:
LOCUS AAU03518 237 bp DNA PLN 04-FEB-1995DEFINITION Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S rRNA and 5.8S rRNA genes, partial sequence.ACCESSION U03518BASE COUNT 41 a 77 c 67 g 52 tORIGIN 1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc//
25.
26.
27.
28.
29.
PIR- PROTEIN SEQUENCE PIR- PROTEIN SEQUENCE DBDB
PIR was established in 1984 by the National Biomedical PIR was established in 1984 by the National Biomedical Research Foundation (NBRF) as a resource to assist researchers Research Foundation (NBRF) as a resource to assist researchers in the identification and interpretation of protein sequence in the identification and interpretation of protein sequence information. information.
Prior to that, the NBRF compiled the first comprehensive Prior to that, the NBRF compiled the first comprehensive collection of macromolecular sequences in the collection of macromolecular sequences in the Atlas of Protein Atlas of Protein Sequence and StructureSequence and Structure, published from 1965-1978 under the , published from 1965-1978 under the editorship of Margaret O. Dayhoff. editorship of Margaret O. Dayhoff. Dr. DayhoffDr. Dayhoff and her and her research group pioneered in the development of computer research group pioneered in the development of computer methods for the comparison of protein sequences, for the methods for the comparison of protein sequences, for the detection of distantly related sequences and duplications within detection of distantly related sequences and duplications within sequences, and for the inference of evolutionary histories from sequences, and for the inference of evolutionary histories from alignments of protein sequences.alignments of protein sequences.
STRUCTURAL DB-PDBSTRUCTURAL DB-PDB
30.
Protein Data Bank (PDB)
31.
The Protein Data Bank (PDB) is a repository for the The Protein Data Bank (PDB) is a repository for the three-dimensional structural data of large biological three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. molecules, such as proteins and nucleic acids.
The data, typically obtained by X-ray The data, typically obtained by X-ray crystallography or NMR spectroscopy and submitted crystallography or NMR spectroscopy and submitted by biologists and biochemists from around the world, by biologists and biochemists from around the world, are freely accessible on the Internet via the websites of are freely accessible on the Internet via the websites of its member organisations its member organisations
The PDB is overseen by an organization called The PDB is overseen by an organization called theWorldwide Protein Data Bank, wwPDB.theWorldwide Protein Data Bank, wwPDB.
The PDB is a key resource in areas of structural The PDB is a key resource in areas of structural biology, such as structural genomics.biology, such as structural genomics.
Most major scientific journals, and some funding Most major scientific journals, and some funding agencies, now require scientists to submit their agencies, now require scientists to submit their structure data to the PDB. structure data to the PDB.
If the contents of the PDB are thought of as primary If the contents of the PDB are thought of as primary data, then there are hundreds of derived (i.e., data, then there are hundreds of derived (i.e., secondary) databases that categorize the data secondary) databases that categorize the data differently.differently.
For example both SCOP and CATH categorize For example both SCOP and CATH categorize structures according to type of structure and assumed structures according to type of structure and assumed evolutionary relations.evolutionary relations.
HEADER, TITLE and AUTHOR records provide information about the HEADER, TITLE and AUTHOR records provide information about the researchers who defined the structure; numerous other types of records are researchers who defined the structure; numerous other types of records are available to provide other types of informationavailable to provide other types of information
REMARK records can contain free-form annotation, but they also REMARK records can contain free-form annotation, but they also accommodate standardized information; for example, the REMARK 350 accommodate standardized information; for example, the REMARK 350 BIOMT records describe how to compute the coordinates of the BIOMT records describe how to compute the coordinates of the experimentally observed multimer from those of the explicitly specified ones experimentally observed multimer from those of the explicitly specified ones of a single repeating unit.of a single repeating unit.
SEQRES records give the sequences of the three peptide chains (named A, B SEQRES records give the sequences of the three peptide chains (named A, B and C), which are very short in this example but usually span multiple lines.and C), which are very short in this example but usually span multiple lines.
ATOM records describe the coordinates of the atoms that are part of the ATOM records describe the coordinates of the atoms that are part of the protein. For example, the first ATOM line above describes the alpha-N atom protein. For example, the first ATOM line above describes the alpha-N atom of the first residue of peptide chain A, which is a proline residue; the first of the first residue of peptide chain A, which is a proline residue; the first three floating point numbers are its x, y and z coordinates and are in units three floating point numbers are its x, y and z coordinates and are in units of Ångströms.of Ångströms.
HETATM records describe coordinates of hetero-atoms, that is those atoms HETATM records describe coordinates of hetero-atoms, that is those atoms which are not part of the protein molecule.which are not part of the protein molecule.
PUBCHEMPUBCHEM
PubChem is database of chemical molecules and their activities PubChem is database of chemical molecules and their activities against biological assays. The system is maintained by against biological assays. The system is maintained by theNational Center for Biotechnology Information (NCBI), a theNational Center for Biotechnology Information (NCBI), a component of the National Library of Medicine, which is part of component of the National Library of Medicine, which is part of the United States National Institutes of Health (NIH). PubChem the United States National Institutes of Health (NIH). PubChem can be accessed for free through a web user interface. Millions of can be accessed for free through a web user interface. Millions of compound structures and descriptive datasets can be freely compound structures and descriptive datasets can be freely downloaded via FTP. PubChem contains substance descriptions downloaded via FTP. PubChem contains substance descriptions and small molecules with fewer than 1000 atoms and 1000 and small molecules with fewer than 1000 atoms and 1000 bonds. More than 80 database vendors contribute to the growing bonds. More than 80 database vendors contribute to the growing PubChem databasePubChem database
Books and Web ReferencesBooks and Web References
Books Name : Books Name :
1. Introduction To Bioinformatics by T. K. Attwood1. Introduction To Bioinformatics by T. K. Attwood
2. BioInformatics by Sangita2. BioInformatics by Sangita
3. Basic Bioinformatics by S.Ignacimuthu, s.j.3. Basic Bioinformatics by S.Ignacimuthu, s.j.
http://en.wikipedia.org/wiki/Biological_databasehttp://en.wikipedia.org/wiki/Biological_database http://bioinformaticsweb.net/data.htmlhttp://bioinformaticsweb.net/data.html http://www.apbionet.org/s-star/downloads/tutorial/t1b.pdfhttp://www.apbionet.org/s-star/downloads/tutorial/t1b.pdf
90
Image ReferencesImage References
1. & 2. https://encrypted-tbn0.gstatic.com/images?1. & 2. https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR39w90rTM4wRcS2WE4I0zbjV7R6KE8JMVZq=tbn:ANd9GcR39w90rTM4wRcS2WE4I0zbjV7R6KE8JMVZz4QF0qY6A8W1qti_QQaeDx5Xz4QF0qY6A8W1qti_QQaeDx5X
3. & 4.Book: Basic Bioinformatics by S.Ignacimuthu, s.j.3. & 4.Book: Basic Bioinformatics by S.Ignacimuthu, s.j. 5. to 18.http://www.ncbi.nlm.nih.gov/5. to 18.http://www.ncbi.nlm.nih.gov/ 19. https://encrypted-tbn0.gstatic.com/images?19. https://encrypted-tbn0.gstatic.com/images?
q=tbn:ANd9GcRGBmCsguJs4geE45YMjE_O80bbqD9dFtCE9q=tbn:ANd9GcRGBmCsguJs4geE45YMjE_O80bbqD9dFtCE9fgZYySwzYSIDbIpfgZYySwzYSIDbIp
21. to 29. http://www.ncbi.nlm.nih.gov/21. to 29. http://www.ncbi.nlm.nih.gov/ 30. & 31. http://www.rcsb.org/pdb/home/home.do30. & 31. http://www.rcsb.org/pdb/home/home.do