b.sc biochem i bobi u 2 database

Biological DatabasesBiological Databases

Course: B.Sc BiochemistrySubject: Basic of Bioinformatics

Unit: II

What can be discovered about a gene What can be discovered about a gene by a database search?by a database search?

A little or a lot, depending on the geneA little or a lot, depending on the gene Evolutionary informationEvolutionary information: homologous genes, taxonomic : homologous genes, taxonomic

distributions, allele frequencies, synteny, etc.distributions, allele frequencies, synteny, etc. Genomic informationGenomic information: chromosomal location, introns, : chromosomal location, introns,

UTRs, regulatory regions, shared domains, etc.UTRs, regulatory regions, shared domains, etc. Structural informationStructural information: associated protein structures, fold : associated protein structures, fold

types, structural domainstypes, structural domains Expression informationExpression information: expression specific to particular : expression specific to particular

tissues, developmental stages, phenotypes, diseases, etc.tissues, developmental stages, phenotypes, diseases, etc. Functional informationFunctional information: enzymatic/molecular function, : enzymatic/molecular function,

pathway/cellular role, localization, role in diseasespathway/cellular role, localization, role in diseases

Using a databaseUsing a database

How to get information out of a database:How to get information out of a database: Browsing: no targeted information to retrieveBrowsing: no targeted information to retrieve Search: looking for particular informationSearch: looking for particular information

Searching a database:Searching a database: Must have a key that identifies the element(s) of the Must have a key that identifies the element(s) of the

database that are of interest.database that are of interest. Name of geneName of gene Sequence of geneSequence of gene Other informationOther information

Helps to have particular Helps to have particular informational goalsinformational goals

Searching for informationSearching for informationabout genes and their productsabout genes and their products

Gene and gene product databases are often organized Gene and gene product databases are often organized by sequenceby sequence Genomic sequence encodes all traits of an organism. Genomic sequence encodes all traits of an organism. Gene products are uniquely described by their sequences.Gene products are uniquely described by their sequences. Similar sequences among biomolecules indicates both similar Similar sequences among biomolecules indicates both similar

function and an evolutionary relationship function and an evolutionary relationship

Macromolecular sequences provide biologically Macromolecular sequences provide biologically meaningful keys for searching databasesmeaningful keys for searching databases

Searching sequence databasesSearching sequence databases

Start from sequence, find information about itStart from sequence, find information about it Many kinds of input sequencesMany kinds of input sequences

Could be amino acid or nucleotide sequenceCould be amino acid or nucleotide sequence Genomic or mRNA/cDNA or protein sequenceGenomic or mRNA/cDNA or protein sequence Complete or fragmentary sequencesComplete or fragmentary sequences

Exact matches are rare (even uninteresting in many Exact matches are rare (even uninteresting in many cases), so often goal is to retrieve a set of similar cases), so often goal is to retrieve a set of similar sequences.sequences. Both small (mutations) and large (required for function) Both small (mutations) and large (required for function)

differences within “similar” can be interesting.differences within “similar” can be interesting.

What might we want What might we want to know about a sequence?to know about a sequence?

Is this sequence similar to any known genes? How close Is this sequence similar to any known genes? How close is the best match? Significance?is the best match? Significance?

What do we know about that gene?What do we know about that gene? Genomic (chromosomal location, allelic information, Genomic (chromosomal location, allelic information,

regulatory regions, etc.)regulatory regions, etc.) Structural (known structure? structural domains? etc.)Structural (known structure? structural domains? etc.) Functional (molecular, cellular & disease)Functional (molecular, cellular & disease)

Evolutionary information: Evolutionary information: Is this gene found in other organisms? Is this gene found in other organisms? What is its taxonomic tree?What is its taxonomic tree?

A historical perspectiveA historical perspective

The 1960s: the birth of The 1960s: the birth of bioinformaticsbioinformatics High-level computer High-level computer

languageslanguages Protein sequence dataProtein sequence data Academic access to Academic access to

computerscomputers Margaret Oakley DayhoffMargaret Oakley Dayhoff

First protein databaseFirst protein database First program for sequence First program for sequence

assemblyassembly IBM 7090 computer

Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

1.

By way of comparison…By way of comparison…

IBM 7090 computer

32 Kbytes RAM

2.18 µHz

$2,900,000 in 1960

20” Apple iMac

1 GB RAM

2.4 GHz

$1199 in 2008

2.

Solving problems in computer Solving problems in computer sciencescience

Necessary parameters for assessing the difficulty Necessary parameters for assessing the difficulty of a computer science problemof a computer science problem Algorithmic complexityAlgorithmic complexity

Is the problem theoretically solvable?Is the problem theoretically solvable? If so, what is the most efficient solution?If so, what is the most efficient solution?

Current state of computer technologyCurrent state of computer technology MemoryMemory CPU speedCPU speed CostCost

Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

AlgorithmsAlgorithms

An An algorithmalgorithm is a sequence of instructions that one is a sequence of instructions that one must perform in order to solve a well-formulated must perform in order to solve a well-formulated problemproblem

First you must identify exactly what the problem is!First you must identify exactly what the problem is! A A problemproblem describes a class of computational tasks. describes a class of computational tasks.

A problem A problem instanceinstance is one particular input from is one particular input from that taskthat task

In general, you should design your algorithms to In general, you should design your algorithms to work for work for anyany instance of a problem (although there instance of a problem (although there are cases in which this is not possible)are cases in which this is not possible)

Computer technology: memory, CPU speed, costComputer technology: memory, CPU speed, cost• Dramatic improvements on yearly basis

• We do a lot of our work using desktop Macs out of the box

- 2 quad core 2.8 GHz processors, 500 GB disk space, 4 GB RAM for ~$3000

- 2 quad core 3.0 GHz processors, 2.5 TB disk space, 8 GB RAM for ~$6000

• CPU speed vs. memory: which is more important?

- for protein structure, might need many calculations but limited memory

- for genome searches, might have few calculations but huge amounts to store in memory

• Reading from memory is several orders of magnitude faster than reading from disk

DatabasesDatabases

What is a database?What is a database? A collection of related data elementsA collection of related data elements

tablestables columns (fields)columns (fields) rows (records)rows (records)

Records retrieved using a query languageRecords retrieved using a query language Database technology is well establishedDatabase technology is well established

Databases are a fundamental part of the bioinformatics revolution. Much of Databases are a fundamental part of the bioinformatics revolution. Much of the conceptual framework for databases had already been developed by the the conceptual framework for databases had already been developed by the 1960s.1960s.

By the 1970s, database technology had already permeated much of the By the 1970s, database technology had already permeated much of the government and corporate sectors. government and corporate sectors.

Modern databases can be described as well-organized collections of data Modern databases can be described as well-organized collections of data that can be accessed through the use of a query language. that can be accessed through the use of a query language.

Two databases of particular importance to biologists are Two databases of particular importance to biologists are GenBankGenBank®®, which , which encompasses all publicly available protein and nucleotide sequences, and encompasses all publicly available protein and nucleotide sequences, and the the Protein Data BankProtein Data Bank, which contains high quality 3-D structures of , which contains high quality 3-D structures of proteins, nucleic acids, and carbohydrates. proteins, nucleic acids, and carbohydrates.

The entire sequence of a single human could fit on one or two CD-ROMS. The entire sequence of a single human could fit on one or two CD-ROMS. As we shall see shortly, it is the comparison of sequences that presents As we shall see shortly, it is the comparison of sequences that presents algorithmic challenges.algorithmic challenges.

Tables (entitites)

•basic elements of information to track, e.g., gene, organism, sequence, citation

Columns (fields)•attributes of tables, e.g. for citation table, title, journal, volume, author

Rows (records)•actual data•whereas fields describe what data is stored, the rows of a table are where the actual data is stored

DatabasesDatabases

What is database?What is database?

A database is a computerized records used to A database is a computerized records used to store and organize data in such a way that store and organize data in such a way that information can be retrieved easily via a variety information can be retrieved easily via a variety of search criteria. Databases are composed of of search criteria. Databases are composed of computer hardware and software for data computer hardware and software for data management.management.


Each record, also called an entry, should contain Each record, also called an entry, should contain a number of fields that hold the actual data a number of fields that hold the actual data items, for example, fields for names, phone items, for example, fields for names, phone numbers, addresses, dates.numbers, addresses, dates.

To retrieve a particular record from the To retrieve a particular record from the database, a user can specify a particular piece of database, a user can specify a particular piece of information, called value, to be found in a information, called value, to be found in a particular field and expect the computer to particular field and expect the computer to retrieve the whole data record. retrieve the whole data record.

This process is called making a queryThis process is called making a query


A biological database is a collection of both experimental A biological database is a collection of both experimental and theoretical data that is organized so that its contents and theoretical data that is organized so that its contents can be easilycan be easily accessedaccessed managedmanaged updatedupdated RetrievedRetrieved

The activity of preparing a database can be divided in to:The activity of preparing a database can be divided in to: Collection of data in a form which can be easily accessedCollection of data in a form which can be easily accessed Making it available to a multi-user systemMaking it available to a multi-user system

Types of databaseTypes of database

Flat file database

A flat file database describes any of various means to encode a database model (most commonly a table) as a single file. A flat file can be a plain text file or a binary file. There are usually no structural relationships between the records.

"Flat file database" may be defined very narrowly, or more broadly. "Flat file database" may be defined very narrowly, or more broadly. Strictly, a flat file database should consist of nothing but data and, if records vary in Strictly, a flat file database should consist of nothing but data and, if records vary in

length, delimiters. length, delimiters. More broadly, the term refers to any database which exists in a single file in the form More broadly, the term refers to any database which exists in a single file in the form

of rows and columns, with no relationships or links between records and fields except of rows and columns, with no relationships or links between records and fields except the table structure.the table structure.

Terms used to describe different aspects of a database and its tools differ from one Terms used to describe different aspects of a database and its tools differ from one implementation to the next, but the concepts remain the same. implementation to the next, but the concepts remain the same.

FileMaker uses the term "Find", while MySQL uses the term "Query"; but the concept FileMaker uses the term "Find", while MySQL uses the term "Query"; but the concept is the same. FileMaker "files", in version 7 and above, are equivalent to MySQL is the same. FileMaker "files", in version 7 and above, are equivalent to MySQL "databases", and so forth. To avoid confusing the reader, one consistent set of terms is "databases", and so forth. To avoid confusing the reader, one consistent set of terms is used throughout this article.used throughout this article.

However, the basic terms "record" and "field" are used in nearly every flat file database However, the basic terms "record" and "field" are used in nearly every flat file database implementationimplementation

Rational databaseRational database

Relational databases are both created and queried Relational databases are both created and queried by DataBase Management Systems (DBMSs). by DataBase Management Systems (DBMSs).

Relational databases displaced hierarchical Relational databases displaced hierarchical databases because the ability to add new relations made it databases because the ability to add new relations made it possible to add new information that was valuable but possible to add new information that was valuable but "broke" a database's original hierarchical conception."broke" a database's original hierarchical conception.

The trend continues as a networked planet and social The trend continues as a networked planet and social media create the world of "big data" which is larger media create the world of "big data" which is larger and less structured than the datasets and tasks that and less structured than the datasets and tasks that relational databases handle well (it is instructive to relational databases handle well (it is instructive to compareHadoop).compareHadoop).

Rational databaseRational database

Object oriented databaseObject oriented database

An object database (also object-oriented An object database (also object-oriented database management system) is a database database management system) is a database management system in which information is management system in which information is represented in the form of objects as used represented in the form of objects as used in object-oriented programming. in object-oriented programming.

Object databases are different from relational Object databases are different from relational databases which are table-oriented.databases which are table-oriented.

Biological databaseBiological database

Online DatabasesOnline DatabasesWhen you query an online database, your query is translated into SQL, the database is interrogated, and the answer displayed on your web browser.

Your computer and browser (the “client”)

Software to receive and translate the instructions you enter into your browser (on the “server”)

The database itself

Image source: David Lane and Hugh E. Williams. Web Database Applications with PHP & MySQL. O’Reilly (2002).

4.

Biological Databases•Over 1000 biological databases

•Vary in size, quality, coverage, level of interest

•Many of the major ones covered in the annual Database Issue of Nucleic Acids Research

•What makes a good database?

•comprehensiveness

•accuracy

•is up-to-date

•good interface

•batch search/download

•API (web services, DAS, etc.)

“Ten Important Bioinformatics Databases”GenBank www.ncbi.nlm.nih.gov nucleotide sequences

Ensembl www.ensembl.org human/mouse genome (and others)

PubMed www.ncbi.nlm.nih.gov literature references

NR www.ncbi.nlm.nih.gov protein sequences

SWISS-PROT www.expasy.ch protein sequences

InterPro www.ebi.ac.uk protein domains

OMIM www.ncbi.nlm.nih.gov genetic diseases

Enzymeswww.chem.qmul.ac.uk enzymes

PDB www.rcsb.org/pdb/ protein structures

KEGG www.genome.ad.jp metabolic pathways

Source: Bioinformatics for Dummies

NCBI (National Center for Biotechnology Information)

• over 30 databases including GenBank, PubMed, OMIM, and GEO

• Access all NCBI resources via Entrez (www.ncbi.nlm.nih.gov/Entrez/)

PubMedPubMed

12.

INFORMATION RETRIEVAL INFORMATION RETRIEVAL FROM BIOLOGICAL DATABASESFROM BIOLOGICAL DATABASES NCBI-EntrezNCBI-Entrez SRS(Sequenceretrievalsystem)SRS(Sequenceretrievalsystem)

NCBI and EntrezNCBI and Entrez

The Central Dogma & Biological DataThe Central Dogma & Biological Data

Protein structures-Experiments-Models (homologues)

Literature information

Original DNA Sequences(Genomes)

Protein Sequences-Inferred -Direct sequencing

Expressed DNA sequences( = mRNA Sequences= cDNA sequences)Expressed Sequence Tags (ESTs)

19.

NCBI Databases and ServicesNCBI Databases and Services

GenBank primary sequence databaseGenBank primary sequence database

Free public access to biomedical literatureFree public access to biomedical literature PubMed free Medline (3 million searches per day)PubMed free Medline (3 million searches per day) PubMed Central full text online accessPubMed Central full text online access

Entrez integrated molecular and literature databasesEntrez integrated molecular and literature databases

PRIMARYPRIMARY VS. VS. DERIVATIVE DERIVATIVE SEQUENCE DATABASESSEQUENCE DATABASES

GenBankGenBank

SequencingSequencingCentersCenters

GA

GAGA

ATT

ATTCCGAGA

ATT

ATTCC

AT

GAGA

ATTCC GAGA

ATTCC

TTGACAAT

TGACTA

ACGTGC

TTGACA

CGTGAATTGACTA

TATAGCCG

ACGTGC

ACGTGCACGTGC

TTGACA

TTGACA

CGTGA

CGTGA

CGTGA

ATTGACTA

ATTGACTAATTGACTA

ATTGACTA

TATAGCCG

TATAGCCGTATAGCCGTATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCG

CATT

GAGA

ATTCC GAGA

ATTCC LabsLabs

AlgorithmsAlgorithms

UniGene

CuratorsCurators

RefSeq

GenomeAssembly

TATAGCCGAGCTCCGATACCGATGACAA

Updated continually by NCBI

Updated continually by NCBI

Updated ONLY by submitters

20.

Sequence Databases at NCBISequence Databases at NCBI

PrimaryPrimary GenBank: NCBI’s primary sequence databaseGenBank: NCBI’s primary sequence database Trace Archive: reads from capillary sequencers Trace Archive: reads from capillary sequencers Sequence Read Archive: next generation dataSequence Read Archive: next generation data

DerivativeDerivative GenPept (GenBank translations)GenPept (GenBank translations) Outside Protein (UniProt—Swiss-Prot, PDB) Outside Protein (UniProt—Swiss-Prot, PDB) NCBI Reference Sequences NCBI Reference Sequences (RefSeq)(RefSeq)

GENBANK -GENBANK - PRIMARY SEQUENCE DBPRIMARY SEQUENCE DB

Nucleotide only Nucleotide only sequence database sequence database

Archival(Records) Archival(Records) in naturein nature HistoricalHistorical Reflective of submitter point of view (subjective)Reflective of submitter point of view (subjective) RedundantRedundant

DataData Direct submissions (traditional records)Direct submissions (traditional records) Batch submissionsBatch submissions FTP accounts (genome data)FTP accounts (genome data)

GENBANK -GENBANK - PRIMARY SEQUENCE DB (2)PRIMARY SEQUENCE DB (2)

Three collaborating databasesThree collaborating databases

1.1. GenBankGenBank

2.2. DNA Database of Japan (DDBJ) DNA Database of Japan (DDBJ)

3.3. European Molecular Biology Laboratory (EMBL) European Molecular Biology Laboratory (EMBL) DatabaseDatabase

Traditional GenBank RecordTraditional GenBank Record

ACCESSION U07418

VERSION U07418.1 GI:466461

ACCESSION U07418

VERSION U07418.1 GI:466461

Accession•Stable•Reportable•Universal

Accession•Stable•Reportable•Universal

VersionTracks changes in sequenceVersionTracks changes in sequence

GI numberNCBI internal useGI numberNCBI internal use

well annotatedwell annotated

the sequence is the datathe sequence is the data21.

NCBI and EntrezNCBI and Entrez

One of the most useful and comprehensive sources of One of the most useful and comprehensive sources of databases is the NCBI, part of the National Library of databases is the NCBI, part of the National Library of Medicine.Medicine.

NCBI provides interesting summaries, browsers for NCBI provides interesting summaries, browsers for genome data, and search toolsgenome data, and search tools

Entrez is their database search interfaceEntrez is their database search interfacehttp://www.ncbi.nlm.nih.gov/Entrezhttp://www.ncbi.nlm.nih.gov/Entrez

Can search on gene names, sequences, chromosomal Can search on gene names, sequences, chromosomal location, diseases, keywords, ...location, diseases, keywords, ...

What did we just do?What did we just do?

Identify loci (genes) associated with the sequence. Identify loci (genes) associated with the sequence. Input was Alcohol DehydrogenaseInput was Alcohol Dehydrogenase

For each particular “hit”, we can look at that For each particular “hit”, we can look at that sequence and its alignment in more detail.sequence and its alignment in more detail.

See similar sequences, and the organisms in which See similar sequences, and the organisms in which they are found.they are found.

But there’s But there’s much moremuch more that can be found on that can be found on these genes, even just inside NCBI…these genes, even just inside NCBI…

More from Entrez GeneMore from Entrez Gene

23.

And more…And more…

Sequence Retrieval SystemSequence Retrieval System

The Sequence Retrieval System is a The Sequence Retrieval System is a database system that works with flat-files. In database system that works with flat-files. In addition, many bioinformatics tools are addition, many bioinformatics tools are incorporated and can be combined with the incorporated and can be combined with the databases searches. databases searches.

NCBI is not all there is...NCBI is not all there is... Links to non-NCBI databasesLinks to non-NCBI databases

Reactome & KEGG for pathwaysReactome & KEGG for pathways HGNC for nomenclatureHGNC for nomenclature UCSC Human Genome BrowserUCSC Human Genome Browser

Other important gene/protein resources not linked to:Other important gene/protein resources not linked to: UniProt (most carefully annotated)UniProt (most carefully annotated) PDBPDB (main macromolecular structure repository) (main macromolecular structure repository)

Other key biological data sourcesOther key biological data sources Gene OntologyGene Ontology/Open Biological Ontologies/Open Biological Ontologies EnzymeEnzyme

Scientific society: iscb.orgScientific society: iscb.org Journals, Conferences…Journals, Conferences…

Take home messagesTake home messages

There are a lot of molecular biology databases, There are a lot of molecular biology databases, containing a lot of valuable informationcontaining a lot of valuable information

Not even the best databases have everything (or Not even the best databases have everything (or the best of everything)the best of everything)

These databases are moderately well cross-These databases are moderately well cross-linked, and there are “linker” databaseslinked, and there are “linker” databases

Sequence is a good identifier, maybe even better Sequence is a good identifier, maybe even better than gene name!than gene name!

FILE FORMATEFILE FORMATE

IG/Stanford Fitch Plain/Raw

GenBank/GB Fasta/Pearson PIR/CODATA

NBRF Zuker MSF

EMBL Olsen ASN 1.8

GCG Phylip 3.2 PAUP/NEXUS

DNAStrider Phylip Pretty

IG/Stanford Fitch Plain/Raw

GenBank/GB Fasta/Pearson PIR/CODATA

NBRF Zuker MSF

EMBL Olsen ASN 1.8

GCG Phylip 3.2 PAUP/NEXUS

DNAStrider Phylip Pretty

LOCUS, Accession, NID and protein_idLOCUS, Accession, NID and protein_id

LOCUS: Unique string of 10 letters and numbers in the database. Not maintained amongst databases, and is therefore a poor sequence identifier. ACCESSION: A unique identifier to that record, citable entity; does not change when record is updated. A good record identifier, ideal for citation in publication.VERSION: : New system where the accession and version play the same function as the accession and gi number. Nucleotide gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes.PID: Protein Identifier: g, e or d prefix to gi number. Can have one or two on one CDS.Protein gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes.protein_id: Identifier which has the same structure and function as the nucleotide Accession.version numbers, but slightlt different format.

Accession.version

LOCUS, Accession, gi and PIDLOCUS, Accession, gi and PIDLOCUS HSU40282 1789 bp mRNA PRI 21-MAY-1998DEFINITION Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds.ACCESSION U40282VERSION U40282.1 GI:3150001

CDS 157..1515 /gene="ILK" /note="protein serine/threonine kinase" /codon_start=1 /product="integrin-linked kinase" /protein_id="AAC16892.1" /db_xref="PID:g3150002" /db_xref="GI:3150002"

LOCUS: HSU40282 ACCESSION: U40282 VERSION: U40282.1 GI: 3150001 PID: g3150002 Protein gi: 3150002 protein_id: AAC16892.1 Protein_idprotein gi

ACCESSIONLOCUS

PIDgi

PLAIN SEQUENCE FORMAT

A sequence in plain format may contain only IUPAC characters and

spaces (no numbers!).

Note: A file in plain sequence format may only contain one sequence,

while most other formats accept several sequences in one file.

An example sequence in plain format is:

AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCCTGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGCCGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCT

PLAIN SEQUENCE PLAIN SEQUENCE FORMATEFORMATE

FASTA FORMATEFASTA FORMATE

FASTA FORMAT

A sequence in Fasta format begins with a single-line description,

followed by lines of sequence data.

The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column.

It is recommended that all lines of text be shorter than 80 characters in length

An example sequence in FASTA format is:

>U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1)AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCCTGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGCCGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCTCTTGGTTCCGGC

• The first line of each sequence entry is the ID definition line which contains entry name, dataclass, molecule, division and sequence length.

• XX line contains no data, just a separator• The AC line lists the accession number.• DE line gives description about the sequence• FT precise annotation for the sequence• Sequence information SQ in the first two spaces.• The sequence information begins on the fifth line of the sequence entry. • The last line of each sequence entry in the file is a terminator line which has the two characters // in

the first two spaces.

ID AA03518 standard; DNA; FUN; 237 BP. XX AC U03518;XXAC U03518; XX DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S DE rRNA and 5.8S rRNA genes, partial sequence. DE rRNA and 5.8S rRNA genes, partial sequence. RX MEDLINE; 94303342. RX PUBMED; 8030378. XXFT rRNA <1..20 FT /product="18S ribosomal RNA" FT misc_RNA 21..205 FT /standard_name="Internal transcribed spacer 1 (ITS1)" FT rRNA 206..>237 FT /product="5.8S ribosomal RNA" SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other; aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237 //

EMBL/Swiss Prot (http://www.ebi.ac.uk/help/formats_frame.html)

EMBL FORMAT

A sequence file in EMBL format can contain several sequences.

One sequence entry starts with an identifier line ("ID "), followed by further annotation lines. The start of the sequence is marked by a line starting with "SQ" and the end of the sequence is marked by two slashes ("//").

An example sequence in EMBL format is:

ID AA03518 standard; DNA; FUN; 237 BP.XXAC U03518;XXDE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18SDE rRNA and 5.8S rRNA genes, partial sequence.XXSQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other; aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237//

GENBANK FORMAT

A sequence file in GenBank format can contain several sequences.One sequence in GenBank format starts with a line containing the word LOCUS and a number of annotation lines. The start of the sequence is marked by a line containing "ORIGIN" and the end of the sequence is marked by two slashes ("//").•Can contain several sequences•One sequence starts with: “LOCUS”•The sequence starts with: "ORIGIN“•The sequence ends with: "//“ An example sequence in GenBank format is:

LOCUS AAU03518 237 bp DNA PLN 04-FEB-1995DEFINITION Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S rRNA and 5.8S rRNA genes, partial sequence.ACCESSION U03518BASE COUNT 41 a 77 c 67 g 52 tORIGIN 1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc//

PIR- PROTEIN SEQUENCE PIR- PROTEIN SEQUENCE DBDB

PIR was established in 1984 by the National Biomedical PIR was established in 1984 by the National Biomedical Research Foundation (NBRF) as a resource to assist researchers Research Foundation (NBRF) as a resource to assist researchers in the identification and interpretation of protein sequence in the identification and interpretation of protein sequence information. information.

Prior to that, the NBRF compiled the first comprehensive Prior to that, the NBRF compiled the first comprehensive collection of macromolecular sequences in the collection of macromolecular sequences in the Atlas of Protein Atlas of Protein Sequence and StructureSequence and Structure, published from 1965-1978 under the , published from 1965-1978 under the editorship of Margaret O. Dayhoff. editorship of Margaret O. Dayhoff. Dr. DayhoffDr. Dayhoff and her and her research group pioneered in the development of computer research group pioneered in the development of computer methods for the comparison of protein sequences, for the methods for the comparison of protein sequences, for the detection of distantly related sequences and duplications within detection of distantly related sequences and duplications within sequences, and for the inference of evolutionary histories from sequences, and for the inference of evolutionary histories from alignments of protein sequences.alignments of protein sequences.

STRUCTURAL DB-PDBSTRUCTURAL DB-PDB

30.

Protein Data Bank (PDB)

31.

The Protein Data Bank (PDB) is a repository for the The Protein Data Bank (PDB) is a repository for the three-dimensional structural data of large biological three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. molecules, such as proteins and nucleic acids.

The data, typically obtained by X-ray The data, typically obtained by X-ray crystallography or NMR spectroscopy and submitted crystallography or NMR spectroscopy and submitted by biologists and biochemists from around the world, by biologists and biochemists from around the world, are freely accessible on the Internet via the websites of are freely accessible on the Internet via the websites of its member organisations its member organisations

The PDB is overseen by an organization called The PDB is overseen by an organization called theWorldwide Protein Data Bank, wwPDB.theWorldwide Protein Data Bank, wwPDB.

The PDB is a key resource in areas of structural The PDB is a key resource in areas of structural biology, such as structural genomics.biology, such as structural genomics.

Most major scientific journals, and some funding Most major scientific journals, and some funding agencies, now require scientists to submit their agencies, now require scientists to submit their structure data to the PDB. structure data to the PDB.

If the contents of the PDB are thought of as primary If the contents of the PDB are thought of as primary data, then there are hundreds of derived (i.e., data, then there are hundreds of derived (i.e., secondary) databases that categorize the data secondary) databases that categorize the data differently.differently.

For example both SCOP and CATH categorize For example both SCOP and CATH categorize structures according to type of structure and assumed structures according to type of structure and assumed evolutionary relations.evolutionary relations.

HEADER, TITLE and AUTHOR records provide information about the HEADER, TITLE and AUTHOR records provide information about the researchers who defined the structure; numerous other types of records are researchers who defined the structure; numerous other types of records are available to provide other types of informationavailable to provide other types of information

REMARK records can contain free-form annotation, but they also REMARK records can contain free-form annotation, but they also accommodate standardized information; for example, the REMARK 350 accommodate standardized information; for example, the REMARK 350 BIOMT records describe how to compute the coordinates of the BIOMT records describe how to compute the coordinates of the experimentally observed multimer from those of the explicitly specified ones experimentally observed multimer from those of the explicitly specified ones of a single repeating unit.of a single repeating unit.

SEQRES records give the sequences of the three peptide chains (named A, B SEQRES records give the sequences of the three peptide chains (named A, B and C), which are very short in this example but usually span multiple lines.and C), which are very short in this example but usually span multiple lines.

ATOM records describe the coordinates of the atoms that are part of the ATOM records describe the coordinates of the atoms that are part of the protein. For example, the first ATOM line above describes the alpha-N atom protein. For example, the first ATOM line above describes the alpha-N atom of the first residue of peptide chain A, which is a proline residue; the first of the first residue of peptide chain A, which is a proline residue; the first three floating point numbers are its x, y and z coordinates and are in units three floating point numbers are its x, y and z coordinates and are in units of Ångströms.of Ångströms.

HETATM records describe coordinates of hetero-atoms, that is those atoms HETATM records describe coordinates of hetero-atoms, that is those atoms which are not part of the protein molecule.which are not part of the protein molecule.

PUBCHEMPUBCHEM

PubChem is database of chemical molecules and their activities PubChem is database of chemical molecules and their activities against biological assays. The system is maintained by against biological assays. The system is maintained by theNational Center for Biotechnology Information (NCBI), a theNational Center for Biotechnology Information (NCBI), a component of the National Library of Medicine, which is part of component of the National Library of Medicine, which is part of the United States National Institutes of Health (NIH). PubChem the United States National Institutes of Health (NIH). PubChem can be accessed for free through a web user interface. Millions of can be accessed for free through a web user interface. Millions of compound structures and descriptive datasets can be freely compound structures and descriptive datasets can be freely downloaded via FTP. PubChem contains substance descriptions downloaded via FTP. PubChem contains substance descriptions and small molecules with fewer than 1000 atoms and 1000 and small molecules with fewer than 1000 atoms and 1000 bonds. More than 80 database vendors contribute to the growing bonds. More than 80 database vendors contribute to the growing PubChem databasePubChem database

Books and Web ReferencesBooks and Web References

Books Name : Books Name :

1. Introduction To Bioinformatics by T. K. Attwood1. Introduction To Bioinformatics by T. K. Attwood

2. BioInformatics by Sangita2. BioInformatics by Sangita

3. Basic Bioinformatics by S.Ignacimuthu, s.j.3. Basic Bioinformatics by S.Ignacimuthu, s.j.

http://en.wikipedia.org/wiki/Biological_databasehttp://en.wikipedia.org/wiki/Biological_database http://bioinformaticsweb.net/data.htmlhttp://bioinformaticsweb.net/data.html http://www.apbionet.org/s-star/downloads/tutorial/t1b.pdfhttp://www.apbionet.org/s-star/downloads/tutorial/t1b.pdf

90

Image ReferencesImage References

1. & 2. https://encrypted-tbn0.gstatic.com/images?1. & 2. https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR39w90rTM4wRcS2WE4I0zbjV7R6KE8JMVZq=tbn:ANd9GcR39w90rTM4wRcS2WE4I0zbjV7R6KE8JMVZz4QF0qY6A8W1qti_QQaeDx5Xz4QF0qY6A8W1qti_QQaeDx5X

3. & 4.Book: Basic Bioinformatics by S.Ignacimuthu, s.j.3. & 4.Book: Basic Bioinformatics by S.Ignacimuthu, s.j. 5. to 18.http://www.ncbi.nlm.nih.gov/5. to 18.http://www.ncbi.nlm.nih.gov/ 19. https://encrypted-tbn0.gstatic.com/images?19. https://encrypted-tbn0.gstatic.com/images?

q=tbn:ANd9GcRGBmCsguJs4geE45YMjE_O80bbqD9dFtCE9q=tbn:ANd9GcRGBmCsguJs4geE45YMjE_O80bbqD9dFtCE9fgZYySwzYSIDbIpfgZYySwzYSIDbIp

21. to 29. http://www.ncbi.nlm.nih.gov/21. to 29. http://www.ncbi.nlm.nih.gov/ 30. & 31. http://www.rcsb.org/pdb/home/home.do30. & 31. http://www.rcsb.org/pdb/home/home.do