identification of gene from databases.core

Upload: drhmpatel

Post on 30-May-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 Identification of Gene From Databases.core

    1/32

    Identification of

    Genes fromDatabasesPresented with Respect to

    Dr.C.G. JoshiProfessor

    Department of Animal Biotechnology,College of Veterinary Science & A.H.,Anand

    Presented byPatel Hiren M.

    M.V.Sc. (Anim.Biotechnology)

    1

  • 8/14/2019 Identification of Gene From Databases.core

    2/32

    Introduction

    Evolution is somewhat conservative

    Evolution seems to have often involved the

    duplication and divergence of gene Certain sequence may indicate a certain

    function

    A structural set of data held in a computer The structures of new genes are constantly

    adding

    2

  • 8/14/2019 Identification of Gene From Databases.core

    3/32

    Databases

    Require Some Basic Knowledge

    What is gene?

    What is gene structure ?

    - Prokaryotes

    - Eukaryotes

    ORFs (Open Reading Frame) cDNA library

    Human genome project

    3

  • 8/14/2019 Identification of Gene From Databases.core

    4/32

  • 8/14/2019 Identification of Gene From Databases.core

    5/32

    5

  • 8/14/2019 Identification of Gene From Databases.core

    6/32

    Eukaryotic Genes

    Much more complex than in prokaryotes.

    Large genomes (0.1 to 3 billion bases)

    A typical mammalian cell has 1,500 times asmuch DNA than the cell ofE. Coli.

    Low coding density (

  • 8/14/2019 Identification of Gene From Databases.core

    7/32

    Gene Structure Eukaryotes

    7

  • 8/14/2019 Identification of Gene From Databases.core

    8/32

    Data Mining

    Development of new tools for datamining

    Sequence alignment

    Genome sequencing

    Genome comparison

    Micro array data analysis

    Proteomics data analysis

    Small molecular array analysis

    8

  • 8/14/2019 Identification of Gene From Databases.core

    9/32

    What is a database?

    A database is a collection ofinformation stored in a computer in a

    systematic way, such that acomputer program can consult it toanswer questions

    The software used to manage andquery a database is known as adatabase management system(DBMS)

    The properties of database systems9

  • 8/14/2019 Identification of Gene From Databases.core

    10/32

    for Gene

    Identification Find candidate genes for the trait

    (time and cost!)

    -What genes are there?

    -How gene are expressed?

    -What do they do?

    -How could they play a role inthe disease

    -Gene synonyms

    -Gene location 10

  • 8/14/2019 Identification of Gene From Databases.core

    11/32

    DATA SOURCES

    PubMed Conserved Domain Database

    GeneAtlas

    dbSNP

    Links to above-mentioned databases:Gene: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genePubMed: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMedCDD: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cddHomologene:http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=homologene

    GeneAtlas: http://wombat.gnf.org/dbSNP: htt : www.ncbi.nlm.nih. ov entrez uer .fc i?db=sn

    11

  • 8/14/2019 Identification of Gene From Databases.core

    12/32

    Gene Databases

    Once a genome is in place, it isdesirable to study the regions thatmake a particular organism what it

    is.

    One such resource is located in thegenetic regions of the organism,

    Several databases of genes andrelated structures exist.

    Such database is the Ref Seq

    database curated at NCBI. 12

  • 8/14/2019 Identification of Gene From Databases.core

    13/32

  • 8/14/2019 Identification of Gene From Databases.core

    14/32

  • 8/14/2019 Identification of Gene From Databases.core

    15/32

    Genomic Database Resource

    Ensembl

    - http://www.ensembl.org

    19 species

    UCSC Genome Browser- http://genome.ucsc.edu/

    28 species (Insects!)

    NCBI MapViewer- http://www.ncbi.nlm.nih.gov/mapview/

    38 species (Plants, Fungi!)

    15

  • 8/14/2019 Identification of Gene From Databases.core

    16/32

    Comparison of Sequence

    against Sequence Database The most commonly used programmes for

    comparing an unknown sequence against

    the sequence the database are BLAST,

    FASTA.

    BLAST and FASTA are derivatives of

    Smith - Watermann Algorithm.

    16

  • 8/14/2019 Identification of Gene From Databases.core

    17/32

    The FASTA algorithm

    Developed by Lipman and Pearson 1985

    First program to search sequence

    databases for gapped local alignment The best scoring local region is given as

    output

    It is an approximate heuristic algorithmused to compute sub-optimal pair wisesimilarity.

    http://www-nbrf.georgetown.edu/pirwww/s

    17

    http://www-nbrf.georgetown.edu/pirwww/search/fasta.htmlhttp://www-nbrf.georgetown.edu/pirwww/search/fasta.html
  • 8/14/2019 Identification of Gene From Databases.core

    18/32

  • 8/14/2019 Identification of Gene From Databases.core

    19/32

    What is BLAST?

    19

    BLAST (Basic LocalAlignment Search Tool) is aset of similarity search programs designed to

    explore all of the available sequence databases

    regardless of whether the query is protein or

    DNA.

    local means it searches and aligns sequence

    segments, rather than align the entire sequence.

    Its able to detect relationships among sequenceswhich share only isolated regions of similarity.

    Currently, it is the most popular and most

    accepted sequence analysis tool.

  • 8/14/2019 Identification of Gene From Databases.core

    20/32

    Why BLAST?

    20

    Identify unknown sequences - The best way to

    identify an unknown sequence is to see if thatsequence already exists in a public database. If

    the database sequence is a well-characterized

    sequence, then you may have access to a wealth

    of biological information.

    Help gene/protein function and structure

    prediction genes with similar sequences tend to

    share similar functions or structure.

    Identify protein family group related (paralog or

    ortholog) genes and their proteins into a family.

    Prepare sequences for multiple alignments

  • 8/14/2019 Identification of Gene From Databases.core

    21/32

    Go to BLAST

    21

  • 8/14/2019 Identification of Gene From Databases.core

    22/32

    Go tonucleotide

    BLAST

    22

  • 8/14/2019 Identification of Gene From Databases.core

    23/32

    23

  • 8/14/2019 Identification of Gene From Databases.core

    24/32

    SOFTWARES FOR

    IDENTIFICATION OF GENES

    Some computational tools that are most

    commonly used for gene prediction

    Gene MarkGlimmer M

    GRAIL

    GenScan

    Genebuilder

    24

  • 8/14/2019 Identification of Gene From Databases.core

    25/32

    Gene Mark

    Used for finding prokaryotic genes.

    This software employs non-homogenous

    markov model to classify DNA regions into

    protein coding, non-coding sequences

    Limitation: Query sequence must be more 100

    kbp than

    25

  • 8/14/2019 Identification of Gene From Databases.core

    26/32

    Glimmer

    Glimmer uses interpolated markov models toidentify coding regions and distinguish themfrom non-coding DNA.

    Glimmer is used as the primary gene finder toolat TIGR.

    The computation consists of two steps, namelymodel building and gene prediction. The model

    building involves training by the input sequence,which optimizes the parameters of the model

    26

  • 8/14/2019 Identification of Gene From Databases.core

    27/32

    GRAIL

    Use for eukaryotes

    This tool identifies exons, polyA sites,

    promoters, CpG islands, repetitive elements

    and frame shift errors in DNA sequences by

    comparing them to a database of known

    Human and Mouse sequence elements.

    Based on a neural network algorithm

    27

  • 8/14/2019 Identification of Gene From Databases.core

    28/32

  • 8/14/2019 Identification of Gene From Databases.core

    29/32

    GenScan

    Programme uses probabilistic model of genestructure that is based on actual biologicalinformation about the transcriptional,translational and splicing signals.

    Its high speed and accuracy make GenScan themethod of choice for the initial analysis olarge stretches of eukaryotic genomic DNA.

    GenScan has being used as the principal tool

    for gene prediction in international Human

    genome project.

    29

  • 8/14/2019 Identification of Gene From Databases.core

    30/32

    Cont

    Makes predictions based on 5th-order HMMs.

    It combines hexamer frequencies with coding

    signals (initiation codons, TATA box, cap site,

    polyA, etc.) in prediction.

    Exons are assigned a probability score

    (P)ofbeing a true exon. Only predictions with

    P >0.5 are deemed reliable.

    This program is trained for sequences from

    vertebrates, Arabidopsis, and maize. It has

    been used extensively in annotating the human

    genome 30

  • 8/14/2019 Identification of Gene From Databases.core

    31/32

    Genebuilder

    Genebuilder performs ab initio gene

    prediction using numerous parameters, such

    as GC content, di-codon frequencies,

    splicing site data, CpG islands, repetitiveelements and others. It also performs BLAST

    searches of predicted genes against protein

    and EST databases.

    31

  • 8/14/2019 Identification of Gene From Databases.core

    32/32

    32