identification of gene from databases.core
TRANSCRIPT
-
8/14/2019 Identification of Gene From Databases.core
1/32
Identification of
Genes fromDatabasesPresented with Respect to
Dr.C.G. JoshiProfessor
Department of Animal Biotechnology,College of Veterinary Science & A.H.,Anand
Presented byPatel Hiren M.
M.V.Sc. (Anim.Biotechnology)
1
-
8/14/2019 Identification of Gene From Databases.core
2/32
Introduction
Evolution is somewhat conservative
Evolution seems to have often involved the
duplication and divergence of gene Certain sequence may indicate a certain
function
A structural set of data held in a computer The structures of new genes are constantly
adding
2
-
8/14/2019 Identification of Gene From Databases.core
3/32
Databases
Require Some Basic Knowledge
What is gene?
What is gene structure ?
- Prokaryotes
- Eukaryotes
ORFs (Open Reading Frame) cDNA library
Human genome project
3
-
8/14/2019 Identification of Gene From Databases.core
4/32
-
8/14/2019 Identification of Gene From Databases.core
5/32
5
-
8/14/2019 Identification of Gene From Databases.core
6/32
Eukaryotic Genes
Much more complex than in prokaryotes.
Large genomes (0.1 to 3 billion bases)
A typical mammalian cell has 1,500 times asmuch DNA than the cell ofE. Coli.
Low coding density (
-
8/14/2019 Identification of Gene From Databases.core
7/32
Gene Structure Eukaryotes
7
-
8/14/2019 Identification of Gene From Databases.core
8/32
Data Mining
Development of new tools for datamining
Sequence alignment
Genome sequencing
Genome comparison
Micro array data analysis
Proteomics data analysis
Small molecular array analysis
8
-
8/14/2019 Identification of Gene From Databases.core
9/32
What is a database?
A database is a collection ofinformation stored in a computer in a
systematic way, such that acomputer program can consult it toanswer questions
The software used to manage andquery a database is known as adatabase management system(DBMS)
The properties of database systems9
-
8/14/2019 Identification of Gene From Databases.core
10/32
for Gene
Identification Find candidate genes for the trait
(time and cost!)
-What genes are there?
-How gene are expressed?
-What do they do?
-How could they play a role inthe disease
-Gene synonyms
-Gene location 10
-
8/14/2019 Identification of Gene From Databases.core
11/32
DATA SOURCES
PubMed Conserved Domain Database
GeneAtlas
dbSNP
Links to above-mentioned databases:Gene: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genePubMed: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMedCDD: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cddHomologene:http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=homologene
GeneAtlas: http://wombat.gnf.org/dbSNP: htt : www.ncbi.nlm.nih. ov entrez uer .fc i?db=sn
11
-
8/14/2019 Identification of Gene From Databases.core
12/32
Gene Databases
Once a genome is in place, it isdesirable to study the regions thatmake a particular organism what it
is.
One such resource is located in thegenetic regions of the organism,
Several databases of genes andrelated structures exist.
Such database is the Ref Seq
database curated at NCBI. 12
-
8/14/2019 Identification of Gene From Databases.core
13/32
-
8/14/2019 Identification of Gene From Databases.core
14/32
-
8/14/2019 Identification of Gene From Databases.core
15/32
Genomic Database Resource
Ensembl
- http://www.ensembl.org
19 species
UCSC Genome Browser- http://genome.ucsc.edu/
28 species (Insects!)
NCBI MapViewer- http://www.ncbi.nlm.nih.gov/mapview/
38 species (Plants, Fungi!)
15
-
8/14/2019 Identification of Gene From Databases.core
16/32
Comparison of Sequence
against Sequence Database The most commonly used programmes for
comparing an unknown sequence against
the sequence the database are BLAST,
FASTA.
BLAST and FASTA are derivatives of
Smith - Watermann Algorithm.
16
-
8/14/2019 Identification of Gene From Databases.core
17/32
The FASTA algorithm
Developed by Lipman and Pearson 1985
First program to search sequence
databases for gapped local alignment The best scoring local region is given as
output
It is an approximate heuristic algorithmused to compute sub-optimal pair wisesimilarity.
http://www-nbrf.georgetown.edu/pirwww/s
17
http://www-nbrf.georgetown.edu/pirwww/search/fasta.htmlhttp://www-nbrf.georgetown.edu/pirwww/search/fasta.html -
8/14/2019 Identification of Gene From Databases.core
18/32
-
8/14/2019 Identification of Gene From Databases.core
19/32
What is BLAST?
19
BLAST (Basic LocalAlignment Search Tool) is aset of similarity search programs designed to
explore all of the available sequence databases
regardless of whether the query is protein or
DNA.
local means it searches and aligns sequence
segments, rather than align the entire sequence.
Its able to detect relationships among sequenceswhich share only isolated regions of similarity.
Currently, it is the most popular and most
accepted sequence analysis tool.
-
8/14/2019 Identification of Gene From Databases.core
20/32
Why BLAST?
20
Identify unknown sequences - The best way to
identify an unknown sequence is to see if thatsequence already exists in a public database. If
the database sequence is a well-characterized
sequence, then you may have access to a wealth
of biological information.
Help gene/protein function and structure
prediction genes with similar sequences tend to
share similar functions or structure.
Identify protein family group related (paralog or
ortholog) genes and their proteins into a family.
Prepare sequences for multiple alignments
-
8/14/2019 Identification of Gene From Databases.core
21/32
Go to BLAST
21
-
8/14/2019 Identification of Gene From Databases.core
22/32
Go tonucleotide
BLAST
22
-
8/14/2019 Identification of Gene From Databases.core
23/32
23
-
8/14/2019 Identification of Gene From Databases.core
24/32
SOFTWARES FOR
IDENTIFICATION OF GENES
Some computational tools that are most
commonly used for gene prediction
Gene MarkGlimmer M
GRAIL
GenScan
Genebuilder
24
-
8/14/2019 Identification of Gene From Databases.core
25/32
Gene Mark
Used for finding prokaryotic genes.
This software employs non-homogenous
markov model to classify DNA regions into
protein coding, non-coding sequences
Limitation: Query sequence must be more 100
kbp than
25
-
8/14/2019 Identification of Gene From Databases.core
26/32
Glimmer
Glimmer uses interpolated markov models toidentify coding regions and distinguish themfrom non-coding DNA.
Glimmer is used as the primary gene finder toolat TIGR.
The computation consists of two steps, namelymodel building and gene prediction. The model
building involves training by the input sequence,which optimizes the parameters of the model
26
-
8/14/2019 Identification of Gene From Databases.core
27/32
GRAIL
Use for eukaryotes
This tool identifies exons, polyA sites,
promoters, CpG islands, repetitive elements
and frame shift errors in DNA sequences by
comparing them to a database of known
Human and Mouse sequence elements.
Based on a neural network algorithm
27
-
8/14/2019 Identification of Gene From Databases.core
28/32
-
8/14/2019 Identification of Gene From Databases.core
29/32
GenScan
Programme uses probabilistic model of genestructure that is based on actual biologicalinformation about the transcriptional,translational and splicing signals.
Its high speed and accuracy make GenScan themethod of choice for the initial analysis olarge stretches of eukaryotic genomic DNA.
GenScan has being used as the principal tool
for gene prediction in international Human
genome project.
29
-
8/14/2019 Identification of Gene From Databases.core
30/32
Cont
Makes predictions based on 5th-order HMMs.
It combines hexamer frequencies with coding
signals (initiation codons, TATA box, cap site,
polyA, etc.) in prediction.
Exons are assigned a probability score
(P)ofbeing a true exon. Only predictions with
P >0.5 are deemed reliable.
This program is trained for sequences from
vertebrates, Arabidopsis, and maize. It has
been used extensively in annotating the human
genome 30
-
8/14/2019 Identification of Gene From Databases.core
31/32
Genebuilder
Genebuilder performs ab initio gene
prediction using numerous parameters, such
as GC content, di-codon frequencies,
splicing site data, CpG islands, repetitiveelements and others. It also performs BLAST
searches of predicted genes against protein
and EST databases.
31
-
8/14/2019 Identification of Gene From Databases.core
32/32
32