identification of gene from databases.core

8/14/2019 Identification of Gene From Databases.core

1/32

Identification of

Genes fromDatabasesPresented with Respect to

Dr.C.G. JoshiProfessor

Department of Animal Biotechnology,College of Veterinary Science & A.H.,Anand

Presented byPatel Hiren M.

M.V.Sc. (Anim.Biotechnology)

1


2/32

Introduction

Evolution is somewhat conservative

Evolution seems to have often involved the

duplication and divergence of gene Certain sequence may indicate a certain

function

A structural set of data held in a computer The structures of new genes are constantly

adding

2


3/32

Databases

Require Some Basic Knowledge

What is gene?

What is gene structure ?

- Prokaryotes

- Eukaryotes

ORFs (Open Reading Frame) cDNA library

Human genome project

3


4/32


5/32

5


6/32

Eukaryotic Genes

Much more complex than in prokaryotes.

Large genomes (0.1 to 3 billion bases)

A typical mammalian cell has 1,500 times asmuch DNA than the cell ofE. Coli.

Low coding density (


7/32

Gene Structure Eukaryotes

7


8/32

Data Mining

Development of new tools for datamining

Sequence alignment

Genome sequencing

Genome comparison

Micro array data analysis

Proteomics data analysis

Small molecular array analysis

8


9/32

What is a database?

A database is a collection ofinformation stored in a computer in a

systematic way, such that acomputer program can consult it toanswer questions

The software used to manage andquery a database is known as adatabase management system(DBMS)

The properties of database systems9


10/32

for Gene

Identification Find candidate genes for the trait

(time and cost!)

-What genes are there?

-How gene are expressed?

-What do they do?

-How could they play a role inthe disease

-Gene synonyms

-Gene location 10


11/32

DATA SOURCES

PubMed Conserved Domain Database

GeneAtlas

dbSNP

Links to above-mentioned databases:Gene: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genePubMed: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMedCDD: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cddHomologene:http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=homologene

GeneAtlas: http://wombat.gnf.org/dbSNP: htt : www.ncbi.nlm.nih. ov entrez uer .fc i?db=sn

11


12/32

Gene Databases

Once a genome is in place, it isdesirable to study the regions thatmake a particular organism what it

is.

One such resource is located in thegenetic regions of the organism,

Several databases of genes andrelated structures exist.

Such database is the Ref Seq

database curated at NCBI. 12


13/32


14/32


15/32

Genomic Database Resource

Ensembl

- http://www.ensembl.org

19 species

UCSC Genome Browser- http://genome.ucsc.edu/

28 species (Insects!)

NCBI MapViewer- http://www.ncbi.nlm.nih.gov/mapview/

38 species (Plants, Fungi!)

15


16/32

Comparison of Sequence

against Sequence Database The most commonly used programmes for

comparing an unknown sequence against

the sequence the database are BLAST,

FASTA.

BLAST and FASTA are derivatives of

Smith - Watermann Algorithm.

16


17/32

The FASTA algorithm

Developed by Lipman and Pearson 1985

First program to search sequence

databases for gapped local alignment The best scoring local region is given as

output

It is an approximate heuristic algorithmused to compute sub-optimal pair wisesimilarity.

http://www-nbrf.georgetown.edu/pirwww/s

17
http://www-nbrf.georgetown.edu/pirwww/search/fasta.htmlhttp://www-nbrf.georgetown.edu/pirwww/search/fasta.html


18/32


19/32

What is BLAST?

19

BLAST (Basic LocalAlignment Search Tool) is aset of similarity search programs designed to

explore all of the available sequence databases

regardless of whether the query is protein or

DNA.

local means it searches and aligns sequence

segments, rather than align the entire sequence.

Its able to detect relationships among sequenceswhich share only isolated regions of similarity.

Currently, it is the most popular and most

accepted sequence analysis tool.


20/32

Why BLAST?

20

Identify unknown sequences - The best way to

identify an unknown sequence is to see if thatsequence already exists in a public database. If

the database sequence is a well-characterized

sequence, then you may have access to a wealth

of biological information.

Help gene/protein function and structure

prediction genes with similar sequences tend to

share similar functions or structure.

Identify protein family group related (paralog or

ortholog) genes and their proteins into a family.

Prepare sequences for multiple alignments


21/32

Go to BLAST

21


22/32

Go tonucleotide

BLAST

22


23/32

23


24/32

SOFTWARES FOR

IDENTIFICATION OF GENES

Some computational tools that are most

commonly used for gene prediction

Gene MarkGlimmer M

GRAIL

GenScan

Genebuilder

24


25/32

Gene Mark

Used for finding prokaryotic genes.

This software employs non-homogenous

markov model to classify DNA regions into

protein coding, non-coding sequences

Limitation: Query sequence must be more 100

kbp than

25


26/32

Glimmer

Glimmer uses interpolated markov models toidentify coding regions and distinguish themfrom non-coding DNA.

Glimmer is used as the primary gene finder toolat TIGR.

The computation consists of two steps, namelymodel building and gene prediction. The model

building involves training by the input sequence,which optimizes the parameters of the model

26


27/32

GRAIL

Use for eukaryotes

This tool identifies exons, polyA sites,

promoters, CpG islands, repetitive elements

and frame shift errors in DNA sequences by

comparing them to a database of known

Human and Mouse sequence elements.

Based on a neural network algorithm

27


28/32


29/32

GenScan

Programme uses probabilistic model of genestructure that is based on actual biologicalinformation about the transcriptional,translational and splicing signals.

Its high speed and accuracy make GenScan themethod of choice for the initial analysis olarge stretches of eukaryotic genomic DNA.

GenScan has being used as the principal tool

for gene prediction in international Human

genome project.

29


30/32

Cont

Makes predictions based on 5th-order HMMs.

It combines hexamer frequencies with coding

signals (initiation codons, TATA box, cap site,

polyA, etc.) in prediction.

Exons are assigned a probability score

(P)ofbeing a true exon. Only predictions with

P >0.5 are deemed reliable.

This program is trained for sequences from

vertebrates, Arabidopsis, and maize. It has

been used extensively in annotating the human

genome 30


31/32

Genebuilder

Genebuilder performs ab initio gene

prediction using numerous parameters, such

as GC content, di-codon frequencies,

splicing site data, CpG islands, repetitiveelements and others. It also performs BLAST

searches of predicted genes against protein

and EST databases.

31


32/32

32

identification of gene from databases.core

Documents