a review on biocomputing approaches and ......cite this article: neelofar sohi and amardeep singh, a...

13
http://www.iaeme.com/IJCET/index.asp 54 [email protected] International Journal of Computer Engineering & Technology (IJCET) Volume 8, Issue 5, Sep-Oct 2017, pp. 5466, Article ID: IJCET_08_05_007 Available online at http://www.iaeme.com/ijcet/issues.asp?JType=IJCET&VType=8&IType=5 Journal Impact Factor (2016): 9.3590(Calculated by GISI) www.jifactor.com ISSN Print: 0976-6367 and ISSN Online: 09766375 © IAEME Publication A REVIEW ON BIOCOMPUTING APPROACHES AND TOOLS FOR IDENTIFICATION OF SINGLE NUCLEOTIDE POLYMORPHISMS Neelofar Sohi Assistant Professor, Department of Computer Engineering, Punjabi University, Patiala, India Amardeep Singh Professor, Department of Computer Engineering, Punjabi University, Patiala, India ABSTRACT Single Nucleotide Polymorphisms (SNPs) are the most common source of genetic variations. There has been enormous research in the area of Biocomputing and Bioinformatics on identification and analysis of SNPs. A large number of methods have been developed for their identification ever since the importance of SNPs in understanding of diseases emerged with the completion of Human Genome Project. This paper reviews Single Nucleotide Polymorphisms, their importance, their association to diseases, Biocomputing approaches and tools available for their identification up to 2017. Key word: Single Nucleotide Polymorphisms, SNPs, Biocomputing, Genetic Variations, SNP identification. Cite this Article: Neelofar Sohi and Amardeep Singh, A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide Polymorphisms. International Journal of Computer Engineering & Technology, 8(5), 2017, pp. 5466. http://www.iaeme.com/ijcet/issues.asp?JType=IJCET&VType=8&IType=5 1. INTRODUCTION There occurred a major breakthrough in the history of genetics with completion of the Human Genome Project (HGP). HGP, the world’s largest international collaborative research project was founded in 1990 by the US Department of Energy and the National Institute of Health (NIH), which aimed at complete human genome sequencing. The project completed in 2003, sequencing the human genome’s 3.3 billion base pairs and revealed that there are about 20, 500 human genes. The valuable information furnished by HGP opened new avenues for understanding of diseases, genetic basis and genetic variants responsible for the diseases. This understanding of connection between sequence variations and phenotype can lead to better diagnosis, prevention and treatment of diseases [1]. The sequence analyses show that 99% of

Upload: others

Post on 16-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A REVIEW ON BIOCOMPUTING APPROACHES AND ......Cite this Article: Neelofar Sohi and Amardeep Singh, A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide

http://www.iaeme.com/IJCET/index.asp 54 [email protected]

International Journal of Computer Engineering & Technology (IJCET)

Volume 8, Issue 5, Sep-Oct 2017, pp. 54–66, Article ID: IJCET_08_05_007

Available online at

http://www.iaeme.com/ijcet/issues.asp?JType=IJCET&VType=8&IType=5

Journal Impact Factor (2016): 9.3590(Calculated by GISI) www.jifactor.com

ISSN Print: 0976-6367 and ISSN Online: 0976–6375

© IAEME Publication

A REVIEW ON BIOCOMPUTING APPROACHES

AND TOOLS FOR IDENTIFICATION OF

SINGLE NUCLEOTIDE POLYMORPHISMS

Neelofar Sohi

Assistant Professor, Department of Computer Engineering,

Punjabi University, Patiala, India

Amardeep Singh

Professor, Department of Computer Engineering,

Punjabi University, Patiala, India

ABSTRACT

Single Nucleotide Polymorphisms (SNPs) are the most common source of genetic

variations. There has been enormous research in the area of Biocomputing and

Bioinformatics on identification and analysis of SNPs. A large number of methods

have been developed for their identification ever since the importance of SNPs in

understanding of diseases emerged with the completion of Human Genome Project.

This paper reviews Single Nucleotide Polymorphisms, their importance, their

association to diseases, Biocomputing approaches and tools available for their

identification up to 2017.

Key word: Single Nucleotide Polymorphisms, SNPs, Biocomputing, Genetic

Variations, SNP identification.

Cite this Article: Neelofar Sohi and Amardeep Singh, A Review on Biocomputing

Approaches and Tools for Identification of Single Nucleotide Polymorphisms.

International Journal of Computer Engineering & Technology, 8(5), 2017, pp. 54–66.

http://www.iaeme.com/ijcet/issues.asp?JType=IJCET&VType=8&IType=5

1. INTRODUCTION

There occurred a major breakthrough in the history of genetics with completion of the Human

Genome Project (HGP). HGP, the world’s largest international collaborative research project

was founded in 1990 by the US Department of Energy and the National Institute of Health

(NIH), which aimed at complete human genome sequencing. The project completed in 2003,

sequencing the human genome’s 3.3 billion base pairs and revealed that there are about 20,

500 human genes. The valuable information furnished by HGP opened new avenues for

understanding of diseases, genetic basis and genetic variants responsible for the diseases. This

understanding of connection between sequence variations and phenotype can lead to better

diagnosis, prevention and treatment of diseases [1]. The sequence analyses show that 99% of

Page 2: A REVIEW ON BIOCOMPUTING APPROACHES AND ......Cite this Article: Neelofar Sohi and Amardeep Singh, A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide

A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide

Polymorphisms

http://www.iaeme.com/IJCET/index.asp 55 [email protected]

genome sequences of different individuals are identical [2]. A difference of 1% is due to the genetic variations

that lead to inherited diseases. The genetic variations are of various types like Single Nucleotide

Polymorphisms (SNPs), insertion/deletion, block substitutions, inversions, variable number of

tandem repeat sequences (VNTRs) and copy number variations (CNVs). Single nucleotide

polymorphisms are the most common source of genetic variations accounting for 90% of the

sequence differences and about half of the known human inherited diseases [3]. There are

about 349,313,504 human SNPs out of which 24,983,387 are validated as listed in the

Database of Short Genetic Variations, dbSNP [4] hosted by National Centre for

Biotechnology Information (NCBI) (dbSNP latest build 150; February 2017 available at

ftp://ftp.ncbi.nlm.nih.gov/snp/).

1.1. Motivation for Work

There has been an explosion of research on SNPs and their identification. Many review

papers are available on SNPs. But there is a lack of systematic review covering SNPs from

various aspects in last five years. This paper reviews SNPs, their importance, association to

diseases, Biocomputing approaches and tools available for their identification up to 2017.

1.2. Single Nucleotide Polymorphisms

Single Nucleotide Polymorphisms are the genetic variations that occur when single

nucleotide i.e. Adenine (A), Guanine (G), Cytosine (C) or Thymine (T) in the genome

sequence gets altered. Single nucleotide polymorphisms result in substitution of one

nucleotide for another in a DNA sequence.

Figure 1 Single Nucleotide Polymorphism

SNPs are mostly biallelic polymorphisms, that is, the nucleotide identity at these positions

is constrained to one of two possibilities in humans [5]. There are two types of Single

Nucleotide polymorphisms, transitions where substitution is between purines (A, G) or

between pyrimidines (C, T) and transversions which involve substitution of a purine by a

pyrimidine or vice versa.

1.2.1. Classification of Single Nucleotide Polymorphisms

SNPs are classified based on their genomic location into coding and non-coding SNPs.

Coding SNPs occur in the coding region of the gene which takes part in protein formation.

Coding SNPs are of two types: Synonymous SNPs and Non- Synonymous SNPs.

Synonymous SNPs change a codon specifying an amino acid into another that codes for the

same amino acid hence no change occurs in the amino acid sequence of protein. Non-

synonymous SNPs change the amino acid sequence of protein. Missense SNPs change the

codon specifying an amino acid into another that produces different amino acid whereas

Nonsense SNPs change the codon into stop codon which terminates the process of protein

formation leading to incomplete or non-functional protein. Non-Coding SNPs may occur in

introns, promoter region, within 5’ and 3’, untranslated region and intergenic region.

Page 3: A REVIEW ON BIOCOMPUTING APPROACHES AND ......Cite this Article: Neelofar Sohi and Amardeep Singh, A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide

Neelofar Sohi and Amardeep Singh

http://www.iaeme.com/IJCET/index.asp 56 [email protected]

Figure 2 Classification of Single Nucleotide Polymorphisms

1.3. Functional Effects of Single Nucleotide Polymorphisms

SNPs may occur in the coding region of genes, non-coding regions of genes or in the

intergenic regions (regions between genes). A SNP is said to be functional if it affects factors

such as splicing, transcription and protein structure hence causing a phenotypic difference

between members of the species. A variant may affect the expression or translation of a gene

product, either by interrupting a regulatory region or by interfering with normal splicing and

mRNA function [6]. About 3% to 5% of human SNPs are functional.

1.3.1. Effect of Coding region SNPs

Coding region SNPs are of more interest to scientists as they are more likely to alter the

biological function of a protein. An altered protein in many cases is responsible for disorders

and abnormalities in humans. Synonymous changes in coding regions are more common than

non-synonymous changes. Non-synonymous SNPs account for more than half of all genetic

polymorphisms known to cause inherited diseases (as per 2009 release of HGMD) [7].

1.3.2. Effect of Non Coding region SNPs

The role of noncoding SNPs is much less studied. Many noncoding SNPs that reside in the

noncoding sequences (e.g. introns, promoter region, within 5’ and 3’ untranslated region)

surrounding protein coding genes have been shown to have profound effects on the

expression of neighbouring genes and may cause disease phenotypes. These are called

regulatory SNPs. Such SNPs may affect the gene expression by interrupting a regulatory

region [8]. Non-coding SNPs are also linked to higher risk of cancer [9]. Some studies

suggest coding region SNPs specifically non-synonymous SNPs show greater functional

effect and association with diseases whereas some studies associate non-coding SNPs with

these effects.

1.4. Association of Genetic Variations and Diseases

There are two classes of Genetic variations viz. mutation and polymorphism. DNA sequence

variations that do not lead to diseases are termed as polymorphisms. To be termed as ‘normal’

they must occur in at least 1% of the population. Many polymorphisms may be found in

genes and influence characteristics like hair colour, eye colour and height. DNA sequence

variations that may lead to diseases are termed as mutations. More than 99% of human

genome sequences are identical; only 1% is different which may be responsible for

predisposition to diseases and variability in response to drugs [5]. A study to establish

relationship between a disease and regions of genome is termed as association study. If a

Page 4: A REVIEW ON BIOCOMPUTING APPROACHES AND ......Cite this Article: Neelofar Sohi and Amardeep Singh, A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide

A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide

Polymorphisms

http://www.iaeme.com/IJCET/index.asp 57 [email protected]

variation is found to be causative of a disease then occurrence of that variation would be

higher in group of cases-with disease than group of controls-not affected by disease.

Penetrance and Expressivity are the two parameters employed to find association. Penetrance

is defined as percentage of individuals having a particular mutation who exhibit clinical signs

or phenotype of the associated disease. Expressivity is defined as the extent to which a

genotype is phenotypically expressed in individuals [10].

1.5. Importance of SNP Identification

Genetic and environmental factors are the key factors responsible for diseases. Environmental exposures

play a larger role in human phenotypic variation than genetic variation, but environmental

exposures are fundamentally more difficult to measure. DNA is stable throughout life, with a

single physical chemistry that enables generic approaches for measurement. Single nucleotide

polymorphisms are the most common source of genetic variations. SNPs underlie the difference

in a person’s susceptibility and predisposition to inherited diseases. Hence, identification of

Single Nucleotide Polymorphisms may lead to early prevention, diagnosis and treatment of

diseases. The knowledge of SNPs also helps in understanding the variation among individuals

in drug response. Pharmacogenomics is the study of how the variations in genes are related to

an individual’s response to a particular drug. Hence, identification of Single Nucleotide

Polymorphisms can be useful for personalised medicine [11]. Genetic diseases, major class of

human diseases are broadly classified into monogenic and polygenic diseases. Monogenic

diseases are those where single gene mutation is responsible for the disease. In 1990s, the

focus of research shifted from monogenic diseases towards analysis of complex multifactorial

diseases like osteoporosis, diabetes, cardiovascular diseases, inflammatory diseases,

psychiatric disorders and most cancers. These diseases are polygenic with multiple gene

variants, each contributing a small effect to the disease. It is very hard to analyse them and

harder to locate the involved genes (Collins et al., 1998). Complex polygenic diseases occur

at a much higher frequency and consequently are a great social burden [12]. Therefore, for

identification of involved genes a highly reliable marker is required. Markers are the genetic

variations associated with phenotypic conditions which can be used to locate genes associated

with a disease [11]. SNP markers emerged as an appropriate marker system owing to their

advantages of being polymorphic, stable, abundant, amenability to automation, easy and

inexpensive genotyping and capability to have direct functional consequence besides being

surrogate markers. Therefore, by using SNP markers, it is often possible to test for

association between a functional variant and a phenotype directly [12]. They are used in

genome-wide association studies (GWAS) for gene to phenotype mapping [11]. SNPs have

been used for human identification. SNPs can be used to reconstruct the history of genome in

population studies. SNPs are suitable for population studies as they are abundant,

evolutionary stable and are inherited from one generation to the next. SNP detection has also

been used in forensic genetics where it can be used to evaluate rare, degraded and nearly

fossilized nucleic acid evidence. Studying the frequency and distribution of SNPs can lead to

information on the evolution of the species [5]. Therefore, SNPs have been used in the study

of evolution, race, migration and lineage [13]. SNPs have been used for population

classification besides using genes [14, 15]. A classification procedure was proposed for two

populations using eight SNP marker selected from a sample of 641 collected from an

Epidemic Society of Shanghai [13]. SNPs having high Allele Mutation Frequency

(0.249>MAF>0.355) were selected randomly from eight different chromosomes. Accuracy of

classification procedure depends upon number of SNP markers.

Page 5: A REVIEW ON BIOCOMPUTING APPROACHES AND ......Cite this Article: Neelofar Sohi and Amardeep Singh, A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide

Neelofar Sohi and Amardeep Singh

http://www.iaeme.com/IJCET/index.asp 58 [email protected]

1.6. Challenges in SNP Identification

SNP identification, also termed as characterisation aims at identifying, analysing and

annotating SNPs. SNP annotation aims at predicting the effect and function of SNPs [6].

There is a lack of SNP identification methods which also provide annotation. Several

methods exist for identifying SNPs though there is no global approach to identify all types of

SNPs. Most of the methods focus at coding region SNPs; SNPs lying in non-coding regions

are ignored owing to poor understanding of effects of non-coding SNPs. Such SNPs may

affect the gene expression by interrupting a regulatory region [8]. Another issue is that a SNP

may be present in an individual but it may or may not have deleterious effect. SNP is more or

less deleterious depending upon whether allele containing it is more or less expressed.

Penetrance and expressivity are the parameters to analyse this aspect. Penetrance is defined

as percentage of individuals having a particular mutation who exhibit clinical signs

(phenotype) of the associated disease. SNPs lead to inherited predisposition to a disease but it

may have reduced or incomplete penetrance in a particular individual. Penetrance of a

mutation may be affected with age and gender in addition to other factors [10]. So, SNP

identification method must also cater to distinguish between disease-causing and functionally

neutral SNPs. SNPs are excellent markers for locating candidate genes associated with a

disease [11]. Hence, SNP identification method can be enriched with capability to identify

candidate genes containing causal variants associated with a disease.

2. BIOCOMPUTING APPROACHES FOR IDENTIFICATION OF SNPS

A general approach followed for identification of SNPs is to create a catalogue of variants in

the human genes and test them for association with diseases. The 1000 Genomes Project set

out to provide a comprehensive description of common human genetic variation

reconstructed the genomes of 2,504 individuals from 26 populations characterizing a broad

spectrum of genetic variation, in total over 88 million variants (84.7 million SNPs, 3.6 million

short insertions/deletions and 60,000 structural variants).This project provides a benchmark

for distribution and understanding of human genetic variation and processes that shape

genetic diversity and disease biology [16]. Another approach is to generate a genome-wide

high resolution map of known polymorphisms and test them for association with diseases.

The goal of the International HapMap Project is to develop a map and determine the common

patterns of DNA sequence variation in the human genome to carry out candidate-gene,

linkage based and genome-wide association studies. The International HapMap Consortium is

developing a map of these patterns across the genome by determining the genotypes of one

million or more sequence variants, their frequencies and the degree of association between

them, in DNA samples from populations with ancestry from parts of Africa, Asia and Europe.

The HapMap will allow the discovery of sequence variants that affect common disease and

development of diagnostic tools [17]. Complex polygenic diseases are believed to have

multiple gene variants contributing to disease whereas it remains a question whether mutation

at any one gene are necessary and sufficient to lead to the phenotype. One prominent

approach to locate SNPs underlying complex traits is ‘linkage disequilibrium’ where physical

distance between two polymorphisms is treated as the arbiter of degree of association

between them. Each SNP has a unique history therefore time of origin and their relative

frequencies also affect Linkage Disequilibrium between two SNPs [8].

3. BIOCOMPUTING TOOLS FOR IDENTIFICATION OF SNPS

With SNPs being the most common form of genetic variation, there is great interest in SNP

discovery. To serve this need, a large number of public databases are available online which

offer data about single nucleotide polymorphisms (SNPs), other forms of genetic variations,

Page 6: A REVIEW ON BIOCOMPUTING APPROACHES AND ......Cite this Article: Neelofar Sohi and Amardeep Singh, A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide

A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide

Polymorphisms

http://www.iaeme.com/IJCET/index.asp 59 [email protected]

diseases and genome annotations. Table 1 provides summary of some of these popular

databases. A no. of Biocomputing methods and tools for identification of SNPs exist. Many

reviews on various aspects of SNP identification, SNP identification software and sequencing

technologies are available [6, 18-24]. Some of the prominent SNP identification tools and

their links are listed in Table 2. Few state-of-the-art techniques for identification of SNPs are

presented in the following section:

3.1. PolyPhred

Figure 3 Flowchart for PolyPhred

Phred is a base-calling computer program with improved accuracy and lower error rates

than ABI sequencing software [25]. It assigns an error probability to each called base. It uses

a four phase procedure to determine a sequence of base calls from processed trace. In the first

phase, the idealized peak locations (predicted peaks) are determined. In the second phase,

observed peaks are identified in the trace. In third phase, observed peak locations are matched

to the predicted peak locations, omitting some peaks and splitting some others. In the final

phase, un-matched observed peaks are checked to see if they represent some base but could

not be assigned to a predicted peak, if found, the corresponding base is inserted into the read

sequence. Phrap is the sequence alignment program to align the subject sequence against the

reference sequence. PolyPhred is an automated program for identification of SNPs used in

conjunction with Phred, Phrap and Consed. It reads the normalised peak areas and quality

values obtained from Phred for each position in the sequence. If a second peak is detected at a

base and there is reduction in the peak height, PolyPhred calls it a heterozygous site. Consed

is an editing and viewing tool that can be used by analyst for editing and evaluating the traces

[26]. As a part of Japanese Millennium Genome Project, Haga et al. (2002) identified a total

of 190562 genetic variations consisting of 174269 SNPs and 16293 insertion/deletions from

DNA samples of 24 Japanese individuals using PolyPhred [26] for SNP identification [27].

Data and methods of the study are available at web site (http://snp.ims.u-tokyo.ac.jp).

3.2. Newberg’s Technique

Phred is used for base-calling to identify the bases in the sequence. Next, sequence alignment

is done in order to align the sequence against the reference sequences using Phrap. It

classifies sequences that do not have sufficient similarity information as singlets and excludes

them from the set of assembled contigs. SNP identification procedure consists of series of

four filters: Filter 1 eliminates cluster of mismatches that occur in region of low quality trace

data.

Page 7: A REVIEW ON BIOCOMPUTING APPROACHES AND ......Cite this Article: Neelofar Sohi and Amardeep Singh, A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide

Neelofar Sohi and Amardeep Singh

http://www.iaeme.com/IJCET/index.asp 60 [email protected]

Figure 4 Flowchart for Newberg’s Technique

This filter searches for window sizes of 5, 10 or 20 bp around single base-pair mismatch

position. Filter 2 identifies the type of sequence mismatch as base substitution or

insertion/deletion. Filter 3 and filter 4 checks the quality of each base call relative to its

position and frequency in a contig. Filter 3 ignores mismatches in first 100 bases and filter 4

requires a mismatch to occur in more than one sequence in a contig to be considered a high

quality candidate SNP. This eliminates the mismatches that could arise from copying errors

[28].

3.3. SNP Detector

Figure 5 Flowchart for SNP Detector

At first, Phred is used for base calling, quality scores and primary and secondary peak

information for each trace file. SIM, based on Smith Waterman algorithm is used for

alignment of the subject sequence to a reference sequence. Neighbourhood Quality Standard

is used to check the variation site and each base in its flanking window to exceed a user-

defined quality threshold. A variation is considered a true variation if this base and each base

in its 4bp flanking region exceed Phred quality score ≥ 25 and sequence similarity must be ≥

95%. Height of the Secondary Peak corresponding to heterozygous allele must be atleast 30%

of the height of Primary Peak. Peak with peak height less than 20% is considered as noise

[29].

Page 8: A REVIEW ON BIOCOMPUTING APPROACHES AND ......Cite this Article: Neelofar Sohi and Amardeep Singh, A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide

A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide

Polymorphisms

http://www.iaeme.com/IJCET/index.asp 61 [email protected]

3.4. PolyScan

Figure 6 Flowchart for PolyScan

Phred is used for base-calling to identify the bases in the sequence. Next, sequence

alignment is done in order to align the sequence against the reference sequences using cross-

match program. SNPs are identified as doublet peaks whose heights are half of those for

homozygous individuals. Drop in the peak height is the indicator of presence of variation.

Two procedures namely Horizontal and Vertical Scan are performed: In Horizontal Scan,

distance metrics are computed within reads and in Vertical Scan, peak height is computed

[30].

3.5. VarDetect

Figure 7 Flowchart for VarDetect

Partitioning and Re-sampling technique improves the base calling procedure. Main idea is

to detect primary peak corresponding to primary allele and secondary peak corresponding to

secondary allele. Presence of multiple peaks at a location indicates presence of SNP. Next,

Observed peak intensity ratio; Qoi is calculated from peak intensities of various peaks at i

th

location as Qo =Highest Peak intensity/ (sum of all intensities). Vicinity peak intensity ratio;

Qvi is calculated relative to two bases to the left and two bases to the right of base call location

as Qv3=(I1+I2+I4+I5)/4. SNPs can be predicted from difference between Observed peak

intensity ratio; Qoi and Vicinity peak intensity ratio; Qv

i; Qv

i- Qo

i=δ called detection value. If

this difference is significant, above a defined threshold value, this is indicative of SNP. Next,

CodeMap technique is used to convert chromatogram traces to numeric codes. Homozygous

bases are converted into 0 and 2 codes while heterozygous base is converted to 1 [31].

Page 9: A REVIEW ON BIOCOMPUTING APPROACHES AND ......Cite this Article: Neelofar Sohi and Amardeep Singh, A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide

Neelofar Sohi and Amardeep Singh

http://www.iaeme.com/IJCET/index.asp 62 [email protected]

3.6. PineSAP

Figure 8 Flowchart for PineSAP

Phred is used for base-calling and sequence alignment is done using Phrap and Probcons

RNA. Next, Polybayes and PolyPhred techniques are used for SNP identification [32].

Table 1 Summary of some prominent Single Nucleotide Polymorphism Databases

Database URL of web resource Information in the Database

Database of

Expressed Sequence

Tags (dbEST)

http://www.ncbi.nlm.nih.gov/dbEST Short-pass reads of cDNA (transcript)

sequences [33]

Entrez - Integrated database retrieval system that

provides access to a diverse set of 40

databases that together contain 1.3 billion

records [34]

Human Genic Bi-

Allelic Sequence

(HGBASE)

http://hgbase.interactiva.de/ Database of polymorphisms

located in human intra-genic sequences [35]

Online Mendelian

Inheritance in Man

(OMIM)

http://www.ncbi.nlm.nih.gov/sites/entrez?db=

omim

Human genes and genetic phenotypes;

relationship between phenotype and genotype

[36]

Database of Short

Genetic Variations

(dbSNP)

http://www.ncbi.nlm.nih.gov/SNP Primary resource for Single Nucleotide

variations, microsatellites, small-scale

insertions and deletions [4]

JSNP database http://snp.ims.u-tokyo.ac.jp/ Catalog of SNPs responsible for different

genetic related disorders in the

Japanese population [37]

University of

California Santa Cruz

(UCSC) Genome

Browser

http://genome.ucsc.edu/ popular web-based tool for displaying portion

of a genome including gene predictions,

mRNA and expressed sequence tag

alignments, SNPs [38]

SWISS-PROT www.expasy.org/sprot/ High level of annotation of function, domain

structure, post-translational modifications,

variants of protein [39]

Human Gene

Mutation Database

(HGMD)

http://www.hgmd.cf.ac.uk/ac/index.php Germline mutations underlying human

inherited diseases [7]

HapMap http://www.hapmap.org Catalog of common genetic variants [17]

Database of genomic

Variants (DGV)

http://projects.tcag.ca/variation catalog of structural variation in healthy

control samples for studying correlation to

genomic variation with phenotypic data [40]

CASCAD http://cascad.niob.knaw.nl. Catalog of candidate single nucleotide

polymorphisms predicted using a

Page 10: A REVIEW ON BIOCOMPUTING APPROACHES AND ......Cite this Article: Neelofar Sohi and Amardeep Singh, A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide

A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide

Polymorphisms

http://www.iaeme.com/IJCET/index.asp 63 [email protected]

computational approach from publicly

available sequence data of the rat and

zebrafish [41]

SNP Function Portal - SNP functional annotations for genomic

elements, transcription regulation, protein

function, pathway, disease and population

genetics [42]

Database of genotype

and phenotype

(dbGAP)

http://www.ncbi.nlm.nih.gov/sites/entrez?db=

gap

Genotype-phenotype association, results of

GWAS, medical resequencing, molecular

diagnostic assays [43]

Functional Single

Nucleotide

Polymorphism

(F-SNP)

http://compbio.cs.queensu.ca/F-SNP/ Functional Single Nucleotide Polymorphism

(F-SNP) database integrates information

obtained

from 16 bioinformatics tools and databases

about

the functional effects of SNPs [44]

Database of genomic

Structural Variation

(dbVar)

http://www.ncbi.nlm.nih.gov/dbvar Large-scale

copy number variants (CNV), insertions,

deletions, inversions and translocations

longer than 50 base pairs (bp) [45]

SNPedia https://www.snpedia.com/ SNPedia is a wiki investigating human

genetics. It provides personal genome

annotation, interpretation and analysis [46]

BioProject (www.ncbi.nlm.nih.gov/ bioproject/) Sequence variants, genotype-phenotype

association, nucleotide sequence sets and

epigenetic information [47]

GWAS db http://jjwanglab.org/gwasdb Contains genetic variants with functional

annotations, genomic mapping, gene

expression and disease associations [48]

Pharmacogenetics

Knowledge Base

(PharmGKB)

http://www.pharmgkb.org/ Annotation of genetic variants and gene-

drug-disease relationship [49]

Dog Genome SNP

Database (DoGSD)

http://dogsd.big.ac.cn/

Dog Genome SNP Database which provides

information about already identified SNPs in

dog/wolf

related genetic diseases [50]

Ensembl http://www.ensembl.org Finds SNPs and other variants for a gene and

association with diseases [51]

Table 2 Single Nucleotide Polymorphism identification tools

Tool URL of Tool

SNPeffect http://snpeffect.vib.be/

novoSNP http://www.molgen.ua.ac.be/bioinfo/novosnp/

PupaSuite http://pupasuite.bioinfo.cipf.es/

Sorting Intolerant from Tolerant (SIFT) http://blocks.fhcrc.org/sift/SIFT.html

Polymorphism Phenotyping (PolyPhen) http:/genetics.bwh.harvard.edu/pph/

SAP prediction method http://sapred.cbi.pku.edu.cn/

PMut http://mmb2.pcb.ub.es:8080/PMut/

Screening for Nonacceptable Polymorphisms

(SNAP)

http://cubic.bioc.columbia.edu/services/SNAP/

SNPSeek http://snp.wustl.edu/cgi-bin/SNPseek/index.cgi

SNP@Promoter http://variome.kobic.re.kr/SNPatPromoter/

SNPper http://snpper.chip.org/

Genewindow http://www.genewindow.nci.nih.gov/

SIFT http://blocks.fhcrc.org/sift/SIFT.html

PolyPhen http://www.bork.embl-heidelberg.de/PolyPhen/

Page 11: A REVIEW ON BIOCOMPUTING APPROACHES AND ......Cite this Article: Neelofar Sohi and Amardeep Singh, A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide

Neelofar Sohi and Amardeep Singh

http://www.iaeme.com/IJCET/index.asp 64 [email protected]

SNP3D http://www.snps3d.org/

SNPeffect http://snpeffect.vib.be/index.php

PicSNP http://plaza.umin.ac.jp/_hchang/picsnp/

Pupa SNP Finder http://pupasnp.bioinfo.cnio.es/

BioPerl http://bio.perl.org/

HaploSNPer http://www.bioinformatics.nl/tools/haplosnper/

Human chromosome 21 cSNP database http://csnp.unige.ch/

QualitySNPng http://www.bioinformatics.nl/QualitySNPng/

4. CONCLUSIONS

There has been an explosion of research on SNPs and SNP identification ever since

association of genetic variations and diseases emerged. SNP identification can enable

identification of candidate genes containing causal variants and identification of candidate

causal SNPs having predisposition to diseases. This enables early prevention, diagnosis and

treatment of diseases. There is a need of identification methods which could process large

number of bases with low cost and in less time, paving way for research on computational

approaches in future. Bioinformatics aims at developing and organising large number of

available databases storing information on SNPs, their functions and associated diseases. It

also aims to develop effective tools which could mine the relevant data from these databases

to enable analysis of the disease causing SNPs. This is a review paper aiming to guide

researchers on Single Nucleotide Polymorphisms (SNPs) and Biocomputing approaches and

tools available for SNP identification.

REFERENCES

[1] An Overview of the Human Genome Project (2015). National Human Genome Research

Institute (NHGRI) homepage. Available: http://www.genome.gov/12011238

[2] Sachidanandam, R. et al. (2001). A map of human genome sequence variation containing

1.42 million single nucleotide polymorphisms. Nature, 409:928-933.

[3] Collins, F.S. et al. (1998). A DNA polymorphism discovery resource for research on

human genetic variation. Genome Research, 8: pp. 1229–1231.

[4] Sherry, S.T. et al. (2001). dbSNP: the NCBI database of Genetic variation. Nucleic Acids

Research, 29(1): pp. 308-311.

[5] Sripichai, O. and Fucharoen, S. (2007). Genetic Polymorphisms and Implications for

Human Diseases. Journal of the Medical association of Thailand, 90(2): 394-398.

Available: http://www.medassocthai.org/journal

[6] Mooney, Sean (2005). Bioinformatics approaches and resources for single nucleotide

[7] polymorphism functional analysis. Briefings in Bioinformatics, 6(1): 44–56.

[8] Krawczak, M. et al. (2000). Human Gene Mutation Database-A Biomedical Information

and Research Resource. Human Mutation, 15: 45-51.

[9] Chakravarti, A. (1999). Population Genetics-making sense out of sequence. Nature

genetics (supplement), 21: 56-60.

[10] Gongcheng, L. et al. (2014). Regulatory Variants and Disease: The E-Cadherin −160C/A

SNP. Molecular Biology International, 2014.

[11] Available: http://www.hindawi.com/journals/mbi/2014/967565/

[12] Shawky, R.M. (2014). Reduced penetrance in human inherited disease. The Egyptian

Journal of Medical Human Genetics, 15: 103-111.

[13] Altshuler, D. et al. (2008). Genetic Mapping in Human Disease. Science, 322(5903):

881–888.

[14] Gray, I. C. et al. (2000). Single nucleotide polymorphisms as tools in human genetics.

Human Molecular Genetics, 9(16): 2403-2408.

Page 12: A REVIEW ON BIOCOMPUTING APPROACHES AND ......Cite this Article: Neelofar Sohi and Amardeep Singh, A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide

A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide

Polymorphisms

http://www.iaeme.com/IJCET/index.asp 65 [email protected]

[15] Hu, P. et al. (2016). A Simple Algorithm for Population Classification. Scientific Reports,

6, article no.: 23491.

[16] Zhou, N. and Wang L. (2007). Effective selection of informative SNPs and classification

on the HapMap genotype data. BMC Bioinformatics, 8:484.

[17] Kohnemann, S. and Pfeiffer H. (2011). Application of mtDNA SNP analysis in forensic

casework. Forensic Sci. Int. Genet., 5(3): 216–221.

[18] The 1000 Genomes Project Consortium (2015). A global reference for human genetic

variation. Nature, 526: 68-87.

[19] International HapMap Consortium (2005). Nature, 437:1299.

[20] Mooney, S.D. et al. (2010). Bioinformatic Tools for Identifying Disease Gene and SNP

Candidates. Methods Molecular Biology, 628: 307–319.

[21] Johnson, A.D. et al. (2009). SNP bioinformatics: a comprehensive review of resources.

Circulation: Cardiovascular Genetics, 2(5): 530-536.

[22] Medvedev, P. et al. (2009). Computational methods for discovering structural variation

with next-generation sequencing. Nature, 6: s13-s18.

[23] Nielsen, R. et al. (2011). Genotype and SNP calling from next-generation sequencing

data. Nature Reviews Genetics, 12(6): 443–451.

[24] Kumar, S. et al. (2012). SNP Discovery through Next-Generation Sequencing and Its

Applications. International Journal of Plant Genomics, 2012, article id: 831460.

[25] Bianco et al. (2013). Database tools in genetic diseases research. Elsevier Genomics, 101:

75–85.

[26] Ghorbani, M. and Karimi, H.(2014). Ten Bioinformatics Tools for Single Nucleotide

Polymorphisms Detection. American Journal of Bioinformatics, 3(2): 45-48.

[27] Ewing, B.L. et al. (1998). Base-calling of automated sequencer traces using Phred I.

Accuracy assessment. Genome Research, 8: 175-185.

[28] Nickerson, D.A. et al. (1997). PolyPhred: Automating the detection and genotyping of

single nucleotide substitutions using fluorescense-based resequencing. Nucleic Acids

Research, 25: 2745–2751.

[29] Haga, H. et al. (2002). Gene-based SNP discovery as part of the Japanese Millennium

Genome Project: identification of 190 562 genetic variations in the human genome.

Journal of Human Genetics, 47: 605-610.

[30] Newberg, L.P. et al. (1999). Mining SNPs from EST Databases. Genome Research, 9:

167–174.

[31] Zhang, J. et al. (2005). SNPdetector: A Software Tool for Sensitive and Accurate SNP

Detection. PLoS Computational Biology, 1(5): 0395-0404.

[32] Chen, K. et al. (2007). PolyScan: An automatic indel and SNP detection approach to the

analysis of human resequencing data. Genome Research, 17: 659–666.

[33] Ngamphiw, C. et al. (2008). VarDetect: a nucleotide sequence variation exploratory tool.

BMC Bioinformatics, 9(12): S9.

[34] Wegrzyn, J.L. et al. (2009). PineSAP-sequence alignment and SNP identification

pipeline. Bioinformatics, 25(19): 2609–2610.

[35] Boguski, M.S. et al. (1993). dbEST–database for expressed sequence tags. Nature

Genetics, 4: 332–333.

[36] Schuler, G.D. et al. (1996). Entrez: molecular biology database and retrieval system.

Methods Enzymology, 266:141-62.

[37] Sarkar, C. et al. (1998). Human genetic bi-allelic sequences (HGBASE), a database of

intra-genic polymorphisms. Memorias do Instituto Oswaldo Cruz, 93(5): 693-4.

Page 13: A REVIEW ON BIOCOMPUTING APPROACHES AND ......Cite this Article: Neelofar Sohi and Amardeep Singh, A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide

Neelofar Sohi and Amardeep Singh

http://www.iaeme.com/IJCET/index.asp 66 [email protected]

[38] Hamosh, A. et al. (2000). Online Mendelian Inheritance in Man (OMIM). Human

Mutations, 15: 57–61.

[39] Hirakawa, M. et al. (2002). JSNP: a database of common gene variations in the Japanese

population. Nucleic Acid Research, 30(1):158-62.

[40] Kent, W.J. et al. (2002). The Human Genome Browser at UCSC. Genome Research, 12:

996-1006.

[41] Boeckmann, B. et al. (2003). The SWISS-PROT protein knowledgebase and its

supplement TrEMBL in 2003. Nucleic Acid Research, 31(1): 365-70.

[42] Iafrate, A.J. et al. (2004). Detection of large-scale variation in the human genome. Nature

Genetics, 36(9): 949-51.

[43] Guryev et al. (2005). CASCAD: a database of annotated candidate single nucleotide

polymorphisms associated with expressed sequences. BMC Genomics, 6(10).

[44] Wang, P. et al. (2006). SNP Function Portal: a web database for exploring the function

implication of SNP alleles. Bioinformatics, 22(14): e523-9.

[45] Mailman, M.D. et al.(2007). The NCBI dbGaP database of genotypes and phenotypes.

Nature Genetics, 39:1181–1186.

[46] Lee, P.H.. and Shatkay, H. (2008). F-SNP: computationally predicted functional SNPs for

disease association studies. Nucleic Acids Research, 36(Database issue):D820-4.

[47] Church, D.M. et al. (2010). Public data archives for genomic structural variation. Nature

Genetics, 42: 813-814.

[48] Cariaso, M. and Lennon, G. (2012). SNPedia: a wiki supporting personal genome

annotation, interpretation and analysis. Nucleic Acids Research, 40(Database issue):

D1308–D1312.

[49] Barrett, T. et al. (2012). BioProject and BioSample databases at NCBI: facilitating capture

and organization of metadata. Nucleic Acids Research, 40(Database issue): D57-63.

[50] Li, M.J. et al. (2012). GWASdb: a database for human genetic variants identified by

genome-wide association studies. Nucleic Acids Research, 40(Database issue): D1047-54.

[51] Whirl-Carrillo, M. et al. (2012). Pharmacogenomics Knowledge for Personalized

Medicine. Clinical Pharmacology and Therapeutics, 92(4): 414–417.

[52] Bai, B. et al. (2015). DoGSD: the dog and wolf genome SNP database. Nucleic Acids

Research, 43(Database issue): D777-83.

[53] Bronwen, L.A. et al. (2016). The Ensembl gene annotation system. Database 2016,

baw093.