bioinformatics review - december 2015 issue

D EC EMBER 2015 VOL 1 ISSUE3

Do you HYPHY with

(Data) Monkey!!

Perl one-liners for Bioinformaticians

- L.L Gatlin

Public Service Ad sponsored by IQLBioinformatics

Contents

December 2015

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Topics

03

22

34

99

34

99

Tools

Proteomics

Programming

Systems Biology

Meta Analysis

Genomics

Editorial.... 5

Roary: Analysis of Prokaryote Pan Genome on a large-scale 07 Do you HYPHY with (Data) Monkey !! 22 T-Coffee : A tool that combines both local and global alignments 24

Disulphide Connectivity in Protein Tertiary Structure Prediction 13

Perl one-liners for bioinformaticians 09

Venice Criteria: Overview 17

Two Components System: Potential Drug Target in Mycobacterium tuberculosis 11

Mycobacteriophages & their potentials as source against Mycobacterial active molecules 19

64 Software

TIN: R package to analyze Transcriptome Instability 15

EDITOR

Dr. PRASHANT PANT

EDITORIAL

SECTION EDITORS

ALTAF ABDUL KALAM MANISH KUMAR MISHRA

SANJAY KUMAR PRAKASH JHA NABAJIT DAS

REPRINTS AND PERMISSIONS

You must have permission before reproducing any material from Bioinformatics Review. Send E-mail

requests to [email protected]. Please include contact detail in your message.

BACK ISSUE

Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com

at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery,

subject to availability. Pre-payment is required

CONTACT

PHONE +91. 991 1942-428 / 852 7572-667

MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025

STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the

address as [email protected]

PUBLICATION INFORMATION

Volume 1, Number 1, Bioinformatics Review™ is published monthly for one year(12 issues) by Social

and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015

Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used

under license by SEWA Trust. Published in India

EXECUTIVE EDITOR FOUNDING EDITOR

FOZAIL AHMAD MUNIBA FAIZA

EDITORIAL

Pursuing PhD in Sciences?

All the low hanging fruits in sciences have been plucked to

kingdom come and it’s time to review the education system. It is

about time to inculcate certain very important and fundamental

questions in the young minds to take an informed decision

regarding their career.

Very recently, an article appeared in Nature on “Reform the PhD

system or close it down” by Mark Taylor. This article emphasised

on the very fact that most doctoral programs are just producing

PhDs like anything and they have very poor absorption rate in

Universities/Institutions and in corporate world due to

deficiencies of the system and/or of the degree making them find

no place. The article also talks about the medieval nature of most

doctoral programs which have made them irrelevant and

unsustainable with the growing number of PhDs churning out

from the Universities all over the world. Two questions come to

our mind. One, why this happened, and secondly, where things

went wrong and who is to be blamed? Probable answer to the

first question lies in the opening statement of this editorial and

this was imminent. The second question however is more

intriguing and needs discussion. Most doctoral programmes are

designed so as to train students to perform research and analyses

on a stereotyped mechanism which is not wrong but makes the

scholars look at PhD as a lucrative option to get a doctor prefixed

in their name without putting much pressure on their grey

matter.

PhDs are not about filling pages under five chapters after a couple

of years. PhDs are not about stereotype work done scholar afte r

scholar in a laboratory to fulfil the mentor’s desire to become a

self-declared expert on a topic. PhDs are (and should be) about

questions, and that too, genuine ones. They are (and should be)

about beautiful experimental designs attacking the question from

all corners and trying to answer it. They are about thinking what,

why and how something happens and how that piece of

Dr. Prashant Pant Editor-in-Chief

Letters and responses:

[email protected]

information can be taken further to serve another research

question.

Our education system will wake up one day and will introduce

revolutionary modifications crashing the dreams of many PhD

aspirants. We should not wait for that day to shine upon us rather

we should prepare our young minds to start thinking. So the

question you should be asking to yourself when you are in your

graduation or post-graduation is “Do you have it in you to do

research and earn a PhD degree”. If you don’t ask, you are going

to repent heavily. If you question everything around you and can

work relentlessly to try and answer a question, then PhDs are for

you. If you can connect radically different aspects and weave

them together into simpler forms, then PhDs are for you.

Science can take you to any part of the world, good or bad, if you

are ready for that, PhDs are for you. If you are ready to explore

again and again, PhDs are for you. If you can give up other things

in life for the sake of the questions, PhDs are for you. One should

always remember, it is a (doctoral) degree in philosophy and not

in sciences and therefore, more important is the question (that

you ask) and the meaning/interpretation of the answer rather

than plain science. So, start thinking!!

Bioinformatics Review | 7

Roary: Analysis of Prokaryote Pan Genome on a large-scale Muniba Faiza

Image Credit: Google Images

“A new method to generate the pan genome of a set of related prokaryotic isolates and named the tool as ‘Roary’ .”

The Microbial Pan Genome is

the union of genes shared by

genomes of interest. This

term was first used by Medini

in 2005.

Since then, microbial genome data

has been enormously increased, so to

study processes such as selection and

evolution, the construction of pan

genome of species is required. But

construction of pan genome from the

real data available is very difficult and

would not be accurate due to

fragmented assemblies, poor

annotation and also the

contamination,i.e., microbial

organisms can rapidly acquire genes

from other organisms. Therefore,

Andrew J. Page et al have developed

a new method to generate the pan

genome of a set of related

prokaryotic isolates and named the

tool as ‘Roary’. It deals with

thousands of isolates in a feasible

time.

How Roary Works?

One annotated assembly per sample

is input in the Roary from which

coding regions are extracted and

converted in to protein sequences,

and all the partial sequences are

removed and pre clustered using CD-

HIT (a fast program for clustering and

comparing). This produces a reduced

set of protein

sequences.These reduced sequences

are compared all-against-all with the

help of BLASTP with a user defined

percentage sequence identity

(default 95%). Now, by using

conserved neighborhood genes,

homologous groups are split in to

true orthologs. Finally, a graph is

constructed showing

the relationships of the clusters

based on the order of occurrence in

the input sequences.

That’s how the orthologous genes of

prokaryotes can be easily identified

and the microbial evolution can be

well studied. It is done on a large

scale covering a large data set to

analyse the pan genomes of

prokaryotes. Other tools have also

been made earlier than Roary for the

same purpose,namely, PanOCT and

PGAP, but Roary is more fast,

heuristic and most feasible tool

among them.

T

TOOLS


Perl one-liners

for

bioinformaticians

Muniba Faiza


“Perl one-liners are extremely short Perl scripts written in the form of a string of commands that fits onto one line. Perl one-liners can be very useful in ad-hoc processing or parsing of files and streams from a plethora of sources . ”

erl one-liners are extremely

short Perl scripts written in

the form of a string of

commands that fits onto one line.

That would amount to a bit less

than 80 symbols for most

purposes. Here’s the obligatory

“Hello World!” one-liner in Perl

and it’s output:

$ perl -e 'print "Hello World!\n";'

Hello World!

Try it! (of course, Perl must be

installed on your computer for the

“perl” command to work).

The most common and useful way

to use such one-liners is to use

them as stream processors on the

command line, sometimes

connected by pipes to other

utilities typical for a Linux

command-line environment. To

process the stream one would

commonly use Perl regular

expression syntax to match

(m/string/) or substitute

(s/string1/string2/). Let us use

“echo” to generate an empty input

to act upon and “-p” to tell Perl to

print the $_ variable (entire line) at

the end:

$ echo | perl -pe 's/$_/Hello

World!\n/;'

Hello World!

P

BIOINFORMATICS PROGRAMMIG


Notice that Perl iterates over all lines

of the input (first create a file test

with 3 empty lines):

$ cat test | perl -pe 's/$_/Hello

World!\n/;'

Hello World!

Hello World!

Hello World!

Finally, let us introduce the “-i” switch

to make Perl do the changes directly on

a supplied file:

$ perl -pi -e 's/$_/Hello World!\n/;'

test2

This will result in the contents

of test2getting overwritten with “Hello

World!” now present on every line!

Needless to say, the “-i” switch can be

quite dangerous for it’s ability to

completely overwrite files.

Suppose you have a file where you

would like to number the lines directly

in the file. This is a no-brainer with Perl

one-liners! Just replace the beginning of

each line with it’s number:

cat test2 | perl -pe '$i++; s/^/$i: /;'

1: Hello World!

2: Hello World!

3: Hello World!

The “^” symbol denotes the

beginning of the line in Perl regular

expressions. Notice that the one-

liner actually contains two lines of

Perl code separated by a

semicolon (;).

Bioinformaticians often process

FASTA files with nucleotide or

amino-acid sequences. Suppose

you have a FASTA file you would

like to convert to a format where

every sequence occupies only one

line, so that you can apply “grep”

to look for a specific k-mer in the

sequence (say TATATAA for TATA-

box). This can be easily done by

removing every end-of-line symbol

on non-header lines:

$ cat test2 | perl -pe

's/^([^>]+)\n/$1/;END{print "\n"}'

| grep -B1 TATATAA

The “$1” is a special Perl variable

created in regular expressions

whenever you enclose something

in parentheses. Here we do that

with entire lines that do not begin

with a “>” character (“^” in

brackets like “*^>+” means NOT

“>”, in this case we choose non-

header lines).

Perl one-liners can be very useful

in ad-hoc processing or parsing of

files and streams from a plethora

of sources. Additional examples of

clever Perl one-liners can be

found here or here.

http://bioinformaticsonline.com/pages/view/11181/perl-one-liner-for-bioinformatician

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/oneliners.html


Two Components System:

Potential Drug Target in

Mycobacterium

tuberculosis Fozail Ahmad Image Credit: Google Images

“To set the stage of infection, to establish itself in the host’s defending environment, to cause the pathogenicity by overcoming the immune system and to escape out from any assailable host attack, this TB causing pathogen has developed a well-embodied system known as two-component system

(TCS).”

he genomic complexity and

unknown functions of

proteins/genes in

Mycobacterium tuberculosis

(Mt) has triggered an in-depth

study of the entire genome to

explore factors responsible for

influencing Mt’s behaviour at

molecular level. To set the stage

of infection, to establish itself in

the host’s defending

environment, to cause the

pathogenicity by overcoming the

immune system and to escape out

from any assailable host attack,

this TB causing pathogen has

developed a well-embodied

system known as two-component

system (TCS) that constitutes two

proteins, universally designated as

sensor protein and response

regulator protein.

The basic function of these proteins

is to sense environmental signals

and respond accordingly. After

interaction with suitable stimulating

ligands, sensor protein, histidine

kinase binds and hydrolyzes ATP,

catalysing the auto-phosphorylation

of a conserved Histidine residue and

producing a high energy phosphoryl

group. The phosphate is then

transferred to the associated

receiver protein known as response

regulator at conserved Aspartic acid

residue generating a high-energy

acyl phosphate. Once

phosphotransfer reaction has taken

place, the response regulator gets

activated, allowing it to carry out its

specific function. In most of the

cases, activated sensor kinase

modulates the transcription of DNA

at a specific binding site located in

target genome at promoter region.

The total effect is change in global

gene expression that aids pathogen

to respond in the initial signal

sensed by histidine kinase. There

are eleven such TCS in the

pathogen. The primary task of such

system is to control the expression

of specific genes at specific time in

response to the environmental

conditions hence contributing to the

growth of pathogen inside host.

Since each of the TCS is obligated

with distinct function, they are

involved in orchestrating most of

the gene regulatory processes. Out

of eleven, only eight TCS have been

studied comprehensively letting

others to be scavenged by further

genomic analysis of Mt.

Interdisciplinary relevance : The

systematic understanding of

biological phenomena and

demonstration of such microscopic

T

SYSTEMS BIOLOGY


processes have been subjected to a

number of sophisticated

experimental procedure in order to

develop the deterministic or

stochastic approaches that are

skilled in unfolding real molecular

system. Biological modeling and

simulation are among those of

biochemical annotating

methodologies using wet lab data

and understanding the scenario of

real biological mechanism.

Systems biology opens a new area

to analyse the raw data generated

through wet lab experimentations

by various modes of

characterization and evaluation by

mathematical modeling, simulations

& network analyses as the sole

implications into any biological

issue. Two-component systems for

their critical contributions in

bacterial pathogenicity have

provided us with new concepts for

comprehending molecular

mechanism which are yet to be

explored. Limitations have been

raised for it’s behaviour and

activation so far as the exact

regulatory mechanism is concerned.

Application of mathematical model

and simulation over the regulatory

behaviour would testify the real

global association of TCS with entire

genomic expression showing how

this pathogen becomes so

potentially virulent? Another

important question that can be

raised is at what level of gene

activation the pathogenicity is

rampant making host

unimmunized? The scavenging

initiative of all two-component

systems would bring the molecular

biology, chemistry, mathematics

and network biology together to

unfold the gene regulatory scenario

of Mycobacterium tuberculosis in

an exclusive manner.


Disulphide Connectivity

in Protein Tertiary

Structure Prediction

Muniba Faiza


“ The disulphide bonds formed between non-adjacent Cysteine residues are identified that would be cross -linked from other possible residues. ”

s the approach towards the

protein structure prediction

has increased and has been

successful in most of the cases but

still also a big challenge. To handle

this situation, the Protein Structure

prediction is divided in to separate

sub classes to get the information

about the whole system (i.e.,protein

structure). One of these sub classes

is Disulphide Connectivity. Under

this, the disulphide bonds formed

between non-adjacent Cysteine

residues are identified that would

be cross-linked from other possible

residues.

Since the disulphide bridges/bonds

plays an important role in the

folding process, stability and

function of a protein, therefore,

the prediction of disulfide bonds

connectivity can help in prediction

of protein structure. Disulphide

Connectivity can be studied in two

steps: first, by disulfide bonding

state prediction and secondly, by

disulphide connectivity prediction

(DCP). The first approach classifies

the cysteines bonded to another

cysteines or any free cysteine

according to their molecular states.

DCP identifies the different pairs of

cysteines that are bonded in a

protein sequence. To perform these

tasks, various predictors are

available that are mainly based on

Neural Networks (NN) and Support

Vector Machines (SVMs), and other

predictive methods.

An Artificial Neural Network is a

computing system of

interconnected elements where

some external inputs are applied

and the information is processed by

the dynamic responses given by the

system. ANN provides a likelihood

of forming a disulphide bond for

each cysteine pair. Several

algorithms are applied such as

Gabow’s algorithm to implement

NN in protein structure

prediction. SVMs are the machine

learning tool to predict

tertiary structure from the primary

sequence of proteins. This approach

uses the Edmond-Gabow algorithm

and PSSMs. After performing these

operations, to validate the accuracy

of predicted connectivity patterns

there are two parameters: Rb &

Qb. Rb is the ratio of the number of

correctly predicted bonds to the

total number of disulphide bonds

(Nb) in test proteins. Qb is the ratio

of the number of proteins whose

connectivity patterns are correctly

predicted (Nprot) to the total

number of proteins (Nt) in the test

set.

A

PROTEOMICS


TIN: R package

to analyze

Transcriptome

Instability

Muniba Faiza


“ TIN is a new R package which enables to analyze TIN from the expression data. TIN is a software package of R modules that uses a framework to analyze expression level data. ”

lternative Splicing plays

a very essential role in

proper functioning of

eukaryotic cells. It acts

as a regulatory

mechanism for gene expression and

any kind of disruption in this

mechanism may lead to human

diseases. Alternative splicing of pre-

mRNA is a major source of genetic

variation in human beings and

disruption of the splicing process

may cause human diseases such as

cancer. Cancer-associated variation

which may occur

at different levels of gene

regulation, particularly during the

processing of pre-mRNA into

mature mRNAs. So, better

understanding of these mechanisms

may provide insights into disease

causes and development.

TIN is a new R package which

enables to analyze TIN from the

expression data. TIN is a software

package of R modules that uses a

framework to analyze expression

level data.

WORKFLOW:

TIN uses raw expression data (cell

intensity,CEL files) as input and

applies the FIRMA method (i.e., a

method for detection of alternative

splicing) estimating the expression

levels of transcriptome and the

alternative splicing patterns

between samples. FIRMA method

gives a FIRMA score to each exon

sample combination, which is based

on the deviation of probes systems

from the expected gene expression

level. Thus, FIRMA score is the

relative ratio between exon

expression level and corresponding

gene expression level. If FIRMA

shows a strong positive score, then

the differential exon is included and

if it shows a negative score, then it

implies that exon is skipped.

Since alternative splicing is

mediated by several splicing factors

and proteins which remove introns

SOFTWARE

A


from the pre-mRNA then joining the

exons of mRNA together. Therefore,

TIN basically test the association

between splicing factor expression

levels and amount of abnormal

exon usage among the samples. For

this, correlation between abnormal

exon usage amounts and splicing

factor expression levels tested

across all samples is calculated. If

the correlation is considerably

lower, it indicates that the aberrant

amounts of exon expression may be

due to splicing factor expression.

After that, correlation is tested by

using random gene sets, if the

correlation is poor then it gives an

indication that the abnormal exon

usage can be attributed to the

expression levels of the splicing

factor genes.

This is how by analyzing the gene

expression levels and alternative

splicing patterns we can easily

monitor a developing disease or it

can be predicted at an very early

stage.

For further reading, click here.

Note:

An exhaustive list of references for

this article is available with the

author and is available on personal

request, for more details write

[email protected]

m

Fig.1 Workflow of TIN

http://i1.wp.com/bioinformaticsreview.com/wp-content/uploads/2015/12/Screenshot1.jpg


Venice Criteria:

Overview

Manish Kumar Mishra


” Venice criteria can be understood as a set of three scores which are used to grade the evidence produced by the study.“

he plethora of research literature available to the

modern day biologists provides the luxury to

conduct a unique procedure- an analysis of the meta(data of data).

GWAS- Genome Wide Association Studies find their utility in aiding the researcher

narrowing down to a specific biomolecule, to target for any

curative or vague analytical procedure for any particular trait.

To make meta-analysis realistic and closer to truth one needs to scrutinize every individual study on some benchmark, VENICE CRITERIA here comes in handy.

Venice criteria can be understood as a set of three

scores which are used to grade the evidence produced by the

study. Each of these three score can attain a maximum of ‘A’ grade, followed by ‘B’, and ‘C’ based on how meticulous the study was.

The first score is generated for

“Amount”

Second scoring is done for

“Replication” and

And final score is awarded for

“Protection from bias”.

When trying to elaborate on each of these three grading

criteria one must play in

numerical quantities, and the details of the same follow

Amount

‘A’ grade is awarded for large

scale evidence

1000 subjects, case: control=

1:1, for least common genetic

group

For moderate evidence

100-1000 subjects, least

common genetic group of

interest

For little evidence

less than 100 subjects, least

common genetic group of

interest

T

META ANALYSIS


Replication

Extensively replicated study

supported by at least 1 well

conducted meta-analysis.

Well conducted meta-analysis

which may have faced some

methodological limitations, or the

studies have moderate

inconsistency.

The analyte lacks association or

independently replicated study, has

a flawed meta-analysis and no

between study consistencies.

Protection from bias

Biases in studies creep in from researchers’ preconceived notions, and affect the compilation of data and declaration of result, much like previous two conditions a study

must also be scrutinized for biases that may have crept in.

Biases are minimized still can affect

the magnitude, but probably not

the presence of association.

Based on the amount of missing

information on generation of

evidence, but the bias doesn’t

clearly defer any associations.

Evidence for bias is so heavy that it

may affect the existence of any

association between studies.

Thus the grades may be scored as

follows-

AAA– strong evidence

AAB, ABA, ABB, BAA, BBA, BBB,

BAB–moderate evidence

Rest all scores will be treated as poor, unreliable evidence.


Mycobacteriophages and

their potential as source

against Mycobacterial

active biomolecules Sanjay Kumar


“There is a notable absence of mycobacteriophages from the family Podoviridae (containing short stubby tails), aris ing the question whether long tails are needed to traverse the relatively thick mycobacterial cell envelope.”

e all are aware of the

epidemics of threat

created

by Mycobaterium tuberculosis and

other related species. But, down

here in this article we show how

nature provides the solution against

it.

As we know Bacteriophage

(Bacterio= Bacteria’s, Phage=

eater) infects several bacterium

species. In contrast to it,

a Mycobacteriophage is a member

of a group of bacteriophages that

infect mycobacterial species as their

hosts e.g., Mycobacterium

smegmatis and Mycobacterium

tuberculosis, the causative agent

of tuberculosis.

The rising incidence of tuberculosis,

emergence of multi drug resistance

inMycobacterium tuberculosis and a

slow progress in finding new drugs

makes mycobacteriophage a

potential candidate for its use as a

diagnostic and therapeutic tool

against TB.

All the characterized

Mycobacteriophages are double-

stranded DNA (dsDNA) tailed

phages belonging to the order

Caudovirales. Most are of the family

Siphoviridae , characterized by long

flexible non contractile tails,

whereas phages of the family

Myoviridae, have contractile tails.

There is a notable absence of

mycobacteriophages from the

family Podoviridae (containing short

stubby tails), arising the question

whether long tails are needed to

traverse the relatively thick

mycobacterial cell envelope. dsDNA

tailed phages are either temperate,

forming stable lysogens with a

turbid plaque or lytic, forming clear

plaques in which the host cells are

killed. Mycobacteriophages can also

be studied by the morphology of

the plaques which vary in size and

shape. Plaque morphology also

depends on the burst size, which is

the number of phage particles

W

GENOMICS


released on the lysis of the infected

bacteria.

GENOMETRICS OF 70 SEQUENCED

MYCOBACTERIOPHAGES

Since the mycobacterial cell wall

consists of a mycolic acid rich

Mycobacterial outer membrane,

attached to an arabinogalactan

layer that is in turn linked to the

peptidoglycan, it poses significant

challenge to the phages. This

challenge is met by a set of

proteins, namely Lysin B proteins

that cleave the linkage of mycolic

acids to the arabinogalactan layer,

holins that regulate lysis timing, and

the endolysins (LysinAs) that

hydrolyze peptidoglycan.

Phages affect hosts with a holin-

endolysin system essential for

programmed lysis. Endolysin

is found to be associated with a

protein component of the phage tail

involved in facilitating the

penetration of the murein during

injection of the genome into the

host. Holins are small membrane

proteins that form holes in the

membrane through which the

endolysin can pass. Holins control

the length of the infective cycle for

lytic phages so as to achieve lysis at

an optimal time.

Endolysins can be a source of

potential antibacterial because of its

specificity (targeting only a few

strains of bacteria) and thus

replacing antibiotics (which have a

more wide ranging effect), their low

probababilty of developing

resistance inMycobacterium and

novel mode of action.

Bioinformatics can assist this

particular field of research by

finding several other proteins

existing on this planet or to prepare

other such options having similar

pharmacophore (physical and

chemical attributes) properties. We

can demolish the various disease

threats by using natural options

provided to us and can remain

healthy on this planet. The only

point to be remembered for this is,

NATURE CAN SATISFY OUR NEEDS,

BUT IT CANNOT SUSTAIN OUR

GREED….. AS A HEALTHY BODY

CONSISTS OF A HEALTHY MIND,

THE SAME WAY.. A CONSERVED

PLANET CONSERVES ITS SPECIES

TOO…..

Hatfull, Graham F.

“Mycobacteriophages: genes and

genomes.” Annual review of

microbiology 64 (2010): 331-


Do you HYPHY

with (Data)

Monkey !! Prashant Pant Image Credit: Google Images

“ Datamonkey is a web interface (http://www.datamonkey.org) which uses HyPhy batch files to execute most of its tools and packages for the computational analyses . ”

yPhy, acronym

for Hypothesis Testing

Using Phylogenies

(www.hyphy.org) was written &

designed by Kosakovsky Pond and

workers to provide likelihood-based

analyses on molecular evolutionary

data sets and help detect

differential rates of variability within

a coding sequence datasets. It is

freely available, has a Graphical

User Interface and can be used by

anyone with or without much

computer language or programming

exposure.

It was earlier presumed that

substitution rates were uniform

over an alignment of homologous

DNA/Protein sequences but many

workers studying molecular

evolutionary processes influencing

rates and patterns of evolution

negated this presumption with

quite a lot of data and this is

especially true for highly evolving

gene family datasets and for viral

genomes. Natural selection takes

place at different

domains/regions/sites which are

under positive, negative or neutral

selection pressures. Positive

selection originates with more of

non-synonymous substitutions in a

protein coding sequence influencing

the fitness advantage (protein

structure and function) of an

organism whereas negative

selection takes place with more of

synonymous substitution in a

protein coding sequence leaving the

amino acid sequence or protein

structure and function unchanged.

A neutral evolution is said to be

taking place when the non-

synonymous substitutions does not

affect the protein structure and

function and rate of non-

synonymous substitutions. The rate

of synonymous and non-

synonymous substitutions is given

by dS and dN respectively. In the

case of neutral evolution, dS and dN

are observed to be in equilibrium.

Accordingly, the ratio of dN/dS

given by ω=β/α (also referred to as

dN/dS) has become a standard

measure of selective pressure. The

total ω for a sequence alignment is

referred to as Global ω. Global ω

with a value of approximately 1

signifies neutral evolution, below 1

suggests negative selection whereas

ω more than 1 implies positive

selection. To start with the analyses,

all one needs is, a suitable codon

substitution model as detected by

MODELTEST program (available

online), a nexus formatted

sequence alignment file (must be

codon data file) and a Maximum

Liklihood tree of the data.

H

TOOLS


Datamonkey is a web interface

(http://www.datamonkey.org)

which uses HyPhy batch files to

execute most of its tools and

packages for the computational

analyses. This web interface can be

used for estimating dS and dN over

an alignment of coding sequences

and also for identifying codons and

lineages under selection. It also

provides “state of the art” tests of

codon based models to infer

signatures of positive darwinian

selection by comparing rates of

synonymous (dS) versus non-

synonymous (dN) mutations even in

the presence of recombination. It

actually reports ω (=dN/dS) using a

variety of evolutionary models.

Apart from this, Datamonkey also

offers a number of packages such as

GARD, SLAC, REL, FEL, EVOBLAST

etc. These will be discussed in the

next issue. Keep reading!!

A comprehensive list of references

on the article are available upon

request to the author

([email protected]

m)


T-Coffee : A tool

that combines both

local and global

alignments

Muniba Faiza


“T-Coffee is a multiple sequence alignment tool which stands for Tree-based Consistency Objective Function for alignment Evaluation. It is a s imultaneous alignment which combines the best properties of local and global alignment and for this it also uses the Smith -W aterman algorithm. .”

Coffee is a multiple

sequence alignment tool

which stands for Tree-

based Consistency Objective

Function for alignment Evaluation. It

is a simultaneous alignment which

combines the best properties of

local and global alignment and for

this it also uses the Smith-

Waterman algorithm. T-Coffee is an

advancement over other multiple

alignment tools such as ClustalW,

MUSCLE (discussed about in earlier

article), etc.

Its main features include, first, it

provides the multiple alignments

using various data sources which is

the library of pairwise

alignments(global + local). Second

main feature is the optimization

method which provides the multiple

alignment that best fits in the input

library.

Fig.1 Layout of the T-

Coffee strategy; the main steps

required to compute a multiple

sequence alignment using the T-

Coffee method. Square blocks

designate procedures while

TOOLS

T

http://i1.wp.com/bioinformaticsreview.com/wp-content/uploads/2015/12/Screenshot-1.jpg


rounded blocks indicate

data structures.

How T-Coffee works?

1. Generate Primary library of

alignments:

It consists of a set of pairwise

alignments of all of the

sequences to be aligned (here

the alignment source is local).

It may also include two or

more different alignments of

the same pair of sequences.

Then the global alignment is

done using ClustalW .

2. Derive primary library

weights:

The most reliable residue pair

is obtained in this step using a

weighted scheme. In this, a

weight is assigned to each pair

of aligned residues in the

library. Here, sequence identity

is the criteria to measure

accuracy with more than 30 %

identity. For each set of

sequences, two libraries are

constructed along with their

weights, one using ClustaW

and other using Lalign

(program of FASTA package).

3. Combine Libraries:

In this step, all the duplicated

pairs are merged into a single

entry that has a weight equal

to the sum of two weights, or a

new entry is created for the

pair being considered.

4. Extend library:

A triplet approach involving

intermediate-sequence

method is used. For example,

we have 4 sequences, A,B,C &

D, it aligns A-B and with C and

D as well and checks for the

alignment.

5. Progressive alignment strategy:

In this alignment strategy, a

distance matrix is constructed

using pairwise alignments

between all the sequences, with

the help of which a guide tree is

constructed using Neighbor

Joining (NJ) method (a method

that first aligns the two closest

sequences), the obtained pair of

sequences are checked for

gaps,again the next closest two

sequences. This continue until

all the sequences have been

aligned.

Fig.2 The library extension. (a)

Progressive alignment. Four

sequences have been designed. The

tree indicates

the order in which the sequences

are aligned when using a

progressive method such as

ClustalW. The resulting alignment is

shown, with the word CAT

misaligned. (b) Primary library. Each

pair of sequences is aligned using

ClustalW. In these alignments, each

pair of aligned residues is associated

with a weight equal to the average

identity among matched residues

within the complete alignment

(mismatches are indicated in bold

type). (c) Library extension for a pair

of sequences. The three possible

alignments of sequence A and B are

shown (A and B, A and B through C,

A and B through D). These

alignments are combined, as

explained in the text, to produce the

position-specific library. This library

is resolved by dynamic

programming to give the correct

alignment. The thickness of the lines

indicates the strength of the weight.


Subscribe to Bioinformatics Review newsletter to get the latest post in your mailbox and

never miss out on any of your favorite topics.

Log on to

www.bioinformaticsreview.com

http://www.bioinformaticsreview.com/

bioinformatics review - december 2015 issue

Documents