bioinformatics review - december 2015 issue
DESCRIPTION
December issue of Bioinformatics Review. Available via http://bioinformaticsreview.comTRANSCRIPT
![Page 1: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/1.jpg)
D EC EMBER 2015 VOL 1 ISSUE3
Do you HYPHY with
(Data) Monkey!!
Perl one-liners for Bioinformaticians
- L.L Gatlin
![Page 2: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/2.jpg)
Public Service Ad sponsored by IQLBioinformatics
![Page 3: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/3.jpg)
Contents
December 2015
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Topics
03
22
34
99
34
99
Tools
Proteomics
Programming
Systems Biology
Meta Analysis
Genomics
Editorial.... 5
Roary: Analysis of Prokaryote Pan Genome on a large-scale 07 Do you HYPHY with (Data) Monkey !! 22 T-Coffee : A tool that combines both local and global alignments 24
Disulphide Connectivity in Protein Tertiary Structure Prediction 13
Perl one-liners for bioinformaticians 09
Venice Criteria: Overview 17
Two Components System: Potential Drug Target in Mycobacterium tuberculosis 11
Mycobacteriophages & their potentials as source against Mycobacterial active molecules 19
64 Software
TIN: R package to analyze Transcriptome Instability 15
![Page 4: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/4.jpg)
EDITOR
Dr. PRASHANT PANT
EDITORIAL
SECTION EDITORS
ALTAF ABDUL KALAM MANISH KUMAR MISHRA
SANJAY KUMAR PRAKASH JHA NABAJIT DAS
REPRINTS AND PERMISSIONS
You must have permission before reproducing any material from Bioinformatics Review. Send E-mail
requests to [email protected]. Please include contact detail in your message.
BACK ISSUE
Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com
at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery,
subject to availability. Pre-payment is required
CONTACT
PHONE +91. 991 1942-428 / 852 7572-667
MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025
STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the
address as [email protected]
PUBLICATION INFORMATION
Volume 1, Number 1, Bioinformatics Review™ is published monthly for one year(12 issues) by Social
and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015
Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used
under license by SEWA Trust. Published in India
EXECUTIVE EDITOR FOUNDING EDITOR
FOZAIL AHMAD MUNIBA FAIZA
![Page 5: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/5.jpg)
EDITORIAL
Pursuing PhD in Sciences?
All the low hanging fruits in sciences have been plucked to
kingdom come and it’s time to review the education system. It is
about time to inculcate certain very important and fundamental
questions in the young minds to take an informed decision
regarding their career.
Very recently, an article appeared in Nature on “Reform the PhD
system or close it down” by Mark Taylor. This article emphasised
on the very fact that most doctoral programs are just producing
PhDs like anything and they have very poor absorption rate in
Universities/Institutions and in corporate world due to
deficiencies of the system and/or of the degree making them find
no place. The article also talks about the medieval nature of most
doctoral programs which have made them irrelevant and
unsustainable with the growing number of PhDs churning out
from the Universities all over the world. Two questions come to
our mind. One, why this happened, and secondly, where things
went wrong and who is to be blamed? Probable answer to the
first question lies in the opening statement of this editorial and
this was imminent. The second question however is more
intriguing and needs discussion. Most doctoral programmes are
designed so as to train students to perform research and analyses
on a stereotyped mechanism which is not wrong but makes the
scholars look at PhD as a lucrative option to get a doctor prefixed
in their name without putting much pressure on their grey
matter.
PhDs are not about filling pages under five chapters after a couple
of years. PhDs are not about stereotype work done scholar afte r
scholar in a laboratory to fulfil the mentor’s desire to become a
self-declared expert on a topic. PhDs are (and should be) about
questions, and that too, genuine ones. They are (and should be)
about beautiful experimental designs attacking the question from
all corners and trying to answer it. They are about thinking what,
why and how something happens and how that piece of
Dr. Prashant Pant Editor-in-Chief
Letters and responses:
![Page 6: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/6.jpg)
information can be taken further to serve another research
question.
Our education system will wake up one day and will introduce
revolutionary modifications crashing the dreams of many PhD
aspirants. We should not wait for that day to shine upon us rather
we should prepare our young minds to start thinking. So the
question you should be asking to yourself when you are in your
graduation or post-graduation is “Do you have it in you to do
research and earn a PhD degree”. If you don’t ask, you are going
to repent heavily. If you question everything around you and can
work relentlessly to try and answer a question, then PhDs are for
you. If you can connect radically different aspects and weave
them together into simpler forms, then PhDs are for you.
Science can take you to any part of the world, good or bad, if you
are ready for that, PhDs are for you. If you are ready to explore
again and again, PhDs are for you. If you can give up other things
in life for the sake of the questions, PhDs are for you. One should
always remember, it is a (doctoral) degree in philosophy and not
in sciences and therefore, more important is the question (that
you ask) and the meaning/interpretation of the answer rather
than plain science. So, start thinking!!
![Page 7: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/7.jpg)
Bioinformatics Review | 7
Roary: Analysis of Prokaryote Pan Genome on a large-scale Muniba Faiza
Image Credit: Google Images
“A new method to generate the pan genome of a set of related prokaryotic isolates and named the tool as ‘Roary’ .”
The Microbial Pan Genome is
the union of genes shared by
genomes of interest. This
term was first used by Medini
in 2005.
Since then, microbial genome data
has been enormously increased, so to
study processes such as selection and
evolution, the construction of pan
genome of species is required. But
construction of pan genome from the
real data available is very difficult and
would not be accurate due to
fragmented assemblies, poor
annotation and also the
contamination,i.e., microbial
organisms can rapidly acquire genes
from other organisms. Therefore,
Andrew J. Page et al have developed
a new method to generate the pan
genome of a set of related
prokaryotic isolates and named the
tool as ‘Roary’. It deals with
thousands of isolates in a feasible
time.
How Roary Works?
One annotated assembly per sample
is input in the Roary from which
coding regions are extracted and
converted in to protein sequences,
and all the partial sequences are
removed and pre clustered using CD-
HIT (a fast program for clustering and
comparing). This produces a reduced
set of protein
sequences.These reduced sequences
are compared all-against-all with the
help of BLASTP with a user defined
percentage sequence identity
(default 95%). Now, by using
conserved neighborhood genes,
homologous groups are split in to
true orthologs. Finally, a graph is
constructed showing
the relationships of the clusters
based on the order of occurrence in
the input sequences.
That’s how the orthologous genes of
prokaryotes can be easily identified
and the microbial evolution can be
well studied. It is done on a large
scale covering a large data set to
analyse the pan genomes of
prokaryotes. Other tools have also
been made earlier than Roary for the
same purpose,namely, PanOCT and
PGAP, but Roary is more fast,
heuristic and most feasible tool
among them.
T
TOOLS
![Page 8: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/8.jpg)
Bioinformatics Review | 8
Perl one-liners
for
bioinformaticians
Muniba Faiza
Image Credit: Google Images
“Perl one-liners are extremely short Perl scripts written in the form of a string of commands that fits onto one line. Perl one-liners can be very useful in ad-hoc processing or parsing of files and streams from a plethora of sources . ”
erl one-liners are extremely
short Perl scripts written in
the form of a string of
commands that fits onto one line.
That would amount to a bit less
than 80 symbols for most
purposes. Here’s the obligatory
“Hello World!” one-liner in Perl
and it’s output:
$ perl -e 'print "Hello World!\n";'
Hello World!
Try it! (of course, Perl must be
installed on your computer for the
“perl” command to work).
The most common and useful way
to use such one-liners is to use
them as stream processors on the
command line, sometimes
connected by pipes to other
utilities typical for a Linux
command-line environment. To
process the stream one would
commonly use Perl regular
expression syntax to match
(m/string/) or substitute
(s/string1/string2/). Let us use
“echo” to generate an empty input
to act upon and “-p” to tell Perl to
print the $_ variable (entire line) at
the end:
$ echo | perl -pe 's/$_/Hello
World!\n/;'
Hello World!
P
BIOINFORMATICS PROGRAMMIG
![Page 9: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/9.jpg)
Bioinformatics Review | 9
Notice that Perl iterates over all lines
of the input (first create a file test
with 3 empty lines):
$ cat test | perl -pe 's/$_/Hello
World!\n/;'
Hello World!
Hello World!
Hello World!
Finally, let us introduce the “-i” switch
to make Perl do the changes directly on
a supplied file:
$ perl -pi -e 's/$_/Hello World!\n/;'
test2
This will result in the contents
of test2getting overwritten with “Hello
World!” now present on every line!
Needless to say, the “-i” switch can be
quite dangerous for it’s ability to
completely overwrite files.
Suppose you have a file where you
would like to number the lines directly
in the file. This is a no-brainer with Perl
one-liners! Just replace the beginning of
each line with it’s number:
cat test2 | perl -pe '$i++; s/^/$i: /;'
1: Hello World!
2: Hello World!
3: Hello World!
The “^” symbol denotes the
beginning of the line in Perl regular
expressions. Notice that the one-
liner actually contains two lines of
Perl code separated by a
semicolon (;).
Bioinformaticians often process
FASTA files with nucleotide or
amino-acid sequences. Suppose
you have a FASTA file you would
like to convert to a format where
every sequence occupies only one
line, so that you can apply “grep”
to look for a specific k-mer in the
sequence (say TATATAA for TATA-
box). This can be easily done by
removing every end-of-line symbol
on non-header lines:
$ cat test2 | perl -pe
's/^([^>]+)\n/$1/;END{print "\n"}'
| grep -B1 TATATAA
The “$1” is a special Perl variable
created in regular expressions
whenever you enclose something
in parentheses. Here we do that
with entire lines that do not begin
with a “>” character (“^” in
brackets like “*^>+” means NOT
“>”, in this case we choose non-
header lines).
Perl one-liners can be very useful
in ad-hoc processing or parsing of
files and streams from a plethora
of sources. Additional examples of
clever Perl one-liners can be
found here or here.
![Page 10: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/10.jpg)
Bioinformatics Review | 10
Two Components System:
Potential Drug Target in
Mycobacterium
tuberculosis Fozail Ahmad Image Credit: Google Images
“To set the stage of infection, to establish itself in the host’s defending environment, to cause the pathogenicity by overcoming the immune system and to escape out from any assailable host attack, this TB causing pathogen has developed a well-embodied system known as two-component system
(TCS).”
he genomic complexity and
unknown functions of
proteins/genes in
Mycobacterium tuberculosis
(Mt) has triggered an in-depth
study of the entire genome to
explore factors responsible for
influencing Mt’s behaviour at
molecular level. To set the stage
of infection, to establish itself in
the host’s defending
environment, to cause the
pathogenicity by overcoming the
immune system and to escape out
from any assailable host attack,
this TB causing pathogen has
developed a well-embodied
system known as two-component
system (TCS) that constitutes two
proteins, universally designated as
sensor protein and response
regulator protein.
The basic function of these proteins
is to sense environmental signals
and respond accordingly. After
interaction with suitable stimulating
ligands, sensor protein, histidine
kinase binds and hydrolyzes ATP,
catalysing the auto-phosphorylation
of a conserved Histidine residue and
producing a high energy phosphoryl
group. The phosphate is then
transferred to the associated
receiver protein known as response
regulator at conserved Aspartic acid
residue generating a high-energy
acyl phosphate. Once
phosphotransfer reaction has taken
place, the response regulator gets
activated, allowing it to carry out its
specific function. In most of the
cases, activated sensor kinase
modulates the transcription of DNA
at a specific binding site located in
target genome at promoter region.
The total effect is change in global
gene expression that aids pathogen
to respond in the initial signal
sensed by histidine kinase. There
are eleven such TCS in the
pathogen. The primary task of such
system is to control the expression
of specific genes at specific time in
response to the environmental
conditions hence contributing to the
growth of pathogen inside host.
Since each of the TCS is obligated
with distinct function, they are
involved in orchestrating most of
the gene regulatory processes. Out
of eleven, only eight TCS have been
studied comprehensively letting
others to be scavenged by further
genomic analysis of Mt.
Interdisciplinary relevance : The
systematic understanding of
biological phenomena and
demonstration of such microscopic
T
SYSTEMS BIOLOGY
![Page 11: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/11.jpg)
Bioinformatics Review | 11
processes have been subjected to a
number of sophisticated
experimental procedure in order to
develop the deterministic or
stochastic approaches that are
skilled in unfolding real molecular
system. Biological modeling and
simulation are among those of
biochemical annotating
methodologies using wet lab data
and understanding the scenario of
real biological mechanism.
Systems biology opens a new area
to analyse the raw data generated
through wet lab experimentations
by various modes of
characterization and evaluation by
mathematical modeling, simulations
& network analyses as the sole
implications into any biological
issue. Two-component systems for
their critical contributions in
bacterial pathogenicity have
provided us with new concepts for
comprehending molecular
mechanism which are yet to be
explored. Limitations have been
raised for it’s behaviour and
activation so far as the exact
regulatory mechanism is concerned.
Application of mathematical model
and simulation over the regulatory
behaviour would testify the real
global association of TCS with entire
genomic expression showing how
this pathogen becomes so
potentially virulent? Another
important question that can be
raised is at what level of gene
activation the pathogenicity is
rampant making host
unimmunized? The scavenging
initiative of all two-component
systems would bring the molecular
biology, chemistry, mathematics
and network biology together to
unfold the gene regulatory scenario
of Mycobacterium tuberculosis in
an exclusive manner.
![Page 12: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/12.jpg)
Bioinformatics Review | 12
Disulphide Connectivity
in Protein Tertiary
Structure Prediction
Muniba Faiza
Image Credit: Google Images
“ The disulphide bonds formed between non-adjacent Cysteine residues are identified that would be cross -linked from other possible residues. ”
s the approach towards the
protein structure prediction
has increased and has been
successful in most of the cases but
still also a big challenge. To handle
this situation, the Protein Structure
prediction is divided in to separate
sub classes to get the information
about the whole system (i.e.,protein
structure). One of these sub classes
is Disulphide Connectivity. Under
this, the disulphide bonds formed
between non-adjacent Cysteine
residues are identified that would
be cross-linked from other possible
residues.
Since the disulphide bridges/bonds
plays an important role in the
folding process, stability and
function of a protein, therefore,
the prediction of disulfide bonds
connectivity can help in prediction
of protein structure. Disulphide
Connectivity can be studied in two
steps: first, by disulfide bonding
state prediction and secondly, by
disulphide connectivity prediction
(DCP). The first approach classifies
the cysteines bonded to another
cysteines or any free cysteine
according to their molecular states.
DCP identifies the different pairs of
cysteines that are bonded in a
protein sequence. To perform these
tasks, various predictors are
available that are mainly based on
Neural Networks (NN) and Support
Vector Machines (SVMs), and other
predictive methods.
An Artificial Neural Network is a
computing system of
interconnected elements where
some external inputs are applied
and the information is processed by
the dynamic responses given by the
system. ANN provides a likelihood
of forming a disulphide bond for
each cysteine pair. Several
algorithms are applied such as
Gabow’s algorithm to implement
NN in protein structure
prediction. SVMs are the machine
learning tool to predict
tertiary structure from the primary
sequence of proteins. This approach
uses the Edmond-Gabow algorithm
and PSSMs. After performing these
operations, to validate the accuracy
of predicted connectivity patterns
there are two parameters: Rb &
Qb. Rb is the ratio of the number of
correctly predicted bonds to the
total number of disulphide bonds
(Nb) in test proteins. Qb is the ratio
of the number of proteins whose
connectivity patterns are correctly
predicted (Nprot) to the total
number of proteins (Nt) in the test
set.
A
PROTEOMICS
![Page 13: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/13.jpg)
Bioinformatics Review | 13
TIN: R package
to analyze
Transcriptome
Instability
Muniba Faiza
Image Credit: Google Images
“ TIN is a new R package which enables to analyze TIN from the expression data. TIN is a software package of R modules that uses a framework to analyze expression level data. ”
lternative Splicing plays
a very essential role in
proper functioning of
eukaryotic cells. It acts
as a regulatory
mechanism for gene expression and
any kind of disruption in this
mechanism may lead to human
diseases. Alternative splicing of pre-
mRNA is a major source of genetic
variation in human beings and
disruption of the splicing process
may cause human diseases such as
cancer. Cancer-associated variation
which may occur
at different levels of gene
regulation, particularly during the
processing of pre-mRNA into
mature mRNAs. So, better
understanding of these mechanisms
may provide insights into disease
causes and development.
TIN is a new R package which
enables to analyze TIN from the
expression data. TIN is a software
package of R modules that uses a
framework to analyze expression
level data.
WORKFLOW:
TIN uses raw expression data (cell
intensity,CEL files) as input and
applies the FIRMA method (i.e., a
method for detection of alternative
splicing) estimating the expression
levels of transcriptome and the
alternative splicing patterns
between samples. FIRMA method
gives a FIRMA score to each exon
sample combination, which is based
on the deviation of probes systems
from the expected gene expression
level. Thus, FIRMA score is the
relative ratio between exon
expression level and corresponding
gene expression level. If FIRMA
shows a strong positive score, then
the differential exon is included and
if it shows a negative score, then it
implies that exon is skipped.
Since alternative splicing is
mediated by several splicing factors
and proteins which remove introns
SOFTWARE
A
![Page 14: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/14.jpg)
Bioinformatics Review | 14
from the pre-mRNA then joining the
exons of mRNA together. Therefore,
TIN basically test the association
between splicing factor expression
levels and amount of abnormal
exon usage among the samples. For
this, correlation between abnormal
exon usage amounts and splicing
factor expression levels tested
across all samples is calculated. If
the correlation is considerably
lower, it indicates that the aberrant
amounts of exon expression may be
due to splicing factor expression.
After that, correlation is tested by
using random gene sets, if the
correlation is poor then it gives an
indication that the abnormal exon
usage can be attributed to the
expression levels of the splicing
factor genes.
This is how by analyzing the gene
expression levels and alternative
splicing patterns we can easily
monitor a developing disease or it
can be predicted at an very early
stage.
For further reading, click here.
Note:
An exhaustive list of references for
this article is available with the
author and is available on personal
request, for more details write
m
Fig.1 Workflow of TIN
![Page 15: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/15.jpg)
Bioinformatics Review | 15
Venice Criteria:
Overview
Manish Kumar Mishra
Image Credit: Google Images
” Venice criteria can be understood as a set of three scores which are used to grade the evidence produced by the study.“
he plethora of research literature available to the
modern day biologists provides the luxury to
conduct a unique procedure- an analysis of the meta(data of data).
GWAS- Genome Wide Association Studies find their utility in aiding the researcher
narrowing down to a specific biomolecule, to target for any
curative or vague analytical procedure for any particular trait.
To make meta-analysis realistic and closer to truth one needs to scrutinize every individual study on some benchmark, VENICE CRITERIA here comes in handy.
Venice criteria can be understood as a set of three
scores which are used to grade the evidence produced by the
study. Each of these three score can attain a maximum of ‘A’ grade, followed by ‘B’, and ‘C’ based on how meticulous the study was.
The first score is generated for
“Amount”
Second scoring is done for
“Replication” and
And final score is awarded for
“Protection from bias”.
When trying to elaborate on each of these three grading
criteria one must play in
numerical quantities, and the details of the same follow
Amount
‘A’ grade is awarded for large
scale evidence
1000 subjects, case: control=
1:1, for least common genetic
group
For moderate evidence
100-1000 subjects, least
common genetic group of
interest
For little evidence
less than 100 subjects, least
common genetic group of
interest
T
META ANALYSIS
![Page 16: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/16.jpg)
Bioinformatics Review | 16
Replication
Extensively replicated study
supported by at least 1 well
conducted meta-analysis.
Well conducted meta-analysis
which may have faced some
methodological limitations, or the
studies have moderate
inconsistency.
The analyte lacks association or
independently replicated study, has
a flawed meta-analysis and no
between study consistencies.
Protection from bias
Biases in studies creep in from researchers’ preconceived notions, and affect the compilation of data and declaration of result, much like previous two conditions a study
must also be scrutinized for biases that may have crept in.
Biases are minimized still can affect
the magnitude, but probably not
the presence of association.
Based on the amount of missing
information on generation of
evidence, but the bias doesn’t
clearly defer any associations.
Evidence for bias is so heavy that it
may affect the existence of any
association between studies.
Thus the grades may be scored as
follows-
AAA– strong evidence
AAB, ABA, ABB, BAA, BBA, BBB,
BAB–moderate evidence
Rest all scores will be treated as poor, unreliable evidence.
![Page 17: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/17.jpg)
Bioinformatics Review | 17
Mycobacteriophages and
their potential as source
against Mycobacterial
active biomolecules Sanjay Kumar
Image Credit: Google Images
“There is a notable absence of mycobacteriophages from the family Podoviridae (containing short stubby tails), aris ing the question whether long tails are needed to traverse the relatively thick mycobacterial cell envelope.”
e all are aware of the
epidemics of threat
created
by Mycobaterium tuberculosis and
other related species. But, down
here in this article we show how
nature provides the solution against
it.
As we know Bacteriophage
(Bacterio= Bacteria’s, Phage=
eater) infects several bacterium
species. In contrast to it,
a Mycobacteriophage is a member
of a group of bacteriophages that
infect mycobacterial species as their
hosts e.g., Mycobacterium
smegmatis and Mycobacterium
tuberculosis, the causative agent
of tuberculosis.
The rising incidence of tuberculosis,
emergence of multi drug resistance
inMycobacterium tuberculosis and a
slow progress in finding new drugs
makes mycobacteriophage a
potential candidate for its use as a
diagnostic and therapeutic tool
against TB.
All the characterized
Mycobacteriophages are double-
stranded DNA (dsDNA) tailed
phages belonging to the order
Caudovirales. Most are of the family
Siphoviridae , characterized by long
flexible non contractile tails,
whereas phages of the family
Myoviridae, have contractile tails.
There is a notable absence of
mycobacteriophages from the
family Podoviridae (containing short
stubby tails), arising the question
whether long tails are needed to
traverse the relatively thick
mycobacterial cell envelope. dsDNA
tailed phages are either temperate,
forming stable lysogens with a
turbid plaque or lytic, forming clear
plaques in which the host cells are
killed. Mycobacteriophages can also
be studied by the morphology of
the plaques which vary in size and
shape. Plaque morphology also
depends on the burst size, which is
the number of phage particles
W
GENOMICS
![Page 18: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/18.jpg)
Bioinformatics Review | 18
released on the lysis of the infected
bacteria.
GENOMETRICS OF 70 SEQUENCED
MYCOBACTERIOPHAGES
Since the mycobacterial cell wall
consists of a mycolic acid rich
Mycobacterial outer membrane,
attached to an arabinogalactan
layer that is in turn linked to the
peptidoglycan, it poses significant
challenge to the phages. This
challenge is met by a set of
proteins, namely Lysin B proteins
that cleave the linkage of mycolic
acids to the arabinogalactan layer,
holins that regulate lysis timing, and
the endolysins (LysinAs) that
hydrolyze peptidoglycan.
Phages affect hosts with a holin-
endolysin system essential for
programmed lysis. Endolysin
is found to be associated with a
protein component of the phage tail
involved in facilitating the
penetration of the murein during
injection of the genome into the
host. Holins are small membrane
proteins that form holes in the
membrane through which the
endolysin can pass. Holins control
the length of the infective cycle for
lytic phages so as to achieve lysis at
an optimal time.
Endolysins can be a source of
potential antibacterial because of its
specificity (targeting only a few
strains of bacteria) and thus
replacing antibiotics (which have a
more wide ranging effect), their low
probababilty of developing
resistance inMycobacterium and
novel mode of action.
Bioinformatics can assist this
particular field of research by
finding several other proteins
existing on this planet or to prepare
other such options having similar
pharmacophore (physical and
chemical attributes) properties. We
can demolish the various disease
threats by using natural options
provided to us and can remain
healthy on this planet. The only
point to be remembered for this is,
NATURE CAN SATISFY OUR NEEDS,
BUT IT CANNOT SUSTAIN OUR
GREED….. AS A HEALTHY BODY
CONSISTS OF A HEALTHY MIND,
THE SAME WAY.. A CONSERVED
PLANET CONSERVES ITS SPECIES
TOO…..
Hatfull, Graham F.
“Mycobacteriophages: genes and
genomes.” Annual review of
microbiology 64 (2010): 331-
![Page 19: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/19.jpg)
Bioinformatics Review | 19
Do you HYPHY
with (Data)
Monkey !! Prashant Pant Image Credit: Google Images
“ Datamonkey is a web interface (http://www.datamonkey.org) which uses HyPhy batch files to execute most of its tools and packages for the computational analyses . ”
yPhy, acronym
for Hypothesis Testing
Using Phylogenies
(www.hyphy.org) was written &
designed by Kosakovsky Pond and
workers to provide likelihood-based
analyses on molecular evolutionary
data sets and help detect
differential rates of variability within
a coding sequence datasets. It is
freely available, has a Graphical
User Interface and can be used by
anyone with or without much
computer language or programming
exposure.
It was earlier presumed that
substitution rates were uniform
over an alignment of homologous
DNA/Protein sequences but many
workers studying molecular
evolutionary processes influencing
rates and patterns of evolution
negated this presumption with
quite a lot of data and this is
especially true for highly evolving
gene family datasets and for viral
genomes. Natural selection takes
place at different
domains/regions/sites which are
under positive, negative or neutral
selection pressures. Positive
selection originates with more of
non-synonymous substitutions in a
protein coding sequence influencing
the fitness advantage (protein
structure and function) of an
organism whereas negative
selection takes place with more of
synonymous substitution in a
protein coding sequence leaving the
amino acid sequence or protein
structure and function unchanged.
A neutral evolution is said to be
taking place when the non-
synonymous substitutions does not
affect the protein structure and
function and rate of non-
synonymous substitutions. The rate
of synonymous and non-
synonymous substitutions is given
by dS and dN respectively. In the
case of neutral evolution, dS and dN
are observed to be in equilibrium.
Accordingly, the ratio of dN/dS
given by ω=β/α (also referred to as
dN/dS) has become a standard
measure of selective pressure. The
total ω for a sequence alignment is
referred to as Global ω. Global ω
with a value of approximately 1
signifies neutral evolution, below 1
suggests negative selection whereas
ω more than 1 implies positive
selection. To start with the analyses,
all one needs is, a suitable codon
substitution model as detected by
MODELTEST program (available
online), a nexus formatted
sequence alignment file (must be
codon data file) and a Maximum
Liklihood tree of the data.
H
TOOLS
![Page 20: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/20.jpg)
Bioinformatics Review | 20
Datamonkey is a web interface
(http://www.datamonkey.org)
which uses HyPhy batch files to
execute most of its tools and
packages for the computational
analyses. This web interface can be
used for estimating dS and dN over
an alignment of coding sequences
and also for identifying codons and
lineages under selection. It also
provides “state of the art” tests of
codon based models to infer
signatures of positive darwinian
selection by comparing rates of
synonymous (dS) versus non-
synonymous (dN) mutations even in
the presence of recombination. It
actually reports ω (=dN/dS) using a
variety of evolutionary models.
Apart from this, Datamonkey also
offers a number of packages such as
GARD, SLAC, REL, FEL, EVOBLAST
etc. These will be discussed in the
next issue. Keep reading!!
A comprehensive list of references
on the article are available upon
request to the author
m)
![Page 21: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/21.jpg)
Bioinformatics Review | 21
T-Coffee : A tool
that combines both
local and global
alignments
Muniba Faiza
Image Credit: Google Images
“T-Coffee is a multiple sequence alignment tool which stands for Tree-based Consistency Objective Function for alignment Evaluation. It is a s imultaneous alignment which combines the best properties of local and global alignment and for this it also uses the Smith -W aterman algorithm. .”
Coffee is a multiple
sequence alignment tool
which stands for Tree-
based Consistency Objective
Function for alignment Evaluation. It
is a simultaneous alignment which
combines the best properties of
local and global alignment and for
this it also uses the Smith-
Waterman algorithm. T-Coffee is an
advancement over other multiple
alignment tools such as ClustalW,
MUSCLE (discussed about in earlier
article), etc.
Its main features include, first, it
provides the multiple alignments
using various data sources which is
the library of pairwise
alignments(global + local). Second
main feature is the optimization
method which provides the multiple
alignment that best fits in the input
library.
Fig.1 Layout of the T-
Coffee strategy; the main steps
required to compute a multiple
sequence alignment using the T-
Coffee method. Square blocks
designate procedures while
TOOLS
T
![Page 22: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/22.jpg)
Bioinformatics Review | 22
rounded blocks indicate
data structures.
How T-Coffee works?
1. Generate Primary library of
alignments:
It consists of a set of pairwise
alignments of all of the
sequences to be aligned (here
the alignment source is local).
It may also include two or
more different alignments of
the same pair of sequences.
Then the global alignment is
done using ClustalW .
2. Derive primary library
weights:
The most reliable residue pair
is obtained in this step using a
weighted scheme. In this, a
weight is assigned to each pair
of aligned residues in the
library. Here, sequence identity
is the criteria to measure
accuracy with more than 30 %
identity. For each set of
sequences, two libraries are
constructed along with their
weights, one using ClustaW
and other using Lalign
(program of FASTA package).
3. Combine Libraries:
In this step, all the duplicated
pairs are merged into a single
entry that has a weight equal
to the sum of two weights, or a
new entry is created for the
pair being considered.
4. Extend library:
A triplet approach involving
intermediate-sequence
method is used. For example,
we have 4 sequences, A,B,C &
D, it aligns A-B and with C and
D as well and checks for the
alignment.
5. Progressive alignment strategy:
In this alignment strategy, a
distance matrix is constructed
using pairwise alignments
between all the sequences, with
the help of which a guide tree is
constructed using Neighbor
Joining (NJ) method (a method
that first aligns the two closest
sequences), the obtained pair of
sequences are checked for
gaps,again the next closest two
sequences. This continue until
all the sequences have been
aligned.
Fig.2 The library extension. (a)
Progressive alignment. Four
sequences have been designed. The
tree indicates
the order in which the sequences
are aligned when using a
progressive method such as
ClustalW. The resulting alignment is
shown, with the word CAT
misaligned. (b) Primary library. Each
pair of sequences is aligned using
ClustalW. In these alignments, each
pair of aligned residues is associated
with a weight equal to the average
identity among matched residues
within the complete alignment
(mismatches are indicated in bold
type). (c) Library extension for a pair
of sequences. The three possible
alignments of sequence A and B are
shown (A and B, A and B through C,
A and B through D). These
alignments are combined, as
explained in the text, to produce the
position-specific library. This library
is resolved by dynamic
programming to give the correct
alignment. The thickness of the lines
indicates the strength of the weight.
![Page 23: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/23.jpg)
Bioinformatics Review | 23
![Page 24: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/24.jpg)
Bioinformatics Review | 24
Subscribe to Bioinformatics Review newsletter to get the latest post in your mailbox and
never miss out on any of your favorite topics.
Log on to
www.bioinformaticsreview.com
![Page 25: BIOINFORMATICS REVIEW - DECEMBER 2015 ISSUE](https://reader033.vdocuments.us/reader033/viewer/2022052705/579075cd1a28ab6874b64a02/html5/thumbnails/25.jpg)
Bioinformatics Review | 25