benelux bioinformatics conference bbc 2015 · 10th benelux bioinformatics conference bbc 2015 8...
TRANSCRIPT
![Page 1: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/1.jpg)
1
bbc 2015
December 7 - 8, 2015
Antwerp, Belgium
www.bbc2015.be
10th Benelux Bioinformatics Conference
![Page 2: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/2.jpg)
10th Benelux Bioinformatics Conference bbc 2015
2
10th
Benelux Bioinformatics Conference
bbc 2015
PROCEEDINGS
December 7 and 8, 2015
Antwerp, Belgium Elzenveld, Lange Gasthuisstraat 45, 2000 Antwerp, Belgium
![Page 3: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/3.jpg)
10th Benelux Bioinformatics Conference bbc 2015
3
![Page 4: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/4.jpg)
10th Benelux Bioinformatics Conference bbc 2015
4
Welcome to the 10th
Benelux Bioinformatics Conference!
Dear attendee, It is our great pleasure to welcome you to the 10th Benelux Bioinformatics Conference in Antwerp (Belgium)! We are especially proud to host this conference, for the first time ever, in Antwerp, the diamond city. Ten years of BBC is worth some celebration. The meeting has always struck the right balance between strengthening the regional network and offering a scientifically strong program. From its inception 10 years ago, the BBC has always been a prominent platform for the thriving regional bioinformatics community to present their latest research. Not only did many young bioinformatics scientists get their first experience presenting their work as a poster or an oral presentation at one of the BBC editions, it has always attracted a healthy mix of presenters and attendees from all career stages, with diverse backgrounds. The program of this year's edition again demonstrates the wide range of life science disciplines in which bioinformatics plays a key role nowadays. First, we are delighted to introduce two eminent keynote speakers: Cedric Notredame (Center for Genomic Regulation) and Lars Juhl Jensen (Novo Nordisk Foundation Center for Protein Research). Second, a program committee of 36 scientists has critically reviewed a large number of submissions and selected 24 authors to deliver an oral presentation. In addition, we have two special corporate talks. Furthermore, we have again a large number of poster presentations that promise a very interactive poster session, and our corporate sponsors present their activities at their respective booths. Last but not least, our special guest Pierre Rouzé will bring us a perspective on the history of bioinformatics and 10 years of Benelux Bioinformatics Conferences. For this edition, we would like to congratulate 10 (mostly master) students that were selected from a large pool of submissions to enjoy a student fellowship. For many of them it is their first chance to actively participate in a scientific conference, and we hope that it inspires them for their future bioinformatics career. The program also includes a healthy mix of chances for social interaction and networking. Conference dinner, coffee and lunch breaks and the farewell drink are perfect opportunities to strengthen the network even further. We cannot close this foreword without a very strong word of thank you to the many people who made this event possible. Thanks to the sponsors for their crucial support, to the keynote speakers and all other presenters for presenting their work, to the program committee for reviewing many abstracts, to many volunteers and people in the administration of the University of Antwerp for their helping hands, in many different ways. Last but not least, thank you for being here and being part of yet another great BBC edition. We wish you an enjoyable and very illuminating meeting. On behalf of the organizing committee, Kris Laukens & Pieter Meysman BBC2015 chairs University of Antwerp
![Page 5: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/5.jpg)
10th Benelux Bioinformatics Conference bbc 2015
5
Special thanks to the BBC 2015 sponsors!
Gold sponsors:
Silver sponsors:
Bronze sponsors:
Affiliations:
![Page 6: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/6.jpg)
10th Benelux Bioinformatics Conference bbc 2015
6
Organizing committee
Kris Laukens, University of Antwerp, Belgium
Pieter Meysman, University of Antwerp, Belgium
Geert Vandeweyer, University of Antwerp, Belgium
Yvan Saeys, Ghent University, Belgium
Thomas Abeel, Delft University of Technology, The Netherlands
Programme committee
Thomas Abeel, Delft University of Technology, The Netherlands Stein Aerts, University of Leuven, Belgium Francisco Azuaje, Luxembourg Institute of Health, Luxembourg Gianluca Bontempi, Université libre de Bruxelles, Belgium Tomasz Burzykowski, Hasselt University, Belgium Susan Coort, Maastricht University, The Netherlands Tim De Meyer, Ghent University, Belgium Jeroen De Ridder, Delft University of Technology, The Netherlands Dick De Ridder, Delft University of Technology, The Netherlands Peter De Rijk, University of Antwerp, Belgium Pierre Dupont, Université catholique de Louvain, Belgium Pierre Geurts, University of Liège, Belgium Peter Horvatovich, University of Groningen, The Netherlands Jan Ramon, University of Leuven, Belgium Rob Jelier, University of Leuven, Belgium Gunnar Klau, Centrum Wiskunde & Informatica, The Netherlands Andreas Kremer, ITTM S.A., Luxembourg Kris Laukens, University of Antwerp, Belgium Tom Lenaerts, Université libre de Bruxelles, Belgium Steven Maere, Ghent University / VIB, Belgium Lennart Martens, Ghent University / VIB, Belgium Pieter Meysman, University of Antwerp, Belgium Perry Moerland, University of Amsterdam, Belgium Pieter Monsieurs, SCK-CEN, Belgium Yves Moreau, University of Leuven, Belgium Yvan Saeys, Ghent University / VIB, Belgium Thomas Sauter, University of Luxembourg, Luxembourg Alexander Schoenhuth, Centrum Wiskunde & Informatica, The Netherlands Berend Snel, Utrecht University, Belgium Dirk Valkenborg, VITO, Belgium Raf Van de Plas, Delft University of Technology, The Netherlands Vera van Noort, University of Leuven, Belgium Natal van Riel, Eindhoven University of Technology, The Netherlands Klaas Vandepoele, Ghent University / VIB, Belgium Geert Vandeweyer, University of Antwerp, Belgium Wim Vrancken, Vrije Universiteit Brussel, Belgium
![Page 7: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/7.jpg)
10th Benelux Bioinformatics Conference bbc 2015
7
Local Organizing Committee
Charlie Beirnaert, University of Antwerp
Wout Bittremieux, University of Antwerp
Bart Cuypers, University of Antwerp
Nicolas De Neuter, University of Antwerp
Aida Mrzic, University of Antwerp
Stefan Naulaerts, University of Antwerp
The results published in this book of abstracts are under the full responsibility of the authors. The organizing committee cannot be held responsible for any errors in this publication or potential consequences thereof.
![Page 8: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/8.jpg)
10th Benelux Bioinformatics Conference bbc 2015
8
Conference agenda 1/2
December 6, 2015: Satellite events
12.30 – 19.00 Student-run satellite meeting at the Institute of Tropical Medicine, Antwerp.
19.00 - … Guided sightseeing tour of Antwerp for early arrivals.
December 7, 2015: Main Conference
8.30 - 9.30 Registration and welcome coffee.
9.30 - 9.50 Welcome and conference opening, with foreword by UAntwerpen Rector Prof.
Alain Verschoren.
9.50 - 10.50 K1 Invited keynote: Lars Juhl Jensen. Medical data and text mining: Linking
diseases, drugs, and adverse reactions.
10.50 - 11.10 Coffee break.
Selected talks session 1
11.10 - 11.25 O1 Mafalda Galhardo, Philipp Berninger, Thanh-Phuong Nguyen, Thomas Sauter and Lasse
Sinkkonen. Cell type-selective disease association of genes under high regulatory load.
11.25 - 11.40 O2 Andrea M. Gazzo, Dorien Daneels, Maryse Bonduelle, Sonia Van Dooren, Guillaume
Smits and Tom Lenaerts. Predicting oligogenic effects using digenic disease data.
11.40 - 11.55 O3 Wouter Saelens, Robrecht Cannoodt, Bart N. Lambrecht and Yvan Saeys. A
comprehensive comparison of module detection methods for gene expression data.
11.55 - 12.10 O4 Joana P. Gonçalves and Sara C. Madeira. LateBiclustering: Efficient discovery of temporal
local patterns with potential delays.
12.10 - 12.30 C1 Nicolas Goffard. Illumina software platforms to transform the path to knowledge and
discovery. (Corporate presentation: Illumina)
![Page 9: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/9.jpg)
10th Benelux Bioinformatics Conference bbc 2015
9
12.30 - 15.00 Lunch break & poster session.
Selected talks session 2
15.00 - 15.15 O5 Robrecht Cannoodt, Katleen De Preter and Yvan Saeys. Inferring developmental
chronologies from single cell RNA.
15.15 - 15.30 O6 Vân Anh Huynh-Thu and Guido Sanguinetti. Combining tree-based and dynamical
systems for the inference of gene regulatory networks.
15.30 - 15.45 O7 Annika Jacobsen, Nika Heijmans, Renée van Amerongen, Martine Smit, Jaap Heringa
and K. Anton Feenstra. Modeling the Regulation of β-Catenin Signalling by WNT stimulation
and GSK3 inhibition.
15.45 - 16.00 O8 Thanh Le Van, Jimmy Van den Eynden, Dries De Maeyer, Ana Carolina Fierro, Lieven
Verbeke, Matthijs van Leeuwen, Siegfried Nijssen, Luc De Raedt and Kathleen Marchal.
Ranked tiling based approach to discovering patient subtypes.
16.00 - 16.15 O9 Martin Bizet, Jana Jeschke, Matthieu Defrance, François Fuks and Gianluca Bontempi.
Development of a DNA methylation-based score reflecting Tumour Infiltrating Lymphocytes.
16.15 - 16-30 O10 Aliaksei Vasilevich, Shantanu Singh, Aurélie Carlier and Jan de Boer. Prediction of cell
responses to surface topographies using machine learning techniques.
16.30 - 17.00 Coffee break.
Selected talks session 3
17.00 - 17.15 O11 Wout Bittremieux, Pieter Meysman, Lennart Martens, Bart Goethals, Dirk Valkenborg
and Kris Laukens. Analysis of mass spectrometry quality control metrics.
17.15 - 17.30 O12 Şule Yılmaz, Masa Cernic, Friedel Drepper, Bettina Warscheid, Lennart Martens and
Elien Vandermarliere. Xilmass: A cross-linked peptide identification algorithm.
17.30 - 17.45 O13 Nico Verbeeck, Jeffrey Spraggins, Yousef El Aalamat, Junhai Yang, Richard M. Caprioli,
Bart De Moor, Etienne Waelkens and Raf Van de Plas. Automated anatomical interpretation
of differences between imaging mass spectrometry experiments.
17.45 - 18.00 O14 Yousef El Aalamat, Xian Mao, Nico Verbeeck, Junhai Yang, Bart De Moor, Richard M.
Caprioli, Etienne Waelkens and Raf Van de Plas. Enhancement of imaging mass spectrometry
data through removal of sparse intensity variations.
18.10 - 18.30 Walk to the gala dinner leaving from conference venue.
18.30 - 22.00 Gala dinner at Pelgrom – Pelgrimstraat 15, Antwerpen.
![Page 10: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/10.jpg)
10th Benelux Bioinformatics Conference bbc 2015
10
Conference agenda 2/2
December 8, 2015: Main Conference
8.30 - 9.30 Welcome coffee.
9.30 - 9.40 Opening and announcements.
Selected talks session 4
9.40 - 9.55
O15 Gipsi Lima Mendez, Karoline Faust, Nicolas Henry, Johan Decelle, Sébastien Colin,
Fabrizio Carcillo, Simon Roux, Gianluca Bontempi, Matthew B. Sullivan, Chris Bowler, Eric
Karsenti, Colomban de Vargas and Jeroen Raes. Determinants of community structure in the
plankton interactome.
9.55 - 10.10 O16 Mohamed Mysara, Yvan Saeys, Natalie Leys, Jeroen Raes and Pieter Monsieurs.
Bioinformatics tools for accurate analysis of amplicon sequencing data for
biodiversity analysis.
10.10 – 10.25
O17 Sjoerd M. H. Huisman, Else Eising, Ahmed Mahfouz, Boudewijn P.F. Lelieveldt, Arn
van den Maagdenberg and Marcel Reinders. Gene co-expression analysis identifies brain
regions and cell types involved in migraine pathophysiology: a GWAS-based study using the
Allen Human Brain Atlas.
10.25 - 10.40
O18 Ahmed Mahfouz, Boudewijn P.F. Lelieveldt, Aldo Grefhorst, Isabel Mol, Hetty Sips,
Jose van den Heuvel, Jenny Visser, Marcel Reinders and Onno Meijer. Spatial co-expression
analysis of steroid receptors in the mouse brain identifies region-specific
regulation mechanisms.
10.40 - 11.10 Coffee break.
Selected talks session 5
11.10 - 11.25 O19 Bart Cuypers, Pieter Meysman, Manu Vanaerschot, Maya Berg, Malgorzata
Domagalksa, Jean-Claude Dujardin and Kris Laukens. A systems biology compendium for
Leishmania Donovani.
11.25 - 11.40 O20 Volodimir Olexiouk, Elvis Ndah, Sandra Steyaert, Steven Verbruggen, Eline De Schutter,
Alexander Koch, Daria Gawron, Wim Van Criekinge, Petra Van Damme and Gerben
Menschaert. Multi-omics integration: Ribosome profiling applications.
11.40 - 11.55 O21 Qingzhen Hou, Kamil Krystian Belau, Marc Lensink, Jaap Heringa and K. Anton
Feenstra. CLUB-MARTINI: Selecting favorable interactions amongst available candidates: A
coarse-grained simulation approach to scoring docking decoys.
11.55 - 12.10 O22 Elien Vandermarliere, Davy Maddelein, Niels Hulstaert, Elisabeth Stes, Michela Di
Michele, Kris Gevaert, Edgar Jacoby, Dirk Brehmer and Lennart Martens. Pepshell:
Visualization of conformational proteomics data.
![Page 11: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/11.jpg)
10th Benelux Bioinformatics Conference bbc 2015
11
12.10 - 12.30 C2 Carine Poussin. The systems toxicology computational challenge: Identification of
exposure response markers. (Corporate presentation: sbv IMPROVER)
12.30 - 13.30 Lunch break.
13.30 - 14.30 K2 Invited keynote: Cedric Notredame. Multiple survival strategies to deal with the
multiplication of multiple sequence alignment methods.
Selected talks session 6
14.30 - 14.45 O23 Thomas Moerman, Dries Decap and Toni Verbeiren. Interactive VCF comparison using
Spark Notebook.
14.45 - 15.00 O24 Sepideh Babaei, Waseem Akhtar, Johann de Jong, Marcel Reinders and Jeroen de
Ridder. 3D hotspots of recurrent retroviral insertions reveal long-range interactions with
cancer genes.
15.00 - 15.30 Coffee break.
15.30 - 16.00 K3 Invited keynote: Pierre Rouzé. Thirty years in Bioinformatics.
16.00 - 16.30 Closing and awards.
16.30 - 17.00 Closing reception.
![Page 12: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/12.jpg)
10th Benelux Bioinformatics Conference bbc 2015
12
Gala dinner
The gala event will take place at the Pelgrom, a Medieval-style restaurant at walking distance from
the Elzenveld conference location, on the evening of Monday December 7th, after the conference
programme, from 18h30 until 22h00. Gala dinner participation is optional, although highly
recommended!
The Pelgrom is one of Antwerp’s most historic eating and drinking place, situated in authentic 15th
century cellars that were used by merchants for temporary storage during the two big annual
Antwerp fairs. Prepare to feast on a Medieval buffet in the style of Antwerp’s Golden Century!
The Pelgrom is at walking distance from the
Elzenveld conference location. For people using
public transportation, after the end of the gala
dinner, the Antwerp-Central train station can easily
be reached by tram from the Groenplaats station
(10 minutes), or on foot (20 minutes).
Where? Restaurant Pelgrom, Pelgrimsstraat 15, 2000 Antwerp
When? Monday December 7th, 2015; 18h30 - 22h00
![Page 13: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/13.jpg)
10th Benelux Bioinformatics Conference bbc 2015
13
List of abstracts
K1 MEDICAL DATA AND TEXT MINING: LINKING DISEASES, DRUGS, AND ADVERSE REACTIONS 17
K2 MULTIPLE SURVIVAL STRATEGIES TO DEAL WITH THE MULTIPLICATION OF MULTIPLE SEQUENCE
ALIGNMENT METHODS 18
C1 ILLUMINA SOFTWARE PLATFORMS TO TRANSFORM THE PATH TO KNOWLEDGE AND DISCOVERY 19
C2 THE SYSTEMS TOXICOLOGY COMPUTATIONAL CHALLENGE: IDENTIFICATION OF EXPOSURE
RESPONSE MARKERS 20
O1 CELL TYPE-SELECTIVE DISEASE ASSOCIATION OF GENES UNDER HIGH REGULATORY LOAD 21
O2 PREDICTING OLIGOGENIC EFFECTS USING DIGENIC DISEASE DATA 22
O3 A COMPREHENSIVE COMPARISON OF MODULE DETECTION METHODS FOR GENE EXPRESSION
DATA 23
O4 LATEBICLUSTERING: EFFICIENT DISCOVERY OF TEMPORAL LOCAL PATTERNS WITH POTENTIAL
DELAYS 24
O5 INFERRING DEVELOPMENTAL CHRONOLOGIES FROM SINGLE CELL RNA 25
O6 COMBINING TREE-BASED AND DYNAMICAL SYSTEMS FOR THE INFERENCE OF GENE
REGULATORY NETWORKS 26
O7 MODELING THE REGULATION OF Β-CATENIN SIGNALLING BY WNT STIMULATION AND GSK3
INHIBITION 27
O8 RANKED TILING BASED APPROACH TO DISCOVERING PATIENT SUBTYPES 28
O9 DEVELOPMENT OF A DNA METHYLATION-BASED SCORE REFLECTING TUMOUR INFILTRATING
LYMPHOCYTES 29
O10 PREDICTION OF CELL RESPONSES TO SURFACE TOPOGRAPHIES USING MACHINE LEARNING
TECHNIQUES 30
O11 ANALYSIS OF MASS SPECTROMETRY QUALITY CONTROL METRICS 31
O12 XILMASS: A CROSS-LINKED PEPTIDE IDENTIFICATION ALGORITHM 32
O13 AUTOMATED ANATOMICAL INTERPRETATION OF DIFFERENCES BETWEEN IMAGING MASS
SPECTROMETRY EXPERIMENTS 33
O14 ENHANCEMENT OF IMAGING MASS SPECTROMETRY DATA THROUGH REMOVAL OF SPARSE
INTENSITY VARIATIONS 34
O15 DETERMINANTS OF COMMUNITY STRUCTURE IN THE PLANKTON INTERACTOME 35
O16 BIOINFORMATICS TOOLS FOR ACCURATE ANALYSIS OF AMPLICON SEQUENCING DATA FOR
BIODIVERSITY ANALYSIS 36
O17 GENE CO-EXPRESSION ANALYSIS IDENTIFIES BRAIN REGIONS AND CELL TYPES INVOLVED IN
MIGRAINE PATHOPHYSIOLOGY: A GWAS-BASED STUDY USING THE ALLEN HUMAN BRAIN ATLAS 37
Keynotes
Corporate presentations
Selected oral presentations
![Page 14: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/14.jpg)
10th Benelux Bioinformatics Conference bbc 2015
14
O18 SPATIAL CO-EXPRESSION ANALYSIS OF STEROID RECEPTORS IN THE MOUSE BRAIN IDENTIFIES
REGION-SPECIFIC REGULATION MECHANISMS 38
O19 A SYSTEMS BIOLOGY COMPENDIUM FOR LEISHMANIA DONOVANI 39
O20 MULTI-OMICS INTEGRATION: RIBOSOME PROFILING APPLICATIONS 40
O21 CLUB-MARTINI: SELECTING FAVORABLE INTERACTIONS AMONGST AVAILABLE CANDIDATES: A
COARSE-GRAINED SIMULATION APPROACH TO SCORING DOCKING DECOYS 41
O22 PEPSHELL: VISUALIZATION OF CONFORMATIONAL PROTEOMICS DATA 42
O23 INTERACTIVE VCF COMPARISON USING SPARK NOTEBOOK 43
O24 3D HOTSPOTS OF RECURRENT RETROVIRAL INSERTIONS REVEAL LONG-RANGE INTERACTIONS
WITH CANCER GENES 44
P1 KNN-MDR APPROACH FOR DETECTING GENE-GENE INTERACTIONS 45
P2 CONSERVATION AND DIVERSITY OF SUGAR-RELATED CATABOLIC PATHWAYS IN FUNGI 46
P3 VISUALIZING BIOLOGICAL DATA THROUGH WEB COMPONENTS USING POLIMERO AND
POLIMERO-BIO 47
P4 DISEASE-SPECIFIC NETWORK CONSTRUCTION BY SEED-AND-EXTEND 48
P5 BIG DATA SOLUTIONS FOR VARIANT DISCOVERY FROM LOW COVERAGE SEQUENCING DATA, BY
INTEGRATION OF HADOOP, HBASE AND HIVE 49
P6 ENTEROCOCCUS FAECIUM GENOME DYNAMICS DURING LONG-TERM PATIENT GUT
COLONIZATION 50
P7 XCMS OPTIMISATION IN HIGH-THROUGHPUT LC-MS QC 51
P8 IDENTIFICATION OF NUMTS THROUGH NGS DATA 52
P9 MICROBIAL SEMANTICS: GENOME-WIDE HIGH-PRECISION NAMING SCHEMES FOR BACTERIA 53
P10 FROM SNPS TO PATHWAYS: AN APPROACH TO STRENGTHEN BIOLOGICAL INTERPRETATION OF
GWAS RESULTS 54
P11 IDENTIFICATION OF TRANSCRIPTION FACTOR CO-ASSOCIATIONS IN SETS OF FUNCTIONALLY
RELATED GENES 55
P12 PHENETIC: MULTI-OMICS DATA INTERPRETATION USING INTERACTION NETWORKS 56
P13 THE ROLE OF HLA ALLELES UNDERLYING CYTOMEGALOVIRUS SUSCEPTIBILITY IN ALLOGENEIC
TRANSPLANT POPULATIONS 57
P14 NOVOPLASTY: IN SILICO ASSEMBLY OF PLASTID GENOMES FROM WHOLE GENOME NGS DATA 58
P15 ENANOMAPPER - ONTOLOGY, DATABASE AND TOOLS FOR NANOMATERIAL SAFETY
EVALUATION 59
P16 BIOMEDICAL TEXT MINING FOR DISEASE-GENE DISCOVERY: SOMETIMES LESS IS MORE 60
P17 TUNESIM - TUNABLE VARIANT SET SIMULATOR FOR NGS READS 61
P18 RNA-SEQ REVEALS ALTERNATIVE SPLICING WITH ALTERNATIVE FUNCTIONALITY IN
MUSHROOMS 62
P19 MSQROB: AN R/BIOCONDUCTOR PACKAGE FOR ROBUST RELATIVE QUANTIFICATION IN LABEL-
FREE MASS SPECTROMETRY-BASED QUANTITATIVE PROTEOMICS 63
P20 A MIXTURE MODEL FOR THE OMICS BASED IDENTIFICATION OF MONOALLELICALLY EXPRESSED
LOCI AND THEIR DEREGULATION IN CANCER 64
P21 GEVACT: GENOMIC VARIANT CLASSIFIER TOOL 65
P22 MAPPI-DAT: MANAGEMENT AND ANALYSIS FOR HIGH THROUGHPUT INTERACTOMICS DATA
FROM ARRAY-MAPPIT EXPERIMENTS 66
P23 HIGHLANDER: VARIANT FILTERING MADE EASIER 67
Poster presentations
![Page 15: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/15.jpg)
10th Benelux Bioinformatics Conference bbc 2015
15
P24 DOSE-TIME NETWORK IDENTIFICATION: A NEW METHOD FOR GENE REGULATORY NETWORK
INFERENCE FROM GENE EXPRESSION DATA WITH MULTIPLE DOSES AND TIME POINTS 68
P25 IDENTIFICATION OF NOVEL ALLOSTERIC DRUG TARGETS USING A “DUMMY” LIGAND
APPROACH 69
P26 PASSENGER MUTATIONS CONFOUND INTERPRETATION OF ALL GENETICALLY MODIFIED
CONGENIC MICE 70
P27 DETECTING MIXED MYCOBACTERIUM TUBERCULOSIS INFECTION AND DIFFERENCES IN DRUG
SUSCEPTIBILITY WITH WGS DATA 71
P28 APPLICATION OF HIGH-THROUGHPUT SEQUENCING TO CIRCULATING MICRORNAS REVEALS
NOVEL BIOMARKERS FOR DRUG-INDUCED LIVER INJURY 72
P29 INFORMATION THEORETIC MODEL FOR GENE PRIORITIZATION 73
P30 GALAHAD: A WEB SERVER FOR THE ANALYSIS OF DRUG EFFECTS FROM GENE EXPRESSION DATA 74
P31 KMAD: KNOWLEDGE BASED MULTIPLE SEQUENCE ALIGNMENT FOR INTRINSICALLY DISORDERED
PROTEINS 75
P32 ON THE LZ DISTANCE FOR DEREPLICATING REDUNDANT PROKARYOTIC GENOMES 76
P33 THE ROLE OF MIRNAS IN ALZHEIMER’ S DISEASE 77
P34 FUNCTIONAL SUBGRAPH ENRICHMENTS FOR NODE SETS IN REGULATORY NETWORKS 78
P35 HUMANS DROVE THE INTRODUCTION & SPREAD OF MYCOBACTERIUM ULCERANS IN AFRICA 79
P36 LEVERAGING AGO-SRNA AFFINITY TO IMPROVE IN SILICO SRNA DETECTION AND
CLASSIFICATION IN PLANTS 80
P37 ANALYSIS OF RELATIONSHIP PATTERNS IN UNASSIGNED MS/MS SPECTRA 81
P38 MINING ACROSS “ OMICS ” DATA FOR DRUG PRIORITIZATION 82
P39 ABUNDANT TRANS-SPECIFIC POLYMORPHISM AND A COMPLEX HISTORY OF NON-BIFURCATING
SPECIATION IN THE GENUS ARABIDOPSIS 83
P40 RIBOSOME PROFILING ENABLES THE DISCOVERY OF SMALL OPEN READING FRAMES (SORFS), A
NEW SOURCE OF BIOACTIVE PEPTIDES 84
P41 RIGAPOLLO, A HMM-SVM BASED APPROACH TO SEQUENCE ALIGNMENT 85
P42 EARLY FOLDING AND LOCAL INTERACTIONS 86
P43 BINDING SITE SIMILARITY DRUG REPOSITIONING: A GENERAL AND SYSTEMATIC METHOD FOR
DRUG DISCOVERY AND SIDE EFFECTS DETECTION 87
P44 ASSESSMENT OF THE CONTRIBUTION OF COCOA-DERIVED STRAINS OF ACETOBACTER
GHANENSIS AND ACETOBACTER SENEGALENSIS TO THE COCOA BEAN FERMENTATION PROCESS
THROUGH A GENOMIC APPROACH
88
P45 REPRESENTATIONAL POWER OF GENE FEATURES FOR FUNCTION PREDICTION 89
P46 ANALYSIS OF BIAS AND ASYMMETRY IN THE PROTEIN STABILITY PREDICTION 90
P47 MULTI-LEVEL BIOLOGICAL CHARACTERIZATION OF EXOMIC VARIANTS AT THE PROTEIN LEVEL
IMPROVES THE IDENTIFICATION OF THEIR DELETERIOUS EFFECTS 91
P48 NGOME: PREDICTION OF NON-ENZYMATIC PROTEIN DEAMIDATION FROM SEQUENCE-DERIVED
SECONDARY STRUCTURE AND INTRINSIC DISORDER 92
P49 OPTIMAL DESIGN OF SRM ASSAYS USING MODULAR EMPIRICAL MODELS 93
P50 EVALUATING THE ROBUSTNESS OF LARGE INDEL IDENTIFICATION ACROSS MULTIPLE MICROBIAL
GENOMES 94
P51 INTEGRATING STRUCTURED AND UNSTRUCTURED DATA SOURCES FOR PREDICTING CLINICAL
CODES 95
P52 SUPERVISED TEXT MINING FOR DISEASE AND GENE LINKS 96
P53 FLOWSOM WEB: A SCALABLE ALGORITHM TO VISUALIZE AND COMPARE CYTOMETRY DATA IN
THE BROWSER 97
P54 TOWARDS A BELGIAN REFERENCE SET 98
P55 MANAGING BIG IMAGING DATA FROM MICROSCOPY: A DEPARTMENTAL-WIDE APPROACH 99
![Page 16: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/16.jpg)
10th Benelux Bioinformatics Conference bbc 2015
16
P56 ESTIMATING THE IMPACT OF CIS-REGULATORY VARIATION IN CANCER GENOMES USING
ENHANCER PREDICTION MODELS AND MATCHED GENOME-EPIGENOME-TRANSCRIPTOME
DATA 100
P57 I-PV: A CIRCOS MODULE FOR INTERACTIVE PROTEIN SEQUENCE VISUALIZATION 101
P58 SFINX: STRAIGHTFORWARD FILTERING INDEX FOR AFFINITY PURIFICATION-MASS
SPECTROMETRY DATA ANALYSIS 102
P59 MAPREDUCE APPROACHES FOR CONTACT MAP PREDICTION: AN EXTREMELY IMBALANCED BIG
DATA PROBLEM 103
P60 COEXPNETVIZ: THE CONSTRUCTION AND VIZUALISATION OF CO-EXPRESSION NETWORKS 104
P61 THE DETECTION OF PURIFYING SELECTION DURING TUMOUR EVOLUTION UNVEILS CANCER
VULNERABILITIES 105
P62 FLOREMI: SURVIVAL TIME PREDICTION BASED ON FLOW CYTOMETRY DATA 106
P63 STUDYING BET PROTEIN-CHROMATIN OCCUPATION TO UNDERSTAND GENOTOXICITY OF MLV-
BASED GENE THERAPY VECTORS 107
P64 THE COMPLETE GENOME SEQUENCE OF LACTOBACILLUS FERMENTUM IMDO 130101 AND ITS
METABOLIC TRAITS RELATED TO THE SOURDOUGH FERMENTATION PROCESS 108
P65 ORTHOLOGICAL ANALYSIS OF AN EBOLA VIRUS – HUMAN PPIN SUGGESTS REDUCED
INTERFERENCE OF EBOLA VIRUS WITH EPIGENETIC PROCESSES IN ITS SUSPECTED BAT
RESERVOIR HOST
109
P66 PLADIPUS EMPOWERS UNIVERSAL DISTRIBUTED COMPUTING 110
P67 IDENTIFICATION OF ANTIBIOTIC RESISTANCE MECHANISMS USING A NETWORK-BASED
APPROACH 111
P68 DEFINING THE MICROBIAL COMMUNITY OF DIFFERENT LACTOBACILLUS NICHES USING
METAGENOMIC SEQUENCING 112
P69 HUNTING HUMAN PHENOTYPE-ASSOCIATED GENES USING MATRIX FACTORIZATION 113
P70 THE IMPACT OF HMGA PROTEINS ON REPLICATION ORIGINS DISTRIBUTION 114
C2 THE SYSTEMS TOXICOLOGY COMPUTATIONAL CHALLENGE: IDENTIFICATION OF EXPOSURE
RESPONSE MARKERS 20
Corporate poster presentations
![Page 17: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/17.jpg)
10th Benelux Bioinformatics Conference bbc 2015
17
K1. MEDICAL DATA AND TEXT MINING:
LINKING DISEASES, DRUGS, AND ADVERSE REACTIONS
Lars Juhl Jensen
Clinical data describing the phenotypes and treatment of patients is an underused data source that has much greater
research potential than is currently realized. Mining of electronic health records (EHRs) has the potential for revealing
unknown disease correlations and for improving post-approval monitoring of drugs. In my presentation I will introduce
the centralized Danish health registries and show how we use them for identification of temporal disease correlations and
discovery of common diagnosis trajectories of patients. I will also describe how we perform text mining of the clinical
narrative from electronic health records and use this for identification of new adverse reactions of drugs.
![Page 18: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/18.jpg)
10th Benelux Bioinformatics Conference bbc 2015
18
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: K2 Keynote
K2. MULTIPLE SURVIVAL STRATEGIES TO DEAL WITH THE
MULTIPLICATION OF MULTIPLE SEQUENCE ALIGNMENT METHODS
Cedric Notredame
In this seminar I will introduce some of the latest developments in the field of multiple sequence alignment construction,
including some of the work from my group. I will briefly review the main challenges and the latest work in the field,
including ClustalO and the phylogeny aware aligners like SATe and how these aligners relate to consistency based
methods like T-Coffee. I will also look at the complex relationship between multiple sequence alignment accuracy,
structural modeling and phylogenetic tree reconstruction and introduce the notion of reliability index while reviewing
some of the latest advances in this field, including the TCS (Transitive consistency score). I will show how this index can
be used to both identify structurally correct positions in an alignment and evolutionary informative sites, thus suggesting
more unity than initially thought between these two parameters. I will then introduce the structure based clustering
method we recently developed to further test these hypothesis. I will finish with some consideration on the main
challenges that need to be confronted for the accurate modeling of biological sequences relationship with a special
attention on genomic and RNA sequences. All methods are available from www.tcoffee.org.
REFERENCES TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Chang
JM, Di Tommaso P, Notredame C. Mol Biol Evol. 2014 Jun;31(6):1625-37. doi: 10.1093/molbev/msu117. Epub 2014 Apr 1.
Using tertiary structure for the computation of highly accurate multiple RNA alignments with the SARA-Coffee package. Kemena C, Bussotti G,
Capriotti E, Marti-Renom MA, Notredame C. Bioinformatics. 2013 May 1;29(9):1112-9. doi: 10.1093/bioinformatics/btt096. Epub 2013 Feb 28. Alignathon: a competitive assessment of whole-genome alignment methods. Earl D, Nguyen N, Hickey G, Harris RS, Fitzgerald S, Beal K,
Seledtsov I, Molodtsov V, Raney BJ, Clawson H, Kim J, Kemena C, Chang JM, Erb I, Poliakov A, Hou M, Herrero J, Kent WJ, Solovyev V, Darling AE, Ma J, Notredame C, Brudno M, Dubchak I, Haussler D, Paten B. Genome Res. 2014 Dec;24(12):2077-89. doi: 10.1101/gr.174920.114.
Epub 2014 Oct 1.
Epistasis as the primary factor in molecular evolution. Breen MS, Kemena C, Vlasov PK, Notredame C, Kondrashov FA. Nature. 2012 Oct 25;490(7421):535-8. doi: 10.1038/nature11510. Epub 2012 Oct 14.
![Page 19: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/19.jpg)
10th Benelux Bioinformatics Conference bbc 2015
19
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: C1 Corporate presentation
C1. ILLUMINA SOFTWARE PLATFORMS TO TRANSFORM THE PATH TO
KNOWLEDGE AND DISCOVERY
Nicolas Goffard
Illumina, Inc. [email protected]
The next big bottleneck in the biological sample to answer workflow has undoubtedly moved beyond the generation of
the raw data towards its initial processing and analysis and even more so its biological and medical interpretation. There
are two main reasons why this is particularly challenging for research organisations to successfully accomplish. Firstly
there is a need to easily and securely analyse, archive and share sequencing data as well as to simplify and accelerate the
data analysis with push button tools using widely validated and scientifically accepted algorithms. Secondly there is a
requirement to normalize, standardize and curate not just their proprietary data from multiple studies, but to do it in a
way that allows them to compare it in real time to data produced from public domain studies. Illumina provides two
integrated software platforms to overcome these challenges called BaseSpace and NextBio and this presentation provides
an overview of the capabilities found within both to empower biologists and informaticians to interactively explore the
data.
![Page 20: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/20.jpg)
10th Benelux Bioinformatics Conference bbc 2015
20
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: C2 Corporate presentation
C2. THE SYSTEMS TOXICOLOGY COMPUTATIONAL CHALLENGE:
IDENTIFICATION OF EXPOSURE RESPONSE MARKERS
Carine Poussin, Vincenzo Belcastro, Stéphanie Boué, Florian Martin,
Alain Sewer, Bjoern Titz, Manuel C. Peitsch & Julia Hoeng.
Philip Morris International Research and Development, Philip Morris Product SA,
Quai Jeanrenaud 5, CH-2000 Neuchâtel, Switzerland
INTRODUCTION
Risk assessment in the context of 21st century
toxicology relies on the identification of specific
exposure response markers and the elucidation of
mechanisms of toxicity, which can lead to adverse
events. As a foundation for this future predictive risk
assessment, diverse set of chemicals or mixtures are
tested in different biological systems, and datasets are
generated using high-throughput technologies.
However, the development of effective computational
approaches for the analysis and integration of these data
sets remains challenging.
METHODS
The sbv IMPROVER (Industrial Methodology for
Process Verification in Research;
http://sbvimprover.com/) project aims to verify methods
and concepts in systems biology research via challenges
posed to the scientific community. In fall 2015, the 4th
sbv IMPROVER computational challenge will be
launched which is aimed at evaluating algorithms for
the identification of specific markers of chemical
mixture exposure response in blood of humans or
rodents. The blood is an easily accessible matrix,
however remains a complex biofluid to analyze. This
computational challenge will address questions related
to the classification of samples based on transcriptomics
profiles from well-defined sample cohorts. Moreover, it
will address whether gene expression data derived from
human or rodent whole blood are sufficiently
informative to identify human-specific or species-
independent blood gene signatures predictive of the
exposure status of a subject to chemical mixtures
(current/former/non-exposure).
RESULTS & DISCUSSION
Participants will be provided with high quality datasets
to develop predictive models/classifiers and the
predictions will be scored by an independent scoring
panel. The results and post-challenge analyses will be
shared with the scientific community, and will open
new avenues in the field of systems toxicology.
REFERENCES Meyer et al. Industrial methodology for process verification in
research (IMPROVER): toward systems biology verification.
Bioinformatics, 2012 Meyer et al. Verification of systems biology research in the age of
collaborative competition. Nat Biotechnol, 2011
Tarca et al. Strengths and limitations of microarray-based phenotype prediction: lessons learned from the IMPROVER Diagnostic
Signature Challenge. Bioinformatics, 2013
Hartung, T. Lessons learned from alternative methods and their validation for a new toxicology in the 21st century. Journal of
toxicology and environmental health, 2010 Hoeng et al. A network-based approach to quantifying the impact of
biologically active substances. Drug Discov Today, 2012.
![Page 21: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/21.jpg)
10th Benelux Bioinformatics Conference bbc 2015
21
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O1 Oral presentation
O1. CELL TYPE-SELECTIVE DISEASE ASSOCIATION
OF GENES UNDER HIGH REGULATORY LOAD
Mafalda Galhardo1, Philipp Berninger
2, Thanh-Phuong Nguyen
1, Thomas Sauter
1 & Lasse Sinkkonen
1*.
Life Sciences Research Unit, University of Luxembourg, Luxembourg, Luxembourg1; Biozentrum, University of Basel
and Swiss Institute of Bioinformatics, Basel, Switzerland2.
Identification of biomarkers and drug targets is a key task of biomedical research. We previously showed that disease-
linked metabolic genes are often under combinatorial regulation (Galhardo et al. 2014). Here we extend this analysis to
include almost 100 transcription factors (TFs) and key histone modifications from over 100 samples to show that genes
under high regulatory load (HRL) are enriched for disease-association across cell types. Network and pathway analysis
suggests the central role of HRL genes in biological networks, under heavy regulation both at transcriptional and post-
transcriptional level, as a possible explanation for the observed enrichment. Thus, epigenomic mapping of enhancers
presents an unbiased approach for identification of novel disease-associated genes.
INTRODUCTION
Identification of disease-relevant genes and gene products
as biomarkers and drug targets is one of key tasks of
biomedical research. Still, a great majority of research is
focused on a small minority of genes while many remain
unstudied (Pandey et al. 2014). Unbiased prioritization
within these ignored genes would be important to harvest
the full potential of genomics in understanding diseases.
Many databases to catalog disease-associated genes have
been created, including DisGeNET that draws from
multiple sources (Bauer-Mehren et al. 2010). In addition,
large amounts of publicly available epigenomic data on
the cell type-selective regulation of these genes has been
produced. The importance of epigenetic regulation for
disease development is increasingly recognized, for
example in analysis of GWAS studies where causal SNPs
are mostly located within gene regulatory regions
(Maurano et al. 2012).
METHODS
Public ChIP-seq data produced by the ENCODE project
(Dunham et al. 2012), the BLUEPRINT Epigenome
project (Martens et al. 2013) and the NIH Epigenomic
Roadmap project (Kundaje et al. 2015) were downloaded
on May 2014. The data were used to rank active protein
coding genes (based on NCBI Entrez and marked by
H3K4me3) by their regulatory load based on the number
of associated TFs or enhancer (H3K27ac) regions using
GREAT tool. The enrichment of disease genes from
DisGeNET among HRL genes was tested using either
Matlab® hypergeometric cumulative distribution function
and adjusted for multiple testing with the Benjamini and
Hochberg methodology or normalized enrichment score.
Enriched diseases were clustered using R package
“blockcluster”. Peak calling for super-enhancers was done
using HOMER. A liver disease gene network was
constructed from HPRD based on liver diseases genes
from MeSH and genes from CTD and had 8278
interactions. Statistical analysis of KEGG pathway
enrichments and betweenness centrality was done using
random sampling tests. miRNA target predictions were
obtained from TargetScan6.2. Further details of the used
methods can be found in Galhardo et al. 2015.
RESULTS & DISCUSSION
Using ENCODE ChIP-Seq profiles for 93 transcription
factors (TFs) in nine cell lines, we show that HRL genes
are enriched for disease-association across cell types
(Figure 1). TF load correlates with the enhancer load of
the genes, allowing the identification of HRL genes by
epigenomic mapping of active enhancers marked by
H3K27ac modifications. Identification of the HRL genes
across 139 samples from 96 different cell and tissue types
reveals a consistent enrichment for disease-associated
genes in a cell type-selective manner.
The HRL genes are involved in more pathways than
expected by chance, exhibit increased betweenness
centrality in the interaction network of liver disease genes,
and carry longer 3’UTRs with more microRNA binding
sites than genes on average, suggesting a role as hubs
within regulatory networks.
Thus, epigenomic mapping of enhancers presents an
unbiased approach for identification of novel disease-
associated genes (Galhardo et al. 2015).
FIGURE 1. Worflow of the disease-gene enrichment analysis.
REFERENCES Pandey AK et al. PLoS One, 9:e88889 (2014).
Bauer-Mehren A et al. Nucleic Acids Res., 33:D514-D517 (2010). Maurano et al. Science, 337:1190-1195 (2012).
Galhardo et al. Nucleic Asics Res. 42:1474-1496 (2014).
Dunham et al. Nature, 489:57-74 (2012) Martens et al. Haematologica, 98:1487-1489 (2013)
Kundaje et al. Nature, 518:317-330 (2015).
Galhardo et al. Nucleic Acids Res. 10.1093/nar/gkv863 (2015).
Figure 1
ChIP-seq data (Human)
Gene ranking by
regulatory load
(Number of TFs or enhancers per gene)
High regulatory load genes are enriched
for disease association
Active enhancers
(H3K27ac)
139 samples comprising
96 tissue or cell types
Transcription factor
binding sites
(93 TFs)
9 ENCODE cell lines
A549, GM12878, H1hESC, HCT116,
HeLaS3, HepG2, HUVEC, K562, MCF7
Disease genes
(min score 0.08)
![Page 22: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/22.jpg)
10th Benelux Bioinformatics Conference bbc 2015
22
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O2 Oral presentation
O2. PREDICTING OLIGOGENIC EFFECTS USING DIGENIC DISEASE DATA
Andrea M. Gazzo1,2,3*
, Dorien Daneels1,3
, Maryse Bonduelle3, Sonia Van Dooren
1,3, Guillaume Smits
1,4 & Tom
Lenaerts1,2,5
.
Interuniversity Institute of Bioinformatics in Brussels, Brussels, Belgium1; MLG, Departement d'Informatique,
Universite Libre de Bruxelles, Brussels, Belgium2; Center for Medical Genetics, Reproduction and Genetics,
Reproduction Genetics and Regenerative Medicine, Vrije Universiteit Brussel, UZ Brussel, Brussel, Belgium3; Genetics,
Hopital Universitaire des Enfants Reine Fabiola, Universite Libre de Bruxelles, Brussels, Belgium4;
Computerwetenschappen, Vrije Universiteit Brussel, Brussel, Belgium5.
Recent research has shown that disorders may be better described by more complex inheritance mechanisms, advocating
that some of the monogenic disease may in fact be oligogenic. Understanding how the combined interplay and weight of
variants leads to disease may provide improved and novel insights into diseases classically considered being monogenic.
Here we present a unique classification method that separates two types of digenic diseases, i.e. those that requires
variants in both genes to induce the disease and those where one is causative and the second increases the severity. Our
results show that a clear separation can be made between both classes using gene and variant-level features extracted
from DIDA.
INTRODUCTION
DIDA is a novel database that provides for the first time
detailed information on genes and associated genetic
variants involved in digenic diseases, the simplest form of
oligogenic inheritance1. The database is accessible via
http://dida.ibsquare.be and currently includes 213 digenic
combinations involved in 44 different digenic diseases2.
These combinations are composed of 364 distinct variants,
which are distributed over 136 distinct genes. Creating this
new repository was essential, as current databases do not
allow one to retrieve detailed records regarding digenic
combinations. Genes, variants, diseases and digenic
combinations in DIDA are annotated with manually
curated information and information mined from other
online resources. Each digenic combination was
categorized into one of two effect classes: either ``on/off'',
in which variant combinations in both genes are required
to develop the disease, or ``severity'', where variants in
one gene are enough to develop the disease and carrying
variant combinations in two genes increases the severity or
affects its age of onset. In this work we present a predictor
capable of distinguishing between the digenic effect
classes. We analyse the result of this predictor in relation
to specific features collected for the different digenic
combinations in DIDA, as for instance the
haploinsufficiency of the genes, their zygosity and the
relationship between them, providing insight into the
biological meaning of the result.
METHODS
We used a machine learning approach to determine the
classes, i.e. "severity" or "on/off", of a digenic
combination. Starting with feature selection we chose the
most informative features to classify the digenic
combination in either 2 classes. For each of the two genes
involved in a digenic combination: Zygosity
(Heterozygote, Homozygote, etc.), recessiveness
probability, haploinsufficiency score, known recessive
information, if the gene is essential or not (based on
Mouse knock out experimental data) are used as features
in the predictor. At variant level, we used as features the
pathogenicity predictions from SIFT and Polyphen 2 tools.
Finally, we encode also the relationship between the two
genes, defining the relation "Similar function", "Directly
interacting" and "Pathway membership". After different
tests we decided to use a Random forest algorithm, as this
approach gave the best results.
RESULTS & DISCUSSION
After a 10-fold cross validation we obtained promising
performances, with an MCC of 0,67 and 0,92 as AUROC.
Regretfully, this performance is an overestimation since,
as the gene-based features are the most important, many
examples with mutations mapped on the same gene pair
lead to the same oligogenic effect class. A stratification
that ensures that the same pair of genes are never in both
the training and in the testing set was required. We
manually created 5 subsets, where the instances with the
same gene-pair belong to the same subset. . After this
procedure we assessed again the performances, obtaining
an MCC of 0,36 and as AUROC 0,78. In order to verify
the significance of the performances we retrained the
random forest on a randomization of the data. This
randomization was obtained by shuffling all the features
for each instance but maintaining class unchanged. This
reshuffling resulted in an MCC close to zero and a
AUROC near to 0.5, as expected. This additional test
confirms the significance of the stratified results.
In a next stage we are analysing the relationship between
the oligogenic effect and the features used, particularly in
terms of biological and molecular interpretation. As a
future perspective, the benefit at clinical level is very
promising: one goal of medical genetics is to assign
predictive value to the genotype, in order to it to assist in
diagnosis and disease management. If we can infer, based
on the genotype, what the digenic/oligogenic effect will be,
we can potentially anticipate the treatment.
REFERENCES [1] Gazzo, A. et al., DIDA: a curated and annotated digenic diseases
database, under review on NAR database issue (2016).
[2] Schäffer, A. A. (2013) Digenic inheritance in medical genetics.
J. Med. Genet., 50, 641–652.
![Page 23: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/23.jpg)
10th Benelux Bioinformatics Conference bbc 2015
23
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O3 Oral presentation
O3. A COMPREHENSIVE COMPARISON OF MODULE DETECTION METHODS
FOR GENE EXPRESSION DATA
Wouter Saelens1,2*
, Robrecht Cannoodt1,2,3
, Bart N. Lambrecht1,2
& Yvan Saeys1,2
.
VIB Inflammation Research Center1; Department of Respiratory Medicine, Ghent University
2; Center for Medical
Genetics, Ghent University Hospital3.
Module detection is central in every analysis of large scale gene expression data. While numerous methods have been
developed, the relative merits and drawbacks of these different approaches is still unclear. In this work we use known
gene regulatory networks to do an unbiased comparison of 41 module detection methods, spanning clustering,
biclustering, decomposition, direct network inference and iterative network inference. This analysis showed that
decomposition methods outperform current clustering methods. Our work provides a first comprehensive evaluation to
guide the biologist in their choice but also serves as a protocol for the evaluation of novel module detection methods.
INTRODUCTION
Module detection methods form a cornerstone in the
analysis of genome wide gene expression compendia.
Modules in this context are defined as groups of genes
with a similar expression profile, and therefore frequently
share certain functions, are co-regulated and cooperate to
produce a certain phenotype.
Over the last years, dozens of module detection methods
have been developed, which can be classified in five
different categories. The most popular method is
undoubtedly clustering, which will group genes into
modules based on global similarity in expression profiles.
Within the transcriptomics community these methods have
received a considerable amount of criticism. This is
mainly due to three drawbacks: (i) clustering cannot detect
so called local co-expression effects, (ii) most clustering
methods are unable to detect overlapping modules and (iii)
clustering methods do not model the underlying gene
regulatory network. Alternative approaches have therefore
been developed which either handle both overlap and local
co-expression (biclustering and decomposition) or model
the gene regulatory network (direct network inference and
iterative network inference).
Given this methodological diversity, it is important that
existing and new approaches are evaluated on robust and
objective benchmarks. However, evaluation studies in the
past were limited in the number of methods, use synthetic
data or do not correctly assess the balance between false
positives and false negatives. In this study we therefore
provide a novel unbiased and comprehensive evaluation
strategy (Figure 1), and used it to evaluate 41 state-of-the-
art module detection methods.
METHODS
The key of our approach is that we use golden standard
regulatory networks to define sets of known modules.
These can be used to directly assess the sensitivity and
specificity of the different module detection methods. We
used four different large scale gene expression compendia,
two from E. coli and two from S. cerevisae. For each of
these organisms a substantial part of the regulatory
network is already known, either based on the integration
of small-scale experiments or based on large, genome
wide datasets. We use these networks to define groups of
known modules using by looking at genes which either
share on regulator, all regulators or are strongly
interconnected. We used four different metrics to compare
a set of observed modules with known modules: recovery
and recall control the type II errors, while the relevance
and specificity control the type I errors.
Parameter tuning is a necessary but often overlooked
challenge of module detection methods. As default
parameters of a tool are usually optimized for some
specific test cases by the authors, they do not necessarily
reflect general good performance on other datasets. On the
other hand, one should be careful of overfitting parameters
on specific characteristics of the data, as such parameters
will lead to suboptimal results when using the same
parameter settings on other datasets. In this study we first
optimized parameters using a grid-based approach. Next,
to avoid overfitting we used the optimal parameters on one
dataset to score the performance on another dataset, in an
approach akin to cross-validation.
RESULTS & DISCUSSION
We evaluated 41 different module detection methods
covering all five approaches. Overall, our analysis showed
that certain decomposition methods, those based on the
independent component analysis, outperform current state-
of-the-art clustering methods. However, despite their
theoretical advantages, neither biclustering nor network
inference methods are able to outperform clustering
methods. Importantly, our results are stable across datasets,
module definitions and scoring metrics, demonstrating the
robustness of our evaluation methodology.
FIGURE 1. Overview of our evaluation methodology.
The applications of our work are twofold. First, if local co-
expression and overlap are of interest, we discourage the
use of biclustering methods and suggest the use of
decomposition instead. Secondly, we provide a new
comprehensive evaluation methodology which can be used
to compare novel methods with the current state-of-the-art.
![Page 24: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/24.jpg)
10th Benelux Bioinformatics Conference bbc 2015
24
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O4 Oral presentation
O4. LATEBICLUSTERING: EFFICIENT DISCOVERY OF TEMPORAL LOCAL
PATTERNS WITH POTENTIAL DELAYS
Joana P. Gonçalves1,2*
& Sara C. Madeira3,4
.
Pattern Recognition and Bioinformatics Group, Department of Intelligent Systems, Delft University of Technology1;
Division of Molecular Carcinogenesis, The Netherlands Cancer Institute2; Department of Computer Science and
Engineering, Instituto Superior Técnico, Universidade de Lisboa 3; INESC-ID
4.
Temporal transcriptomes can provide valuable insight into the dynamics of transcriptional response and gene regulation.
In particular, many studies seek to uncover functional biological units by identifying and grouping genes with common
expression patterns. Nevertheless, most analytical tools available for this purpose fall short in their ability to consider
biologically reasonable models and adequately incorporate the temporal dimension. Each biological task is likely to
occur within a time period that does not necessarily span the whole time course of the experiment, and genes involved in
such a task are expected to coordinate only while the task is ongoing. LateBiclustering is an efficient algorithm to
identify this type of coordinated activity, while allowing genes to participate in distinct biological tasks with multiple
partners over time. Additionally, LateBiclustering is able to capture temporal delays suggestive of transcriptional
cascades: one of the hallmarks of gene expression and regulation.
INTRODUCTION
The discovery of patterns in temporal transcriptomes
exposes gene expression dynamics and contributes to
understand the machinery involved in its modulation.
Various analytical tools are employed in this regard.
Differential expression summarizes an entire time course
into one feature, thus lacking detail. Clustering maintains
respects the chronological order, but focuses on global
similarities and tends to identify rather broad patterns,
associated with unspecific functions. Biclustering offers
increased granularity by additionally searching for local
patterns, but allows for arbitrary jumps in time, eventually
leading to patterns that are incoherent from a temporal
perspective.
METHODS
LateBiclustering is an efficient algorithm for the
identification of transcriptional modules, here termed
LateBiclusters. Each LateBicluster is a group of genes
showing a similar expression pattern with potential delays,
within a particular time frame that does not necessarily
span the whole time course of the transciptome.
LateBiclustering only reports maximal LateBiclusters, that
is, those that cannot be extended and are not fully
contained in any other LateBicluster.
LateBiclustering takes as input a gene-time expression
matrix of real values. Each gene expression profile is first
normalized to zero mean and unit standard deviation. A
discretization is further applied to discern variations
between consecutive time points into three levels: down-
trend, no-change and up-trend. Upon discretization each
gene profile can be seen as a string.
A generalized suffix tree is built to find common
patterns in the gene profiles. Internal nodes
satisfying certain properties are marked for their
potential to denote LateBiclusters.
When an internal node does not satisfy the basic
conditions for LateBicluster maximality, a
procedure is applied to remove occurrences
leading to non-maximal LateBiclusters. For this
purpose, LateBiclustering uses a bit array
representing the occurrences underlying each
internal node. During the maximality update
procedure, the bit array of the inspected node is
compared against those of internal children nodes
(right-max) and nodes from which the inspected
node receives suffix links (left-max).
Finally, LateBiclustering comes with different
heuristics to report a single pattern occurrence per
gene in each maximal LateBicluster. A heuristic
is necessary because there may be multiple
occurrences of a pattern in the profile of a given
gene, which is a direct consequence of allowing
the discovery of delayed patterns.
RESULTS & DISCUSSION
LateBiclustering is the first efficient algorithm suitable for
the discovery of biclusters with temporal delays. It runs in
polynomial time, while previous methods yielded
exponential time complexity. LateBiclustering was able to
find planted biclusters in synthetic data. It also identified
biologically relevant LateBiclusters associated with
Saccharomyces cerevisiae’s response to heat stress, and
interesting time-lagged responses.
FIGURE 1. Schematic of the LateBiclustering algorithm.
REFERENCES Gonçalves JP & Madeira SC. IEEE/ACM Transactions on
Computational Biology and Bioinformatics, 11(5), 801–813
(2014).
![Page 25: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/25.jpg)
10th Benelux Bioinformatics Conference bbc 2015
25
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O5 Oral presentation
O5. INFERRING DEVELOPMENTAL CHRONOLOGIES FROM SINGLE CELL
RNA
Robrecht Cannoodt1,2,3*
, Katleen De Preter3 & Yvan Saeys
1,2.
Data Mining and Modelling for Biomedicine group, VIB Inflammation Research Center, Ghent1; Department of
Respiratory Medicine, Ghent University Hospital, Ghent 2; Center of Medical Genetics, Ghent University Hospital,
Ghent 3.
With the advent of single cell RNA sequencing, it is now possible to analyse the transcriptomes of hundreds of individual
cells in an unbiased manner. Reconstructing the developmental chronology of differentiating cells is a challenging task,
and doing so in a unsupervised and robust manner is a hitherto untackled problem. We developed a truly unsupervised
developmental chronology inference technique, and evaluated its performance and robustness using multiple datasets.
INTRODUCTION
Early attempts at inferring the chronologies of single cells
are MONOCLE (Trapnell et al., 2014) and NBOR
(Schlitzer et al., 2015). However, these techniques are not
unsupervised as they require knowledge of the cell type of
each cell prior to analysis, which biases the results to prior
knowledge and possibly obstructs the discovery of novel
subpopulations.
METHODS
Our approach consists of four steps.
In the first step, the feature space (~30000 genes) is
reduced to three dimensions.
Secondly, outliers are detected and removed, using a K-
nearest neighbour approach. After outlier removal, the
original feature space is again reduced to three dimensions.
Next, a nonparametric nonlinear curve is iteratively fitted
to the data.
Finally, each cell is projected onto the curve, thus
resulting in a cell chronology.
RESULTS & DISCUSSION
A single-cell RNAseq dataset (Schlitzer et al., 2015)
contains profilings of DC progenitor cells. These cells are
expected to differentiate from MDP to CDP to PreDC. Our
method is able to intuitively visualise known population
groups (Figure 1), as well as infer the developmental
chronology of the individual cells (Figure 2).
We evaluated our method on four datasets (Shalek et al.,
2014; Trapnell et al., 2014; Buettner et al., 2015 and
Schlitzer et al., 2015), and found it to perform better and
more robustly than existing methods MONOCLE and
NBOR.
This approach opens opportunities to further study known
mechanisms or investigate unknown key regulatory
structures in cell differentiation, or detect novel
subpopulations in a truly unsupervised manner.
REFERENCES Buettner F et al. Nature Biotechnology 33, 155-160 (2015). Schlitzer A et al. Nature Immunology 16, 718-726 (2015).
Shalek A et al. Nature 509, 363-369 (2014).
Trapnell C et al. Nature Biotechnology 32, 381-386 (2014).
FIGURE 1. After feature space reduction and outlier detection of 244 DC
progenitor cells (Schlitzer et al., 2015), our method can intuitively
visualise known populations.
FIGURE 2. An iterative curve fitting results in a smooth curve reflecting
the developmental chronology. After projecting each cell to the curve,
regulatory patterns in expression which correlate with this timeline can be investigated.
![Page 26: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/26.jpg)
10th Benelux Bioinformatics Conference bbc 2015
26
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O6 Oral presentation
O6. COMBINING TREE-BASED AND DYNAMICAL SYSTEMS
FOR THE INFERENCE OF GENE REGULATORY NETWORKS
Vân Anh Huynh-Thu1*
& Guido Sanguinetti2,3
.
GIGA-R & Department of Electrical Engineering and Computer Science, University of Liège1; School of Informatics,
University of Edinburgh2; SynthSys – Systems and Synthetic Biology, University of Edinburgh
3.
INTRODUCTION
Reconstructing the topology of gene regulatory networks
(GRNs) from time series of gene expression data remains
an important open problem in computational systems
biology. Current approaches can be broadly divided into
model-based and model-free approaches, and face one of
two limitations: model-free methods are scalable but
suffer from a lack of interpretability, and cannot in general
be used for out of sample predictions. On the other hand,
model-based methods focus on identifying a dynamical
model of the system; these are clearly interpretable and
can be used for predictions, however they rely on strong
assumptions and are typically very demanding
computationally. Here, we aim to bridge the gap between
model-based and model-free methods by proposing a
hybrid approach to the GRN inference problem, called
Jump3 (Huynh-Thu & Sanguinetti, 2015). Our approach
combines formal dynamical modelling with the efficiency
of a nonparametric, tree-based method, allowing the
reconstruction of GRNs of hundreds of genes.
METHODS
Gene expression model. At the heart of the Jump3
framework, we use the on/off model of gene expression
(Ptashne & Gann, 2002), where the rate of transcription of
a gene can vary between two levels depending on the
activity state μ of the promoter of the gene. The expression
x of a gene is modelled through the following stochastic
differential equation:
dxi = (Aiμi(t) + bi – λixi)dt + σdω(t),
where subscript i refers to the i-th target gene. Here, the
promoter state μi(t) is a binary variable (the promoter is
either active or inactive) that depends on the expression
levels of the transcription factors (TFs) that bind to the
promoter. Ai, bi and λi are kinetic parameters, and the term
σdω(t) represents a white noise-driving process with
variance σ2.
Network reconstruction with jump trees. Recovering
the regulatory links pointing to gene i amounts to finding
the genes whose expression is predictive of the promoter
state μi. To achieve this goal, we propose a procedure that
learns, for each target gene i, an ensemble of decision trees
predicting the promoter state μi at any time t from the
expression levels of the candidate regulators at the same
time t. However, standard tree-based methods cannot be
applied here since the output μi(t) is a latent variable. We
therefore propose a new decision tree algorithm called
“jump tree”, which splits the observations by maximising
the marginal likelihood of the dynamical on/off model.
The learned tree-based model is then used to derive an
importance score for each candidate regulator, computed
as the sum of the likelihood gains that are obtained at all
the tree nodes where this regulator was selected to split the
observations. The importance of a candidate regulator j is
used as weight for the putative regulatory link of the
network that is directed from gene j to gene i.
RESULTS & DISCUSSION
We evaluated Jump3 on the networks of the DREAM4 In
Silico Network challenge (Prill et al., 2010). For each
network topology, two types of simulated expression data
were used: data simulated using the on/off model (toy
data) and the time series data that was provided in the
context of the DREAM4 challenge. We compared Jump3
to other GRN inference methods: two model-free methods,
which are time-lagged variants of GENIE3 (Huynh-Thu et
al., 2010) and CLR (Faith et al., 2007) respectively; two
model-based methods, namely Inferelator (Greenfield et
al., 2010) and TSNI (Bansal et al., 2006), and G1DBN
(Lèbre, 2009), a method based on dynamic Bayesian
networks. Areas Under the Precision-Recall curves
(AUPRs) obtained for size-100 networks are shown in
Table 1. Jump3 yields the highest AUPR in the case of the
toy data. As expected, its performance decreases when the
networks are inferred from the DREAM4 data, due to the
mismatch between the on/off model and the one used to
simulate the data. However, Jump3 still outperforms the
other methods.
Toy DREAM4
Jump3 0.272 ± 0.060 0.187 ± 0.058
GENIE3-lag 0.114 ± 0.010 0.176 ± 0.056
CLR-lag 0.088 ± 0.008 0.169 ± 0.047
Inferelator 0.069 ± 0.006 0.144 ± 0.036
TSNI 0.020 ± 0.003 0.042 ± 0.010
G1DBN 0.104 ± 0.024 0.114 ± 0.043 TABLE 1. Comparison of network inference methods (mean AUPR and standard deviation).
We also applied Jump3 to gene expression data from
murine bone marrow-derived macrophages treated with
interferon gamma (Blanc et al., 2011). Several of the hub
TFs in the predicted network have biologically relevant
annotations. They include interferon genes, one gene
associated with cytomegalovirus infection, and cancer-
associated genes, showing the potential of Jump3 for
biologically meaningful hypothesis generation.
REFERENCES Bansal M et al. Bioinformatics 22, 815-822 (2006). Blanc M et al. PLoS Biol 9, e1000598 (2011).
Faith JJ et al. PLoS Biol 5, e8 (2007).
Greenfield A. PLoS ONE 5, e13397 (2010). Huynh-Thu VA & Sanguinetti G. Bioinformatics 31, 1614-1622 (2015).
Huynh-Thu VA et al. PLoS ONE 5, e12776 (2010).
Lèbre S. Stat Appl Genet Mol Biol 8, Article 9 (2009). Prill RJ et al. PLoS ONE 5, e9202 (2010).
Ptashne M & Gann A. Genes and Signals. Cold Harbor Spring
Laboratory Press (2002).
![Page 27: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/27.jpg)
10th Benelux Bioinformatics Conference bbc 2015
27
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O7 Oral presentation
O7. MODELING THE REGULATION OF Β-CATENIN SIGNALLING BY WNT
STIMULATION AND GSK3 INHIBITION
Annika Jacobsen1, Nika Heijmans
2, Reneé van Amerongen
2, Folkert Verkaar
3,
Martine J. Smit3, Jaap Heringa
1 & K. Anton Feenstra
1*.
1Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, The Netherlands;
2Van Leeuwenhoek Centre
for Advanced Microscopy and Section of Molecular Cytology, Swammerdam Institute for Life Sciences, University of
Amsterdam, The Netherlands; 3Division of Medicinal Chemistry, VU University Amsterdam, The Netherlands.
The Wnt/β-catenin signaling pathway is crucial for stem cell self-renewal, proliferation and differentiation. Hyperactive
Wnt/β-catenin signaling caused by genetic alterations plays an important role in oncogenesis. In our newly developed
Petri net model, GSK3 inhibition leads to significantly higher pathway activation (high β-catenin levels) compared to
WNT stimulation, which is confirmed by TCF/LEF luciferase reporter assays experimentally. Using this validated model
we can now simulate changes in Wnt/β-catenin signaling resulting from different mutations found in breast and
colorectal cancer. We propose that this model can be used further to investigate different players affecting Wnt/β-catenin
signaling during oncogenic transformation and the effect of drug treatment.
WNT/Β-CATENIN
Wnt/β-catenin signaling is important for stem cell
maintenance and developmental processes and is highly
conserved in all multicellular organisms (1, 2). The
pathway regulates the expression of specific target genes
by changing the levels of the transcriptional co-activator,
β-catenin which activates the TCF/LEF transcription
factors. Wnt/β-catenin signaling is active in stem cells
located in Wnt rich environments.
APC and AXIN are key proteins of the destruction
complex, which targets β-catenin for destruction.
Mutations in APC, AXIN and β-catenin play important
roles in oncogenesis (2, 3). To better understand its role in
oncogenesis, we here create a Petri net (PN) model of the
Wnt/β-catenin signaling pathway, that uses available
coarse-grained data, such as binary interactions and semi-
quantitative protein levels. Using this model and
validating experiments we show how different strengths of
Wnt stimulation and GSK3 inhibition activate signaling
over time.
PETRI-NET MODELLING
We built a PN model of Wnt/β-catenin signaling des-
cribing the logic of known (inter)actions, cf. our previous
work (5). In a PN, a place represents an entity (e.g. gene),
a transition indicates the activity occurring between the
places (e.g. gene expression), and these are connected by
directed edges called arcs that represent their interactions
(e.g., activation of gene expression by a protein).
TRANSCRIPTION AND PROTEIN ASSAYS
TCF/LEF transcription was measure by TOPFLASH
reporter activity at several time points and at different
concentrations of Wnt3a stimulation and GSK3 inhibition
by CHIR99021. Active and total β-catenin (CTNNB1)
levels were measured by Western blot.
VALIDATED ACTIVATION & INHIBITION
We simulate the model with initial Wnt and GSK3 token
levels ranging from 0 to 5 to represent addition of Wnt and
inhibition of GSK3. Figure 1 shows the four different β-
catenin responses for Wnt addition (purple) and GSK3
inhibition (green). At low GSK3 levels, β-catenin linearly
increases, but at high GSK levels β-catenin remains low.
At high Wnt levels, β-catenin shows a transient response,
with the peak height increasing with Wnt levels. The
increase of β-catenin is due to sequestration of AXIN to
the cell membrane, which inactivates the destruction
complex. Increase in β-catenin activates transcription of
AXIN2 which triggers the negative feedback.
FIGURE 1. Pathway response for different levels of Wnt and activity of
GSK3. When adding Wnt, the pathway transiently activates but GSK3 inhibition permanently activates.
TCF/LEF reporter assay validation experiments for both
perturbations show that transcriptional activity of
TCF/LEF is both dosage and time dependent,
corresponding well for GKS3 inhibition. Wnt3a stimu-
lation, on the other hand, does activate expression, but we
do not observe the β-catenin dosage or time effect
predicted by our model. Measuring β-catenin by Western
blot reveals a consistent increase upon pathway activation,
however protein levels and changes are on the border of
experimental sensitivity.
In conclusion, our Petri net model recapitulates much of
the known behavior of the Wnt/β-catenin pathway upon
Wnt stimulation and GSK3 inhibition, and hints at
subtleties in the mechanism that will help us gain further
understanding in the role of this pathway in development
and oncogenesis.
REFERENCES 1. Clevers & Nusse (2012) Cell. 149:1192-1205
2. Holstein (2012) Cold Spring Harb Perspect Biol. 4:a007922
3. MacDonald, Tamai & He (2009) Dev Cell. 17:9-26
4. Klaus & Birchmeier (2008) Nat. Rev. Cancer. 8:387-398
5. Bonzanni et al., (2009) Bioinformatics. 25:2049-2056
![Page 28: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/28.jpg)
10th Benelux Bioinformatics Conference bbc 2015
28
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O8 Oral presentation
O8. RANKED TILING BASED APPROACH TO DISCOVERING PATIENT
SUBTYPES
Thanh Le Van1,*
, Jimmy Van den Eynden3, Dries De Maeyer
2, Ana Carolina Fierro
5, Lieven Verbeke
5, Matthijs van
Leeuwen4
, Siegfried Nijssen1,4
, Luc De Raedt
1 & Kathleen Marchal
5,6.
Department of Computer Science1, Centre of Microbial and Plant Genetics
2, KULeuven, Belgium; Department of
Medical Biochemistry, University of Gothenburg3, Sweden; Leiden Institute for Advanced Computer Science
4,
Universiteit Leiden, The Netherlands; Department of Plant Biotechnology and Bioinformatics5, Department of
Information Technology, iMinds6, Ghent University, Belgium.
Cancer is a heterogeneous disease consisting of many subtypes that usually have both shared and distinguishing
mechanisms. To derive good subtypes, it is essential to have a computational model that can score their homogeneity
from different angles, for example, mutated pathways and gene expression. In this paper, we introduce our ongoing work
which studies a constraint-based optimisation model to discover patient subtypes as well as their perturbed pathways
from mutation, transcription and interaction data. We propose a way to solve the optimisation problem based on
constraint programming principles. Experiments on a TCGA breast cancer dataset demonstrate the promise of the
approach.
INTRODUCTION
Discovering patient subtypes and understanding their
mechanisms are essential to provide precise treatments to
patients. There have been efforts to understand how
mutation causes subtypes such as the work by Hofree et
al., (2013). However, to the best knowledge of the authors,
it is still an open question on how to combine mutation
and expression data to derive good subtypes. Therefore,
we study a new computation model that can discover
subtypes as well as their specific mutated genes and
expressed genes from mutation, transcription and
interaction data.
METHODS
We conjecture that a subtype consists of a number of
patients who have the same set of differentially expressed
genes and a set of mutated genes that hit the same
pathways.
To find both mutations and expressions of patient subtypes,
we extend our recent ranked tiling method (Le Van et al.,
2014). Ranked tiling is a data mining method proposed to
mine regions with high average rank values in a rank
matrix. In this type of matrix, each row is a complete
ranking of the columns. We find that rank matrices are a
good abstraction for numeric data and are useful to
integrate datasets that are at different scales.
To apply the ranked tiling method, we first transform the
given numeric expression matrix, where rows are
expressed genes and columns are patients, into a ranked
expression matrix. Then, we search for a region in the
transformed matrix that has high average rank scores.
However, different from the ranked tiling method, we
impose a further constraint that the columns (patients) of
the region should also have a number of mutated genes
that have high rank scores in a network with respect to a
network model. We formalise this as a constraint
optimisation problem and use a constraint solver to solve
it.
RESULTS & DISCUSSION
We apply our method on TCGA breast cancer dataset and
discover eight subtypes. Compared to PAM50 annotations,
our method divide the Basal subtype into three sub-groups
named S2, S3 and S6. The LumA subtype is divided into
04 smaller groups, namely, S1, S4, S7 and S8. Finally, our
method could recover the Her2 subtype in S5.
To validate the mined subtypes in the patient dimension,
we assume PAM50 annotations are true labels for them.
Then, grouping patients into subtypes can be seen as a
multi-class prediction problem, for which we can calculate
F1 score to measure the average accuracy. We also
compare our scores with state-of-the-art, including
iCluster+ (Mo, Q. et al., 2013), NBS (Hofree et al., 2013)
and SNF (Wang B. et al., 2014). The result (not shown)
illustrates that our subtypes are more homogeneous than
the ones produced by iCluster+ and NBS and are
comparable to those by SNF.
To validate the mined subtypes in the gene dimension, we
perform geometric tests to see how their mutated genes
and expressed genes are related to cancer pathways. The
figure below is the heatmap showing the log_10 p-values
of the tests. In this Figure, we can see that the discovered
subtypes have specific perturbed pathways.
FIGURE 1. Cancer pathway enrichment analysis using mined mutated genes and expressed genes of subtypes
REFERENCES Hofree et al., Nat Methods 10(11), 1108–15 (2013).
Le Van et al., ECML/PKDD 2014 (2), 98–113 (2014)
Mo, Q. et al., PNAS 110(11), 4245–50 (2013)
Wang, B. et al., Nature methods, 11(3), 333–7 (2014)
![Page 29: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/29.jpg)
10th Benelux Bioinformatics Conference bbc 2015
29
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O9 Oral presentation
O9. DEVELOPMENT OF A DNA METHYLATION-BASED SCORE
REFLECTING TUMOUR INFILTRATING LYMPHOCYTES
Martin Bizet1,2,3*#
, Jana Jeschke1#
, Christine Desmedt4, Emilie Calonne
1, Sarah Dedeurwaerder
1,
Gianluca Bontempi2,3
, Matthieu Defrance1,2
, Christos Sotiriou4 and Francois Fuks
1
Laboratory of Cancer Epigenetics, Faculty of Medicine, Université Libre de Bruxelles1; Interuniversity Institute of
Bioinformatics in Brussels, Université Libre de Bruxelles & Vrije Universiteit Brussel2; Machine Learning Group,
Computer Science Department, Université Libre de Bruxelles, Brussels3; Breast Cancer Translational Research
Laboratory, Jules Bordet Institute, Université Libre de Bruxelles4; #These authors contributed equally to this work;
Tumour infiltrating lymphocytes (TIL) are increasingly recognised as one of the key feature to predict outcome and
therapy response in malignancies. However, measuring quantities of TIL remains challenging since it relies on subjective
and spatially-restricted measurements from a pathologist. In this study we used genome-scale DNA-methylation profiles
from breast tumours to develop a so-called MeTIL score, which reflects TIL level within whole-tumour samples. We
demonstrate the robustness to noise of the MeTIL score using simulated data as well as the ability of the MeTIL score to
sensitively measure TIL in patient samples and to improve prediction of outcome.
INTRODUCTION
Breast cancer (BC) is one of the most common and
deadliest diseases in women from Western countries.
Tumour infiltrating lymphocytes (TIL) emerged as one of
the key feature to predict outcome and response to
treatment in this disease [1]. However the measurement of
TIL levels remains challenging because it relies on manual
readings of a tumour cancer slide by a pathologist, which
is subjective by nature and does not necessary reflect the
whole-tumour TIL content. In this study we took
advantage of the high tissue-specificity of DNA-
methylation patterns [2] to develop a so-called MeTIL
score, which predicts the amount of lymphocytes within
the tumour.
METHODS
The MeTIL score has been developed in 3 key-steps:
We first used genome-scale DNA-methylation
profiles data from 11 cell-lines (8 normal or
cancerous epithelial breast and 3 T-lymphocytes)
to extract 29 cytosines specifically unmethylated
in T-lymphocytes (delta-beta < -0.8 and standard
deviation between groups < 0.1).
We then applied a cross-validated pipeline,
associating mRMR feature selection and random-
forest algorithm, on 118 BC samples to extract a
minimal set of cytosines, which methylation level
is predictive for quantities of TIL.
Finally we used a “normalised PCA” approach to
compute a unique MeTIL score from the
individual methylation values.
The robustness of the relation between the MeTIL score
and TIL levels was also assessed using spearman
correlation computed from 10 000 simulations with
varying proportion of TIL (Fig.1B&C). The simulated
data took two sources of noise into account:
Technical noise modeled as a Gaussian noise
Perturbations due to the presence of other cell-
types within the tumour microenvironment that
are not lymphocytic or epithelial, modeled by a
methylation value sampled randomly among the
array.
Lastly, we measured TIL quantities with the MeTIL score
in three independent BC cohorts and applied COX
regression models to evaluate the prognostic value of the
MeTIL score.
RESULTS & DISCUSSION
We first applied a hierarchical clustering analysis and
observed that BC samples with high TIL infiltration show
a hypomethylated pattern for all MeTIL markers (Fig.1A).
Furthermore we demonstrated, using simulations, a strong
correlation between the MeTIL score and TIL levels, even
when high level of noise (0.7 times the standard deviation)
and high proportion of perturbing unknown cell-types
(70%) were included in the model (Fig.1B).
FIGURE 1. The MeTIL score reflects TIL levels (A) Heatmap showing the
methylation values of the 5 MeTIL markers. A ‘TIL high’ group with a hypomethylated pattern (orange) appeared. (B) Color-map of the
spearman correlation between MeTIL score and TIL level for increasing
noise (y-axis) and abundance of unknown cell-types (x-axis) based on simulations. (C) Methylation value of each MeTIL marker was simulated
as the sum of the methylation level in lymphocyte (M1), epithelial cell
(M2) and other cell-types (random value M3) weighted by their proportion in the tissue (f1, f2, f3) and an Gaussian noise (e).
Finally, we observed consistent patterns of TIL levels
within BC subtypes in independent cohorts suggesting the
robust nature of our score to evaluate TIL levels.
Furthermore, COX regressions analysis revealed a
prognostic value for the MeTIL score in triple negative
and luminal BC (p-value < 0.05).
REFERENCES [1] Loi, S., et al. Official journal of the European Society for Medical Oncology /
ESMO 25, 1544-1550 (2014).
[2] Jeschke, J., Collignon, E., Fuks, F. FEBS J., 282, 9:1801-14. (2015).
(A) (B)
(C)
![Page 30: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/30.jpg)
10th Benelux Bioinformatics Conference bbc 2015
30
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O10 Oral presentation
O10. PREDICTION OF CELL RESPONSES TO SURFACE TOPOGRAPHIES
USING MACHINE LEARNING TECHNIQUES
Aliaksei S Vasilevich1*,Shantanu Singh
2, Aurélie Carlier
1 & Jan de Boer
1.
Laboratory for Cell Biology-inspired Tissue Engineering, Merln Institute, Maastricht University1, Imaging Platform,
Broad Institute of MIT and Harvard2. *[email protected]
Topographical cues have been repeatedly shown to influence cell fate dramatically (Bettinger et. al., 2009). This
phenomenon opens new opportunities to design the interaction between biomaterials and biological tissues in a
predictable manner. Unfortunately, the exact mechanism of topographical control of cell behavior remains largely
unknown. We have therefore developed a technology in our laboratory to determine an optimal surface topography for
virtually any application in biomedical field. Previously we have reported that we can control cell shape by our surfaces
in a predictable manner (Hulsman et.al., 2015). Here we demonstrate that we can successfully predict not only cell shape,
but also cell response on protein level based on the properties of our topographies. The results of our study show that we
are able to design materials for biomedical applications that require a particular cell behavior.
INTRODUCTION
The TopoChip, a micro topography screening platform,
enables the assessment of cell response to 2176 unique
topographies in a single high-throughput screen. The
topographical features were randomly selected from an in
silico library of more than 150 million of topographies,
which were designed from algorithm that synthesized
patterns based on simple geometric elements – circles,
triangles and rectangles (Unadkat et al, 2011). In our
previous studies, we have demonstrated that these surface
topographies exert a mitogenic effect on hMSCs (Unadkat
et al, 2011), as well as on cell shape (Hulsman et. al.,
2015). In this paper, we show that these topographies can
also be used to modulate the ALP expression in human
mesenchymal stromal cells, as well as pluripotency in
human induced pluripotent stem (iPS) cells. We further
show that computational models can be build to predict
these protein levels using surface topography parameters.
METHODS
Cell response to topography was captured by high-content
imaging. Using image analysis and data mining methods
described previously (Hulsman et.al., 2015),
multiparametric “profiles” of cellular response were
obtained. Multiple replicates of each topography were
used to estimate the median level of a cellular response of
interest – either ALP in human mesenchymal stromal cells
(hMSCs), or the median number of Oct4 positive cells in
population of human induced pluripotent stem cell
(hIPSCs). We aimed to predict the cellular response based
on surface topography parameters using machine learning
methods. To learn and validate these methods (specifically,
classifiers), the data were split into training and testing
sets in a 3:1 proportion respectively. In the training step,
we performed a 10-fold cross-validation to obtain optimal
parameters for each classifier. The caret package (Kuhn
M., 2008) in R (R core team, 2015) was used to perform
the analysis.
RESULTS & DISCUSSION
In the first project, we conducted a screening on the
TopoChip with hMSCs in order to find topographies that
would be able to increase the ALP level, a protein that is
an early marker of osteogenesis. We were able to
successfully find such surfaces and confirm results
experimentally (publication in preparation). To move
further we decided to check how accurately we can make a
prediction of ALP level in hMSCs based on topographical
features. Focussing only on extreme examples, we
selected 100 high- and and low-scoring topographies and
used the model validation scheme described in Methods to
find the most accurate binary classifier for our data set.
We tested several classifiers and identified random forest
as most precise, which obtained an accuracy of 96% on
the held-out test set.
In a second project, we aim to find a topography that will
increase proliferation and pluripotency of hIPSCs. We
used Oct4 as a marker of pluripotency. The screening was
performed on one half of the Topochip (1000+ surfaces),
which were then ranked based on the number of Oct4
positive cells. One hundred high- and low-scoring surfaces
were chosen to train a classifier. Using logistic regression ,
we obtained 72% accuracy on a held-out test set. We used
this model to predict surfaces that would increase
pluripotency in hIPSCs among surfaces that were not
included in the initial screening. Topographies were
ranked according to their predicted probability score and
top 30 surfaces were chosen for experimental validation.
We found that 79% of selected surfaces were predicted
accurately.
In summary, the combination of our screening methods
and machine learning algorithms open new avenues to
design surfaces with desired properties for variable
applications. Our next step will be to find a surface with
maximum ALP level from our virtual library based on our
screening data.
REFERENCES Bettinger C J, Langer R, & Borenstein J T. “Engineering Substrate
Micro- and Nanotopography to Control Cell Function.” Angewandte
Chemie (International ed. in English) 48.30 (2009). Hulsman M et. al., Analysis of high-throughput screening reveals the
effect of surface topographies on cellular morphology, Acta
Biomaterialia, 15, (2015). Kuhn M. “Building Predictive Models in R Using the caret Package”
Journal of Statistical Software, Vol. 28, (2008)
R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL
http://www.R-project.org/. (2015)
Unadkat H V. et al. “An Algorithm-Based Topographical Biomaterials Library to Instruct Cell Fate.” Proceedings of the National Academy
of Sciences of the United States of America 108.40 (2011).
![Page 31: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/31.jpg)
10th Benelux Bioinformatics Conference bbc 2015
31
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O11 Oral presentation
O11. ANALYSIS OF MASS SPECTROMETRY QUALITY CONTROL METRICS
Wout Bittremieux1, Pieter Meysman
1, Lennart Martens
2, Bart Goethals
1, Dirk Valkenborg
3 & Kris Laukens
1.
Advanced Database Research and Modeling (ADReM) & Biomedical Informatics Research Center Antwerp (biomina),
University of Antwerp / Antwerp University Hospital1; Department of Biochemistry & Department of Medical Protein
Research, Ghent University / VIB2; Flemish Institute for Technological Research (VITO)
3.
Mass-spectrometry-based proteomics is a powerful analytical technique to identify complex protein samples, however,
its results are still subject to a large variability. Lately several quality control metrics have been introduced to assess the
performance of a mass spectrometry experiment. Unfortunately these metrics are generally not sufficiently thoroughly
understood. For this reason, we present a few powerful techniques to analyse multiple experiments based on quality
control metrics, identify low-performance experiments, and provide an interpretation of outlying experiments.
INTRODUCTION
Mass-spectrometry-based proteomics is a powerful
analytical technique that can be used to identify complex
protein samples. Despite many technological and
computational advances, performing a mass spectrometry
experiment is still a highly complicated task and its results
are subject to a large variability. To understand and
evaluate how technical variability affects the results of an
experiment, lately several quality control (QC) and
performance metrics have been introduced. Unfortunately,
despite the availability of such QC metrics covering a
wide range of qualitative information, a systematic
approach to quality control is often still lacking.
As most quality control tools are able to generate several
dozens of metrics, any single experiment can be
characterized by multiple QC metrics. Therefore it is
often not clear which metrics are most interesting in
general, or even which metrics are relevant in a specific
situation. To take into account the multidimensional data
space formed by the numerous metrics, we have applied
advanced techniques to visualize, analyze, and interpret
the QC metrics.
METHODS
Outlier detection can be used to detect deviating
experiments with a low performance or a high level of
(unexplained) variability. These outlying experiments can
subsequently be analyzed to discover the source of the
reduced performance and to enhance the quality of future
experiments.
However, it is insufficient to know that a specific
experiment is an outlier; it is also of vital importance to
know the reason. To understand why an experiment is an
outlier, we have used the subspace of QC metrics in which
the outlying experiment can be differentiated from the
other experiments. This provides crucial information on
how to interpret an outlier, which can be used by domain
experts to increase interpretability and investigate the
performance of the experiment.
RESULTS & DISCUSSION
Figure 1 shows an example of interpreting a specific
experiment that has been identified as an outlier. As can
be seen, two QC metrics mainly contribute to this
experiment being an outlier. The explanatory subspace
formed by these QC metrics can be extracted, which can
then be interpreted by domain experts, resulting in insights
in relationships between various QC metrics.
FIGURE 1. QC metrics importances for interpreting an outlying
experiment.
Next, by combining the explanatory subspaces for all
individual outliers, it is possible to get a general view on
which QC metrics are most relevant when detecting
deviating experiments. When taking the various
explanatory subspaces for all different outliers into
account, a distinction between several of the outliers can
be made in terms of the number of identified spectra
(PSM’s). As can be seen in Figure 2, for some specific QC
metrics (highlighted in italics) the outliers result in a
notably lower number of PSM's compared to the non-
outlying experiments.
Because monitoring a large number of QC metrics on a
regular basis is often unpractical, it is more convenient to
focus on a small number of user-friendly, well-understood,
and discriminating metrics. As the QC metrics highlighted
in Figure 2 are shown to indicate low-performance
experiments, these metrics are prime candidates to monitor
on a continuous basis to quickly detect faulty experiments.
FIGURE 2. Comparison of the number of PSM’s between the non-outlying
and the outlying experiments.
![Page 32: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/32.jpg)
10th Benelux Bioinformatics Conference bbc 2015
32
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O12 Oral presentation
O12. XILMASS: A CROSS-LINKED PEPTIDE IDENTIFICATION ALGORITHM
Şule Yılmaz1,2,3*
, Masa Cernic4, Friedel Drepper
5, Bettina Warscheid
5, Lennart Martens
1,2,3 & Elien Vandermarliere
1,2,3.
Medical Biotechnology Center, VIB, Ghent, Belgium1; Department of Biochemistry, Ghent University, Ghent, Belgium
2;
Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium3; Department of Biochemistry, Molecular and
Structural Biology, Jožef Stefan Institute, Ljubljana, Slovenia4; Functional Proteomics and Biochemistry, Department of
Biochemistry and Functional Proteomics, Institute for Biology II and BIOSS Centre for Biological Signaling Studies,
University of Freiburg, Freiburg, Germany5. *[email protected]
Chemical cross-linking coupled with mass spectrometry (XL-MS) facilitates the determination of protein structure and
the understanding of protein interactions. The current computational approaches rely on different strategies with a limited
number of open-source and easy-to-use search algorithms. We therefore built a novel cross-linked peptide identification
algorithm, called Xilmass which has a novel database construction and a new scoring function adapted from traditional
database search algorithms. We compared the performance of Xilmass against one of the most popular and publicly
available algorithms: pLink, and a recently published algorithm Kojak. We found that Xilmass identified 140 spectra
whereas Kojak and pLink identified 119 and 35, respectively. We mapped the cross-linking sites on the structure which
resulted in the identification of 20 possible cross-linking sites. These findings show that Xilmass allows the identification
of cross-linking sites.
INTRODUCTION
The structure of a protein is crucial for its functionality.
Protein structure is commonly determined by X-ray
crystallography or nuclear magnetic resonance (NMR). X-
ray crystallography is only feasible for crystallizable
proteins and NMR has a protein size limitation. Due to
these restrictions, protein complexes are much more
difficult to approach with these classical methods.
However, chemical cross-linking of the complex coupled
with mass spectrometry (XL-MS) allows to study of these
protein complexes. The identification of the measured
fragmentation spectra is a challenging task. One approach
to identify cross-linked peptides is to linearize cross-
linked peptide-pairs in order to generate a database to
perform traditional search engines (Maiolica et al., 2007).
However, a traditional search engine is not directly
applicable to identify cross-linked peptides. Another
approach is to rely on the usage of labeled cross-linkers,
but this has a decreased performance when unlabeled
cross-linkers are used. We therefore built an algorithm,
Xilmass, which is designed for the identification of XL-
MS fragmentation spectra without linearization of peptides
and the requirement of labeled cross-linkers. We also
introduced a new way of representation of a cross-linked
peptide database and directly implemented a new scoring
function.
METHODS
The data sets were derived from human calmodulin (CaM)
and the actin binding domain of plectin (plectin-ABD)
which were cross-linked by DSS. The data sets were
analyzed on a Velos Orbitrap Elite.
Cross-linked peptides were identified by Xilmass, pLink
(Yang et al., 2012) and Kojak (Hoopmann et al., 2015).
The identifications of both Xilmass and Kojak were
validated by Percolator (Käll et al., 2007) at q-value=0.05.
pLink returned a validated list at FDR=0.05.
The findings on cross-linking sites were validated with the
aid of the available structures (Plectin PDB-entry: 4Q57
and calmodulin PDB-entry: 2F3Y). The cross-linking sites
were predicted by X-Walk (Kahraman et al., 2011) and
PyMOL was used for the visualization.
RESULTS & DISCUSSION
We compared the number of identified spectra and cross-
linking sites from Xilmass, pLink and Kojak. Xilmass
identified 140 spectra whereas Kojak and pLink identified
119 and 35 spectra, respectively (at FDR=0.05). Xilmass
identified 53 cross-linking sites from the 140 spectra with
37 obtained from at least 2 peptide-to-spectrum matches
(PSMs). Kojak identified more cross-linking sites (60),
however, only 26 cross-linking sites have at least 2 PSMs.
The identified cross-linking sites by Xilmass were
manually verified on the structure (Figure1). We defined
20 cross-linking sites as possible (Cα-Cα distances within
30Å (orange)) and not-predicted (Cα-Cα distances
exceeding 30Å (blue)). These findings show that Xilmass
allows the identification of cross-linking sites.
FIGURE 1. The identified cross-linking sites were mapped on the plectin
protein structure to manually verify them (PDB-entry:4Q57)
REFERENCES Hoopmann ,M R et al. Journal of Proteome Research, 14, 2190–2198
(2015)
Kahraman,A. et al. Bioinformatics, 27, 2163–2164 (2011)
Käll,L. et al. Nature Methods, 4, 923–925 (2007) Maiolica,A. et al. Molecular & cellular proteomics:MCP, 6, 2200–2211
(2007)
Yang,B. et al. Nature Methods, 9, 904–906 (2012)
![Page 33: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/33.jpg)
10th Benelux Bioinformatics Conference bbc 2015
33
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O13 Oral presentation
O13. AUTOMATED ANATOMICAL INTERPRETATION OF DIFFERENCES
BETWEEN IMAGING MASS SPECTROMETRY EXPERIMENTS
Nico Verbeeck1*
, Jeffrey Spraggins,2, Yousef El Aalamat
3,4, Junhai Yang
2 ,
Richard M. Caprioli2, Bart De Moor
3,4,Etienne Waelkens
5,6 & Raf Van de Plas
1,2.
Delft Center for Systems and Control (DCSC), Delft University of Technology1; Mass Spectrometry Research Center
(MSRC),Vanderbilt University2; STADIUS Center for Dynamical Systems, Signal Processing, and Data Analytics, Dept.
of Electrical Engineering (ESAT), KU Leuven3; iMinds Medical IT, KU Leuven
4; Dept. of Cellular and Molecular
Medicine, KU Leuven5; Sybioma, KU Leuven
6.
Imaging mass spectrometry (IMS) is a powerful molecular imaging technology that generates large amounts of data,
making manual analysis often practically infeasible. In this work we aid the differential analysis of multiple IMS datasets
by linking these data to an anatomical atlas. Using matrix factorization based multivariate analysis techniques, we are
able to identify differential biomolecular signals between individual tissue samples in an obesity case study on mouse
brain. The resulting differential signals are then automatically interpreted in terms of anatomical structures using a
convex optimization approach and the Allen Mouse Brain Atlas. The automated anatomical interpretation facilitates
much deeper exploration by the biomedical expert for these types of very rich data sets.
INTRODUCTION
Imaging Mass Spectrometry (IMS) is a relatively new
molecular imaging technology that enables a user to
monitor the spatial distributions of hundreds of
biomolecules in a tissue slice simultaneously. This unique
property makes IMS an immensely valuable technology in
biomedical research. However, it also leads to very large
amounts of data in a single analysis (e.g. >1 TB), making
manual analysis of these data increasingly impractical. In
order to aid the exploration of these data, we have recently
developed a framework that integrates IMS data with an
anatomical atlas. The framework uses the anatomical data
in the atlas to automatically interpret the IMS data in terms
of anatomical structures, and guides the user towards
relevant findings within a single tissue section. In this
work, we extend this framework towards the automated
interpretation of biomolecular differences between
multiple IMS datasets.
METHODS
We demonstrate our method on IMS data of multiple
mouse brain sections, and use the Allen Mouse Brain
Atlas as the curated anatomical data source that is linked
to the MALDI-based IMS measurements. We spatially
map the data of each individual IMS dataset to the
anatomical atlas using both rigid and non-rigid registration
techniques. This establishes a common reference space
and allows for direct comparison of spatial locations
between the different IMS datasets. Group Independent
Component Analysis (GICA) is then used to automatically
extract the differentially expressed biomolecular patterns,
after which convex optimization is used to automatically
interpret the differential components in terms of known
anatomical structures (Verbeeck et al, 2014), directly
listing the anatomical areas in which changes occur.
RESULTS & DISCUSSION
We demonstrate our approach in an obesity case study on
mouse brain. All tissue sections are cryosectioned at 10
μm and thaw-mounted onto ITO coated glass slides after
which they are sublimated with CMBT matrix. MALDI
IMS images are collected using the Bruker 15T solariX
FTICR MS with a spatial resolution of 50 μm, collecting
approximately 35,000 pixels per experiment.
The IMS data of the different experiments are registered to
the anatomical reference space provided by the Allen
Mouse Brain Atlas, establishing an inter-experiment
study-wide reference space. Analysis of the IMS
measurements using GICA reveals multiple biomolecular
patterns that differentiate between the various dietary
conditions examined by the study. The retrieved
differentially expressed biomolecular patterns are then
translated to combinations of anatomical structures using
our convex optimization approach, similar to what a
human investigator intends to do. This automated
interpretation of inter-experiment differences can serve as
a great accelerator in the exploration of IMS data, as it
avoids the time-and resource-intensive step of having a
histological expert manually interpret the differential
patterns.
FIGURE 1. Automated anatomical interpretation of a biomolecular pattern that is differentially expressed in coronal mouse brain sections
between a high fat and a low fat diet in our obesity case study.
REFERENCES Verbeeck, N. et al. Automated anatomical interpretation of ion distributions in tissue: linking imaging mass spectrometry to curated
atlases. Anal. Chem. 86, 8974–8982 (2014).
![Page 34: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/34.jpg)
10th Benelux Bioinformatics Conference bbc 2015
34
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O14 Oral presentation
O14. ENHANCEMENT OF IMAGING MASS SPECTROMETRY DATA
THROUGH REMOVAL OF SPARSE INTENSITY VARIATIONS
Yousef El Aalamat1,2*
, Xian Mao1,2
, Nico Verbeeck3, Junhai Yang
4, Bart De Moor
1,2,
Richard M. Caprioli4, Etienne Waelkens
5,6 & Raf Van de Plas
3,4.
Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing, and Data
Analytics, KU Leuven1; iMinds Medical IT, KU Leuven
2; Delft Center for Systems and Control, Delft University of
Technology3; Mass Spectrometry Research Center (MSRC),Vanderbilt University
4; Department of Cellular and
Molecular Medicine, KU Leuven5; Sybioma, KU Leuven
Imaging mass spectrometry (IMS) is rapidly evolving as a label-free, spatially resolved molecular imaging tool for the
direct analysis of biological samples. However, mass spectrometry (MS) measurements are subject to different types of
noise. In IMS, one of the most abundant noise types in ion images is the presence of localized intensity spikes, known
also as sparse intensity variations, which occur on top of the biological ion distribution pattern. In this study, we develop
a method that addresses the issue of sparse intensity noise. We use low-rank approximations of the IMS data to separate
and filter sparse intensity variations from the MS signals. The efficiency of the developed method is tested using MS
measurements of coronal sections of mouse brain and strong de-noising performance is demonstrated both along the
spatial and the spectral domain.
INTRODUCTION
Imaging mass spectrometry (IMS) provides unique
capabilities for biomedical and biological research.
However, its measurements tend to be subject to different
types of noise. One of the more abundant noise types in
IMS are localized intensity spikes, which can be seen as
sparse intensity variations on top of the true biological ion
patterns. This kind of noise can have a substantial impact,
particularly on low ion intensity measurements where the
signal-to-noise ratio (SNR) can be significantly affected.
We present a method to filter sparse intensity variations
from IMS data, and demonstrate its use to de-noise IMS
measurements both along the spatial and the spectral
domain.
METHODS
We introduce a de-noising algorithm based on low-rank
approximation, a concept from linear algebra. The method
can separate sparse intensity variations from biological
and tissue sample patterns, which hold up across multiple
ions and pixels. This approach decomposes IMS data into
two parts, namely a structured data matrix and a sparse
data matrix. Since the noise tends to be sparse in nature, it
will have a propensity to be collected into the sparse data
part. The structured part tends to capture the de-noised
IMS signals, effectively de-noising the ion images and the
spectral profiles in the process. This de-noising method
allows us to automatically filter sparse intensity variations
from the underlying tissue signal without requiring any
parameter tuning.
RESULTS & DISCUSSION
The filter method is demonstrated on two IMS
experiments (one lipid-focused and one protein-focused)
acquired from coronal sections of mouse brain. For the
protein experiment, the tissue section was coated with
sinapinic acid, and measurements were acquired using a
Bruker AutoFlex MALDI-TOF/TOF in positive linear
mode at a spatial resolution of 100 μm and with a mass
range extending from m/z 3000 to 22000. For the lipid
experiment, the tissue section was sublimated with 1,5-
diaminonaphthalene, and the measurements were acquired
using a Bruker AutoFlex MALDI-TOF/TOF in negative
reflectron mode at a spatial resolution of 80 μm and with a
mass range extending from m/z 400 to 1000. The case
studies demonstrate robust de-noising performance,
retrieving the underlying tissue signal efficiently and
consistently using the structured data matrix. On the
spatial side, we observe a clean-up effect in the spatial
distributions of both high- and low-intensity ions. The
effect is especially impactful for low-intensity ions,
showing a strong increase in the amount of spatial
structure that can be retrieved from low SNR
measurements and revealing patterns that would have
gone unnoticed otherwise. On the spectral side, we
observe an improved SNR after applying the method.
Thus, at the cost of computational analysis, the de-noising
method described here provides a means of increasing the
amount of information that can be extracted from an IMS
experiment, without requiring user interaction or
additional measurement.
FIGURE 1. Impact on both spatial and spectral domain. Top: example of
de-noised ion image. Bottom: plot of a spectrum before (blue) and after (red) removal of sparse intensity variations.
![Page 35: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/35.jpg)
10th Benelux Bioinformatics Conference bbc 2015
35
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O15 Oral presentation
O15. DETERMINANTS OF COMMUNITY STRUCTURE
IN THE PLANKTON INTERACTOME
Gipsi Lima-Mendez1,2*
, Karoline Faust 1,2,3
, Nicolas Henry 4, Johan Decelle
4, Sébastien Colin
4, Fabrizio Carcillo
2,3,5,
Simon Roux6, Gianluca Bontempi
5, Matthew B. Sullivan
6, Chris Bowler
7, Eric Karsenti
7,8, Colomban de Vargas
4 &
Jeroen Raes1,2
.
Department of Microbiology and Immunology, Rega Institute KU Leuven1; VIB Center for the Biology of Disease
2;
Laboratory of Microbiology, Vrije Universiteit Brussel, Belgium3; CNRS, UMR 7144, Station Biologique de Roscoff
4;
Interuniversity Institute of Bioinformatics in Brussels (IB)2, Machine Learning Group, Université Libre de Bruxelles
5;
Department of Ecology and Evolutionary Biology, University of Arizona, USA6;
Ecole Normale Supérieure, Institut de
Biologie (IBENS), France7; European Molecular Biology Laboratory
Identifying the abiotic and biotic factors that shape species interactions are fundamental yet unsolved goals in ecology.
Here, we integrate organismal abundances and environmental measures from Tara Oceans to reconstruct the first global
photic-zone co-occurrence network. Environmental factors are incomplete predictors of community structure. Putative
biotic interactions are non-randomly distributed across phylogenetic groups, and show both local and global patterns.
Known and novel interactions were identified among grazers, primary producers, viruses and symbionts. The high
prevalence of parasitism suggests that parasites are important regulators in the ocean food web. Together, this effort
provides a foundational resource for ocean food web research and integrating biological components into ocean models.
INTRODUCTION
Determining the relative importance of both biotic and
abiotic processes represents a grand challenge in ecology.
Here we analyze sequence on plankton organisms and
environmental data from the Tara-Oceans project. We
applied network inference methods to construct a global-
ocean cross-kingdom species interaction network and
disentangled the biotic and abiotic signals shaping this
interactome (Lima-Mendez, et al., 2015).
METHODS
Methods are described in details in (Lima-Mendez, et al.,
2015). Briefly:
Network inference. Taxon-taxon networks were
constructed as in (Faust, et al., 2012), selecting
Spearman and Kullback-Leibler dissimilarity.
Edges with merged multiple-test-corrected p-
values below 0.05 were kept. Taxon-environment
networks were computed with the same
procedure and merged with taxon-taxon networks
for environmental triplet detection.
Indirect taxon edge detection. For each triplet
consisting of two taxa and one environmental
parameter, we computed the interaction
information (II) and taxon edges were considered
indirect when II<0 and within the 0.05 quantile of
the random II distribution obtained by shuffling
environmental vectors.
RESULTS & DISCUSSION
Comparison of the taxon co-occurrence and environmental
profiles lead to the inference of a network featuring
127,995 unique edges, of which 92,633 are taxon-taxon
edges and 35,362 are taxon-environment edges.
We identified 27,868 taxon-taxon edges that were affected
by the environment (30% of total), of which 11,043 were
driven solely by abiotic factors and 18,869 resulted from
biotic-abiotic synergistic effects. Among environmental
factors, we found that PO4, temperature, NO2 and mixed
layer depth were frequent drivers of network connections.
In the network containing 81,590 predicted biotic
interactions (after removal of environmentally driven
edges), copresences (positive associations) outnumbered
mutual exclusions (anticorrelations; 73% versus 27%),
with most copresences derived from syndiniales parasites
and exclusions involving arthropods. Associations
between Bacteria and Archaea were limited to 24 mutual
exclusions. Virus-bacteria networks revealed 1,869
positive associations between viral populations and seven
of the 54 known bacterial phyla and one archaeal phylum.
The virus-host interaction data suggest that viruses are
host-range-limited across large sections of host space
(network modularity), but that specialist and generalist
phages prey on specific groups within sub-sections of this
available host space (network nestedness).
These analyses highlight the importance of top-down
effects, and specifically that of broad-range parasites such
as Syndiniales controlling the most abundant species and
ensure carbon recycling between the different
compartments of the trophic web. Finally we show how
network-generated hypotheses guide the discovery of
symbiotic relationships (Figure 1).
Additional material is available at
http://www.raeslab.org/companion/ocean-interactome.html
FIGURE 1. Confocal microscopy confirmed predicted interaction between
acoel flatworms (Symsagittifera sp.) together with their photosynthetic
green microalgal endosymbionts (Tetraselmis sp.).
REFERENCES Faust, K., et al. Microbial co-occurrence relationships in the human
microbiome. PLoS Comput Biol 2012;8(7):e1002606.
Lima-Mendez, G., et al. Ocean plankton. Determinants of community structure in the global plankton interactome. Science
2015;348(6237):1262073.
![Page 36: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/36.jpg)
10th Benelux Bioinformatics Conference bbc 2015
36
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O16 Oral presentation
O16. BIOINFORMATICS TOOLS FOR ACCURATE ANALYSIS OF AMPLICON
SEQUENCING DATA FOR BIODIVERSITY ANALYSIS
Mohamed Mysara1-3
, Yvan Saeys4,5
, Natalie Leys1, Jeroen Raes
2,6 & Pieter Monsieurs
1*.
Unit of Microbiology, Belgian Nuclear Research Centre SCK•CEN, Mol; Belgium1;
Department of Bioscience
Engineering, Vrije Universiteit Brussel VUB, Brussels, Belgium2; Department of Structural Biology, Vlaams Instituut
voor Biotechnologie VIB, Brussels, Belgium3; Data Mining and Modeling Group, VIB Inflammation Research Center,
Ghent, Belgium4, Department of RespiratoryMedicine, Ghent University Hospital, Ghent, Belgium
5, Department of
Microbiology and Immunology, REGA institute, KU Leuven, Belgium6.
High-throughput sequencing technologies have created a wide range of new applications, also in the field of microbial
ecology. Yet when used in 16S rRNA biodiversity studies, it suffers from two important problems: the presence of PCR
artefacts (called chimera) and sequencing errors resulting from the sequencing sequencing technologies. In this work
three artificial intelligence-based algorithms are proposed, CATCh, NoDe and IPED, to handle these two problems. A
benchmarking study was performed comparing CATCh/NoDe (for 454 pyrosequencing) or CATCh/IPED (for Illumina
MiSeq sequencing) with other state-of-the art tools, showing a clear improvement in chimera detection and reduction of
sequencing errors respectively, and in general leading to more accurate clustering of the sequencing reads in Operational
Taxonomic Units (OTUs). All algorithms are available via http://science.sckcen.be/en/Institutes/EHS/MCB/MIC
/Bioinformatics/.
INTRODUCTION
The revolution in new sequencing technologies has led to
an explosion of possible applications, including new
opportunities for microbial ecological studies via the
usage of 16S rDNA amplicon sequencing. However,
within such studies, all sequencing technologies suffer
from the presence of erroneous sequences, i.e. (i) chimera,
introduced by wrong target amplification in PCR, and (ii)
sequencing errors originating from different factors during
the sequencing process. As such, there is a need for
effective algorithms to remove those erroneous sequences
to be able to accurately assess the microbial diversity.
METHODS
First, a new algorithm called CATCh (Combining
Algorithms to Track Chimeras) was developed by
integrating the output of existing chimera detection tools
into a new more powerful method. Second, NoDe (Noise
Detector) was introduced, an algorithm that identifies and
corrects erroneous positions in 454-pyrosequencing reads.
Third, IPED (Illumina Paired End Denoiser) algorithm
was developed to handle error correction in Illumina
MiSeq sequencing data as the first tool in the field. After
identifying those positions likely to contain an error, those
sequencing reads are subsequently clustered with correct
reads resulting in error-free consensus reads. The three
algorithms were benchmarked with state-of-the-art tools.
RESULTS & DISCUSSION
Via a comparative study with other chimera detection
tools, CATCh was shown to outperform all other tools,
thereby increasing the sensitivity with up to 14% (see
Figure 1).
FIGURE 1. Plot indicating the effect of applying 5% indels (shown on the
left) and 5% mismatches (shown on the right), on the performance of different chimera detection tools. CATCh was found to outperform other
existing tools.
Similarly, NoDe and IPED were benchmarked against
other denoising algorithms, thereby showing a significant
improvement in reduction of the error rate up to 55% and
75% respectively (see Figure 2). The combined effect of
our algorithms for chimera removal and error correction
also had a positive effect on the clustering of reads in
operational taxonomic units (OTUs), with an almost
perfect correlation between the number of OTUs and the
number of species present in the mock communities.
Indeed, when applying our improved pipeline containing
CATCh and NoDe on a 454 pyrosequencing mock dataset,
our pipeline could reduce the number of OTUs to 28 (i.e.
close 18, the correct number of species). In contrast,
running the straightforward pipeline without our
algorithms included would inflate the number of OTUs to
98. Similarly, when tested on Illumina MiSeq sequencing
data obtained for a mock community, using a pipeline
integrating CATCh and IPED, the number of OTUs
returned was 33 (i.e. close to the real number of 21
species), while 86 OTUs was obtained using the default
mothur pipeline.
REFERENCES Mysara M., Leys N., Raes J., Monsieurs P.- NoDe: a fast error-correction
algorithm for pyrosequencing amplicon reads.- In: BMC
Bioinformatics, 16:88(2015), p. 1-15.- ISSN 1471-2105 Mysara M., Saeys Y., Leys N., Raes J., Monsieurs P.- CATCh, an
Ensemble Classifier for Chimera Detection in 16S rRNA Sequencing
Studies.- In: Applied and Environmental Microbiology, 81:5(2015), p. 1573-1584.- ISSN 0099-2240
![Page 37: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/37.jpg)
10th Benelux Bioinformatics Conference bbc 2015
37
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O17 Oral presentation
O17. GENE CO-EXPRESSION ANALYSIS IDENTIFIES BRAIN REGIONS AND
CELL TYPES INVOLVED IN MIGRAINE PATHOPHYSIOLOGY: A GWAS-
BASED STUDY USING THE ALLEN HUMAN BRAIN ATLAS
Sjoerd M.H. Huisman1,2*
, Else Eising3, Ahmed Mahfouz
1,2, Lisanne Vijfhuizen
3, International Headache Genetics
Consortium, Boudewijn P.F. Lelieveldt2, Arn M.J.M. van den Maagdenberg
3,4 & Marcel J.T. Reinders
1.
DBL, Dept. of Intelligent Systems, Delft University of Technology, The Netherlands1; LKEB, Dept. of Radiology, Leiden
University Medical Center, The Netherlands2; Dept. of Human Genetics, Leiden University Medical Center, The
Netherlands3; Dept. of Neurology, Leiden University Medical Center, The Netherlands
4.
Migraine is a common brain disorder, with a heritability of around 50%. To understand the genetic component of this
disease, a large genome wide association study has been carried out. Several loci were identified, but their interpretation
remained challenging. We integrated the GWAS results with gene expression data, from healthy human brains, to
identify anatomical regions and biological pathways implicated in migraine pathophysiology.
INTRODUCTION
Genome Wide Association Studies (GWAS) are
frequently used to find common variants with small effect
sizes. However, they often provide researchers with short
lists of single nucleotide polymorphisms (SNPs) with
uncertain connections to biological functions.
We present an analysis of GWAS data for migraine, where
the full list of SNP statistics is used to find groups of
functionally related migraine-associated genes. For this
end we make use of gene co-expression in the healthy
human brain.
We performed genome wide clustering of genes, followed
by enrichment analysis for migraine candidate genes. In
addition, we constructed local co-expression networks
around high-confidence genes. Both approaches converge
on distinct biological functions and brain regions of
interest.
METHODS
Migraine GWAS data was obtained from the International
Headache Genetics Consortium, with 23,285 cases and
95,425 controls (Anttila et al., 2013). Genes were scored
by SNP load and divided into high-confidence genes,
migraine candidate genes, and non-migraine genes.
Spatial gene expression data in the healthy adult human
brain was obtained from the Allen Brain Institute
(Hawrylycz et al., 2012). It contains microarray
expression values of 3702 samples from 6 donors. Robust
gene co-expressions were used to cluster genes into 18
modules, which were then tested for enrichment of
migraine candidate genes, and functionally characterized.
In a second approach, local co-expression networks were
built around the high-confidence migraine genes. These
local networks were then compared to the modules of the
first approach.
RESULTS & DISCUSSION
The genome wide analysis revealed several modules of
genes enriched in migraine candidates. Two modules have
preferential expression in the cerebral cortex and are
enriched in synapse related annotations and neuron
specific genes. A third module contains oligodendrocytes
and genes preferentially expressed in subcortical regions.
The local co-expression networks, of the second approach,
converge on the same pathways and expression patterns,
even though the high confidence genes lie mostly outside
of the modules of interest. This provides a control to the
results of the first approach.
FIGURE 1. The co-expression network around high confidence migraine genes of the second approach. Genes (and links between them) of the
migraine modules of the first approach are coloured in red, yellow, blue,
and green.
The analyses confirm the previously observed link
between migraine and cortical neurotransmission. They
also point to the involvement of subcortical myelination,
which is in line with recent tentative findings. These
results show that more relevant information can be
extracted from GWAS results, using (publicly available)
tissue specific expression patterns.
REFERENCES Anttila V. et al. Genome-wide meta-analysis identifies new susceptibility
loci for migraine. Nat. Genet. 45, 912–7, (2013).
Hawrylycz M.J. et al. An anatomically comprehensive atlas of the adult human brain transcriptome. Nature 489, 391–9, (2012).
![Page 38: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/38.jpg)
10th Benelux Bioinformatics Conference bbc 2015
38
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O18 Oral presentation
O18. SPATIAL CO-EXPRESSION ANALYSIS OF STEROID RECEPTORS IN
THE MOUSE BRAIN IDENTIFIES REGION-SPECIFIC REGULATION
MECHANISMS
Ahmed Mahfouz1,2*
, Boudewijn P.F. Lelieveldt1,2
, Aldo Grefhorst3, Isabel M. Mol
4, Hetty C.M. Sips
4, José K. van den
Heuvel4, Jenny A. Visser
3, Marcel J.T. Reinders
2, & Onno C. Meijer
4.
Department of Radiology, Leiden University Medical Center1; Delft Bioinformatics Lab, Delft University of
Technology2; Department of Internal Medicine, Erasmus University Medical Center
3; Department of Internal Medicine,
Leiden University Medical Center4.
Steroid hormones coordinate the activity of many brain regions by binding to nuclear receptors that act as transcription
factors. This study uses genome wide correlation of gene expression in the mouse brain to discover 1) brain regions that
respond in a similar manner to particular steroids, 2) signaling pathways that are used in a steroid receptor and brain
region-specific manner, and 3) potential target genes and relationships between groups of target genes. The data
constitute a rich repository for the research community to support new insights in neuroendocrine relationships, and to
develop novel ways to manipulate brain activity in research of clinical settings.
INTRODUCTION
Steroid receptors are pleiotropic transcription factors that
coordinate adaptation to different physiological states. An
important target organ is the brain, but its complexity
hampers the understanding of their modulation.
METHODS
We used the Allen Brain Atlas (ABA) (Lein et al., 2007),
the most comprehensive repository of in situ
hybridization-based gene expression in the adult mouse
brain, to identify genes that have three dimensional (3D)
spatial gene expression profiles similar to steroid receptors.
To validate the functional relevance of this approach, we
analyzed the co-expression relationship of the
glucocorticoid receptor (Gr) and estrogen receptor alpha
(Esr1) and their known transcriptional targets in their
brain regions of action. Next, we studied the region-
specific co-expression of nuclear receptors and their co-
regulators to identify potential partners mediating the
hormonal effects on dopaminergic transmission. Finally,
to illustrate the potential of using spatial co-expression to
predict region-specific steroid receptor targets in the brain,
we identified and validated gene which responded to
changes in estrogen in the arcuate nucleus and medial
preoptic area of the mouse hypothalamus.
RESULTS & DISCUSSION
For each steroid receptor, we ranked genes based on their
spatial co-expression across the whole brain as well as in
each of the aforementioned 12 brain structures separately.
For each steroid receptor, strongly co-expressed genes
within a brain region are likely related to the localized
functional role of the receptor. For example, out of the top
10 genes co-expressed with Esr1 across the whole brain, 4
were previously shown to be regulated by Esr1 and/or
estrogens in various tissues (Gpr101, Calcr, Ngb, and
Gpx3)
We assessed the extent of co-expression of glucocorticoid
(GC)-responsive genes (Datson et al., 2012) with Gr in the
whole brain, the hippocampus and its substructures the
dentate gyrus (DG) and the different subregions of the
cornu ammonis (CA). GC-responsive genes were
significantly co-expressed with Gr in the DG, but
interestingly also in the whole brain and in the CA3 region
(FDR-corrected p < 1.8×10-3
; Mann-Whitney U-Test).
Similarly, A Mann-Whitney U-test showed that a set of 15
genes that are sensitive to gonadal steroids (Xu et al.,
2012) is significantly correlated to Esr1 across the whole
brain (FDR-corrected p = 8.69 ×10-14
), as well as in the
hypothalamus (p = 3.85×10-10
) , the brain region
responsible for the sexual behavior in animals.
In order to identify putative region-dependent co-
regulators of steroid receptors, we analyzed the co-
expression relationships of the each steroid receptor and a
set of 62 nuclear receptor co-regulators as present on a
peptide array (Nwachukwu et al., 2014). We focused our
analysis on well-established target regions of steroid
hormone action, dopaminergic brain regions (ventral
tegmental area; VTA & substantia nigra; SN). We found
three significantly co-expressed co-regulators with
androgen receptor (Ar): Pnrc2, Pak6 and Trerf1,
suggesting that these receptors may be involved in
mediating Ar effects on dopaminergic transmission.
In order to validate the predictive value of high correlated
expression with a steroid receptor, we analyzed the
response of top 10 genes that are strongly co-expressed
with Esr1 in the hypothalamus to the estrogen
diethylstilbesterol (DES) in castrated male mice using
qPCR. We performed quantitative double in situ
hybridization (dISH) for Esr1 and the six mRNAs (Irs4,
Magel2, Adck4, Unc5, Ngb, and Gdpd2) that showed more
than 1.3 fold enrichment in qPCR. We found Irs4 and
Magel2 mRNA were both significantly upregulated by
DES treatment (1.9 and 2.4-fold, respectively).
REFERENCES Lein E. et al. Nature 445, 168–76 (2007).
Datson N. et al. Hippocampus 22, 359–71 (2012). Xu X. et al., Cell 3, 596–607 (2012).
Nwachukwu J. et al. eLife 3, e02057 (2014).
![Page 39: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/39.jpg)
10th Benelux Bioinformatics Conference bbc 2015
39
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O19 Oral presentation
O19. A SYSTEMS BIOLOGY COMPENDIUM FOR LEISHMANIA DONOVANI
Bart Cuypers1,2,3*
, Pieter Meysman1,2
, Manu Vanaerschot3, Maya Berg
3, Malgorzata Domagalska
3, Jean-Claude
Dujardin3,4#
& Kris Laukens1,2#
.
Advanced Database Research and Modeling (ADReM), University of Antwerp1; Biomedical informatics research center
Antwerpen (biomina)2; Molecular Parasitology Unit, Department of Biomedical Sciences, Institute of Tropical Medicine,
Antwerp3;
4Department of Biomedical Sciences, University of Antwerp
4.
#shared senior
authors
Leishmania donovani is the cause of visceral leishmaniasis in the Indian subcontinent and poses a threat to public health
due to increasing drug resistance. Only little is known about its very peculiar molecular biology and there has been little
‘omics integration effort so far. Here we present an integratory database or ‘omics compendium that contains all
genomics, transcriptomics proteomics and metabolomics experiments that are currently publically available for
Leishmania donovani. Additionally the user interface contains analysis tools for new datasets that uses smart data mining
strategies like frequent itemset mining to link results from different ‘omics layers.
INTRODUCTION
The protozoan parasite Leishmania donovani causes
visceral leishmaniasis (VL), a life threatening disease
which affects 500 000 people each year. With only four
drugs available and rapidly emerging drug resistance,
knowledge about the parasite’s resistance mechanisms is
essential to boost the development of new drugs. However,
only little is known about the gene regulation of
Leishmania and the few findings indicate major
differences to known gene expression systems. Indeed, no
polymerase II promotors have ever been found in
Leishmania1. Genes are constitutively transcribed in large
polycistronic units and subsequently spliced into
individual mRNAs (trans-splicing)1. A modified thymine,
Base J, marks the end of transcription units and functions
as a stop signal for the RNA polymerase2. Gene
expression is then assumed to be regulated at the post-
transcriptional level (mRNA stability, translation
efficiency, epigenetic factors, etc…) but evidence to
support this is scarce1. Integration of different ‘omics
could shed light on these gene regulatory mechanisms, but
there has been little integration effort so far.
METHODS
We developed an easy to use tool, able to import and
connect all existing L. donovani –omics experiments.
Genomics, epigenomics, transcriptomics, proteomics,
metabolomics and phenotypic data was collected and
added to a MySQL database compendium, further
complemented with publicly available data. Relations
between different ‘omics layers were explicitly defined
and provided with a level of confidence. Python scripts
were developed to preprocess, analyse and import the data.
To allow comparability between different experiments,
platforms and labs the three integration principles of the
COLOMBOS bacterial expression compendium were
adapted3. 1) Use the same data-analysis pipeline for all
data. 2) Work with contrasts to a control condition instead
of expression values. 3) Annotate these contrasts in a
unified and structured manner.
Next to this vast data source a set of integrative data-
analysis tools was developed based on data mining
strategies. For example: One tool uses frequent itemset
mining algorithms to detect which proteins and
metabolites frequently exhibit the same behaviour under
different conditions. Another tool converts several –omics
layers to a network format that can be opened in
Cytoscape and can thus be the basis for network analysis.
The Django and Twitter Bootstrap frameworks were used
to create a web portal to make the tools accessible to any
Leishmania researcher.
RESULTS & DISCUSSION
Excellent public gene, protein, metabolite annotation
databases for Leishmania and related species are already
available (e.g. TriTrypDB and GeneDB). However, the
strength of our tool is that it links these annotation data to
‘omics experiments that are either provided by the user, or
that are publically available. New experiments can quickly
be preprocessed, analysed and integrated in the database
via its python back end. The compendium is therefore not
only a look-up tool (e.g. under which conditions is this
gene or metabolite upregulated?), but has tools available
to also analyse the user-provided data with intelligent data
mining tools (e.g. which metabolites/genes are typically
upregulated in drug-resistant strains?). These new
experiments provide additional confidence and
information about the biological entities in the database.
Unlike many other databases, the compendium has an
elaborate quality control system. Every result provided by
the tools can be traced back to the experimental data,
which contains the necessary quality control plots to
support the experiment’s validity. Additionally, it contains
all relevant information about the extractions and the
origin of the biological material.
Using the compendium and its tools, we characterized the
development and drug-resistance in a system biology
context of Leishmania donovani. The genomes of more
than 200 strains were examined for associations with
phenotypical features and a subset was linked to
transcriptomics, proteomics and metabolomics results. The
compendium and its scripts were designed to be generic
and can therefore be used for other organisms with only
minor changes.
REFERENCES 1. Donelson, J. (1999) PNAS. 96, 2579–258. 2. Van Luenen, H. G. a M. et al. (2012) Cell. 150, 909–21.
3. Meysman. et al. (2014) Nucleic acids research. 42, D649-
D653.
![Page 40: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/40.jpg)
10th Benelux Bioinformatics Conference bbc 2015
40
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O20 Oral presentation
O20. MULTI-OMICS INTEGRATION: RIBOSOME PROFILING
APPLICATIONS
Volodimir Olexiouk1, Elvis Ndah
1, Sandra Steyaert
1, Steven Verbruggen
1, Eline De Schutter
1, Alexander Koch
1, Daria
Gawron2, Wim Van Criekinge
1, Petra Van Damme
2, Gerben Menschaert
1,*.
Lab of Bioinformatics and Computational Genomics (BioBix), Department of Mathematical Modelling, Statistics and
Bioinformatics, Faculty of Bioscience Engineering, Ghent University1; Dept. Medical Protein Research, VIB-Ghent
University2.
Ribosome profiling is a relatively new NGS technology that enables the monitoring of the in vivo synthesis of mRNA-
encoded translation products measured at the genome-wide level. The technique, also sometimes referred to as RIBOseq,
uses the property of translating ribosomes to protect mRNA fragments from nuclease digestion and allows to determine
genomic positions of translating ribosomes with sub-codon to single-nucleotide precision. Since the advent of the
technology, several bioinformatics solutions have been devised to investigate this type of data. Here we will present
several solutions to detect novel proteoforms by combining RIBOseq and mass spectrometry data, to detect putatively
coding small open reading frames (sORFs), and to evaluate the impact of DNA and RNA methylation on the translation
level.
INTRODUCTION
Integration of different OMICS technologies is routinely
adapted to investigate biological systems. Our lab focuses
on high-throughput data analysis and the development of
novel data integration methodologies. Currently our focus
goes to ribosome profiling (Ingolia et al., 2011), an NGS
based technique to measure the so-called translatome (i.e.
the mRNA that shows ribosome occupancy). This
technique is applied in combination with other sequencing
based protocols to measure expression (RNAseq),
translation (mass spectrometry) and to chart maps of
regulatory elements such as DNA methylation (reduced
representation bisulfite sequencing, RRBS) and RNA
methylation (m6Aseq) to address several biological
questions.
METHODS
For the integration of RIBOseq and mass spectrometry
(MS), we devised a tool called PROTEOFORMER
(www.biobix.be/proteoformer). This proteogenomics tool
consists of several steps. It starts with the mapping of
ribosome-protected fragments (RPFs) and quality control
of subsequent alignments. It further includes modules for
identification of transcripts undergoing protein synthesis,
positions of translation initiation with sub-codon
specificity and single nucleotide polymorphisms (SNPs).
We used PROTEOFORMER to create protein sequence
search databases from publicly available mouse and in-
house performed human RIBOseq experiments and
evaluated these with matching proteomics data (Crappé et
al., 2015).
Another pipeline based on RIBOseq data is built around
the discovery of putatively coding small open reading
frames (sORFs). Herein, the first step is to delineate
sORFs based on RPF coverage throughout the coding
sequence and at the translation initiation site. Afterwards,
state-of-the-art tools and metrics accessing the coding
potential of sORFs are implemented and a list of candidate
sORFs for downstream analysis is compiled (e.g. MS-
based identification).
To assess the impact of DNA-methylation at the
translation level a double knockout DNMT model was
studied (WT and DNMT1 + 3B knockout HCT116 cell
line). Genome-wide DNA methylation profiling was
performed using RRBS, while ribosome profiling,
quantitative shotgun and positional proteomics (N-
terminal COFRADIC) were used to obtain protein
expression data.
An initial experiment to integrate m6Aseq (measuring the
m6A epitranscriptome) and ribosome profiling has also
been executed on HCT116 cells.
RESULTS & DISCUSSION
The RIBOseq-MS integration (through
PROTEOFORMER) increases the overall protein
identification rates with 3% and 11% (improved and new
identifications) for human and mouse respectively and
enables proteome-wide detection of 5’-extended
proteoforms, upstream ORF (uORF) translation and near-
cognate translation start sites. The PROTEOFORMER
tool is available as a stand-alone pipeline and has been
implemented in the galaxy framework for ease of use.
The sORF pipeline was tested and curated on three
different cell-lines (HCT116: human, E14 mESC: mouse,
and S2: fruitfly). The public repository has been made
available at www.sorfs.org (Olexiouk V. et al., in review),
and so far includes the datasets mentioned above.
In the study for the effect of DNA methylation at the
proteome level in the DNMT double knock-out we found
that the knockout cells show more significantly up-
regulated than down-regulated genes and that these up-
regulated genes were characterized by higher levels of
promoter methylation in the wild type cells. Both the MS
and RIBOseq analyses corroborated these findings.
Preliminary results based on the m6A sequencing confirm
previous findings on know m6A sequence motifs and
enrichment of m6A sites in specific functional regions
(around translation start sites and in 3’UTR regions) and
moreover some examples hint at an effect of m6A on
ribosomal pausing, after integrating m6A- and RIBOseq
data.
REFERENCES Ingolia N. et al. Cell 11;147(4):789-802 (2011).
Crappé, J., Ndah, E. et al. NAR 11;43(5):e29 (2015).
![Page 41: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/41.jpg)
10th Benelux Bioinformatics Conference bbc 2015
41
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O21 Oral presentation
O21. CLUB-MARTINI: SELECTING FAVORABLE INTERACTIONS
AMONGST AVAILABLE CANDIDATES: A COARSE-GRAINED SIMULATION
APPROACH TO SCORING DOCKING DECOYS Qingzhen Hou
1*, Kamil K. Belau
2, Marc F. Lensink
3, Jaap Heringa
1 & K. Anton Feenstra
1*.
Center for Integrative Bioinformatics VU (IBIVU), VU University Amsterdam, De Boelelaan 1081A, 1081 HV
Amsterdam, The Netherlands1; Intercollegiate Faculty of Biotechnology, University of Gdańsk - Medical University of
Gdańsk, Kładki 24, 80-822 Gdańsk, Poland2; Institute for Structural and Functional Glycobiology (UGSF), CNRS
UMR8576, FRABio FR3688, University Lille, 59000, Lille, France3.
Protein-protein Interactions (PPIs) play a central role in all cellular processes. Large-scale identification of native binding
orientations is essential to understand the role of particular protein-protein interactions in their biological context. We
estimate the binding free energy using coarse-grained simulations with the MARTINI forcefield, and use those to rank
decoys for 15 CAPRI benchmark targets. In our top 100 and top 10 ranked structures, for the 'easier' targets that have
many near-native conformations, we obtain a strong enrichment of acceptable or better quality structures; for the 'hard'
targets with very few near-native complexes in the decoys, our method is still able to retain structures which have native
interface contacts. Moreover, CLUB-MARTINI is rather precise for some targets and able to pinpoint near-native
binding modes in top 1, 5, 10 and 20 selections.
INTRODUCTION
Measuring binding free energy is essential to understand the
relevance of particular protein-protein interactions in their
biological context. Moreover, at the atomic scale, molecular
simulations give us insight into the physically realistic details
of these interactions. In our recent study, we successfully
applied coarse-grained molecular dynamics simulations to
estimate binding free energy with similar accuracy as and
500-fold less time consuming than full atomistic simulation
(May et al., 2014). The approach relied on the availability of
crystal structures of the protein complex of interest. Here, we
investigate the effectiveness of this approach as a scoring
method to identify stable binding conformations out of
docking decoys from protein docking.
We apply our method as an evaluation method to rank more
than 19 000 docked protein conformations, or ‘decoys’, for
15 benchmark targets from the Critical Assessment of
PRedicted Interactions (CAPRI) (Lensink & Wodak, 2014).
METHODS
For each target, the binding free energy of all decoys was
calculated, using the MARTINI forcefield as introduced
before (May et al., 2014). In short, for a set of closely spaced
separation distances, we calculate the constraint force applied
to maintain the set distance. Integrating this force yields a
potential of mean force (PMF), from which the binding free
energy is extracted as the highest minus the lowest value.
Previously, for accuracy, we used up to 20 replicate
simulations for each distance in the PMF, but for efficiency,
here we use only a single replicate initially. We then selected
the lowest-scoring half to run an additional four replicates to
obtain better sampling and more accurate estimates of the
binding free energy. In total, we used approximately 800 000
core-hours of compute time.
RESULTS & DISCUSSION
We obtained strong enrichment of acceptable and high
quality structures in the TOP 100 based on our PMF free
energies, as shown in Figure 1. We estimate the error of our
energies to be significant. This can be approved by increasing
sampling, but remains very expensive.
Moreover, for several targets, we can select near-native
structures in top 1, top 5 and top 10 as shown in Table 1,
which means that, overall, our method is rather precise. From
estimates of the error, we expect we can improve accuracy by
extending the amount of sampling done at each distance. In
conclusion, our approach can find favorable interactions from
available candidates produced by docking programs. To the
best of our knowledge, this is the first time interaction free
energy from a coarse-grained force field is used as a scoring
method to rank docking solutions at a large scale.
FIG. 1. Enrichment in
percentage of acceptable or better
structures. For each of
the 13 targets with acceptable or better
decoys, two columns
(from left to right) stand for CAPRI
Score_set and top 100
in our rank of binding free energy calculation. Red, orange and yellow represent the fractions of
high, medium and acceptable quality structures over the number of all or
selected docking decoys. The order (left to right) is based on the fraction of acceptable structures in each target (easy to difficult)
Table 1. Success selections of top ranked structures
Selection Target\Quality High Medium Acceptable
Total
(% )
TOP 1 T47 1 0 0 100
T53 0 0 1 100
TOP 5
T47 3 2 0 100
T41 0 0 4 80
T53 0 0 3 60
T37 0 2 0 40
TOP 10
T47 7 3 0 100
T41 0 1 7 80
T53 0 1 5 60
T37 0 3 0 30
T50 0 0 1 10
TOP 20
T47 14 6 0 100
T41 0 4 13 85
T53 0 3 9 60
T37 0 4 2 30
T50 0 0 3 15
T40 1 2 0 15
T46 0 0 1 5
REFERENCES May, Pool, Van Dijk, Bijlard, Abeln, Heringa & Feenstra. Coarse-
grained versus atomistic simulations: realistic interaction free energies for real proteins. Bioinformatics (2014) 30: 326-334.
Lensink & Wodak. Score_set: A CAPRI benchmark for scoring protein
complexes. Proteins (2014) 82:3163-3169.
![Page 42: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/42.jpg)
10th Benelux Bioinformatics Conference bbc 2015
42
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O22 Oral presentation
O22. PEPSHELL: VISUALIZATION OF CONFORMATIONAL PROTEOMICS
DATA
Elien Vandermarliere1,2*
, Davy Maddelein1,2
, Niels Hulstaert1,2
, Elisabeth Stes1,2
, Michela Di Michele1,2
,
Kris Gevaert1,2
, Edgar Jacoby3, Dirk Brehmer
3 & Lennart Martens
1,2.
Department of Medical Protein Research, VIB1; Department of Biochemistry, Ghent University
2; Oncology Discovery,
Janssen Research and Development – Janssen Pharmaceutica, Beerse3.
Proteins are dynamic molecules; they undergo crucial conformational changes induced by post-translational
modifications and by binding of cofactors or other molecules. The characterization of these conformational changes and
their relation to protein function is a central goal of structural biology. Unfortunately, most conventional methods to
obtain structural information do not provide information on protein dynamics. Therefore, mass spectrometry-based
approaches, such as limited proteolysis, hydrogen-deuterium exchange, and stable-isotope labelling, are frequently used
to characterize protein conformation and dynamics, yet the interpretation of these data can be cumbersome and time
consuming. Here, we present PepShell, a tool that allows interactive data analysis of mass spectrometry-based
conformational proteomics studies by visualization of the identified peptides both at the sequence and structure levels.
Moreover, PepShell allows the comparison of experiments under different conditions which include proteolysis times or
binding of the protein to different substrates or inhibitors.
INTRODUCTION
The study of protein structure with mass spectrometry,
called conformational proteomics, is frequently used to
characterize protein conformations and dynamics. Most of
these methods exploit the surface accessibility of amino
acids within the native protein conformation or more
specifically, the differences in protein surface accessibility
in different situations within a protein structure.
The experimental setup and subsequent workflow of a
conformational proteomics experiment do not deviate
drastically from that of a classic mass spectrometry-based
experiment in which peptides present in a complex peptide
mixture are identified. The final outcome of a
conformational proteomics experiment is a list of peptides.
These peptide lists typically span multiple experimental
conditions across which the structural observations are to
be compared; the peptide lists have to be combined and, if
available, mapped onto the structure of the protein.
To fulfill these latter steps, we developed PepShell
(Vandermarliere et al., 2015), to guide the interpretation
of mass spectrometry-based proteomics data in the context
of protein structure and dynamics.
TOOL DESCRIPTION
PepShell aids the user in the interpretation of the outcome
of conformational proteomics experiments and is
composed of three panels: the experiment comparison
panel, the PDB view panel, and the statistics panel.
The data to analyze
PepShell allows the input from limited proteolysis,
hydrogen-deuterium exchange, MS footprinting and
stable-isotope labelling experiments. The data have to
be present in a comma-separated text file format. The
project selection interface allows the user to select a
reference project and to indicate which setups need to
be compared with each other.
Experiment comparison
This panel allows the comparison of the selected
experimental setups at the sequence level. For each
experimental condition, the identified and quantified
peptides are mapped onto the sequence of the protein
of interest.
The PDB view panel
Here, the detected peptides are mapped on the protein
structure. The main requirement is the availability of a
3D structure of the protein of interest.
Statistics within PepShell
In this panel, the peptides of interest can be analyzed
in more detail. The outcome from CP-DT (Fannes et
al., 2013) for tryptic cleavage probability for each
tryptic cleavage position is given. Also detailed
comparison of the peptide ratios over the different
experimental setups is allowed.
CONCLUSIONS
The increasing popularity of structural proteomics is in
stark contrast with the availability of efficient tools to
visualize this multitude of data. There are however some
tools available that aid data interpretation; but these are
approach-specific and are aimed primarily at mass
spectrometrists with a specific focus on the experimental
mass spectrometry data and their processing and
interpretation. PepShell on the other hand is intended to
support downstream users to interpret the results obtained
from a variety of conformational proteomics approaches.
PepShell uses the peptide lists to compare different
experimental conditions and allows the visualization of
these differences onto the structure of the protein. As such,
PepShell bridges the gap between mass spectrometry-
based proteomics data and their interpretation in the
context of protein structure and dynamics.
PepShell is an open source Java application. Its binaries,
source code and documentation can be found at:
compomics.github.io/projects/pepshell.html
REFERENCES Fannes T et al. J Proteome Res 12, 2253-2259 (2013). Vandermarliere E et al. J Proteome Res 14, 1987-1990 (2015).
![Page 43: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/43.jpg)
10th Benelux Bioinformatics Conference bbc 2015
43
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O23 Oral presentation
O23. INTERACTIVE VCF COMPARISON USING SPARK NOTEBOOK Thomas Moerman
1,2,5*, Dries Decap
3,5, Toni Verbeiren
2,5, Jan Fostier
3,5, Joke Reumers
4,5, Jan Aerts
2,5.
Advanced Database Research and Modeling (ADReM), University of Antwerp1; Visual Data Analysis Lab, ESAT –
STADIUS, Dept. of Electrical Engineering, KU Leuven – iMinds Medical IT2; Department of Information Technology,
Ghent University – iMinds, Gaston Crommenlaan 8 bus 201, 9050 Ghent, Belgium3; Janssen Research & Development,
a division of Janssen Pharmaceutica N.V., 2340 Beerse, Belgium4; ExaScience Life Lab, Kapeldreef 75, 3001 Leuven,
Belgium5.
Researchers benefit greatly from tools that allow hands-on, interactive and visual experimentation with data, unimpeded
by setup complexities nor scaling issues resulting from large data sizes. In our contribution we present an implementation
of an interactive VCF comparison tool, making use of a technology stack based on Apache Spark [1], Big Data
Genomics Adam [2] and Spark Notebook [3].
INTRODUCTION
Current genomics data formats and processing pipelines
are not designed to scale well to large datasets [1]. They
were also not conceived to be used in an interactive
environment. The bioinformatics field typically struggles
with these difficulties as high-throughput, next-generation
sequencing jobs produce large data files. Although many
high-quality bioinformatics processing tools exist, it is
often hard to express analyses in a consolidated and
reproducible fashion. These tools typically do not allow to
interactively iterate on an analysis while visualizing
results.
OBJECTIVE
Analysis tools preferably provide the expressive power to
define ad hoc queries on data. Biologists or clinical
researchers, when dealing with genomic variants encoded
in VCF files, typically perform queries comparing one
protocol to another, tumor to normal, treated to untreated
cell lines and so on. Ideally these comparisons make use
of all quality-related metrics stored in VCF files (e.g.
coverage depth, quality score) as well as the actual region
annotations (e.g. repeat regions, exonic regions) and
generate visual output. We aim to implement a tool that
provides the necessary expressiveness as well as the
computational power needed for making these types of
analyses practical and interactive.
APPROACH
Recent advances in computation platform technology
(Spark) and notebook technologies (Spark Notebook)
enable orchestration of distributed jobs on cluster
infrastructure from a programmable environment running
in a browser. These technologies, combined with Adam
[2], a library specifically designed for processing next-
generation sequencing data, provide the necessary
architectural bedrock for our purposes.
Analyses are expressed in a high-level programming
language (Scala), operating on specialized data structures
(Spark resilient distributed datasets, or RDDs [1]) that
make abstraction of the complexity of defining distributed
computations on data sets too large for single node
processing. Adam meets the need for an explicit data
schema for abstraction of the different bioinformatics file
formats.
RESULTS & CONTRIBUTIONS
Our work focuses on the pairwise comparison of annotated
VCF files. Our contributions consist of two open-source
Scala libraries: VCF-comp [4] and Adam-FX [5]. VCF-
comp implements the concordance by variant position
algorithm, which segregates the variants from two VCF
inputs (A, B) into 5 categories: A/B-unique, concordant
(equal variants on position) and A/B-discordant (different
variants on position). This results in a distributed data
structure from which we project visualizations, presented
to the user by means of the Spark Notebook interface.
FIGURE 1 Allele frequency distribution for concordant and unique
variants in a tumor vs. normal VCF comparison.
FIGURE 2 Functional impact (SnpEff annotation) histogram for
concordant, unique and discordant variants in a tumor vs. normal VCF
comparison.
Adam-FX extends the Adam data structures and file
parsing logic in order to support queries on SnpEff [6],
SnpSift [7], dbSNP and Clinvar annotations.
We believe our tool facilitates the comparison of
annotated VCF files in an interactive manner while
reducing runtime by leveraging the Spark platform.
REFERENCES [1] Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant
abstraction for in-memory cluster computing."
[2] Massie, Matt, et al. "Adam: Genomics formats and processing
patterns for cloud scale computing." [3] https://github.com/andypetrella/spark-notebook
[4] https://github.com/tmoerman/vcf-comp
[5] https://github.com/tmoerman/adam-fx [6] Cingolani, P, et al. "A program for annotating and predicting the
effects of single nucleotide polymorphisms, SnpEff: SNPs in the
genome of Drosophila melanogaster strain w1118; iso-2; iso-3.", Fly
(Austin). 2012 Apr-Jun;6(2):80-92. PMID: 22728672
![Page 44: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/44.jpg)
10th Benelux Bioinformatics Conference bbc 2015
44
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: O24 Oral presentation
O24. 3D HOTSPOTS OF RECURRENT RETROVIRAL INSERTIONS REVEAL
LONG-RANGE INTERACTIONS WITH CANCER GENES
Sepideh Babaei1, Waseem Akhtar
2, Johann de Jong
3, Marcel Reinders
1 & Jeroen de Ridder
1*.
Delft Bioinformatics Lab, Delft University of Technology1; Division of Molecular Genetics
2;
Division of Molecular Carcinogenesis, The Netherlands Cancer Institute3.
Genomically distal mutations can contribute to deregulation of cancer genes by engaging in chromatin interactions. To
study this, we overlay viral cancer-causing insertions obtained in a murine retroviral insertional mutagenesis screen with
genome-wide chromatin conformation capture data. In this talk, we show that insertions tend to cluster in 3D hotspots
within the nucleus. The identified hotspots are significantly enriched for known cancer genes, and bear the expected
characteristics of bona-fide regulatory interactions, such as enrichment for transcription factor binding sites.
Additionally, we observe a striking pattern of mutual exclusive integration. This is an indication that insertions in these
loci target the same gene, either in their linear genomic vicinity or in their 3D spatial vicinity. Our findings shed new
light on the repertoire of targets obtained from insertional mutagenesis screening and underlines the importance of
considering the genome as a 3D structure when studying effects of genomic perturbations.
Evidence is mounting that the organization of the genome
in the cell nucleus is extremely important for gene
regulation. This finding is facilitated by recent
technological advances (i.e. Hi-C) that enabled researchers
to accurately capture the 3D conformation of
chromosomes in the cellular nucleus at a high resolution.
We have exploited a large existing Hi-C dataset to take 3D
chromosome conformation into account while determining
hotspots of viral cancer-causing mutations. These
identified hotspots are significantly enriched for known
cancer genes, and bear the expected characteristics of
bona-fide regulatory interactions, such as enrichment for
transcription factor binding sites. Additionally, we observe
a striking pattern of mutual exclusive integration. This is
an indication that insertions in these loci target the same
gene through long-range interactions (1).
In a second study (2), we performed a similar analysis that
shows a striking relation between genome conformation
and expression correlation in the brain. Although recent
studies have shown a strong correlation between
chromatin interactions and gene co-expression exists,
predicting gene co-expression from frequent long-range
chromatin interactions remains challenging. We address
this by characterizing the topology of the cortical
chromatin interaction network using scale-aware
topological measures. We demonstrate that based on these
characterizations it is possible to accurately predict spatial
co-expression between genes in the mouse cortex.
Consistent with previous findings, we find that the
chromatin interaction profile of a gene-pair is a good
predictor of their spatial co-expression. However, the
accuracy of the prediction can be substantially improved
when chromatin interactions are described using scale-
aware topological measures of the multi-resolution
chromatin interaction network. We conclude that, for co-
expression prediction, it is necessary to take into account
different levels of chromatin interactions ranging from
direct interaction between genes (i.e. small-scale) to
chromatin compartment interactions (i.e. large-scale).
In this talk, I will focus on the computational and
statistical methods that are required to make an insightful
overlaying high-resolution conformation maps obtained
using Hi-C with ~20.000 cancer-causing retroviral
mutations and expression maps from the Allen Brain
Atlas.
FIGURE 1. Circos visualization of the insertions clusters that co-localize
with the Notch1 locus.
REFERENCES (1) Babaei, S. et al. Nature Communications (2015). (2) Babaei and Mahfouz et al. PLoS Computational Biology (2015)
![Page 45: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/45.jpg)
10th Benelux Bioinformatics Conference bbc 2015
45
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P1. KNN-MDR APPROACH FOR DETECTING GENE-GENE
INTERACTIONS
Sinan Abo alchamlat1 & Frédéric
Farnir
1,*.
Fundamental and Applied Research for Animals & Health (FARAH), Sustainable Animal Production, University of
Liège1.
These last years have seen the emergence of a wealth of biological information. Facilitated access to the genome
sequence, along with massive data on genes expression and on proteins have revolutionized the research in many fields
of biology. For example, the identification of up to several millions SNPs in many species and the development of chips
allowing for an effective genotyping of these SNPs in large cohorts have triggered the need for statistical models able to
identify the effects of individual and of interacting SNPs on phenotypic traits in this new high-dimensional landscape. Our work is a contribution to this field...............................................................................................................
INTRODUCTION
GWAS has allowed the identification of hundreds of
genetic variants associated to complex diseases and traits,
and provided valuable information into their genetic
architecture (Wu M et al., 2010). Nevertheless, most
variants identified so far have been found to confer
relatively small information about the relationship
between changes at the genomic level and phenotypes
because of the lack of reproducibility of the findings, or
because these variants most of the time explain only a
small proportion of the underlying genetic variation (Fang
G et al., 2012). This observation, quoted as the ‘missing
heritability’ problem (Manolio T et al., 2009) of course
raises the question: where does the unexplained genetic
variation come from? A tentative explanation is that genes
do not work in isolation, leading to the idea that sets of
genes (or genes networks) could have a major effect on the
tested traits while almost no marginal – i.e. individual
gene – effect is detectable. Consequently, an important
question concerns the exact relationship between the
genomic configuration, including the interactions between
the involved genes, and the phenotypic expression.
METHODS
To tackle this subject, different statistical methods such as
MDR (Multi Dimensional Reduction) have been proposed
for detecting gene-gene interaction (Ritchie, D., et al.,
2001); their relative performances remain largely unclear,
and their extension to situations combining many variants
turns out to be challenging. So we propose a novel MDR
approach using K-Nearest Neighbors (KNN) methodology
(KNN-MDR) for detecting gene-gene interaction as a
possible alternative, especially when the number of
involved determinants is potentially high. The idea behind
our method is to replace the status allocation used in
classical MDR methods by a KNN approach: the majority
vote occurs in the k (a parameter that must be tuned and
depends on the various possible scenarios) nearest
neighbors instead of within the (potentially empty) cell
determined by the tested attributes of the individual to be
classified. The steps other than classification are identical
in both methods (i.e. cross-validation, attributes selection,
training and tests balanced accuracy computations, best
model selection procedure).
RESULTS & DISCUSSION
Experimental results on both simulated data and real
genome-wide data from Wellcome Trust Case Control
Consortium (WTCCC) (Wellcome Trust Case Control C.,
2007) show that KNN-MDR has interesting properties in
terms of accuracy and power, and that, in many cases, it
significantly outperforms its recent competitors.
FIGURE 1. Comparison of the inter-chromosomal interactions detected
on the WTCCC dataset by KNN-MDR and other interaction methods using this same dataset as example (Shchetynsky et al. (2015); Zhang et
al. (2012))
The results of this study allow us to draw some
conclusions about the performance of KNN-MDR: on the
one hand, the performance of the KNN-MDR method to
detect gene-gene interactions are similar to the
performance of MDR for small problems. On the other
hand, KNN-MDR has significant advantages in large
samples and large number of markers (such as GWAS) to
detect the existence of genes effect. So KNN-MDR can be
seen as a new and more comprehensive method than MDR
and other competitors for detecting gene-gene interaction.
REFERENCES Wu M et al. American journal of human genetics 86, 929-942 (2010).
Fang G et al. PloS one 7, 1932-6203 (2012).
Manolio T et al. Nature 461, 747-753 (2009).
Ritchie, D., et al. Am J Hum Genet,69, 138-147 (2001).
Wellcome Trust Case Control C. Nature, 447(7145):661-678 (2007).
Shchetynsky K et al. Clinical immunology 158(1):19-28 (2015).
Zhang J et al. American Medical Journal 3(1) (2015).
![Page 46: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/46.jpg)
10th Benelux Bioinformatics Conference bbc 2015
46
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P2. CONSERVATION AND DIVERSITY OF SUGAR-RELATED CATABOLIC
PATHWAYS IN FUNGI
Maria Victoria Aguilar Pontes*, Eline Majoor, Claire Khosravi, Ronald P. de Vries, Miaomiao Zhou
Fungal Physiology, CBS-KNAW Fungal Biodiversity Centre, Utrecht, The Netherlands; Fungal Molecular Physiology,
Utrecht University, The Netherlands.*[email protected], [email protected], [email protected],
[email protected], [email protected]
INTRODUCTION
Plant polysaccharides are among the major substrates for
many fungi. After extracellular degradation, the
monomeric components (mainly monosaccharides) are
taken up by the cells and used as carbon sources to enable
the fungus to grow. This would also imply that the range
of catabolic pathways of a fungus may be correlated to the
decomposition of the polysaccharides it can degrade.
Several carbon catabolic pathways have been studied in
different fungi able to grow on plant biomass such as
Aspergillus niger (De Vries, et al., 2012).
In this study we have tested this hypothesis by identified
the presence of genes of a number of catabolic pathways
in selected fungi from the Ascomycota and the
Basidiomycota.
METHODS
A total of 104 fungal genomes were identified from the
JGI fungal program (Grigoriev IV, et al., 2011), Broad
Institute of Harvard and MIT, AspGD (Arnaud, et al.,
2012) and NCBI genbank (Benson, et al., 2012) (data
version March 2013).
We identified A. niger genes involved in individual
pathways from literature. Genome scale protein ortholog
clusters were detected according to (Li, et al., 2003), using
inflation factor 1, E-value cutoff 1E-3, percentage match
cut off 60% as for identification of distant homologs
(Boekhorst, et al., 2007). The all-vs-all BlastP search
required by OrthoMCL was carried out in a grid of 500
computers by parallel fashion. The orthologs clusters were
then curated manually by expert knowledge and literature
search. Manual curation was aided by aligning the amino
acid sequences of the hits for each query together with a
suitable outgroup by MAFFT (Katoh, et al., 2009; Katoh,
et al., 2005), after which neighbor joining trees were
generated using MEGA5 with 1000 bootstraps. Genes that
were clearly separated from the query branch in the trees
were removed from the results.
RESULTS & DISCUSSION
Patterns of pathway gene presence are conserved among
clades. Galacturonic acid and rhamnose pathways are
missing in yeast. Pentose pathway is conserved in
Pezizomycetes and Basidiomycota, which explains their
ability to grow on pentose as carbon source (www.fung-
growth.org).
These results may indicate that different evolutionary
tracks have led to different metabolic strategies.
The expression of metabolic genes will be evaluated for
those species for which transcriptome data are available.
The results will be compared to growth profiling data of
the species on a set of plant-related poly- and
monosaccharides to determine to which extent the genome
content fits the physiological ability of the species.
ACKNOWLEDGEMENTS
The comparative genomics analysis was carried out on the
Dutch national e-infrastructure with the support of SURF
Foundation (e-infra1300787).
REFERENCES Arnaud, M.B., et al., Nucleic Acids Res, 40, 653-659 (2012).
Benson, D.A., et al., Nucleic Acids Res, 40, 48-53 (2012). Boekhorst, J., et al., BMC Bioinformatics, 8, 356-363 (2007).
De Vries, R.P., et al. Pan Stanford Publishing Pte. Ltd, Singapore (2012).
Grigoriev IV, et al., Mycology, 2, 192-209 (2011). Katoh, K., et al., Methods Mol Biol, 537, 39-64 (2009).
Katoh, K., et al., Nucleic Acids Res, 33, 511-518 (2005).
Li, L., et al., Genome Res, 13, 2178-2189 (2003).
![Page 47: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/47.jpg)
10th Benelux Bioinformatics Conference bbc 2015
47
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P3. VISUALIZING BIOLOGICAL DATA THROUGH WEB COMPONENTS
USING POLIMERO AND POLIMERO-BIO
Daniel Alcaide1,2*
, Ryo Sakai1,2
, Raf Winand1,2
, Toni Verbeiren1,2
, Thomas Moerman1,2
, Jansi Thiyagarajan & Jan Aerts.
KU Leuven Department of Electrical Engineering-ESAT, STADIUS, VDA-lab, Belgium1; iMinds Medical IT, Leuven,
Belgium. *[email protected]
Although there are currently several tools for fast prototyping in data visualization, the specifics of the biological domain
often require the development of custom visuals. This leads to the issue that we end up re-implementing the base visuals
over and over if we want to build them into a specific analysis tool. This work presents a proof-of-principle library for
creating composable linked data visualizations, including an initial collection of parsers and visuals with an emphasis on
biology. With Polimero and Polimero-bio, we want to create a library to build scalable domain-specific visual data
exploration tools using a collection of D3-based reusable web components.
INTRODUCTION
As a visual data analysis lab, we often combine
(brush/link) well-known data visualization techniques
(scatterplots, barcharts, etc.). Despite it is possible to use
general-purpose tools like Tableau or Excel, the singular
needs of the biological field usually demand the creation
of particular data visualizations which are not included in
these commercial solutions (Figure 1).
These visuals implementations need to be re-implemented
for each new tool created. The present solution tries to be
an alternative to create composable linked data
visualizations.
FIGURE 1. Klaudia-plot - Visualization created with Polimero that shows
the read pairs mapped around a deletion in the NA12878 genome on
chromosome 20.
METHODS
Polimero is a library that uses Polymer implementation for
creating visual web components. (www.polymer-
project.org).
Web components are an emerging W3C standard for
extending the HTML platform to create web-based apps.
This new technology includes custom elements, HTML
templates, shadow DOM, and HTML imports (Figure 2).
The D3-based custom elements that Polimero and
Polimero-bio offer, allow us to create a scalable
framework for building domain-specific visual data
exploration tools.
Leveraging the web components concepts, the main
characteristics of Polimero library are:
Modular: Each element is an independent module
that has a specific purpose (data, visualization,
computation)
Composable: The elements can be combined
setting up new functionalities (linking, filtering,
reading different data sources)
Encapsulated: Web components aim to provide
the user a simple element interface, avoiding to
have to deal with the underlying code.
Reusable: The same element can be used in the
same project for different objectives.
Linkable: Polimero elements can speak to each
other, allowing the use of events for brushing and
linking.
Embeddable: The elements can be added to any
existing frameworks that use HTML (e.g. ipython
notebook).
FIGURE 2. HTML example – Representing Polimero elements to create
visualization.
RESULTS & DISCUSSION
This library makes it possible to create applications that
are composable, encapsulated, and reusable. This is
valuable both for the developer/designer who can easily
create and plug-in custom visual encodings, and for the
end-user who can create linked visualizations by dragging
existing components onto a canvas using the Polimero-
designer.
Polimero and Polimero-bio are still in development but
they are available at www.bitbucket.org/vda-lab/polimero.
![Page 48: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/48.jpg)
10th Benelux Bioinformatics Conference bbc 2015
48
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P4. DISEASE-SPECIFIC NETWORK CONSTRUCTION BY SEED-AND-EXTEND
Ganna Androsova1*
, Reinhard Schneider1 & Roland Krause
1.
Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Belval, Luxembourg 1.
INTRODUCTION
Molecular interaction networks are dense structures of
protein interactions, from which we would like to extract
relevant sub-networks specific to the disease of interest.
Such a disease-specific network is often constructed by the
seed-and-extend algorithm, which extracts the relevant
genes from an organism-wide, weighted interaction
network, typically as its first-neighbourhood. Seed-and-
extend is suitable when disease biomarkers are poorly
investigated and the knowledge about biomarker
interaction partners is missing or when the interacting
partners are established but the connections are missing
between them.
Our syndrome of interest is the postoperative cognitive
impairment frequently experienced by elderly patients,
characterized by progressive cognitive and sensory decline.
The acute phase of cognitive impairment is postoperative
delirium (POD). The underlying pathophysiological
mechanisms have not been studied in depth due to
mulitifactorial pathogenesis of this postoperative cognitive
impairment. The known POD-related genes can be
integrated into the draft network for exploration on a
systems level.
Here, we investigate how stable the results of such
analysis are when the input set of seed genes is varied, and
what is the role of stringency in the initial selection of the
networks. Ideally, we would like to find the “sweet spot”
that provides a biologically meaningful trade-off between
false-positives and -negatives to be used for such analyses.
METHODS
The list of disease-related genes/proteins was retrieved
from literature studies in the PubMed database.
We extended the seed list with directly linked interactors
by seed-and-extend from protein-protein interaction
network databases. We extracted all interactions between
seeds and connected neighbours, which resulted in the
first-degree network.
Next, we evaluated a biological enrichment of the
extracted network, its topological parameters, overlap with
other diseases and clustered the network into the smaller
sub-networks.
RESULTS & DISCUSSION
The POD network (Figure 1) follows a free-scale
distribution and consists of 541 proteins with 5,242
interactions between them.
FIGURE 1. Postoperative delirium molecular network.
The network was evaluated topologically by degree
assortativity, density, shortest path, eccentricity and other
measures. Pathways enrichment analysis showed
glucocorticoid receptor signalling, immune response, and
dopamine signalling as relevant to POD (Figure 2).
FIGURE 2. Postoperative delirium pathway enrichment analysis.
Top 5 hub proteins included UBC_HUMAN,
GCR_HUMAN, P53_HUMAN, HS90A_HUMAN and
EGFR_HUMAN. Appearance of p53 and other very
frequent genes among top 5 hubs in our but also several
other studies, motivated us to investigate its relevance to
the disease and question the possible data bias. We
compare how size, specificity and completeness of the
input seed list can affect the resulting network and
retrieval of the other disease-related proteins.
![Page 49: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/49.jpg)
10th Benelux Bioinformatics Conference bbc 2015
49
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P5. BIG DATA SOLUTIONS FOR VARIANT DISCOVERY FROM LOW
COVERAGE SEQUENCING DATA, BY INTEGRATION OF HADOOP, HBASE
AND HIVE
Amin Ardeshirdavani1*
, Erika Souche2, Martijn Oldenhof
3 & Yves Moreau
1.
KU Leuven ESAT-STADIUS Center for Dynamical Systems, Signal Processing and Data Analytic 1; KU Leuven
Department of Human Genetics 2; KU Leuven Facilities for Research 3. *[email protected]
Next Generation Sequencing (NGS) technologies allow the sequencing of the whole human genome to, among others,
efficiently study human genetic disorders. However, the sequencing data flood needs high computation power and
optimized programming structure to tackle data analysis. A lot of researchers use scale-out network to simulate
supercomputer. In many use cases Apache Hadoop and HBase have been used to coordinate distributed computation and
act as a storage platform, respectively. However, scale-out network has rarely been used to handle gene variation data
from NGS, except for sequencing reads assembly. In our study, we propose a Big Data solution by integrating Apache
Hadoop, HBase and Hive to efficiently analyze NGS output such as VCF files.
INTRODUCTION
The goal of this project is trying to overcome the
difficulties between massive NGS data and low data
process ability. We want propose a data process and
storage model specifically for NGS data. To address our
goal we develop an application based on this model to test
whether its process ability is highly increased. The target
users of this application are researchers with intermediate-
level computer skills. The new model should meet certain
demands, which are scalable, high tolerant and availability.
Data import procedure should be fast and occupies the
smallest storage volume. It also needs to make querying
data faster and possible from remote place. In order to
achieve these demands, three open source projects:
Apache Hadoop, HBase and Hive are integrated as the
backbone and on top of them a user-friendly interface
designed application is developed to make this integration
more straightforward.
METHODS
Generally, Hadoop is for utilizing distributed MapReduce
data processing, HBase is the platform for complex
structured data storage and Hive is for data retrieve from
HBase using of Structural Query Language (SQL) syntax.
Though Hadoop and HBase are popular recently, the
combination of Hadoop, HBase and Hive is rare to be
implemented in bioinformatics field.
Here we mainly discuss gene variation data analysis. Thus
the application developing is focusing on parsing and
storing VCF (Variant Call Format) file. The application is
designed to dynamically adapt VCF file structures with
respect to variant callers. For example in
UnifiedGenotyper calls SNPs and InDels separately by
considering each variant is independent, yet the other
caller HaplotypeCaller calls variants by using local
assembly. For gene variation analysis, the VCF files of
different samples need to be queried and the results should
be able to export for further usage. Normally a VCF file
for each sample or a group of samples is considerably
large, so the efficiency of processing is for sure very
crucial.
The model we have decided is the integration of Hadoop,
HBase and Hive; Hadoop will be used for data processing,
HBase for storage and Hive for querying. Since all of
these projects need distributed cluster to optimize the
performance, it is crucial to decide the suitable
architecture for our application. The cluster will be the
major processing and storage platform. The single server
outside the cluster will act as a client for users. Our
application can connect remotely to the Hive server for
researchers.
RESULTS & DISCUSSION
The tests show clearly that the Apache integration
performances much better than SQL model when dealing
with large size VCF files. Also, for small VCF files, the
integration performance is acceptable. So we conclude that
Apache integration could be a good solution for this kind
of file management. Our newly developed application H3
VCF with user-friendly interface is a nice tool for users
without high level IT knowledge so they can conveniently
use the integration to tackle VCF files. User can either
choose to build his/ her own local computer cluster or use
Amazon EMR to easily create a cluster with Apache
projects for a few dollars.
![Page 50: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/50.jpg)
10th Benelux Bioinformatics Conference bbc 2015
50
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P6. ENTEROCOCCUS FAECIUM GENOME DYNAMICS DURING
LONG-TERM PATIENT GUT COLONIZATION
Jumamurat R. Bayjanov1*
, Jery Baan1, Mark de Been
1, Mick Watson
2 & Willem van Schaik
1.
Department of Medical Microbiology, University Medical Center Utrecht, Utrecht, The Netherlands1; Edinburgh
Genomics, The University of Edinburgh, Edinburgh, Scotland2.
Enterococcus faecium – recently evolved multi-drug resistant nosocomial pathogen – is able to rapidly colonize human
gut. Previous work on animal, healthy human and clinical E. faecium strains has shown that clinical isolates form a
distinct lineage. However, these studies lack detailed niche-specific and longitudinal evolutionary dynamics analysis of
this organism. Here we show longitudinal within-host evolutionary dynamics analysis of E. faecium gut isolates, which
were sampled from five patients over the period of 8 years. Whole-genome sequencing analysis showed that rapid
diversification of E. faecium clones in patient gut is mainly due to recombinations and phages. High diversification
allows E. faecium clones to acquire new genes including antibiotic resistance genes, which allows this bacterium to
rapidly colonize hostile environments.
INTRODUCTION
In recent decades, Enterococcus faecium, normally a
harmless gut commensal, has emerged as an important
multi-drug resistant nosocomial pathogen. Previous work
has shown that clinical isolates of E. faecium form a sub-
population that is distinct from strains isolated from
animals and healthy humans (Lebreton et al., 2013). We
used whole-genome sequencing to characterize how
clinical E. faecium strains evolve during long-term patient
gut colonization.
METHODS
The genomes of 96 E. faecium gut isolates, obtained over
8 years from 5 different patients, were sequenced using
Illumina HiSeq 2x100bp paired-end sequencing. Quality
filtering of sequence reads was performed using Nesoni
(version 0.117) (Nesoni, 2014) and high-quality reads
were assembled into contiguous sequences using Spades
assembler (version 3.1.0) (Bankevich et al., 2012).
Subsequently, assembled sequences were annotated using
Prokka (v 1.10) (Seeman T, 2014). In addition to these 96
genomes, we also included publicly available genome
sequences of 70 E. faecium strains, which were
downloaded from NCBI Genbank database. In the set of
166 strains, orthology between genes were identified using
orthAgogue (Ekseth et al., 2014) and orthologous genes
were clustered into ortholog groups using MCL algorithm
(Enright et al., 2002). Core genome alignments were then
constructed by concatenating core gene sequences and
were filtered for recombinations using Gubbins (Croucher
et al., 2015). Subsequently, recombination-filtered core
genome alignments were used to construct a phylogenetic
tree. In addition to core-genome based analyses, we have
also studied gene gain and loss across time.
RESULTS & DISCUSSION
As expected all of 96 isolates were grouped in E. faecium
clade A, with only one strain clustering in clade A-2,
which mainly contains animal isolates. The remaining 95
strains were assigned to clade A-1, which is almost
exclusively comprised of clinical isolates. The
phylogenetic tree showed 5 clusters of closely related
strains of patients, revealing the microevolution of E.
faecium strains during gut colonization. We also anticipate
that direct transfer of strains had occurred between
patients during hospitalization in the same ward.
Additionally, analysis of gene gain and loss across time
showed that loss and gain of prophages is an important
factor in generating genetic diversity during gut
colonization.
This study highlights the ability of E. faecium clones to
rapidly diversify, which may contribute to the ability of
this bacterium to efficiently colonize new environments
and rapidly acquire antibiotic resistance determinants.
REFERENCES Lebreton F, et. al. “Emergence of epidemic multidrug-resistant
Enterococcus faecium from animal and commensal strains”. MBio.
4(4):e00534-13, 2013.
Nesoni. https://github.com/Victorian-Bioinformatics-Consortium/nesoni Bankevich A, et. al. "SPAdes: A New Genome Assembly Algorithm and
Its Applications to Single-Cell Sequencing". Journal of
Computational Biology 19(5):455-477, 2012 Seemann T. "Prokka: rapid prokaryotic genome annotation".
Bioinformatics. 30(14):2068-9, 2014. Ekseth OK, et. al. "orthAgogue: an agile tool for the rapid prediction of
orthology relations". Bioinformatics. 30(5):734-6, 2014.
Enright AJ, et. al. "An efficient algorithm for large-scale detection of protein families". Nucleic Acids Res. 40:1575-1584, 2002.
Croucher NJ, et. al. "Rapid phylogenetic analysis of large samples of
recombinant bacterial whole genome sequences using Gubbins". Nucleic Acids Res. 43(3):e15, 2015.
![Page 51: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/51.jpg)
10th Benelux Bioinformatics Conference bbc 2015
51
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P7. XCMS OPTIMISATION IN HIGH-THROUGHPUT LC-MS QC
Charlie Beirnaert1,2*
, Matthias Cuykx3, Adrian Covaci
3 & Kris Laukens
1,2.
Advanced Database Research and Modeling (ADReM), University of Antwerp1; Biomedical Informatics Research Centre
Antwerp (biomina)2; Toxicological Centre, University of Antwerp
3.
In high-throughput untargeted metabolomics studies, quality control is still a prominent bottleneck. In analogy to a
recently developed QC tool for proteomics, work in our research group aims to develop a QC environment specific for
metabolomics. One component in this work is the XCMS analysis software for LC-MS data, which is very input-
parameter-sensitive. The presented work deals with the automatic optimisation of the XCMS parameters by building
further upon an existing framework for XCMS optimisation. The additions to this framework will be the inclusion of
quantified resolution data by using the otherwise ignored profile-data and intelligent use of the isotopic profile of
measured compounds.
INTRODUCTION
Metabolomics is the study of small molecules or
metabolites. These metabolites have an enormous
chemical diversity and are only now starting to be
identified in a high-throughput fashion. Reason for this is
the adoption of high performance liquid chromatography
mass spectrometry and nuclear magnetic resonance
spectroscopy. However, the data analysis of these large
datasets is not trivial, specifically for LC-MS there are
almost more ways of analysing data than there are
researchers. Arguably, the most common used software
platform for the initial analysis is XCMS (Smith et al.,
2006). However, the output of XCMS is very dependent
on the input-parameters. Often the default parameters are
chosen or they are adapted to the intuition of the
researcher, with no account of the introduction of false
positives etc. Optimization algorithms have been
constructed by using a dilution series (Eliasson et al.,
2012) and by using the carbon isotope (Libiseller et al.,
2015). In this work, we build further upon the latter by
including quantified information from the profile m/z
domain (the continuous data in the m/z dimension) where
accurate resolutions can be obtained for the mono-isotopic
peaks and other isotopes. The developed optimisation can
be used for both the data analysis and the quality control
framework that is under development.
METHODS
The proposed work uses XCMS to find the peaks of
interest in the data. To optimise this process, the results
from XCMS are analysed for the occurrence of peaks and
their isotopes. In this step, the raw profile data is inspected
around the, by XCMS, identified peaks for the
quantification of the peak resolution and for the
occurrence of missed isotopes.
Centroid vs Profile data: Modern day MS specialists use
centroid data because the file size is considerably lower.
The mass spectrometer converts the continuous data in the
m/z dimension to a collection of spikes where each
approximately Gaussian peak is converted to a single
spike (delta function with the same height as the original
peak). All other data is discarded. The result is a huge
reduction in the file size but a loss of the peak shape and,
as a result, no quantification of the resolution is possible.
Optimization parameter: The peaks and their isotopes
are characterized by a Gaussian in the chromatographic
dimension and spaced apart by 1.0063 Da in the m/z
dimension. When an isotope is missing or the extracted
peak does not appear in enough samples (for example in
50% of the samples in the sample group), the peak is
categorized as “unreliable”. When a peak is present in all
samples or has a clear isotopic distribution it is considered
as “reliable”. With these measures a so called peak picking
score can be calculated, which in turn can be optimised by
a variety of methods. This results in an increase in reliable
peaks, while not increasing false positives.
Analysis & Quality control: The optimisation of the
XCMs parameters is useful both in the analysis of the data
itself, but it is also applicable in quality control for large
scale LC-MS experiments. By being able to quantify the
resolutions of all relevant peaks in a dataset corresponding
to a control sample, it is possible to monitor the quality of
spectra, and when combining this with other QC
frameworks, like iMonDB (Bittremieux et al., 2015) it is
possible to assure the quality of all experiments in a long
lasting study.
RESULTS & DISCUSSION
The aim is to use the profile data to improve the available
optimization algorithms available. It remains to be seen
whether the extra information in this data (compared to
centroid data) justifies the increased need of computer
resources. Nonetheless, profile data provides a valuable
contribution in LC-MS optimization, because it enables
researchers to evaluate (quantitatively) and improve the
m/z resolution.
REFERENCES Smith CA et al. Anal. Chem. 78(3), 779-789, (2006). Eliasson M. et al. Anal. Chem. 84(15), 6869-6876, (2012).
Libiseller G. et al. BMC Bioinformatics 16:118, (2015).
Bittremieux W. et al. J. Proteome Res. 14(5), 2360-2366, (2015).
![Page 52: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/52.jpg)
10th Benelux Bioinformatics Conference bbc 2015
52
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P8. IDENTIFICATION OF NUMTS THROUGH NGS DATA
Vincent Branders1,2*
, Chedly Kastally2 & Patrick Mardulyn
2.
Machine Learning Group, Institute of Information and Communication Technologies, Electronics and Applied
Mathematics (ICTEAM), Université catholique de Louvain1; Evolutionary Biology and Ecology, Université libre de
Bruxelles2.
Numts are copies of mitochondrial DNA sequences that have been transferred into the nuclear genome. Due to their
similarity with mitochondrial DNA sequences, numts have led to many misinterpretations from overestimation of
diversity to wrong association between cystic fibrosis and mitochondrial genome variation. To avoid such bias induced
by numts, theses sequences have to be identified. Current methodologies are based on comparisons of existing nuclear
and mitochondrial sequences and searches for similarities. The Pacific Biosciences (PacBio) new technology generates
sequencing reads that span thousands of base pairs, which gives the opportunity to identify numts by looking for reads
with regions similar to mitochondrial sequences and surrounded by regions highly different from it. It should allow the
systematic identification of numts without a complete known nuclear reference.
INTRODUCTION
The transfer of DNA from mitochondria to the nucleus
generates nuclear copies of mitochondrial DNA (numts).
Numts have been found in many species including yeasts,
rodents and plants. Due to their similarity to mitochondrial
DNA, numts are responsible for many misinterpretations,
both in mitochondrial disease studies and phylogenetic
reconstructions (Hazkani-Covo et al., 2010). Numt
variation have commonly been misreported as
mitochondrial mutations in patients (Yao et al., 2008).
Moreover, DNA barcoding was found to overestimate the
number of species when numts are coamplified (Song et
al., 2008). Current methods identify such sequences by
aligning mitochondrial sequences against the nuclear
genome and identifying similar regions (Figure 1, left).
The PacBio technology allows the sequencing of DNA
fragments spanning thousands of bases pairs. This size
should allow the identification of numts without the need
of a complete nuclear reference (the insect species
Gonioctena intermedia for example). Indeed, it should be
possible to use a mitochondrial assembly to identify
PacBio reads with a central region similar to the
mitochondrial sequence enclosed by nuclear regions that
are dissimilar to it (Figure 1, right).
FIGURE 1. Identification of numts – Existing methods (left) and proposed
method (right). Comparison of mitochondrial sequence to nuclear sequence (left) or long reads (right).
METHODS
The proposed approach aligns PacBio reads to a
mitochondrial genome (here de novo assemblies of PacBio
reads and Illumina HiSeq 2000 reads are used). In these
long reads, numts are identified with one region similar
to the mitochondrial genome but surrounded by regions
that are not similar. We introduce different criteria to
distinguish reads that are presumably numts and reads of
mitochondrial origin (Figure 2). DNA sequences comes
from an insect (Gonioctena intermedia) without reference
genome.
FIGURE 2. Mitochondrial reads and numts with nuclear borders.
RESULTS & DISCUSSION
A systematic identification of potential numts is proposed:
through alignments, we identify 10 mitochondrial reads
and 34 reads with potential numt for one particular
mitochondrial region (the widely studied cytochrome
oxidase I gene). As an exploratory research, we highlight
the usefulness of Pacific Biosciences data in the
identification of numts when no nuclear reference is
available. It only requires PacBio reads and a
mitochondrial assembly. The proposed approach is more
efficient than an identification of numts through short
reads that would require the complete reconstruction of
both mitochondrial and nuclear genomes. A systematic
identification of numts in non-models organisms should
avoid misinterpretations in studies where numts could be
sources of bias. Our current distinction of numts and
mitochondrial reads is quite simple. A detailed analysis of
this distinction could be a perspective of improvements.
REFERENCES Hazkani-Covo E. et al. PLOS Genetics 6, 1-11 (2010). Song H. et al. PNAS 105, 13486-13491 (2008).
Yao Y. G. et al. Journal of Medical Genetics 45, 769-772 (2008).
![Page 53: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/53.jpg)
10th Benelux Bioinformatics Conference bbc 2015
53
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P9. MICROBIAL SEMANTICS: GENOME-WIDE HIGH-PRECISION NAMING
SCHEMES FOR BACTERIA
Esther Camilo dos Reis, Dolf Michielsen, Hannes Pouseele*.
Applied Maths NV, Keistraat 120, 9830 Sint-Martens-Latem, Belgium.
INTRODUCTION
As next-generation sequencing in general, and whole
genome sequencing (WGS) in particular, is increasingly
adopted in public health for routine surveillance tasks,
there is a clear need to incorporate this new technology in
the day-to-day operational workflow of a public health
institute. As cluster detection based on WGS data is
evolving into a commodity, thanks to technologies such as
whole genome multi-locus sequence typing (wgMLST),
the question remains as to how WGS-based data analysis
can be used to build up a human-friendly but high-
precision and epidemiologically consistent naming
strategy for communication purposes.
METHODS
For various organisms, the use of so-called ‘SNP
addresses’ (based on single nucleotide polymorphisms or
SNPs) has been proposed to build up a hierarchical
naming scheme (see [1], [2]). This idea relies on single
linkage clustering of isolates at different levels of
similarity or distance, hence leading to a hierarchical name.
However, the main difficulty here is to define the
appropriate levels of similarity to cluster on, and the
dependence of the naming scheme on the samples at hand.
Moreover, the SNP approach might not provide the best
type of data for this due to its relatively large volatility.
In this work, we present a mathematical framework to
define the levels of similarity upon which single linkage
clustering makes sense. For this, we model the observed
multimodal distribution of pairwise similarities between
samples to obtain a theoretical model of the similarity
distribution, and from there infer the most likely breaking
points for stable similarity cutoffs. This is done in a data-
independent manner, and is therefore applicable to SNP
data, but also to wgMLST data and even gene presence-
absence data. We assess the stability of the naming
scheme by using a cross-validation approach.
RESULTS & DISCUSSION
We apply our methods to propose a wgMLST-based
naming scheme for Listeria monocytogenes. Using a
reference dataset of the diversity within Listeria
monocytogenes, and an extensive data set of over 4000
isolates from real-time surveillance, we show the stability
of the naming scheme, and the epidemiological
concordance.
REFERENCES [1] Dallman T et al., Applying phylogenomics to understand the
2 emergence of Shiga Toxin producing Escherichia coli
3 O157:H7 strains causing severe human disease in the 4 United Kingdom. Microbial Genomics., 10.1099/mgen.0.000029
[2] Coll F et al., PolyTB: A genomic variation map for Mycobacterium
tuberculosis, Tuberculosis (Edinb). 2014 May; 94(3): 346–354. doi: 10.1016/j.tube.2014.02.005
![Page 54: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/54.jpg)
10th Benelux Bioinformatics Conference bbc 2015
54
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P10. FROM SNPS TO PATHWAYS: AN APPROACH TO STRENGTHEN
BIOLOGICAL INTERPRETATION OF GWAS RESULTS
Elisa Cirillo1,*
, Michiel Adriaens2 & Chris T Evelo
1,2.
1Department of Bioinformatics – BiGCaT, Maastricht University, The Netherlands
2Maastricht Centre for Systems Biology (MaCSBio), Maastricht University, The Netherlands
Pathway and network analysis are established and powerful methods for providing a biological context for a variety of
omics data, including transcriptomics, proteomics and metabolomics. These approaches could in theory also be a boon
for the interpretation of genetic variation data, for instance in the context of Genome Wide Association Studies (GWAS),
as it would allow the study of genetic variants in the context of the biological processes in which the implicated genes
and proteins are involved. However, currently genetic variation data cannot easily be integrated into pathways.
Additionally, it is not clear how to visualise and interpret genetic variation data once connected to pathway content. In
this project we take up that challenge and aim to (i) visualise SNPs from a Type 2 Diabetes Mellitus (T2DM) GWAS
dataset on pathways and (ii) generate and analyze a network of all associated genes and pathways. Together, this could
enable a comprehensive pathway and network interpretation of genetic variations in the context of T2DM.
INTRODUCTION
GWAS has become a common approach for discovery of
gene disease relationships, in particular for complex
diseases like T2DM (Wellcome Trust Case Control,
2009). However, biological interpretation remains a
challenge, especially when it concerns connecting genetic
findings with known biological processes. We wish to
improve the interpretation of GWAS results, using a
meaningful network representation that links SNPs to
biological processes.
METHODS
We selected a GWAS data set related to T2DM from a
meta GWAS resource for diseases created by Jhonson et
al. (2009), and we extracted 1971 SNPs associated with
T2DM.
We identified the location for each SNP using Variant
Effect Prediction (VeP) (http://www.ensembl.org) and we
classified them in 5 categories (Figure 1): exonic, 3' UTR,
5' UTR, intronic and intergenic. SNPs located in the first
three categories are easily connected to genes using
BioMart Ensembl (http://www.ensembl.org/). Pathways
related with these genes are identified from the curated
collection of WikiPathways (Kutmon et al., 2015). SNPs,
genes and pathways are visualized in networks using
Cytoscape (Shannon et al., 2003).
RESULTS & DISCUSSION
We analysed four gene related SNP categories: 3' and 5'
UTR, intronic and exonic. The exonic category was
divided into 8 SNP sub-categories based on sequence
interpretation: up- and downstream, splice region,
synonymous, missense, stop/gain, transcription factor
binding, and non-coding transcript. For each of the 11
resulting categories we created a SNP-disease gene-
pathway network. Disease related genes are not always
included in pathways and this is also the case for disease
genes in which GWAS resulting SNPs were found. For the
SNPs that are related to genes in pathways we did a
pathway gene set enrichment analysis and evaluated
whether the resulting pathways were already known to be
related to T2DM.
SNPs in intergenic region need to be analysed and
visualized differently. A possible approach might be using
the expression quantitative trait locus (eQTL) data, which
relates SNPs in intergenic regions to modulation of gene
expression distally. Such datasets are available for many
different human tissues and can provide additional
regulatory information for pathways and the genes they
comprise.
FIGURE 1. Pie chart of the 5 SNPs categories. The total number of SNPs is 2767.
REFERENCES Wellcome Trust Case Control Genome-wide association study of 14,000
cases of seven common diseases and 3,000 shared controls. Nature. 2007;447(7145):661-78.
Johnson A, O'Donnell C. An Open Access Database of Genome-wide
Association Results. BMC Medical Genetics. 2009;10(1):6. Kutmon M, Riutta A, Nunes N, Hanspers K, Willighagen E, Bohler A,
Mélius J, Waagmeester A, Sinha S, Miller R, Coort S, Cirillo E
Smeets B, Evelo C, Pico A. WikiPathways: Capturing the Full Diversity of Pathway Knowledge . Accepted September 2015, NAR-
02735- E- Database issue 2016.
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: A Software Environment for Integrated Models of
Biomolecular Interaction Networks. Genome Research.
2003;13(11):2498-504.
![Page 55: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/55.jpg)
10th Benelux Bioinformatics Conference bbc 2015
55
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P11. IDENTIFICATION OF TRANSCRIPTION FACTOR CO-ASSOCIATIONS
IN SETS OF FUNCTIONALLY RELATED GENES
Pieter De Bleser1,2,4*
, Arne Soetens1,2,4
& Yvan Saeys1,3,4
.
VIB Inflammation Research Center1; Department of Biomedical Molecular Biology
2, Department of Respiratory
Medicine3, Ghent University
4.
Co-associations between transcription factors (TFs) have been studied genome-wide and resulted in the identification of
frequently co-associated pairs of TFs. Co-association of TFs at distinct binding sites is contextual: different combinations
of TFs co-associate at different genomic locations, producing a condition-dependent gene expression profile for a cell.
Here, we present a novel method to identify these condition-dependent co-associations of TFs in sets of functionally
related genes.
INTRODUCTION
The functional expression of genes is achieved by
particular interactions of regulatory transcription factors
(TFs) operating at specific DNA binding sites of their
target genes. Dissecting the specific co-associations of TFs
that bind each target gene represent a difficult challenge.
Co-associations of transcription factor pairs have been
studied genome-wide and resulted in the identification of
frequently co-associated pairs of TFs (ENCODE Project
Consortium, 2012). It was found that TFs co-associate in a
context-specific fashion: different combinations of TFs
bind different target sites and the binding of one TF might
influence the preferred binding partners of other TFs. Here,
we present a tool to identify these condition-dependent co-
associations of TFs in sets of functionally related genes
(e.g. metabolic pathways, tissues, sets of TF target genes,
sets of differentially regulated genes).
METHODS
In a first step, we determine the set of regulatory TFs for
each gene (Tang et al., 2011) in the set using the ChIP-Seq
binding data for 237 TFs from the ReMap database
(Griffon et al., 2015). This results in a number of
regulatory ChIP-Seq binding regions per TF per gene,
represented as a matrix in which each row corresponds to
a gene while the columns correspond to the used TF. In a
next step, this matrix is used as input to the distance
difference matrix (DDM) algorithm, modified to
accommodate this data. The DDM algorithm is a method
that simultaneously integrates statistical over
representation and co-association of TFs (De Bleser et al.,
2007). The result matrix is subsequently reduced, retaining
only the columns of over-represented and co-associated
TFs. Visualization is done by (1) hierarchical clustering of
the reduced result matrix and reordering of the columns
and (2) conversion of the reduced result matrix into a SIF
(simple interaction file format) file, summarizing the
regulator-regulated relationships between transcription
factors and target genes. This SIF file can be imported into
CytoScape for visualization of the regulatory network.
RESULTS & DISCUSSION
FOXF1, TBX3, GATA6, IRX3, PITX2, DLL1 and
NKX2-5 are experimentally verified target genes of the
EZH2 transcription factor (Grote et al., 2013).
Running the transcription factor co-association analysis
method on this data set results in the clustering solution
plot shown in Figure 1.
The strongest associations between TFs are found between
EZH2, POU5F1, SUZ12 and CTBP2. A secondary cluster
of transcription factor associations is composed of
EOMES, SMAD2+3 and NANOG.
The finding of SUZ12 as a cofactor can be accounted for:
EZH2 and SUZ12 are subunits of Polycomb repressive
complex 2 (PRC2), which is responsible for the repressive
histone 3 lysine 27 trimethylation (H3K27me3) chromatin
modification (Yoo and Hennighausen, 2012). CTBP2 is a
known transcriptional repressor (Turner and Crossley,
2001).
The method has been applied previously for the
identification of TFs associated with both high tissue-
specificity and high gene expression levels (Rincon et al.,
2015). The method will be made available as a web tool.
FIGURE 1. Transcription factor co-associations in the EZH2 data set.
Note the tendency of EZH2 to co-localize with POU5F1, SUZ12 and
CTBP2.
REFERENCES De Bleser,P. et al. (2007) A distance difference matrix approach to identifying
transcription factors that regulate differential gene expression. Genome Biol., 8,
R83.
ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements
in the human genome. Nature, 489, 57–74.
Griffon,A. et al. (2015) Integrative analysis of public ChIP-seq experiments reveals
a complex multi-cell regulatory landscape. Nucleic Acids Res., 43, e27.
Grote,P. et al. (2013) The tissue-specific lncRNA Fendrr is an essential regulator of
heart and body wall development in the mouse. Dev. Cell, 24, 206–214.
Rincon,M.Y. et al. (2015) Genome-wide computational analysis reveals
cardiomyocyte-specific transcriptional Cis-regulatory motifs that enable
efficient cardiac gene therapy. Mol. Ther. J. Am. Soc. Gene Ther., 23, 43–52.
Tang,Q. et al. (2011) A comprehensive view of nuclear receptor cancer cistromes.
Cancer Res., 71, 6940–6947.
Turner,J. and Crossley,M. (2001) The CtBP family: enigmatic and enzymatic
transcriptional co-repressors. BioEssays News Rev. Mol. Cell. Dev. Biol., 23,
683–690.
Yoo,K.H. and Hennighausen,L. (2012) EZH2 methyltransferase and H3K27
methylation in breast cancer. Int. J. Biol. Sci., 8, 59–65.
![Page 56: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/56.jpg)
10th Benelux Bioinformatics Conference bbc 2015
56
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P12. PHENETIC: MULTI-OMICS DATA INTERPRETATION USING
INTERACTION NETWORKS Dries De Maeyer
1,2,3*, Bram Weytjens
1,2,3, Luc De Raedt
4 & Kathleen Marchal
2,3.
Centre for Microbial and Plant Genetics, KULeuven1; Department for Information Sciences (INTEC, IMinds), UGent
2;
Department for Plant Biotechnology and Bioinformatics, UGent3; Department of Computer Science, KULeuven
4.
The omics revolution has introduced new challenges when studying interesting phenotypes. High throughput omics
technologies such as next-generation sequencing and microarray technologies generate large amounts of data.
Interpreting the resulting data from these experiments is not trivial due to the data’s size and the inherent noise of the
underlying technologies. In addition to this, the “omics” technologies have led to an ever expanding biological
knowledge which has to be taken into account when interpreting new experimental results. Interaction network in
combination with subnetwork inference methods provide a solution to this problem by mining the current public
interactomics knowledge using experimental omics data to better understand the molecular mechanisms driving the
interesting phenotypes under study.
INTRODUCTION
Computational methods are becoming essential for
analyzing large scale omics datasets in the light of current
knowledge. By representing publicly available
interactomics knowledge as interaction networks
subnetwork inference methods can extract the actual
molecular mechanisms that drive an interesting phenotype.
The PheNetic framework is such a method that allows for
mining interaction networks with multi-omics datasets.
Using this framework different types of biological
applications have been analyzed in the past such as KO-
transcriptomics interpretation (De Maeyer, 2013),
expression analysis (De Maeyer, 2015) and distinguishing
driver from passenger mutation from eQTL experiments
(De Maeyer).
METHODS
Interaction networks provide a flexible representation of
public biological interactomics knowledge. These
networks represent the physical interactions between
genes and their corresponding gene products in the
interactome of the organism under research (Cloots, 2011).
The interaction network integrates different layers of
homogeneous interactomics data, e.g. signalling, protein-
protein, (post)transcriptional and metabolic interactomics
data, into a single heterogeneous network representation.
The PheNetic framework uses interaction networks to find
biologically valid paths which connect (in)activated genes
selected from multi-omics data sets. These paths provide a
biological explanation of how the genes from these data
sets can trigger each other. Finding the best explanations
or paths in the interaction network corresponds to finding
that subnetwork that best explains the observed results and
provides an insight into the molecular mechanisms that
drive the interesting phenotype. Depending on the type of
biological application and provided data different types of
paths can be used to infer the subnetwork such as KO-
transcriptomics interpretation (De Maeyer, 2013),
expression analysis (De Maeyer, 2015) and interpreting
eQTL experiments (De Maeyer).
RESULTS & DISCUSSION
In a first setup PheNetic was used to study the pathways
and processes involved in acid resistance in Escherichia
coli (De Maeyer, 2013). Using our framework we were
able to determine the different molecular pathways that
drive acid resistance and identify the regulators that
underlie this phenotype. It was shown that subnetwork
inference methods outperform naïve gene rankings in
identifying the biological pathways associated with the
phenotype under research based.
In a second setup PheNetic was used to interpret
expression data (De Maeyer, 2015) to extract from the
interaction network those parts of the interaction network
that show differences in expression. This method was
provided as a web server that can be accessed at
http://bioinformatics.intec.ugent.be/
phenetic and that allows for an intuitive and visual
interpretation of the inferred subnetworks.
In a third setup PheNetic was used to select driver
mutations from passenger mutations in coupled genetic-
transcriptomics data sets from evolution experiments (De
Maeyer). Evolved strains with the same phenotype are
expected to have consistent changes in the same pathways.
Therefore, finding the subnetwork that best connects the
mutations to the differentially expressed genes over all
strains is expected to identify the driver mutations over
passenger mutations in combination with identifying the
molecular mechanisms that induce the observed change in
phenotype. This approach provides a systemic insight in
both the biological processes and genetic background that
induces phenotype.
Based on the different approaches it can be concluded that
PheNetic is a flexible framework for subnetwork selection
that allows for solving a large variety of biological
applications using multi-omics data sets.
REFERENCES Cloots, L., & Marchal, K. (2011). Curr Opin Microbiol, 14(5), 599-607. De Maeyer, D., Renkens, J., Cloots, L., De Raedt, L., & Marchal, K.
(2013). Mol Biosyst, 9(7), 1594-1603.
De Maeyer, D., Weytjens, B., Renkens, J., De Raedt, L., & Marchal, K. (2015). Nucleic Acids Res, 43(W1), W244-250.
De Maeyer, D., Weytjens, B., De Raedt, L., & Marchal, K. Molecular
biology and evolution. Submitted
![Page 57: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/57.jpg)
10th Benelux Bioinformatics Conference bbc 2015
57
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P13. THE ROLE OF HLA ALLELES UNDERLYING CYTOMEGALOVIRUS
SUSCEPTIBILITY IN ALLOGENEIC TRANSPLANT POPULATIONS
Nicolas De Neuter1,2*
, Benson Ogunjimi3, Anke Verlinden
4, Kris Laukens
1,2 & Pieter Meysman
1,2.
Advanced Database Research and Modeling (ADReM), University of Antwerp1; Biomedical informatics research center
Antwerpen (biomina)2; Centre for Health Economics Research and Modeling Infectious Diseases (CHERMID), Vaccine
and Infectious Disease Institute, University of Antwerp3; Antwerp University Hospital
4.
In this study, we aim to characterize those HLA alleles that increase or decrease the risk of cytomegalovirus infections
following tissue or organ transplants. This HLA-dependent susceptibility will then be explained using state-of-the-art
HLA peptide affinity methods to identify the underlying molecular reason. This insight can greatly aid prediction of
those transplantation patients that are most at risk from cytomegalovirus infection.
INTRODUCTION
Patients suffering from disorders of the hematopoietic
system or with chemo-, radio-, or immuno- sensitive
malignancies such as leukemia often receive
hematopoietic stem cell transplantation therapy (HSCT).
The transplantation is preceded by a conditioning regimen
that eradicates the recipient’s malignant cell population
through intensive chemotherapy and irradiation,
simultaneously ablating the recipient’s bone marrow. Self
(autologous) or non-self (allogeneic) hematopoietic stem
cells are then reintroduced into the recipient after which
they are allowed to reestablish hematopoietic functions.
HSCT is associated with high morbidity and mortality and
requires careful monitoring of patients during the weeks
following transplantation. Opportunistic cytomegalovirus
(CMV) infections are one of the major causes of this high
morbidity and mortality and can occur in up to 80% of
HSCT patients, depending on the use of prophylactic
treatment or pre-emptive therapy and the serological CMV
status of donor and recipient. CMV disease can manifest
itself as life-threatening pneumonia, gastrointestinal
disease, retinitis, encephalitis or hepatitis.
The relevance of HLA alleles in varicella zoster virus
associated disease has recently been demonstrated by our
group (Meysman et al., 2015) and similar insights might
be gained in CMV related disease. Several studies have
already shown a correlation between the incidence of
CMV infection and the presence of certain human
leukocyte antigens (HLA) alleles in the transplant
recipient. However, the exact alleles identified in previous
studies are very inconsistent, likely due to small sample
sizes and type I multiple testing errors.
METHODS
Anonymized patient records on the HLA alleles, CMV
infection and serological status of 1284 transplant
recipients were collected from the Antwerp University
Hospital (UZA). This data set was further extended with
publicly available HLA data from transplant patient and
the counts for the HLA alleles of each loci present were
combined. A hypergeometric distribution was used to test
HLA loci (A, B, C, DRB1, DQB1 and DPB1) for
statistical over- or underrepresentation of their respective
alleles. HLA alleles were tested for over- or
underrepresentation in two test populations: recipients
who were seropositive for CMV before transplantation
and recipients who developed a CMV infection post-
transplantation. In the later case, we also examined if
donor seropositivity had an influence on the CMV
infection status. The P value cutoff used is 0.05 and was
adjusted with a Bonferroni correction for multiple testing,
in this case the number of alleles tested per loci.
Putative nonameric peptides were generated in silico from
CMV protein sequences available in online protein
sequence repositories such as the UniProt Knowledgebase.
Three complementary methods were employed to predict
the affinity of each putative nonameric peptide to the
significantly enriched or depleted HLA alleles. The
methods used were: NetCTLpan, the stabilized matrix
method (SMM) and an in-house-developed approach
called CRFMHC. Peptide-binding affinity results of each
predictor were normalized against the affinity of a
restricted panel of human proteins and used to compare
results between predictors. Additionally, each CMV
protein was assessed for depletion of high-affinity
peptides using a hypergeometric distribution.
RESULTS
Preliminary results on a small portion of the UZA data
reveals HLA alleles underlying either CMV seropositivity
or CMV infection with a trend towards significance but do
not reach the Bonferroni corrected threshold. We expect
the additional data to increase the power of the analysis.
REFERENCES Meysman,P. et al. (2015) Varicella-Zoster Virus-Derived Major
Histocompatibility Complex Class I-Restricted Peptide Affinity Is
a Determining Factor in the HLA Risk Profile for the
Development of Postherpetic Neuralgia. J. Virol., 89, 962–969.
![Page 58: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/58.jpg)
10th Benelux Bioinformatics Conference bbc 2015
58
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P14. NOVOPLASTY: IN SILICO ASSEMBLY OF PLASTID GENOMES FROM
WHOLE GENOME NGS DATA
Nicolas Dierckxsens1,2*
, Olivier Hardy2, Ludwig Triest
3, Patrick Mardulyn
2 & Guillaume Smits
1,4.
Interuniversity Institute of Bioinformatics Brussels (IB2), ULB-VUB, Triomflaan CP 263, 1050 Brussels, Belgium1;
Evolutionary Biology and Ecology Unit, CP 160/12, Faculté des Sciences, Université Libre de Bruxelles, Av. F. D.
Roosevelt 50, B-1050 Brussels, Belgium2; Plant Biology and Nature Management, Vrije Universiteit Brussel, Brussels,
Belgium3; Department of Paediatrics, Hôpital Universitaire des Enfants Reine Fabiola (HUDERF), Université Libre de
Bruxelles (ULB), Brussels, Belgium4.
Thanks to the evolution in next-generation sequencer (NGS) technology, whole genome data can be readily obtained
from a variety of samples. There are many algorithms available to assemble these reads, but few of them focus on
assembling the plastid genomes. Therefore we developed a new algorithm that solely assembles the plastid genomes
from whole genome data, starting from a single seed. The algorithm is capable of utilizing the full advantage of very high
coverage, which makes it even capable of assembling through problematic regions (AT-rich). The algorithm has been
tested on several whole genome Illumina datasets and it outperformed other assemblers in runtime and specificity. Every
assembly resulted in a single contig for any chloroplast or mitochondrial genome and this always within a timeframe of
30 minutes.
INTRODUCTION
Chloroplasts and mitochondria are both responsible for
generating metabolic energy within eukaryotic cells. Both
plastids are maternally inherited and have a persistent gene
organization, what makes them ideal for phylogenetic
studies or as a barcode in plant and food identification
(Brozynska et al., 2014). But assembling these plastids
genomes is not always that straightforward with the
currently available tools. Therefore we developed a new
algorithm, specifically for the assembly of plastid
genomes from whole genome data.
METHODS
The algorithm is written in Perl. All assemblies were
executed on Intel Xeon CPU machine containing 24 cores
of 2.93 GHz with a total of 96,8 GB of RAM. All non-
human samples were sequenced on the Illumina HiSeq
platform (101 bp paired-end reads). The human
mitochondria samples (PCR-free) were sequenced on the
Illumina HiSeqX platform (150 bp paired-end reads). The
Gonioctena intermedia sample was also sequenced on the
PacBio platform.
RESULTS & DISCUSSION
Algorithm. The algorithm is similar to string overlap
algorithms like SSAKE (Warren et al., 2007) and VCAKE
(Jeck et al., 2007). It starts with reading the sequences into
a hash table, which facilitates a quick accessibility. The
assembly has to be initiated by a seed that will be
extended bidirectionally in iterations. The seed input is
quite flexible, it can be one sequence read, a conserved
gene or even a complete mitochondrial genome from a
distant species. Every base extension is determined by a
consensus between the overlapping reads. Unlike most
assemblers, NOVOPlasty doesn’t try to assemble every
read, but will extend the given seed until the circular
plastid is formed.
Assemblies. NOVOPlasty has currently been tested for the
assembly of 8 chloroplasts and 6 mitochondria. Since
chloroplasts contain an inverted repeat, two versions of the
assembly are generated. The differ only in the orientation
of the region between the two repeats; the correct one will
have to be resolved manually. Besides the mitochondrion
of the leaf beetle Gonioctena intermedia, all assemblies
resulted in a complete circular genome. A comparative
study of four assemblers for the mitochondrial genome of
G. intermedia clearly shows the speed and specificity of
NOVOPlasty (Table 1).
NOVO
Plasty MIRA MITO bim ARC
Duration (min) 12 536 4777* 586
Memory (GB) 15 57,6 63,4 1,9
Storage (GB) 0 144 418 12
Total contigs 1 3434 2221 2502
Mitochondrial contigs 1 1 4 48
Coverage (%) 98 94 94 84
Mismatches 10 25 26 2
Unidentified nucleotides 43 194 197 0
TABLE 1. Benchmarking results between four assemblies of the
mitochondrial genome of Gonioctena intermedia. The assemblies were constructed with MITObim (Hahn et al., 2013), MIRA (Chevreux et al.,
1999), ARC (Hunter et al., 2015) and NOVOPlasty.*manually terminated
Discussion. Despite the many available assemblers, many
researchers still struggle to find a good assembler for
plastids genomes. NOVOPlasty offers an assembler
specifically designed for plastids that will deliver the
complete genome within 30 minutes. The algorithm will
be tested on more datasets and a comparative study with
other assemblers is in progress.
REFERENCES Brozynska et al. PLoS One 9 (2014).
Chevreux et al. Computer Science and Biology: Proceedings of the
German Conference on Bioinformatics (GCB) (1999).
Hahn et al. Nucleic Acids Research, 1-9 (2013).
Hunter et al. http://dx.doi.org/10.1101/014662 (2015).
Jeck et al. BMC Bioinformatics 23, 2942-2944 (2007).
Warren et al. BMC Bioinformatics 23, 500-501 (2007).
![Page 59: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/59.jpg)
10th Benelux Bioinformatics Conference bbc 2015
59
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P15. ENANOMAPPER - ONTOLOGY, DATABASE AND TOOLS FOR
NANOMATERIAL SAFETY EVALUATION
Friederike Ehrhart1, Linda Rieswijk
1, Chris T. Evelo
1, Haralambos Sarimveis
2, Philip Doganis
2, Georgios Drakakis
2,
Bengt Fadeel3, Barry Hardy
4, Janna Hastings
5, Christoph Helma
6, Nina Jeliazkova
7, Vedrin Jeliazkov
7, Pekka Kohonen
89,
Roland Grafström9, Pantelis Sopasakis
10, Georgia Tsiliki
2 & Egon Willighagen
1.
Department of Bioinformatics - BiGCaT, Maastricht University1; National Technical University of Athens
2; Karolinska
Institutet3; Douglas Connect
4; European Molecular Biology Laboratory – European Bioinformatics Institute
5; In silico
toxicology6; Ideaconsult Ltd.
7; VTT Technical Research Centre of Finland
8; Misvik Biology
9; IMT Institute for Advanced
Studies10
eNanoMapper is an open computational infrastructure for engineered nanomaterial data: it comprises a semantic web
supported database, ontology, and user applications for up- and download of experimental data, and tools for modelling.
INTRODUCTION
Nanomaterials are defined by size: between 1 nm and 100
nm in at least one dimension. The properties of these
material do not always resemble those of the bulk
material, i.e. micro- and bigger particles, or solutions.
Nanomaterials can differ in reactivity, toxicity in
biological organisms and ecosystems depending on their
size and surface properties and the possibility for
“leakage” of the material it is made off. That is why it is
so difficult to assess the safety of nanomaterials and why
the NanoSafety Cluster defined a need for a new
computational infrastructure in 2012. eNanoMapper is a
European project with partners from eight European
countries. This project has been developing an
computational infrastructure consisting of a semantic web
assisted database, a modular ontology, and tools to use
them for nanomaterial safety assessment. Data sharing,
data storage, data analysis tools, and web services are
currently under development, being developed and tested,
and put into production use. The project website can be
found at www.enanomapper.net.
PROBLEM
The eNanoMapper platform is designed to support hosting
of data on nanomaterial properties relevant for nanosafety
assessment as found in existing databases like the
NanoMaterial Registry, DaNa Knowledge Base,
Nanoparticle Information Library NIL, Nanomaterial-
Biological Interactions Knowledgebase, caNanoLab,
InterNano, Nano-EHS Database Analysis Tool, nanoHUB,
etc. Each of them has different data formats and
descriptors, like CODATA-VAMAS’ Universal
Description System, ISO-Tab(-Nano), OECD templates,
custom spreadsheets, and images. Interoperability is a
main aim and semi-automatic import or upload of
information and to integrate it in the eNanoMapper data
structure is being enabled. Vice versa, retrieval or
download of experimental data from the database for (re-
)analysis should be provided too, using programmable
interfaces to the data and the ontology. Database and
search functionality should be semantic web compatible:
the project developed and maintain a nanosafety ontology
to support this. This eNanoMapper ontology was
developed using the Web Ontology Language and the
challenge is to map nanomaterial terms to their multiple
ontology terms, namely physico-chemical properties,
biological and ecological impact, experimental assay
description, and known safety aspects.
RESULTS & DISCUSSION
The current eNanoMapper demo database instance,
available at https://data.enanomapper.net/, contains the
physico-chemical, biologic and environmental properties
of nanomaterials of 465 different nanomaterials1. Loading
data into the database supports various formats, including
the OECD Harmonized Templates and the data structure
used by the NanoWiki2. A web interface is designed to
support all interactions with the database you may want to
perform, including uploading of experimental data, as well
as querying data to support analysis and modelling of
nanoparticle properties. The eNanoMapper ontology is
available under
http://purl.enanomapper.net/onto/enanomapper.owl and is
based on a multi-faceted description of nanoparticles
concerning nanoparticle types, physico-chemical
description, life cycle, biological and environmental
characterisation including experimental methods and
protocols, and safety information3. The terms are verified
against the definitions of REACH, ISO, or common
practices used in science in general. The often confused
different meanings of endpoints and assays were
discriminated in the definitions, e.g. size and size
measurement assay. It was partly possible to use existing
ontologies as basis, e.g. NPO, ChEBI, GO, etc. but many
terms had to be added manually. Currently, there are 4592
classes defined. Users can get access and download the
ontology from the U.S. National Center for BioMedical
Ontologies BioPortal platform,
http://bioportal.bioontology.org/ontologies/ENM.
REFERENCES 1 Jeliazkova, N. et al. The eNanoMapper database for
nanomaterial safety information. Beilstein Journal of
Nanotechnology 6, 1609-1634, doi:10.3762/bjnano.6.165 (2015).
2 Willighagen, E.; doi: org/10.6084/m9.figshare.1330208
3 Hastings, J. et al. eNanoMapper: harnessing ontologies to enable data integration for nanomaterial risk assessment. J
Biomed Semantics 6, 10, doi:10.1186/s13326-015-0005-5
(2015).
![Page 60: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/60.jpg)
10th Benelux Bioinformatics Conference bbc 2015
60
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P16. BIOMEDICAL TEXT MINING FOR DISEASE-GENE DISCOVERY:
SOMETIMES LESS IS MORE
Sarah ElShal1,2*
, Jesse Davis3 & Yves Moreau
1,2.
Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data
Analytics Department, KU Leuven1; iMinds Future Health Department, KU Leuven
2; Department of Computer Science,
KU Leuven3.
Biomedical text is increasingly being made available online in either abstract or full article formats. This goes in parallel
with the knowledge desire to extract information from such text (e.g. finding links between diseases and genes).
Consequently text mining is very popular in the biomedical domain given that it provides the possibility to automatically
analyze these texts in order to extract knowledge. One of the big challenges in text mining is recognizing named entities
(e.g. disease and gene entities) inside a given text, which is widely known as Named Entity Recognition (NER). We
studied two biomedical taggers that apply different NER methods on MEDLINE abstracts. Here, we compare the
contribution of each of the two taggers in associating genes with diseases. We show that with fewer recognized entities
we gain more knowledge and we better associate genes with diseases.
INTRODUCTION
MEDLINE currently has more than 25 million biomedical
citations from different journals all over the world. With
this vast amount of text available, it is increasingly
important to mine such data and find the best ways to
extract relevant knowledge out of it. One example of such
knowledge is links between diseases and genes. However
it is very challenging and time consuming to recognize
biomedical entities inside a given text with the evolving
number of dictionaries and tagging strategies. Different
taggers exist that map MEDLINE abstracts to biomedical
entities. Such tagged entities can be used to generate
disease and gene profiles and by applying certain
similarity measures, we can extract knowledge and
generate disease-gene hypothesis.
METHODS
We compare two MEDLINE taggers that map the whole
set of MEDLINE abstracts to biomedical entities (e.g.
genes, diseases, GO and MeSH terms …). The first one is
MetaMap (Aronson et al., 2010), and the second one has
been used as a text mining pipeline in many resources,
latest in Diseases (Pletscher-Frankild et al., 2015). For
sake of simplicity, we will refer to the second tagger by
m_tagger throughout the rest of the abstract. For each
MEDLINE abstract we could obtain two sets of mapped
entities: (1) the metamap set, and (2) the m_tagger set. The
metamap set (given all the abstracts) corresponds to
78,298 distinct entities vs. 29,536 for M_tagger.
In order to compare the contribution of each tagger to the
disease-gene association process, we proceeded as follows.
First, we generated a validation set from the OMIM
database to acquire a list of experimentally-validated
disease-gene pairs. Second, we generated an entity profile
for every gene in our database and for every disease in our
validation set. This profile corresponds to the TF-IDF
score of a given entity in one profile, which is calculated
according to the set of abstracts found to be linked with a
disease or gene. Then for every disease, we computed the
cosine similarity between its profile and all the gene
profiles. Hence we could have a similarity score for each
disease and gene pair, which we used to rank the genes for
a given disease. We computed the average recall at the top
10, 25, 50, and 100 ranked genes. We ran this analysis
once according to the metamap set and once according to
the m_tagger set. We also tried another association
measure where we filtered the profiles such that they only
contain gene entities. Then we ranked the genes according
to their TF-IDF scores in a given disease profile. This
corresponds to 9290 gene entities in the metamap set, and
10,003 entities in the m_tagger set. Again we measured
the average recall at the different rank thresholds, and we
repeated the analysis using the metamap and m_tagger
profiles.
RESULTS & DISCUSSION
Figure 1 presents the recall results on the OMIM
validation set. We observe that MetaMap and M_tagger
result in comparable recall when ranking the genes
according to their cosine similarity with the disease
profiles. We also observe that M_tagger results in the best
recall when simply ranking the genes according to their
TF-IDF scores inside the disease profile.
FIGURE 1. Recall results on the OMIM validation set: comparing the
contribution of MetaMap and M_tagger, once with cosine similarity and once with TF-IDF ranks.
Even though using the m_tagger set implies using less
entities than the metamap one, we could gain the same
knowledge to associate genes with diseases. Moreover,
when we further reduced this set of entities to only genes,
we gained even more knowledge and better associated
genes with diseases.
REFERENCES Aronson A.R. et al. J. Am. Med. Inform. Assoc. An overview of MetaMap: historical
perspective and recent advances. 17, 229-236 (2010).
Pletscher-Frankild S. et al. DISEASES: text mining and data integration of disease-
gene associations. Methods. 74, 83-89 (2015).
![Page 61: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/61.jpg)
10th Benelux Bioinformatics Conference bbc 2015
61
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P17. TUNESIM - TUNABLE VARIANT SET SIMULATOR FOR NGS READS
Bertrand Escaliere1,2
, Nicolas Simonis1,3
, Gianluca Bontempi1,2
& Guillaume Smits1,4
.
Interuniversity Institute of Bioinformatics in Brussels1; Machine Learning Group, Université Libre de Bruxelles
2; Institut
de Pathologie et de Génétique3; Hopital Universitaire des Enfants Reine Fabiola, Université Libre de Bruxelles
4.
NGS analysis softwares and pipelines optimization is crucial in order to improve discovery of (new) disease causing
variants. A better combination between existing tools and the right choice of parameters can lead to more specific and
sensitive calling. Simulated datasets allow the step-by-step generation of new alignment or calling software. Creating a
simulator able to insert known human variants at a realistic minor frequency and artificial variants in a tunable controlled
way would allow to overcome three optimization limits: complete knowledge of the input dataset, allowing to determine
exact calling sensitivity and accuracy; optimization on the appropriate population; and the capacity to dynamically test a
pipeline one variable at the time.
INTRODUCTION
Identification of anomalies causing genetic disorders is
difficult. It can be limited by scarcity of affliction
concerned, by disorder genetic heterogeneity, or by
phenotypic pleiotropy associated with the anomalies in a
single gene. Exome and genome sequencing allowed the
identification of many genetic diseases causes, whose
origin remained inaccessible up to now by the usual
techniques of research in genetics (Ng et al., 2009),
(Gilissen et al., 2012), (Yang et al., 2013), (Gilissen et al.,
2014). Exome and genome sequencing data analysis
pipelines are constituted by several steps (roughly:
alignment, quality filters, variant calling) and several
software are available for those steps. Evaluation and
comparison of those tools are crucial in order to improve
pipelines accuracy. Exome and genome sequencing
simulations should allow to determine the veracity of
called variants (false positives and false negatives).
METHODS
We implemented TuneSIM, a wrapper around NGS
dwgsim (http://sourceforge.net/projects/dnaa/) reads
simulator with realistic mutations. Generated reads contain
real mutations from 1KG project and dbsnp138. We use
existing tool dwgsim for reads generations. In order to
generate data as realistic as possible we decided to keep
the haplotype blocks structure. We computed blocks using
vcf files from 1KG project phase 3 in european individuals
with Plink (Purcell et al., 2007). For each block, we
obtained a frequency of each combination of variants and
we used these frequencies for blocks selection. We also
insert variants in an independent way using their
frequencies in dbSNP (Smigielski et al., 2000). Using 33
in house samples, we computed global allele frequency
variants distributions in coding and non coding regions
and we select the variants according to those frequencies.
Similar operation has been performed for CNVs insertion
using 1KG data. We are developing a web interface
allowing users to download existing generated datasets.
After running their pipelines they can upload their output
and see accuracy of their pipelines.
RESULTS & DISCUSSION
Simulations with different coverage, rate of indels have
been performed and analysed with different pipelines.
Results will be presented.
REFERENCES Gilissen, et al. (2012). Disease gene identification strategies for exome
sequencing. Eur J Hum Genet, 20, 490–497. Gilissen, et al. (2014). Genome sequencing identifies major causes of
severe intellectual disability. Nature, 511, 344–347.
Ng, S. B., et al. (2009). Exome sequencing identifies the cause of a mendelian disorder. Nature Genetics, 42, 30–35.
Purcell, et al. (2007). PLINK: a tool set for whole-genome association
and population-based linkage analyses. American journal of human genetics, 81, 559–575.
Smigielski, E. M., Sirotkin, K., Ward, M., & Sherry, S. T. (2000). dbsnp:
a database of single nucleotide polymorphisms. Nucleic Acids Research, 28, 352–355.
Yang, et al. (2013). Clinical Whole-Exome Sequencing for the Diagnosis
of Mendelian Disorders. N Engl J Med, 369, 1502–1511.
![Page 62: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/62.jpg)
10th Benelux Bioinformatics Conference bbc 2015
62
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P18. RNA-SEQ REVEALS ALTERNATIVE SPLICING WITH
ALTERNATIVE FUNCTIONALITY IN MUSHROOMS
Thies Gehrmann1, Jordi F. Pelkmans
2, Han Wösten
2, Marcel J.T. Reinders
1 & Thomas Abeel
1*.
Delft Bioinformatics Lab, Delft Technical University1; Fungal Microbiology, Science Faculty, Utrecht University
2;
Alternative splicing is well studied in mammalian genomes, and alternative transcripts are often associated with disease
and their role in regulation is gradually being unveiled. In fungi, the study of alternative splicing has only scratched the
surface. Using RNA-Seq data, we predict alternative transcripts based on existing gene predictions in two mushroom
forming fungi. We study the alternative functionality of genes through functional domains, developmental stages, tissue
and time. This analysis reveals the amount of alternative functionality induced by alternative splicing which was
previously unknown in fungi, and asserts the need for further research.
INTRODUCTION
Transcriptreconstruction algorithms rely on the sparsity
(intergenic regions) of the genome in order distinguish
between genes. In fungi, due to the density of the genome,
transcripts overlap in the up and down-stream untranslated
regions (UTRs) and prevent the use of existing tools for
transcript prediction (Roberts et. al. 2011). Previous
studies (Xie et. al. 2015, Zhao et. al. 2013), were limited
to the study of splice junctions, more advanced functional
analyses. We transform the genomes of S. commune and A.
bisporusin order to enable the prediction of alternative
transcripts applying existing transcript reconstruction
algorithms to RNA-Seq data from different tissue types
and developmental stages. We present a functional
analysis of the resulting transcripts.
METHODS
We apply a transformation on our fungal genomes in order
to reduce the impact of overlapping UTRs which prevent
the prediction of alternative transcripts. We split the
genome into chunks, with each chunk being defined by
existing gene annotations. Thus, the transformation
essentially removes intergenic regions (which contain the
UTRs). Each chunk is then analyzed separately by
Cufflinks (Roberts et. al. 2011). Predicted transcripts are
filtered based on read information and ORF sanity. Protein
domain annotations are predicted for each transcript using
InterPro (Zdobnov & Apweiler 2001).
For each gene with multiple alternative transcripts, we
construct a consensus sequence which allows us to call
specific splicing events without the influence of erroneous
reference annotations.
RESULTS & DISCUSSION
For both fungi, we find that alternative splicing is
prevalent and many genes have multiple alternative
transcripts (see Table 1).
# Orig. Genes # Filt.
Genes
# Transcripts
S. commune 16,319 14,615 20,077
A. bisporus 10,438 9612 14,320
TABLE 1. The number of originally annotated genes in S. Commune and
A. Bisporus is decreased after prediction based on RNA-Seq data filters
them out. The number of new transcripts predicted indicates that alternative splicing is not a rare event in these fungi.
The frequency of specific events in the two fungi are
similar and match what is seen in humans (Sammeth, M,
et. al. 2008). However, there are significant differences in
the event usage. While most transcripts in S. commune
only have one event associated with it, most transcripts in
A. Bisporushave at least two events. We show that this is a
result of co-operative events.
As our dataset consists of multiple developmental time-
points and tissue types, we are able to observe the
alternative use of transcripts through time. If a gene swaps
transcript usage at a certain time point, this is indicative of
a functional involvement of that particular transcript (Lees
et. al. 2015). We find multiple transcripts in both S.
commune and A. bisporus which are activated in specific
developmental stages of the mushroom. Furthermore, in A.
bisporus, we are able to identify transcripts which are
activated specifically for certain tissue types through
development.
Using protein domain predictions for each transcript in a
gene, we can measure how gene functionality changes
across its transcripts. Figure 1 shows that functional
annotations are not always preserved across all transcripts,
indicating alternative functionality.
FIGURE 1. Many genes in S. commune demonstrate alternative functionality through alternative splicing
This is the first genome-wide functional analysis of
alternative splicing in fungi from RNA-Seq data. We find
a wealth of alternative splicing events in two fungi,
resulting in many newly discovered transcripts. Although
their functional influence is not yet demonstrated, we
present evidence to suggest that they are relevant to
mushroom development.
REFERENCES Lees, J. G., et. al. BMC Genomics, 16:1 (2015)
Roberts, A., et. al. Bioinformatics 27:17, 2325–2329. (2011)
Sammeth, M., et. al. PLoS Computational Biology, 4:8. (2008)
Xie, B.-B., et. al.. BMC Genomics, 16:54(2015).
Zdobnov, E. M., & Apweiler, R. Bioinformatics 17:9 (2001)
Zhao, C., et. al. BMC Genomics, 14:21. (2013).
![Page 63: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/63.jpg)
10th Benelux Bioinformatics Conference bbc 2015
63
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P19. MSQROB: AN R/BIOCONDUCTOR PACKAGE FOR ROBUST RELATIVE
QUANTIFICATION IN LABEL-FREE MASS SPECTROMETRY-BASED
QUANTITATIVE PROTEOMICS
Ludger Goeminne1,2,3*
, Kris Gevaert2,3
& Lieven Clement1.
Department of Applied Mathematics, Computer Science and Statistics, Ghent University1; VIB Medical Biotechnology
Center2; Department of Biochemistry, Ghent University
3.
MSqRob is an R/Bioconductor package that uses robust ridge regression on peptide-level data for robust relative
quantification of proteins in label-free data-dependent acquisition (DDA) mass spectrometry (MS)-based proteomic
experiments. It has been shown that statistical methods inferring at the peptide-level outperform workflows that
summarize peptide intensities prior to inference. MSqRob improves upon existing peptide-level methods by three
modular extensions: (1) ridge regression, (2) empirical Bayes variance estimation and (3) M-estimation with Huber
weights. The extensions make MSqRob less sensitive towards outliers and missing peptides, enabling more proteins to be
processed. Our software provides streamlined data analysis pipelines for experiments with simple layouts as well as for
more complex multi-factorial designs. Using a spike-in dataset, we illustrate that MSqRob grants more stable protein fold
change estimates and improves the differential abundance (DA) ranking.
INTRODUCTION
In a typical label-free DDA LC-MS/MS-based proteomic
workflow, proteins are digested to peptides, separated by
RP-HPLC and analyzed by a mass spectrometer. However,
several issues inherent to the protocol make data analysis
non-trivial. Most of the common data analysis procedures
use summarization-based workflows. We have previously
shown that inference at the peptide level outperforms these
summarization-based approaches (Goeminne et al., 2015).
However, even these pipelines are sensitive to outliers and
suffer from overfitting. Here, we present MSqRob, an
R/Bioconductor package that starts form peptide-level data
and provides robust inference on DA at the protein level.
METHODS
Dataset. To demonstrate the performance of our package,
we use the CPTAC dataset, in which 48 known human
proteins were spiked-in at different concentrations in a
yeast proteome background. Ideally, when comparing
different spike-in conditions, only the human proteins
should be flagged as differentially abundant.
Competing analytical methods. MaxLFQ+Perseus,
which summarizes peptide data followed by pairwise t-
tests.
LM model. Generally, peptide-based models are
constructed as follows:
𝑦𝑖𝑗𝑘𝑙𝑚𝑛
= 𝑡𝑟𝑒𝑎𝑡𝑖𝑗 + 𝑝𝑒𝑝𝑖𝑘 + 𝑏𝑖𝑜𝑟𝑒𝑝𝑖𝑙 + 𝑡𝑒𝑐ℎ𝑟𝑒𝑝𝑖𝑚+ 𝜀𝑖𝑗𝑘𝑙𝑚𝑛
with 𝑦𝑖𝑗𝑘𝑙𝑚𝑛 the nth
log2-transformed normalized feature
intensity for the ith
protein under the jth
treatment 𝑡𝑟𝑒𝑎𝑡𝑖𝑗 ,
the kth
peptide sequence 𝑝𝑒𝑝𝑖𝑘 , the lth biological repeat
𝑏𝑖𝑜𝑟𝑒𝑝𝑖𝑙 and the mth
technical repeat 𝑡𝑒𝑐ℎ𝑟𝑒𝑝𝑖𝑚 , and
𝜀𝑖𝑗𝑘𝑙𝑚𝑛 a normally distributed error term with mean zero
and variance 𝜎𝑖2.
MSqRob. MSqRob adds the following improvements to
the LM model:
1. Ridge regression: shrink parameter estimates
towards 0 by adding a ridge penalty term to the
loss function.
2. Stabilize variance estimation by borrowing
information across proteins with empirical
Bayes (EB): shrink individual variances towards
the pooled variance.
3. M estimation with Huber weights: weigh down
observations with large errors.
RESULTS & DISCUSSION
MSqRob uses MaxQuant or Mascot peptide-level data as
input. It performs preprocessing, robust model fitting and
returns log2 fold change estimates and FDR corrected p-
values for all model parameters and/or (user specified)
contrasts. Advanced users have the flexibility to (a) adopt
their own preprocessing pipeline (e.g. transformation,
normalization, drop contaminants…) and (b) specify the
appropriate model structure. Compared to competing
methods, MSqRob returns more stable log2 fold change
estimates, improves DA ranking (Figure 1) and is able to
discern between consistently strong DA and an accidental
hit caused by outliers or a small variance due to random
chance in low-abundant proteins.
FIGURE 1. Receiver operating characteristic (ROC) curves showing the
superior performance of MSqRob compared to a simple linear model (LM) and a summarizarion-based approach (MaxLFQ+Perseus) when
comparing the lowest spike-in concentration 6A with the second lowest
spike-in concentration 6B. Stars denote the methods’ cut off at an estimated 5 % FDR.
REFERENCES Goeminne LJE et al. Journal of Proteome Research 14, 2457-2465
(2015).
![Page 64: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/64.jpg)
10th Benelux Bioinformatics Conference bbc 2015
64
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P20. A MIXTURE MODEL FOR THE OMICS BASED IDENTIFICATION OF
MONOALLELICALLY EXPRESSED LOCI AND THEIR DEREGULATION IN
CANCER
Tine Goovaerts1, Sandra Steyaert
1, Jeroen Galle
1, Wim Van Criekinge
1 & Tim De Meyer
1*.
BIOBIX lab of Bioinformatics and Computational Genomics, Department of Mathematical Modelling,
Statistics and Bioinformatics, Ghent University1.
Imprinting is a phenomenon featured by parent-specific monoallelic gene expression. Its deregulation has been
associated with non-Mendelian inherited genetic diseases but is also a common feature of cancer. As imprinting does not
alter the genome yet is mitotically inherited, epigenetics is deemed to be a key regulator. Current knowledge in the field
is particularly hampered by a lack of accurate computational techniques suitable for omics data. Here we introduce a
mixture model for the identification of monoallelically expressed loci based on large scale omics data that can also be
exploited to identify samples and loci featured by loss of imprinting / monoallelic expression.
INTRODUCTION The genome-wide identification of mono-allelically
expressed or epigenetically modified loci typically
requires the presence of SNPs to discriminate both alleles.
Current methods predominantly rely on genotyping for the
identification of heterozygous loci in a limited sample set,
followed by testing whether the expression/epigenetic
modification levels for both alleles deviate from a 1:1 ratio
for those loci (Wang et al., 2014). This approach is limited
by the genotyping step and the required presence of
heterozygous individuals. As large scale omics data is
becoming increasingly available, an alternative strategy
may be to screen larger numbers (e.g. hundreds) of
samples, ensuring the presence of heterozygous
individuals at predictable rates, thereby also avoiding the
need for and limitations of a prior genotyping step.
Based on this concept, a previous strategy (Steyaert et al.,
2014) enabled us to identify and validate approximately 80
loci featured by monoallelic DNA methylation, but had
several drawbacks, such as computational inefficiency,
heavy reliance on Hardy-Weinberg equilibrium (HWE),
need for 100% imprinting and low power, which limited
its practical use. Here we present a novel mixture model
for the identification of monoallelically modified or
expressed loci from large-scale omics data (without
known genotypes) that largely circumvents previous
drawbacks.
METHODS The rationale of the methodology is that RNA-seq and
ChIP-seq(-like) derived SNP data for monoallelic loci are
featured by a general lack of apparent heterozygosity.
More specifically, under the null-hypothesis (no
imprinting) the homozygous and heterozygous sample
fractions can be modelled as a mixture of (beta-)binomial
distributions, with weights according to HWE or
empirically derived. For imprinted loci however, the
heterozygous fraction is split and shifted towards the two
homozygous fractions (Figure 1), which can be evaluated
with a likelihood ratio test. The model does not require but
can incorporate prior genotyping data and allows for
deviation from HWE, sequencing errors and efficiency
differences and partial monoallelic events. Once loci
featured by monoallelic events have been identified in
control data, a loss of imprinting index can be calculated
for each non-normal sample based on the mixture model
likelihoods and loci generally featured by loss of
imprinting in the pathology under study can be identified.
RESULTS & DISCUSSION We demonstrate the applicability of the novel mixture
model with simulations and a proof of concept study using
breast cancer and control RNA-seq data from The Cancer
Genome Atlas (TCGA Research Network, 2008). Well
known imprinted loci such as IGF2 (Figure 1) and H19
were indeed identified. Ongoing efforts are directed
towards artefact-free RNA/ChIP-seq data based allele
frequency inference and the efficient implementation of a
beta-binomial based mixture.
FIGURE 1. Observed (red) and modelled (green) allele frequencies for a
100% (right, no observable heterozygotes) and a partially imprinted
(left) SNP of the IGF2 gene
In conclusion, we introduce a novel mixture model for the
identification of loci featured by monoallelic events which
can subsequently be exploited to determine their
deregulation in the pathology of interest.
REFERENCES Steyaert S et al. Nucleic Acids Research 42, e157 (2014).
TCGA Research Network. Nature 455, 1061-1068 (2008). Wang X & Clark AG. Heredity 113, 156-166 (2014).
![Page 65: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/65.jpg)
10th Benelux Bioinformatics Conference bbc 2015
65
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P21. GEVACT: GENOMIC VARIANT CLASSIFIER TOOL
Isel Grau1,4
, Dorien Daneels2,3
, Sonia Van Dooren2,3
, Maryse Bonduelle2,
Dewan Md. Farid1,3
, Didier Croes2,3
, Ann Nowé1,3
& Dipankar Sengupta1,3*
.
Como - Artificial Intelligence Lab, Vrije Universiteit Brussel1; Centre for Medical Genetics, Reproduction and Genetics,
Reproduction Genetics and Regenerative Medicine, Vrije Universiteit Brussel,UZ Brussel2; Interuniversity Institute of
Bioinformatics in Brussels, ULB-VUB3; Department of Computer Sciences, Universidad Central de Las Villas
4 .
High throughput screening (HTS) techniques, like genome or exome screening are becoming norms in the conventional
clinical analysis. However, classifying the identified variants to be pathogenic, or potentially pathogenic or non-
pathogenic, is still a manual, tedious and time consuming process for clinicians or geneticists. Thus, to facilitate the
variant classification process, we have developed GEVACT, a Java based tool, designed on an algorithm, i.e. based on the
existing literature and knowledge of clinical geneticists. GEVACT can classify variants annotated by Alamut Batch, with
a future plan to support for inputs from other annotation software's also.
INTRODUCTION
With the emergence of new screening techniques, targeted
or whole exome and genome screening are becoming
standard diagnostic norms in clinical settings to identify
the variants for a genetic disease (Ng et al., 2010;
Saunders et al., 2012). However, development of
bioinformatics solutions for pathogenic classification of
the variants still remains a big challenge and henceforth,
making the process ponderous for geneticists and
clinicians. In this work, we describe GEVACT (Genomic
Variant Classifier Tool), a tool for classification of
genomic single nucleotide and short insertion/deletion
variants. The aim of this study was to design and
implement a variant classification algorithm, based on a
literature review of cardiac arrhythmia syndromes
(Hofman et al., 2013; Schulze-Bahr et al., 2000; Wilde &
Tan, 2007) and existing knowledge of clinical geneticists.
METHODS
The algorithm we propose for GEVACT is based on a
published variant classification schema for cardiac
arrhythmia syndromes. This approach is based on the yield
of DNA testing over a time span of 15 years (1996-2011),
between probands with isolated/familial cases, and also
between probands with or without clear disease-specific
clinical characteristics (Hofman et al., 2013). It proposes
two varying approaches: one to classify missense variants
and another to classify nonsense and frameshift variants.
The algorithm is implemented in two phases: pre-
processing and classification. In the pre-processing phase,
the annotated tab-delimited variant file (vcf.ann) from the
Alamut batch, is refined based on the gene list for the
disease-of-interest, so as to reduce the number of variants
for the analysis. Filters are applied to look for variants that
have already been reported in the Human Genome
Mutation Database (Stenson et al., 2003) and in ClinVar
(Landrum et al., 2014), or that have previously been
detected and classified in an internal patient population.
And lastly, the variants are filtered based on their location
in the genome and their coding effect, followed by the
check for minor allele frequency of the variant in a control
population (Sherry ST et al. 2001). Thereafter, in the
classification phase, the filtered variants are classified as
missense or nonsense and frameshift variants. For
missense variants the classification is based on the
parameters: amino acid substitution and its impact on
protein function (Adzhubei et al., 2010; Kumar et al.,
2009), biochemical variation (Mathe et al., 2006),
conservation (Pollard et al., 2010), frequency of variant
alleles in a control population (ExAC, 2015), effects on
splicing (Desmet et al., 2009), family and phenotype
information and functional analysis. Whereas, for the
nonsense and frameshift variants, it is based on: effects on
splicing, frequency of variant alleles in a control
population, family and phenotype information and
functional analysis. For each parameter, a score is given to
the variant, which is subsequently cumulated.
Conclusively, based on the cumulative score each variant
is classified into one of the five categories: Class I - Non-
Pathogenic; Class II - VUS1 (unlikely pathogenic); Class
III - VUS2 (unclear); Class IV - VUS3 (likely
pathogenic); Class V - Pathogenic (Sharon et al., 2008).
RESULTS & DISCUSSION
In this study, we report a Java based tool called GEVACT,
developed for classification of genomic variants. Input for
the tool is an annotated vcf file, while the output depicts
the cumulative classification score along with the class
label for a variant. The tool was tested on a dataset of 130
cardiac arrhythmia syndrome patients, available at UZ
Brussel. The results of the variant classification made by
the tool were cross-validated by manual curation,
performed by the clinical geneticist. Definitively, the
study indicates the tool to be promising but needs to be
further validated on datasets from other diseases. In
addition to, we are working on the tool to be adaptable for
file inputs from other annotation software.
REFERENCES Adzhubei IA et al. Nat Methods 7(4), 248-249 (2010).
Desmet et al. Nucleic Acids Res 37 (9): e67 (2009). Exome Aggregation Consortium (ExAC), Cambridge, MA (2015).
Hofman N et al. Circulation 128(14),1513-21 (2013).
Kumar P et al. Nat Protoc 4(7), 1073–1081 (2009). Landrum MJ et al. Nucleic Acids Res 42(1), D980-5 (2014).
Mathe E et al. Nucleic Acids Res 34(5),1317-25 (2006).
Ng SB et al. Nat Genetics 42, 30–35 (2010). Pollard K et al. Genome Res 20, 110-121 (2010).
Saunders CJ et al. Sci Transl Med 4, 154ra135 (2012).
Sharon EP et al. Hum Mutat. 29(11), 1282–1291 (2008). Sherry ST et al. Nucleic Acids Res 29(1),308-11 (2001).
Schulze-Bahr E et al. Z Kardiol 89 Suppl 4:IV12-22 (2000).
Stenson et al. Hum Mutat. 21:577-581 (2003). Wilde AA & Tan HL Circ J 71, Suppl A:A12-9 (2007).
![Page 66: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/66.jpg)
10th Benelux Bioinformatics Conference bbc 2015
66
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P22. MAPPI-DAT: MANAGEMENT AND ANALYSIS FOR HIGH
THROUGHPUT INTERACTOMICS DATA FROM ARRAY-MAPPIT
EXPERIMENTS
Surya Gupta1,2,3
, Jan Tavernier1,2
& Lennart Martens1,2,3
.
Medical Biotechnology Center, VIB, Ghent, Belgium1; Department of Biochemistry, Ghent University, Ghent, Belgium
2;
Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium3.
INTRODUCTION
Proteins are highly interesting objects of study, involved
in different cellular and molecular functions. Identification
and quantification of these proteins along with their
interacting proteins, nucleic acids and molecules can
provide insight into development and disease mechanisms
at the systems level. Yet studying these interactions is not
trivial. In vivo methods exist to determine these
interactions, but these suffer from several drawbacks [4].
To overcome existing problems, an innovative approach
called MAPPIT (Mammalian Protein-Protein Interaction
Trap) [2] has been established in the Cytokine Receptor
Lab to determine interacting partners of proteins in
mammalian cells. To allow screening of thousands of
interactors simultaneously, MAPPIT has been parallelized
in the array MAPPIT system [3].
AIM
However, no effective pipeline existed to process the high-
through put data generated from array MAPPIT. We
therefore established an automated high-throughput data
analysis system called MAPPI-DAT (Mappit Array
Protein Protein Interaction- Database & Analysis Tool).
METHODS
In the array-MAPPIT platform the interaction of two
proteins (bait-prey) restores a mutated JAK-STAT
signaling pathway which leads to the expression of
florescence emitting genes. In order to rank the positive
interactions based on fluorescence intensity, RankProd [1]
is used. This method was originally developed to
determine differentially expressed genes in microarray
experiments and is available as R package. To minimize
false positive hits from RankProd output, quartile based
filtration was applied. MySQL platform was used to build
the data management system for the array-MAPPIT
system.
RESULTS
To extend and ease the usage of the analysis pipeline and
database system, an interface has been developed called
MAPPI-DAT. MAPPI-DAT is capable of processing
many thousand data points for each experiment, and
comprising a data storage system that stores the
experimental data in a structured way for meta-analysis.
REFERENCES [1] Breitling, R., Armengaud, P., Amtmann, A., & Herzyk, P. (2004).
Rank products: A simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments.
FEBS Letters, 573(1-3), 83–92.
[2] Lievens, S., Peelman, F., De Bosscher, K., Lemmens, I., & Tavernier, J. (2011). MAPPIT: A protein interaction toolbox built on
insights in cytokine receptor signaling. Cytokine and Growth Factor
Reviews, 22(5-6), 321–329.
[3] Lievens, S., Vanderroost, N., Heyden, J. Van Der, Gesellchen, V.,
Vidal, M., Tavernier, J., & Heyden, V. Der. (2009). Array MAPPIT :
High-Throughput Interactome Analysis in Mammalian Cells Array MAPPIT : High-Throughput Interactome Analysis in Mammalian Cells,
877–886.
[4] S.Gopichandran and S.Ranganathan. (2013). Protein-protein Interactions and Prediction: A Comprehensive Overview. Protein and
Peptide Letters, 779–789
![Page 67: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/67.jpg)
10th Benelux Bioinformatics Conference bbc 2015
67
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P23. HIGHLANDER: VARIANT FILTERING MADE EASIER
Raphael Helaers1*
& Miikka Vikkula1.
Human Molecular Genetics (GEHU), de Duve Institute, Université catholique de Louvain1.
The field of human genetics is being revolutionized by exome and genome sequencing. A massive amount of data is
being produced at ever-increasing rates. Targeted exome sequencing can be completed in a few days using NGS,
allowing for new variant discovery in a matter of weeks. The technology generates considerable numbers of false
positives, and the differentiation of sequencing errors from true mutations is not a straightforward task. Moreover, the
identification of changes-of-interest from amongst tens of thousands of variants requires annotation drawn from various
sources, as well as advanced filtering capabilities. We have developed Highlander, a Java software coupled to a MySQL
database, in order to centralize all variant data and annotations from the lab, and to provide powerful filtering tools that
are easily accessible to the biologist. Data can be generated by any NGS machine (such as Illumina’s HiSeq, or Life
Technologies’ Solid or Ion Torrent) and most variant callers (such as Broad Institute’s GATK or Life Technologies’
LifeScope). Variant calls are annotated using DBNSFP (providing predictions from 6 different programs, and MAF from
1000G and ESP), GoNL and SnpEff, subsequently imported into the database. The database is used to compute global
statistics, allowing for the discrimination of variants based on their representation in the database. The Highlander GUI
easily allows for complex queries to this database, using shortcuts for certain standard criteria, such as “sample-specific
variants”, “variants common to specific samples” or “combined-heterozygous genes”. Users can browse through query
results using sorting, masking and highlighting of information. Highlander also gives access to useful additional tools,
including direct access to IGV, and an algorithm that checks all available alignments for allele-calls at specific positions.
![Page 68: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/68.jpg)
10th Benelux Bioinformatics Conference bbc 2015
68
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P24. DOSE-TIME NETWORK IDENTIFICATION: A NEW METHOD FOR
GENE REGULATORY NETWORK INFERENCE FROM GENE EXPRESSION
DATA WITH MULTIPLE DOSES AND TIME POINTS
Diana M Hendrickx1*
, Danyel G J Jennen1 & Jos C S Kleinjans
1.
Department of Toxicogenomics, Maastricht University, The Netherlands1.
Toxicogenomics, the application of ‘omics’ technologies to toxicology, is a rapidly growing field due to the need for
alternatives to animal experiments for toxicity testing of compounds. Identification of gene regulatory networks affected
by compounds is important to gain more insight into the mode of action of a toxic compound. The response to a toxic
compound is both time and dose dependent. Therefore, toxicogenomics data are often measured across several time
points and doses. However, to our knowledge, there does not exist a method for gene regulatory network inference that
takes into account both time and dose dependencies. Here we present Dose-Time Network Identification (DTNI), a novel
gene regulatory network inference algorithm that takes into account both dose and time dependencies in the data. We
show that DTNI can be used to infer gene regulatory networks affected by a group of compounds with the same mode of
action. This is illustrated with gene expression (microarray) data from COX inhibitors, measured in human hepatocytes.
INTRODUCTION
Identifying and understanding gene regulatory networks
(GRN) influenced by chemical compounds is one of the
main challenges of systems toxicology. A GRN affected
by one or more compounds evolves over time and with
dose. The analysis of gene expression data measured at
multiple time points and for multiple doses can provide
more insight in the effects of compounds. Therefore, there
is a need for mathematical approaches for GRN
identification from this type of data.
METHODS
One of the mathematical approaches currently used for
GRN inference is based on ordinary differential equations
(ODE), where changes in gene expression over time are
related to each other and to the external perturbation (i.e.
the dose of the compound). Because gene expression data
usually have less data points than variables (genes), ODE
approaches are often combined with interpolation and/or
dimension reduction techniques (PCA). A current method
that combines ODE with both interpolation and dimension
reduction techniques is Time Series Network
Identification (TSNI) (Bansal et al., 2006).
Here, we present Dose-Time Network Identification
(DTNI), a method that extends TSNI by including ODE
that describe changes in gene expression over dose in
relation to each other and to time. We also adapted the
original method so that it can include data from multiple
perturbations (compounds).
RESULTS & DISCUSSION
By exploiting simulated data, we show that including ODE for expression changes over dose leads to improved
GRN identification compared with including only ODE
that describe changes over time. Furthermore, we show
that DTNI performs better when including data from
multiple perturbations (compounds) than when applying
DTNI to data from a single perturbation. This suggests
that the method is suitable to infer a GRN affected by
compounds with the same mode of action. As an example,
we infer the network affected by COX inhibitors from
public microarray data of 6 COX inhibitors, measured in
human hepatocytes, available from Open TG-Gates
(http://toxico.nibio.go.jp/english/index.html) (Noriyuki et
al., 2012). The interactions in the inferred network were
compared to interactions from ConsensusPathDB, a
database including interactions from 32 different sources
(Kamburov et al., 2013). The inferred network was
validated by leave-one out cross-validation (LOOCV). Six
datasets were created from the original data by leaving out
the data of one compound. The network constructed from
the whole data set showed large overlap with the networks
constructed from each of the LOOCV datasets. Edges in
the network constructed from the whole data set, but not in
the networks constructed from the LOOCV datasets were
removed from the network. The remaining novel
interactions, i.e. those that are not in ConsensusPathDB,
have to be validated experimentally, e.g. by gene-
knockdown experiments.
FIGURE 1. Workflow for identifying a gene regulatory network affected
by a group of compounds with the same mode of action.
REFERENCES Bansal M et al. Bioinformatics 22, 815-822 (2006).
Noriyuki N et al. J Toxicol Sci 37,791-801 (2012).
Kamburov A et al. Nucl Acids Res 41, D793-D800 (2013).
![Page 69: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/69.jpg)
10th Benelux Bioinformatics Conference bbc 2015
69
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Category: Poster
P25. IDENTIFICATION OF NOVEL ALLOSTERIC DRUG TARGETS
USING A “DUMMY” LIGAND APPROACH
Susanne M.A. Hermans, Christopher Pfleger & Holger Gohlke*.
Department of Mathematics and Natural Sciences, Institute for Pharmaceutical and Medicinal Chemistry, Heinrich-
Heine-University, Düsseldorf, Germany. *[email protected]
Targeting allosteric sites is a promising strategy in drug discovery due to their regulatory role in almost all cellular
processes. Currently, there is no standard method to identify novel pockets and to detect whether a pocket has a
regulatory effect on the protein. Here, we present a new and efficient approach to probe information transfer through
proteins in the context of dynamically dominated allostery that exploits “dummy” ligands as surrogates for allosteric
modulators.
INTRODUCTION
Allosteric regulation is the coupling between separated
sites in biomacromolecules such that an action at one site
changes the function at a distant site. Allosteric drugs are
popular, they often have less side effects then orthosteric
drugs because the allosteric sites are less conserved. The
identification of novel allosteric pockets is complicated by
the large variation in allosteric regulation, ranging from
rigid body motions to disorder/order transitions, with
dynamically dominated allostery in between (Motlagh et
al., 2014). Here we focus on dynamically dominated
allostery with minimal or no conformational changes.
Novel pockets do not have a known ligand, therefore we
generate “dummy” ligands to function as surrogates for
allosteric ligands. We have developed an efficient
approach to probe information transfer through proteins
using “dummy” ligands and detect if allosteric coupling is
present between the novel pocket and the orthosteric site.
METHODS
In a preliminary study to test the general feasibility, the
approach was applied to conformations extracted from a
MD trajectory of the holo and apo structures of LFA1.
The grid-based PocketAnalyzer program (Craig et al.,
2011) is used to detect putative binding sites. “Dummy”
ligands were generated for each detected pocket along the
ensemble. Finally, the Constraint Network Analysis
(CNA) software, which links biomacromolecular structure,
(thermo-)stability, and function, is used to probe the
allosteric response by monitoring altered stability
characteristics of the protein due to the presence of the
“dummy” ligand (Pfleger et al., 2013; Krüger et al., 2013;
Pfleger, 2014). The results were compared to those of the
holo structure with the bound allosteric ligand to validate
the “dummy” ligand approach.
RESULTS & DISCUSSION
Remarkably, the usage of “dummy” ligands almost
perfectly reproduced the results obtained from the known
allosteric effector. Although it turned out that the intrinsic
rigidity of the “dummy” ligands over-stabilizes the LFA1
structure, these results are already encouraging. Even for
the LFA1 apo structures, where the allosteric pocket is
partially closed, the results are in agreement with known
allosteric effectors. Overall, the results obtained from the
validation of the “dummy” ligand approach are
encouraging. This suggests that our “dummy” ligand
approach for the characterization of unexplored allosteric
pockets is a promising step towards identifying novel drug
targets.
REFERENCES Craig, I.R. et al. J. Chem. Inf. Model. 51 2666–2679 (2011). Krüger, D. M. et al. Nucleic Acids Res. 41 340–348 (2013).
Motlagh, H.N. et al. Nature 508 7496 331–339 (2014).
Pfleger, C. et al. J. Chem. Inf. Model. 53 1007–1015 (2013). Pfleger, C. Doctoral Thesis, Heinrich Heine University, Düsseldorf,
Germany (2014).
![Page 70: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/70.jpg)
10th Benelux Bioinformatics Conference bbc 2015
70
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P26. PASSENGER MUTATIONS CONFOUND INTERPRETATION OF ALL
GENETICALLY MODIFIED CONGENIC MICE
Paco Hulpiau1,2,3
*, Liesbet Martens1,2,3
*, Yvan Saeys1,2,3
, Peter Vandenabeele1,2,4
& Tom Vanden Berghe1,2
.
Inflammation Research Center, VIB, Ghent, Belgium1; Department of Biomedical Molecular Biology, Ghent University,
Ghent, Belgium2; Data Mining and Modelling for Biomedicine (DaMBi), Ghent, Belgium
3; Methusalem Program, Ghent
University, Belgium4. *[email protected], [email protected]
Targeted mutagenesis in mice is a powerful tool for functional analysis of genes. However, genetic variation between
embryonic stem cells (ESCs) used for targeting (previously almost exclusively 129-derived) and recipient strains (often
C57BL/6J) typically results in congenic mice in which the targeted gene is flanked by ESC-derived passenger DNA
potentially containing mutations. Comparative genomic analysis of 129 and C57BL/6J mouse strains revealed indels and
single nucleotide polymorphisms resulting in alternative or aberrant amino acid sequences in 1,084 genes in the 129-
strain genome.
INTRODUCTION
Annotating the passenger mutations to the reported
genetically modified congenic mice that were generated
using 129-strain ESCs revealed that nearly all these mice
possess multiple passenger mutations potentially
influencing the phenotypic outcome. We illustrated this
phenotypic interference of 129-derived passenger
mutations with several case studies and developed a Me-
PaMuFind-It web tool to estimate the number and possible
effect of passenger mutations in transgenic mice of interest.
METHODS
We analyzed the SNP data release v3 from the Mouse
Genome Project available at Sanger Institute (Keane et al.,
2011). The data in the indel vcf file and SNP vcf file were
filtered to retrieve indels and SNPs present in at least one
of the three 129 strains (129P2/OlaH, 129S1/SvIm and
129S5SvEvB) and affecting the protein coding sequence
of the genes. These so-called protein coding variants are
based on the following sequence ontology (SO) terms:
stop gained, stop lost, inframe insertion, inframe deletion,
frameshift variant, splice donor variant, splice acceptor
variant, and coding sequence variant. In total, 949 indels
and 446 SNPs affecting 1,084 mouse genes were retained.
We gathered chromosome and gene start and end positions
for 1,084 genes covering 1,395 variations. The Ensembl
gene ID was used to find the most upstream and
downstream start and stop in all Ensembl transcripts for
that gene. Next these genome coordinates were used to
search for flanking genes within 2, 10, and 20 Mbps
upstream and downstream. We then downloaded all mouse
phenotypic allele data from the MGI resource and
extracted the data of genetically modified mouse lines.
Information on 5,322 genes (corresponding to 7,979 129-
derived genetically modified mouse lines) was connected
to genes with passenger mutations and affected genes.
Additionally we filtered the data to identify putative
regulatory variants. All data were stored in a MySQL
database and can be queried using the publicly available
web tool Me-PaMuFind-It:
http://me-pamufind-it.org/
Passenger genome mutations in gene-targeted mice (Nechanitzky and
Mak, 2015)
RESULTS & DISCUSSION
The vast majority of existing and well-characterized
genetically engineered congenic mice have been created
using 129 ESCs. 99.5% of these mouse lines are affected
by a median number of 20 passenger mutations within a
10 cM flanking region. This implies that nearly all
genetically modified congenic mice contain multiple
passenger mutations despite intensive backcrossing.
Consequently, the phenotypes observed in these mice
might be due to flanking passenger mutations rather than a
defect in the targeted gene (Vanden Berghe et al, 2015).
REFERENCES Keane, T.M., Goodstadt, L., Danecek, P., White, M.A., Wong, K., Yalcin,
B., Heger, A., Agam, A., Slater, G., Goodson, M., et al. (2011).
Mouse genomic variation and its effect on phenotypes and gene
regulation. Nature 477, 289–294.
Nechanitzky R, Mak TW (2015). Passenger Mutations Identified in the
Blink of an Eye. Immunity 43(1), 9-11. Vanden Berghe, T., Hulpiau, P., Martens, L. et al (2015). Passenger
Mutations Confound Interpretation of All Genetically Modified
Congenic Mice. Immunity 43(1), 200-9.
![Page 71: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/71.jpg)
10th Benelux Bioinformatics Conference bbc 2015
71
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: 000 Category: Abstract template
P27. DETECTING MIXED MYCOBACTERIUM TUBERCULOSIS INFECTION
AND DIFFERENCES IN DRUG SUSCEPTIBILITY WITH WGS DATA
Arlin Keo1
& Thomas Abeel1,2,*
.
Delft Bioinformatics Lab, Delft University of Technology, Delft, the Netherlands
1; Broad Institute of MIT
and Harvard, Cambridge, MA, USA2.
*t.abeel@ tudelft.nl
Mycobacterium tuberculosis is a bacterial pathogen that causes tuberculosis and infects millions of people. When a
person is infected with more than one distinct strain type of tuberculosis (TB), referred to as mixed infection, diagnosis
and treatment is complicated. Due to difficulty of diagnosis the prevalence of mixed infections among TB patients
remain uncertain. Whole genome sequencing (WGS) yields a great number of single nucleotide polymorphisms (SNPs)
and offers increased resolution to distinguish distinct strains. Here, we present a tool that maps sample reads against 21
bp cluster specific SNP markers to detect putative mixed infections and estimate the frequencies of the present
subpopulations.
INTRODUCTION
Mycobacterium tuberculosis is a clonal, bacterial pathogen
that causes the pulmonary disease tuberculosis (TB), and it
infects and kills millions of people worldwide [1]. The
study of genetic diversity within the M. tuberculosis
complex (MTBC) is complicated by mixed TB infections,
which happens when a person is infected with more than
one distinct strain type of MTBC. This often results in
poor diagnosis and treatment of patients as the bacterial
subpopulation may have undetected differences in drug
susceptibility [2]. A strain typing method should be able to
distinguish closely related strains, to also allow the
detection of a mixed infection at finer resolutions [3]. This
study aims to detect a possible mixed TB infection at
different levels in MTBC and to determine the frequencies
of the present strains based on established tree paths in the
MTBC phylogenetic tree.
METHODS
A global comprehensive dataset of 5992 MTBC strains
was used for analysis, and 226570 SNPs were extracted
from this set to construct a SNP-based phylogenetic tree
with RAxML. In this bifurcating tree, each branch
represents a cluster of strains and splits into two new
monophyletic subclusters of genetically more closely
related strain. These ¨splits¨ were used to define clusters
and subclusters that contain more than 10. Global SNP
association was done for each cluster to get cluster-
specific SNPs, those for which the true positive rate, true
negative rate, positive predictive value, and negative
predictive value were >0.95. Markers were generated from
these SNPs by extending them with 10 bp sequence on
each side based on reference genome H37Rv. Each
hierarchical cluster now has a set of specific SNP markers.
By mapping sample reads against these 21 bp cluster-
specific SNP markers the tool determines the presence of
paths in the phylogenetic tree that start at the MTBC root
node. Paths that split indicate the presence of multiple
strains and thus a mixed infection.
The read depth at the root node represents a frequency of 1
of the present MTBC species. If the path splits further in
the tree, the total read depth is divided over the two
subpaths and determines the frequencies of those present
subclusters (Figure 1).
FIGURE 1. Detection of mixed TB infection with hierarchical clusters.
The detected strains are combined with detected drug
susceptibility profiles. A minimized reference genome
consisting of drug resistance genes and 1000 bp flanking
regions is used to map sample reads with BWA, and call
variants with Pilon. Ambiguous variation calls may
indicate that present strains in a mixed infection sample
also have differences in drug susceptibility.
RESULTS & DISCUSSION
In the phylogenetic tree 308 clusters (MTBC root
excluded) were defined and there are 14823 SNP markers
in total that are specific to a cluster and unique within the
cluster. The known MTBC lineages 1 to 6 have between
355-614 markers.
7661 TB samples were tested, present strain(s) and
frequencies could be predicted for 7495 samples of which
914 (~12%) are mixed infections (Table 1).
# of subpopulations 1 2 3 >3
# of samples 6581 798 95 21 TABLE 1. 914 Out of 7495 samples is a mixed infection.
REFERENCES 1. World Health Organization. Global Tuberculosis Report. World
Health Organization, Geneva, Switzerland, 2014.
2. Zetola et al. Mixed Mycobacterium tuberculosis complex infections
and false-negative results for rifampicin resistance by GeneXpert MTB/RIF are associated with poor clinical outcomes. Journal of
Clin. Microb., 52:2422-2429, 2014.
3. G. Plazzotta, T. Cohen, and C. Colijn. Magnitude and sources of bias in the detection of mixed strain M. tuberculosis infection. Journal of
theoretical biology, 368:67–73, 2015.
![Page 72: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/72.jpg)
10th Benelux Bioinformatics Conference bbc 2015
72
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P28. APPLICATION OF HIGH-THROUGHPUT SEQUENCING TO
CIRCULATING MICRORNAS REVEALS NOVEL BIOMARKERS FOR DRUG-
INDUCED LIVER INJURY Julian Krauskopf
1*, Florian Caiment
1, Sandra Claessen
1, Kent J. Johnson
2, Roscoe L. Warner
2, Shelli J. Schomaker
3,
Deborah A. Burt3, Jiri Aubrecht
3, Jos C. Kleinjans
1.
Department of Toxicogenomics, Maastricht University, Maastricht 6200 MD, The Netherlands1; Pathology Department,
University of Michigan, Ann Arbor, MI 48109, USA2; Drug Safety Research and Development, Pfizer, Inc., Groton, CT
06340, USA2. *[email protected]
Drug-induced liver-injury (DILI) is a leading cause of acute liver failure and the major reason for withdrawal of drugs
from the market. Preclinical evaluation of drug candidates has failed to detect about 40% of potentially hepatotoxic
compounds in humans. At the onset of liver injury in humans, currently used biomarkers have difficulty differentiating
severe DILI from mild, and/or predict the outcome of injury for individual subjects. Therefore, new biomarker
approaches for predicting and diagnosing DILI in humans are urgently needed. Recently, circulating microRNAs
(miRNAs) such as miR-122 and miR-192 have emerged as promising biomarkers of liver injury in preclinical species
and in DILI patients. In this study, we focused on examining global circulating miRNA profiles in serum samples from
subjects with liver injury caused by accidental acetaminophen (APAP)-overdose. Upon applying next generation high-
throughput sequencing of small RNA libraries, we identified 36 miRNAs, including three novel miRNA-like small
nuclear RNAs, which were enriched in serum of APAP overdosed subjects. The set comprised miRNAs that are
functionally associated with liver-specific biological processes and relevant to APAP toxic mechanisms. Although more
patients need to be investigated, our study suggests that profiles of circulating miRNAs in human serum might provide
additional biomarker candidates and possibly mechanistic information relevant to liver injury.
![Page 73: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/73.jpg)
10th Benelux Bioinformatics Conference bbc 2015
73
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P29. INFORMATION THEORETIC MODEL FOR GENE PRIORITIZATION
Ajay Anand Kumar1,2 *
, Geert Vandeweyer1,2
, Lut Van Laer1,2
& Bart Loeys1,2
.
Department of Medical Genetics, University of Antwerp1; Biomedical informatics, Antwerp University Hospital
2.
The identification of top candidate genes involved in human diseases from a list of candidate genes remains
computationally challenging. Many tools exist for this computational prioritization, of which the core typically utilizes
fusion or integration of various genomic annotation data sources. However, due to the rapid generation of novel data
high-throughput experiments, annotation sources often become outdated, lead to annotation errors. Hence, predictions
based on these computational tools are not reliable. To tackle this, we propose an information theoretic model that
effectively fuses annotation sources and regression model under Bayesian framework to prioritize candidate genes. Our
method is fast and performs better as compared to four existing tools on their own benchmark dataset.
INTRODUCTION
Gene Prioritizaton has become a central research problem
in the bioinformatics domain. With the advent of exome
sequencing in clinical genetics, it became a necessity to
automate the identification of the top most genes likely
involved in the disease from a given pool of affected
genes. Various annotation sources can be integrated or
fused to learn multiple functionality of genes and then
design a classifiers/regressor for prioritization. We
propose here an early data integration method that
implements an information retrieval model to fusing the
data at functional feature level and then designing a
discriminative regression model in Bayesian framework to
prioritize candidate genes.
METHODS
Principle behind our approach is based on guilt-by-
association. Genes that are known to be disease associated
might also share similar functions. The idea is that a
classifier or regressor can be trained on the linear
mapping between functional proximity profiles of genes
and their phenotypic proximity profiles. We implemented
Bayesian regressor to infer the degree of association of the
test genes with the query disease. The work-flow of is
shown in the Figure 1. The details are:
1. Functional annotation: Text, Ontologies (GO, MPO),
Sequence similarity, Pathways, Interactions. Phenotype
annotation: Human Phenotype Ontology (HPO), Disease
Ontology (DO), HuGe/ MeSh terms and GAD
2. TF - IDF (Term Frequency – Inverse document
frequency) methodology is used to assign statistical
weights to the functional attributes of genes form these
annotation sources. TF-IDF is data driven model
traditionally used for information retrieval. We apply same
methodology for weighing features. Together, it gives
gene-by-gene functional & phenotypic proximity profiles.
3. Finally, the Bayesian linear regression model for a
given set of query disease or training genes it learns the
linear mapping between functional & phenotypic
proximity profiles. Y = βX + η, where is Gaussian
distributed. We have incorporated traditional non-
informative Normal-Inverse Gama (NIG) priors for
estimating the unknowns namely β and б.
RESULTS & DISCUSSION
We performed leave-one-out cross validation experiment
on the benchmark data set that was used to compare four
other tools whose design principles are similar to our
method [1]. Our dataset consisted of 1040 disease genes
categorized under manually curated 12 different disease
classes [2]. In our preliminary results for 1154
prioritizations under the cut-off of top 5%, 10% and 30%
genes ranked in random control dataset we achieved
AUROC of 86.31 % against their best achieved score of
83.0%. This clearly indicates our method is comparatively
better with other tools mentioned in the comparative
analysis.
FIGURE 1. Workflow of Bayesian regression model for gene
prioritization.
Currently, we are incurring large-scale cross-validation
with manually curated 6762 disease gene association with
more number of tools and benchmark data [3].
Additionally, we also plan to explore to develop
probabilistic generative approach to model co-
occurrences, dependencies of features for effective data
fusion that can help in finding novel disease causing
genes.
REFERENCES 1. Chen B et.al BMC Med Genomics. 2015;8 Suppl 3:S2
2. Goh et.al Proc Natl Acad Sci USA 2007, 104(21):8685-8690
3. Börnigen, Daniela, et al. Bioinformatics 28.23 (2012): 3081-3088.
![Page 74: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/74.jpg)
10th Benelux Bioinformatics Conference bbc 2015
74
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P30. GALAHAD: A WEB SERVER FOR THE ANALYSIS OF DRUG EFFECTS
FROM GENE EXPRESSION DATA
Griet Laenen1,2,*
, Amin Ardeshirdavani1,2
, Yves Moreau1,2
& Lieven Thorrez1,3
.
Dept. of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics,
KU Leuven1 ; iMinds Medical IT Dept., KU Leuven
2 ; Dept. of Development and Regeneration @ Kulak, KU Leuven
3.
Galahad (https://galahad.esat.kuleuven.be) is a web-based application for the analysis of gene expression data from drug
treatment versus control experiments, aimed at predicting a drug’s molecular targets and biological effects. Galahad
provides data quality assessment and exploratory analysis, as well as computation of differential expression. Based on
the obtained differential expression values, drug target prioritization and both pathway and disease enrichment can be
calculated and visualized. Drug target prioritization is based on the integration of the gene expression data with a
functional protein association network.
INTRODUCTION
Gene expression analysis is frequently employed to study
the effects of drug compounds on cells. The observed
transcriptional patterns can provide valuable information
for identifying compound–protein inter-actions as well as
resulting biological effects. To facilitate the analysis of
this particular data type and enable an in-depth exploration
of a drug’s mode of effect, we have developed Galahad1.
INPUT
The main input for Galahad are raw Affymetrix human,
mouse or rat DNA microarray data derived from both
untreated control samples and samples treated with a drug
of interest. In addition, Galahad provides the possibility to
start from differential expression data derived with other
platforms to perform drug target prioritization and
enrichment analysis.
METHODS
The different analyses are depicted in Figure 1 and
include:
preprocessing of the raw data with RMA or
MAS5.0, as indicated by the user;
quality assessment and exploratory analysis to
ascertain data quality, uncover experimental
issues, and help in deciding whether certain
arrays need to be considered as outlying;
differential expression analysis to determine the
significance of gene up- and downregulation
following drug treatment;
genome-wide drug target prioritization by
means of an in-house developed algorithm for
network neighborhood analysis integrating the
expression data with functional protein
association infor-mation2;
prediction of molecular pathways involved in the
drug’s mode of effect;
identification of associated disease phenotypes
enabling side effect prediction and drug
repositioning.
OUTPUT
The output is displayed in a series of tabs corresponding to
the different analyses selected by the user:
in the Quality Control and Data Exploration
tabs, several diagnostic plots are displayed along
with a short explanation;
the Differential Expression tab contains a sorted
table listing all genes together with their log2
ratios and P-values for differential expression, as
well as links to the corresponding GeneCards
sections;
in the Drug Target Prioritization tab, a ranked
list of genes as potential targets of the drug can be
found, together with the network diffusion-based
scores and P-values for prioritization, and links to
the corresponding GeneCards section; in addition,
a network-based visualization is available for
each gene, showing the 10 interaction partners
contrib-uting most to the gene’s ranking;
the tabs summarizing the results for Pathway
and Disease Enrichment contain a sorted table
with pathway or disease ontology IDs, names,
and database links, together with the number of
differentially expressed genes in the
corresponding gene sets and the accompanying P-
values; in addition, network graphs are available,
consisting of the top 10 most significant
pathways or disease phenotypes, along with their
associated genes colored according to fold change.
FIGURE 1. Overview of the Galahad analysis steps.
REFERENCES 1. Laenen G. et al. Nucl Acids Res 43, W208-W212 (2015).
2. Laenen G. et al. Mol BioSyst 9, 1676-1685 (2013).
![Page 75: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/75.jpg)
10th Benelux Bioinformatics Conference bbc 2015
75
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: 000 Category: Abstract template
P31. KMAD: KNOWLEDGE BASED MULTIPLE SEQUENCE ALIGNMENT
FOR INTRINSICALLY DISORDERED PROTEINS
Joanna Lange1,2
, Lucjan S Wyrwicz1 & Gert Vriend
2*.
Laboratory of Bioinformatics and Biostatistics, M. Sklodowska-Curie Memorial Cancer Center;
Institute of Oncology1, CMBI, Radboud University Nijmegen
2.
INTRODUCTION
Intrinsically disordered proteins (IDPs) lack tertiary
structure and thus differ from globular proteins in terms of
their sequence – structure – function relations. IDPs have a
lower sequence conservation, different types of active
sites, and a different distribution of functionally important
regions, which altogether makes their multiple sequence
alignment (MSA) difficult.
Algorithms underlying existing MSA programs are
directly or indirectly based on knowledge obtained from
studying three dimensional protein structures. Hereby we
introduce a tool for Knowledge based Multiple sequence
Alignment for intrinsically Disordered proteins, KMAD,
that incorporates SLiM, domain, and PTM annotations to
improve the alignments.
KMAD web server is accessible at
http://www.cmbi.ru.nl/kmad/. A standalone version is
freely available.
METHODS
Dataset of proteins experimentally proven to be disordered
was obtained from DisProt (Sickmeier et al., 2007). For
each IDP all homologous sequences were extracted from
SwissProt (The Uniprot Consortium, 2014) using BLAST.
The sequence sets were aligned with several MSA tools.
Apart from manual validation we also performed a
benchmark validation on reference sets from BAliBASE
(Thompson et al., 2005) and PREFAB holding structure-
based 'gold standard' sequence alignments. For this
purpose we used KMAD and a modified version of
KMAD, which performs a ’refinement’ of Clustal Omega
(Sievers et al., 2011) alignments.
RESULTS & DISCUSSION
Manual validation showed that KMAD bypasses many
mistakes made by Clustal Omega. An example of an
alignment mistake is shown on Figure 1.
a) Clustal Omega
b) KMAD
FIGURE 1. Excerpts from Clustal Omega and KMAD alignments of human sialoprotein (SIAL HUMAN) with four homologues. Various PTM
kinds are highlighted with bright colours
In the field of sequence alignment research it is common
practice to compare the sequence alignments obtained with
MSA software with those that are obtained from structure
superpositions. IDPs do not possess a static 3D structure
so that this method is not applicable to KMAD alignments.
Both of the validation methods that we used have their
disadvantages, but so far there is no alternative. Validation
on benchmark alignments of structured proteins is biased
towards Clustal Omega, because it was optimized to work
with structured proteins. On the other hand, the manual
inspection based on the same features that influence the
alignment is not a very elegant method, but given the
nature of IDPs probably the best we can do.
REFERENCES Edgar, R. C. (2004). MUSCLE: multiple sequence alignment with high
accuracy and high throughput. Nucleic Acids Research, 32(5), 1792–
1797. Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W.,
Lopez, R., McWilliam, H., Remmert, M., S oding, J., Thompson, J.
D., and Higgins, D. G. (2011). Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.
Molecular System Biology, 7(539), 539.
Sickmeier, M., Hamilton, J. a., LeGall, T., Vacic, V., Cortese, M. S., Tantos, A., Szabo, B., Tompa, P., Chen, J., Uversky, V. N.,
Obradovic, Z., and Dunker, a. K. (2007). DisProt: the Database of
Disordered Proteins. Nucleic Acids Research, 35(Database issue), D786–93.
The Uniprot Consortium (2014). Activities at the Universal Protein
Resource (UniProt). Nucleic Acids Research, 42(Database issue), D191–8.
Thompson, J. D., Koehl, P., Ripp, R., and Poch, O. (2005). BAliBASE
3.0: latest developments of the multiple sequence alignment benchmark. Proteins: Structure, Function, and Bioinformatics,
61(1), 127–136.
![Page 76: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/76.jpg)
10th Benelux Bioinformatics Conference bbc 2015
76
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P32. ON THE LZ DISTANCE FOR DEREPLICATING
REDUNDANT PROKARYOTIC GENOMES
Raphaël R. Léonard1,2*
, Damien Sirjacobs², Eric Sauvage1, Frédéric Kerff
1 & Denis Baurain².
Centre for Protein Engineering, University of Liège1; PhytoSYSTEMS, University of Liège
2.
The fast-growing number of available prokaryotic genomes, along with their uneven taxonomic distribution, is a problem
when trying to assemble broadly sampled genome sets for phylogenomics and comparative genomics. Indeed, most of
the new genomes belong to the same subset of hyper-sampled phyla, such as Proteobacteria and Firmicutes, or even to
single species, such as Escherichia coli (almost 2000 genomes as of Sept 2015), while the continuous flow of newly
discovered phyla prompts for regular updates. This situation makes it difficult to maintain sets of representative genomes
combining lesser known phyla, for which only few species are available, and sound subsets of highly abundant phyla. An
automated straightforward method is required but none are publicly available. The LZ distance, in conjunction with the
quality of the annotations, can be used to create an automated approach for selecting a subset of representative genomes
without redundancy. We are planning to release this tool on a website that will be made publicly available.
INTRODUCTION
The LZ distance (Lempel and Ziv, 1977; Otu and Sayood,
2003) is inspired by compression algorithms, such as gzip
or WinRAR. This distance, amongst others, has already
been used in attempts to produce alignment-free
phylogenetic trees (Bacha and Baurain, 2005; Hohl et al.
2007), though the results were disappointing in such a
context (due to the heterogeneity of the substitution
process at large evolutionary scales). However, the LZ
distance is likely to provide enough resolving power to
identify groups of redundant genomes and to keep only
one representative for each group.
METHODS
For each pair of genomes A and B, the LZ distance is
computed from the gzip-compressed file lengths of the
corresponding nucleotide assemblies s(A) and s(B) and of
their concatenations s(A+B) and s(B+A). These distances,
along with taxonomic information, are stored in a
database.
A clustering method is then applied to regroup the similar
genomes into a user-specified number of groups. For each
of these groups, a representative is chosen based on the
quality of the genomic assemblies (chromosomes rather
than scaffolds) and of the protein annotations (e.g., few
rather than many “unknown proteins”).
RESULTS & DISCUSSION
Our method using the LZ distance is currently under
development using the genomes from the release 28 of
Ensembl Bacteria (ftp://ftp.ensemblgenomes.org/pub/
bacteria/release-28/). It contains 20,950 unique
prokaryotic genomes, composed of 286 Archaea and
20,664 Bacteria. The three most represented phyla are the
Proteobacteria (8642, of which 1980 E. coli), the
Firmicutes (7766) and the Actinobacteria (2673). These
genomes are already the result of a pre-processing step
designed to remove extra assemblies for strains present in
multiple copies (due to parallel sequencing or
resequencing in different labs).
We are working on different approaches for validating our
dereplication method, based on (1) current taxonomy, (2)
16S rRNA phylogeny, and (3) clustering using genomic
signatures (Moreno-Hagelsieb et al. 2013).
First, we compute a central measure of the taxonomic
“purity” of all genome clusters, which reflects the amount
of “mixture” at different taxonomic levels (phylum, class,
order etc). A good clustering should regroup different
genera (or species) without amalgamating distinct classes
(or phyla). Second, we cut the branches of a large 16S
rRNA tree based on the same genome collection to
produce an equal number of groups to compare with our
clustering method. We then compute a statistic of the
overlap between the 16S subtrees and the LZ clusters. A
good clustering should have a reasonable overlap with the
gold standard that is the 16S rRNA tree. Third, using the
same overlap metric, we compare the LZ clusters to
clusters obtained using the genomic signature.
Finally, an interactive tool will be made available through
a website. It will allow the users to download pre-
computed sets of representative genomes for either the
complete database or for taxonomic subsets. We are also
planning to allow users to upload their own genomes to
cluster them with the LZ method.
REFERENCES Ziv, J. and a. Lempel. 1977. ‘A Universal Algorithm for Sequential Data
Compression.’ IEEE Transactions on Information Theory 23.3.
doi:10.1109/TIT.1977.1055714. Otu, H. H. and K. Sayood. 2003. ‘A New Sequence Distance Measure for
Phylogenetic Tree Construction.’ Bioinformatics 19.16: 2122–2130.
doi:10.1093/bioinformatics/btg295. Moreno-Hagelsieb, G., Z. Wang, S. Walsh and A. Elsherbiny. 2013.
‘Phylogenomic Clustering for Selecting Non-Redundant Genomes
for Comparative Genomics.’ Bioinformatics 29.1: 947–949.
doi:10.1093/bioinformatics/btt064.
Höhl, M. and M. a Ragan. 2007. ‘Is Multiple-Sequence Alignment
Required for Accurate Inference of Phylogeny?’ Systematic biology 56.2: 206–221. doi:10.1080/10635150701294741.
Bacha, S. and Baurain, D. 2005. ‘Application of Lempel-Ziv complexity
to alignment-free sequence comparison of protein families’. Benelux Bioinformatics Conference 2005.
http://hdl.handle.net/2268/80179
![Page 77: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/77.jpg)
10th Benelux Bioinformatics Conference bbc 2015
77
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P33. THE ROLE OF MIRNAS IN ALZHEIMER’S DISEASE
Ashley Lu1,2*
, Annerieke Sierksma1,2
, Bart De Strooper1,2
& Mark Fiers1,2
.
VIB Center for the Biology of Disease1; KU Leuven Center for Human Genetics
2.
MicroRNAs (miRNA) play an important role in post-transcriptional regulation and were shown to be dysregulated in
Alzheimer’s disease. By analysing the hippocampal miRNA and mRNA expression of two mouse models of Alzheimer’s
disease, we identify a set of miRNAs that are dysregulated with the onset of cognitive impairments. Using GO
enrichment analysis we aim to identify miRNAs that likely play a role in learning and memory.
INTRODUCTION
MiRNAs are small non-coding RNAs involved in post-
transcriptional regulation through mRNA inhibition or
degradation. Past studies have suggested miRNAs to play
a direct role in Alzheimer’s disease (AD), e.g. by
modulating the expression of genes involved in the
formation of neuropathological protein aggregates (Lau P
& De Strooper B, 2010). In this study, we investigated the
changes in miRNA and mRNA expression in two AD
mouse models: APPswe/PS1L166P
(Radde R, 2006) and
Thy-Tau22 (Schindowski K, 2006), which have similar
patterns of cognitive impairment, but different pathology.
We aim to better understand the functional role of
miRNAs in AD-related cognitive impairments.
METHODS
RNA was extracted from the left hippocampus of 96 mice.
The experiment covers the two models (APPswe/PS1L166P
& Thy-Tau22), with wild type controls for each. All
genotypes are tested at two ages (4 and 10 months); before
and after onset of cognitive impairment. This yields eight
experimental groups with twelve mice each.
Expression profiles of miRNAs and mRNAs were
generated using Illumina single-end sequencing.
Differential Expression (DE) analysis was performed
using the limma package of R/Bioconductor with a linear
model to test the effects of age, genotype and their
interaction.
Functional analysis of the mRNAs and miRNAs are
conducted separately. For mRNAs, gene ontology analysis
was applied to sets of the most up- and down regulated
genes.
To determine the functional impact of dysregulated
miRNAs we determined which mRNAs are the most likely
direct targets of each miRNA using the following
approach: 1) for each miRNA we calculated the Pearson’s
correlation coefficient to each mRNA based on the
miRNA and mRNA expression data. 2) For each miRNA
we extracted the predicted set of targets from Targetscan
(Lewis BP & Burge CB & Bartel DP, 2005), with Diana
(Maragkakis M et al. 2011) as backup when Targetscan
had no record. 3) We filtered the miRNA target genes by
determining the leading edge set in a GSEA PreRanked
analysis (Subramanian A. et al, 2005) using the predicted
target mRNAs of each miRNA against the mRNAs ranked
according to the Pearson’s scores generated in step 1. We
additionally investigated target sets based on a Pearson’s
correlation coefficient cut-off of -0.2, -0.3, and -0.4. 4)
Gene-ontology analysis was then applied to these
candidate target sets to infer the likely biological function
of each miRNA.
RESULTS & DISCUSSION
DE analysis showed that the direction of expression level
changes in mRNAs are similar between APPswe/PS1166P
and Thy-Tau22 in terms of age*genotype interaction
effects. However, for the miRNAs the expression pattern
is less obvious. Overall, the effect size is more pronounced
in APPswe/PS1L166P
mouse than the Thy-Tau22 for both
miRNAs and mRNAs.
Functional analyses of the down-regulated mRNAs show a
clear enrichment in cognition and neural development
related categories, whereas up-regulated genes show a
clear inflammatory signature.
Combining miRNA target prediction with miRNA/mRNA
correlation analysis shows a marked increase of GO
enrichment scores. This analysis strongly suggests a
regulatory role for miRNAs in the down regulation of
genes involved in learning, cognition and related
categories.
This analysis workflow has allowed focusing on a list of
miRNAs that likely play a direct role in the observed
learning and memory deficits in AD mouse models, and
have been used to select candidate miRNAs for
downstream in vivo experiments, which will hopefully
provide a deeper understanding in the impact of AD on
learning and cognition.
REFERENCES Lau P & De Strooper B. Seminars in Cell & Developmental Biology,
21(7), 768–773, (2010).
Radde R. EMBO reports, 7(9), 940–946, (2006).
Schindowski K. The American Journal of Pathology, 169(2),599–616, (2006).
Lewis BP & Burge CB & Bartel DP. Cell, 120,15-20 (2005).
Maragkakis M et al. Nucleic Acids Research (2011)
Subramanian A. et al. Proceedings of the National Academy of Sciences
of the United States of America, 102(43), 15545–15550, (2005)
![Page 78: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/78.jpg)
10th Benelux Bioinformatics Conference bbc 2015
78
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P34. FUNCTIONAL SUBGRAPH ENRICHMENTS
FOR NODE SETS IN REGULATORY NETWORKS
Pieter Meysman1,2*
, Yvan Saeys3,4
, Ehsan Sabaghian5,6
, Wout Bittremieux1,2
,
Yves van de Peer5,6
, Bart Goethals1
& Kris Laukens1,2
.
Advanced Database Research and Modeling (ADReM), University of Antwerp1; Biomedical informatics research center
Antwerpen (biomina)2; VIB Inflammation Research Center
3; Department of Respiratory Medicine, Ghent University
4;
Department of Plant Biotechnology and Bioinformatics, Ghent University5; Department of Plant Systems Biology,
VIB/Ghent University6.
We have developed a subgroup discovery algorithm to find subgraphs in a single graph that are associated with a given
set of nodes. The association between a subgraph pattern and a set of vertices is defined by its significant enrichment
based on a Bonferroni-corrected hypergeometric probability value, and can therefore be considered as a network-focused
extension of traditional gene ontology enrichment analysis. We demonstrate the operation of this algorithm by applying it
on two transcriptional regulatory networks and show that we can find relevant functional subgraphs enriched for the
selected nodes.
INTRODUCTION
Frequent subgraph mining (FSM) is a common but
complex problem within the data mining field that has
gained in importance as more graph data has become
available. However traditional FSM finds all frequent
subgraphs within the graph dataset, while often a more
interesting query is to find the subgraphs that are most
associated with a specific set of nodes. Nodes of interest
might be those that are associated with a specific disease,
or those that are differentially expressed in an omics
experiment.
METHODS To address this issue, we developed a novel subgraph
mining algorithm that can efficiently construct, match and
test candidate subgraphs against the given graph for
enrichment within a specific set of nodes (Meysman et al.
2015). To allow the enrichment testing, each candidate
subgraph is built around a ‘source’ node. A subgraph
match where the source node corresponds to a node of
interest is counted as a ‘hit’. If the source node is not a
node of interest, it is counted as a background hit. In this
manner the problem of enrichment can be easily tested
using a hypergeometric test. Furthermore, we show that
this definition of enrichment allows us to drastically prune
the search space that the algorithm must traverse to find all
enriched subgraphs.
An implementation of the algorithm is available at
http://adrem.ua.ac.be/sigsubgraph.
RESULTS & DISCUSSION The first data set concerned the yeast genes that have
remained in duplicate following the most recent whole
genome duplication. Within the yeast transcriptional
network, we found that these duplicate genes were
enriched for self-regulating motifs (e.g. feedback loops,
self edges, etc.), which matches the duplicated nature of
these genes (Figure 1).
FIGURE 1. Enriched subgraphs for yeast duplicated genes
The second data set concerned mining the subgraphs
associated with the homologs of the PhoR transcription
factor across seven different inferred bacterial regulatory
networks from Colombos expression data (Meysman et al.
2014). These PhoR homologs were found to be
significantly associated with several complex regulatory
motifs.
REFERENCES Meysman P et al. Discovery of Significantly Enriched
Subgraphs Associated with Selected Vertices in a
Single Graph. Proceedings of the 14th International
Workshop on Data Mining in Bioinformatics (2015).
Meysman P et al. COLOMBOS v2. 0: an ever expanding
collection of bacterial expression compendia. Nucleic
acids research 42 (D1), D649-D653 (2014).
![Page 79: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/79.jpg)
10th Benelux Bioinformatics Conference bbc 2015
79
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: 000 Category: Poster
P35. HUMANS DROVE THE INTRODUCTION & SPREAD OF
MYCOBACTERIUM ULCERANS IN AFRICA Koen Vandelannoote
1,2,*, Conor Meehan
1*, Miriam Eddyani
1, Dissou Affolabi
3, Delphin Mavinga Phanzu
4, Sara
Eyangoh5, Kurt Jordaens
6, Françoise Portaels
1, Kirstie Mangas
7, Torsten Seemann
7, Herwig Leirs
2, Tim Stinear
7 &
Bouke C. de Jong1.
Institute of Tropical Medicine, Antwerp, Belgium1; Evolutionary Ecology Group, University of Antwerp, Antwerp,
Belgium2; Laboratoire de Référence des Mycobactéries, Cotonou, Benin
3; Institut Médical Evangélique, Kimpese,
Democratic Republic of Congo4; Centre Pasteur du Cameroun, Yaoundé, Cameroun
5 ; Joint Experimental Molecular
Unit, Royal Museum for Central Africa, Tervuren, Belgium6; Department of Microbiology and Immunology, University
of Melbourne, Melbourne, Australia7. *[email protected]
Buruli ulcer (BU) is an insidious neglected tropical disease. BU is reported around the world but the rural regions of
West and Central Africa are most affected. How BU is transmitted and spreads has remained a mystery, even though the
causative agent, Mycobacterium ulcerans, has been known for more than 70 years. Here, using the tools of population
genomics, we reconstruct the evolutionary history of M. ulcerans by comparing 167 isolates spanning 48 years and
representing 11 endemic countries across Africa. The genetic diversity of African M. ulcerans proved very limited
because of its slow substitution rate coupled with its recent origin. We show for the first time how M. ulcerans has
existed in Africa for several hundreds of years but was recently re-introduced during the period of Neo-imperialism. We
also provide evidence of the role that the so-called “Scramble for Africa” played in the spread of the disease.
INTRODUCTION
The clonal population structure of M. ulcerans has meant
that conventional genetic fingerprinting methods have
largely failed to differentiate clinical disease isolates,
complicating molecular analyses on the elucidation of the
population structure, and the evolutionary history of the
pathogen. Whole genome sequencing (WGS) is currently
replacing conventional genotyping methods for M.
ulcerans.
METHODS
We analyzed a panel of 165 M. ulcerans disease isolates
originating from disease foci in 11 different African
countries that had been cultured between 1964 and 2012.
Index-tagged paired-end sequencing-ready libraries were
prepared from gDNA extracts. Genome sequencing was
performed on the Illumina HiSeq 2000 DNA sequencer or
the Illumina MiSeq sequencing platform with respectively
2x150bp and 2x250bp paired-end sequencing chemistry.
Read mapping and SNP detection were performed using
the Snippy v.2.6 pipeline. Bayesian model-based inference
of the genetic population structure was performed using
BAPS v.6.0.1
Evidence for recombination between
different BAPS-clusters was assessed using BRAT-
NextGen2. We used BEAST2 v2.2.1
3 to date evolutionary
events, determine the substitution rate and produce a time-
tree of African M. ulcerans. A permutation test was used
to assess the validity of the temporal signal in the data. To
assess the geospatial distribution of African M. ulcerans
through time, an additional BEAST2 analysis was
performed with a discrete BSSVS geospatial model4.
RESULTS & DISCUSSION
Resulting sequence reads were mapped to the Ghanaian M.
ulcerans Agy99 reference genome and, after excluding
mobile repetitive elements and small indels, we detected a
total of 9,193 SNPs randomly distributed across the M.
ulcerans chromosome with approximately 1 SNP per 613
bp (0.15% nucleotide divergence). We explored the
distribution of DNA chromosomal deletions and identified
differential genome reduction that strongly supports the
existence of two specific M. ulcerans lineages within the
African continent, hereafter referred to as Lineage Africa I
(Mu_A1) and Lineage Africa II (Mu_A2). Subsequent
SNP-based exploration of the genetic population structure
agreed with the above deletion analysis and subdivided the
African M. ulcerans population into four major clusters.
BRAT-NextGen did not detect any recombined segments
in any isolate, supporting a strongly clonal population
structure for M. ulcerans that is evolving by vertically
inherited mutations. Within the phylogenetic tree, isolates
formed tight, shallow-rooted phylogenetic clusters which
are suggestive of contemporary dispersal. We estimated a
very slow mean genome wide substitution rate of 6.32E-8
per site per year. The Bayesian analysis demonstrated that Mu_A1 has existed in Africa for several hundreds of years
and that Mu_A2 was recently introduced on the continent.
The re-introduction event coincides well with a historical
event of particular interest: the period of Neo-imperialism
(1881-1914). Since tMCRA(Mu_A2) did not predate
colonization it seems very likely that lineage Mu_A2 was
introduced after the instigation of colonial rule through an
influx of BU infected humans. The time-tree of African M.
ulcerans also reveals evidence of the likely role that the
so-called “Scramble for Africa” played in the spread of
endemic Mu_A1 clones in three hydrological basins
(Congo, Oueme & Nyong) that are particularly well
covered by our isolate panel.
REFERENCES 1. Corander, J., et al. (2008) BMC bioinformatics. 9: p. 539.
2. Marttinen, P., et al. (2012) Nucleic acids research. 40(1): p. e6.
3. Bouckaert, R., et al. (2014) PLoS computational biology. 10(4): p. e1003537.
4. Lemey, P., et al., (2009) PLoS computational biology. 5(9): p.
e1000520.
![Page 80: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/80.jpg)
10th Benelux Bioinformatics Conference bbc 2015
80
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P36. LEVERAGING AGO-SRNA AFFINITY TO IMPROVE IN SILICO SRNA
DETECTION AND CLASSIFICATION IN PLANTS
Lionel Morgado1*
& Frank Johannes2,3
.
Groningen Bioinformatics Centre (GBiC), University of Groningen1; Department of Plant Sciences, Center of Life and
Food Sciences Weihenstephan, Technical University Munich2; Institute of Advanced Studies, Technical University
Munich3.
Small RNAs (sRNA) have an important role in the regulation of gene expression, either through post-transcriptional
silencing or the recruitment of repressive epigenetic marks such as DNA methylation. In plants, the mode of action of a
given sRNA is tightly related with the Argonaute protein (AGO) to which it binds. High throughput sequencing in
combination with immunoprecipitation techniques have made it possible to determine the sequences of sRNA that are
bound to different families of AGO. Here we apply Support Vector Machines (SVM) to recent AGO-sRNA sequencing
data of A. thaliana to learn which sRNA sequence features govern their differential association with certain AGOs. Our
SVM classifiers show good sensitivity and specificity and provide a framework for accurate in silico sRNA detection and
classification in plants.
INTRODUCTION
Small RNA molecules are known to have an important
role in gene expression control. It is therefore of extreme
interest to be able to detect them and determine the
regulatory pathways in which they are involved. With the
current laboratorial methods it is unfeasible to test the high
number of sRNA candidates, but there are computational
methods that can greatly narrow down the list.
Nevertheless, sRNA activity is still far from being fully
understood and that is reflected in the very high false
positive rate of the prediction tools currently available.
High throughput sequencing in combination with
immunoprecipitation (IP) techniques make nowadays
possible to access sRNA sequences associated with
specific AGO. AGO-sRNA binding is a fundamental step
for the activation of specific silencing pathways. Here,
AGO-sRNA data acquired from A. thaliana is explored
with SVM-based algorithms to learn which sequence
features drive different AGO-sRNA associations. Using
this knowledge, a framework for in silico sRNA detection
and classification in plants is presented.
METHODS
A system with 3 layers of classifiers (see figure 1) was
designed to identify different kinds of sRNA: the 1st layer
includes a binary SVM model that filters out sequences
that don’t bind to AGO and are therefore most probably
inactive; 2nd
layer is composed by an ensemble of binary
classifiers, each trained to explore the differences in sRNA
bound to a specific AGO against all others; and finally, the
3rd
layer comprises a multiclass linear model to assign the
most akin AGO to a given sRNA, using scores produced
in the previous layer.
Diverse AGO-sRNA libraries from A. thaliana were
explored, namely from AGO: 1, 2, 4, 5, 6, 7, 9 and 10.
After the typical RNA-seq library preprocessing, quality
check and genome mapping, several features were
extracted from the remaining sequences, namely: position
specific base composition, sequence length, k-mer
composition and entropy scores. The different feature sets
were explored separately and in different combinations.
Initially, highly correlated features (pearson score>0.75)
were removed, and the remaining ones were further
subjected to selection using SVM-RFE (Guyon et al.,
2002) with a linear kernel to handle the large data set size.
A 10-fold cross-validation procedure was executed to
modulate the variation in the data, being the best features
of each round determined as the ones with the highest
average weight across the models with the best ROC-AUC
score in each cross-validation subset. Each round, 1/3 of
the remaining features with the worst performance were
eliminated, being the process repeated until no more
features were available. The best features found were then
used to train the final classifiers using RBF kernels with
optimal parameters. This was repeated for all models in
layers 1 and 2.
FIGURE 1. Proposed architecture for the SVM-based framework.
RESULTS & DISCUSSION
Although the classifiers are still being optimized,
preliminary results from the 2nd
layer of the framework
(see figure 1) show that the top ranked features by SVM-
RFE reflect indeed significant biological patterns for
AGO-sRNA association. Among others, the relevance of
the 5’ terminal nucleotide was observed, in agreement
with findings from previous work (Mi et al., 2008).
Additionally, the accuracy for the models trained span
values that range from 71% to 86%, showing their
capacity to recognize specific AGO-binding patterns.
REFERENCES Guyon I et al.Gene selection for cancer classification using support vector machines. Mach Learn
46:389-422 (2002)
Mi S et al. Sorting of small RNAs into Arabidospis agonaute complexes is directed by the
5’terminal nucleotide. Cell 133(1): 116-27 (2008).
Zhou A & Pawlowski WP. Regulation of meiotic gene expression in plants. Front Plant Sci 5:
413, 209-215 (2014).
AGO vs noAGO
AGO1 vs
otherAGO
AGO2 vs
otherAGO
AGO10 vs
otherAGO
Final AGO prediction
Layer 1
Layer 2
Layer 3
…
![Page 81: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/81.jpg)
10th Benelux Bioinformatics Conference bbc 2015
81
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P37. ANALYSIS OF RELATIONSHIP PATTERNS
IN UNASSIGNED MS/MS SPECTRA
Aida Mrzic1,2*
, Wout Bittremieux1,2
, Trung Nghia Vu4, Dirk Valkenborg
3,5,6, Bart Goethals
1& Kris Laukens
1,2.
Advanced Database Research and Modeling (ADReM), University of Antwerp1; Biomedical informatics research center
Antwerpen (biomina)2; Flemish Institute for Technological Research (VITO), Mol
3; Karolinska Institutet, Stockholm
4;
CFP, University of Antwerp5; I-BioStat, Hasselt University
6.
Tandem mass spectrometry (MS/MS) spectra generated in proteomics experiments often contain a large portion of
unexplained peaks, despite continuous search engines improvements. Here we use pattern mining technique to determine
the origin of these unassigned spectra. We discover patterns that indicate the presence of chimeric spectra and missed
post-translational modifications (PTMs).
INTRODUCTION
Regardless of being a rich source of information, mass
spectra acquired in mass spectrometry proteomics
experiments often contain a significant number of
unexplained peaks, or even remain completely
unidentified. The unexplained fraction of mass spectra
may come from low-quality or chimeric MS/MS spectra,
or unexpected PTMs. To interpret the unexplained data,
we propose a structured analysis of the peaks occurring in
MS/MS spectra. We employ an unsupervised pattern
mining technique (Naulaerts et al., 2013) to discover
which peaks are associated with each other, and therefore
are likely to have a common origin.
METHODS
Frequent itemset mining
The technique we used to discover relationships between
frequently co-occurring peaks in MS/MS data is frequent
itemset mining, a class of data mining techniques that is
specifically designed to discover co-occurring items in
transactional datasets. The typical example of frequent
itemset mining is the discovery of sets of products that are
frequently bought together. Here, every set of products
purchased together represents a single transaction, which
results in a dataset consisting of a large number of
supermarket basket transactions that can be mined for
frequent patterns (Figure 1). In our approach a transaction
consists of the mass differences between relevant peaks in
the MS/MS spectrum.
FIGURE 1. Frequent itemset mining principle.
Mass differences associations
In order to detect relationships between different types of
mass spectrometry peaks, a distinction is made between
peaks that were relevant for spectrum identification
(assigned peaks) and peaks that were not used for the
identification (unassigned peaks) (Vu et al., 2013). The
mass differences between peaks (either assigned,
unassigned, or both) are then calculated so that for each
MS/MS spectrum in the dataset there is a single
transaction consisting of all its mass differences.
After obtaining these transactions for all MS/MS spectra
in the dataset, frequent itemset mining can be employed to
detect relationship patterns (Figure 2). These patterns can
indicate previously unknown characteristics of the spectra,
or even detect novel PTMs.
FIGURE 2. Outline of the approach.
RESULTS & DISCUSSION
In order to evaluate our approach, we used MS/MS
datasets from the PRoteomics IDEntifications (PRIDE)
database (Vizcaino et al., 2013). This database contains a
large number of publicly available datasets from mass-
spectrometry-based proteomics experiments. However, the
quality of the submitted datasets can be subject to a large
variability, which makes it a proper candidate for our
pattern mining approach.
Preliminary results show that the detected patterns are able
to capture valid information in a spectrum. The obtained
patterns indicate peaks originating from the same peptide
in case of chimeric spectra and mass differences
originating from common PTMs.
REFERENCES Naulaerts et al. Brief Bioinform, 16(2): 216–231 (2015).
Vizcaino et al. Nucleic Acids Res, 41(D1):D1063-9 (2013).
Vu et al. Proteome Science, 12:54 (2014).
![Page 82: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/82.jpg)
10th Benelux Bioinformatics Conference bbc 2015
82
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P38. MINING ACROSS “OMICS” DATA FOR DRUG PRIORITIZATION
Stefan Naulaerts1,2*
, Pieter Meysman1,2
, Bart Goethals1, Wim Vanden Berghe
,3 & Kris Laukens
1,2.
Advanced Database Research and Modeling (ADReM), University of Antwerp1; Biomedical informatics research center
Antwerpen (biomina)2; Department for Biomedical Sciences, University of Antwerp
3.
Drug resistance and response have traditionally been investigated by means of case-by-case studies. The process to
profile drug compounds is time and resource intensive. Large scale information on gene expression and protein
abundance, protein interactions, as well as functional and pathways annotations exist nowadays, as well as freely
accessible repositories for drug targets. Also structural evidence of select drug compounds is publicly available. These
data offer an enormous opportunity for data integration and pattern mining efforts across each of these levels. Here, we
apply frequent itemset mining to identify structurally similar compounds, and to detect patterns within the biological
effect profiles of these chemical compound families. Next, we explore how we can link both types of patterns to meta-
information (such as drug interactions) in a bid to identify promising compounds and speed up the drug discovery
process by means of candidate prioritization.
INTRODUCTION
In the last decades, several widely used databases have
emerged. These vary from gene expression data and mass-
spectrometric protein identifications to resources covering
interaction graphs or functional annotations of proteins
and chemicals.
The presence of these resources offers interesting
opportunities to gain deeper insight in drug mode of action,
as well as help reduce important bottlenecks with regards
to the speed of novel drug discovery or drug repurposing,
by intelligently prioritizing potentially interesting
compounds.
METHODS
To integrate the listed kinds of data, we use pattern mining
methods that are collectively known as “frequent itemset
mining”. This set of techniques uses clever heuristics to
efficiently find items that occur more often together than a
minimal threshold. In this work, we identified several
pattern types based on their source:
Expression itemsets
Metadata itemsets
Graph patterns (protein-protein, protein-drug and
chemical structures)
For subgraph mining, we used GASTON1. All other data
sources were analysed with Apriori2.
To deal with the extreme numbers of patterns that result
from mining this kind of data, we used a filter which
incorporates several quality measures based on objective
data mining measures properties (e.g. lift), as well as more
biologically inspired methods (e.g. functional coherence in
the Gene Ontology3 tree).
Simple classification based on the patterns was performed
with CBA4.
RESULTS & DISCUSSION
We were able to identify several backbone patterns within
the chemical structures studied and used these to define
“chemical compound families”. Next, we used this
classification as starting point to group experimental
evidence (bio-assays, interactions and metadata). After
applying cut-offs based on the quality measures, all
patterns remaining were significant and made sense
biologically.
Unsurprisingly, structurally similar compound families
show significant pattern overlaps in drug-drug interactions,
gene expression, term co-occurrence and conserved
protein-protein interactions. We found that specific
patterns in the biological profile often correlate with
specific discriminative structural patterns. Moreover, these
collections of structural frequent subgraphs seemed highly
relevant for the mode in which a compound connects to
the “core” proteome. This central proteome performs
essential functions of the cell (e.g. energy metabolism) and
it is known to be conserved across cell types. Structurally
distinct compound families converge much later (if at all)
to the same “core proteins” than more similar chemicals
do. This observation corresponds to currently known
pathway knowledge and tissue biology.
We were further able to associate previously unseen
compounds to chemicals present in the database, based on
the subgraph collection and by extension to the biological
profile patterns. Manual survey of literature indicated that
several compounds not covered by our database have
recently been approved or are in testing as alternative
drugs to the compounds we hypothesized as being
substantially similar.
FIGURE 1. Visualizing the dexamethasone environment. Both predictions
and experimental evidence (drug-target and protein-protein interactions) are shown.
REFERENCES 1. Nijssen S & Kok J. ENTCS 127, 77-87 (2005). 2. Agrawal R & Srikant R. Proc 20th Int Conf on Very Large Databases
(1994).
3. Ashburner M et al. Nat Genet 25, 25-29 (2000). 4. Liu B et al. KDD (1998).
![Page 83: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/83.jpg)
10th Benelux Bioinformatics Conference bbc 2015
83
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P39. ABUNDANT TRANS-SPECIFIC POLYMORPHISM AND A COMPLEX
HISTORY OF NON-BIFURCATING SPECIATION IN THE GENUS
ARABIDOPSIS Polina Novikova
1, Nora Hohmann
2, Marcus Koch
2 & Magnus Nordborg
1.
Gregor Mendel Institute, Austrian Academy of Sciences, Vienna Biocenter (VBC), A-1030 Vienna, Austria1; Centre for
Organismal Studies Heidelberg, University of Heidelberg, D-69120 Heidelberg, Germany2.
The prevailing notion of species rests on the concept of reproductive isolation. Under this model, sister taxa should not
share genetic variation unless they still hybridize, or diverged too recently for genetic drift to have eliminated shared
ancestral polymorphism, and gene trees should generally agree with species trees. Advances in sequencing technology
are finally making it possible to evaluate this model. We sequenced (Illumina 100bp paired reads) multiple individuals
from 26 proposed taxa in the genus Arabidopsis. Cluster analysis identified seven distinct groups, corresponding to four
common species — the model species A. thaliana, plus A. arenosa, A. halleri and A. lyrata — and three species with
very limited geographical distribution. However, at the level of gene trees, only the separation of A. thaliana from the
remaining taxa was universally supported, and even in this case there was abundant sharing of ancestral polymorphism
with the other taxa, demonstrating that reproductive isolation must be fairly recent. By considering the distribution of
derived alleles, we were also able to reject a bifurcating species tree because there is clear evidence for asymmetrical
gene flow between taxa. Finally, we show that the pattern of sharing and divergence between taxa differs between gene
ontologies, suggesting a role for selection.
![Page 84: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/84.jpg)
10th Benelux Bioinformatics Conference bbc 2015
84
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P40. RIBOSOME PROFILING ENABLES THE DISCOVERY OF SMALL OPEN
READING FRAMES (SORFS), A NEW SOURCE OF BIOACTIVE PEPTIDES Volodimir Olexiouk
1,*, Jeroen Crappé
1, Steven Verbruggen
1 & Gerben Menschaert
1,*.
Lab of Bioinformatics and Computational Genomics (BioBix), Department of Mathematical Modelling, Statistics and
Bioinformatics, Faculty of Bioscience Engineering, Ghent University1.
INTRODUCTION
Evidence for micropeptides, defined as translation
products from small open reading frames (sORFs), has
recently emerged. While limitations contributed to
sequencing technologies as well as proteomics have
stalled the discovery of micropeptides. It is the advent of
ribosome profiling (RIBO-SEQ), a next generation
sequencing technique revealing the translation machinery
on a sub-codon resolution, that provided evidence in favor
of translating sORFs. RIBO-SEQ captures and
subsequently sequences the +-30 nt mRNA-fragments
captured within ribosomes, providing means to identify
translating sORFs, possible encoding functional
micropeptides. Since the advent of ribosome profiling
several micropeptides were described with import cellular
functions micropeptides (e.g. Toddler, Pri-peptides,
Sarcolipin and Myoregulin).
METHODS
RIBO-SEQ allows the identification of sORFs with
ribosomal activity, however in order to further access the
coding potential (potential of sORFs truly encoding
functional micropeptides) down-stream analysis is
necessary. Here we propose a pipeline which starts from
RIBO-SEQ, implements state-of-the-art tools and metrics
accessing the coding potential of sORFs and creates a list
of candidate sORFs for downstream analysis (e.g.
proteomic identification). In summary, assessment of the
coding potential includes: PhyloCSF (conservation
analysis), FLOSS-score (Ribosome protected fragment
(RPF) length distribution analysis), ORFscore (distribution
analysis of RPFs towards the first frame of a coding
sequence (CDS), BLASTp (sequence similarity), VarAn
(genetic variation analysis). In an attempt to set a
community standard in addition to make sORFs accessible
to a larger audience, a public database (www.sorfs.org) is
provided where public available datasets were processed
by this pipeline, allowing users to browse, query and
export identified ORFs. Furthermore a PRIDE-respin
pipeline was developed in order to periodically search the
PRIDE database for proteomic evidence.
RESULTS & DISCUSSION
The pipeline has been tested and curated on three different
cell-lines. These cell-lines include: HCT116 (human), E14
mESC (mouse) and s2 (fruitfly). Results obtained
provided similar results to those reported in recent
literature proving its relevance. All metrics, as stated
above, have been carefully inspected for their biological
relevance and contributed significantly to the detection of
sORFs. The pipeline is currently being finalized, however
is available upon request. The public repository is
accessible at http://www.sorfs.org, and includes the
datasets mentioned above resulting in 263354 sORFs. Two
querying interfaces were implemented, a default query
interface intended for browsing sORFs and a BioMart
query interface for advanced querying and export
functions. sORFs have their own detail page, visualizing
the above discussed metrics and ribosome profiling data
and a link to the UCSC-browser is provided, visualizing
the RIBO-SEQ data.
REFERENCES Pauli,A., Norris,M.L., Valen,E., Chew,G.-L., Gagnon,J. a,
Zimmerman,S., Mitchell,A., Ma,J., Dubrulle,J., Reyon,D., et al.
(2014) Toddler: an embryonic signal that promotes cell movement
via Apelin receptors. Science, 343, 1248636.
Pauli,A., Norris,M.L., Valen,E., Chew,G.-L., Gagnon,J. a, Zimmerman,S., Mitchell,A., Ma,J., Dubrulle,J., Reyon,D., et al.
(2014) Toddler: an embryonic signal that promotes cell movement
via Apelin receptors. Science, 343, 1248636. Crappé,J., Ndah,E., Koch,A., Steyaert,S., Gawron,D., De Keulenaer,S.,
De Meester,E., De Meyer,T., Van Criekinge,W., Van Damme,P., et
al. (2014) PROTEOFORMER: deep proteome coverage through ribosome profiling and MS integration. Nucleic Acids Res.,
10.1093/nar/gku1283.
Ingolia,N.T. (2014) Ribosome profiling: new views of translation, from single codons to genome scale. Nat. Rev. Genet., 15, 205–13.
Crappé,J., Van Criekinge,W., Trooskens,G., Hayakawa,E., Luyten,W.,
Baggerman,G. and Menschaert,G. (2013) Combining in silico prediction and ribosome profiling in a genome-wide search for novel
putatively coding sORFs. BMC Genomics, 14, 648. Pauli,A., Norris,M.L., Valen,E., Chew,G.-L., Gagnon,J. a,
Zimmerman,S., Mitchell,A., Ma,J., Dubrulle,J., Reyon,D., et al.
(2014) Toddler: an embryonic signal that promotes cell movement via Apelin receptors. Science, 343, 1248636.
Chanut-Delalande,H., Hashimoto,Y., Pelissier-Monier,A., Spokony,R.,
Dib,A., Kondo,T., Bohère,J., Niimi,K., Latapie,Y., Inagaki,S., et al. (2014) Pri peptides are mediators of ecdysone for the temporal
control of development. Nat. Cell Biol., 16
![Page 85: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/85.jpg)
10th Benelux Bioinformatics Conference bbc 2015
85
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P PosterBeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: 000 Category: Abstract template
P41. RIGAPOLLO, A HMM-SVM BASED APPROACH TO SEQUENCE
ALIGNMENT
Gabriele Orlando1,2,3,4
, Wim Vranken1,2,3
and & Tom Lenaerts1,4,5
. 1Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, La Plaine Campus, Triomflaan, CP 263
1;
2Structural
Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 22;
3Structural Biology Research Center, VIB,1050 Brussels,
Belgium3;.
4Machine Learning group, Université Libre de Bruxelles, Brussels, 1050, Belgium
4;.
5Artificial Intelligence
lab, Vrije Universiteit Brussel, Brussels, 1050, Belgium5.
INTRODUCTION
Reliable protein alignments are a central problem for
many bioinformatics tools, such as homology modelling.
Over the years many different algorithms have been
developed and different kinds of information have been
used to align very divergent sequences [1]. Here we
present a pairwise alignment tool, called Rigapollo, based
on pairwise HMM-SVM, which includes backbone
dynamics predictions [2] in the alignment process: recent
work suggests that protein backbone dynamics is often
evolutionary conserved and contains information
orthogonal to the amino acid conservation..
METHODS
Rigapollo uses a pairwise HMM-SVM alignment
approach to infer the optimal alignment between two
proteins, taking into consideration both sequence and
dynamic information. The model (described in Figure 1) is
composed by 3 states: M (match), G1 (gap in the first
sequence) and G2 (gap in the second sequence). The
transition probabilities are defined in the same way as a
standard HMM. This new alignment tool is further
designed in the following manner:
Defining the N-dimensional feature vectors:
Each amino acid in the sequences is described by an N-
dimensional feature vector. That vector can be defined
using any kind of information, ranging from evolutionary
information (i.e. PSSM calculated with HHblits [3])) to
dynamics predictions (using the DynaMine predictor [2]).
While standard pairwise HMMs require the definition of a
finite and discrete alphabet of observable states, our model
works directly using these feature vectors (that can be both
orthonormal or not orthonormal), evaluating the emission
probability with a support vector machine (SVM).
Definition of the emisisonemission probability:
We define the emission probability using a SVM trained
to discriminate matches from mismatches. We define as
matches all the positions in the reference pairwise
alignments that do not contain gaps and we use the
concatenation of the previously defined feature vectors to
describe them. These matches are considered positive hits.
For what concerns the mismatches, we perform the same
procedure, but couple positions that, in the reference
alignment, are shifted a number of amino acids, varying
between 5 and 10. After the training, the predicted
emission probabilities for the M state, given the
concatenation of two feature vectors, will be a function of
the distance from the decision hyperplane of the SVM
(called f(D)). The corresponding emission probabilities for
the states G1 and G2 will be modeled as 1-f(D)
RESULTS & DISCUSSION
For the evaluation of the performances of Rigapollo, we
adopted two publicly available subsets of the Balibase and
SABmark alignmenta datasets, already used to evaluate
other pairwise alignment tools [1]; from the MSAs, all-
pair pairwise alignments has been extracted, and all these
that shared a percentage of sequence equal to the median
of the one of the full database has been put in the subset.
The datasets consist respectively in 38 and 123 manually
curated, structure based pairwise alignments and they
share very low sequence identity. For the evaluation of the
performances we performed a 10 folds randomized cross-
validtion. Rigapollo increases the quality of low sequence
identity pairwise alignment from 5 to 10% respect to the
state of the art methods and it seams appears that the
increase in the performancewse is more marked in very
divergent sequences, such as the onesthose in the
SABmark dataset , where the dynamics information seams
to significantly increase the quality of the alignment. This
is probably due to the fact that dynamics are often well
conserved in functional patterns, also when the sequence
is not preserved [2].
REFERENCES [1] Do Chuong B.et al. Research in Computational Molecular Biology.
Springer Berlin Heidelberg, 2006
[2] Cilia, Elisa, et al. Nucleic acids research 42.W1 (2014): W264-W270
[3] Remmert, Michael, et al.Nature methods 9.2 (2012): 173-175.
Figure 1: Structure of the pairwise HMM-SVM model
![Page 86: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/86.jpg)
10th Benelux Bioinformatics Conference bbc 2015
86
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P42. EARLY FOLDING AND LOCAL INTERACTIONS
R. Pancsa1, M. Varadi
1, E. Cilia
2,3, D. Raimondi
1,2,3 & W. F. Vranken
1,3,*.
Structural Biology Research Centre, VIB and Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium1;
Machine Learning Group, Université Libre de Bruxelles, Brussels, Belgium2; Interuniversity Institute of Bioinformatics
in Brussels (IB)2, Brussels, Belgium
3.
INTRODUCTION
Protein folding is in its early stages largely determined by
the protein sequence and complex local interactions
between amino acids, resulting in the formation of foldons
that provide the context for further folding into the native
state. These early folding processes are therefore
important to understand subsequent folding steps and their
influence on, for example, aggregation, but they are
difficult to study experimentally. We here address this
issue computationally by assembling and analysing a
dataset on early folding residues from hydrogen deuterium
exchange (HDX) data from NMR and MS, and analyse
how they relate to the sequence-based backbone dynamics
predictions from DynaMine (Cilia et al. 2013, 2014) and
evolutionary information from multiple sequence
alignments.
METHODS
We assembled a dataset of HDX experimental data from
NMR and MS from literature for 57 proteins totalling
4172 residues. The data was classified by the into early,
intermediate and late classes depending on the folding
time where protection of the backbone NH was observed,
and into strong, medium and weak classes depending on
how long the amides remain protected upon unfolding the
native state. This resulted in 219 residue sets that are
organised in XML files and loaded into a database that is
made available online via http://start2fold.eu.
The DynaMine predictions were run locally with a new
version of the software that handles C- and N-terminal
effects. These original predictions were then normalised
by shifting them so that the maximum prediction value for
each protein is always 1.0, so not affecting the relative
differences between the prediction values within each
protein, but effectively normalising the values between
different proteins. MSAs were generated for each
sequence in the dataset using HHblits and Jackhmmer with
3 iterations and E value threshold of 10-4
. All the retrieved
homologs have minimum 90% coverage with the query
sequence. By using HHfilter, a post processing tool
provided in the HHblits package, we built two different
sets of MSAs by varying the maximum pairwise sequence
identity threshold between the collected homologs in each
MSA. The (ungapped) sequences in the MSAs were
predicted without normalisation in order to preserve the
differences within a protein family, and mapped back to
the full (gapped) MSA.
RESULTS & DISCUSSION
Our analysis shows that the DynaMine-predicted rigidity
of the protein backbone represents where the protein is
likely to adopt specific lower free energy conformations
based on sequence-encoded local interactions, as
evidenced by the HDX data on early folding (Figure 1).
This effect is also present on a per-residue basis.
FIGURE 1. Distribution of DynaMine predictions for early folding residues (green) and non-early folding residues (brown) for the original
(left) and normalized (right) values.
When relating the secondary structure elements as
observed in the native fold to the early folding residues,
we observe that the ‘early folding’ secondary structure
elements also tend to be more rigid overall. Finally, we
examined whether early folding is conserved in evolution
on the basis of multiple sequence alignments. Although
there is no conservation of individual amino acids, the
physical characteristic of a rigid backbone seems to be
conserved.
We therefore propose that the backbone dynamics of the
protein is a fundamental physical feature conserved by
proteins that can provide important insights into their
folding mechanisms and stability.
REFERENCES Cilia, E., Pancsa, R., Tompa, P., Lenaerts, T., & Vranken, W. F. (2013).
From protein sequence to dynamics and disorder with DynaMine.
Nature Communications, 4, 2741. http://doi.org/10.1038/ncomms3741
Cilia, E., Pancsa, R., Tompa, P., Lenaerts, T., & Vranken, W. F. (2014).
The DynaMine webserver: predicting protein dynamics from
sequence. Nucleic Acids Research, 12(Web Server), W264–W270.
http://doi.org/10.1093/nar/gku270
![Page 87: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/87.jpg)
10th Benelux Bioinformatics Conference bbc 2015
87
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P43. BINDING SITE SIMILARITY DRUG REPOSITIONING:
A GENERAL AND SYSTEMATIC METHOD FOR DRUG DISCOVERY
AND SIDE EFFECTS DETECTION
Daniele Parisi & Yves Moreau.
I developed a protocol based on prediction of druggable cavities, comparison of these putative binding sites and cross-
docking between bound ligands and the binding site detected to be similar to the one of the complex, in order to study the
cross reactivity of known compounds. It is a general method because it can find applications both in drug repositioning
and in the study of adverse effects, and it is systematic because it consists in several subsequent steps. It would indicate
ligands to screen, reducing the number of candidates and allowing companies or universities to save money and time
from unnecessary tests.
INTRODUCTION
The ability of small molecules to interact with multiple
proteins is referred to as polypharmacology [1]
, and the
strategy that aims to exploit the positive aspects of
polypharmacology is drug repositioning, whereby existing
drugs are investigated for efficacy against targets for other
indications. Existing drugs are privileged structures with
verified bioavailability and compatibility. Furthermore,
virtual screening allows to conduct repositioning of
existing drugs against novel disease targets without the
expense of purchasing thousands of compounds [2]
. The
combination of structure-based virtual screening (such as
estimation of similarity of protein-ligand binding sites and
consequent cross-docking) and drug repositioning
represents a highly efficient and fast methodology for
predicting cross-reactivity and putative side effects of drug
candidates [3]
.
METHODS
Each step of my work is related to a bioinformatics
technique or tool, resulting to be the coupling of different
software.
1. At first there is the choice of the query (a single protein
as PDB file) and the templates (a set of PDB
structures). At least one of the two categories has to
present a ligand bound in a cavity;
2. prediction of druggable cavities in all the protein
structures using a geometry-based or an energy-based
algorithm (Fpocket, geometry-based tool, in my case);
3. comparison of the query binding sites to the binding
sites of the templates for assessing the similarity. It can
be carried out by an alignment or alignment-free
algorithm (I used Apoc, an alignment based tool);
4. cross-docking of the ligand available in the pair of
similar binding sites, into the other cavity, in order to
study the binding with a different target for toxicity or
new therapeutic indications (AutodockVina);
5. Fingerprinting of the new complex ligand-cavity for
scoring the docking poses.
I applied this protocol on two different queries (Thrombin
and Dihydrofolate reductase), using a data set of 1067
druggable proteins as tamplates (Druggable Cavity
Directory).
RESULTS & DISCUSSION
The method works well in repositioning ligands among
proteins of the same family (intraprotein), but is not able
to detect interprotein similarities (among not related
proteins). It happens because of the big size of the
predicted cavities (larger than the mere space occupied by
the ligand) coupled to the alignment-based algorithm used,
which make difficult to have a sufficient similarity rate
and exponentially increase the false negatives. For my
further works I will divide the cavity space in subpockets,
disengage the similarity from the sequence by using
pharmacophoric maps, and couple the structure based
similarity to the ligand based and network based. All the
information will be fused with data integrations algorithms.
REFERENCES On the origins of drug polypharmacology, Xavier Jalencas and Jordi
Mestres, Med. Chem. Commun., 2013, 4, 80.
Drug repositioning by structure-based virtual screening, Dik-Lung Ma, Daniel Shiu-Hin Chana and Chung-Hang Leung, Chem. Soc. Rev.,
2013, 42, 2130.
Comparison and Druggability Prediction of Protein−Ligand Binding Sites from Pharmacophore-Annotated Cavity Shapes, Jeremy
Desaphy, Karima Azdimousa, Esther Kellenberger, and Didier Rognan, J. Chem. Inf. Model. 2012, 52, 2287−2299.
![Page 88: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/88.jpg)
10th Benelux Bioinformatics Conference bbc 2015
88
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P44. ASSESSMENT OF THE CONTRIBUTION OF COCOA-DERIVED STRAINS
OF ACETOBACTER GHANENSIS AND ACETOBACTER SENEGALENSIS TO
THE COCOA BEAN FERMENTATION PROCESS THROUGH A GENOMIC
APPROACH
Rudy Pelicaen, Koen Illeghems, Luc De Vuyst, and Stefan Weckx*.
Research Group of Industrial Microbiology and Food Biotechnology (IMDO), Faculty of Sciences and Bioengineering
Sciences, Vrije Universiteit Brussel, Brussels, Belgium; Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB,
Brussels, Belgium. *[email protected]
Acetobacter ghanensis LMG 23848T and Acetobacter senegalensis 108B are acetic acid bacteria species that originate
from a spontaneous cocoa bean heap fermentation process. They have been indicated as strains with interesting
functionalities through extensive metabolic and kinetic studies. Whole-genome sequencing of A. ghanensis LMG 23848T
and A. senegalensis 108B allowed to unravel their genetic adaptations to the cocoa bean fermentation ecosystem.
INTRODUCTION
Fermented dry cocoa beans are the basic raw material for
chocolate production. The cocoa pulp-bean mass contents
of the cocoa pods undergo, once taken out of the pods, a
spontaneous fermentation process that lasts four to six
days. This process is characterised by a succession of
yeasts, lactic acid bacteria (LAB), and acetic acid bacteria
(AAB) coming from the environment (De Vuyst et al.,
2015).
METHODS
Total genomic DNA isolation and purification of A.
ghanensis LMG 23848T and A. senegalensis 108B was
followed by the construction of an 8-kb paired-end library,
454 pyrosequencing, and assembly of the sequence reads
using the GS De Novo Assembler version 2.5.3 with
default parameters. Genome finishing was performed by
PCR assays to close gaps in the draft assembly using
CONSED 23.0. Automated gene prediction and annotation
of the assembled genome sequences were carried out using
the bacterial genome sequence annotation platform
GenDB v2.2 (Meyer et al., 2003). The predicted genes
were functionally characterised using searches in public
databases and bioinformatics tools, and annotations were
manually curated. Comparative analysis of the genome
sequences of the cocoa-derived strains A. ghanensis LMG
23848T (this study), A. senegalensis 108B (this study), and
A. pasteurianus 386B (Illeghems et al., 2013) was
accomplished by the EDGAR framework (Blom et al.,
2009).
RESULTS & DISCUSSION
The genomes of the strains investigated consisted of a
circular chromosomal DNA sequence with a size of 2.7
Mbp and two plasmids for A. ghanensis LMG 23848T and
a circular chromosomal DNA sequence with a size of 3.9
Mbp and one plasmid for A. senegalensis 108B (Figure 1).
Comparative analysis revealed that the order of
orthologous genes was highly conserved between the
genome sequences of A. pasteurianus 386B and A.
ghanensis LMG 23848T. Evidence was found that both
species possessed the genetic ability to be involved in
citrate assimilation and they displayed adaptations in their
respiratory chain. As is the case for many AAB, the
missing gene encoding phosphofructokinase in the
genome sequences of both A. ghanensis LMG 23848T and
A. senegalensis 108B resulted in a non-functional upper
part of the Embden–Meyerhof–Parnas pathway. However,
the presence of genes coding for membrane-bound PQQ-
dependent dehydrogenases enabled the AAB strains
examined to rapidly oxidise ethanol into acetic acid.
Furthermore, an alternative TCA cycle, characterised by
genes coding for a succinyl-CoA:acetate-CoA transferase
and a malate:quinone oxidoreductase, was present.
Furthermore, evidence was found in both genome
sequences that glycerol, mannitol and lactate could be
used as energy sources. Thus, although both species
displayed genetic adaptations to the cocoa bean
fermentation process, their dependence on glycerol,
mannitol and lactate may partly explain their low
competitiveness during cocoa bean fermentation processes,
as these substrates have to be formed through yeast or
LAB activities, respectively.
FIGURE 1. Graphical representation of the genomes of A. ghanensis
LMG 23848T (A) and A. senegalensis 108B (B).
REFERENCES Blom, J., Albaum, S., Doppmeier, D., Pühler, A., Vorhölter, F.-J., Zakrzewski, M.,
Goesmann, A., 2009. EDGAR: a software framework for the comparative
analysis of prokaryotic genomes. BMC Bioinformatics 10, 1-14.
De Vuyst, L., Weckx, S., 2015. The functional role of lactic acid bacteria in cocoa
bean fermentation. In: Mozzi, F., Raya, R.R., Vignolo, G.M. (Eds.).
Biotechnology of Lactic Acid Bacteria: Novel Applications. Wiley-Blackwell,
Ames, IA, USA. In press.Illeghems, K., De Vuyst, L., Weckx, S., 2013.
Complete genome sequence and comparative analysis of Acetobacter
pasteurianus 386B, a strain well-adapted to the cocoa bean fermentation
ecosystem. BMC Genomics 14, 526.
Meyer, F., Goesmann, A., McHardy, A. C., Bartels, D., Bekel, T., et al., 2003.
GenDB - an open source genome annotation system for prokaryote genomes.
Nucleic Acids Res. 31, 2187-2195.
![Page 89: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/89.jpg)
10th Benelux Bioinformatics Conference bbc 2015
89
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: 000 Category: Abstract template
P45. REPRESENTATIONAL POWER OF GENE FEATURES
FOR FUNCTION PREDICTION
Konstantinos Pliakos1*
, Isaac Triguero2,3
, Dragi Kocev4 & Celine Vens
1.
Department of Public Health and Primary Care, KU Leuven Kulak1; Department of Respiratory Medicine, Ghent
University2; Data Mining and Modelling for Biomedicine group, VIB Inflammation Research Center
3; Department of
Knowledge Technologies, Jožef Stefan Institute4.
We present a short study on gene function prediction datasets, revealing an existing issue of non-unique feature
representation, as well as the effect of this issue on hierarchical multi-label classification algorithms.
INTRODUCTION
This study focuses on hierarchical multi-label
classification (HMC). HMC is a variant of classification
where one sample can be assigned to several classes
simultaneously. It differs though from multi-label
classification as these classes are organized in a hierarchy.
That means that a sample belonging to a class
automatically belongs to all its super-classes. Typical
HMC tasks include gene function prediction or text
classification. Here, we focus on the former.
A typical characteristic of genes is that they can be
described in several ways: using information about their
sequence, homology to well-characterized genes,
expression profiles, secondary structure of their derived
proteins, etc. The HMC community has multiple research
datasets at its disposal on gene functions (e.g., (Vens et al.,
2008) or (Schietgat et al., 2010)), each representing genes
by one type of features. Indisputably, researchers should
get advantage of this amount of data but the question
arises how “good” these datasets are. How discriminant
are the features describing a gene? Here, a short study is
trying to display existing data-related problems and give
answers to the aforementioned questions.
DATA STUDY & RESULTS
After careful experimentation on various publicly
available datasets it was noted that some of them suffer
from large amount of duplicate feature vectors. The
irrational behind this occurrence is that there are genes,
which despite having different functions, have exactly the
same feature representation. The table below lists the
aforementioned problem in the 20 gene function
prediction datasets described in (Vens et al., 2008) and
(Schietgat et al., 2010).
Organism Dataset Nb of genes Nb of unique gene
representations
S. cerevisiae
church 3755 2352
pheno 1591 514
hom 3854 3646
seq 3919 3913
struc 3838 3785
A. thaliana scop 9843 9415
struc 11763 11689
TABLE 1. Datasets, the number of genes and their unique representations.
As it is displayed, the church (micro-array expression) and
the pheno (phenotype features) datasets suffer the most.
More specifically, in pheno dataset the 67.7% of the gene
representations are duplicates. The most frequent feature
vector appears 315 times, 197 times in the training set and
118 times in the test set. Due to this, 20% of the 582 test
examples will give the same feature vector as input for
prediction. In a decision tree model, for example, these
genes will end up in the same leaf, receive the same
prediction (the average class vector of 197 training
examples), but receive a different error term as they are a
priori associated with a different class label-set. In the
training phase, there may still be a lot of variation in the
class vectors of the 197 genes, but no split exists to
separate them. In the Church dataset, the 3755 genes
correspond to only 2352 unique feature descriptors. In
Hom or Struc datasets the number of the duplicates is
lower but still impressive, considering the enormous size
of the feature vectors in these datasets.
For evaluation purposes, ML-KNN (Zhang M. L et al.,
2007) was employed to demonstrate the effect of the
studied problem on the average precision for the FunCat
annotated datasets. Here, “unique” refers to the datasets
occurring after removing all the duplicates. Thus, any
feature vector can only once be included in a gene’s
neighbour set. We report the average of 10 “unique”
versions, each one using a different gene’s class label as
ground truth for the feature vector.
Dataset K= 1 K = 5 K = 17
Train Test (5cv)
Train Test (5cv)
Train Test (5cv)
pheno initial 51.59 23.62 39.55 24.14 32.76 23.59
unique 100 24.21 55.62 24.90 39.70 25.01
hom initial 98.30 39.32 63.64 39.45 48.96 37.28
unique 100 39.14 64.64 39.67 49.28 37.53
TABLE 2. Average Precision rates (%) using ML-KNN.
The table shows that the less discriminant feature
representation can affect the ML-KNN and decrease the
precision of multi-label classification. Indisputably, it
could be concluded that the same problem will be more
obvious or even completely disastrous for two-class or
multi-class classification problems.
CONCLUSION
The major point of this study was to inform the research
community of the relatively low representational power of
the features present in some widely used gene function
prediction datasets, making them even more difficult and
challenging datasets from machine learning perspective.
We observed the same issue in datasets of other HMC
application domains like text categorization.
REFERENCES Zhang M. L. & Zhou Z. H. ML-KNN: A lazy learning approach to multi-label learning, Pattern
recognition 40, 2038-2048, (2007). Vens C. et al. Decision trees for hierarchical multi-label classification, Machine Learning 73, 185-214,
(2008).
Schietgat L. et al. Predicting gene function using hierarchical multi-label decision tree ensembles, BMC Bioinformatics 11, (2010).
![Page 90: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/90.jpg)
10th Benelux Bioinformatics Conference bbc 2015
90
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P46. ANALYSIS OF BIAS AND ASYMMETRY IN THE PROTEIN STABILITY
PREDICTION
Fabrizio Pucci1,*
, Katrien Bernaerts1,2
, Fabian Teheux1, Dimitri Gilis
1 & Marianne Rooman
1.
Department of BioModeling, BioInformatics & BioProcesses1, Université Libre de Bruxelles, 1050 Brussels, Belgium;
BioBased Materials, Faculty of Humanities and Sciences2, Maastricht University, 6200 Maastricht, The Netherlands.
In many bioinformatics analyses avoiding biases towards the training dataset is one of the most intricate issue. Here we
focus on the specific case of the prediction of protein thermodynamic stability changes upon point mutations (G). In a
first instance we measure the bias towards the destabilizing mutations of some widely used G-prediction algorithms
described in the literature. Then we show how important is the use of the symmetry of the model to avoid biasing. In the
last step we briefly discuss the distribution of the G values for all possible point mutations in a series of proteins with
the aim of understanding whether the distribution is universal and how much it is biased towards the training dataset.
INTRODUCTION
The accurate prediction of the stability changes on a large
scale is still a challenge in protein science. Despite the
large amount of work done in the last years, the results
frequently suffer from hidden biases towards the training
dataset and this makes the evaluation of the real
performances a difficult task.
Here we study the “bias problem” in the case of the
prediction of protein thermodynamic stability changes
upon point mutations and more precisely of its best
descriptor G that is the change of folding free energy
upon mutation from the wild type protein W to the mutant
M. In principle the predicted G value of the inverse
mutation (M to W) has to be exactly equal to minus the
G of the direct mutation (W to M), since the free energy
is a state function.
Unfortunately the asymmetry of the training dataset
towards the destabilizing mutations (reflecting the
evolutionary optimization of protein stability) makes the
prediction of inverse mutations less accurate with respect
to the direct ones. This introduces a series of distortions in
the prediction model that we will analyze here.
METHODS
We computed the G value for a set of almost 200
mutations in which both the structure of the wild type
protein and mutant are known, using a series of prediction
tools, i.e. PoPMuSiC [1], I-Mutant, FoldX, Duet,
AutoMute, CupSat, Eris and ProSMS. We then computed
the Ratio (RID) of the standard deviation between the
predicted and the experimental values of G for the
Inverse mutations to for the Direct mutations (which
should be one in the case of a perfect symmetric
prediction) and compared the results of the different
programs.
If the functional structure of the model is known as in the
case of the artificial neural network of PoPMuSiC, one
can further understand which terms contribute more than
others to deviate the RID from unit and thus propose new
model structures in which the biases are correctly avoided
[2].
In the more blind machine learning approaches (as the
methods based on Random Forest or Support Vector
Machine) in which the functional form is not explicitly
known, the asymmetry correction is less obvious.
In a second part, we investigated how the symmetry of the
G values distribution in the training dataset influences
the prediction of the G distribution for all possible
mutations in a series of proteins with known structures.
RESULTS & DISCUSSION
The estimation of the asymmetry computed for a
series of available prediction methods gives a RID
values between 1 for bias-corrected methods and
about 3 for the most biased programs. From these
results we have shown that the correct use of the
symmetry in setting up the model structure helps to
avoid unwanted biases towards the destabilizing
mutations.
Furthermore the distribution of the G values for all
point mutations in some proteins has been analyzed
and showed a dependence from the G distribution
of the training dataset when the RID deviate
significantly from one. The understanding of the
relation between the two distrubutions is an
important step to comprehend the universality of the
distribution [3] and how much the proteins are
optimized to minimize the impact of single-site
aminoacid substitution.
REFERENCES [1] Y. Dehouck, Jean Marc Kwasigroch, D. Gilis, M. Rooman (2011),
PopMusic 2.1 : a web server for the estimation of the protein
stability changes upon mutation and sequence optimality. BMC
Bioinformatics. 12, 151 [2] F. Pucci, K. Bernaerts, F. Teheux, D. Gilis, M. Rooman, Symmetry
Principles in Optimization Problems: an application to Protein
Stability Prediction (2015), IFAC-PapersOnLine 48-1, 458-463
[3] Tokuriki N, Stricher F, Schymkowitz J, Serrano L, Tawfik DS, The
stability effects of protein mutations appear to be universally
distributed (2007), J Mol Biol, 356, 1318-1332.
![Page 91: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/91.jpg)
10th Benelux Bioinformatics Conference bbc 2015
91
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P47. MULTI-LEVEL BIOLOGICAL CHARACTERIZATION OF EXOMIC
VARIANTS AT THE PROTEIN LEVEL IMPROVES THE IDENTIFICATION OF
THEIR DELETERIOUS EFFECTS
Daniele Raimondi1,2,3,4
, Andrea Gazzo1,2
, Marianne Rooman1,6
, Tom Lenaerts1,2,5
& Wim Vranken1,2,3,4
.
Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, 1050, Belgium1; Machine Learning group,
Université Libre de Bruxelles, Brussels, 1050, Belgium2; Structural Biology Brussels, Vrije Universiteit Brussel,
Brussels, 1050, Belgium3; Structural Biology Research Centre, VIB, Brussels, 1050, Belgium
4; Artificial Intelligence lab,
Vrije Universiteit Brussel, Brussels, 1050 Belgium5; 3BIO-BioInfo group, Université Libre de Bruxelles, Brussels, 1050,
Belgium6.
The increasing availability of genome sequence data led to the development of predictors that are capable of identifying
the likely phenotypic effects of Single Nucleotide Variants (SNVs) or short inframe Insertions or Deletions (INDELs).
Most of these predictors focus on SNVs and use a combination of features related to sequence conservation, biophysical
and/or structural properties to link the observed variant to either a neutral or a disease phenotype. Despite notable
successes, the mapping between genetic alterations and phenotypic effects is riddled with levels of complexity that are
not yet fully understood and that are often not taken into account in the predictions. A better multi-level molecular and
functional contextualization of both the variant and the protein may therefore significantly improve the predictive quality
of variant-effect predictors.
INTRODUCTION
The phenotypical interpretation at the organism level of
protein-level alterations is the ultimate goal of the variant-
effect prediction field. This causal relationship is still far
from being completely understood and is confounded by
many aspects related to the intrinsic complexity of cell life. A
crucial restriction of variant-effect prediction is that an
alteration of the protein’s molecular phenotype, even if it is a
sine qua non condition for the disease phenotype in the
carrier individual,may not constitute in itself a sufficient
cause for the disease: this also depends on the particular role
that the affected protein plays in the well-being of the
organism. Even the most commonly used features, which
relate evolutionary constraints with likely functional damage,
offer only a partial correlation with the pathogenicity of the
variant. Consequently, additional information that bridges the
variant-phenotype gap is crucial to improve variant-effect
predictions.
METHODS
We address the inherently complex variant-effect prediction
problem through the integration of different sources of
information. By describing each (protein, variant) pair from
different perspectives corresponding to different levels of
contextualisation, we assembled the most relevant and
accessible pieces of information that are currently available,
with the aim to elucidate the fuzzy and complex mapping
between molecular-level alterations and the individual-level
phenotypic outcome. We use three variant-oriented features
with different characteristics: the log-odd ratio (LOR) score
and Conservation index (CI) [1], which are column-wise
measures of the conservation of a mutated column within a
multiple-sequence alignment (MSA), and the PROVEAN [2]
predictions (PROV), which provide a sequence-wide measure
of the change in evolutionary distance between the mutated
target protein and close functional homologs that correlates
with the deleteriousness of variants. The protein-oriented
features use pathway [4] and protein-protein interaction
networks information [5] (DGR) as well as genetic and
clinical information, for instance an evaluation of how
tolerant the affected genes are to homozygous loss-of-
function mutations (REC) [3].
RESULTS & DISCUSSION
DEOGEN is our novel variant effect predictor that can
natively handle both SNVs and inframe INDELs. By
integrating information from different biological scales and
mimicking the complex mixture of effects that lead from the
variant to the phenotype, we obtain significant improvements
in the variant-effect prediction results. Next to the typical
variant-oriented features based on the evolutionary
conservation of the mutated positions, we added a collection
of protein-oriented features that are based on functional
aspects of the gene affected. We cross-validated DEOGEN on
36825 polymorphisms, 20821 deleterious SNVs and 1038
INDELs from SwissProt.
Method Missing SNVs Sen Spe Pre Bac MCC
PROVEAN 0.0 78 79 68 79 56
SIFT 2.0 85 69 61 77 52
Mutation Assessor 0.6 85 71 63 78 54
PolyPhen2 (HumDiv) 4.0 89 63 57 76 50
CADD 7.0 82 75 66 78 55
EFIN 0.0 86 80 87 83 64
MutationTaster 20.7 86 75 69 81 60
GERP++ 20.7 97 24 45 61 28
DEOGEN 4.4 77 92 85 84 71
FIGURE 1. Comparison of the performances of 8 variant-effect predictors with DEOGEN on Humsavar 2013 dataset.
REFERENCES [1]Calabrese, R. et al., R. Functional annotations improve the predictive
score of human disease-related mutations in proteins. Hum. Mutat. 30, 123744 (2009).
[2]Choi, Y. et al., Predicting the functional effect of amino acid substitutions and indels. PLoS One 7, e46688 (2012).
[3]Daniel G. MacArthur et al. A Systematic Survey of Loss-of-Function
Variants in Human Protein-Coding Genes Science 17 February 2012: 335 (6070), 823-828.
[4]Atanas Kamburov et al. (2011) ConsensusPathDB: toward a more
complete picture of cell biology. Nucleic Acids Research 39:D712-717.
![Page 92: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/92.jpg)
10th Benelux Bioinformatics Conference bbc 2015
92
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P48. NGOME: PREDICTION OF NON-ENZYMATIC PROTEIN
DEAMIDATION FROM SEQUENCE-DERIVED SECONDARY STRUCTURE AND
INTRINSIC DISORDER
J. Ramiro Lorenzo1, Leonardo G. Alonso
2 & Ignacio E. Sánchez
1*.
Protein Physiology Laboratory, Facultad de Ciencias Exactas y Naturales and IQUIBICEN - CONICET, Universidad de
Buenos Aires, Argentina1; Protein Structure-Function and Engineering Laboratory, Fundación Instituto Leloir and
IIBBA - CONICET, Buenos Aires, Argentina2. *[email protected]
Asparagine residues in proteins undergo spontaneous deamidation, a post-translational modification that may act as a
molecular clock for the regulation of protein function and turnover. Asparagine deamidation is modulated by protein
local sequence, secondary structure and hydrogen bonding. We present NGOME, an algorithm able to predict non -
enzymatic deamidation of internal asparagine residues in proteins, in the absence of structural data, from sequence based
predictions of secondary structure and intrinsic disorder. NGOME may help the user identify deamidation-prone
asparagine residues, often related to protein gain of function, protein degradation or protein misfolding in pathological
processes.
INTRODUCTION
Protein deamidation is a post-translational modification in
which the side chain amide group of a glutamine or
asparagine (Asn) residue is transformed into an acidic
carboxylate group. Deamidation often, but not always,
leads to loss of protein function1,2
. Deamidation rates in
proteins vary widely, with halftimes for particular Asn
residues ranging from several days to years. In contrast
with the ubiquity and importance of Asn deamidation,
there is currently no publicly available algorithm for the
prediction of Asn deamidation A structure-based
algorithm was published3, but is no longer available online
and is not useful for proteins of unknown structure or
those that are intrinsically disordered.
METHODS
Dataset. We collected from the literature experimental
reports of deamidation of Asn residues in proteins using
mass spectrometry or Edman sequencing. Since
deamidation rates depend strongly on pH and temperature,
we only included experiments at neutral or slightly basic
pH and up to 313K. An Asn residue was considered a
positive if unequivocal change to aspartic or isoaspartic
residue was observed. Asn residues for which direct
experimental evidence was not obtained were not taken
into account.
NGOME training. We trained the algorithm by randomly
splitting the dataset into training and test sets 100 times,
while keeping a similar number of positive and negative
Asn-Xaa dipeptides in the two sets. For each splitting, we
selected the weights for disorder4 and alpha helix
prediction5 in NGOME algorithm to maximize the area
under the ROC curve for the training set. For the test set,
the area under the ROC curve for NGOME was larger than
for sequence-based prediction 97 out of 100 times. Finally,
we selected the average values of weights for NGOME.
RESULTS & DISCUSSION
Both protein sequence and structure can influence Asn
deamidation kinetics. In the absence of secondary and
tertiary structure, Asn deamidation rates are governed by
the identity of the N+1 amino acid3. In model peptides, the
Asn-Gly dipeptide is by far the fastest to deamidate, with
bulky N+1 side chains generally slowing down the
reaction. Several structural features decreasing Asn
deamidation rates have also been identified, including
alpha helix formation and hydrogen bond formation by the
Asn side chain, the N+1 backbone amide and the
neighbouring residues3.
We compiled a database of 281 Asn residues (67 positives
and 214 negatives) in 39 proteins to train NGOME. We
computed t50 for all Asn in the dataset and generated a
ROC curve by considering as positives Asn residues with
different values of t50. The area under the ROC curve is
larger for the NGOME predictions (0.9640) than for the
sequence-based predictions (0.9270) (p-value 6×10-3
).
NGOME also performs better for threshold value s
yielding few false positives. NGOME can also
discriminate between positive and negative Asn-Gly
dipeptides whereas sequence-based prediction can not.
The area under the ROC curve is 0.7051 for the NGOME
predictions, larger than the random value of 0.5 for
sequence-based prediction (p-value 9×10–3
). Since
NGOME requires only a protein sequence as an input and
not a three-dimensional structure, we envision that
GNOME will be useful to systematically evaluate whole
proteome data and in the study of intrinsically disordered
proteins for which the structural data is scarce. NGOME is
freely available as a webserver at the National EMBnet
node Argentina, URL: http://www.embnet.qb.fcen.uba.ar/
in the subpage “Protein and nucleic acid structure and
sequence analysis”.
REFERENCES 1. Curnis, F., et al. J Biol Chem 281:36466-36476 (2006).
2. Reissner, K.J. and Aswad, D.W. Cell Mol Life Sci 60:1281 -1295
(2003). 3. Robinson, N.E. and Robinson, A.B. Proc Natl Acad Sci U S A
98:4367-4372 (2001).
4. Dosztanyi, Z., et al. Bioinformatics 21:3433-3434 (2005).
5. Cole, C., et al. Nucleic Acids Res 36:W197-201 (2008).
![Page 93: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/93.jpg)
10th Benelux Bioinformatics Conference bbc 2015
93
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P49. OPTIMAL DESIGN OF SRM ASSAYS USING MODULAR EMPIRICAL
MODELS
Jérôme Renaux1,*
, Alexandros Sarafianos1, Kurt De Grave
1 & Jan Ramon
1.
Department of Computer Science, KU Leuven.1 *[email protected]
Targeted proteomics techniques such as Selected Reaction Monitoring (SRM) have become very popular for protein
quantification due to their high sensitivity and reproducibility. However, these rely on the selection of optimal transitions,
which are not always known in advance and may require expensive and time-consuming discovery experiments to
identify. We propose a computer program for the automated identification of optimal transitions using machine learning
and show encouraging results when compared to a widely used spectral library.
INTRODUCTION
A major issue with both SRM is to know which transitions
to monitor in order to maximally detect a specific protein,
these being different from one protein to another. Good
candidates are transitions whose chemical properties will
make them likely to occur and easy to detect by the mass
spectrometer, while being sufficiently specific indicators
of their parent protein.
Traditionally, targeted proteomics assays, which consist of
lists of ions or transitions to monitor, are designed through
costly exploratory experiments. Recently, attempts have
been made to produce software to help design optimal
assays. These efforts rely on some extent on collaborative
databases of mass spectra which are mined to identify the
best possible peptides to include in the assays. While
successful, these approaches still depend on past
exploratory analyses and on the coverage of the exploited
databases. Therefore, their performance decrease in cases
where such databases cannot be leveraged, such as when
dealing with little-studied organisms or rare, low-
abundance proteins.
We propose an approach called SIMPOPE (Sequence of
Inductive Models for the Prediction and Optimization of
Proteomics Experiments) that models all the steps of the
typical tandem mass spectrometry (MS/MS) workflow in
order to accurately predict the properties of peptide and
fragment ions within a given proteome, and subsequently
identify optimal assays among them.
METHODS
SIMPOPE consists of a sequential suite of predictive
models for each step of the MS/MS workflow. It exploits
knowledge from public databases and combines it with the
generalizing power of machine learning models to
compensate for noisy or missing data. All models are
probabilistic, allowing to keep track of the inherent
uncertainty of the successive predictions and to weight the
results accordingly for the assay prediction.
Enzymatic cleavage is modelled using CP-DT(Fannes et
al., 2013), which models the behaviour of the trypsin
enzyme using random forests. Retention time prediction is
achieved using the Elude tool from the Percolator suite
(Moruz et al., 2010). The charge distribution of
electrospray precursor ions is also modelled using random
forests trained on experimental data mined from PRIDE
(Vizcaino et al., 2013). Fragmentation patterns and
product ion intensity are predicted with the help of random
forest models trained on MS-LIMS data (Degroeve &
Martens 2013; De Grave et al., 2014). Finally, prior
knowledge about the abundance of proteins within a given
proteome is incorporated as prior probabilities, obtained
when available from PaxDB.
On the human proteome, these steps yield a total of 321
000 000 transitions together with their relevant chemical
properties. We then compute a score for every single
transition, based on these properties and on their aliasing
with other transitions in terms of Q1 and Q3 m/z.
RESULTS & DISCUSSION
We validated our approach by computing scores for 2000
reference transitions from the SRMAtlas database (Picotti
et al., 2014). Based on these scores, we can rank the
reference transitions among all possible transitions.
Intuitively, reference transitions should rank high, and
therefore have a low rank (ideally, in the top five). Based
on the average number of transitions per protein in our
reference set, a perfect median rank would be 3.2, while a
totally random scoring system should yield a median rank
of 151. The approach we propose achieved a median rank
of 15, signifying that using our scoring method, 50% of
the reference transitions are ranked in the top 15. This
result is encouraging as it shows that the scores predicted
by SIMPOPE do correlate with the quality of the
transitions. We can subsequently use that score as a
feature to train an additional model on top of the ones
described here to refine the assay prediction process
(further results on the poster).
REFERENCES Degroeve, S. & Martens, L. MS2PIP: a tool for MS/MS peak
intensity prediction. Bioinformatics, 29, pp.3199–203 (2013).
Fannes, T. et al. Journal of Proteome Research, 12(5), pp.2253–2259 (2013).
De Grave, K. De et al. Prediction of peptide fragment ion intensity : a
priori partitioning reconsidered. International Mass Spectrometry Conference 2014, (2014).
Moruz, L., Tomazela, D. & Käll, L. Training, selection, and robust
calibration of retention time models for targeted proteomics. Journal of Proteome Research, 9(10), pp.5209–5216 (2010).
Picotti, P. et al. A complete mass-spectrometric map of the yeast
proteome applied to quantitative trait analysis. Nature, 494(7436), pp.266–270 (2014).
Vizcaino, J. a. et al. The Proteomics Identifications (PRIDE) database
and associated tools: status in 2013. Nucleic Acids Research, 41(D1), pp.D1063–D1069 (2013).
![Page 94: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/94.jpg)
10th Benelux Bioinformatics Conference bbc 2015
94
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P50. EVALUATING THE ROBUSTNESS OF LARGE INDEL IDENTIFICATION
ACROSS MULTIPLE MICROBIAL GENOMES
Alex Salazar1,2
& Thomas Abeel1,2*
.
Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands1; Genome Sequencing and Analysis
Program, Broad Institute of MIT and Harvard2.
Comparing large structural variants—such as large insertions and deletions (indels)—across multiple genomes can reveal
important insights in microbial organisms. Unfortunately, most studies that compare sequence variants only focus on
single nucleotide variants and small indels. In this study, we investigated whether current available variant callers are
robust when identifying the same large indel across multiple genomes—an important criteria for accurately associating
large variants. By simulating over 8,000 large indels of various sizes across 161 bacterial strains, we found that
breakpoint detection is precise when identifying both deletions and insertion. We suggest that left-most-overlap
normalization across all samples will ensure uniform breakpoint coordinates of identical large variants which can then be
incorporated to existing association pipelines.
INTRODUCTION
Structural sequence variants—such as large insertion and
deletions (indels)—along with small sequence variants (e.g.
single nucleotide variants and small indels) can enable more
robust comparisons of microbial populations. Unfortunately,
limitations in variant calling methods restrict investigations to
compare only small variants across multiple microbial
genomes—thereby ignoring larger variants (e.g. indels of size
greater than 50nt). The recent development of structural
variant detecting tools now provide an opportunity to
compare and associate large indels with phenotype and
population structure across a collection of samples. However,
these tools have only been benchmarked against a single
genome and their ability to consistently call large events
across multiple genomes remains uncharacterized.
METHODS
In this study, we systematically benchmarked the robustness
of large indel identification across multiple genomes using
five recently developed structural variant detection tools:
Pilon (Walker et al., 2014), Breseq (Barrick et al., 2014),
BreakSeek (Zhao et al., 2015), and MindTheGap (Rizk et al.,
2014). Using a manually-curated reference genome for
M. tuberculosis (H37Rv), we simulated nearly 10,000
deletions and 8,000 thousand insertions—ranging from 50nt
to 550nt. Overall, the simulation experiment resulted in a
total 1.6 million expected deletions and 1.3 million expected
insertions when we aligned short-reads from a data set of 161
clinical strains of M. tuberculosis (Zhang et al., 2013).
After identifying the simulated indels using the variant
detecting tools, we used a distance test to investigate each
tool’s robustness in breakpoint and genotype prediction. For
each simulated indel prediction, we computed the distance of
the predicted breakpoint coordinate to the expected
breakpoint coordinate. We also calculated a genotype
similarity score using the Damerau-Levenshtein distance.
RESULTS & DISCUSSION
We found that all tools are able to precisely predict the
breakpoint coordinate of the same large event present across
multiple genomes. For deletions, Breseq and Breakseek
consistently identified more than 96% of all simulated
deletions regardless of size. This number ranged from 87% to
93% in Pilon and correlated with decreasing deletion size.
Breseq and Pilon correctly predicted the exact breakpoint
coordinate for about two-thirds of all identified simulated
indels. This number ranged from 1% to 7% in Breakseek calls
and inversely correlated with increasing deletion size.
For insertions, MindTheGap consistently identified
approximately 97% of all simulated insertions, but Pilon’s
performance worsened as the number of insertions that it
identified ranged from 69% to 93%--again, we observed a
direct correlation of missed calls as the insertion size
increased. Both tools correctly predicted the exact breakpoint
coordinate for about two-thirds of all identified simulated
indels. Nevertheless, we found 99% of the predicted
breakpoint coordinates made by the four tools were within
10nt of the expected breakpoint coordinate.
Our results also indicate that Pilon, Breseq, Breakseek, and
MindTheGap are robust when predicting the genotype of
large indels across multiple samples. The large majority of
identified simulated deletions had a size and genotype
similarity of more than 98%. In insertions, the size similarity
of insertions varied widely in both MindTheGap and Pilon
calls indicating that both tools have a difficult time
determining the exact length of an insertion sequence.
Overall, these results show that breakpoint detection is
precise when identifying deletion and insertions of any size.
Therefore, a simple normalization procedure—such as left-
most-overlap normalization across samples—will ensure
consistent breakpoint location for identical large events. This
will enable researchers to incorporate large variants to
existing association pipelines; opening novel opportunities to
associate large variants with phenotype and population
structure.
REFERENCES Barrick,J.E. et al. (2014) Identifying structural variation in haploid
microbial genomes from short-read resequencing data using breseq.
BMC Genomics, 15, 1039.
Rizk,G. et al. (2014) MindTheGap: integrated detection and assembly of short and long insertions. Bioinformatics, 30, 1–7.
Walker,B.J. et al. (2014) Pilon: an integrated tool for comprehensive
microbial variant detection and genome assembly improvement. PLoS One, 9, e112963.
Zhang,H. et al. (2013) Genome sequencing of 161 Mycobacterium
tuberculosis isolates from China identifies genes and intergenic regions associated with drug resistance. Nat. Genet., 45, 1255–60.
Zhao,H. and Zhao,F. (2015) BreakSeek: a breakpoint-based algorithm for
full spectral range INDEL detection. Nucleic Acids Res., 1–13.
![Page 95: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/95.jpg)
10th Benelux Bioinformatics Conference bbc 2015
95
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P51. INTEGRATING STRUCTURED AND UNSTRUCTURED DATA SOURCES
FOR PREDICTING CLINICAL CODES
Elyne Scheurwegs1,3*
, Kim Luyckx2, Léon Luyten
2, Walter Daelemans
3 & Tim Van den Bulcke
1.
Advanced Database Research and Modeling (ADReM), University of Antwerp1; Antwerp University Hospital
2; Center
for Computation Linguistics and Psycholinguistics (CliPS), University of Antwerp3;
Automated clinical coding is a task in medical informatics, in which information found in patient files is translated to
various types of coding systems (e.g. ICD-9-CM). The information in patient files consists of multiple data sources, both
in structured (e.g. lab test results) and unstructured form (e.g. a text describing the progress of a patient over multiple
days during the stay). This work studies the complementarity of information derived from these different sources to
enhance clinical code prediction.
INTRODUCTION
The increased accessibility of healthcare data through the
large-scale adoption of electronic health records stimulates
the development of algorithms that monitor hospital
activities, such as clinical coding applications.
Clinical coding consists of the translation of information
found in a patient file to diagnostic and procedural codes,
originating from a medical ontology to patient files.
In our work, we investigate if unstructured (textual) and
structured data sources, present in electronic health
records, can be combined to assign clinical diagnostic and
procedural codes (specifically ICD-9-CM) to patient stays.
Our main objective is to evaluate if integrating these
heterogeneous data types improves prediction strength
compared to using the data types in isolation.
METHODS
Several datasets were collected from the clinical data
warehouse of the Antwerp University Hospital (UZA).
The resulting dataset consists of a randomized subset of
anonymized data of patient stays, in 14 different medical
specialties. Two separate data integration approaches were
evaluated on each dataset from a medical specialty.
With early data integration, multiple sources are combined
prior to training a model. This is achieved by using a
single bag of features that are given to the prediction
pipeline. Feature selection is performed with tf-idf for
unstructured sources and gainratio and minimal
redundancy, maximum relevance (mRMR) for structured
source filtering.
The late data integration method trains a separate model
on each data source, and then combines the prediction
output for each code in a meta-learner. This meta-learner
is mainly used to find which sources perform best for a
certain code.
The prediction task in both approaches was cast as a multi-
class classification task, in which an array of binary
predictions was made (one for each clinical code).
RESULTS & DISCUSSION
Late data integration improves the predictions of ICD-9-
CM diagnostic codes made in comparison to the best
individual prediction source (i.e. overall F-measure
increased from 30.6% to 38.3%). Early data integration
does not show this trend and only performs well with a
limited number of combinations of sources. ICD-9-CM
procedure codes also show this trend, with the exception
of the RIZIV data source, which shows a better prediction
when used individually. The predictive strength of the
models varies strongly between different medical
specialties.
The results show that the data sources, independent of
their structured or unstructured nature, are able to provide
complementary information when predicting ICD-9-CM
codes, particularly when combined within the late data
integration approach. This approach also allows for
including as many sources as possible, as the effects of
including a source that does not contain any additional
information barely influences the end result. This is an
advantage when the information content of a data source is
not previously known. A disadvantage is the loss of
information due to the strong generalisation as each data
source is effectively reduced to a single feature for the
meta-learner.
Early data integration seems to suffer when combining
sources that have features with a largely differing
information content and different numbers of features. An
unstructured data source typically renders 30,000
different, weak features, while a structured source often
contains only 500 different features.
CONCLUSIONS
Models using multiple electronic health record data
sources systematically outperform models using data
sources in isolation in the task of predicting ICD-9-CM
codes over a broad range of medical specialties.
ACKNOWLEDGEMENT
This work is supported by a doctoral research grant (nr.
131137) by the Agency for Innovation by Science and
Technology in Flanders (IWT). The datasets used in this
research were made available by the Antwerp University
Hospital (UZA) for restricted use.
REFERENCES Scheurwegs, E et al. Data integration of structured and unstructured
sources for assigning clinical codes to patient stays. Journal of the American Medical Informatics Association (2015): ocv115.
![Page 96: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/96.jpg)
10th Benelux Bioinformatics Conference bbc 2015
96
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P52. SUPERVISED TEXT MINING FOR DISEASE AND GENE LINKS
Jaak Simm1,2,3*
, Adam Arany1,2
, Sarah ElShal1,2
& Yves Moreau1,2
.
Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing, and Data
Analytics, KU Leuven, Kasteelpark Arenberg 10, box 2446, 3001 Leuven, Belgium1; iMinds Medical IT, Kasteelpark
Arenberg 10, box 2446, 3001 Leuven, Belgium2; Institute of Gene Technology, Tallinn University of Technology,
Akadeemia tee 15A, Estonia3.
Scientific publications contain rich information about genetic disorders. Text mining these publications provides an
automatic way to quickly query and summarize the information. We propose a supervised learning approach that takes
advantage of the well known unsupervised approach TF-IDF (term frequency–inverse document frequency) and
integrates it with supervised approach using logistic loss error metric. The preliminary results on OMIM dataset look
promising.
INTRODUCTION
Scientific publications contain rich information about
genetic disorders. Text mining these publications provides
an automatic way to quickly query and summarize the
information.
The traditional approaches employ unsupervised text
mining approaches like TF-IDF (term frequency–inverse
document frequency) or Latent Dirichlet Allocation
(LDA) by Blei et al. (2003) for linking terms to genes and
diseases. A recent text mining software Beegle (ElShal et
al., 2015) developed for linking diseases and genes has
taken this approach using TF-IDF as its similarity metric.
PROPOSED METHOD
Our work proposes a supervised learning of the
importance of the textual terms, which can automatically
filter out many terms that are unnecessary for the task at
hand. We formulate it as a prediction of supervised values
y given the terms for all genes g and all diseases d where i
is the index of the term:
and wi is the weight for the term i and σ is sigmoid
function. The main idea is to learn the weight vector w that
minimizes the difference between known values y and
predictions. The minimization can transformed into a
logistic regression.
For the supervised values we use OMIM database
(Hamosh et al., 2003). More specifically y corresponds to
1 if there is a link between the given gene-disease pair and
0 if there is no link. Intuitively, in this setup the text
mining is transformed into a classification problem. We
use dataset of 330 OMIM terms and their linked genes and
randomly sample genes as negatives for each disease.
For the textual terms we use MEDLINE abstracts as the
source of biomedical text. We employ MetaMap (Aronson
et al. 2010) to link terms with abstracts. We use geneRIF
to link genes with abstracts, and PubMed to link diseases
with abstracts. We apply a TF-IDF transformation to score
a term with a given disease or gene based on the abstracts
linked to each entity. We only use the terms linked to
abstracts that belong to genes. Hence our vocabulary
consists of 66,883 terms.
RESULTS & DISCUSSION
The preliminary results show that supervised learning
allows to automatically pick up the keywords that are
informative, improving the recall of the genes that are
related to genetic disorders. We will present more detailed
results in the poster.
We are also investigate how to integrate the supervised
approach to have answers to online queries provided by
Beegle.
REFERENCES Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet
allocation. the Journal of machine Learning research, 3, 993-1022. Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A., & McKusick,
V. A. (2005). Online Mendelian Inheritance in Man (OMIM), a
knowledgebase of human genes and genetic disorders. Nucleic acids research, 33(suppl 1), D514-D517.
ElShal, S., Tranchevent L.C., Sifrim A., Ardeshirdavani A., Davis J.,
Moreau Y. (2015). Beegle: from literature mining to disease-gene discovery. Nucleic Acids Res, gkv905.
Aronson, A. R., & Lang, F. M. (2010). An overview of MetaMap:
historical perspective and recent advances. Journal of the American Medical Informatics Association, 17(3), 229-236.
![Page 97: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/97.jpg)
10th Benelux Bioinformatics Conference bbc 2015
97
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P53. FLOWSOM WEB: A SCALABLE ALGORITHM TO VISUALIZE AND
COMPARE CYTOMETRY DATA IN THE BROWSER
Arne Soete2, Sofie Van Gassen
1,2,3, Tom Dhaene
1, Bart N. Lambrecht
2,3 & Yvan Saeys
2,3.
Department of Information Technology, Ghent University-iMinds, Ghent, Belgium1; Inflammation Research Center, VIB,
Ghent, Belgium2; Department of Respiratory Medicine, Ghent University Hospital, Ghent, Belgium
3.
We developed FlowSOM Web, a web-tool which visualizes cytometry data based on Self-Organizing Maps. Similar cells
are clustered and visualized via star charts. This allows us to process and display millions of cells efficiently.
Additionally, different biological samples (e.g. healthy versus diseased mice) can be compared.
INTRODUCTION
Cytometry data describes cell characteristics in
biological samples. Cells are labeled with fluorescent
antibodies and a flow cytometer measures the properties
of millions of cells one by one. Biologists use this
information to get more insight in diseases and to
diagnose patients. Most of them still analyse this data
manually to differentiate between the different cell types
present. This is done by plotting the data in 2D scatter
plots and selecting groups of cells in a hierarchical way.
This process is called `gating'. Recently, the number of
properties that can be measured simultaneously has
strongly increased. As the number of possible 2D scatter
plots increases exponentially with the number of
properties measured, it becomes infeasible to analyze
them all and relevant information that is present in the
data might be missed.
METHODS
We present FlowSOM, a new algorithm for the
visualization and interpretation of cytometry data (Van
Gassen, et al,. 2015). Using a twolevel clustering and
star charts, our algorithm helps to obtain a clear
overview of how all markers are behaving on all cells,
and to detect subsets that might be missed otherwise.
Our algorithm consists of 4 steps: pre-processing the
data, building a self-organizing map, building a minimal
spanning tree and computing a meta-clustering result.
RESULTS & DISCUSSION
Although our results are quite similar to SPADE, another
state-of-the art algorithm for the visualization of
cytometry data, our results can be computed much faster
and use less memory. By providing star-charts and an
automatic meta-clustering step, much more information
can be visualised in a single tree than is done by the
SPADE algorithm.
Additionally, multiple states can be compared (e.g.
healthy versus diseased mice) with one another and the
differences between the two states can be visualized via
star-charts.
On this conference, we would like to demonstrate a
recently developed web interface to the underlying R
functionality. This interface allows to upload cytometry
data, run the aforementioned analysis, compare different
cell states and explore the results, via interactive
visualizations, all from the comfort of the browser.
FIGURE 1. Example of a FlowSOM star chart.
REFERENCES Van Gassen, et al. (2015), FlowSOM: Using self-organizing maps for
visualization and interpretation of cytometry data. Cytometry,
87: 636–645
![Page 98: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/98.jpg)
10th Benelux Bioinformatics Conference bbc 2015
98
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P54. TOWARDS A BELGIAN REFERENCE SET
Erika Souche1*
, Amin Ardeshirdavani2, Yves Moreau
2, Gert Matthijs
1 & Joris Vermeesch
1.
Department of Human Genetics, KU Leuven 1; ESAT-STADIUS Center for Dynamical Systems, Signal Processing and
Data Analytic, KU Leuven 2.
Next-Generation Sequencing (NGS) is increasingly used to study and diagnose human disorders. The simultaneous
sequencing of a large number of genes leading to the detection of a large number of variants, the bottleneck has moved
from sequencing to variant interpretation and classification. Although publically available databases of variant
frequencies help distinguishing causative mutations from common variants, they often lack population specific variant
frequencies. To circumvent this shortage of population specific information, most genetic centers exploit their sequence
data of unrelated and unaffected individuals to filter out common local variants is often done. However the
files/databases are rarely shared and they are mainly based on whole exome data. In this project we demonstrate the
utility of a local variant database generated from whole exome data, describe a procedure allowing the sharing of
information between genetic centers and mine low coverage whole genome data for common variants.
INTRODUCTION
Next-Generation Sequencing (NGS) is increasingly used
to study and diagnose human disorders. The simultaneous
sequencing of a large number of genes leading to the
detection of a large number of variants, the bottleneck has
moved from sequencing to variant interpretation and
classification. Publically available databases of variant
frequencies provided by, among others, the Exome
Sequencing Project (ESP) the 1000 genomes project
(McVean et al., 2012) or dbSNP (Sherry et al., 2001) help
distinguishing causative mutations from common variants,
identifying up to 78% of variants as common for a Belgian
exome. However, these data sets often lack population
specific variant frequencies and are outperformed by
databases of local variants. For example, using GoNL
(The Genome of the Netherlands Consortium, 2014) alone
allowed the identification of up to 85% of variants as
common for the same Belgian exome. The fact that the
GoNL is based on only 498 individuals further highlights
the importance of building and using population specific
databases.
Such population specific data can be retrieved from locally
sequenced individuals that underwent Whole Exome
Sequencing (WES) or Whole Genome Sequencing (WGS).
Storing only the frequencies and genotype counts of the
variants provides a valuable tool for variant classification
while no sensitive information on the individuals is
included.
METHODS
WES data of 350 unrelated and unaffected individuals
have been parsed. All samples were analysed in a similar
way i.e. reads were aligned to the reference genome with
BWA (Li & Durbin, 2009) and genotyping was performed
according to GATK best practices (McKenna et al., 2010;
DePristo et al., 2011). All samples were genotyped at all
polymorphic positions using GATK HaplotypeCaller and
GenotypeGVCFs. For each position, samples with low
quality genotype were considered as not genotyped and
excluded from the genotype counts. The number of
alternate alleles, allele counts and genotypes were
compiled in a population VCF file, in which individual
genotypes are not accessible.
Variant frequencies can also be extracted from low
coverage WGS. As a pilot we processed the data of
chromosome 21 of about 4,000 WGS. The mapping was
performed with BWA (Li & Durbin, 2009) and the BAM
files were merged per 200 samples. All positions were
genotyped using freebayes (Garrison & Marth, 2012).
Genotype information of all locations outside low
complexity regions were then compiled for all samples
using the integration of Apache Hadoop, HBase and Hive
(see poster “Big data solutions for variant discovery from
low coverage sequencing data, by integration of Hadoop,
Hbase and Hive”). Several models were then used to
distinguish real variants from sequencing errors: the Minor
Allele Frequency (MAF), the transition/transversion ratio,
the expected number of loci with a MAF of 5%, etc.
RESULTS & DISCUSSION
We demonstrated the effect of our reference set on several
exomes. The inclusion of only 350 individuals allowed the
identification of about 3% additional common variants,
not listed as common by ESP, dbSNP (Sherry et al., 2001),
1000 Genomes (McVean et al., 2012) and GoNL (The
Genome of the Netherlands Consortium, 2014). Since only
the frequencies of the variants in the screened populations
are reported, this file can easily be shared between
laboratories. Besides, the procedure used to generate the
population VCF file can easily be applied to several
genetic centers in order to generate a common population
VCF file, as planned within the BeMGI project.
Finally we expect that the data from WGS will further
increase the performance of our reference set. A genome-
wide variant frequencies file from local population will
become worthwhile when WGS is routinely used in
diagnostics.
REFERENCES DePristo M et al. Nature Genetics 43, 491-498 (2011). Exome Variant Server, NHLBI Exome Sequencing Project (ESP), Seattle,
WA (URL: http://evs.gs.washington.edu/EVS/).
Garrison E & Marth G http://arxiv.org/abs/1207.3907 (2012). Li H & Durbin R Bioinformatics 25, 1754-60 (2009).
McKenna A et al. Genome Research 20, 1297-303 (2010).
McVean et al. Nature 491, 56–65 (2012). Sherry ST, et al. Nucleic Acids Res. 29, 308-11 (2001).
The Genome of the Netherlands Consortium. Nature Genetics 46,
818–825 (2014).
![Page 99: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/99.jpg)
10th Benelux Bioinformatics Conference bbc 2015
99
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P55. MANAGING BIG IMAGING DATA FROM MICROSCOPY:
A DEPARTMENTAL-WIDE APPROACH
Yves Sucaet1*
, Silke Smeets1, Stijn Piessens
1, Sabrina D’Haese
1, Chris Groven
1, Wim Waelput
1 & Peter In’t Veld
1.
Department of Pathology1, Faculty of Medicine, Vrije Universiteit Brussel, Laarbeeklaan 103, 1090 Brussels, Belgium.
With recent breakthroughs in whole slide imaging (WSI), almost any microscopic material can be digitized in an
efficient manner. In order to mine these data efficiently, a top-down approach was employed to manage various imaging
platforms. At Brussels Free University (VUB), we built a centralized infrastructure that integrates a variety of imaging
platforms (brightfield, fluorescence, multi-vendor formats). With the help of the Pathomation software platform for
digital microscopy, various datastores and image repositories were integrated. Custom coding was used to interact with
various vendor-software and server applications, where needed. The end-result is an interconnected network of
heterogeneous scalable information silos. We currently have two main use cases for WSI: education and biobanking.
These applications are available to the public via http://www.diabetesbiobank.org.
INTRODUCTION
Too often, image analysis and data/image mining projects
remain stuck in micro-environments because they are
limited by vendor-specific solutions that neither scale nor
interact with material from other departments or
institutions. Successful roll-out of digital histopathology
therefore requires more than a whole slide scanner.
If the goal is for an imaging facility to allow a researcher
to conduct a (microscopic) experiment, then that
researcher should not be hindered by the imaging platform
used. Similarly, an instructor integrating digital content
into his or her course, should be able to make their
materials as accessible as possible to as many students as
possible.
At Brussels Free University (VUB), we currently have two
main use cases for whole slide imaging: education and
biobanking. We have set these up in such a way that they
are both scalable and expandable.
METHODS
Whole slide imaging (WSI) has recently provided a boost
to digital capturing of microscopic content (and an
explosion of data, resulting in a veritable digital treasure
trove waiting for bioinformatics to be explored). But
researchers have been digitizing content for a long time
already through various technologies (mounted cameras,
inverted fluorescent microscopes with low magnification,
…).
We envisioned an environment whereby a researcher can
manage and view all of the material related to an
experiment or observation from a single interface,
irrespective of origin or technology used.
The following steps were taken to accomplish this:
Setup a central server (50TB storage)
Centrally store all imaging data provide mapped
drives on the individual workstations to facilitate
a smooth transition for end-users
Install the Pathomation platform for digital
microscopy (PMA.core, PMA.view, PMA.zui)
for universal viewing of digital content and to
provide a uniform end-user experience
Install Pydio (open source) for easy sharing of
digital imaging content (integrated with
Pathomation’s PMA.core so no duplicate user
directories need to be maintained)
Build custom portals to highlight specific
collections of microscopic content and/or serve
specific target audiences
RESULTS & DISCUSSION
The centralized digital imaging infrastructure is used by
various researchers and graduate students. Recently over
3,000 images were processed and hosted in the course of
one month.
Two use cases are worth highlighting:
For undergraduate students (Medicine, BMS) we
built custom portal websites to supplement their
courses in histology and pathology. These sites
are available at http://histology.vub.ac.be and
http://pathology.vub.ac.be and provide students
with (guided) virtual microscopy without the
need to install any additional software
We also provide access portals to different
specialized biobanks. The Willy Gepts collection
represents a historic milestone in diabetes
research (http://gepts.vub.ac.be) and is
complementary to the Alan Foulis collection
(http://foulis.vub.ac.be). Furthermore, the clinical
diabetes biobank can now be consulted online,
too, via http://www.diabetesbiobank.org.
CONCLUSION
Digital histopathology has been around for some time now,
but often results in heterogeneous data collections. It is
only now that we start looking at integrated approaches on
this varied data can be best handled. Digital pathology
involves much more than the acquisition of a slide scanner.
We have engaged five different imaging platforms onto a
single architecture. We are storing data from all modalities
in a single storage facility, and manage it through a single
access point. The resulting environment assists in
rendering content to any type of display device, without
the need for extra software or background information
concerning the content’s origin.
![Page 100: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/100.jpg)
10th Benelux Bioinformatics Conference bbc 2015
100
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P56. ESTIMATING THE IMPACT OF CIS-REGULATORY VARIATION IN
CANCER GENOMES USING ENHANCER PREDICTION MODELS AND
MATCHED GENOME-EPIGENOME-TRANSCRIPTOME DATA
Dmitry Svetlichnyy1*
, Hana Imrichova1, Zeynep Kalender Atak
1 & Stein Aerts
1.
Laboratory of Computational Biology, University of Leuven1. *[email protected]
The prioritization of candidate driver mutations in the non-coding part of the genome is a key challenge in cancer
genomics. Whereas driver mutations in protein-coding genes can be distinguished from passenger mutations based on
their recurrence, non-coding mutations are usually not recurrent at the same position. We aim to tackle this problem
using machine-learning methods to predict regulatory regions and cancer genome sequences in combination with sample-
specific chromatin profiles obtained using ChIP-seq against H3K27Ac.
INTRODUCTION
Perturbations of gene regulatory networks in cancer cells
can arise from mutations in transcription factors or co-
factors, but also from mutations in regulatory regions.
Prioritizing candidate driver mutations that have a
significant impact on the activity of a regulatory region is
a key challenge in cancer genomics.
METHODS
We have developed enhancer prediction methods using
Random Forest classifiers to estimate the Predicted
Regulatory Impact of a Mutation in an Enhancer
(PRIME). We find that the recently identified driver
mutation in the TAL1 enhancer has a high PRIME score,
representing a “gain-of-target” for the oncogenic
transcription factor MYB [1]. We trained enhancer models
for 45 cancer-related transcription factors, and used these
to score somatic mutations across more than five hundred
breast cancer genomes. Next, we re-sequenced the genome
of ten cancer cell lines representing six different cancer
types (breast, lung, melanoma, ovarian, and colon) and
profiled their active chromatin by ChIP-seq against
H3K27Ac.
RESULTS & DISCUSSION
Then we integrated these data with matched expression
data and with the Random Forest model predictions for
sets of oncogenic transcription factors per cancer type.
This resulted in surprisingly few high-impact mutations
that generate de novo regulatory (oncogenic) activity at
the chromatin and gene expression level. Our framework
can be applied to identify candidate cis-regulatory
mutations using sequence information alone, and to
samples with combined genome-epigenome-transcriptome
data. Our results suggest the presence of only few cis-
regulatory driver mutations per genome in cancer genomes
that may alter the expression levels of specific oncogenes
and tumor suppressor genes.
REFERENCES 1. Mansour MR, Abraham BJ, Anders L, Berezovskaya A, Gutierrez A,
Durbin AD, et al. An oncogenic super-enhancer formed through somatic mutation of a noncoding intergenic element. Science. 2014;346: 1373–
1377. doi:10.1126/science.1259037
![Page 101: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/101.jpg)
10th Benelux Bioinformatics Conference bbc 2015
101
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P57. I-PV: A CIRCOS MODULE FOR INTERACTIVE PROTEIN
SEQUENCE VISUALIZATION
Ibrahim Tanyalcin1,2*
, Carla Al Assaf3, Alexander Gheldof
1, Katrien Stouffs
1,4, Willy Lissens
1,4 & Anna C. Jansen
5,2.
Center for Medical Genetics, UZ Brussel, Brussels, Belgium1; Neurogenetics Research Group, Vrije Universiteit Brussel,
Brussels, Belgium2; Center for Human Genetics, KU Leuven and University Hospitals Leuven, 3000 Leuven, Belgium
3;
Reproduction, Genetics and Regenerative Medicine, Vrije Universiteit Brussel, Brussels, Belgium4; Pediatric Neurology
Unit, Department of Pediatrics, UZ Brussel, Brussels, Belgium5. *[email protected] or [email protected]
Summary: Today’s genome browsers and protein databanks supply vast amounts of information about proteins. The challenge is to concisely bring together this information in an interactive and easy to generate format. Availability and Implementation: We have developed an interactive CIRCOS module called i-PV to visualize user supplied protein sequence, conservation and SNV data in a live presentable format. I-PV can be downloaded from http://www.i-pv.org.
INTRODUCTION
Today’s genome browsers and protein databanks supply
vast amount of information about both the structural
annotation and the single nucleotide variants (SNV) in
genes. The challenge is to concisely bring together this
information in an interactive and easy to generate format.
Thus, we have developed an interactive CIRCOS
(Krzywinski et al.) module combined with D3 (Bostock et
al.) and plain javascript called i-PV to visualize user
supplied protein sequence, conservation and SNV data
while significantly easing and automating input file
requirements and generation.
METHODS
To use i-PV, only 4 text files (with “.txt” extension) have
to be supplied to the software: conservation scores,
protein and cDNA sequences, and SNVs/Indels files.
Protein and cDNA (or mRNA) sequence files are supplied
in fasta format whereas SNP/Indel fıles are provided as
annotated vcf file (Variant Call Format). The conservation
scores are simply array of numbers separated by newline
characters. The input files are supplied to i-PV, data are
automatically checked for errors or duplicates and
matched against the user provided fasta files, and then an
interactive html file containing the graph is automatically
generated as shown in Fig.1.
RESULTS & DISCUSSION
Many sequence visualization tools focus on certain aspects
of proteins such as conservation, variations, sequence
alignments or topology. While all these tools are very
useful in their own right, we pursued a more interactivity
based design. Therefore, i-PV is not solely designed for
visualization but also for live presentable graphs and
information that can selectively be displayed and
customized. I-PV combines major sources of information
under one html file that is easy to generate and share on
both desktop and mobile environments.
Last but not least, many visualization tools are based on
rectangular-scroll based representation of information
which does not deliver a “wide angle” view of the
sequence data unlike circular visualization. However, as
like all other types of visualizations, there are also
limitations for circular graphs when it comes to
conveniently zoom in to a particular region or visually
align tracks with different radii. We intend to further
develop this software with several other features based on
end user needs. The current version of i-PV can be
downloaded from http://www.i-pv.org.
FIGURE 1. Overview of i-PV features. (A) SNVs with mouse over
explanation and automatic generated dbSNP links (red: Non-
synonymous, green: Synonymous, gray: Not validated). (B) Console can be hidden for publication quality image. (C) Domains are colored based
on user preference. (D) Conservation data from user generated
alignment with mouse over information. (E) The user can define which amino acids to be shown on the sequence track. (F) Switch the color of
the background to black. (G) Amino acids are plotted and split into 5
main categories (nonpolar: gray circle, polar: magenta circle, negative: blue triangle, positive: red triangle, aromatic: green hexagon). (H)
Adjustable conservation score threshold to display regions above a
certain percentage of maximum conservation score. (I) Font-size of chosen amino acids can be adjusted. (J) User selectable amino acids to
be displayed. (K) Up to 17 different amino acid properties can be chosen
to be displayed from drop-down menu. (I) Tile track showing SNVs and indels (red: SNVs, magenta: Indels, gray stroke: Not validated, black:
collapsed due to over display). (M) Gene Name. (N) Buttons for mass
selection of amino acids. (O) User defined regions are marked with custom name tag and mouse over information. (P) Meta-analysis of
amino acid distributions. This information is only displayed in case of
single amino acid comparisons. The log2 ratios are capped between -3 and 3. The maximum and the minimum blosum62 scores are -4 and 11.
Since the blosum62 matrix is diagonally symmetric, the absolute value of
the log ratios are mapped to this range and a p-value is indicated based on how close the two scores are.
REFERENCES Bostock, M., et al. (2011), 'D3: Data-Driven Documents', IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis).
Krzywinski, M., et al. (2009), 'Circos: an information aesthetic for
comparative genomics', Genome Res, 19 (9), 1639-45.
![Page 102: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/102.jpg)
10th Benelux Bioinformatics Conference bbc 2015
102
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P58. SFINX: STRAIGHTFORWARD FILTERING INDEX FOR AFFINITY
PURIFICATION-MASS SPECTROMETRY DATA ANALYSIS
Kevin Titeca1,2
, Pieter Meysman3,4
, Kris Gevaert1,2
, Jan Tavernier 1,2
,
Kris Laukens 3,4
, Lennart Martens1,2
& Sven Eyckerman1,2*
.
Medical Biotechnology Center, VIB, B-9000 Ghent, Belgium1; Department of Biochemistry, Ghent University, B-9000
Ghent, Belgium2; Advanced Database Research and Modeling (ADReM), University of Antwerp, Belgium
3; Biomedical
informatics research center Antwerpen (biomina), Belgium4. [email protected]
Affinity purification-mass spectrometry (AP-MS) is one of the most common techniques for the analysis of protein-
protein interactions, but inferring bona fide interactions from the resulting datasets remains notoriously difficult because
of the many false positives. The ideal filter technique for these data is highly accurate, fast and user friendly without the
need to rely on extensive parameter optimization or external databases, which also makes it reproducible and unbiased.
Because none of the existing filter techniques combines all these features, we developed SFINX, the Straightforward
Filtering INdeX.
We here describe the SFINX algorithm and its performance on two independent AP-MS benchmark datasets. SFINX
shows superior performance over the other approaches with accuracy increases of up to 20%, and is extremely fast. It
does not require parameter optimization, and is absolutely independent of external resources. Both the algorithm and its
website interface are highly intuitive with limited need for user input and the possibility of immediate network
visualization and interpretation at http://sfinx.ugent.be/. SFINX might become essential in the toolbox of any scientist
interested in user-friendly and highly accurate filtering of AP-MS data.
![Page 103: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/103.jpg)
10th Benelux Bioinformatics Conference bbc 2015
103
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P59. MAPREDUCE APPROACHES FOR CONTACT MAP PREDICTION:
AN EXTREMELY IMBALANCED BIG DATA PROBLEM
Isaac Triguero1,2*
, Sara del Río3, Victoria López
3, Jaume Bacardit
4, José M. Benítez
3 & Francisco Herrera
3.
VIB Inflammation Research Center1; Department of Respiratory Medicine, Ghent University
2; Department of Computer
Science and Artificial Intelligence3; School of Computing Science, Newcastle University
4.
The application of data mining and machine learning techniques to biological and biomedicine data continues to be an
ubiquitous research theme in current bioinformatics. The rapid advances in biotechnology are allowing us to obtain and
store large quantities of data about cells, proteins, genes, etc, that should be processed. Moreover, in many of these
problems such as contact map prediction, it is difficult to collect representative positive examples. Learning under these
circumstances, known as imbalance big data classification, may not be straightforward for most of the standard machine
learning methods. In this work we describe the methodology that won the ECBDL'14 big data competition, which was
concerned with the prediction of contact maps. Our methodology is composed of several MapReduce approaches to deal
with big amounts of data. The results show that this model is very suitable to tackle large-scale bioinformatics
classifications problems.
INTRODUCTION
The prediction of a protein’s contact map is a crucial step
for the prediction of the complete 3D structure of a protein.
This is one of the most challenging bioinformatics tasks
within the field of protein structure prediction because of
the sparseness of the contacts (i.e. few positive examples)
and the great amount of data extracted (i.e. millions of
instances, Gbs of disk space) from a few thousand of
proteins.
This problem refers to an imbalance bioinformatics big
data application, in which traditional machine learning
techniques become non effective and non efficient due to
the big dimension of the problem. However, with use of
the emerging cloud-based technologies, these techniques
can be redesigned to extract valuable knowledge from
such amount of data.
The ECDBL’14 competition (http://cruncher.ncl.ac.uk/
bdcomp/) brought up a data set that modeled the contact
map prediction problem as a classification task.
Concretely, the training data set considered was formed by
32 million instances, 631 attributes, 2 classes, 98% of
negative examples and it occupies about 56GB of disk
space.
In this work we describe the methodology with which we
have participated, under the name 'Efdamis', ranking as the
winner algorithm (Triguero et al, 2015).
METHODS
In the proposed methodology, we focused on the
MapReduce (Dean et al, 2008) paradigm in order to
manage this voluminous data set. We extended the
applicability of some pre-processing and classification
models to deal with large-scale problems. This is
composed of four main parts:
An oversampling approach: The goal is to balance the
highly skewed class distribution of the problem by
replicating randomly the instances of the minority
class (del Rio et al, 2014).
An evolutionary feature weighting method: Due the
relative high number of features of the given problem
we developed a feature selection scheme for large-
scale problems that improves the classification
performance by detecting the most significant features
(Triguero et al, 2012).
Building a learning model: As classifier, we focused
on a scalable RandomForest algorithm.
Testing the model: Even the test data can be
considered big data (2.9 millions of instances), so that,
the testing phase was also deployed within a parallel
approach.
RESULTS & DISCUSSION
Table 1 presents the final results of the top 5 participants
in terms of True Positive Rate (TPR) and True Negative
Rate (TNR). In this particular problem, the necessity of
balancing the TPR and TNR ratios emerged as a difficult
challenge for most of the participants of the competition.
In this sense, the use of scalable preprocessing techniques
played in important role to improve the results of the
RandomForest classifier. First, the designed oversampling
approach allowed us to prevent RandomForest to be
biased to the negative class. Second, our feature weighting
approach provided us the possibility of reducing the
dimensionality of the problem by selecting the most
relevant features. Thus, it resulted in a better performance
as well as a notable reduction of the time requirements. Team TPR TNR TPR * TNR
Efdamis 0.73043 0.73018 0.53335
ICOS 0.70321 0.73016 0.51345
UNSW 0.69916 0.72763 0.50873
HyperEns 0.64003 0.76338 0.48858
PUC-Rio_ICA 0.65709 0.71460 0.46956
TABLE 1: Comparison with the top 5 of the competition.
REFERENCES Dean J., Ghemawat S., Mapreduce: simplified data processing on large
clusters, Commun. ACM 51 (1), 107–113 (2008).
del Río S., et al., On the use of MapReduce for imbalanced big data using
random forest, Inf. Sci. 285 (2014) 112–137.
Triguero I. et al., Integrating a differential evolution feature weighting scheme into prototype generation, Neurocomputing 97 (2012) 332–
343.
![Page 104: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/104.jpg)
10th Benelux Bioinformatics Conference bbc 2015
104
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P60. COEXPNETVIZ: THE CONSTRUCTION AND VIZUALISATION OF CO-
EXPRESSION NETWORKS
Oren Tzfadia1,2
, Tim Diels1,2,4
, Sam De Meyer1,2
, Klaas Vandepoele1,2
, Yves Van de Peer1,2,3,5,*
& Asaph Aharoni6.
Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium1; Department of Plant Biotechnology and
Bioinformatics, Ghent University, 9052 Ghent, Belgium2; Genomics Research Institute (GRI), University of Pretoria,
0028 Pretoria, South Africa3; Department of Mathematics and Computer Science, University of Antwerp, Antwerp,
Belgium4; Bioinformatics Institute Ghent, Ghent University, 9052 Ghent, Belgium
5; Department of Plant Sciences and
the Environment, Weizmann Institute of Science, Rehovot6.
INTRODUCTION
Comparative transcriptomics is a common approach in
functional gene discovery efforts. It allows for finding
conserved co-expression patterns between orthologous
genes in closely related plant species, suggesting that these
genes potentially share similar function and regulation.
Several efficient co-expression-based tools have been
commonly used in plant research but most of these
pipelines are limited to data from model systems, which
greatly limit their utility. Moreover, in addition, none of
the existing pipelines allow plant researchers to make use
of their own unpublished gene expression data for
performing a comparative co-expression analysis and
generate multi-species co-expression networks.
RESULTS
We introduce CoExpNetViz, a computational tool that
uses a set of bait genes as an input (chosen by the user)
and a minimum of one pre-processed gene expression
dataset. The CoExpNetViz algorithm proceeds in three
main steps; (i) for every bait gene submitted, co-
expression values are calculated using Pearson correlation
coefficients, (ii) non-bait (or target) genes are grouped
based on cross-species orthology, and (iii) output files are
generated and results can be visualized as network graphs
in Cytoscape.
AVAILABILITY AND IMPLEMENTATION
The CoExpNetViz tool is freely available both as a PHP
web server (link:
http://bioinformatics.psb.ugent.be/webtools/coexpr/)
(implemented in C++) and as a Cytoscape plugin
(implemented in Java). Both versions of the CoExpNetViz
tool support LINUX and Windows platforms.
![Page 105: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/105.jpg)
10th Benelux Bioinformatics Conference bbc 2015
105
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P61. THE DETECTION OF PURIFYING SELECTION DURING TUMOUR
EVOLUTION UNVEILS CANCER VULNERABILITIES Jimmy Van den Eynden
1* & Erik Larsson
1.
Department of Medical Biochemistry and Cell Biology, Institute of Biomedicine, The Sahlgrenska Academy, University
of Gothenburg, Sweden. *[email protected]
Identification of somatic mutation patterns indicative of positive selection arguably has become the major goal of cancer
genomics. This is motivated by a search for cancer driver genes and pathways that are recurrently activated in tumours
but not normal cells, thus providing possible therapeutic windows. However, cancer cells additionally depend on a large
number of basic cellular processes, and elevated sensitivity to inhibition of certain essential non-driver genes has been
demonstrated in some cases. While such vulnerability genes should in theory be identifiable based on strong purifying
(negative) selection in tumors, these patterns have been elusive and purifying selection remains underexplored in cancer.
We established a new methodology and, using mutational data from 25 TCGA tumor types, we show for the first time
that negative selection in candidate vulnerability genes can be detected.
INTRODUCTION
Recently it was shown that a hemizygous deletion of the
well–known tumour suppressor gene TP53 creates
therapeutic vulnerability in colorectal cancer due to
concomitant loss of the neighbouring gene POLR2A (Liu
et al., 2015).
As any damaging mutation occurring in the single allele of
a hemizygously deleted essential gene, like POLR2A, is
expected to lead to cell death, we hypothesized that
purifying selection in these genes could be unveiled by
demonstrating a lower number of damaging mutations
then could be expected in the absence of any selection.
Therefore we used the POLR2A case as a proof-of-
concept to develop a methodology to detect purifying
selection in large genome sequencing datasets.
METHODS
Mutation and copy number data from 25 different cancers
types and 7,871 samples were downloaded from the
TCGA data portal and pooled together in a large pan-
cancer dataset. Different mutational functional impact
scores were calculated using Annovar. Copy number data
were analyzed using Gistic 2.0 to differentiate POLR2A
copy number neutral from hemizygously deleted samples.
RESULTS & DISCUSSION
POLR2A was found to be hemizygously deleted in 29% of
all samples. As expected, in over 99% this deletion was
part of the TP53 (driving) deletion on chromosome 17.
POLR2A was mutated 228 times in 2.3% of all samples.
While 14 nonsense mutations and small out-of-frame
insertions or deletions occurred in the copy number
neutral group, none of these damaging mutations were
found in the deletion group (p=0.03, fisher test),
suggesting purifying selection against this type of
mutations.
Next to these truncating mutations, also missense
mutations that have a damaging effect on the gene’s
protein function are expected to be selected against.
Therefore we predicted the functional impact of all
mutations using different functional impact scores. The
median (PolyPhen-2) functional impact score was found
to significantly lower in the deletion group compared to
the copy number neutral group (p=0.002, Wilcoxon test,
fig.1), further confirming that purifying selection has
taken place in POLR2A during tumour evolution.
These preliminary findings confirm that purifying
selection is detectable in vulnerability genes like POLR2A
and this approach could be used to detect other, new
candidate vulnerability genes.
FIGURE 1. Negative selection against POLR2A high impact mutations in
hemizygously deleted tumour samples.
REFERENCES Liu, Y., Zhang, X., Han, C., Wan, G., Huang, X., Ivan, C., … Lu, X.
(2015). TP53 loss creates therapeutic vulnerability in colorectal
cancer. Nature, 520(7549), 697–701. http://doi.org/10.1038/nature14418
![Page 106: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/106.jpg)
10th Benelux Bioinformatics Conference bbc 2015
106
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P62. FLOREMI: SURVIVAL TIME PREDICTION
BASED ON FLOW CYTOMETRY DATA
Sofie Van Gassen1,2,3*
, Celine Vens2,3,4
, Tom Dhaene1, Bart N. Lambrecht
2,3 & Yvan Saeys
2,3.
Department of Information Technology, Ghent University—iMinds1; VIB Inflammation Research Center
2; Department of
Respiratory Medicine, Ghent University3; Department of Public Health and Primary Care, kU Leuven Kulak
4.
Flow cytometry is a high-throughput technique for single cell analysis. It enables researchers and pathologists to study
blood and tissue samples by measuring several cell properties, such as cell size, granularity and the presence of cellular
markers. While this technique provides a wealth of information, it becomes hard to analyze all data manually. To
investigate alternative automatic analysis methods, the FlowCAP challenges were organized. We will present an
algorithm that obtained the best results on the FlowCAP IV challenge, predicting the time of progression to AIDS for
HIV patients.
INTRODUCTION
The main task of the most recent FlowCAP IV challenge
was a survival modeling challenge: participants had to
predict the time of progression to AIDS for HIV patients,
based on flow cytometry data of an unstimulated and a
stimulated blood sample. Additionally, a secondary task
was the identification of cell populations that could be
indicative of this progression rate. Several challenges
needed to be taken into account: the raw dataset was about
20GB large and about eighty percent of the survival times
were censored.
METHODS
We developed a new algorithm, FloReMi, which
combined several preprocessing steps with a density based
clustering algorithm, a feature selection step and a random
survival forest (Van Gassen et al., 2015).
The input for our algorithm consisted of 2 flow cytometry
samples for each patient: one unstimulated PBMC sample
and one PBMC sample stimulated with HIV antigens. For
each of these samples, 16 parameters were measured for
hundreds of thousands of cells.
First, we included quality control to remove erroneous
measurements from the samples. We also made an
automatic selection of live T cells to focus on the cells of
interest in this specific flow cytometry staining.
Once the dataset was cleaned up, we extracted features for
each patient. This was done by clustering the cells using
the flowDensity (Malek et al., 2015) and flowType
algorithms (Aghaeepour et al., 2012). These algorithms
divide the values for each feature into either “high” or
“low” and use all combinatorial options of “high”, “low”
or “neutral” marker values to group the cells. This resulted
in 310
different cell subsets.
For each of these subsets, we computed the number of
cells assigned to it and the mean fluorescence intensity for
13 markers. Per patient, we collected these numbers for
both samples and also computed the differences between
the two. This resulted in a total of 2,480,058 features per
patient.
Because traditional machine learning algorithms cannot
handle this amount of features, we then applied a feature
selection step. To estimate the usefulness of a feature, we
applied a Cox proportional hazards model on each feature.
The resulting p-value indicates how well the feature
corresponds with the known survival times for the training
set. We ordered the features based on these scores, and
picked only those that were uncorrelated with the others.
This resulted in a final selection of 13 features, on which
we applied several machine learning techniques. We
compared the results of the Cox Proportional Hazards
model, the Additive Hazards model and the Random
Survival Forest.
RESULTS & DISCUSSION
All three methods performed well on the training dataset.
However, on the test dataset, both the Cox Proportional
Hazards model and the Additive Hazards model obtained
bad results, probably due to overfitting on the training data.
Only the Random Survival Forest obtained good results on
the test dataset (Figure 1). This method outperformed all
other methods submitted to the challenge.
FIGURE 1. On the training dataset, there was a strong correlation
between the scores and the actual survival times for all models. On the test dataset, only the Random Survival Forest performed well.
One important challenge remains: the biological
interpretation of our final features. Although they correlate
with the transition times from HIV to AIDS, it is hard to
interpret them as known cell types, due to our
unsupervised feature extraction. Our method delivers a
first step towards new insights in the progress from HIV to
AIDS.
REFERENCES Malek M et al. Bioinformatics 31.4, 606-607 (2015).
Aghaeepour N et al. Bioinformatics 28, 1009-1016 (2012).
Van Gassen S et al. Cytometry A, DOI 10.1002/cyto.a.22734
![Page 107: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/107.jpg)
10th Benelux Bioinformatics Conference bbc 2015
107
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P63. STUDYING BET PROTEIN-CHROMATIN OCCUPATION TO
UNDERSTAND GENOTOXICITY OF MLV-BASED GENE THERAPY VECTORS
Sebastiaan Vanuytven1*
, Jonas Demeulemeester1, Zeger Debyser
1 & Rik Gijsbers
1,2.
Laboratory for Molecular Virology and Gene Therapy, KU Leuven1; Leuven Viral Vector Core, KU Leuven
2.
Integrating retroviral vectors are used to treat genetic and acquired disorders that, theoretically, can be cured by
introducing specific gene expression cassettes into patient cells. Clinical trials held over the past two decades have
proven that this approach is effective in curing genetic disorders and can produce better results than the standard therapy
(Touzot, F et al., 2015). Nevertheless, adverse events in a limited number of patients treated with gamma-retroviral
vectors have deterred their widespread application. Specifically, vector integration occurring in proximity of proto-
oncogenes resulted in insertional mutagenesis and clonal expansion of the cells (Hacein-Bey-Abina S et al., 2003).
INTRODUCTION
Retroviruses and their derived viral vectors do not
integrate at random. Their overall integration pattern is
dictated by cellular cofactors that are co-opted by the
invading viral complex. For gammaretroviral vectors
(prototype MLV) the cellular bromo- and extraterminal
domain (BET) family of proteins (BRD2, BRD3 and
BRD4) tethers the viral integrase to the host cell
chromatin (De Rijck J et al., 2013). At the moment the
only available ChIP-seq data derives from HEK-293T
cells exogenously overexpressing FLAG-tagged versions
of the BET proteins (LeRoy G et al., 2012). Yet, the
detailed chromatin binding profile of endogenous BET
proteins in human cells is currently unknown. Here we
report on the chromatin occupation of the endogenous
BET proteins in K562 and human primary CD4+ T cells.
METHODS
Following fixation, all three BET proteins were pulled-
down with specific antibodies (Bethyl Laboratories, α-
BRD2: A302-583A; α-BRD3: A302-368A; α-BRD4:
A301-985A or Abcam ab84776). Subsequently, 1x107
cells per sample were processed for ChIP as previously
described (Pradeepa MM et al., 2012). ChIPed DNA was
amplified with WGA2 using the manufacturer's protocol
(Sigma Aldrich). All ChIP experiments were done with at
least two biological replicates in K562 and CD4+ T cells.
After processing of the ChIP-seq data, we compared the
obtained BET protein-binding sites with MLV integration
sites, histone modifications and other genetic features.
Furthermore, we used motif discovery in the
neighbourhood of BET binding sites and MLV integration
sites to try and discover potential new players in the MLV
integration process.
RESULTS & DISCUSSION
Analysis showed that 24% of the MLV integration sites
overlap with a BET-binding site in K562 cells, the
majority of which are BRD4 sites. In addition, BET
binding sites located in promoter and enhancer regions are
preferred for MLV integration. Further, evaluation
demonstrated a strong correlation between MLV-
integration in these sites and the occurrence of the
transcription factor recognition motifs for MAX, GATA2,
EGR1, GAPBA and YY1, suggesting a role for these
proteins or the underlying chromatin structures in
targeting integration of MLV to these locations in the
genome via interaction with BET proteins and/or the MLV
long terminal repeat sequences. Recently, we generated
MLV-based vectors that no longer recognize BET-proteins,
BET independent MLV-based (BinMLV) vectors (El
Ashkar S et al., 2014). Integration preferences of BinMLV
vectors are shifted away from epigenetic marks associated
with enhancers and promoters as shown in a PCA analysis,
but they also associate less with BET and MAX binding
sites. Even though, BinMLV vectors still did not integrate
at random, their distribution can overall be described as
more safe, with 3% more integration sites in so-called
genomic "safe-harbor" regions (Sadelain M et al., 2012).
REFERENCES
De Rijck J et al. The BET family of proteins targets moloney murine
leukemia virus integration near transcription start sites, Cell Rep, 5, 886-894, (2013).
El Ashkar S et al. BET-independent MLV-based Vectors Target Away
From Promoters and Regulatory Elements, Mol Ther Nucleic Acids, 3, e179, (2014).
Hacein-Bey-Abina S et al. LMO2-associated clonal T cell proliferation in
two patients after gene therapy for SCID-X1, Science, 302, 415-419, (2003).
LeRoy G et al. Proteogenomic characterization and mapping of nucleosomes decoded by Brd and HP1 proteins, Genome Biol, 13,
R68, (2012).
Pradeepa MM et al. Psip1/Ledgf p52 binds methylated histone H3K36 and splicing factors and contributes to the regulation of alternative
splicing, PLoS Genet, 8, e1002717, (2012).
Sadelain M, Papapetrou EP and Bushman FD. Safe harbours for the integration of new DNA in the human genome, Nat Rev Cancer, 12,
51-58, (2012).
Touzot, F et al. Faster T-cell development following gene therapy compared with haploidentical HSCT in the treatment of SCID-X1,
Blood, 125, 3563-3569, (2015).
![Page 108: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/108.jpg)
10th Benelux Bioinformatics Conference bbc 2015
108
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P64. THE COMPLETE GENOME SEQUENCE OF LACTOBACILLUS
FERMENTUM IMDO 130101 AND ITS METABOLIC TRAITS RELATED TO
THE SOURDOUGH FERMENTATION PROCESS
Marko Verce, Koen Illeghems, Luc De Vuyst & Stefan Weckx*.
Research Group of Industrial Microbiology and Food Biotechnology (IMDO), Faculty of Sciences and Bioengineering
Sciences, Vrije Universiteit Brussel, Brussels, Belgium. *[email protected]
The genome of the lactic acid bacterium species Lactobacillus fermentum IMDO 130101, capable of dominating
sourdough fermentation processes, was sequenced, annotated, and curated. Further, this genome sequence of 2.09 Mbp
was compared to other complete genomes of different strains of L. fermentum to elucidate the potential of L. fermentum
IMDO 130101 as a sourdough starter culture strain. As opposed to the other strains, L. fermentum IMDO 130101
contained unique genes related to carbohydrate import and metabolism as well as a gene coding for a phenolic acid
decarboxylase and a gene encoding a 4,6- -glucanotransferase. The latter enzyme activity may result in the production
of isomalto/malto-polysaccharides. All these features make L. fermentum IMDO 130101 attractive for further study as a
candidate sourdough starter culture strain.
INTRODUCTION
Lactobacillus fermentum is a heterofermentative lactic
acid bacterium often found in fermented food products,
including sourdough. Strain L. fermentum IMDO 130101,
a dominant sourdough strain originally isolated from a rye
sourdough (Weckx et al., 2010) and extensively described
previously (e.g., Vrancken et al., 2008), was sequenced
and compared to other L. fermentum strains with
completed genomes to elucidate unique adaptations of the
strain studied to the sourdough environment.
METHODS
High-quality genomic DNA was used to construct an 8-kb
paired-end library for 454 pyrosequencing. The
pyrosequencing reads were assembled using the GS De
Novo Assembler version 2.5.3 with default parameters.
Primers for gap closure were designed using CONSED
23.0, the gaps amplified with polymerase chain reaction
(PCR) assays and the amplicons sequenced using Sanger
sequencing. The sequences were imported into CONSED
23.0 and used to close the gaps. The genome was
annotated using the automated genome annotation
platform GenDB v2.2 (Meyer et al., 2003), followed by
extensive manual curation. Publicly available genome
sequences of L. fermentum F-6 (Sun et al., 2015), L.
fermentum IFO 3956 (Morita et al., 2008), and L.
fermentum CECT 5716 (Jiménez et al., 2010) were
acquired from RefSeq. Whole-genome comparisons with
the other three L. fermentum strains and ortholog findings
were performed using the progressiveMauve algorithm
(Darling et al., 2010).
RESULTS & DISCUSSION
The 2.09 Mbp genome was assembled from 403,466 reads,
resulting in 74 contigs. No plasmids were found. The
comparative genome analysis with other strains showed
that 477 coding sequences were found in L. fermentum
IMDO 130101 solely (Figure 1).
L. fermentum IMDO 130101 was predicted to be able to
import and utilise glucose, fructose, xylose, mannose, N-
acetylglucosamine, maltose, sucrose, lactose and gluconic
acid via the heterolactic fermentation pathway. Also, the
ability to degrade raffinose and arabinose was predicted.
Consumption of glucose, fructose, maltose and sucrose
was shown in previous research, although growth with
sucrose as the sole energy source was impaired (Vrancken
et al., 2008). The strain possibly imports isomaltose and
maltodextrins, hence elaborating glucose subunits. The
-glucosidase-encoding gene was not found in the
genomes of the other three strains considered, and neither
were the putative maltodextrin import-related genes, the
trehalose-6-phosphate phosphorylase-encoding gene and a
putative -glucanase-encoding gene, which all may be
adaptations of L. fermentum IMDO 130101 to the
sourdough environment. The presence of the arginine
deiminase gene cluster was confirmed. Also, L. fermentum
IMDO 130101 contained a gene for a phenolic acid
decarboxylase, which may have an impact on sourdough
aroma. Further, a 4,6- -glucanotransferase-encoding gene
was present in strain IMDO 130101 solely, which could
result in isomalto/malto-polysaccharide production, a
soluble dietary fibre with prebiotic properties.
Overall, comparative genome analysis revealed metabolic
traits that are of interest for the use of L. fermentum IMDO
130101 as a functional starter culture for sourdough
fermentation processes.
FIGURE 1. Venn diagram of shared coding sequences between four
different strains of Lactobacillus fermentum.
REFERENCES Darling et al. PLoS ONE 5, e11147 (2010).
Jiménez E. et al. J. Bacteriol. 192, 4800-4800 (2010). Meyer et al. Nucleic Acids Res. 31, 2187-2195 (2003).
Morita et al. DNA Res. 15: 151-161 (2008).
Sun et al. J. Biotechnol. 194, 110-111 (2015). Vrancken et al. Int. J. Food Microbiol. 128, 58-66 (2008).
Weckx et al. Food Microbiol. 27, 1000-1008 (2010).
![Page 109: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/109.jpg)
10th Benelux Bioinformatics Conference bbc 2015
109
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P65. ORTHOLOGICAL ANALYSIS OF AN EBOLA VIRUS – HUMAN PPIN
SUGGESTS REDUCED INTERFERENCE OF EBOLA VIRUS WITH EPIGENETIC
PROCESSES IN ITS SUSPECTED BAT RESERVOIR HOST
Ben Verhees1*
, Kris Laukens1,2
, Stefan Naulaerts1,2
, Pieter Meysman1,2
& Xaveer Van Ostade3.
Biomedical informatics research center Antwerpen (biomina)1; Advanced Database Research and Modeling (ADReM),
University of Antwerp2; Laboratory of Protein Science, Proteomics and Epigenetic Signalling (PPES) and Centre for
Proteomics and Mass spectrometry (CFP-CeProMa), University of Antwerp3.
Ebola virus is a zoonosis, but its reservoir host has not yet been identified. Recent findings suggest however, that Mops
condylurus, an insect-eating bat, is a likely candidate. Studying the interactions between Ebola virus and its reservoir
host could prove highly informative, as reservoir hosts of zoonotic pathogens often appear to tolerate infections with
these pathogens with little evidence of disease. In this study, a protein-protein interaction network (PPIN) was created
between Ebola virus and human proteins. Orthology data in Myotis lucifugus – a model organism often used for bat
studies – was employed to determine which of the human first neighbors of Ebola virus proteins do not possess an
orthologue in M. lucifugus. Subsequent GO enrichment analysis suggested that these proteins are mostly involved in
epigenetic processes, and thus we hypothesize that Ebola virus displays reduced interference with epigenetic processes in
its reservoir host.
INTRODUCTION
The idea that bats serve as reservoirs for a wide range of
zoonotic pathogens has been the topic of much recent
research. Previous studies on human and bat orthology in
this context have mainly focused on specific genes,
important in fighting off viral infection.
Our study is different however, in that it focuses on
proteins the Ebola virus immediately interacts with in
humans, and the existence of orthologues of these proteins
in bats.
METHODS
Construction of an Ebola virus – human PPIN
An Ebola virus – human PPIN was constructed from in
silico data. All network analysis was done using
Cytoscape v. 3.2.1.
Orthology analysis
Identification of orthologues was performed using the
OMA orthology database, release: September 2015.
Statistics
For the statistical analysis, the hypergeometric test was
performed.
GO enrichment
GO enrichment analysis was performed using ClueGO v.
1.2.7, a Cytoscape plug-in. Default settings were used, and
all ontologies/pathways were examined.
RESULTS & DISCUSSION
Myotis lucifugus as a model for Mops condylurus
In this study, Myotis lucifugus was used as a model to
study interactions between Ebola virus and Mops
condylurus, its suspected reservoir.
Ebola virus – human PPIN and orthology in M.
lucifugus
An Ebola virus – human PPIN was created, and human
first neighbors of Ebola virus proteins were examined for
existence of orthologues in M. lucifugus. Statistical
analysis revealed that there was an upregulation of human
proteins with orthologues in M. lucifugus (p=0.019).
GO enrichment suggests reduced interference of Ebola
virus with epigenetic processes in its reservoir host
Gene ontology (GO) enrichment analysis was performed
of the human first neighbors of Ebola virus proteins which
do not possess an orthologue in M. lucifugus. The analysis
revealed that these proteins are mostly involved in
epigenetic processes (Figure 1).
FIGURE 1. GO enrichment analysis of human first neighbors of Ebola
virus proteins which do not possess an orthologue in M. lucifugus.
Discussion
Using this novel approach, we have shown that Ebola
virus is likely able to interfere with epigenetic processes in
humans. Secondly, Ebola virus’ ability to interfere with
host epigenetics is likely reduced or altered in its reservoir
host.
While the idea that viruses are able to interact with host
epigenetic mechanisms is fairly recent, over the past few
years significant research has been done exploring this
topic. In a comprehensive review, Li et al. (2014) describe
how specific viral proteins are able to modulate the
activity of chromatin modification complexes, e.g. HATs,
HDACs, HMTs, and HDMTs, and even directly bind
histone proteins. These findings lend support to the results
of our study, as these suggest that Ebola virus is also able
to interact with HDACs, HMTs and several histone
proteins in humans.
REFERENCES Li S et al. Rev Med Virol 24, 223-241 (2014).
![Page 110: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/110.jpg)
10th Benelux Bioinformatics Conference bbc 2015
110
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P66. PLADIPUS EMPOWERS UNIVERSAL DISTRIBUTED COMPUTING
Kenneth Verheggen1,2,3*
, Harald Barsnes4,5
, Lennart Martens1,2,3
& Marc Vaudel4.
Medical Biotechnology Center, VIB, Ghent, Belgium1; Department of Biochemistry, Ghent University, Ghent
2;
Belgium,Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium3; Proteomics Unit, Department of
Biomedicine, University of Bergen, Norway4; KG Jebsen Center for Diabetes Research, Department of Clinical Science,
University of Bergen, Norway5. *[email protected]
The use of proteomics bioinformatics substantially contributes to an improved understanding of proteomes, but this novel
and in-depth knowledge comes at the cost of increased computational complexity. Parallelization across multiple
computers, a strategy termed distributed computing, can be used to handle this increased complexity. However, setting
up and maintaining a distributed computing infrastructure requires resources and skills that are not readily available to
most research groups.
Here, we propose a free and open source framework named Pladipus that greatly facilitates the establishment of
distributed computing networks for proteomics bioinformatics tools.
INTRODUCTION
Various modern day bioinformatics-related fields have a
growing focus on large scale data processing. This
inevitably leads to an increased complexity, as is
illustrated by the recent efforts to elaborate a
comprehensive MS-based human proteome
characterization (Kim et al., 2014; Wilhelm et al., 2014).
Such high-throughput, complex studies are becoming
increasingly popular, but require high performance
computational setups in order to be analyzed swiftly.
METHODS
Here, we present a generic platform for distributed
proteomics software, called Pladipus. It provides an
end-user-oriented solution to distribute
bioinformatics tasks over a network of computers,
managed through an intuitive graphical user interface
(GUI).
Pladipus comes with several modules that work out
of the box. They include SearchGUI (Vaudel et al.,
2011), PeptideShaker (Vaudel et al., 2015),
DeNovoGUI (Muth et al., 2014), MsConvert (part of
Proteowizard (Kessner et al., 2008)) and three
common forms of the BLAST (Altschul et al., 1990)
algorithm (blastn, blastp and blastx). It is possible to
link these together to set up tailored pipelines for
specific needs, including custom, in-house
algorithms and execute the whole on an inexpensive,
scalable cluster infrastructure without additional cost
or expert maintenance requirement. It can even be set
up to allow existing (idle) hardware to hook into the
network and participate in the processing.
RESULTS & DISCUSSION
To numerically assess the benefits of using a distributed
computing framework, 52 CPTAC experiments (LTQ-
Study6 : Orbitrap@86) (Paulovich et al., 2010) were
searched three times against a protein sequence database
(UniProtKB/SwissProt (release-2015_05)) on Pladipus
networks of various. A selection of three search engines
was applied: X!Tandem, Tide and MS-GF+. As expected
for a distributed system, the wall time is very reproducible
and decreased nearly exponentially with the number of
workers.
FIGURE 1. Benchmarking of a Pladipus network
(16GB ram, 12cores, 250GB disk space, Ubuntu precise)
Pladipus is freely available as open
source under the permissive Apache2
license. Documentation, including
example files, an installer and a video tutorial, can be
found at
https://compomics.github.io/projects/pladipus.html.
REFERENCES Altschul,S.F. et al. (1990) Basic local alignment search tool. J. Mol.
Biol., 215, 403–10. Kessner,D. et al. (2008) ProteoWizard: open source software for rapid
proteomics tools development. Bioinformatics, 24, 2534–6.
Kim,M.-S. et al. (2014) A draft map of the human proteome. Nature, 509, 575–81.
Muth,T. et al. (2014) DeNovoGUI: an open source graphical user
interface for de novo sequencing of tandem mass spectra. J. Proteome Res., 13, 1143–6.
Paulovich,A.G. et al. (2010) Interlaboratory study characterizing a yeast
performance standard for benchmarking LC-MS platform performance. Mol. Cell. Proteomics, 9, 242–54.
Vaudel,M. et al. (2015) PeptideShaker enables reanalysis of MS-derived
proteomics data sets. Nat. Biotechnol., 33, 22–24. Vaudel,M. et al. (2011) SearchGUI: An open-source graphical user
interface for simultaneous OMSSA and X!Tandem searches.
Proteomics, 11, 996–9. Wilhelm,M. et al. (2014) Mass-spectrometry-based draft of the human
proteome. Nature, 509, 582–7.
![Page 111: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/111.jpg)
10th Benelux Bioinformatics Conference bbc 2015
111
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P67. IDENTIFICATION OF ANTIBIOTIC RESISTANCE MECHANISMS USING
A NETWORK-BASED APPROACH
Bram Weytjens1,2,3,4
, Dries De Maeyer1,2,,3,4
& Kathleen Marchal1,2,4
*.
Dept. of Information Technology (INTEC, iMINDS), UGent, Ghent, 9052, Belgium1; Dept. of Plant Biotechnology and
Bioinformatics, Ghent University, Technologiepark 927, 9052 Gent, Belgium2; Dept. of Microbial and Molecular
Systems, KU Leuven, Kasteelpark Arenberg 20, B-3001 Leuven, Belgium3, Bioinformatics Institute Ghent, Ghent
University, Ghent B-9000, Belgium4. * [email protected]
Antibiotic resistance is a growing public health concern as the effectiveness of multiple types of antibiotics is decreasing.
To prevent and combat the further spread of antibiotic resistance in bacteria there is the need to better understand the
relationship between genetic alterations and the (molecular) phenotype of antibiotic resistant strains. As several (-omics)
experiments regarding the attainment of antibiotic resistance by bacteria have already been performed and are publicly
available, we re-analysed a laboratory evolution experiment by Suzuki et al. (Suzuki, 2014) in order to demonstrate the
power of a network-based approach in identifying mutations and molecular pathways driving the resistance phenotype.
INTRODUCTION
While network-based approaches are no longer new in
high-throughput (-omics) analysis, they are not yet widely
used in standard analysis pipelines. We analysed a dataset
consisting of multiple E. coli MDS42 strains, each
independently evolved in the presence of a specific
antibiotic (10 in total). By adapting PheNetic (De Maeyer.
2013), an algorithm which connects genetic alterations to
their differentially expressed genes over a genome-wide
interaction network, we were able to automatically
identify mutations in genes which are known to induce
antibiotic resistance.
METHODS
For every strain whole-genome sequencing data and
microarray data (eQTL data) was available. By finding the
most probable connections between the mutations of every
strain and the strain’s respective expression data over a
biological network, PheNetic was able to not only uncover
potential driver genes and molecular pathways for the
resistance phenotype but also to prioritize the identified
mutations based on the likelihood that they are truly
driving the resistance phenotype. Such network-based
approach has following advantages:
Integration of interactomics (network), genomics
and interactomics data
Multiple related datasets can be analyzed together
FIGURE 1: Part of Amikacin resistance network.
RESULTS & DISCUSSION
In the case of Amikacin resistance (figure 1) we were able
to uncover a gain-of-function mutation in cpxA, a gene of
a two-component signal transduction mechanisms which is
known to be involved in amikacin resistance for two
strains out of four. For the other two strains, deleterious
cyoB mutations were found, which is known to lead to
intracellular oxidized copper and eventually multidrug
resistance. These genes were furthermore ranked highest
by PheNetic.
REFERENCES Suzuki S et al. Nat Commun 5, 5792 (2014).
De Maeyer D et al. Mol Biosyst 9: 1594-1603 (2013).
![Page 112: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/112.jpg)
10th Benelux Bioinformatics Conference bbc 2015
112
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P68. DEFINING THE MICROBIAL COMMUNITY OF DIFFERENT
LACTOBACILLUS NICHES USING METAGENOMIC SEQUENCING
Sander Wuyts1,2*
, Eline Oerlemans1, Ilke De Boeck
1, Wenke Smets
1, Dieter Vandenheuvel, Ingmar Claes
1 & Sarah
Lebeer1.
Laboratory of Applied Microbiology and Biotechnology, University of Antwerp1; Research Group of Industrial
Microbiology and Food Biotechnology (IMDO), Vrije Universiteit Brussel2 *[email protected]
Next-Generation Sequencing (NGS) has revolutionized the field of microbial community analysis. Due to these high-
throughput DNA-technologies, microbiologists are now able to perform more in-depth analyses of various microbial
communities compared to culture-independent methods. In our lab, we have successfully deployed 16S rDNA amplicon
sequencing using MiSeq-sequencing (Illumina). A bioinformatic pipeline has been built based on mothur (Schloss et al.
2009), UPARSE (Edgar 2013) and Phyloseq (McMurdie & Holmes 2013) to analyse different microbial community
datasets. The focus is on functional analysis of lactobacilli and other lactic acid bacteria in different ecological niches:
ranging from the human upper respiratory tract to naturally fermented plant-based foods.
INTRODUCTION
16S metagenomics is a technique that makes use of the
highly conserved bacterial 16S rRNA gene. This gene
codes for an RNA-molecule which is a component of the
30S small subunit of bacterial ribosomes. It consists of 9
hypervariable regions, flanked by conserved regions for
which primer pairs for PCR/sequencing can be designed.
Due to these characteristics and due to the slow rate of
evolution, this gene has been widely used in bacterial
phylogeny and taxonomy. NGS technologies like Illumina
MiSeq have made it possible to study all the different
16S rRNA gene copies from an environmental sample and
use these to identify the bacteria present in the sample. But
the use of these high-throughput technologies comes with
a cost: the need for a more in-depth bioinformatic analysis.
METHODS
Wetlab:
DNA is extracted using sample dependent extraction
protocols. A barcoded PCR is performed on the V4 region
of the 16S rRNA gene as described in Kozich et al. 2013.
For each sample a different set of primers is used; each
primerset contains a unique combination of barcodes. The
PCR-products are cleaned using AMPure XP (Agencourt)
bead purification and quantified using Qubit (Life
technologies). All samples are equimolary pooled into one
single library. A negative control (= “empty” DNA-
extraction) and a positive control (= “Mock” communities
HM-276D and HM-782D) are always processed together
with the samples. The library is sequenced using a dual
index sequencing strategy (Kozich et al. 2013) and a
2 x 250 bp kit on the Illumina MiSeq.
Bio-informatic analysis:
Samples are demultiplexed on the MiSeq itself, allowing 1
bp difference in the barcodes. The general quality of the
reads is checked using FastQC (Babraham Bioinformatics).
The paired end reads are merged using mothur’s
make.contigs command. Quality control in mothur is
performed using screen.seqs, alignment to the SILVA
database and removal of sequences that do not map to the
database, removal of chimeras using chimera.uchime and
removal of sequences that classify to the lineages
“Mitochondria” and “Chloroplast”.
The distance between sequences are calculated using
mothur’s dist.seqs command and are clustered at 97 %
sequence similarity using mothur’s cluster command.
Alternatively the UPARSE clustering algorithm can be
used for these last two steps. Sequences are classified
using the RDP database and the complete dataset is
exported as a .biom file.
Visualisation and statistical analysis is performed using
the R-package Phyloseq. This analysis depends on the
experimental design but generally consists of a
normalisation step (either using rarefying, proportions or a
statistical mixture model (McMurdie & Holmes 2014)), a
calculation of alpha diversity measurements and a
calculation and visualisation of beta diversity.
RESULTS & DISCUSSION
The above described method was optimised and proved to
be working. We successfully used this technique to obtain
better insights in the role of lactobacilli in different
ecological niches, e.g. in the murine gastrointestinal tract,
vegetable fermentations and the human upper respiratory
tract.
REFERENCES Edgar, R.C., 2013. UPARSE: highly accurate OTU sequences from
microbial amplicon reads. Nature methods, 10(10), pp.996–8.
Kozich, J.J. et al., 2013. Development of a dual-index sequencing
strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Applied and
environmental microbiology, 79(17), pp.5112–20.
McMurdie, P.J. & Holmes, S., 2013. Phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome
Census Data. PLoS ONE, 8(4).
McMurdie, P.J. & Holmes, S., 2014. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS computational biology,
10(4), p.e1003531.
Schloss, P.D. et al., 2009. Introducing mothur: Open-source, platform-independent, community-supported software for describing and
comparing microbial communities. Applied and Environmental
Microbiology, 75(23), pp.7537–7541.
![Page 113: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/113.jpg)
10th Benelux Bioinformatics Conference bbc 2015
113
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P69. HUNTING HUMAN PHENOTYPE-ASSOCIATED GENES
USING MATRIX FACTORIZATION
Pooya Zakeri1,2,*
, Jaak Simm1,2
, Adam Arany1,2
, Sarah Elshal1,2
& Yves Moreau1,2
.
Department of Electrical Engineering, STADIUS, KU Leuven, Leuven 3001, Belgium1; iMinds Medical IT, Leuven 3001,
Belgium2. *[email protected]
In the last decade, the phenotype-genes identification has received growing attention. It is yet one of the most
challenging problem in biology. In particular, determining disease-associated genes is a demanding process and plays a
crucial role in understanding the relationship between phenotype disease and genes. Typical approaches for gene
prioritization often models each diseases individually, that fails to capture the common patterns in the data. This
motivates us to formulate the hunting phenotype-associated genes problem as a factorization of an incompletely filled
gene-phenotype-matrix where the objective is to predict unknown values. Experimental result on the updated version of
Endeavour benchmark demonstrates that our proposed model can effectively improve the accuracy of the state-of-the-art
gene prioritization model.
INTRODUCTION
In biology, there is often the need to discover the most
promising genes among large list of candidate genes to
further investigate. While a single data source might not
be effective enough, fusing several complementary
genomic data sources results in more accurate prediction.
Moreover, fusing the phenotypic similarity of diseases and
sharing information about known disease genes across
both diseases and genes through a multi-task approach,
enable us to handle gene prioritization for diseases with
very few known genes and genes with limited available
information. Typical strategies for hunting phenotype-
associated genes often models each phenotype
individually [1, 2, 3, 4], that fails to capture the common
patterns in the data. This motivates us to formulate the
hunting phenotype-associated genes task as a factorization
of an incompletely filled gene-phenotype-matrix where the
objective is to predict unknown values.
METHODS
We consider OMIM database which is a human phenotype
disease specific association databases. OMIM focuses on
the relationship between human genotype and associated
diseases. OMIM database can be seen as an incomplete
matrix where each row is a gene and each column is a
phenotype (disease).
The idea behind the factorizing the M×N OMIM matrix is
to represent each row and each column by a latent vector
of size D. Then, the OMIM matrix can be modeled by
product of an N×D gene matrix G and an M× D disease
matrix P.
Bayesian matrix factorization (BPMF) [5] is a famous
method to fill such an incomplete matrix. But BPMF uses
no side information which results in an inaccurate gene-
phenotype-matrix completion.
We propose an extended version of BPMF with an ability
to work with multiple side information sources for
completing gene-phenotype-matrix [6], which allows to
make out-of-genes-phenotype-matrix ranking. In our
proposed framework we are also able to integrate both
genomic data sources and phenotypes information,
whereas earlier approaches for hunting phenotype
associated genes are limited to only fuse genomic
information. This modification is done by adding genomic
and phenotypic features to the corresponding latent
variables [6]. In this study, we consider several genomic
data sources including annotation-based data sources such
as UniProt annotation, literature-based data sources on
each genes, and as well the literature-based phenotypic
information on each diseases, as just as in [1, 4, 9]. The
framework of our Bayesian data fusion model for gene
prioritization is illustrated in Figure 1.
FIGURE 1. The framework of our Bayesian data fusion model for gene prioritization.
RESULTS & DISCUSSION
We report the average TPR results, when considering the
top 1%, 5%, 10%, and 30% of the ranked genes.
Experimental result on the updated version of Endeavour
[3] benchmark demonstrates that our proposed model can
effectively improve the accuracy of the state-of-the-art
gene prioritization model.
REFERENCES Aerts, S. et al. Nat Biotech, 24(5), 537–544, (2006).
De Bie T, Tranchevent LC, van Oeffelen LMM, Moreau Y, Bioinformatics, 23(13):i125-i132, (2007).
Tranchevent LC1, et. al. NAR, (35) W377-W384(2008) .
ElShal S, et al. Davis J. Moreau Y. NAR, (2015). R. Salakhutdinov and A. Mnih. 25th ICML, 880–887. ACM, (2008).
SIMM J, et al. arXiv:1509.04610 [stat.ML], (2106).
![Page 114: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/114.jpg)
10th Benelux Bioinformatics Conference bbc 2015
114
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015
Abstract ID: P Poster
P70. THE IMPACT OF HMGA PROTEINS ON REPLICATION ORIGINS
DISTRIBUTION A. Zouaoui
1, M. Kahli
2, E. Besnard
3, R. Desprat
1, N. Kirsten
4, P. Ben-sadoun
1 & J.M. Lemaitre
1.
Institute for Regenerative Medicine and Biotherapy, France1; Institut de Biologie de l’École Normale Supérieure (ENS),
France2; The Gladstone Institutes, University of California San Francisco (UCSF), United States
3; Helmholtz Zentrum
München, Research Unit Gene Vectors, Munich, Germany4.
Proliferative cells can have an irreversible stop in the cell
cycle that is called cellular senescence which can induct
the development of cancer and ageing. Senescence is
characterized by the development of Dense
Heterochromatic Foci (SAHF) and the decline of the DNA
replication. High-Mobility Group A proteins promote
SAHF formation, a proliferative stop and stabilize
senescence when overexpressed.
In a cell, DNA replication is regulated on several
genomics sites called replication origin (« Oris »). Pre-
replication proteic complex is required for DNA
replication to occur. In the pre-replication complex, the
ORC1 protein is involved in recognition of the origin of
replication. DNA autoradiography of eukaryote cells
allowed to find that human replication origins are
bidirectional and spaced at 20-400kb intervals (Huberman
and Riggs, 1968). At each origin, replication forks are
formed and new short nascent strand are synthetized. A
popular method to map replication origins is the
purification of Short Nascent Strand (SNS). Several
laboratories have identified up to 50 000 origins using
microarray and sequencing techniques. Our laboratory has
developed an origin mapping method divided in four cell
type: IMR90, H9, iPSC and HeLa (Besnard et al., 2012).
The Short Nascent Strand was isolated, sequenced and
analyzed. 250 000 origin peaks have been identified with a
peak detection tool named SoleSearch (Blahnik KR, Dou
L, O’Geen H, et al. 2010).
The objective is to find the most sensitive method to
analyze the origin distribution in proliferative and
senescent cells to observe if senescence has an impact on
the origin distribution. The implication of HMGA proteins
on the DNA replication is investigated. Two new methods
are in development to analyze the replication origin with
two more sensitive tools. In the first method, we search
origin peaks with Macs2 tool (Zhang et al., 2008) which
uses a new statistic and algorithm model. In a second time,
origin enrichment is observed with Homer tool (Heinz S et
al., 2010).
Two methods are currently in development to identify the
replication origin site by Illumina GaII sequencing of short
nascent strand. Human SNS-seq reads of 36bp were
mapped to human genome build GRCH38 with BWA tool
(ref). Origin peaks were called by MACS2 and origin
enrichment by Homer. To compare the two methods,
active origins in HeLa cells were detected with each
method. Correlation between ORC1 peaks and origins
identified is calculated to choose the most sensitive
method. The impact of pre-senecence is observed in
comparing origins distribution observed in proliferative
and senescent cells. Origins distribution is compared
before and after induction of HMGA proteins to
investigate the implication of these proteins on the DNA
replication during senescence.
REFERENCES Besnard et al. Best practices for mapping replication origins in
eukaryotic chromosomes. Current Protoc Cell Biol. 2014 Sep 2;
64:22.18.1-22.18.13
Besnard et al. Unraveling cell type-specific and reprogrammable human replication origin signatures associated with G-quadruplex consensus
motifs. Nat Struct Mol Biol. 2012 Aug; 19, 837-44
Blahnik KR, Dou L, O’Geen H, et al. Sole-Search: an integrated analysis program for peak detection and functional annotation using ChIP-seq
data. Nucleic Acids Res. 2010; 38:e13
Fu H et al. Mapping replication origin sequences in eukaryotic chromosomes. Curr Protoc Cell Biol. 2014 Dec 1; 65:22.20.1-
22.20.17
Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of
Lineage-Determining Transcription Factors Prime cis-Regulatory
Elements Required for Macrophage and B Cell Identities. Mol Cell
2010 May 28; 38, 576-589 Hubberman JA et al. On the mechanism of DNA replication in
mammalian chromosomes. J Mol Biol 1968 Mar 14; 32, 327-41
Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol (2008) 9 pp. R13
![Page 115: Benelux Bioinformatics Conference bbc 2015 · 10th Benelux Bioinformatics Conference bbc 2015 8 Conference agenda 1/2 December 6, 2015: Satellite events 12.30 – 19.00 Student-run](https://reader036.vdocuments.us/reader036/viewer/2022071002/5fbed1045e6def772b778bc5/html5/thumbnails/115.jpg)
10th Benelux Bioinformatics Conference bbc 2015
115
bbc 2015
December 7 - 8, 2015 Antwerp, Belgium
www.bbc2015.be