benelux bioinformatics conference bbc 2015 · 10th benelux bioinformatics conference bbc 2015 8...

1

bbc 2015

December 7 - 8, 2015

Antwerp, Belgium

www.bbc2015.be

10th Benelux Bioinformatics Conference

10th Benelux Bioinformatics Conference bbc 2015

2

10th

Benelux Bioinformatics Conference

bbc 2015

PROCEEDINGS

December 7 and 8, 2015

Antwerp, Belgium Elzenveld, Lange Gasthuisstraat 45, 2000 Antwerp, Belgium


3


4

Welcome to the 10th

Benelux Bioinformatics Conference!

Dear attendee, It is our great pleasure to welcome you to the 10th Benelux Bioinformatics Conference in Antwerp (Belgium)! We are especially proud to host this conference, for the first time ever, in Antwerp, the diamond city. Ten years of BBC is worth some celebration. The meeting has always struck the right balance between strengthening the regional network and offering a scientifically strong program. From its inception 10 years ago, the BBC has always been a prominent platform for the thriving regional bioinformatics community to present their latest research. Not only did many young bioinformatics scientists get their first experience presenting their work as a poster or an oral presentation at one of the BBC editions, it has always attracted a healthy mix of presenters and attendees from all career stages, with diverse backgrounds. The program of this year's edition again demonstrates the wide range of life science disciplines in which bioinformatics plays a key role nowadays. First, we are delighted to introduce two eminent keynote speakers: Cedric Notredame (Center for Genomic Regulation) and Lars Juhl Jensen (Novo Nordisk Foundation Center for Protein Research). Second, a program committee of 36 scientists has critically reviewed a large number of submissions and selected 24 authors to deliver an oral presentation. In addition, we have two special corporate talks. Furthermore, we have again a large number of poster presentations that promise a very interactive poster session, and our corporate sponsors present their activities at their respective booths. Last but not least, our special guest Pierre Rouzé will bring us a perspective on the history of bioinformatics and 10 years of Benelux Bioinformatics Conferences. For this edition, we would like to congratulate 10 (mostly master) students that were selected from a large pool of submissions to enjoy a student fellowship. For many of them it is their first chance to actively participate in a scientific conference, and we hope that it inspires them for their future bioinformatics career. The program also includes a healthy mix of chances for social interaction and networking. Conference dinner, coffee and lunch breaks and the farewell drink are perfect opportunities to strengthen the network even further. We cannot close this foreword without a very strong word of thank you to the many people who made this event possible. Thanks to the sponsors for their crucial support, to the keynote speakers and all other presenters for presenting their work, to the program committee for reviewing many abstracts, to many volunteers and people in the administration of the University of Antwerp for their helping hands, in many different ways. Last but not least, thank you for being here and being part of yet another great BBC edition. We wish you an enjoyable and very illuminating meeting. On behalf of the organizing committee, Kris Laukens & Pieter Meysman BBC2015 chairs University of Antwerp


5

Special thanks to the BBC 2015 sponsors!

Gold sponsors:

Silver sponsors:

Bronze sponsors:

Affiliations:


6

Organizing committee

Kris Laukens, University of Antwerp, Belgium

Pieter Meysman, University of Antwerp, Belgium

Geert Vandeweyer, University of Antwerp, Belgium

Yvan Saeys, Ghent University, Belgium

Thomas Abeel, Delft University of Technology, The Netherlands

Programme committee

Thomas Abeel, Delft University of Technology, The Netherlands Stein Aerts, University of Leuven, Belgium Francisco Azuaje, Luxembourg Institute of Health, Luxembourg Gianluca Bontempi, Université libre de Bruxelles, Belgium Tomasz Burzykowski, Hasselt University, Belgium Susan Coort, Maastricht University, The Netherlands Tim De Meyer, Ghent University, Belgium Jeroen De Ridder, Delft University of Technology, The Netherlands Dick De Ridder, Delft University of Technology, The Netherlands Peter De Rijk, University of Antwerp, Belgium Pierre Dupont, Université catholique de Louvain, Belgium Pierre Geurts, University of Liège, Belgium Peter Horvatovich, University of Groningen, The Netherlands Jan Ramon, University of Leuven, Belgium Rob Jelier, University of Leuven, Belgium Gunnar Klau, Centrum Wiskunde & Informatica, The Netherlands Andreas Kremer, ITTM S.A., Luxembourg Kris Laukens, University of Antwerp, Belgium Tom Lenaerts, Université libre de Bruxelles, Belgium Steven Maere, Ghent University / VIB, Belgium Lennart Martens, Ghent University / VIB, Belgium Pieter Meysman, University of Antwerp, Belgium Perry Moerland, University of Amsterdam, Belgium Pieter Monsieurs, SCK-CEN, Belgium Yves Moreau, University of Leuven, Belgium Yvan Saeys, Ghent University / VIB, Belgium Thomas Sauter, University of Luxembourg, Luxembourg Alexander Schoenhuth, Centrum Wiskunde & Informatica, The Netherlands Berend Snel, Utrecht University, Belgium Dirk Valkenborg, VITO, Belgium Raf Van de Plas, Delft University of Technology, The Netherlands Vera van Noort, University of Leuven, Belgium Natal van Riel, Eindhoven University of Technology, The Netherlands Klaas Vandepoele, Ghent University / VIB, Belgium Geert Vandeweyer, University of Antwerp, Belgium Wim Vrancken, Vrije Universiteit Brussel, Belgium


7

Local Organizing Committee

Charlie Beirnaert, University of Antwerp

Wout Bittremieux, University of Antwerp

Bart Cuypers, University of Antwerp

Nicolas De Neuter, University of Antwerp

Aida Mrzic, University of Antwerp

Stefan Naulaerts, University of Antwerp

The results published in this book of abstracts are under the full responsibility of the authors. The organizing committee cannot be held responsible for any errors in this publication or potential consequences thereof.


8

Conference agenda 1/2

December 6, 2015: Satellite events

12.30 – 19.00 Student-run satellite meeting at the Institute of Tropical Medicine, Antwerp.

19.00 - … Guided sightseeing tour of Antwerp for early arrivals.

December 7, 2015: Main Conference

8.30 - 9.30 Registration and welcome coffee.

9.30 - 9.50 Welcome and conference opening, with foreword by UAntwerpen Rector Prof.

Alain Verschoren.

9.50 - 10.50 K1 Invited keynote: Lars Juhl Jensen. Medical data and text mining: Linking

diseases, drugs, and adverse reactions.

10.50 - 11.10 Coffee break.

Selected talks session 1

11.10 - 11.25 O1 Mafalda Galhardo, Philipp Berninger, Thanh-Phuong Nguyen, Thomas Sauter and Lasse

Sinkkonen. Cell type-selective disease association of genes under high regulatory load.

11.25 - 11.40 O2 Andrea M. Gazzo, Dorien Daneels, Maryse Bonduelle, Sonia Van Dooren, Guillaume

Smits and Tom Lenaerts. Predicting oligogenic effects using digenic disease data.

11.40 - 11.55 O3 Wouter Saelens, Robrecht Cannoodt, Bart N. Lambrecht and Yvan Saeys. A

comprehensive comparison of module detection methods for gene expression data.

11.55 - 12.10 O4 Joana P. Gonçalves and Sara C. Madeira. LateBiclustering: Efficient discovery of temporal

local patterns with potential delays.

12.10 - 12.30 C1 Nicolas Goffard. Illumina software platforms to transform the path to knowledge and

discovery. (Corporate presentation: Illumina)


9

12.30 - 15.00 Lunch break & poster session.


15.00 - 15.15 O5 Robrecht Cannoodt, Katleen De Preter and Yvan Saeys. Inferring developmental

chronologies from single cell RNA.

15.15 - 15.30 O6 Vân Anh Huynh-Thu and Guido Sanguinetti. Combining tree-based and dynamical

systems for the inference of gene regulatory networks.

15.30 - 15.45 O7 Annika Jacobsen, Nika Heijmans, Renée van Amerongen, Martine Smit, Jaap Heringa

and K. Anton Feenstra. Modeling the Regulation of β-Catenin Signalling by WNT stimulation

and GSK3 inhibition.

15.45 - 16.00 O8 Thanh Le Van, Jimmy Van den Eynden, Dries De Maeyer, Ana Carolina Fierro, Lieven

Verbeke, Matthijs van Leeuwen, Siegfried Nijssen, Luc De Raedt and Kathleen Marchal.

Ranked tiling based approach to discovering patient subtypes.

16.00 - 16.15 O9 Martin Bizet, Jana Jeschke, Matthieu Defrance, François Fuks and Gianluca Bontempi.

Development of a DNA methylation-based score reflecting Tumour Infiltrating Lymphocytes.

16.15 - 16-30 O10 Aliaksei Vasilevich, Shantanu Singh, Aurélie Carlier and Jan de Boer. Prediction of cell

responses to surface topographies using machine learning techniques.



17.00 - 17.15 O11 Wout Bittremieux, Pieter Meysman, Lennart Martens, Bart Goethals, Dirk Valkenborg

and Kris Laukens. Analysis of mass spectrometry quality control metrics.

17.15 - 17.30 O12 Şule Yılmaz, Masa Cernic, Friedel Drepper, Bettina Warscheid, Lennart Martens and

Elien Vandermarliere. Xilmass: A cross-linked peptide identification algorithm.

17.30 - 17.45 O13 Nico Verbeeck, Jeffrey Spraggins, Yousef El Aalamat, Junhai Yang, Richard M. Caprioli,

Bart De Moor, Etienne Waelkens and Raf Van de Plas. Automated anatomical interpretation

of differences between imaging mass spectrometry experiments.

17.45 - 18.00 O14 Yousef El Aalamat, Xian Mao, Nico Verbeeck, Junhai Yang, Bart De Moor, Richard M.

Caprioli, Etienne Waelkens and Raf Van de Plas. Enhancement of imaging mass spectrometry

data through removal of sparse intensity variations.

18.10 - 18.30 Walk to the gala dinner leaving from conference venue.

18.30 - 22.00 Gala dinner at Pelgrom – Pelgrimstraat 15, Antwerpen.


10

Conference agenda 2/2

December 8, 2015: Main Conference

8.30 - 9.30 Welcome coffee.

9.30 - 9.40 Opening and announcements.


9.40 - 9.55

O15 Gipsi Lima Mendez, Karoline Faust, Nicolas Henry, Johan Decelle, Sébastien Colin,

Fabrizio Carcillo, Simon Roux, Gianluca Bontempi, Matthew B. Sullivan, Chris Bowler, Eric

Karsenti, Colomban de Vargas and Jeroen Raes. Determinants of community structure in the

plankton interactome.

9.55 - 10.10 O16 Mohamed Mysara, Yvan Saeys, Natalie Leys, Jeroen Raes and Pieter Monsieurs.

Bioinformatics tools for accurate analysis of amplicon sequencing data for

biodiversity analysis.

10.10 – 10.25

O17 Sjoerd M. H. Huisman, Else Eising, Ahmed Mahfouz, Boudewijn P.F. Lelieveldt, Arn

van den Maagdenberg and Marcel Reinders. Gene co-expression analysis identifies brain

regions and cell types involved in migraine pathophysiology: a GWAS-based study using the

Allen Human Brain Atlas.

10.25 - 10.40

O18 Ahmed Mahfouz, Boudewijn P.F. Lelieveldt, Aldo Grefhorst, Isabel Mol, Hetty Sips,

Jose van den Heuvel, Jenny Visser, Marcel Reinders and Onno Meijer. Spatial co-expression

analysis of steroid receptors in the mouse brain identifies region-specific

regulation mechanisms.



11.10 - 11.25 O19 Bart Cuypers, Pieter Meysman, Manu Vanaerschot, Maya Berg, Malgorzata

Domagalksa, Jean-Claude Dujardin and Kris Laukens. A systems biology compendium for

Leishmania Donovani.

11.25 - 11.40 O20 Volodimir Olexiouk, Elvis Ndah, Sandra Steyaert, Steven Verbruggen, Eline De Schutter,

Alexander Koch, Daria Gawron, Wim Van Criekinge, Petra Van Damme and Gerben

Menschaert. Multi-omics integration: Ribosome profiling applications.

11.40 - 11.55 O21 Qingzhen Hou, Kamil Krystian Belau, Marc Lensink, Jaap Heringa and K. Anton

Feenstra. CLUB-MARTINI: Selecting favorable interactions amongst available candidates: A

coarse-grained simulation approach to scoring docking decoys.

11.55 - 12.10 O22 Elien Vandermarliere, Davy Maddelein, Niels Hulstaert, Elisabeth Stes, Michela Di

Michele, Kris Gevaert, Edgar Jacoby, Dirk Brehmer and Lennart Martens. Pepshell:

Visualization of conformational proteomics data.


11

12.10 - 12.30 C2 Carine Poussin. The systems toxicology computational challenge: Identification of

exposure response markers. (Corporate presentation: sbv IMPROVER)

12.30 - 13.30 Lunch break.

13.30 - 14.30 K2 Invited keynote: Cedric Notredame. Multiple survival strategies to deal with the

multiplication of multiple sequence alignment methods.


14.30 - 14.45 O23 Thomas Moerman, Dries Decap and Toni Verbeiren. Interactive VCF comparison using

Spark Notebook.

14.45 - 15.00 O24 Sepideh Babaei, Waseem Akhtar, Johann de Jong, Marcel Reinders and Jeroen de

Ridder. 3D hotspots of recurrent retroviral insertions reveal long-range interactions with

cancer genes.


15.30 - 16.00 K3 Invited keynote: Pierre Rouzé. Thirty years in Bioinformatics.

16.00 - 16.30 Closing and awards.

16.30 - 17.00 Closing reception.


12

Gala dinner

The gala event will take place at the Pelgrom, a Medieval-style restaurant at walking distance from

the Elzenveld conference location, on the evening of Monday December 7th, after the conference

programme, from 18h30 until 22h00. Gala dinner participation is optional, although highly

recommended!

The Pelgrom is one of Antwerp’s most historic eating and drinking place, situated in authentic 15th

century cellars that were used by merchants for temporary storage during the two big annual

Antwerp fairs. Prepare to feast on a Medieval buffet in the style of Antwerp’s Golden Century!

The Pelgrom is at walking distance from the

Elzenveld conference location. For people using

public transportation, after the end of the gala

dinner, the Antwerp-Central train station can easily

be reached by tram from the Groenplaats station

(10 minutes), or on foot (20 minutes).

Where? Restaurant Pelgrom, Pelgrimsstraat 15, 2000 Antwerp

When? Monday December 7th, 2015; 18h30 - 22h00


13

List of abstracts

K1 MEDICAL DATA AND TEXT MINING: LINKING DISEASES, DRUGS, AND ADVERSE REACTIONS 17

K2 MULTIPLE SURVIVAL STRATEGIES TO DEAL WITH THE MULTIPLICATION OF MULTIPLE SEQUENCE

ALIGNMENT METHODS 18

C1 ILLUMINA SOFTWARE PLATFORMS TO TRANSFORM THE PATH TO KNOWLEDGE AND DISCOVERY 19

C2 THE SYSTEMS TOXICOLOGY COMPUTATIONAL CHALLENGE: IDENTIFICATION OF EXPOSURE

RESPONSE MARKERS 20

O1 CELL TYPE-SELECTIVE DISEASE ASSOCIATION OF GENES UNDER HIGH REGULATORY LOAD 21

O2 PREDICTING OLIGOGENIC EFFECTS USING DIGENIC DISEASE DATA 22

O3 A COMPREHENSIVE COMPARISON OF MODULE DETECTION METHODS FOR GENE EXPRESSION

DATA 23

O4 LATEBICLUSTERING: EFFICIENT DISCOVERY OF TEMPORAL LOCAL PATTERNS WITH POTENTIAL

DELAYS 24

O5 INFERRING DEVELOPMENTAL CHRONOLOGIES FROM SINGLE CELL RNA 25

O6 COMBINING TREE-BASED AND DYNAMICAL SYSTEMS FOR THE INFERENCE OF GENE

REGULATORY NETWORKS 26

O7 MODELING THE REGULATION OF Β-CATENIN SIGNALLING BY WNT STIMULATION AND GSK3

INHIBITION 27

O8 RANKED TILING BASED APPROACH TO DISCOVERING PATIENT SUBTYPES 28

O9 DEVELOPMENT OF A DNA METHYLATION-BASED SCORE REFLECTING TUMOUR INFILTRATING

LYMPHOCYTES 29

O10 PREDICTION OF CELL RESPONSES TO SURFACE TOPOGRAPHIES USING MACHINE LEARNING

TECHNIQUES 30

O11 ANALYSIS OF MASS SPECTROMETRY QUALITY CONTROL METRICS 31

O12 XILMASS: A CROSS-LINKED PEPTIDE IDENTIFICATION ALGORITHM 32

O13 AUTOMATED ANATOMICAL INTERPRETATION OF DIFFERENCES BETWEEN IMAGING MASS

SPECTROMETRY EXPERIMENTS 33

O14 ENHANCEMENT OF IMAGING MASS SPECTROMETRY DATA THROUGH REMOVAL OF SPARSE

INTENSITY VARIATIONS 34

O15 DETERMINANTS OF COMMUNITY STRUCTURE IN THE PLANKTON INTERACTOME 35

O16 BIOINFORMATICS TOOLS FOR ACCURATE ANALYSIS OF AMPLICON SEQUENCING DATA FOR

BIODIVERSITY ANALYSIS 36

O17 GENE CO-EXPRESSION ANALYSIS IDENTIFIES BRAIN REGIONS AND CELL TYPES INVOLVED IN

MIGRAINE PATHOPHYSIOLOGY: A GWAS-BASED STUDY USING THE ALLEN HUMAN BRAIN ATLAS 37

Keynotes

Corporate presentations

Selected oral presentations


14

O18 SPATIAL CO-EXPRESSION ANALYSIS OF STEROID RECEPTORS IN THE MOUSE BRAIN IDENTIFIES

REGION-SPECIFIC REGULATION MECHANISMS 38

O19 A SYSTEMS BIOLOGY COMPENDIUM FOR LEISHMANIA DONOVANI 39

O20 MULTI-OMICS INTEGRATION: RIBOSOME PROFILING APPLICATIONS 40

O21 CLUB-MARTINI: SELECTING FAVORABLE INTERACTIONS AMONGST AVAILABLE CANDIDATES: A

COARSE-GRAINED SIMULATION APPROACH TO SCORING DOCKING DECOYS 41

O22 PEPSHELL: VISUALIZATION OF CONFORMATIONAL PROTEOMICS DATA 42

O23 INTERACTIVE VCF COMPARISON USING SPARK NOTEBOOK 43

O24 3D HOTSPOTS OF RECURRENT RETROVIRAL INSERTIONS REVEAL LONG-RANGE INTERACTIONS

WITH CANCER GENES 44

P1 KNN-MDR APPROACH FOR DETECTING GENE-GENE INTERACTIONS 45

P2 CONSERVATION AND DIVERSITY OF SUGAR-RELATED CATABOLIC PATHWAYS IN FUNGI 46

P3 VISUALIZING BIOLOGICAL DATA THROUGH WEB COMPONENTS USING POLIMERO AND

POLIMERO-BIO 47

P4 DISEASE-SPECIFIC NETWORK CONSTRUCTION BY SEED-AND-EXTEND 48

P5 BIG DATA SOLUTIONS FOR VARIANT DISCOVERY FROM LOW COVERAGE SEQUENCING DATA, BY

INTEGRATION OF HADOOP, HBASE AND HIVE 49

P6 ENTEROCOCCUS FAECIUM GENOME DYNAMICS DURING LONG-TERM PATIENT GUT

COLONIZATION 50

P7 XCMS OPTIMISATION IN HIGH-THROUGHPUT LC-MS QC 51

P8 IDENTIFICATION OF NUMTS THROUGH NGS DATA 52

P9 MICROBIAL SEMANTICS: GENOME-WIDE HIGH-PRECISION NAMING SCHEMES FOR BACTERIA 53

P10 FROM SNPS TO PATHWAYS: AN APPROACH TO STRENGTHEN BIOLOGICAL INTERPRETATION OF

GWAS RESULTS 54

P11 IDENTIFICATION OF TRANSCRIPTION FACTOR CO-ASSOCIATIONS IN SETS OF FUNCTIONALLY

RELATED GENES 55

P12 PHENETIC: MULTI-OMICS DATA INTERPRETATION USING INTERACTION NETWORKS 56

P13 THE ROLE OF HLA ALLELES UNDERLYING CYTOMEGALOVIRUS SUSCEPTIBILITY IN ALLOGENEIC

TRANSPLANT POPULATIONS 57

P14 NOVOPLASTY: IN SILICO ASSEMBLY OF PLASTID GENOMES FROM WHOLE GENOME NGS DATA 58

P15 ENANOMAPPER - ONTOLOGY, DATABASE AND TOOLS FOR NANOMATERIAL SAFETY

EVALUATION 59

P16 BIOMEDICAL TEXT MINING FOR DISEASE-GENE DISCOVERY: SOMETIMES LESS IS MORE 60

P17 TUNESIM - TUNABLE VARIANT SET SIMULATOR FOR NGS READS 61

P18 RNA-SEQ REVEALS ALTERNATIVE SPLICING WITH ALTERNATIVE FUNCTIONALITY IN

MUSHROOMS 62

P19 MSQROB: AN R/BIOCONDUCTOR PACKAGE FOR ROBUST RELATIVE QUANTIFICATION IN LABEL-

FREE MASS SPECTROMETRY-BASED QUANTITATIVE PROTEOMICS 63

P20 A MIXTURE MODEL FOR THE OMICS BASED IDENTIFICATION OF MONOALLELICALLY EXPRESSED

LOCI AND THEIR DEREGULATION IN CANCER 64

P21 GEVACT: GENOMIC VARIANT CLASSIFIER TOOL 65

P22 MAPPI-DAT: MANAGEMENT AND ANALYSIS FOR HIGH THROUGHPUT INTERACTOMICS DATA

FROM ARRAY-MAPPIT EXPERIMENTS 66

P23 HIGHLANDER: VARIANT FILTERING MADE EASIER 67

Poster presentations


15

P24 DOSE-TIME NETWORK IDENTIFICATION: A NEW METHOD FOR GENE REGULATORY NETWORK

INFERENCE FROM GENE EXPRESSION DATA WITH MULTIPLE DOSES AND TIME POINTS 68

P25 IDENTIFICATION OF NOVEL ALLOSTERIC DRUG TARGETS USING A “DUMMY” LIGAND

APPROACH 69

P26 PASSENGER MUTATIONS CONFOUND INTERPRETATION OF ALL GENETICALLY MODIFIED

CONGENIC MICE 70

P27 DETECTING MIXED MYCOBACTERIUM TUBERCULOSIS INFECTION AND DIFFERENCES IN DRUG

SUSCEPTIBILITY WITH WGS DATA 71

P28 APPLICATION OF HIGH-THROUGHPUT SEQUENCING TO CIRCULATING MICRORNAS REVEALS

NOVEL BIOMARKERS FOR DRUG-INDUCED LIVER INJURY 72

P29 INFORMATION THEORETIC MODEL FOR GENE PRIORITIZATION 73

P30 GALAHAD: A WEB SERVER FOR THE ANALYSIS OF DRUG EFFECTS FROM GENE EXPRESSION DATA 74

P31 KMAD: KNOWLEDGE BASED MULTIPLE SEQUENCE ALIGNMENT FOR INTRINSICALLY DISORDERED

PROTEINS 75

P32 ON THE LZ DISTANCE FOR DEREPLICATING REDUNDANT PROKARYOTIC GENOMES 76

P33 THE ROLE OF MIRNAS IN ALZHEIMER’ S DISEASE 77

P34 FUNCTIONAL SUBGRAPH ENRICHMENTS FOR NODE SETS IN REGULATORY NETWORKS 78

P35 HUMANS DROVE THE INTRODUCTION & SPREAD OF MYCOBACTERIUM ULCERANS IN AFRICA 79

P36 LEVERAGING AGO-SRNA AFFINITY TO IMPROVE IN SILICO SRNA DETECTION AND

CLASSIFICATION IN PLANTS 80

P37 ANALYSIS OF RELATIONSHIP PATTERNS IN UNASSIGNED MS/MS SPECTRA 81

P38 MINING ACROSS “ OMICS ” DATA FOR DRUG PRIORITIZATION 82

P39 ABUNDANT TRANS-SPECIFIC POLYMORPHISM AND A COMPLEX HISTORY OF NON-BIFURCATING

SPECIATION IN THE GENUS ARABIDOPSIS 83

P40 RIBOSOME PROFILING ENABLES THE DISCOVERY OF SMALL OPEN READING FRAMES (SORFS), A

NEW SOURCE OF BIOACTIVE PEPTIDES 84

P41 RIGAPOLLO, A HMM-SVM BASED APPROACH TO SEQUENCE ALIGNMENT 85

P42 EARLY FOLDING AND LOCAL INTERACTIONS 86

P43 BINDING SITE SIMILARITY DRUG REPOSITIONING: A GENERAL AND SYSTEMATIC METHOD FOR

DRUG DISCOVERY AND SIDE EFFECTS DETECTION 87

P44 ASSESSMENT OF THE CONTRIBUTION OF COCOA-DERIVED STRAINS OF ACETOBACTER

GHANENSIS AND ACETOBACTER SENEGALENSIS TO THE COCOA BEAN FERMENTATION PROCESS

THROUGH A GENOMIC APPROACH

88

P45 REPRESENTATIONAL POWER OF GENE FEATURES FOR FUNCTION PREDICTION 89

P46 ANALYSIS OF BIAS AND ASYMMETRY IN THE PROTEIN STABILITY PREDICTION 90

P47 MULTI-LEVEL BIOLOGICAL CHARACTERIZATION OF EXOMIC VARIANTS AT THE PROTEIN LEVEL

IMPROVES THE IDENTIFICATION OF THEIR DELETERIOUS EFFECTS 91

P48 NGOME: PREDICTION OF NON-ENZYMATIC PROTEIN DEAMIDATION FROM SEQUENCE-DERIVED

SECONDARY STRUCTURE AND INTRINSIC DISORDER 92

P49 OPTIMAL DESIGN OF SRM ASSAYS USING MODULAR EMPIRICAL MODELS 93

P50 EVALUATING THE ROBUSTNESS OF LARGE INDEL IDENTIFICATION ACROSS MULTIPLE MICROBIAL

GENOMES 94

P51 INTEGRATING STRUCTURED AND UNSTRUCTURED DATA SOURCES FOR PREDICTING CLINICAL

CODES 95

P52 SUPERVISED TEXT MINING FOR DISEASE AND GENE LINKS 96

P53 FLOWSOM WEB: A SCALABLE ALGORITHM TO VISUALIZE AND COMPARE CYTOMETRY DATA IN

THE BROWSER 97

P54 TOWARDS A BELGIAN REFERENCE SET 98

P55 MANAGING BIG IMAGING DATA FROM MICROSCOPY: A DEPARTMENTAL-WIDE APPROACH 99


16

P56 ESTIMATING THE IMPACT OF CIS-REGULATORY VARIATION IN CANCER GENOMES USING

ENHANCER PREDICTION MODELS AND MATCHED GENOME-EPIGENOME-TRANSCRIPTOME

DATA 100

P57 I-PV: A CIRCOS MODULE FOR INTERACTIVE PROTEIN SEQUENCE VISUALIZATION 101

P58 SFINX: STRAIGHTFORWARD FILTERING INDEX FOR AFFINITY PURIFICATION-MASS

SPECTROMETRY DATA ANALYSIS 102

P59 MAPREDUCE APPROACHES FOR CONTACT MAP PREDICTION: AN EXTREMELY IMBALANCED BIG

DATA PROBLEM 103

P60 COEXPNETVIZ: THE CONSTRUCTION AND VIZUALISATION OF CO-EXPRESSION NETWORKS 104

P61 THE DETECTION OF PURIFYING SELECTION DURING TUMOUR EVOLUTION UNVEILS CANCER

VULNERABILITIES 105

P62 FLOREMI: SURVIVAL TIME PREDICTION BASED ON FLOW CYTOMETRY DATA 106

P63 STUDYING BET PROTEIN-CHROMATIN OCCUPATION TO UNDERSTAND GENOTOXICITY OF MLV-

BASED GENE THERAPY VECTORS 107

P64 THE COMPLETE GENOME SEQUENCE OF LACTOBACILLUS FERMENTUM IMDO 130101 AND ITS

METABOLIC TRAITS RELATED TO THE SOURDOUGH FERMENTATION PROCESS 108

P65 ORTHOLOGICAL ANALYSIS OF AN EBOLA VIRUS – HUMAN PPIN SUGGESTS REDUCED

INTERFERENCE OF EBOLA VIRUS WITH EPIGENETIC PROCESSES IN ITS SUSPECTED BAT

RESERVOIR HOST

109

P66 PLADIPUS EMPOWERS UNIVERSAL DISTRIBUTED COMPUTING 110

P67 IDENTIFICATION OF ANTIBIOTIC RESISTANCE MECHANISMS USING A NETWORK-BASED

APPROACH 111

P68 DEFINING THE MICROBIAL COMMUNITY OF DIFFERENT LACTOBACILLUS NICHES USING

METAGENOMIC SEQUENCING 112

P69 HUNTING HUMAN PHENOTYPE-ASSOCIATED GENES USING MATRIX FACTORIZATION 113

P70 THE IMPACT OF HMGA PROTEINS ON REPLICATION ORIGINS DISTRIBUTION 114

C2 THE SYSTEMS TOXICOLOGY COMPUTATIONAL CHALLENGE: IDENTIFICATION OF EXPOSURE

RESPONSE MARKERS 20

Corporate poster presentations


17

K1. MEDICAL DATA AND TEXT MINING:

LINKING DISEASES, DRUGS, AND ADVERSE REACTIONS

Lars Juhl Jensen

Clinical data describing the phenotypes and treatment of patients is an underused data source that has much greater

research potential than is currently realized. Mining of electronic health records (EHRs) has the potential for revealing

unknown disease correlations and for improving post-approval monitoring of drugs. In my presentation I will introduce

the centralized Danish health registries and show how we use them for identification of temporal disease correlations and

discovery of common diagnosis trajectories of patients. I will also describe how we perform text mining of the clinical

narrative from electronic health records and use this for identification of new adverse reactions of drugs.


18

BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015

Abstract ID: K2 Keynote

K2. MULTIPLE SURVIVAL STRATEGIES TO DEAL WITH THE

MULTIPLICATION OF MULTIPLE SEQUENCE ALIGNMENT METHODS

Cedric Notredame

In this seminar I will introduce some of the latest developments in the field of multiple sequence alignment construction,

including some of the work from my group. I will briefly review the main challenges and the latest work in the field,

including ClustalO and the phylogeny aware aligners like SATe and how these aligners relate to consistency based

methods like T-Coffee. I will also look at the complex relationship between multiple sequence alignment accuracy,

structural modeling and phylogenetic tree reconstruction and introduce the notion of reliability index while reviewing

some of the latest advances in this field, including the TCS (Transitive consistency score). I will show how this index can

be used to both identify structurally correct positions in an alignment and evolutionary informative sites, thus suggesting

more unity than initially thought between these two parameters. I will then introduce the structure based clustering

method we recently developed to further test these hypothesis. I will finish with some consideration on the main

challenges that need to be confronted for the accurate modeling of biological sequences relationship with a special

attention on genomic and RNA sequences. All methods are available from www.tcoffee.org.

REFERENCES TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Chang

JM, Di Tommaso P, Notredame C. Mol Biol Evol. 2014 Jun;31(6):1625-37. doi: 10.1093/molbev/msu117. Epub 2014 Apr 1.

Using tertiary structure for the computation of highly accurate multiple RNA alignments with the SARA-Coffee package. Kemena C, Bussotti G,

Capriotti E, Marti-Renom MA, Notredame C. Bioinformatics. 2013 May 1;29(9):1112-9. doi: 10.1093/bioinformatics/btt096. Epub 2013 Feb 28. Alignathon: a competitive assessment of whole-genome alignment methods. Earl D, Nguyen N, Hickey G, Harris RS, Fitzgerald S, Beal K,

Seledtsov I, Molodtsov V, Raney BJ, Clawson H, Kim J, Kemena C, Chang JM, Erb I, Poliakov A, Hou M, Herrero J, Kent WJ, Solovyev V, Darling AE, Ma J, Notredame C, Brudno M, Dubchak I, Haussler D, Paten B. Genome Res. 2014 Dec;24(12):2077-89. doi: 10.1101/gr.174920.114.

Epub 2014 Oct 1.

Epistasis as the primary factor in molecular evolution. Breen MS, Kemena C, Vlasov PK, Notredame C, Kondrashov FA. Nature. 2012 Oct 25;490(7421):535-8. doi: 10.1038/nature11510. Epub 2012 Oct 14.


19


Abstract ID: C1 Corporate presentation

C1. ILLUMINA SOFTWARE PLATFORMS TO TRANSFORM THE PATH TO

KNOWLEDGE AND DISCOVERY

Nicolas Goffard

Illumina, Inc. [email protected]

The next big bottleneck in the biological sample to answer workflow has undoubtedly moved beyond the generation of

the raw data towards its initial processing and analysis and even more so its biological and medical interpretation. There

are two main reasons why this is particularly challenging for research organisations to successfully accomplish. Firstly

there is a need to easily and securely analyse, archive and share sequencing data as well as to simplify and accelerate the

data analysis with push button tools using widely validated and scientifically accepted algorithms. Secondly there is a

requirement to normalize, standardize and curate not just their proprietary data from multiple studies, but to do it in a

way that allows them to compare it in real time to data produced from public domain studies. Illumina provides two

integrated software platforms to overcome these challenges called BaseSpace and NextBio and this presentation provides

an overview of the capabilities found within both to empower biologists and informaticians to interactively explore the

data.


20


Abstract ID: C2 Corporate presentation

C2. THE SYSTEMS TOXICOLOGY COMPUTATIONAL CHALLENGE:

IDENTIFICATION OF EXPOSURE RESPONSE MARKERS

Carine Poussin, Vincenzo Belcastro, Stéphanie Boué, Florian Martin,

Alain Sewer, Bjoern Titz, Manuel C. Peitsch & Julia Hoeng.

Philip Morris International Research and Development, Philip Morris Product SA,

Quai Jeanrenaud 5, CH-2000 Neuchâtel, Switzerland

INTRODUCTION

Risk assessment in the context of 21st century

toxicology relies on the identification of specific

exposure response markers and the elucidation of

mechanisms of toxicity, which can lead to adverse

events. As a foundation for this future predictive risk

assessment, diverse set of chemicals or mixtures are

tested in different biological systems, and datasets are

generated using high-throughput technologies.

However, the development of effective computational

approaches for the analysis and integration of these data

sets remains challenging.

METHODS

The sbv IMPROVER (Industrial Methodology for

Process Verification in Research;

http://sbvimprover.com/) project aims to verify methods

and concepts in systems biology research via challenges

posed to the scientific community. In fall 2015, the 4th

sbv IMPROVER computational challenge will be

launched which is aimed at evaluating algorithms for

the identification of specific markers of chemical

mixture exposure response in blood of humans or

rodents. The blood is an easily accessible matrix,

however remains a complex biofluid to analyze. This

computational challenge will address questions related

to the classification of samples based on transcriptomics

profiles from well-defined sample cohorts. Moreover, it

will address whether gene expression data derived from

human or rodent whole blood are sufficiently

informative to identify human-specific or species-

independent blood gene signatures predictive of the

exposure status of a subject to chemical mixtures

(current/former/non-exposure).

RESULTS & DISCUSSION

Participants will be provided with high quality datasets

to develop predictive models/classifiers and the

predictions will be scored by an independent scoring

panel. The results and post-challenge analyses will be

shared with the scientific community, and will open

new avenues in the field of systems toxicology.

REFERENCES Meyer et al. Industrial methodology for process verification in

research (IMPROVER): toward systems biology verification.

Bioinformatics, 2012 Meyer et al. Verification of systems biology research in the age of

collaborative competition. Nat Biotechnol, 2011

Tarca et al. Strengths and limitations of microarray-based phenotype prediction: lessons learned from the IMPROVER Diagnostic

Signature Challenge. Bioinformatics, 2013

Hartung, T. Lessons learned from alternative methods and their validation for a new toxicology in the 21st century. Journal of

toxicology and environmental health, 2010 Hoeng et al. A network-based approach to quantifying the impact of

biologically active substances. Drug Discov Today, 2012.


21


Abstract ID: O1 Oral presentation

O1. CELL TYPE-SELECTIVE DISEASE ASSOCIATION

OF GENES UNDER HIGH REGULATORY LOAD

Mafalda Galhardo1, Philipp Berninger

2, Thanh-Phuong Nguyen

1, Thomas Sauter

1 & Lasse Sinkkonen

1*.

Life Sciences Research Unit, University of Luxembourg, Luxembourg, Luxembourg1; Biozentrum, University of Basel

and Swiss Institute of Bioinformatics, Basel, Switzerland2.

*[email protected]

Identification of biomarkers and drug targets is a key task of biomedical research. We previously showed that disease-

linked metabolic genes are often under combinatorial regulation (Galhardo et al. 2014). Here we extend this analysis to

include almost 100 transcription factors (TFs) and key histone modifications from over 100 samples to show that genes

under high regulatory load (HRL) are enriched for disease-association across cell types. Network and pathway analysis

suggests the central role of HRL genes in biological networks, under heavy regulation both at transcriptional and post-

transcriptional level, as a possible explanation for the observed enrichment. Thus, epigenomic mapping of enhancers

presents an unbiased approach for identification of novel disease-associated genes.

INTRODUCTION

Identification of disease-relevant genes and gene products

as biomarkers and drug targets is one of key tasks of

biomedical research. Still, a great majority of research is

focused on a small minority of genes while many remain

unstudied (Pandey et al. 2014). Unbiased prioritization

within these ignored genes would be important to harvest

the full potential of genomics in understanding diseases.

Many databases to catalog disease-associated genes have

been created, including DisGeNET that draws from

multiple sources (Bauer-Mehren et al. 2010). In addition,

large amounts of publicly available epigenomic data on

the cell type-selective regulation of these genes has been

produced. The importance of epigenetic regulation for

disease development is increasingly recognized, for

example in analysis of GWAS studies where causal SNPs

are mostly located within gene regulatory regions

(Maurano et al. 2012).

METHODS

Public ChIP-seq data produced by the ENCODE project

(Dunham et al. 2012), the BLUEPRINT Epigenome

project (Martens et al. 2013) and the NIH Epigenomic

Roadmap project (Kundaje et al. 2015) were downloaded

on May 2014. The data were used to rank active protein

coding genes (based on NCBI Entrez and marked by

H3K4me3) by their regulatory load based on the number

of associated TFs or enhancer (H3K27ac) regions using

GREAT tool. The enrichment of disease genes from

DisGeNET among HRL genes was tested using either

Matlab® hypergeometric cumulative distribution function

and adjusted for multiple testing with the Benjamini and

Hochberg methodology or normalized enrichment score.

Enriched diseases were clustered using R package

“blockcluster”. Peak calling for super-enhancers was done

using HOMER. A liver disease gene network was

constructed from HPRD based on liver diseases genes

from MeSH and genes from CTD and had 8278

interactions. Statistical analysis of KEGG pathway

enrichments and betweenness centrality was done using

random sampling tests. miRNA target predictions were

obtained from TargetScan6.2. Further details of the used

methods can be found in Galhardo et al. 2015.


Using ENCODE ChIP-Seq profiles for 93 transcription

factors (TFs) in nine cell lines, we show that HRL genes

are enriched for disease-association across cell types

(Figure 1). TF load correlates with the enhancer load of

the genes, allowing the identification of HRL genes by

epigenomic mapping of active enhancers marked by

H3K27ac modifications. Identification of the HRL genes

across 139 samples from 96 different cell and tissue types

reveals a consistent enrichment for disease-associated

genes in a cell type-selective manner.

The HRL genes are involved in more pathways than

expected by chance, exhibit increased betweenness

centrality in the interaction network of liver disease genes,

and carry longer 3’UTRs with more microRNA binding

sites than genes on average, suggesting a role as hubs

within regulatory networks.

Thus, epigenomic mapping of enhancers presents an

unbiased approach for identification of novel disease-

associated genes (Galhardo et al. 2015).

FIGURE 1. Worflow of the disease-gene enrichment analysis.

REFERENCES Pandey AK et al. PLoS One, 9:e88889 (2014).

Bauer-Mehren A et al. Nucleic Acids Res., 33:D514-D517 (2010). Maurano et al. Science, 337:1190-1195 (2012).

Galhardo et al. Nucleic Asics Res. 42:1474-1496 (2014).

Dunham et al. Nature, 489:57-74 (2012) Martens et al. Haematologica, 98:1487-1489 (2013)

Kundaje et al. Nature, 518:317-330 (2015).

Galhardo et al. Nucleic Acids Res. 10.1093/nar/gkv863 (2015).

Figure 1

ChIP-seq data (Human)

Gene ranking by

regulatory load

(Number of TFs or enhancers per gene)

High regulatory load genes are enriched

for disease association

Active enhancers

(H3K27ac)

139 samples comprising

96 tissue or cell types

Transcription factor

binding sites

(93 TFs)

9 ENCODE cell lines

A549, GM12878, H1hESC, HCT116,

HeLaS3, HepG2, HUVEC, K562, MCF7

Disease genes

(min score 0.08)


22



O2. PREDICTING OLIGOGENIC EFFECTS USING DIGENIC DISEASE DATA

Andrea M. Gazzo1,2,3*

, Dorien Daneels1,3

, Maryse Bonduelle3, Sonia Van Dooren

1,3, Guillaume Smits

1,4 & Tom

Lenaerts1,2,5

.

Interuniversity Institute of Bioinformatics in Brussels, Brussels, Belgium1; MLG, Departement d'Informatique,

Universite Libre de Bruxelles, Brussels, Belgium2; Center for Medical Genetics, Reproduction and Genetics,

Reproduction Genetics and Regenerative Medicine, Vrije Universiteit Brussel, UZ Brussel, Brussel, Belgium3; Genetics,

Hopital Universitaire des Enfants Reine Fabiola, Universite Libre de Bruxelles, Brussels, Belgium4;

Computerwetenschappen, Vrije Universiteit Brussel, Brussel, Belgium5.

*[email protected]

Recent research has shown that disorders may be better described by more complex inheritance mechanisms, advocating

that some of the monogenic disease may in fact be oligogenic. Understanding how the combined interplay and weight of

variants leads to disease may provide improved and novel insights into diseases classically considered being monogenic.

Here we present a unique classification method that separates two types of digenic diseases, i.e. those that requires

variants in both genes to induce the disease and those where one is causative and the second increases the severity. Our

results show that a clear separation can be made between both classes using gene and variant-level features extracted

from DIDA.

INTRODUCTION

DIDA is a novel database that provides for the first time

detailed information on genes and associated genetic

variants involved in digenic diseases, the simplest form of

oligogenic inheritance1. The database is accessible via

http://dida.ibsquare.be and currently includes 213 digenic

combinations involved in 44 different digenic diseases2.

These combinations are composed of 364 distinct variants,

which are distributed over 136 distinct genes. Creating this

new repository was essential, as current databases do not

allow one to retrieve detailed records regarding digenic

combinations. Genes, variants, diseases and digenic

combinations in DIDA are annotated with manually

curated information and information mined from other

online resources. Each digenic combination was

categorized into one of two effect classes: either ``on/off'',

in which variant combinations in both genes are required

to develop the disease, or ``severity'', where variants in

one gene are enough to develop the disease and carrying

variant combinations in two genes increases the severity or

affects its age of onset. In this work we present a predictor

capable of distinguishing between the digenic effect

classes. We analyse the result of this predictor in relation

to specific features collected for the different digenic

combinations in DIDA, as for instance the

haploinsufficiency of the genes, their zygosity and the

relationship between them, providing insight into the

biological meaning of the result.

METHODS

We used a machine learning approach to determine the

classes, i.e. "severity" or "on/off", of a digenic

combination. Starting with feature selection we chose the

most informative features to classify the digenic

combination in either 2 classes. For each of the two genes

involved in a digenic combination: Zygosity

(Heterozygote, Homozygote, etc.), recessiveness

probability, haploinsufficiency score, known recessive

information, if the gene is essential or not (based on

Mouse knock out experimental data) are used as features

in the predictor. At variant level, we used as features the

pathogenicity predictions from SIFT and Polyphen 2 tools.

Finally, we encode also the relationship between the two

genes, defining the relation "Similar function", "Directly

interacting" and "Pathway membership". After different

tests we decided to use a Random forest algorithm, as this

approach gave the best results.


After a 10-fold cross validation we obtained promising

performances, with an MCC of 0,67 and 0,92 as AUROC.

Regretfully, this performance is an overestimation since,

as the gene-based features are the most important, many

examples with mutations mapped on the same gene pair

lead to the same oligogenic effect class. A stratification

that ensures that the same pair of genes are never in both

the training and in the testing set was required. We

manually created 5 subsets, where the instances with the

same gene-pair belong to the same subset. . After this

procedure we assessed again the performances, obtaining

an MCC of 0,36 and as AUROC 0,78. In order to verify

the significance of the performances we retrained the

random forest on a randomization of the data. This

randomization was obtained by shuffling all the features

for each instance but maintaining class unchanged. This

reshuffling resulted in an MCC close to zero and a

AUROC near to 0.5, as expected. This additional test

confirms the significance of the stratified results.

In a next stage we are analysing the relationship between

the oligogenic effect and the features used, particularly in

terms of biological and molecular interpretation. As a

future perspective, the benefit at clinical level is very

promising: one goal of medical genetics is to assign

predictive value to the genotype, in order to it to assist in

diagnosis and disease management. If we can infer, based

on the genotype, what the digenic/oligogenic effect will be,

we can potentially anticipate the treatment.

REFERENCES [1] Gazzo, A. et al., DIDA: a curated and annotated digenic diseases

database, under review on NAR database issue (2016).

[2] Schäffer, A. A. (2013) Digenic inheritance in medical genetics.

J. Med. Genet., 50, 641–652.


23



O3. A COMPREHENSIVE COMPARISON OF MODULE DETECTION METHODS

FOR GENE EXPRESSION DATA

Wouter Saelens1,2*

, Robrecht Cannoodt1,2,3

, Bart N. Lambrecht1,2

& Yvan Saeys1,2

.

VIB Inflammation Research Center1; Department of Respiratory Medicine, Ghent University

2; Center for Medical

Genetics, Ghent University Hospital3.

*[email protected]

Module detection is central in every analysis of large scale gene expression data. While numerous methods have been

developed, the relative merits and drawbacks of these different approaches is still unclear. In this work we use known

gene regulatory networks to do an unbiased comparison of 41 module detection methods, spanning clustering,

biclustering, decomposition, direct network inference and iterative network inference. This analysis showed that

decomposition methods outperform current clustering methods. Our work provides a first comprehensive evaluation to

guide the biologist in their choice but also serves as a protocol for the evaluation of novel module detection methods.

INTRODUCTION

Module detection methods form a cornerstone in the

analysis of genome wide gene expression compendia.

Modules in this context are defined as groups of genes

with a similar expression profile, and therefore frequently

share certain functions, are co-regulated and cooperate to

produce a certain phenotype.

Over the last years, dozens of module detection methods

have been developed, which can be classified in five

different categories. The most popular method is

undoubtedly clustering, which will group genes into

modules based on global similarity in expression profiles.

Within the transcriptomics community these methods have

received a considerable amount of criticism. This is

mainly due to three drawbacks: (i) clustering cannot detect

so called local co-expression effects, (ii) most clustering

methods are unable to detect overlapping modules and (iii)

clustering methods do not model the underlying gene

regulatory network. Alternative approaches have therefore

been developed which either handle both overlap and local

co-expression (biclustering and decomposition) or model

the gene regulatory network (direct network inference and

iterative network inference).

Given this methodological diversity, it is important that

existing and new approaches are evaluated on robust and

objective benchmarks. However, evaluation studies in the

past were limited in the number of methods, use synthetic

data or do not correctly assess the balance between false

positives and false negatives. In this study we therefore

provide a novel unbiased and comprehensive evaluation

strategy (Figure 1), and used it to evaluate 41 state-of-the-

art module detection methods.

METHODS

The key of our approach is that we use golden standard

regulatory networks to define sets of known modules.

These can be used to directly assess the sensitivity and

specificity of the different module detection methods. We

used four different large scale gene expression compendia,

two from E. coli and two from S. cerevisae. For each of

these organisms a substantial part of the regulatory

network is already known, either based on the integration

of small-scale experiments or based on large, genome

wide datasets. We use these networks to define groups of

known modules using by looking at genes which either

share on regulator, all regulators or are strongly

interconnected. We used four different metrics to compare

a set of observed modules with known modules: recovery

and recall control the type II errors, while the relevance

and specificity control the type I errors.

Parameter tuning is a necessary but often overlooked

challenge of module detection methods. As default

parameters of a tool are usually optimized for some

specific test cases by the authors, they do not necessarily

reflect general good performance on other datasets. On the

other hand, one should be careful of overfitting parameters

on specific characteristics of the data, as such parameters

will lead to suboptimal results when using the same

parameter settings on other datasets. In this study we first

optimized parameters using a grid-based approach. Next,

to avoid overfitting we used the optimal parameters on one

dataset to score the performance on another dataset, in an

approach akin to cross-validation.


We evaluated 41 different module detection methods

covering all five approaches. Overall, our analysis showed

that certain decomposition methods, those based on the

independent component analysis, outperform current state-

of-the-art clustering methods. However, despite their

theoretical advantages, neither biclustering nor network

inference methods are able to outperform clustering

methods. Importantly, our results are stable across datasets,

module definitions and scoring metrics, demonstrating the

robustness of our evaluation methodology.

FIGURE 1. Overview of our evaluation methodology.

The applications of our work are twofold. First, if local co-

expression and overlap are of interest, we discourage the

use of biclustering methods and suggest the use of

decomposition instead. Secondly, we provide a new

comprehensive evaluation methodology which can be used

to compare novel methods with the current state-of-the-art.


24



O4. LATEBICLUSTERING: EFFICIENT DISCOVERY OF TEMPORAL LOCAL

PATTERNS WITH POTENTIAL DELAYS

Joana P. Gonçalves1,2*

& Sara C. Madeira3,4

.

Pattern Recognition and Bioinformatics Group, Department of Intelligent Systems, Delft University of Technology1;

Division of Molecular Carcinogenesis, The Netherlands Cancer Institute2; Department of Computer Science and

Engineering, Instituto Superior Técnico, Universidade de Lisboa 3; INESC-ID

4.

*[email protected]

Temporal transcriptomes can provide valuable insight into the dynamics of transcriptional response and gene regulation.

In particular, many studies seek to uncover functional biological units by identifying and grouping genes with common

expression patterns. Nevertheless, most analytical tools available for this purpose fall short in their ability to consider

biologically reasonable models and adequately incorporate the temporal dimension. Each biological task is likely to

occur within a time period that does not necessarily span the whole time course of the experiment, and genes involved in

such a task are expected to coordinate only while the task is ongoing. LateBiclustering is an efficient algorithm to

identify this type of coordinated activity, while allowing genes to participate in distinct biological tasks with multiple

partners over time. Additionally, LateBiclustering is able to capture temporal delays suggestive of transcriptional

cascades: one of the hallmarks of gene expression and regulation.

INTRODUCTION

The discovery of patterns in temporal transcriptomes

exposes gene expression dynamics and contributes to

understand the machinery involved in its modulation.

Various analytical tools are employed in this regard.

Differential expression summarizes an entire time course

into one feature, thus lacking detail. Clustering maintains

respects the chronological order, but focuses on global

similarities and tends to identify rather broad patterns,

associated with unspecific functions. Biclustering offers

increased granularity by additionally searching for local

patterns, but allows for arbitrary jumps in time, eventually

leading to patterns that are incoherent from a temporal

perspective.

METHODS

LateBiclustering is an efficient algorithm for the

identification of transcriptional modules, here termed

LateBiclusters. Each LateBicluster is a group of genes

showing a similar expression pattern with potential delays,

within a particular time frame that does not necessarily

span the whole time course of the transciptome.

LateBiclustering only reports maximal LateBiclusters, that

is, those that cannot be extended and are not fully

contained in any other LateBicluster.

LateBiclustering takes as input a gene-time expression

matrix of real values. Each gene expression profile is first

normalized to zero mean and unit standard deviation. A

discretization is further applied to discern variations

between consecutive time points into three levels: down-

trend, no-change and up-trend. Upon discretization each

gene profile can be seen as a string.

A generalized suffix tree is built to find common

patterns in the gene profiles. Internal nodes

satisfying certain properties are marked for their

potential to denote LateBiclusters.

When an internal node does not satisfy the basic

conditions for LateBicluster maximality, a

procedure is applied to remove occurrences

leading to non-maximal LateBiclusters. For this

purpose, LateBiclustering uses a bit array

representing the occurrences underlying each

internal node. During the maximality update

procedure, the bit array of the inspected node is

compared against those of internal children nodes

(right-max) and nodes from which the inspected

node receives suffix links (left-max).

Finally, LateBiclustering comes with different

heuristics to report a single pattern occurrence per

gene in each maximal LateBicluster. A heuristic

is necessary because there may be multiple

occurrences of a pattern in the profile of a given

gene, which is a direct consequence of allowing

the discovery of delayed patterns.


LateBiclustering is the first efficient algorithm suitable for

the discovery of biclusters with temporal delays. It runs in

polynomial time, while previous methods yielded

exponential time complexity. LateBiclustering was able to

find planted biclusters in synthetic data. It also identified

biologically relevant LateBiclusters associated with

Saccharomyces cerevisiae’s response to heat stress, and

interesting time-lagged responses.

FIGURE 1. Schematic of the LateBiclustering algorithm.

REFERENCES Gonçalves JP & Madeira SC. IEEE/ACM Transactions on

Computational Biology and Bioinformatics, 11(5), 801–813

(2014).


25



O5. INFERRING DEVELOPMENTAL CHRONOLOGIES FROM SINGLE CELL

RNA

Robrecht Cannoodt1,2,3*

, Katleen De Preter3 & Yvan Saeys

1,2.

Data Mining and Modelling for Biomedicine group, VIB Inflammation Research Center, Ghent1; Department of

Respiratory Medicine, Ghent University Hospital, Ghent 2; Center of Medical Genetics, Ghent University Hospital,

Ghent 3.

*[email protected]

With the advent of single cell RNA sequencing, it is now possible to analyse the transcriptomes of hundreds of individual

cells in an unbiased manner. Reconstructing the developmental chronology of differentiating cells is a challenging task,

and doing so in a unsupervised and robust manner is a hitherto untackled problem. We developed a truly unsupervised

developmental chronology inference technique, and evaluated its performance and robustness using multiple datasets.

INTRODUCTION

Early attempts at inferring the chronologies of single cells

are MONOCLE (Trapnell et al., 2014) and NBOR

(Schlitzer et al., 2015). However, these techniques are not

unsupervised as they require knowledge of the cell type of

each cell prior to analysis, which biases the results to prior

knowledge and possibly obstructs the discovery of novel

subpopulations.

METHODS

Our approach consists of four steps.

In the first step, the feature space (~30000 genes) is

reduced to three dimensions.

Secondly, outliers are detected and removed, using a K-

nearest neighbour approach. After outlier removal, the

original feature space is again reduced to three dimensions.

Next, a nonparametric nonlinear curve is iteratively fitted

to the data.

Finally, each cell is projected onto the curve, thus

resulting in a cell chronology.


A single-cell RNAseq dataset (Schlitzer et al., 2015)

contains profilings of DC progenitor cells. These cells are

expected to differentiate from MDP to CDP to PreDC. Our

method is able to intuitively visualise known population

groups (Figure 1), as well as infer the developmental

chronology of the individual cells (Figure 2).

We evaluated our method on four datasets (Shalek et al.,

2014; Trapnell et al., 2014; Buettner et al., 2015 and

Schlitzer et al., 2015), and found it to perform better and

more robustly than existing methods MONOCLE and

NBOR.

This approach opens opportunities to further study known

mechanisms or investigate unknown key regulatory

structures in cell differentiation, or detect novel

subpopulations in a truly unsupervised manner.

REFERENCES Buettner F et al. Nature Biotechnology 33, 155-160 (2015). Schlitzer A et al. Nature Immunology 16, 718-726 (2015).

Shalek A et al. Nature 509, 363-369 (2014).

Trapnell C et al. Nature Biotechnology 32, 381-386 (2014).

FIGURE 1. After feature space reduction and outlier detection of 244 DC

progenitor cells (Schlitzer et al., 2015), our method can intuitively

visualise known populations.

FIGURE 2. An iterative curve fitting results in a smooth curve reflecting

the developmental chronology. After projecting each cell to the curve,

regulatory patterns in expression which correlate with this timeline can be investigated.


26



O6. COMBINING TREE-BASED AND DYNAMICAL SYSTEMS

FOR THE INFERENCE OF GENE REGULATORY NETWORKS

Vân Anh Huynh-Thu1*

& Guido Sanguinetti2,3

.

GIGA-R & Department of Electrical Engineering and Computer Science, University of Liège1; School of Informatics,

University of Edinburgh2; SynthSys – Systems and Synthetic Biology, University of Edinburgh

3.

*[email protected]

INTRODUCTION

Reconstructing the topology of gene regulatory networks

(GRNs) from time series of gene expression data remains

an important open problem in computational systems

biology. Current approaches can be broadly divided into

model-based and model-free approaches, and face one of

two limitations: model-free methods are scalable but

suffer from a lack of interpretability, and cannot in general

be used for out of sample predictions. On the other hand,

model-based methods focus on identifying a dynamical

model of the system; these are clearly interpretable and

can be used for predictions, however they rely on strong

assumptions and are typically very demanding

computationally. Here, we aim to bridge the gap between

model-based and model-free methods by proposing a

hybrid approach to the GRN inference problem, called

Jump3 (Huynh-Thu & Sanguinetti, 2015). Our approach

combines formal dynamical modelling with the efficiency

of a nonparametric, tree-based method, allowing the

reconstruction of GRNs of hundreds of genes.

METHODS

Gene expression model. At the heart of the Jump3

framework, we use the on/off model of gene expression

(Ptashne & Gann, 2002), where the rate of transcription of

a gene can vary between two levels depending on the

activity state μ of the promoter of the gene. The expression

x of a gene is modelled through the following stochastic

differential equation:

dxi = (Aiμi(t) + bi – λixi)dt + σdω(t),

where subscript i refers to the i-th target gene. Here, the

promoter state μi(t) is a binary variable (the promoter is

either active or inactive) that depends on the expression

levels of the transcription factors (TFs) that bind to the

promoter. Ai, bi and λi are kinetic parameters, and the term

σdω(t) represents a white noise-driving process with

variance σ2.

Network reconstruction with jump trees. Recovering

the regulatory links pointing to gene i amounts to finding

the genes whose expression is predictive of the promoter

state μi. To achieve this goal, we propose a procedure that

learns, for each target gene i, an ensemble of decision trees

predicting the promoter state μi at any time t from the

expression levels of the candidate regulators at the same

time t. However, standard tree-based methods cannot be

applied here since the output μi(t) is a latent variable. We

therefore propose a new decision tree algorithm called

“jump tree”, which splits the observations by maximising

the marginal likelihood of the dynamical on/off model.

The learned tree-based model is then used to derive an

importance score for each candidate regulator, computed

as the sum of the likelihood gains that are obtained at all

the tree nodes where this regulator was selected to split the

observations. The importance of a candidate regulator j is

used as weight for the putative regulatory link of the

network that is directed from gene j to gene i.


We evaluated Jump3 on the networks of the DREAM4 In

Silico Network challenge (Prill et al., 2010). For each

network topology, two types of simulated expression data

were used: data simulated using the on/off model (toy

data) and the time series data that was provided in the

context of the DREAM4 challenge. We compared Jump3

to other GRN inference methods: two model-free methods,

which are time-lagged variants of GENIE3 (Huynh-Thu et

al., 2010) and CLR (Faith et al., 2007) respectively; two

model-based methods, namely Inferelator (Greenfield et

al., 2010) and TSNI (Bansal et al., 2006), and G1DBN

(Lèbre, 2009), a method based on dynamic Bayesian

networks. Areas Under the Precision-Recall curves

(AUPRs) obtained for size-100 networks are shown in

Table 1. Jump3 yields the highest AUPR in the case of the

toy data. As expected, its performance decreases when the

networks are inferred from the DREAM4 data, due to the

mismatch between the on/off model and the one used to

simulate the data. However, Jump3 still outperforms the

other methods.

Toy DREAM4

Jump3 0.272 ± 0.060 0.187 ± 0.058

GENIE3-lag 0.114 ± 0.010 0.176 ± 0.056

CLR-lag 0.088 ± 0.008 0.169 ± 0.047

Inferelator 0.069 ± 0.006 0.144 ± 0.036

TSNI 0.020 ± 0.003 0.042 ± 0.010

G1DBN 0.104 ± 0.024 0.114 ± 0.043 TABLE 1. Comparison of network inference methods (mean AUPR and standard deviation).

We also applied Jump3 to gene expression data from

murine bone marrow-derived macrophages treated with

interferon gamma (Blanc et al., 2011). Several of the hub

TFs in the predicted network have biologically relevant

annotations. They include interferon genes, one gene

associated with cytomegalovirus infection, and cancer-

associated genes, showing the potential of Jump3 for

biologically meaningful hypothesis generation.

REFERENCES Bansal M et al. Bioinformatics 22, 815-822 (2006). Blanc M et al. PLoS Biol 9, e1000598 (2011).

Faith JJ et al. PLoS Biol 5, e8 (2007).

Greenfield A. PLoS ONE 5, e13397 (2010). Huynh-Thu VA & Sanguinetti G. Bioinformatics 31, 1614-1622 (2015).

Huynh-Thu VA et al. PLoS ONE 5, e12776 (2010).

Lèbre S. Stat Appl Genet Mol Biol 8, Article 9 (2009). Prill RJ et al. PLoS ONE 5, e9202 (2010).

Ptashne M & Gann A. Genes and Signals. Cold Harbor Spring

Laboratory Press (2002).


27



O7. MODELING THE REGULATION OF Β-CATENIN SIGNALLING BY WNT

STIMULATION AND GSK3 INHIBITION

Annika Jacobsen1, Nika Heijmans

2, Reneé van Amerongen

2, Folkert Verkaar

3,

Martine J. Smit3, Jaap Heringa

1 & K. Anton Feenstra

1*.

1Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, The Netherlands;

2Van Leeuwenhoek Centre

for Advanced Microscopy and Section of Molecular Cytology, Swammerdam Institute for Life Sciences, University of

Amsterdam, The Netherlands; 3Division of Medicinal Chemistry, VU University Amsterdam, The Netherlands.

*[email protected]

The Wnt/β-catenin signaling pathway is crucial for stem cell self-renewal, proliferation and differentiation. Hyperactive

Wnt/β-catenin signaling caused by genetic alterations plays an important role in oncogenesis. In our newly developed

Petri net model, GSK3 inhibition leads to significantly higher pathway activation (high β-catenin levels) compared to

WNT stimulation, which is confirmed by TCF/LEF luciferase reporter assays experimentally. Using this validated model

we can now simulate changes in Wnt/β-catenin signaling resulting from different mutations found in breast and

colorectal cancer. We propose that this model can be used further to investigate different players affecting Wnt/β-catenin

signaling during oncogenic transformation and the effect of drug treatment.

WNT/Β-CATENIN

Wnt/β-catenin signaling is important for stem cell

maintenance and developmental processes and is highly

conserved in all multicellular organisms (1, 2). The

pathway regulates the expression of specific target genes

by changing the levels of the transcriptional co-activator,

β-catenin which activates the TCF/LEF transcription

factors. Wnt/β-catenin signaling is active in stem cells

located in Wnt rich environments.

APC and AXIN are key proteins of the destruction

complex, which targets β-catenin for destruction.

Mutations in APC, AXIN and β-catenin play important

roles in oncogenesis (2, 3). To better understand its role in

oncogenesis, we here create a Petri net (PN) model of the

Wnt/β-catenin signaling pathway, that uses available

coarse-grained data, such as binary interactions and semi-

quantitative protein levels. Using this model and

validating experiments we show how different strengths of

Wnt stimulation and GSK3 inhibition activate signaling

over time.

PETRI-NET MODELLING

We built a PN model of Wnt/β-catenin signaling des-

cribing the logic of known (inter)actions, cf. our previous

work (5). In a PN, a place represents an entity (e.g. gene),

a transition indicates the activity occurring between the

places (e.g. gene expression), and these are connected by

directed edges called arcs that represent their interactions

(e.g., activation of gene expression by a protein).

TRANSCRIPTION AND PROTEIN ASSAYS

TCF/LEF transcription was measure by TOPFLASH

reporter activity at several time points and at different

concentrations of Wnt3a stimulation and GSK3 inhibition

by CHIR99021. Active and total β-catenin (CTNNB1)

levels were measured by Western blot.

VALIDATED ACTIVATION & INHIBITION

We simulate the model with initial Wnt and GSK3 token

levels ranging from 0 to 5 to represent addition of Wnt and

inhibition of GSK3. Figure 1 shows the four different β-

catenin responses for Wnt addition (purple) and GSK3

inhibition (green). At low GSK3 levels, β-catenin linearly

increases, but at high GSK levels β-catenin remains low.

At high Wnt levels, β-catenin shows a transient response,

with the peak height increasing with Wnt levels. The

increase of β-catenin is due to sequestration of AXIN to

the cell membrane, which inactivates the destruction

complex. Increase in β-catenin activates transcription of

AXIN2 which triggers the negative feedback.

FIGURE 1. Pathway response for different levels of Wnt and activity of

GSK3. When adding Wnt, the pathway transiently activates but GSK3 inhibition permanently activates.

TCF/LEF reporter assay validation experiments for both

perturbations show that transcriptional activity of

TCF/LEF is both dosage and time dependent,

corresponding well for GKS3 inhibition. Wnt3a stimu-

lation, on the other hand, does activate expression, but we

do not observe the β-catenin dosage or time effect

predicted by our model. Measuring β-catenin by Western

blot reveals a consistent increase upon pathway activation,

however protein levels and changes are on the border of

experimental sensitivity.

In conclusion, our Petri net model recapitulates much of

the known behavior of the Wnt/β-catenin pathway upon

Wnt stimulation and GSK3 inhibition, and hints at

subtleties in the mechanism that will help us gain further

understanding in the role of this pathway in development

and oncogenesis.

REFERENCES 1. Clevers & Nusse (2012) Cell. 149:1192-1205

2. Holstein (2012) Cold Spring Harb Perspect Biol. 4:a007922

3. MacDonald, Tamai & He (2009) Dev Cell. 17:9-26

4. Klaus & Birchmeier (2008) Nat. Rev. Cancer. 8:387-398

5. Bonzanni et al., (2009) Bioinformatics. 25:2049-2056


28



O8. RANKED TILING BASED APPROACH TO DISCOVERING PATIENT

SUBTYPES

Thanh Le Van1,*

, Jimmy Van den Eynden3, Dries De Maeyer

2, Ana Carolina Fierro

5, Lieven Verbeke

5, Matthijs van

Leeuwen4

, Siegfried Nijssen1,4

, Luc De Raedt

1 & Kathleen Marchal

5,6.

Department of Computer Science1, Centre of Microbial and Plant Genetics

2, KULeuven, Belgium; Department of

Medical Biochemistry, University of Gothenburg3, Sweden; Leiden Institute for Advanced Computer Science

4,

Universiteit Leiden, The Netherlands; Department of Plant Biotechnology and Bioinformatics5, Department of

Information Technology, iMinds6, Ghent University, Belgium.

*[email protected]

Cancer is a heterogeneous disease consisting of many subtypes that usually have both shared and distinguishing

mechanisms. To derive good subtypes, it is essential to have a computational model that can score their homogeneity

from different angles, for example, mutated pathways and gene expression. In this paper, we introduce our ongoing work

which studies a constraint-based optimisation model to discover patient subtypes as well as their perturbed pathways

from mutation, transcription and interaction data. We propose a way to solve the optimisation problem based on

constraint programming principles. Experiments on a TCGA breast cancer dataset demonstrate the promise of the

approach.

INTRODUCTION

Discovering patient subtypes and understanding their

mechanisms are essential to provide precise treatments to

patients. There have been efforts to understand how

mutation causes subtypes such as the work by Hofree et

al., (2013). However, to the best knowledge of the authors,

it is still an open question on how to combine mutation

and expression data to derive good subtypes. Therefore,

we study a new computation model that can discover

subtypes as well as their specific mutated genes and

expressed genes from mutation, transcription and

interaction data.

METHODS

We conjecture that a subtype consists of a number of

patients who have the same set of differentially expressed

genes and a set of mutated genes that hit the same

pathways.

To find both mutations and expressions of patient subtypes,

we extend our recent ranked tiling method (Le Van et al.,

2014). Ranked tiling is a data mining method proposed to

mine regions with high average rank values in a rank

matrix. In this type of matrix, each row is a complete

ranking of the columns. We find that rank matrices are a

good abstraction for numeric data and are useful to

integrate datasets that are at different scales.

To apply the ranked tiling method, we first transform the

given numeric expression matrix, where rows are

expressed genes and columns are patients, into a ranked

expression matrix. Then, we search for a region in the

transformed matrix that has high average rank scores.

However, different from the ranked tiling method, we

impose a further constraint that the columns (patients) of

the region should also have a number of mutated genes

that have high rank scores in a network with respect to a

network model. We formalise this as a constraint

optimisation problem and use a constraint solver to solve

it.


We apply our method on TCGA breast cancer dataset and

discover eight subtypes. Compared to PAM50 annotations,

our method divide the Basal subtype into three sub-groups

named S2, S3 and S6. The LumA subtype is divided into

04 smaller groups, namely, S1, S4, S7 and S8. Finally, our

method could recover the Her2 subtype in S5.

To validate the mined subtypes in the patient dimension,

we assume PAM50 annotations are true labels for them.

Then, grouping patients into subtypes can be seen as a

multi-class prediction problem, for which we can calculate

F1 score to measure the average accuracy. We also

compare our scores with state-of-the-art, including

iCluster+ (Mo, Q. et al., 2013), NBS (Hofree et al., 2013)

and SNF (Wang B. et al., 2014). The result (not shown)

illustrates that our subtypes are more homogeneous than

the ones produced by iCluster+ and NBS and are

comparable to those by SNF.

To validate the mined subtypes in the gene dimension, we

perform geometric tests to see how their mutated genes

and expressed genes are related to cancer pathways. The

figure below is the heatmap showing the log_10 p-values

of the tests. In this Figure, we can see that the discovered

subtypes have specific perturbed pathways.

FIGURE 1. Cancer pathway enrichment analysis using mined mutated genes and expressed genes of subtypes

REFERENCES Hofree et al., Nat Methods 10(11), 1108–15 (2013).

Le Van et al., ECML/PKDD 2014 (2), 98–113 (2014)

Mo, Q. et al., PNAS 110(11), 4245–50 (2013)

Wang, B. et al., Nature methods, 11(3), 333–7 (2014)


29



O9. DEVELOPMENT OF A DNA METHYLATION-BASED SCORE

REFLECTING TUMOUR INFILTRATING LYMPHOCYTES

Martin Bizet1,2,3*#

, Jana Jeschke1#

, Christine Desmedt4, Emilie Calonne

1, Sarah Dedeurwaerder

1,

Gianluca Bontempi2,3

, Matthieu Defrance1,2

, Christos Sotiriou4 and Francois Fuks

1

Laboratory of Cancer Epigenetics, Faculty of Medicine, Université Libre de Bruxelles1; Interuniversity Institute of

Bioinformatics in Brussels, Université Libre de Bruxelles & Vrije Universiteit Brussel2; Machine Learning Group,

Computer Science Department, Université Libre de Bruxelles, Brussels3; Breast Cancer Translational Research

Laboratory, Jules Bordet Institute, Université Libre de Bruxelles4; #These authors contributed equally to this work;

*[email protected]

Tumour infiltrating lymphocytes (TIL) are increasingly recognised as one of the key feature to predict outcome and

therapy response in malignancies. However, measuring quantities of TIL remains challenging since it relies on subjective

and spatially-restricted measurements from a pathologist. In this study we used genome-scale DNA-methylation profiles

from breast tumours to develop a so-called MeTIL score, which reflects TIL level within whole-tumour samples. We

demonstrate the robustness to noise of the MeTIL score using simulated data as well as the ability of the MeTIL score to

sensitively measure TIL in patient samples and to improve prediction of outcome.

INTRODUCTION

Breast cancer (BC) is one of the most common and

deadliest diseases in women from Western countries.

Tumour infiltrating lymphocytes (TIL) emerged as one of

the key feature to predict outcome and response to

treatment in this disease [1]. However the measurement of

TIL levels remains challenging because it relies on manual

readings of a tumour cancer slide by a pathologist, which

is subjective by nature and does not necessary reflect the

whole-tumour TIL content. In this study we took

advantage of the high tissue-specificity of DNA-

methylation patterns [2] to develop a so-called MeTIL

score, which predicts the amount of lymphocytes within

the tumour.

METHODS

The MeTIL score has been developed in 3 key-steps:

We first used genome-scale DNA-methylation

profiles data from 11 cell-lines (8 normal or

cancerous epithelial breast and 3 T-lymphocytes)

to extract 29 cytosines specifically unmethylated

in T-lymphocytes (delta-beta < -0.8 and standard

deviation between groups < 0.1).

We then applied a cross-validated pipeline,

associating mRMR feature selection and random-

forest algorithm, on 118 BC samples to extract a

minimal set of cytosines, which methylation level

is predictive for quantities of TIL.

Finally we used a “normalised PCA” approach to

compute a unique MeTIL score from the

individual methylation values.

The robustness of the relation between the MeTIL score

and TIL levels was also assessed using spearman

correlation computed from 10 000 simulations with

varying proportion of TIL (Fig.1B&C). The simulated

data took two sources of noise into account:

Technical noise modeled as a Gaussian noise

Perturbations due to the presence of other cell-

types within the tumour microenvironment that

are not lymphocytic or epithelial, modeled by a

methylation value sampled randomly among the

array.

Lastly, we measured TIL quantities with the MeTIL score

in three independent BC cohorts and applied COX

regression models to evaluate the prognostic value of the

MeTIL score.


We first applied a hierarchical clustering analysis and

observed that BC samples with high TIL infiltration show

a hypomethylated pattern for all MeTIL markers (Fig.1A).

Furthermore we demonstrated, using simulations, a strong

correlation between the MeTIL score and TIL levels, even

when high level of noise (0.7 times the standard deviation)

and high proportion of perturbing unknown cell-types

(70%) were included in the model (Fig.1B).

FIGURE 1. The MeTIL score reflects TIL levels (A) Heatmap showing the

methylation values of the 5 MeTIL markers. A ‘TIL high’ group with a hypomethylated pattern (orange) appeared. (B) Color-map of the

spearman correlation between MeTIL score and TIL level for increasing

noise (y-axis) and abundance of unknown cell-types (x-axis) based on simulations. (C) Methylation value of each MeTIL marker was simulated

as the sum of the methylation level in lymphocyte (M1), epithelial cell

(M2) and other cell-types (random value M3) weighted by their proportion in the tissue (f1, f2, f3) and an Gaussian noise (e).

Finally, we observed consistent patterns of TIL levels

within BC subtypes in independent cohorts suggesting the

robust nature of our score to evaluate TIL levels.

Furthermore, COX regressions analysis revealed a

prognostic value for the MeTIL score in triple negative

and luminal BC (p-value < 0.05).

REFERENCES [1] Loi, S., et al. Official journal of the European Society for Medical Oncology /

ESMO 25, 1544-1550 (2014).

[2] Jeschke, J., Collignon, E., Fuks, F. FEBS J., 282, 9:1801-14. (2015).

(A) (B)

(C)


30



O10. PREDICTION OF CELL RESPONSES TO SURFACE TOPOGRAPHIES

USING MACHINE LEARNING TECHNIQUES

Aliaksei S Vasilevich1*,Shantanu Singh

2, Aurélie Carlier

1 & Jan de Boer

1.

Laboratory for Cell Biology-inspired Tissue Engineering, Merln Institute, Maastricht University1, Imaging Platform,

Broad Institute of MIT and Harvard2. *[email protected]

Topographical cues have been repeatedly shown to influence cell fate dramatically (Bettinger et. al., 2009). This

phenomenon opens new opportunities to design the interaction between biomaterials and biological tissues in a

predictable manner. Unfortunately, the exact mechanism of topographical control of cell behavior remains largely

unknown. We have therefore developed a technology in our laboratory to determine an optimal surface topography for

virtually any application in biomedical field. Previously we have reported that we can control cell shape by our surfaces

in a predictable manner (Hulsman et.al., 2015). Here we demonstrate that we can successfully predict not only cell shape,

but also cell response on protein level based on the properties of our topographies. The results of our study show that we

are able to design materials for biomedical applications that require a particular cell behavior.

INTRODUCTION

The TopoChip, a micro topography screening platform,

enables the assessment of cell response to 2176 unique

topographies in a single high-throughput screen. The

topographical features were randomly selected from an in

silico library of more than 150 million of topographies,

which were designed from algorithm that synthesized

patterns based on simple geometric elements – circles,

triangles and rectangles (Unadkat et al, 2011). In our

previous studies, we have demonstrated that these surface

topographies exert a mitogenic effect on hMSCs (Unadkat

et al, 2011), as well as on cell shape (Hulsman et. al.,

2015). In this paper, we show that these topographies can

also be used to modulate the ALP expression in human

mesenchymal stromal cells, as well as pluripotency in

human induced pluripotent stem (iPS) cells. We further

show that computational models can be build to predict

these protein levels using surface topography parameters.

METHODS

Cell response to topography was captured by high-content

imaging. Using image analysis and data mining methods

described previously (Hulsman et.al., 2015),

multiparametric “profiles” of cellular response were

obtained. Multiple replicates of each topography were

used to estimate the median level of a cellular response of

interest – either ALP in human mesenchymal stromal cells

(hMSCs), or the median number of Oct4 positive cells in

population of human induced pluripotent stem cell

(hIPSCs). We aimed to predict the cellular response based

on surface topography parameters using machine learning

methods. To learn and validate these methods (specifically,

classifiers), the data were split into training and testing

sets in a 3:1 proportion respectively. In the training step,

we performed a 10-fold cross-validation to obtain optimal

parameters for each classifier. The caret package (Kuhn

M., 2008) in R (R core team, 2015) was used to perform

the analysis.


In the first project, we conducted a screening on the

TopoChip with hMSCs in order to find topographies that

would be able to increase the ALP level, a protein that is

an early marker of osteogenesis. We were able to

successfully find such surfaces and confirm results

experimentally (publication in preparation). To move

further we decided to check how accurately we can make a

prediction of ALP level in hMSCs based on topographical

features. Focussing only on extreme examples, we

selected 100 high- and and low-scoring topographies and

used the model validation scheme described in Methods to

find the most accurate binary classifier for our data set.

We tested several classifiers and identified random forest

as most precise, which obtained an accuracy of 96% on

the held-out test set.

In a second project, we aim to find a topography that will

increase proliferation and pluripotency of hIPSCs. We

used Oct4 as a marker of pluripotency. The screening was

performed on one half of the Topochip (1000+ surfaces),

which were then ranked based on the number of Oct4

positive cells. One hundred high- and low-scoring surfaces

were chosen to train a classifier. Using logistic regression ,

we obtained 72% accuracy on a held-out test set. We used

this model to predict surfaces that would increase

pluripotency in hIPSCs among surfaces that were not

included in the initial screening. Topographies were

ranked according to their predicted probability score and

top 30 surfaces were chosen for experimental validation.

We found that 79% of selected surfaces were predicted

accurately.

In summary, the combination of our screening methods

and machine learning algorithms open new avenues to

design surfaces with desired properties for variable

applications. Our next step will be to find a surface with

maximum ALP level from our virtual library based on our

screening data.

REFERENCES Bettinger C J, Langer R, & Borenstein J T. “Engineering Substrate

Micro- and Nanotopography to Control Cell Function.” Angewandte

Chemie (International ed. in English) 48.30 (2009). Hulsman M et. al., Analysis of high-throughput screening reveals the

effect of surface topographies on cellular morphology, Acta

Biomaterialia, 15, (2015). Kuhn M. “Building Predictive Models in R Using the caret Package”

Journal of Statistical Software, Vol. 28, (2008)

R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL

http://www.R-project.org/. (2015)

Unadkat H V. et al. “An Algorithm-Based Topographical Biomaterials Library to Instruct Cell Fate.” Proceedings of the National Academy

of Sciences of the United States of America 108.40 (2011).


31



O11. ANALYSIS OF MASS SPECTROMETRY QUALITY CONTROL METRICS

Wout Bittremieux1, Pieter Meysman

1, Lennart Martens

2, Bart Goethals

1, Dirk Valkenborg

3 & Kris Laukens

1.

Advanced Database Research and Modeling (ADReM) & Biomedical Informatics Research Center Antwerp (biomina),

University of Antwerp / Antwerp University Hospital1; Department of Biochemistry & Department of Medical Protein

Research, Ghent University / VIB2; Flemish Institute for Technological Research (VITO)

3.

*[email protected]

Mass-spectrometry-based proteomics is a powerful analytical technique to identify complex protein samples, however,

its results are still subject to a large variability. Lately several quality control metrics have been introduced to assess the

performance of a mass spectrometry experiment. Unfortunately these metrics are generally not sufficiently thoroughly

understood. For this reason, we present a few powerful techniques to analyse multiple experiments based on quality

control metrics, identify low-performance experiments, and provide an interpretation of outlying experiments.

INTRODUCTION

Mass-spectrometry-based proteomics is a powerful

analytical technique that can be used to identify complex

protein samples. Despite many technological and

computational advances, performing a mass spectrometry

experiment is still a highly complicated task and its results

are subject to a large variability. To understand and

evaluate how technical variability affects the results of an

experiment, lately several quality control (QC) and

performance metrics have been introduced. Unfortunately,

despite the availability of such QC metrics covering a

wide range of qualitative information, a systematic

approach to quality control is often still lacking.

As most quality control tools are able to generate several

dozens of metrics, any single experiment can be

characterized by multiple QC metrics. Therefore it is

often not clear which metrics are most interesting in

general, or even which metrics are relevant in a specific

situation. To take into account the multidimensional data

space formed by the numerous metrics, we have applied

advanced techniques to visualize, analyze, and interpret

the QC metrics.

METHODS

Outlier detection can be used to detect deviating

experiments with a low performance or a high level of

(unexplained) variability. These outlying experiments can

subsequently be analyzed to discover the source of the

reduced performance and to enhance the quality of future

experiments.

However, it is insufficient to know that a specific

experiment is an outlier; it is also of vital importance to

know the reason. To understand why an experiment is an

outlier, we have used the subspace of QC metrics in which

the outlying experiment can be differentiated from the

other experiments. This provides crucial information on

how to interpret an outlier, which can be used by domain

experts to increase interpretability and investigate the

performance of the experiment.


Figure 1 shows an example of interpreting a specific

experiment that has been identified as an outlier. As can

be seen, two QC metrics mainly contribute to this

experiment being an outlier. The explanatory subspace

formed by these QC metrics can be extracted, which can

then be interpreted by domain experts, resulting in insights

in relationships between various QC metrics.

FIGURE 1. QC metrics importances for interpreting an outlying

experiment.

Next, by combining the explanatory subspaces for all

individual outliers, it is possible to get a general view on

which QC metrics are most relevant when detecting

deviating experiments. When taking the various

explanatory subspaces for all different outliers into

account, a distinction between several of the outliers can

be made in terms of the number of identified spectra

(PSM’s). As can be seen in Figure 2, for some specific QC

metrics (highlighted in italics) the outliers result in a

notably lower number of PSM's compared to the non-

outlying experiments.

Because monitoring a large number of QC metrics on a

regular basis is often unpractical, it is more convenient to

focus on a small number of user-friendly, well-understood,

and discriminating metrics. As the QC metrics highlighted

in Figure 2 are shown to indicate low-performance

experiments, these metrics are prime candidates to monitor

on a continuous basis to quickly detect faulty experiments.

FIGURE 2. Comparison of the number of PSM’s between the non-outlying

and the outlying experiments.


32



O12. XILMASS: A CROSS-LINKED PEPTIDE IDENTIFICATION ALGORITHM

Şule Yılmaz1,2,3*

, Masa Cernic4, Friedel Drepper

5, Bettina Warscheid

5, Lennart Martens

1,2,3 & Elien Vandermarliere

1,2,3.

Medical Biotechnology Center, VIB, Ghent, Belgium1; Department of Biochemistry, Ghent University, Ghent, Belgium

2;

Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium3; Department of Biochemistry, Molecular and

Structural Biology, Jožef Stefan Institute, Ljubljana, Slovenia4; Functional Proteomics and Biochemistry, Department of

Biochemistry and Functional Proteomics, Institute for Biology II and BIOSS Centre for Biological Signaling Studies,

University of Freiburg, Freiburg, Germany5. *[email protected]

Chemical cross-linking coupled with mass spectrometry (XL-MS) facilitates the determination of protein structure and

the understanding of protein interactions. The current computational approaches rely on different strategies with a limited

number of open-source and easy-to-use search algorithms. We therefore built a novel cross-linked peptide identification

algorithm, called Xilmass which has a novel database construction and a new scoring function adapted from traditional

database search algorithms. We compared the performance of Xilmass against one of the most popular and publicly

available algorithms: pLink, and a recently published algorithm Kojak. We found that Xilmass identified 140 spectra

whereas Kojak and pLink identified 119 and 35, respectively. We mapped the cross-linking sites on the structure which

resulted in the identification of 20 possible cross-linking sites. These findings show that Xilmass allows the identification

of cross-linking sites.

INTRODUCTION

The structure of a protein is crucial for its functionality.

Protein structure is commonly determined by X-ray

crystallography or nuclear magnetic resonance (NMR). X-

ray crystallography is only feasible for crystallizable

proteins and NMR has a protein size limitation. Due to

these restrictions, protein complexes are much more

difficult to approach with these classical methods.

However, chemical cross-linking of the complex coupled

with mass spectrometry (XL-MS) allows to study of these

protein complexes. The identification of the measured

fragmentation spectra is a challenging task. One approach

to identify cross-linked peptides is to linearize cross-

linked peptide-pairs in order to generate a database to

perform traditional search engines (Maiolica et al., 2007).

However, a traditional search engine is not directly

applicable to identify cross-linked peptides. Another

approach is to rely on the usage of labeled cross-linkers,

but this has a decreased performance when unlabeled

cross-linkers are used. We therefore built an algorithm,

Xilmass, which is designed for the identification of XL-

MS fragmentation spectra without linearization of peptides

and the requirement of labeled cross-linkers. We also

introduced a new way of representation of a cross-linked

peptide database and directly implemented a new scoring

function.

METHODS

The data sets were derived from human calmodulin (CaM)

and the actin binding domain of plectin (plectin-ABD)

which were cross-linked by DSS. The data sets were

analyzed on a Velos Orbitrap Elite.

Cross-linked peptides were identified by Xilmass, pLink

(Yang et al., 2012) and Kojak (Hoopmann et al., 2015).

The identifications of both Xilmass and Kojak were

validated by Percolator (Käll et al., 2007) at q-value=0.05.

pLink returned a validated list at FDR=0.05.

The findings on cross-linking sites were validated with the

aid of the available structures (Plectin PDB-entry: 4Q57

and calmodulin PDB-entry: 2F3Y). The cross-linking sites

were predicted by X-Walk (Kahraman et al., 2011) and

PyMOL was used for the visualization.


We compared the number of identified spectra and cross-

linking sites from Xilmass, pLink and Kojak. Xilmass

identified 140 spectra whereas Kojak and pLink identified

119 and 35 spectra, respectively (at FDR=0.05). Xilmass

identified 53 cross-linking sites from the 140 spectra with

37 obtained from at least 2 peptide-to-spectrum matches

(PSMs). Kojak identified more cross-linking sites (60),

however, only 26 cross-linking sites have at least 2 PSMs.

The identified cross-linking sites by Xilmass were

manually verified on the structure (Figure1). We defined

20 cross-linking sites as possible (Cα-Cα distances within

30Å (orange)) and not-predicted (Cα-Cα distances

exceeding 30Å (blue)). These findings show that Xilmass

allows the identification of cross-linking sites.

FIGURE 1. The identified cross-linking sites were mapped on the plectin

protein structure to manually verify them (PDB-entry:4Q57)

REFERENCES Hoopmann ,M R et al. Journal of Proteome Research, 14, 2190–2198

(2015)

Kahraman,A. et al. Bioinformatics, 27, 2163–2164 (2011)

Käll,L. et al. Nature Methods, 4, 923–925 (2007) Maiolica,A. et al. Molecular & cellular proteomics:MCP, 6, 2200–2211

(2007)

Yang,B. et al. Nature Methods, 9, 904–906 (2012)


33



O13. AUTOMATED ANATOMICAL INTERPRETATION OF DIFFERENCES

BETWEEN IMAGING MASS SPECTROMETRY EXPERIMENTS

Nico Verbeeck1*

, Jeffrey Spraggins,2, Yousef El Aalamat

3,4, Junhai Yang

2 ,

Richard M. Caprioli2, Bart De Moor

3,4,Etienne Waelkens

5,6 & Raf Van de Plas

1,2.

Delft Center for Systems and Control (DCSC), Delft University of Technology1; Mass Spectrometry Research Center

(MSRC),Vanderbilt University2; STADIUS Center for Dynamical Systems, Signal Processing, and Data Analytics, Dept.

of Electrical Engineering (ESAT), KU Leuven3; iMinds Medical IT, KU Leuven

4; Dept. of Cellular and Molecular

Medicine, KU Leuven5; Sybioma, KU Leuven

6.

*[email protected]

Imaging mass spectrometry (IMS) is a powerful molecular imaging technology that generates large amounts of data,

making manual analysis often practically infeasible. In this work we aid the differential analysis of multiple IMS datasets

by linking these data to an anatomical atlas. Using matrix factorization based multivariate analysis techniques, we are

able to identify differential biomolecular signals between individual tissue samples in an obesity case study on mouse

brain. The resulting differential signals are then automatically interpreted in terms of anatomical structures using a

convex optimization approach and the Allen Mouse Brain Atlas. The automated anatomical interpretation facilitates

much deeper exploration by the biomedical expert for these types of very rich data sets.

INTRODUCTION

Imaging Mass Spectrometry (IMS) is a relatively new

molecular imaging technology that enables a user to

monitor the spatial distributions of hundreds of

biomolecules in a tissue slice simultaneously. This unique

property makes IMS an immensely valuable technology in

biomedical research. However, it also leads to very large

amounts of data in a single analysis (e.g. >1 TB), making

manual analysis of these data increasingly impractical. In

order to aid the exploration of these data, we have recently

developed a framework that integrates IMS data with an

anatomical atlas. The framework uses the anatomical data

in the atlas to automatically interpret the IMS data in terms

of anatomical structures, and guides the user towards

relevant findings within a single tissue section. In this

work, we extend this framework towards the automated

interpretation of biomolecular differences between

multiple IMS datasets.

METHODS

We demonstrate our method on IMS data of multiple

mouse brain sections, and use the Allen Mouse Brain

Atlas as the curated anatomical data source that is linked

to the MALDI-based IMS measurements. We spatially

map the data of each individual IMS dataset to the

anatomical atlas using both rigid and non-rigid registration

techniques. This establishes a common reference space

and allows for direct comparison of spatial locations

between the different IMS datasets. Group Independent

Component Analysis (GICA) is then used to automatically

extract the differentially expressed biomolecular patterns,

after which convex optimization is used to automatically

interpret the differential components in terms of known

anatomical structures (Verbeeck et al, 2014), directly

listing the anatomical areas in which changes occur.


We demonstrate our approach in an obesity case study on

mouse brain. All tissue sections are cryosectioned at 10

μm and thaw-mounted onto ITO coated glass slides after

which they are sublimated with CMBT matrix. MALDI

IMS images are collected using the Bruker 15T solariX

FTICR MS with a spatial resolution of 50 μm, collecting

approximately 35,000 pixels per experiment.

The IMS data of the different experiments are registered to

the anatomical reference space provided by the Allen

Mouse Brain Atlas, establishing an inter-experiment

study-wide reference space. Analysis of the IMS

measurements using GICA reveals multiple biomolecular

patterns that differentiate between the various dietary

conditions examined by the study. The retrieved

differentially expressed biomolecular patterns are then

translated to combinations of anatomical structures using

our convex optimization approach, similar to what a

human investigator intends to do. This automated

interpretation of inter-experiment differences can serve as

a great accelerator in the exploration of IMS data, as it

avoids the time-and resource-intensive step of having a

histological expert manually interpret the differential

patterns.

FIGURE 1. Automated anatomical interpretation of a biomolecular pattern that is differentially expressed in coronal mouse brain sections

between a high fat and a low fat diet in our obesity case study.

REFERENCES Verbeeck, N. et al. Automated anatomical interpretation of ion distributions in tissue: linking imaging mass spectrometry to curated

atlases. Anal. Chem. 86, 8974–8982 (2014).


34



O14. ENHANCEMENT OF IMAGING MASS SPECTROMETRY DATA

THROUGH REMOVAL OF SPARSE INTENSITY VARIATIONS

Yousef El Aalamat1,2*

, Xian Mao1,2

, Nico Verbeeck3, Junhai Yang

4, Bart De Moor

1,2,

Richard M. Caprioli4, Etienne Waelkens

5,6 & Raf Van de Plas

3,4.

Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing, and Data

Analytics, KU Leuven1; iMinds Medical IT, KU Leuven

2; Delft Center for Systems and Control, Delft University of

Technology3; Mass Spectrometry Research Center (MSRC),Vanderbilt University

4; Department of Cellular and

Molecular Medicine, KU Leuven5; Sybioma, KU Leuven

6. *[email protected]

Imaging mass spectrometry (IMS) is rapidly evolving as a label-free, spatially resolved molecular imaging tool for the

direct analysis of biological samples. However, mass spectrometry (MS) measurements are subject to different types of

noise. In IMS, one of the most abundant noise types in ion images is the presence of localized intensity spikes, known

also as sparse intensity variations, which occur on top of the biological ion distribution pattern. In this study, we develop

a method that addresses the issue of sparse intensity noise. We use low-rank approximations of the IMS data to separate

and filter sparse intensity variations from the MS signals. The efficiency of the developed method is tested using MS

measurements of coronal sections of mouse brain and strong de-noising performance is demonstrated both along the

spatial and the spectral domain.

INTRODUCTION

Imaging mass spectrometry (IMS) provides unique

capabilities for biomedical and biological research.

However, its measurements tend to be subject to different

types of noise. One of the more abundant noise types in

IMS are localized intensity spikes, which can be seen as

sparse intensity variations on top of the true biological ion

patterns. This kind of noise can have a substantial impact,

particularly on low ion intensity measurements where the

signal-to-noise ratio (SNR) can be significantly affected.

We present a method to filter sparse intensity variations

from IMS data, and demonstrate its use to de-noise IMS

measurements both along the spatial and the spectral

domain.

METHODS

We introduce a de-noising algorithm based on low-rank

approximation, a concept from linear algebra. The method

can separate sparse intensity variations from biological

and tissue sample patterns, which hold up across multiple

ions and pixels. This approach decomposes IMS data into

two parts, namely a structured data matrix and a sparse

data matrix. Since the noise tends to be sparse in nature, it

will have a propensity to be collected into the sparse data

part. The structured part tends to capture the de-noised

IMS signals, effectively de-noising the ion images and the

spectral profiles in the process. This de-noising method

allows us to automatically filter sparse intensity variations

from the underlying tissue signal without requiring any

parameter tuning.


The filter method is demonstrated on two IMS

experiments (one lipid-focused and one protein-focused)

acquired from coronal sections of mouse brain. For the

protein experiment, the tissue section was coated with

sinapinic acid, and measurements were acquired using a

Bruker AutoFlex MALDI-TOF/TOF in positive linear

mode at a spatial resolution of 100 μm and with a mass

range extending from m/z 3000 to 22000. For the lipid

experiment, the tissue section was sublimated with 1,5-

diaminonaphthalene, and the measurements were acquired

using a Bruker AutoFlex MALDI-TOF/TOF in negative

reflectron mode at a spatial resolution of 80 μm and with a

mass range extending from m/z 400 to 1000. The case

studies demonstrate robust de-noising performance,

retrieving the underlying tissue signal efficiently and

consistently using the structured data matrix. On the

spatial side, we observe a clean-up effect in the spatial

distributions of both high- and low-intensity ions. The

effect is especially impactful for low-intensity ions,

showing a strong increase in the amount of spatial

structure that can be retrieved from low SNR

measurements and revealing patterns that would have

gone unnoticed otherwise. On the spectral side, we

observe an improved SNR after applying the method.

Thus, at the cost of computational analysis, the de-noising

method described here provides a means of increasing the

amount of information that can be extracted from an IMS

experiment, without requiring user interaction or

additional measurement.

FIGURE 1. Impact on both spatial and spectral domain. Top: example of

de-noised ion image. Bottom: plot of a spectrum before (blue) and after (red) removal of sparse intensity variations.


35



O15. DETERMINANTS OF COMMUNITY STRUCTURE

IN THE PLANKTON INTERACTOME

Gipsi Lima-Mendez1,2*

, Karoline Faust 1,2,3

, Nicolas Henry 4, Johan Decelle

4, Sébastien Colin

4, Fabrizio Carcillo

2,3,5,

Simon Roux6, Gianluca Bontempi

5, Matthew B. Sullivan

6, Chris Bowler

7, Eric Karsenti

7,8, Colomban de Vargas

4 &

Jeroen Raes1,2

.

Department of Microbiology and Immunology, Rega Institute KU Leuven1; VIB Center for the Biology of Disease

2;

Laboratory of Microbiology, Vrije Universiteit Brussel, Belgium3; CNRS, UMR 7144, Station Biologique de Roscoff

4;

Interuniversity Institute of Bioinformatics in Brussels (IB)2, Machine Learning Group, Université Libre de Bruxelles

5;

Department of Ecology and Evolutionary Biology, University of Arizona, USA6;

Ecole Normale Supérieure, Institut de

Biologie (IBENS), France7; European Molecular Biology Laboratory

8.*[email protected]

Identifying the abiotic and biotic factors that shape species interactions are fundamental yet unsolved goals in ecology.

Here, we integrate organismal abundances and environmental measures from Tara Oceans to reconstruct the first global

photic-zone co-occurrence network. Environmental factors are incomplete predictors of community structure. Putative

biotic interactions are non-randomly distributed across phylogenetic groups, and show both local and global patterns.

Known and novel interactions were identified among grazers, primary producers, viruses and symbionts. The high

prevalence of parasitism suggests that parasites are important regulators in the ocean food web. Together, this effort

provides a foundational resource for ocean food web research and integrating biological components into ocean models.

INTRODUCTION

Determining the relative importance of both biotic and

abiotic processes represents a grand challenge in ecology.

Here we analyze sequence on plankton organisms and

environmental data from the Tara-Oceans project. We

applied network inference methods to construct a global-

ocean cross-kingdom species interaction network and

disentangled the biotic and abiotic signals shaping this

interactome (Lima-Mendez, et al., 2015).

METHODS

Methods are described in details in (Lima-Mendez, et al.,

2015). Briefly:

Network inference. Taxon-taxon networks were

constructed as in (Faust, et al., 2012), selecting

Spearman and Kullback-Leibler dissimilarity.

Edges with merged multiple-test-corrected p-

values below 0.05 were kept. Taxon-environment

networks were computed with the same

procedure and merged with taxon-taxon networks

for environmental triplet detection.

Indirect taxon edge detection. For each triplet

consisting of two taxa and one environmental

parameter, we computed the interaction

information (II) and taxon edges were considered

indirect when II<0 and within the 0.05 quantile of

the random II distribution obtained by shuffling

environmental vectors.


Comparison of the taxon co-occurrence and environmental

profiles lead to the inference of a network featuring

127,995 unique edges, of which 92,633 are taxon-taxon

edges and 35,362 are taxon-environment edges.

We identified 27,868 taxon-taxon edges that were affected

by the environment (30% of total), of which 11,043 were

driven solely by abiotic factors and 18,869 resulted from

biotic-abiotic synergistic effects. Among environmental

factors, we found that PO4, temperature, NO2 and mixed

layer depth were frequent drivers of network connections.

In the network containing 81,590 predicted biotic

interactions (after removal of environmentally driven

edges), copresences (positive associations) outnumbered

mutual exclusions (anticorrelations; 73% versus 27%),

with most copresences derived from syndiniales parasites

and exclusions involving arthropods. Associations

between Bacteria and Archaea were limited to 24 mutual

exclusions. Virus-bacteria networks revealed 1,869

positive associations between viral populations and seven

of the 54 known bacterial phyla and one archaeal phylum.

The virus-host interaction data suggest that viruses are

host-range-limited across large sections of host space

(network modularity), but that specialist and generalist

phages prey on specific groups within sub-sections of this

available host space (network nestedness).

These analyses highlight the importance of top-down

effects, and specifically that of broad-range parasites such

as Syndiniales controlling the most abundant species and

ensure carbon recycling between the different

compartments of the trophic web. Finally we show how

network-generated hypotheses guide the discovery of

symbiotic relationships (Figure 1).

Additional material is available at

http://www.raeslab.org/companion/ocean-interactome.html

FIGURE 1. Confocal microscopy confirmed predicted interaction between

acoel flatworms (Symsagittifera sp.) together with their photosynthetic

green microalgal endosymbionts (Tetraselmis sp.).

REFERENCES Faust, K., et al. Microbial co-occurrence relationships in the human

microbiome. PLoS Comput Biol 2012;8(7):e1002606.

Lima-Mendez, G., et al. Ocean plankton. Determinants of community structure in the global plankton interactome. Science

2015;348(6237):1262073.


36



O16. BIOINFORMATICS TOOLS FOR ACCURATE ANALYSIS OF AMPLICON

SEQUENCING DATA FOR BIODIVERSITY ANALYSIS

Mohamed Mysara1-3

, Yvan Saeys4,5

, Natalie Leys1, Jeroen Raes

2,6 & Pieter Monsieurs

1*.

Unit of Microbiology, Belgian Nuclear Research Centre SCK•CEN, Mol; Belgium1;

Department of Bioscience

Engineering, Vrije Universiteit Brussel VUB, Brussels, Belgium2; Department of Structural Biology, Vlaams Instituut

voor Biotechnologie VIB, Brussels, Belgium3; Data Mining and Modeling Group, VIB Inflammation Research Center,

Ghent, Belgium4, Department of RespiratoryMedicine, Ghent University Hospital, Ghent, Belgium

5, Department of

Microbiology and Immunology, REGA institute, KU Leuven, Belgium6.

*[email protected]

High-throughput sequencing technologies have created a wide range of new applications, also in the field of microbial

ecology. Yet when used in 16S rRNA biodiversity studies, it suffers from two important problems: the presence of PCR

artefacts (called chimera) and sequencing errors resulting from the sequencing sequencing technologies. In this work

three artificial intelligence-based algorithms are proposed, CATCh, NoDe and IPED, to handle these two problems. A

benchmarking study was performed comparing CATCh/NoDe (for 454 pyrosequencing) or CATCh/IPED (for Illumina

MiSeq sequencing) with other state-of-the art tools, showing a clear improvement in chimera detection and reduction of

sequencing errors respectively, and in general leading to more accurate clustering of the sequencing reads in Operational

Taxonomic Units (OTUs). All algorithms are available via http://science.sckcen.be/en/Institutes/EHS/MCB/MIC

/Bioinformatics/.

INTRODUCTION

The revolution in new sequencing technologies has led to

an explosion of possible applications, including new

opportunities for microbial ecological studies via the

usage of 16S rDNA amplicon sequencing. However,

within such studies, all sequencing technologies suffer

from the presence of erroneous sequences, i.e. (i) chimera,

introduced by wrong target amplification in PCR, and (ii)

sequencing errors originating from different factors during

the sequencing process. As such, there is a need for

effective algorithms to remove those erroneous sequences

to be able to accurately assess the microbial diversity.

METHODS

First, a new algorithm called CATCh (Combining

Algorithms to Track Chimeras) was developed by

integrating the output of existing chimera detection tools

into a new more powerful method. Second, NoDe (Noise

Detector) was introduced, an algorithm that identifies and

corrects erroneous positions in 454-pyrosequencing reads.

Third, IPED (Illumina Paired End Denoiser) algorithm

was developed to handle error correction in Illumina

MiSeq sequencing data as the first tool in the field. After

identifying those positions likely to contain an error, those

sequencing reads are subsequently clustered with correct

reads resulting in error-free consensus reads. The three

algorithms were benchmarked with state-of-the-art tools.


Via a comparative study with other chimera detection

tools, CATCh was shown to outperform all other tools,

thereby increasing the sensitivity with up to 14% (see

Figure 1).

FIGURE 1. Plot indicating the effect of applying 5% indels (shown on the

left) and 5% mismatches (shown on the right), on the performance of different chimera detection tools. CATCh was found to outperform other

existing tools.

Similarly, NoDe and IPED were benchmarked against

other denoising algorithms, thereby showing a significant

improvement in reduction of the error rate up to 55% and

75% respectively (see Figure 2). The combined effect of

our algorithms for chimera removal and error correction

also had a positive effect on the clustering of reads in

operational taxonomic units (OTUs), with an almost

perfect correlation between the number of OTUs and the

number of species present in the mock communities.

Indeed, when applying our improved pipeline containing

CATCh and NoDe on a 454 pyrosequencing mock dataset,

our pipeline could reduce the number of OTUs to 28 (i.e.

close 18, the correct number of species). In contrast,

running the straightforward pipeline without our

algorithms included would inflate the number of OTUs to

98. Similarly, when tested on Illumina MiSeq sequencing

data obtained for a mock community, using a pipeline

integrating CATCh and IPED, the number of OTUs

returned was 33 (i.e. close to the real number of 21

species), while 86 OTUs was obtained using the default

mothur pipeline.

REFERENCES Mysara M., Leys N., Raes J., Monsieurs P.- NoDe: a fast error-correction

algorithm for pyrosequencing amplicon reads.- In: BMC

Bioinformatics, 16:88(2015), p. 1-15.- ISSN 1471-2105 Mysara M., Saeys Y., Leys N., Raes J., Monsieurs P.- CATCh, an

Ensemble Classifier for Chimera Detection in 16S rRNA Sequencing

Studies.- In: Applied and Environmental Microbiology, 81:5(2015), p. 1573-1584.- ISSN 0099-2240


37



O17. GENE CO-EXPRESSION ANALYSIS IDENTIFIES BRAIN REGIONS AND

CELL TYPES INVOLVED IN MIGRAINE PATHOPHYSIOLOGY: A GWAS-

BASED STUDY USING THE ALLEN HUMAN BRAIN ATLAS

Sjoerd M.H. Huisman1,2*

, Else Eising3, Ahmed Mahfouz

1,2, Lisanne Vijfhuizen

3, International Headache Genetics

Consortium, Boudewijn P.F. Lelieveldt2, Arn M.J.M. van den Maagdenberg

3,4 & Marcel J.T. Reinders

1.

DBL, Dept. of Intelligent Systems, Delft University of Technology, The Netherlands1; LKEB, Dept. of Radiology, Leiden

University Medical Center, The Netherlands2; Dept. of Human Genetics, Leiden University Medical Center, The

Netherlands3; Dept. of Neurology, Leiden University Medical Center, The Netherlands

4.

*[email protected]

Migraine is a common brain disorder, with a heritability of around 50%. To understand the genetic component of this

disease, a large genome wide association study has been carried out. Several loci were identified, but their interpretation

remained challenging. We integrated the GWAS results with gene expression data, from healthy human brains, to

identify anatomical regions and biological pathways implicated in migraine pathophysiology.

INTRODUCTION

Genome Wide Association Studies (GWAS) are

frequently used to find common variants with small effect

sizes. However, they often provide researchers with short

lists of single nucleotide polymorphisms (SNPs) with

uncertain connections to biological functions.

We present an analysis of GWAS data for migraine, where

the full list of SNP statistics is used to find groups of

functionally related migraine-associated genes. For this

end we make use of gene co-expression in the healthy

human brain.

We performed genome wide clustering of genes, followed

by enrichment analysis for migraine candidate genes. In

addition, we constructed local co-expression networks

around high-confidence genes. Both approaches converge

on distinct biological functions and brain regions of

interest.

METHODS

Migraine GWAS data was obtained from the International

Headache Genetics Consortium, with 23,285 cases and

95,425 controls (Anttila et al., 2013). Genes were scored

by SNP load and divided into high-confidence genes,

migraine candidate genes, and non-migraine genes.

Spatial gene expression data in the healthy adult human

brain was obtained from the Allen Brain Institute

(Hawrylycz et al., 2012). It contains microarray

expression values of 3702 samples from 6 donors. Robust

gene co-expressions were used to cluster genes into 18

modules, which were then tested for enrichment of

migraine candidate genes, and functionally characterized.

In a second approach, local co-expression networks were

built around the high-confidence migraine genes. These

local networks were then compared to the modules of the

first approach.


The genome wide analysis revealed several modules of

genes enriched in migraine candidates. Two modules have

preferential expression in the cerebral cortex and are

enriched in synapse related annotations and neuron

specific genes. A third module contains oligodendrocytes

and genes preferentially expressed in subcortical regions.

The local co-expression networks, of the second approach,

converge on the same pathways and expression patterns,

even though the high confidence genes lie mostly outside

of the modules of interest. This provides a control to the

results of the first approach.

FIGURE 1. The co-expression network around high confidence migraine genes of the second approach. Genes (and links between them) of the

migraine modules of the first approach are coloured in red, yellow, blue,

and green.

The analyses confirm the previously observed link

between migraine and cortical neurotransmission. They

also point to the involvement of subcortical myelination,

which is in line with recent tentative findings. These

results show that more relevant information can be

extracted from GWAS results, using (publicly available)

tissue specific expression patterns.

REFERENCES Anttila V. et al. Genome-wide meta-analysis identifies new susceptibility

loci for migraine. Nat. Genet. 45, 912–7, (2013).

Hawrylycz M.J. et al. An anatomically comprehensive atlas of the adult human brain transcriptome. Nature 489, 391–9, (2012).


38



O18. SPATIAL CO-EXPRESSION ANALYSIS OF STEROID RECEPTORS IN

THE MOUSE BRAIN IDENTIFIES REGION-SPECIFIC REGULATION

MECHANISMS

Ahmed Mahfouz1,2*

, Boudewijn P.F. Lelieveldt1,2

, Aldo Grefhorst3, Isabel M. Mol

4, Hetty C.M. Sips

4, José K. van den

Heuvel4, Jenny A. Visser

3, Marcel J.T. Reinders

2, & Onno C. Meijer

4.

Department of Radiology, Leiden University Medical Center1; Delft Bioinformatics Lab, Delft University of

Technology2; Department of Internal Medicine, Erasmus University Medical Center

3; Department of Internal Medicine,

Leiden University Medical Center4.

*[email protected]

Steroid hormones coordinate the activity of many brain regions by binding to nuclear receptors that act as transcription

factors. This study uses genome wide correlation of gene expression in the mouse brain to discover 1) brain regions that

respond in a similar manner to particular steroids, 2) signaling pathways that are used in a steroid receptor and brain

region-specific manner, and 3) potential target genes and relationships between groups of target genes. The data

constitute a rich repository for the research community to support new insights in neuroendocrine relationships, and to

develop novel ways to manipulate brain activity in research of clinical settings.

INTRODUCTION

Steroid receptors are pleiotropic transcription factors that

coordinate adaptation to different physiological states. An

important target organ is the brain, but its complexity

hampers the understanding of their modulation.

METHODS

We used the Allen Brain Atlas (ABA) (Lein et al., 2007),

the most comprehensive repository of in situ

hybridization-based gene expression in the adult mouse

brain, to identify genes that have three dimensional (3D)

spatial gene expression profiles similar to steroid receptors.

To validate the functional relevance of this approach, we

analyzed the co-expression relationship of the

glucocorticoid receptor (Gr) and estrogen receptor alpha

(Esr1) and their known transcriptional targets in their

brain regions of action. Next, we studied the region-

specific co-expression of nuclear receptors and their co-

regulators to identify potential partners mediating the

hormonal effects on dopaminergic transmission. Finally,

to illustrate the potential of using spatial co-expression to

predict region-specific steroid receptor targets in the brain,

we identified and validated gene which responded to

changes in estrogen in the arcuate nucleus and medial

preoptic area of the mouse hypothalamus.


For each steroid receptor, we ranked genes based on their

spatial co-expression across the whole brain as well as in

each of the aforementioned 12 brain structures separately.

For each steroid receptor, strongly co-expressed genes

within a brain region are likely related to the localized

functional role of the receptor. For example, out of the top

10 genes co-expressed with Esr1 across the whole brain, 4

were previously shown to be regulated by Esr1 and/or

estrogens in various tissues (Gpr101, Calcr, Ngb, and

Gpx3)

We assessed the extent of co-expression of glucocorticoid

(GC)-responsive genes (Datson et al., 2012) with Gr in the

whole brain, the hippocampus and its substructures the

dentate gyrus (DG) and the different subregions of the

cornu ammonis (CA). GC-responsive genes were

significantly co-expressed with Gr in the DG, but

interestingly also in the whole brain and in the CA3 region

(FDR-corrected p < 1.8×10-3

; Mann-Whitney U-Test).

Similarly, A Mann-Whitney U-test showed that a set of 15

genes that are sensitive to gonadal steroids (Xu et al.,

2012) is significantly correlated to Esr1 across the whole

brain (FDR-corrected p = 8.69 ×10-14

), as well as in the

hypothalamus (p = 3.85×10-10

) , the brain region

responsible for the sexual behavior in animals.

In order to identify putative region-dependent co-

regulators of steroid receptors, we analyzed the co-

expression relationships of the each steroid receptor and a

set of 62 nuclear receptor co-regulators as present on a

peptide array (Nwachukwu et al., 2014). We focused our

analysis on well-established target regions of steroid

hormone action, dopaminergic brain regions (ventral

tegmental area; VTA & substantia nigra; SN). We found

three significantly co-expressed co-regulators with

androgen receptor (Ar): Pnrc2, Pak6 and Trerf1,

suggesting that these receptors may be involved in

mediating Ar effects on dopaminergic transmission.

In order to validate the predictive value of high correlated

expression with a steroid receptor, we analyzed the

response of top 10 genes that are strongly co-expressed

with Esr1 in the hypothalamus to the estrogen

diethylstilbesterol (DES) in castrated male mice using

qPCR. We performed quantitative double in situ

hybridization (dISH) for Esr1 and the six mRNAs (Irs4,

Magel2, Adck4, Unc5, Ngb, and Gdpd2) that showed more

than 1.3 fold enrichment in qPCR. We found Irs4 and

Magel2 mRNA were both significantly upregulated by

DES treatment (1.9 and 2.4-fold, respectively).

REFERENCES Lein E. et al. Nature 445, 168–76 (2007).

Datson N. et al. Hippocampus 22, 359–71 (2012). Xu X. et al., Cell 3, 596–607 (2012).

Nwachukwu J. et al. eLife 3, e02057 (2014).


39



O19. A SYSTEMS BIOLOGY COMPENDIUM FOR LEISHMANIA DONOVANI

Bart Cuypers1,2,3*

, Pieter Meysman1,2

, Manu Vanaerschot3, Maya Berg

3, Malgorzata Domagalska

3, Jean-Claude

Dujardin3,4#

& Kris Laukens1,2#

.

Advanced Database Research and Modeling (ADReM), University of Antwerp1; Biomedical informatics research center

Antwerpen (biomina)2; Molecular Parasitology Unit, Department of Biomedical Sciences, Institute of Tropical Medicine,

Antwerp3;

4Department of Biomedical Sciences, University of Antwerp

4.

*[email protected]

#shared senior

authors

Leishmania donovani is the cause of visceral leishmaniasis in the Indian subcontinent and poses a threat to public health

due to increasing drug resistance. Only little is known about its very peculiar molecular biology and there has been little

‘omics integration effort so far. Here we present an integratory database or ‘omics compendium that contains all

genomics, transcriptomics proteomics and metabolomics experiments that are currently publically available for

Leishmania donovani. Additionally the user interface contains analysis tools for new datasets that uses smart data mining

strategies like frequent itemset mining to link results from different ‘omics layers.

INTRODUCTION

The protozoan parasite Leishmania donovani causes

visceral leishmaniasis (VL), a life threatening disease

which affects 500 000 people each year. With only four

drugs available and rapidly emerging drug resistance,

knowledge about the parasite’s resistance mechanisms is

essential to boost the development of new drugs. However,

only little is known about the gene regulation of

Leishmania and the few findings indicate major

differences to known gene expression systems. Indeed, no

polymerase II promotors have ever been found in

Leishmania1. Genes are constitutively transcribed in large

polycistronic units and subsequently spliced into

individual mRNAs (trans-splicing)1. A modified thymine,

Base J, marks the end of transcription units and functions

as a stop signal for the RNA polymerase2. Gene

expression is then assumed to be regulated at the post-

transcriptional level (mRNA stability, translation

efficiency, epigenetic factors, etc…) but evidence to

support this is scarce1. Integration of different ‘omics

could shed light on these gene regulatory mechanisms, but

there has been little integration effort so far.

METHODS

We developed an easy to use tool, able to import and

connect all existing L. donovani –omics experiments.

Genomics, epigenomics, transcriptomics, proteomics,

metabolomics and phenotypic data was collected and

added to a MySQL database compendium, further

complemented with publicly available data. Relations

between different ‘omics layers were explicitly defined

and provided with a level of confidence. Python scripts

were developed to preprocess, analyse and import the data.

To allow comparability between different experiments,

platforms and labs the three integration principles of the

COLOMBOS bacterial expression compendium were

adapted3. 1) Use the same data-analysis pipeline for all

data. 2) Work with contrasts to a control condition instead

of expression values. 3) Annotate these contrasts in a

unified and structured manner.

Next to this vast data source a set of integrative data-

analysis tools was developed based on data mining

strategies. For example: One tool uses frequent itemset

mining algorithms to detect which proteins and

metabolites frequently exhibit the same behaviour under

different conditions. Another tool converts several –omics

layers to a network format that can be opened in

Cytoscape and can thus be the basis for network analysis.

The Django and Twitter Bootstrap frameworks were used

to create a web portal to make the tools accessible to any

Leishmania researcher.


Excellent public gene, protein, metabolite annotation

databases for Leishmania and related species are already

available (e.g. TriTrypDB and GeneDB). However, the

strength of our tool is that it links these annotation data to

‘omics experiments that are either provided by the user, or

that are publically available. New experiments can quickly

be preprocessed, analysed and integrated in the database

via its python back end. The compendium is therefore not

only a look-up tool (e.g. under which conditions is this

gene or metabolite upregulated?), but has tools available

to also analyse the user-provided data with intelligent data

mining tools (e.g. which metabolites/genes are typically

upregulated in drug-resistant strains?). These new

experiments provide additional confidence and

information about the biological entities in the database.

Unlike many other databases, the compendium has an

elaborate quality control system. Every result provided by

the tools can be traced back to the experimental data,

which contains the necessary quality control plots to

support the experiment’s validity. Additionally, it contains

all relevant information about the extractions and the

origin of the biological material.

Using the compendium and its tools, we characterized the

development and drug-resistance in a system biology

context of Leishmania donovani. The genomes of more

than 200 strains were examined for associations with

phenotypical features and a subset was linked to

transcriptomics, proteomics and metabolomics results. The

compendium and its scripts were designed to be generic

and can therefore be used for other organisms with only

minor changes.

REFERENCES 1. Donelson, J. (1999) PNAS. 96, 2579–258. 2. Van Luenen, H. G. a M. et al. (2012) Cell. 150, 909–21.

3. Meysman. et al. (2014) Nucleic acids research. 42, D649-

D653.


40



O20. MULTI-OMICS INTEGRATION: RIBOSOME PROFILING

APPLICATIONS

Volodimir Olexiouk1, Elvis Ndah

1, Sandra Steyaert

1, Steven Verbruggen

1, Eline De Schutter

1, Alexander Koch

1, Daria

Gawron2, Wim Van Criekinge

1, Petra Van Damme

2, Gerben Menschaert

1,*.

Lab of Bioinformatics and Computational Genomics (BioBix), Department of Mathematical Modelling, Statistics and

Bioinformatics, Faculty of Bioscience Engineering, Ghent University1; Dept. Medical Protein Research, VIB-Ghent

University2.

*[email protected]

Ribosome profiling is a relatively new NGS technology that enables the monitoring of the in vivo synthesis of mRNA-

encoded translation products measured at the genome-wide level. The technique, also sometimes referred to as RIBOseq,

uses the property of translating ribosomes to protect mRNA fragments from nuclease digestion and allows to determine

genomic positions of translating ribosomes with sub-codon to single-nucleotide precision. Since the advent of the

technology, several bioinformatics solutions have been devised to investigate this type of data. Here we will present

several solutions to detect novel proteoforms by combining RIBOseq and mass spectrometry data, to detect putatively

coding small open reading frames (sORFs), and to evaluate the impact of DNA and RNA methylation on the translation

level.

INTRODUCTION

Integration of different OMICS technologies is routinely

adapted to investigate biological systems. Our lab focuses

on high-throughput data analysis and the development of

novel data integration methodologies. Currently our focus

goes to ribosome profiling (Ingolia et al., 2011), an NGS

based technique to measure the so-called translatome (i.e.

the mRNA that shows ribosome occupancy). This

technique is applied in combination with other sequencing

based protocols to measure expression (RNAseq),

translation (mass spectrometry) and to chart maps of

regulatory elements such as DNA methylation (reduced

representation bisulfite sequencing, RRBS) and RNA

methylation (m6Aseq) to address several biological

questions.

METHODS

For the integration of RIBOseq and mass spectrometry

(MS), we devised a tool called PROTEOFORMER

(www.biobix.be/proteoformer). This proteogenomics tool

consists of several steps. It starts with the mapping of

ribosome-protected fragments (RPFs) and quality control

of subsequent alignments. It further includes modules for

identification of transcripts undergoing protein synthesis,

positions of translation initiation with sub-codon

specificity and single nucleotide polymorphisms (SNPs).

We used PROTEOFORMER to create protein sequence

search databases from publicly available mouse and in-

house performed human RIBOseq experiments and

evaluated these with matching proteomics data (Crappé et

al., 2015).

Another pipeline based on RIBOseq data is built around

the discovery of putatively coding small open reading

frames (sORFs). Herein, the first step is to delineate

sORFs based on RPF coverage throughout the coding

sequence and at the translation initiation site. Afterwards,

state-of-the-art tools and metrics accessing the coding

potential of sORFs are implemented and a list of candidate

sORFs for downstream analysis is compiled (e.g. MS-

based identification).

To assess the impact of DNA-methylation at the

translation level a double knockout DNMT model was

studied (WT and DNMT1 + 3B knockout HCT116 cell

line). Genome-wide DNA methylation profiling was

performed using RRBS, while ribosome profiling,

quantitative shotgun and positional proteomics (N-

terminal COFRADIC) were used to obtain protein

expression data.

An initial experiment to integrate m6Aseq (measuring the

m6A epitranscriptome) and ribosome profiling has also

been executed on HCT116 cells.


The RIBOseq-MS integration (through

PROTEOFORMER) increases the overall protein

identification rates with 3% and 11% (improved and new

identifications) for human and mouse respectively and

enables proteome-wide detection of 5’-extended

proteoforms, upstream ORF (uORF) translation and near-

cognate translation start sites. The PROTEOFORMER

tool is available as a stand-alone pipeline and has been

implemented in the galaxy framework for ease of use.

The sORF pipeline was tested and curated on three

different cell-lines (HCT116: human, E14 mESC: mouse,

and S2: fruitfly). The public repository has been made

available at www.sorfs.org (Olexiouk V. et al., in review),

and so far includes the datasets mentioned above.

In the study for the effect of DNA methylation at the

proteome level in the DNMT double knock-out we found

that the knockout cells show more significantly up-

regulated than down-regulated genes and that these up-

regulated genes were characterized by higher levels of

promoter methylation in the wild type cells. Both the MS

and RIBOseq analyses corroborated these findings.

Preliminary results based on the m6A sequencing confirm

previous findings on know m6A sequence motifs and

enrichment of m6A sites in specific functional regions

(around translation start sites and in 3’UTR regions) and

moreover some examples hint at an effect of m6A on

ribosomal pausing, after integrating m6A- and RIBOseq

data.

REFERENCES Ingolia N. et al. Cell 11;147(4):789-802 (2011).

Crappé, J., Ndah, E. et al. NAR 11;43(5):e29 (2015).


41



O21. CLUB-MARTINI: SELECTING FAVORABLE INTERACTIONS

AMONGST AVAILABLE CANDIDATES: A COARSE-GRAINED SIMULATION

APPROACH TO SCORING DOCKING DECOYS Qingzhen Hou

1*, Kamil K. Belau

2, Marc F. Lensink

3, Jaap Heringa

1 & K. Anton Feenstra

1*.

Center for Integrative Bioinformatics VU (IBIVU), VU University Amsterdam, De Boelelaan 1081A, 1081 HV

Amsterdam, The Netherlands1; Intercollegiate Faculty of Biotechnology, University of Gdańsk - Medical University of

Gdańsk, Kładki 24, 80-822 Gdańsk, Poland2; Institute for Structural and Functional Glycobiology (UGSF), CNRS

UMR8576, FRABio FR3688, University Lille, 59000, Lille, France3.

Protein-protein Interactions (PPIs) play a central role in all cellular processes. Large-scale identification of native binding

orientations is essential to understand the role of particular protein-protein interactions in their biological context. We

estimate the binding free energy using coarse-grained simulations with the MARTINI forcefield, and use those to rank

decoys for 15 CAPRI benchmark targets. In our top 100 and top 10 ranked structures, for the 'easier' targets that have

many near-native conformations, we obtain a strong enrichment of acceptable or better quality structures; for the 'hard'

targets with very few near-native complexes in the decoys, our method is still able to retain structures which have native

interface contacts. Moreover, CLUB-MARTINI is rather precise for some targets and able to pinpoint near-native

binding modes in top 1, 5, 10 and 20 selections.

INTRODUCTION

Measuring binding free energy is essential to understand the

relevance of particular protein-protein interactions in their

biological context. Moreover, at the atomic scale, molecular

simulations give us insight into the physically realistic details

of these interactions. In our recent study, we successfully

applied coarse-grained molecular dynamics simulations to

estimate binding free energy with similar accuracy as and

500-fold less time consuming than full atomistic simulation

(May et al., 2014). The approach relied on the availability of

crystal structures of the protein complex of interest. Here, we

investigate the effectiveness of this approach as a scoring

method to identify stable binding conformations out of

docking decoys from protein docking.

We apply our method as an evaluation method to rank more

than 19 000 docked protein conformations, or ‘decoys’, for

15 benchmark targets from the Critical Assessment of

PRedicted Interactions (CAPRI) (Lensink & Wodak, 2014).

METHODS

For each target, the binding free energy of all decoys was

calculated, using the MARTINI forcefield as introduced

before (May et al., 2014). In short, for a set of closely spaced

separation distances, we calculate the constraint force applied

to maintain the set distance. Integrating this force yields a

potential of mean force (PMF), from which the binding free

energy is extracted as the highest minus the lowest value.

Previously, for accuracy, we used up to 20 replicate

simulations for each distance in the PMF, but for efficiency,

here we use only a single replicate initially. We then selected

the lowest-scoring half to run an additional four replicates to

obtain better sampling and more accurate estimates of the

binding free energy. In total, we used approximately 800 000

core-hours of compute time.


We obtained strong enrichment of acceptable and high

quality structures in the TOP 100 based on our PMF free

energies, as shown in Figure 1. We estimate the error of our

energies to be significant. This can be approved by increasing

sampling, but remains very expensive.

Moreover, for several targets, we can select near-native

structures in top 1, top 5 and top 10 as shown in Table 1,

which means that, overall, our method is rather precise. From

estimates of the error, we expect we can improve accuracy by

extending the amount of sampling done at each distance. In

conclusion, our approach can find favorable interactions from

available candidates produced by docking programs. To the

best of our knowledge, this is the first time interaction free

energy from a coarse-grained force field is used as a scoring

method to rank docking solutions at a large scale.

FIG. 1. Enrichment in

percentage of acceptable or better

structures. For each of

the 13 targets with acceptable or better

decoys, two columns

(from left to right) stand for CAPRI

Score_set and top 100

in our rank of binding free energy calculation. Red, orange and yellow represent the fractions of

high, medium and acceptable quality structures over the number of all or

selected docking decoys. The order (left to right) is based on the fraction of acceptable structures in each target (easy to difficult)

Table 1. Success selections of top ranked structures

Selection Target\Quality High Medium Acceptable

Total

(% )

TOP 1 T47 1 0 0 100

T53 0 0 1 100

TOP 5

T47 3 2 0 100

T41 0 0 4 80

T53 0 0 3 60

T37 0 2 0 40

TOP 10

T47 7 3 0 100

T41 0 1 7 80

T53 0 1 5 60

T37 0 3 0 30

T50 0 0 1 10

TOP 20

T47 14 6 0 100

T41 0 4 13 85

T53 0 3 9 60

T37 0 4 2 30

T50 0 0 3 15

T40 1 2 0 15

T46 0 0 1 5

REFERENCES May, Pool, Van Dijk, Bijlard, Abeln, Heringa & Feenstra. Coarse-

grained versus atomistic simulations: realistic interaction free energies for real proteins. Bioinformatics (2014) 30: 326-334.

Lensink & Wodak. Score_set: A CAPRI benchmark for scoring protein

complexes. Proteins (2014) 82:3163-3169.


42



O22. PEPSHELL: VISUALIZATION OF CONFORMATIONAL PROTEOMICS

DATA

Elien Vandermarliere1,2*

, Davy Maddelein1,2

, Niels Hulstaert1,2

, Elisabeth Stes1,2

, Michela Di Michele1,2

,

Kris Gevaert1,2

, Edgar Jacoby3, Dirk Brehmer

3 & Lennart Martens

1,2.

Department of Medical Protein Research, VIB1; Department of Biochemistry, Ghent University

2; Oncology Discovery,

Janssen Research and Development – Janssen Pharmaceutica, Beerse3.

*[email protected]

Proteins are dynamic molecules; they undergo crucial conformational changes induced by post-translational

modifications and by binding of cofactors or other molecules. The characterization of these conformational changes and

their relation to protein function is a central goal of structural biology. Unfortunately, most conventional methods to

obtain structural information do not provide information on protein dynamics. Therefore, mass spectrometry-based

approaches, such as limited proteolysis, hydrogen-deuterium exchange, and stable-isotope labelling, are frequently used

to characterize protein conformation and dynamics, yet the interpretation of these data can be cumbersome and time

consuming. Here, we present PepShell, a tool that allows interactive data analysis of mass spectrometry-based

conformational proteomics studies by visualization of the identified peptides both at the sequence and structure levels.

Moreover, PepShell allows the comparison of experiments under different conditions which include proteolysis times or

binding of the protein to different substrates or inhibitors.

INTRODUCTION

The study of protein structure with mass spectrometry,

called conformational proteomics, is frequently used to

characterize protein conformations and dynamics. Most of

these methods exploit the surface accessibility of amino

acids within the native protein conformation or more

specifically, the differences in protein surface accessibility

in different situations within a protein structure.

The experimental setup and subsequent workflow of a

conformational proteomics experiment do not deviate

drastically from that of a classic mass spectrometry-based

experiment in which peptides present in a complex peptide

mixture are identified. The final outcome of a

conformational proteomics experiment is a list of peptides.

These peptide lists typically span multiple experimental

conditions across which the structural observations are to

be compared; the peptide lists have to be combined and, if

available, mapped onto the structure of the protein.

To fulfill these latter steps, we developed PepShell

(Vandermarliere et al., 2015), to guide the interpretation

of mass spectrometry-based proteomics data in the context

of protein structure and dynamics.

TOOL DESCRIPTION

PepShell aids the user in the interpretation of the outcome

of conformational proteomics experiments and is

composed of three panels: the experiment comparison

panel, the PDB view panel, and the statistics panel.

The data to analyze

PepShell allows the input from limited proteolysis,

hydrogen-deuterium exchange, MS footprinting and

stable-isotope labelling experiments. The data have to

be present in a comma-separated text file format. The

project selection interface allows the user to select a

reference project and to indicate which setups need to

be compared with each other.

Experiment comparison

This panel allows the comparison of the selected

experimental setups at the sequence level. For each

experimental condition, the identified and quantified

peptides are mapped onto the sequence of the protein

of interest.

The PDB view panel

Here, the detected peptides are mapped on the protein

structure. The main requirement is the availability of a

3D structure of the protein of interest.

Statistics within PepShell

In this panel, the peptides of interest can be analyzed

in more detail. The outcome from CP-DT (Fannes et

al., 2013) for tryptic cleavage probability for each

tryptic cleavage position is given. Also detailed

comparison of the peptide ratios over the different

experimental setups is allowed.

CONCLUSIONS

The increasing popularity of structural proteomics is in

stark contrast with the availability of efficient tools to

visualize this multitude of data. There are however some

tools available that aid data interpretation; but these are

approach-specific and are aimed primarily at mass

spectrometrists with a specific focus on the experimental

mass spectrometry data and their processing and

interpretation. PepShell on the other hand is intended to

support downstream users to interpret the results obtained

from a variety of conformational proteomics approaches.

PepShell uses the peptide lists to compare different

experimental conditions and allows the visualization of

these differences onto the structure of the protein. As such,

PepShell bridges the gap between mass spectrometry-

based proteomics data and their interpretation in the

context of protein structure and dynamics.

PepShell is an open source Java application. Its binaries,

source code and documentation can be found at:

compomics.github.io/projects/pepshell.html

REFERENCES Fannes T et al. J Proteome Res 12, 2253-2259 (2013). Vandermarliere E et al. J Proteome Res 14, 1987-1990 (2015).


43



O23. INTERACTIVE VCF COMPARISON USING SPARK NOTEBOOK Thomas Moerman

1,2,5*, Dries Decap

3,5, Toni Verbeiren

2,5, Jan Fostier

3,5, Joke Reumers

4,5, Jan Aerts

2,5.

Advanced Database Research and Modeling (ADReM), University of Antwerp1; Visual Data Analysis Lab, ESAT –

STADIUS, Dept. of Electrical Engineering, KU Leuven – iMinds Medical IT2; Department of Information Technology,

Ghent University – iMinds, Gaston Crommenlaan 8 bus 201, 9050 Ghent, Belgium3; Janssen Research & Development,

a division of Janssen Pharmaceutica N.V., 2340 Beerse, Belgium4; ExaScience Life Lab, Kapeldreef 75, 3001 Leuven,

Belgium5.

*[email protected]

Researchers benefit greatly from tools that allow hands-on, interactive and visual experimentation with data, unimpeded

by setup complexities nor scaling issues resulting from large data sizes. In our contribution we present an implementation

of an interactive VCF comparison tool, making use of a technology stack based on Apache Spark [1], Big Data

Genomics Adam [2] and Spark Notebook [3].

INTRODUCTION

Current genomics data formats and processing pipelines

are not designed to scale well to large datasets [1]. They

were also not conceived to be used in an interactive

environment. The bioinformatics field typically struggles

with these difficulties as high-throughput, next-generation

sequencing jobs produce large data files. Although many

high-quality bioinformatics processing tools exist, it is

often hard to express analyses in a consolidated and

reproducible fashion. These tools typically do not allow to

interactively iterate on an analysis while visualizing

results.

OBJECTIVE

Analysis tools preferably provide the expressive power to

define ad hoc queries on data. Biologists or clinical

researchers, when dealing with genomic variants encoded

in VCF files, typically perform queries comparing one

protocol to another, tumor to normal, treated to untreated

cell lines and so on. Ideally these comparisons make use

of all quality-related metrics stored in VCF files (e.g.

coverage depth, quality score) as well as the actual region

annotations (e.g. repeat regions, exonic regions) and

generate visual output. We aim to implement a tool that

provides the necessary expressiveness as well as the

computational power needed for making these types of

analyses practical and interactive.

APPROACH

Recent advances in computation platform technology

(Spark) and notebook technologies (Spark Notebook)

enable orchestration of distributed jobs on cluster

infrastructure from a programmable environment running

in a browser. These technologies, combined with Adam

[2], a library specifically designed for processing next-

generation sequencing data, provide the necessary

architectural bedrock for our purposes.

Analyses are expressed in a high-level programming

language (Scala), operating on specialized data structures

(Spark resilient distributed datasets, or RDDs [1]) that

make abstraction of the complexity of defining distributed

computations on data sets too large for single node

processing. Adam meets the need for an explicit data

schema for abstraction of the different bioinformatics file

formats.

RESULTS & CONTRIBUTIONS

Our work focuses on the pairwise comparison of annotated

VCF files. Our contributions consist of two open-source

Scala libraries: VCF-comp [4] and Adam-FX [5]. VCF-

comp implements the concordance by variant position

algorithm, which segregates the variants from two VCF

inputs (A, B) into 5 categories: A/B-unique, concordant

(equal variants on position) and A/B-discordant (different

variants on position). This results in a distributed data

structure from which we project visualizations, presented

to the user by means of the Spark Notebook interface.

FIGURE 1 Allele frequency distribution for concordant and unique

variants in a tumor vs. normal VCF comparison.

FIGURE 2 Functional impact (SnpEff annotation) histogram for

concordant, unique and discordant variants in a tumor vs. normal VCF

comparison.

Adam-FX extends the Adam data structures and file

parsing logic in order to support queries on SnpEff [6],

SnpSift [7], dbSNP and Clinvar annotations.

We believe our tool facilitates the comparison of

annotated VCF files in an interactive manner while

reducing runtime by leveraging the Spark platform.

REFERENCES [1] Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant

abstraction for in-memory cluster computing."

[2] Massie, Matt, et al. "Adam: Genomics formats and processing

patterns for cloud scale computing." [3] https://github.com/andypetrella/spark-notebook

[4] https://github.com/tmoerman/vcf-comp

[5] https://github.com/tmoerman/adam-fx [6] Cingolani, P, et al. "A program for annotating and predicting the

effects of single nucleotide polymorphisms, SnpEff: SNPs in the

genome of Drosophila melanogaster strain w1118; iso-2; iso-3.", Fly

(Austin). 2012 Apr-Jun;6(2):80-92. PMID: 22728672


44



O24. 3D HOTSPOTS OF RECURRENT RETROVIRAL INSERTIONS REVEAL

LONG-RANGE INTERACTIONS WITH CANCER GENES

Sepideh Babaei1, Waseem Akhtar

2, Johann de Jong

3, Marcel Reinders

1 & Jeroen de Ridder

1*.

Delft Bioinformatics Lab, Delft University of Technology1; Division of Molecular Genetics

2;

Division of Molecular Carcinogenesis, The Netherlands Cancer Institute3.

* [email protected]

Genomically distal mutations can contribute to deregulation of cancer genes by engaging in chromatin interactions. To

study this, we overlay viral cancer-causing insertions obtained in a murine retroviral insertional mutagenesis screen with

genome-wide chromatin conformation capture data. In this talk, we show that insertions tend to cluster in 3D hotspots

within the nucleus. The identified hotspots are significantly enriched for known cancer genes, and bear the expected

characteristics of bona-fide regulatory interactions, such as enrichment for transcription factor binding sites.

Additionally, we observe a striking pattern of mutual exclusive integration. This is an indication that insertions in these

loci target the same gene, either in their linear genomic vicinity or in their 3D spatial vicinity. Our findings shed new

light on the repertoire of targets obtained from insertional mutagenesis screening and underlines the importance of

considering the genome as a 3D structure when studying effects of genomic perturbations.

Evidence is mounting that the organization of the genome

in the cell nucleus is extremely important for gene

regulation. This finding is facilitated by recent

technological advances (i.e. Hi-C) that enabled researchers

to accurately capture the 3D conformation of

chromosomes in the cellular nucleus at a high resolution.

We have exploited a large existing Hi-C dataset to take 3D

chromosome conformation into account while determining

hotspots of viral cancer-causing mutations. These

identified hotspots are significantly enriched for known

cancer genes, and bear the expected characteristics of

bona-fide regulatory interactions, such as enrichment for

transcription factor binding sites. Additionally, we observe

a striking pattern of mutual exclusive integration. This is

an indication that insertions in these loci target the same

gene through long-range interactions (1).

In a second study (2), we performed a similar analysis that

shows a striking relation between genome conformation

and expression correlation in the brain. Although recent

studies have shown a strong correlation between

chromatin interactions and gene co-expression exists,

predicting gene co-expression from frequent long-range

chromatin interactions remains challenging. We address

this by characterizing the topology of the cortical

chromatin interaction network using scale-aware

topological measures. We demonstrate that based on these

characterizations it is possible to accurately predict spatial

co-expression between genes in the mouse cortex.

Consistent with previous findings, we find that the

chromatin interaction profile of a gene-pair is a good

predictor of their spatial co-expression. However, the

accuracy of the prediction can be substantially improved

when chromatin interactions are described using scale-

aware topological measures of the multi-resolution

chromatin interaction network. We conclude that, for co-

expression prediction, it is necessary to take into account

different levels of chromatin interactions ranging from

direct interaction between genes (i.e. small-scale) to

chromatin compartment interactions (i.e. large-scale).

In this talk, I will focus on the computational and

statistical methods that are required to make an insightful

overlaying high-resolution conformation maps obtained

using Hi-C with ~20.000 cancer-causing retroviral

mutations and expression maps from the Allen Brain

Atlas.

FIGURE 1. Circos visualization of the insertions clusters that co-localize

with the Notch1 locus.

REFERENCES (1) Babaei, S. et al. Nature Communications (2015). (2) Babaei and Mahfouz et al. PLoS Computational Biology (2015)


45


Abstract ID: P Poster

P1. KNN-MDR APPROACH FOR DETECTING GENE-GENE

INTERACTIONS

Sinan Abo alchamlat1 & Frédéric

Farnir

1,*.

Fundamental and Applied Research for Animals & Health (FARAH), Sustainable Animal Production, University of

Liège1.

*[email protected]

These last years have seen the emergence of a wealth of biological information. Facilitated access to the genome

sequence, along with massive data on genes expression and on proteins have revolutionized the research in many fields

of biology. For example, the identification of up to several millions SNPs in many species and the development of chips

allowing for an effective genotyping of these SNPs in large cohorts have triggered the need for statistical models able to

identify the effects of individual and of interacting SNPs on phenotypic traits in this new high-dimensional landscape. Our work is a contribution to this field...............................................................................................................

INTRODUCTION

GWAS has allowed the identification of hundreds of

genetic variants associated to complex diseases and traits,

and provided valuable information into their genetic

architecture (Wu M et al., 2010). Nevertheless, most

variants identified so far have been found to confer

relatively small information about the relationship

between changes at the genomic level and phenotypes

because of the lack of reproducibility of the findings, or

because these variants most of the time explain only a

small proportion of the underlying genetic variation (Fang

G et al., 2012). This observation, quoted as the ‘missing

heritability’ problem (Manolio T et al., 2009) of course

raises the question: where does the unexplained genetic

variation come from? A tentative explanation is that genes

do not work in isolation, leading to the idea that sets of

genes (or genes networks) could have a major effect on the

tested traits while almost no marginal – i.e. individual

gene – effect is detectable. Consequently, an important

question concerns the exact relationship between the

genomic configuration, including the interactions between

the involved genes, and the phenotypic expression.

METHODS

To tackle this subject, different statistical methods such as

MDR (Multi Dimensional Reduction) have been proposed

for detecting gene-gene interaction (Ritchie, D., et al.,

2001); their relative performances remain largely unclear,

and their extension to situations combining many variants

turns out to be challenging. So we propose a novel MDR

approach using K-Nearest Neighbors (KNN) methodology

(KNN-MDR) for detecting gene-gene interaction as a

possible alternative, especially when the number of

involved determinants is potentially high. The idea behind

our method is to replace the status allocation used in

classical MDR methods by a KNN approach: the majority

vote occurs in the k (a parameter that must be tuned and

depends on the various possible scenarios) nearest

neighbors instead of within the (potentially empty) cell

determined by the tested attributes of the individual to be

classified. The steps other than classification are identical

in both methods (i.e. cross-validation, attributes selection,

training and tests balanced accuracy computations, best

model selection procedure).


Experimental results on both simulated data and real

genome-wide data from Wellcome Trust Case Control

Consortium (WTCCC) (Wellcome Trust Case Control C.,

2007) show that KNN-MDR has interesting properties in

terms of accuracy and power, and that, in many cases, it

significantly outperforms its recent competitors.

FIGURE 1. Comparison of the inter-chromosomal interactions detected

on the WTCCC dataset by KNN-MDR and other interaction methods using this same dataset as example (Shchetynsky et al. (2015); Zhang et

al. (2012))

The results of this study allow us to draw some

conclusions about the performance of KNN-MDR: on the

one hand, the performance of the KNN-MDR method to

detect gene-gene interactions are similar to the

performance of MDR for small problems. On the other

hand, KNN-MDR has significant advantages in large

samples and large number of markers (such as GWAS) to

detect the existence of genes effect. So KNN-MDR can be

seen as a new and more comprehensive method than MDR

and other competitors for detecting gene-gene interaction.

REFERENCES Wu M et al. American journal of human genetics 86, 929-942 (2010).

Fang G et al. PloS one 7, 1932-6203 (2012).

Manolio T et al. Nature 461, 747-753 (2009).

Ritchie, D., et al. Am J Hum Genet,69, 138-147 (2001).

Wellcome Trust Case Control C. Nature, 447(7145):661-678 (2007).

Shchetynsky K et al. Clinical immunology 158(1):19-28 (2015).

Zhang J et al. American Medical Journal 3(1) (2015).


46



P2. CONSERVATION AND DIVERSITY OF SUGAR-RELATED CATABOLIC

PATHWAYS IN FUNGI

Maria Victoria Aguilar Pontes*, Eline Majoor, Claire Khosravi, Ronald P. de Vries, Miaomiao Zhou

Fungal Physiology, CBS-KNAW Fungal Biodiversity Centre, Utrecht, The Netherlands; Fungal Molecular Physiology,

Utrecht University, The Netherlands.*[email protected], [email protected], [email protected],

[email protected], [email protected]

INTRODUCTION

Plant polysaccharides are among the major substrates for

many fungi. After extracellular degradation, the

monomeric components (mainly monosaccharides) are

taken up by the cells and used as carbon sources to enable

the fungus to grow. This would also imply that the range

of catabolic pathways of a fungus may be correlated to the

decomposition of the polysaccharides it can degrade.

Several carbon catabolic pathways have been studied in

different fungi able to grow on plant biomass such as

Aspergillus niger (De Vries, et al., 2012).

In this study we have tested this hypothesis by identified

the presence of genes of a number of catabolic pathways

in selected fungi from the Ascomycota and the

Basidiomycota.

METHODS

A total of 104 fungal genomes were identified from the

JGI fungal program (Grigoriev IV, et al., 2011), Broad

Institute of Harvard and MIT, AspGD (Arnaud, et al.,

2012) and NCBI genbank (Benson, et al., 2012) (data

version March 2013).

We identified A. niger genes involved in individual

pathways from literature. Genome scale protein ortholog

clusters were detected according to (Li, et al., 2003), using

inflation factor 1, E-value cutoff 1E-3, percentage match

cut off 60% as for identification of distant homologs

(Boekhorst, et al., 2007). The all-vs-all BlastP search

required by OrthoMCL was carried out in a grid of 500

computers by parallel fashion. The orthologs clusters were

then curated manually by expert knowledge and literature

search. Manual curation was aided by aligning the amino

acid sequences of the hits for each query together with a

suitable outgroup by MAFFT (Katoh, et al., 2009; Katoh,

et al., 2005), after which neighbor joining trees were

generated using MEGA5 with 1000 bootstraps. Genes that

were clearly separated from the query branch in the trees

were removed from the results.


Patterns of pathway gene presence are conserved among

clades. Galacturonic acid and rhamnose pathways are

missing in yeast. Pentose pathway is conserved in

Pezizomycetes and Basidiomycota, which explains their

ability to grow on pentose as carbon source (www.fung-

growth.org).

These results may indicate that different evolutionary

tracks have led to different metabolic strategies.

The expression of metabolic genes will be evaluated for

those species for which transcriptome data are available.

The results will be compared to growth profiling data of

the species on a set of plant-related poly- and

monosaccharides to determine to which extent the genome

content fits the physiological ability of the species.

ACKNOWLEDGEMENTS

The comparative genomics analysis was carried out on the

Dutch national e-infrastructure with the support of SURF

Foundation (e-infra1300787).

REFERENCES Arnaud, M.B., et al., Nucleic Acids Res, 40, 653-659 (2012).

Benson, D.A., et al., Nucleic Acids Res, 40, 48-53 (2012). Boekhorst, J., et al., BMC Bioinformatics, 8, 356-363 (2007).

De Vries, R.P., et al. Pan Stanford Publishing Pte. Ltd, Singapore (2012).

Grigoriev IV, et al., Mycology, 2, 192-209 (2011). Katoh, K., et al., Methods Mol Biol, 537, 39-64 (2009).

Katoh, K., et al., Nucleic Acids Res, 33, 511-518 (2005).

Li, L., et al., Genome Res, 13, 2178-2189 (2003).


47



P3. VISUALIZING BIOLOGICAL DATA THROUGH WEB COMPONENTS

USING POLIMERO AND POLIMERO-BIO

Daniel Alcaide1,2*

, Ryo Sakai1,2

, Raf Winand1,2

, Toni Verbeiren1,2

, Thomas Moerman1,2

, Jansi Thiyagarajan & Jan Aerts.

KU Leuven Department of Electrical Engineering-ESAT, STADIUS, VDA-lab, Belgium1; iMinds Medical IT, Leuven,

Belgium. *[email protected]

Although there are currently several tools for fast prototyping in data visualization, the specifics of the biological domain

often require the development of custom visuals. This leads to the issue that we end up re-implementing the base visuals

over and over if we want to build them into a specific analysis tool. This work presents a proof-of-principle library for

creating composable linked data visualizations, including an initial collection of parsers and visuals with an emphasis on

biology. With Polimero and Polimero-bio, we want to create a library to build scalable domain-specific visual data

exploration tools using a collection of D3-based reusable web components.

INTRODUCTION

As a visual data analysis lab, we often combine

(brush/link) well-known data visualization techniques

(scatterplots, barcharts, etc.). Despite it is possible to use

general-purpose tools like Tableau or Excel, the singular

needs of the biological field usually demand the creation

of particular data visualizations which are not included in

these commercial solutions (Figure 1).

These visuals implementations need to be re-implemented

for each new tool created. The present solution tries to be

an alternative to create composable linked data

visualizations.

FIGURE 1. Klaudia-plot - Visualization created with Polimero that shows

the read pairs mapped around a deletion in the NA12878 genome on

chromosome 20.

METHODS

Polimero is a library that uses Polymer implementation for

creating visual web components. (www.polymer-

project.org).

Web components are an emerging W3C standard for

extending the HTML platform to create web-based apps.

This new technology includes custom elements, HTML

templates, shadow DOM, and HTML imports (Figure 2).

The D3-based custom elements that Polimero and

Polimero-bio offer, allow us to create a scalable

framework for building domain-specific visual data

exploration tools.

Leveraging the web components concepts, the main

characteristics of Polimero library are:

Modular: Each element is an independent module

that has a specific purpose (data, visualization,

computation)

Composable: The elements can be combined

setting up new functionalities (linking, filtering,

reading different data sources)

Encapsulated: Web components aim to provide

the user a simple element interface, avoiding to

have to deal with the underlying code.

Reusable: The same element can be used in the

same project for different objectives.

Linkable: Polimero elements can speak to each

other, allowing the use of events for brushing and

linking.

Embeddable: The elements can be added to any

existing frameworks that use HTML (e.g. ipython

notebook).

FIGURE 2. HTML example – Representing Polimero elements to create

visualization.


This library makes it possible to create applications that

are composable, encapsulated, and reusable. This is

valuable both for the developer/designer who can easily

create and plug-in custom visual encodings, and for the

end-user who can create linked visualizations by dragging

existing components onto a canvas using the Polimero-

designer.

Polimero and Polimero-bio are still in development but

they are available at www.bitbucket.org/vda-lab/polimero.


48



P4. DISEASE-SPECIFIC NETWORK CONSTRUCTION BY SEED-AND-EXTEND

Ganna Androsova1*

, Reinhard Schneider1 & Roland Krause

1.

Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Belval, Luxembourg 1.

*[email protected]

INTRODUCTION

Molecular interaction networks are dense structures of

protein interactions, from which we would like to extract

relevant sub-networks specific to the disease of interest.

Such a disease-specific network is often constructed by the

seed-and-extend algorithm, which extracts the relevant

genes from an organism-wide, weighted interaction

network, typically as its first-neighbourhood. Seed-and-

extend is suitable when disease biomarkers are poorly

investigated and the knowledge about biomarker

interaction partners is missing or when the interacting

partners are established but the connections are missing

between them.

Our syndrome of interest is the postoperative cognitive

impairment frequently experienced by elderly patients,

characterized by progressive cognitive and sensory decline.

The acute phase of cognitive impairment is postoperative

delirium (POD). The underlying pathophysiological

mechanisms have not been studied in depth due to

mulitifactorial pathogenesis of this postoperative cognitive

impairment. The known POD-related genes can be

integrated into the draft network for exploration on a

systems level.

Here, we investigate how stable the results of such

analysis are when the input set of seed genes is varied, and

what is the role of stringency in the initial selection of the

networks. Ideally, we would like to find the “sweet spot”

that provides a biologically meaningful trade-off between

false-positives and -negatives to be used for such analyses.

METHODS

The list of disease-related genes/proteins was retrieved

from literature studies in the PubMed database.

We extended the seed list with directly linked interactors

by seed-and-extend from protein-protein interaction

network databases. We extracted all interactions between

seeds and connected neighbours, which resulted in the

first-degree network.

Next, we evaluated a biological enrichment of the

extracted network, its topological parameters, overlap with

other diseases and clustered the network into the smaller

sub-networks.


The POD network (Figure 1) follows a free-scale

distribution and consists of 541 proteins with 5,242

interactions between them.

FIGURE 1. Postoperative delirium molecular network.

The network was evaluated topologically by degree

assortativity, density, shortest path, eccentricity and other

measures. Pathways enrichment analysis showed

glucocorticoid receptor signalling, immune response, and

dopamine signalling as relevant to POD (Figure 2).

FIGURE 2. Postoperative delirium pathway enrichment analysis.

Top 5 hub proteins included UBC_HUMAN,

GCR_HUMAN, P53_HUMAN, HS90A_HUMAN and

EGFR_HUMAN. Appearance of p53 and other very

frequent genes among top 5 hubs in our but also several

other studies, motivated us to investigate its relevance to

the disease and question the possible data bias. We

compare how size, specificity and completeness of the

input seed list can affect the resulting network and

retrieval of the other disease-related proteins.


49



P5. BIG DATA SOLUTIONS FOR VARIANT DISCOVERY FROM LOW

COVERAGE SEQUENCING DATA, BY INTEGRATION OF HADOOP, HBASE

AND HIVE

Amin Ardeshirdavani1*

, Erika Souche2, Martijn Oldenhof

3 & Yves Moreau

1.

KU Leuven ESAT-STADIUS Center for Dynamical Systems, Signal Processing and Data Analytic 1; KU Leuven

Department of Human Genetics 2; KU Leuven Facilities for Research 3. *[email protected]

Next Generation Sequencing (NGS) technologies allow the sequencing of the whole human genome to, among others,

efficiently study human genetic disorders. However, the sequencing data flood needs high computation power and

optimized programming structure to tackle data analysis. A lot of researchers use scale-out network to simulate

supercomputer. In many use cases Apache Hadoop and HBase have been used to coordinate distributed computation and

act as a storage platform, respectively. However, scale-out network has rarely been used to handle gene variation data

from NGS, except for sequencing reads assembly. In our study, we propose a Big Data solution by integrating Apache

Hadoop, HBase and Hive to efficiently analyze NGS output such as VCF files.

INTRODUCTION

The goal of this project is trying to overcome the

difficulties between massive NGS data and low data

process ability. We want propose a data process and

storage model specifically for NGS data. To address our

goal we develop an application based on this model to test

whether its process ability is highly increased. The target

users of this application are researchers with intermediate-

level computer skills. The new model should meet certain

demands, which are scalable, high tolerant and availability.

Data import procedure should be fast and occupies the

smallest storage volume. It also needs to make querying

data faster and possible from remote place. In order to

achieve these demands, three open source projects:

Apache Hadoop, HBase and Hive are integrated as the

backbone and on top of them a user-friendly interface

designed application is developed to make this integration

more straightforward.

METHODS

Generally, Hadoop is for utilizing distributed MapReduce

data processing, HBase is the platform for complex

structured data storage and Hive is for data retrieve from

HBase using of Structural Query Language (SQL) syntax.

Though Hadoop and HBase are popular recently, the

combination of Hadoop, HBase and Hive is rare to be

implemented in bioinformatics field.

Here we mainly discuss gene variation data analysis. Thus

the application developing is focusing on parsing and

storing VCF (Variant Call Format) file. The application is

designed to dynamically adapt VCF file structures with

respect to variant callers. For example in

UnifiedGenotyper calls SNPs and InDels separately by

considering each variant is independent, yet the other

caller HaplotypeCaller calls variants by using local

assembly. For gene variation analysis, the VCF files of

different samples need to be queried and the results should

be able to export for further usage. Normally a VCF file

for each sample or a group of samples is considerably

large, so the efficiency of processing is for sure very

crucial.

The model we have decided is the integration of Hadoop,

HBase and Hive; Hadoop will be used for data processing,

HBase for storage and Hive for querying. Since all of

these projects need distributed cluster to optimize the

performance, it is crucial to decide the suitable

architecture for our application. The cluster will be the

major processing and storage platform. The single server

outside the cluster will act as a client for users. Our

application can connect remotely to the Hive server for

researchers.


The tests show clearly that the Apache integration

performances much better than SQL model when dealing

with large size VCF files. Also, for small VCF files, the

integration performance is acceptable. So we conclude that

Apache integration could be a good solution for this kind

of file management. Our newly developed application H3

VCF with user-friendly interface is a nice tool for users

without high level IT knowledge so they can conveniently

use the integration to tackle VCF files. User can either

choose to build his/ her own local computer cluster or use

Amazon EMR to easily create a cluster with Apache

projects for a few dollars.


50



P6. ENTEROCOCCUS FAECIUM GENOME DYNAMICS DURING

LONG-TERM PATIENT GUT COLONIZATION

Jumamurat R. Bayjanov1*

, Jery Baan1, Mark de Been

1, Mick Watson

2 & Willem van Schaik

1.

Department of Medical Microbiology, University Medical Center Utrecht, Utrecht, The Netherlands1; Edinburgh

Genomics, The University of Edinburgh, Edinburgh, Scotland2.

*[email protected]

Enterococcus faecium – recently evolved multi-drug resistant nosocomial pathogen – is able to rapidly colonize human

gut. Previous work on animal, healthy human and clinical E. faecium strains has shown that clinical isolates form a

distinct lineage. However, these studies lack detailed niche-specific and longitudinal evolutionary dynamics analysis of

this organism. Here we show longitudinal within-host evolutionary dynamics analysis of E. faecium gut isolates, which

were sampled from five patients over the period of 8 years. Whole-genome sequencing analysis showed that rapid

diversification of E. faecium clones in patient gut is mainly due to recombinations and phages. High diversification

allows E. faecium clones to acquire new genes including antibiotic resistance genes, which allows this bacterium to

rapidly colonize hostile environments.

INTRODUCTION

In recent decades, Enterococcus faecium, normally a

harmless gut commensal, has emerged as an important

multi-drug resistant nosocomial pathogen. Previous work

has shown that clinical isolates of E. faecium form a sub-

population that is distinct from strains isolated from

animals and healthy humans (Lebreton et al., 2013). We

used whole-genome sequencing to characterize how

clinical E. faecium strains evolve during long-term patient

gut colonization.

METHODS

The genomes of 96 E. faecium gut isolates, obtained over

8 years from 5 different patients, were sequenced using

Illumina HiSeq 2x100bp paired-end sequencing. Quality

filtering of sequence reads was performed using Nesoni

(version 0.117) (Nesoni, 2014) and high-quality reads

were assembled into contiguous sequences using Spades

assembler (version 3.1.0) (Bankevich et al., 2012).

Subsequently, assembled sequences were annotated using

Prokka (v 1.10) (Seeman T, 2014). In addition to these 96

genomes, we also included publicly available genome

sequences of 70 E. faecium strains, which were

downloaded from NCBI Genbank database. In the set of

166 strains, orthology between genes were identified using

orthAgogue (Ekseth et al., 2014) and orthologous genes

were clustered into ortholog groups using MCL algorithm

(Enright et al., 2002). Core genome alignments were then

constructed by concatenating core gene sequences and

were filtered for recombinations using Gubbins (Croucher

et al., 2015). Subsequently, recombination-filtered core

genome alignments were used to construct a phylogenetic

tree. In addition to core-genome based analyses, we have

also studied gene gain and loss across time.


As expected all of 96 isolates were grouped in E. faecium

clade A, with only one strain clustering in clade A-2,

which mainly contains animal isolates. The remaining 95

strains were assigned to clade A-1, which is almost

exclusively comprised of clinical isolates. The

phylogenetic tree showed 5 clusters of closely related

strains of patients, revealing the microevolution of E.

faecium strains during gut colonization. We also anticipate

that direct transfer of strains had occurred between

patients during hospitalization in the same ward.

Additionally, analysis of gene gain and loss across time

showed that loss and gain of prophages is an important

factor in generating genetic diversity during gut

colonization.

This study highlights the ability of E. faecium clones to

rapidly diversify, which may contribute to the ability of

this bacterium to efficiently colonize new environments

and rapidly acquire antibiotic resistance determinants.

REFERENCES Lebreton F, et. al. “Emergence of epidemic multidrug-resistant

Enterococcus faecium from animal and commensal strains”. MBio.

4(4):e00534-13, 2013.

Nesoni. https://github.com/Victorian-Bioinformatics-Consortium/nesoni Bankevich A, et. al. "SPAdes: A New Genome Assembly Algorithm and

Its Applications to Single-Cell Sequencing". Journal of

Computational Biology 19(5):455-477, 2012 Seemann T. "Prokka: rapid prokaryotic genome annotation".

Bioinformatics. 30(14):2068-9, 2014. Ekseth OK, et. al. "orthAgogue: an agile tool for the rapid prediction of

orthology relations". Bioinformatics. 30(5):734-6, 2014.

Enright AJ, et. al. "An efficient algorithm for large-scale detection of protein families". Nucleic Acids Res. 40:1575-1584, 2002.

Croucher NJ, et. al. "Rapid phylogenetic analysis of large samples of

recombinant bacterial whole genome sequences using Gubbins". Nucleic Acids Res. 43(3):e15, 2015.


51



P7. XCMS OPTIMISATION IN HIGH-THROUGHPUT LC-MS QC

Charlie Beirnaert1,2*

, Matthias Cuykx3, Adrian Covaci

3 & Kris Laukens

1,2.

Advanced Database Research and Modeling (ADReM), University of Antwerp1; Biomedical Informatics Research Centre

Antwerp (biomina)2; Toxicological Centre, University of Antwerp

3.

*[email protected]

In high-throughput untargeted metabolomics studies, quality control is still a prominent bottleneck. In analogy to a

recently developed QC tool for proteomics, work in our research group aims to develop a QC environment specific for

metabolomics. One component in this work is the XCMS analysis software for LC-MS data, which is very input-

parameter-sensitive. The presented work deals with the automatic optimisation of the XCMS parameters by building

further upon an existing framework for XCMS optimisation. The additions to this framework will be the inclusion of

quantified resolution data by using the otherwise ignored profile-data and intelligent use of the isotopic profile of

measured compounds.

INTRODUCTION

Metabolomics is the study of small molecules or

metabolites. These metabolites have an enormous

chemical diversity and are only now starting to be

identified in a high-throughput fashion. Reason for this is

the adoption of high performance liquid chromatography

mass spectrometry and nuclear magnetic resonance

spectroscopy. However, the data analysis of these large

datasets is not trivial, specifically for LC-MS there are

almost more ways of analysing data than there are

researchers. Arguably, the most common used software

platform for the initial analysis is XCMS (Smith et al.,

2006). However, the output of XCMS is very dependent

on the input-parameters. Often the default parameters are

chosen or they are adapted to the intuition of the

researcher, with no account of the introduction of false

positives etc. Optimization algorithms have been

constructed by using a dilution series (Eliasson et al.,

2012) and by using the carbon isotope (Libiseller et al.,

2015). In this work, we build further upon the latter by

including quantified information from the profile m/z

domain (the continuous data in the m/z dimension) where

accurate resolutions can be obtained for the mono-isotopic

peaks and other isotopes. The developed optimisation can

be used for both the data analysis and the quality control

framework that is under development.

METHODS

The proposed work uses XCMS to find the peaks of

interest in the data. To optimise this process, the results

from XCMS are analysed for the occurrence of peaks and

their isotopes. In this step, the raw profile data is inspected

around the, by XCMS, identified peaks for the

quantification of the peak resolution and for the

occurrence of missed isotopes.

Centroid vs Profile data: Modern day MS specialists use

centroid data because the file size is considerably lower.

The mass spectrometer converts the continuous data in the

m/z dimension to a collection of spikes where each

approximately Gaussian peak is converted to a single

spike (delta function with the same height as the original

peak). All other data is discarded. The result is a huge

reduction in the file size but a loss of the peak shape and,

as a result, no quantification of the resolution is possible.

Optimization parameter: The peaks and their isotopes

are characterized by a Gaussian in the chromatographic

dimension and spaced apart by 1.0063 Da in the m/z

dimension. When an isotope is missing or the extracted

peak does not appear in enough samples (for example in

50% of the samples in the sample group), the peak is

categorized as “unreliable”. When a peak is present in all

samples or has a clear isotopic distribution it is considered

as “reliable”. With these measures a so called peak picking

score can be calculated, which in turn can be optimised by

a variety of methods. This results in an increase in reliable

peaks, while not increasing false positives.

Analysis & Quality control: The optimisation of the

XCMs parameters is useful both in the analysis of the data

itself, but it is also applicable in quality control for large

scale LC-MS experiments. By being able to quantify the

resolutions of all relevant peaks in a dataset corresponding

to a control sample, it is possible to monitor the quality of

spectra, and when combining this with other QC

frameworks, like iMonDB (Bittremieux et al., 2015) it is

possible to assure the quality of all experiments in a long

lasting study.


The aim is to use the profile data to improve the available

optimization algorithms available. It remains to be seen

whether the extra information in this data (compared to

centroid data) justifies the increased need of computer

resources. Nonetheless, profile data provides a valuable

contribution in LC-MS optimization, because it enables

researchers to evaluate (quantitatively) and improve the

m/z resolution.

REFERENCES Smith CA et al. Anal. Chem. 78(3), 779-789, (2006). Eliasson M. et al. Anal. Chem. 84(15), 6869-6876, (2012).

Libiseller G. et al. BMC Bioinformatics 16:118, (2015).

Bittremieux W. et al. J. Proteome Res. 14(5), 2360-2366, (2015).


52



P8. IDENTIFICATION OF NUMTS THROUGH NGS DATA

Vincent Branders1,2*

, Chedly Kastally2 & Patrick Mardulyn

2.

Machine Learning Group, Institute of Information and Communication Technologies, Electronics and Applied

Mathematics (ICTEAM), Université catholique de Louvain1; Evolutionary Biology and Ecology, Université libre de

Bruxelles2.

*[email protected]

Numts are copies of mitochondrial DNA sequences that have been transferred into the nuclear genome. Due to their

similarity with mitochondrial DNA sequences, numts have led to many misinterpretations from overestimation of

diversity to wrong association between cystic fibrosis and mitochondrial genome variation. To avoid such bias induced

by numts, theses sequences have to be identified. Current methodologies are based on comparisons of existing nuclear

and mitochondrial sequences and searches for similarities. The Pacific Biosciences (PacBio) new technology generates

sequencing reads that span thousands of base pairs, which gives the opportunity to identify numts by looking for reads

with regions similar to mitochondrial sequences and surrounded by regions highly different from it. It should allow the

systematic identification of numts without a complete known nuclear reference.

INTRODUCTION

The transfer of DNA from mitochondria to the nucleus

generates nuclear copies of mitochondrial DNA (numts).

Numts have been found in many species including yeasts,

rodents and plants. Due to their similarity to mitochondrial

DNA, numts are responsible for many misinterpretations,

both in mitochondrial disease studies and phylogenetic

reconstructions (Hazkani-Covo et al., 2010). Numt

variation have commonly been misreported as

mitochondrial mutations in patients (Yao et al., 2008).

Moreover, DNA barcoding was found to overestimate the

number of species when numts are coamplified (Song et

al., 2008). Current methods identify such sequences by

aligning mitochondrial sequences against the nuclear

genome and identifying similar regions (Figure 1, left).

The PacBio technology allows the sequencing of DNA

fragments spanning thousands of bases pairs. This size

should allow the identification of numts without the need

of a complete nuclear reference (the insect species

Gonioctena intermedia for example). Indeed, it should be

possible to use a mitochondrial assembly to identify

PacBio reads with a central region similar to the

mitochondrial sequence enclosed by nuclear regions that

are dissimilar to it (Figure 1, right).

FIGURE 1. Identification of numts – Existing methods (left) and proposed

method (right). Comparison of mitochondrial sequence to nuclear sequence (left) or long reads (right).

METHODS

The proposed approach aligns PacBio reads to a

mitochondrial genome (here de novo assemblies of PacBio

reads and Illumina HiSeq 2000 reads are used). In these

long reads, numts are identified with one region similar

to the mitochondrial genome but surrounded by regions

that are not similar. We introduce different criteria to

distinguish reads that are presumably numts and reads of

mitochondrial origin (Figure 2). DNA sequences comes

from an insect (Gonioctena intermedia) without reference

genome.

FIGURE 2. Mitochondrial reads and numts with nuclear borders.


A systematic identification of potential numts is proposed:

through alignments, we identify 10 mitochondrial reads

and 34 reads with potential numt for one particular

mitochondrial region (the widely studied cytochrome

oxidase I gene). As an exploratory research, we highlight

the usefulness of Pacific Biosciences data in the

identification of numts when no nuclear reference is

available. It only requires PacBio reads and a

mitochondrial assembly. The proposed approach is more

efficient than an identification of numts through short

reads that would require the complete reconstruction of

both mitochondrial and nuclear genomes. A systematic

identification of numts in non-models organisms should

avoid misinterpretations in studies where numts could be

sources of bias. Our current distinction of numts and

mitochondrial reads is quite simple. A detailed analysis of

this distinction could be a perspective of improvements.

REFERENCES Hazkani-Covo E. et al. PLOS Genetics 6, 1-11 (2010). Song H. et al. PNAS 105, 13486-13491 (2008).

Yao Y. G. et al. Journal of Medical Genetics 45, 769-772 (2008).


53



P9. MICROBIAL SEMANTICS: GENOME-WIDE HIGH-PRECISION NAMING

SCHEMES FOR BACTERIA

Esther Camilo dos Reis, Dolf Michielsen, Hannes Pouseele*.

Applied Maths NV, Keistraat 120, 9830 Sint-Martens-Latem, Belgium.

INTRODUCTION

As next-generation sequencing in general, and whole

genome sequencing (WGS) in particular, is increasingly

adopted in public health for routine surveillance tasks,

there is a clear need to incorporate this new technology in

the day-to-day operational workflow of a public health

institute. As cluster detection based on WGS data is

evolving into a commodity, thanks to technologies such as

whole genome multi-locus sequence typing (wgMLST),

the question remains as to how WGS-based data analysis

can be used to build up a human-friendly but high-

precision and epidemiologically consistent naming

strategy for communication purposes.

METHODS

For various organisms, the use of so-called ‘SNP

addresses’ (based on single nucleotide polymorphisms or

SNPs) has been proposed to build up a hierarchical

naming scheme (see [1], [2]). This idea relies on single

linkage clustering of isolates at different levels of

similarity or distance, hence leading to a hierarchical name.

However, the main difficulty here is to define the

appropriate levels of similarity to cluster on, and the

dependence of the naming scheme on the samples at hand.

Moreover, the SNP approach might not provide the best

type of data for this due to its relatively large volatility.

In this work, we present a mathematical framework to

define the levels of similarity upon which single linkage

clustering makes sense. For this, we model the observed

multimodal distribution of pairwise similarities between

samples to obtain a theoretical model of the similarity

distribution, and from there infer the most likely breaking

points for stable similarity cutoffs. This is done in a data-

independent manner, and is therefore applicable to SNP

data, but also to wgMLST data and even gene presence-

absence data. We assess the stability of the naming

scheme by using a cross-validation approach.


We apply our methods to propose a wgMLST-based

naming scheme for Listeria monocytogenes. Using a

reference dataset of the diversity within Listeria

monocytogenes, and an extensive data set of over 4000

isolates from real-time surveillance, we show the stability

of the naming scheme, and the epidemiological

concordance.

REFERENCES [1] Dallman T et al., Applying phylogenomics to understand the

2 emergence of Shiga Toxin producing Escherichia coli

3 O157:H7 strains causing severe human disease in the 4 United Kingdom. Microbial Genomics., 10.1099/mgen.0.000029

[2] Coll F et al., PolyTB: A genomic variation map for Mycobacterium

tuberculosis, Tuberculosis (Edinb). 2014 May; 94(3): 346–354. doi: 10.1016/j.tube.2014.02.005


54



P10. FROM SNPS TO PATHWAYS: AN APPROACH TO STRENGTHEN

BIOLOGICAL INTERPRETATION OF GWAS RESULTS

Elisa Cirillo1,*

, Michiel Adriaens2 & Chris T Evelo

1,2.

1Department of Bioinformatics – BiGCaT, Maastricht University, The Netherlands

2Maastricht Centre for Systems Biology (MaCSBio), Maastricht University, The Netherlands

*[email protected]

Pathway and network analysis are established and powerful methods for providing a biological context for a variety of

omics data, including transcriptomics, proteomics and metabolomics. These approaches could in theory also be a boon

for the interpretation of genetic variation data, for instance in the context of Genome Wide Association Studies (GWAS),

as it would allow the study of genetic variants in the context of the biological processes in which the implicated genes

and proteins are involved. However, currently genetic variation data cannot easily be integrated into pathways.

Additionally, it is not clear how to visualise and interpret genetic variation data once connected to pathway content. In

this project we take up that challenge and aim to (i) visualise SNPs from a Type 2 Diabetes Mellitus (T2DM) GWAS

dataset on pathways and (ii) generate and analyze a network of all associated genes and pathways. Together, this could

enable a comprehensive pathway and network interpretation of genetic variations in the context of T2DM.

INTRODUCTION

GWAS has become a common approach for discovery of

gene disease relationships, in particular for complex

diseases like T2DM (Wellcome Trust Case Control,

2009). However, biological interpretation remains a

challenge, especially when it concerns connecting genetic

findings with known biological processes. We wish to

improve the interpretation of GWAS results, using a

meaningful network representation that links SNPs to

biological processes.

METHODS

We selected a GWAS data set related to T2DM from a

meta GWAS resource for diseases created by Jhonson et

al. (2009), and we extracted 1971 SNPs associated with

T2DM.

We identified the location for each SNP using Variant

Effect Prediction (VeP) (http://www.ensembl.org) and we

classified them in 5 categories (Figure 1): exonic, 3' UTR,

5' UTR, intronic and intergenic. SNPs located in the first

three categories are easily connected to genes using

BioMart Ensembl (http://www.ensembl.org/). Pathways

related with these genes are identified from the curated

collection of WikiPathways (Kutmon et al., 2015). SNPs,

genes and pathways are visualized in networks using

Cytoscape (Shannon et al., 2003).


We analysed four gene related SNP categories: 3' and 5'

UTR, intronic and exonic. The exonic category was

divided into 8 SNP sub-categories based on sequence

interpretation: up- and downstream, splice region,

synonymous, missense, stop/gain, transcription factor

binding, and non-coding transcript. For each of the 11

resulting categories we created a SNP-disease gene-

pathway network. Disease related genes are not always

included in pathways and this is also the case for disease

genes in which GWAS resulting SNPs were found. For the

SNPs that are related to genes in pathways we did a

pathway gene set enrichment analysis and evaluated

whether the resulting pathways were already known to be

related to T2DM.

SNPs in intergenic region need to be analysed and

visualized differently. A possible approach might be using

the expression quantitative trait locus (eQTL) data, which

relates SNPs in intergenic regions to modulation of gene

expression distally. Such datasets are available for many

different human tissues and can provide additional

regulatory information for pathways and the genes they

comprise.

FIGURE 1. Pie chart of the 5 SNPs categories. The total number of SNPs is 2767.

REFERENCES Wellcome Trust Case Control Genome-wide association study of 14,000

cases of seven common diseases and 3,000 shared controls. Nature. 2007;447(7145):661-78.

Johnson A, O'Donnell C. An Open Access Database of Genome-wide

Association Results. BMC Medical Genetics. 2009;10(1):6. Kutmon M, Riutta A, Nunes N, Hanspers K, Willighagen E, Bohler A,

Mélius J, Waagmeester A, Sinha S, Miller R, Coort S, Cirillo E

Smeets B, Evelo C, Pico A. WikiPathways: Capturing the Full Diversity of Pathway Knowledge . Accepted September 2015, NAR-

02735- E- Database issue 2016.

Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: A Software Environment for Integrated Models of

Biomolecular Interaction Networks. Genome Research.

2003;13(11):2498-504.


55



P11. IDENTIFICATION OF TRANSCRIPTION FACTOR CO-ASSOCIATIONS

IN SETS OF FUNCTIONALLY RELATED GENES

Pieter De Bleser1,2,4*

, Arne Soetens1,2,4

& Yvan Saeys1,3,4

.

VIB Inflammation Research Center1; Department of Biomedical Molecular Biology

2, Department of Respiratory

Medicine3, Ghent University

4.

*[email protected]

Co-associations between transcription factors (TFs) have been studied genome-wide and resulted in the identification of

frequently co-associated pairs of TFs. Co-association of TFs at distinct binding sites is contextual: different combinations

of TFs co-associate at different genomic locations, producing a condition-dependent gene expression profile for a cell.

Here, we present a novel method to identify these condition-dependent co-associations of TFs in sets of functionally

related genes.

INTRODUCTION

The functional expression of genes is achieved by

particular interactions of regulatory transcription factors

(TFs) operating at specific DNA binding sites of their

target genes. Dissecting the specific co-associations of TFs

that bind each target gene represent a difficult challenge.

Co-associations of transcription factor pairs have been

studied genome-wide and resulted in the identification of

frequently co-associated pairs of TFs (ENCODE Project

Consortium, 2012). It was found that TFs co-associate in a

context-specific fashion: different combinations of TFs

bind different target sites and the binding of one TF might

influence the preferred binding partners of other TFs. Here,

we present a tool to identify these condition-dependent co-

associations of TFs in sets of functionally related genes

(e.g. metabolic pathways, tissues, sets of TF target genes,

sets of differentially regulated genes).

METHODS

In a first step, we determine the set of regulatory TFs for

each gene (Tang et al., 2011) in the set using the ChIP-Seq

binding data for 237 TFs from the ReMap database

(Griffon et al., 2015). This results in a number of

regulatory ChIP-Seq binding regions per TF per gene,

represented as a matrix in which each row corresponds to

a gene while the columns correspond to the used TF. In a

next step, this matrix is used as input to the distance

difference matrix (DDM) algorithm, modified to

accommodate this data. The DDM algorithm is a method

that simultaneously integrates statistical over

representation and co-association of TFs (De Bleser et al.,

2007). The result matrix is subsequently reduced, retaining

only the columns of over-represented and co-associated

TFs. Visualization is done by (1) hierarchical clustering of

the reduced result matrix and reordering of the columns

and (2) conversion of the reduced result matrix into a SIF

(simple interaction file format) file, summarizing the

regulator-regulated relationships between transcription

factors and target genes. This SIF file can be imported into

CytoScape for visualization of the regulatory network.


FOXF1, TBX3, GATA6, IRX3, PITX2, DLL1 and

NKX2-5 are experimentally verified target genes of the

EZH2 transcription factor (Grote et al., 2013).

Running the transcription factor co-association analysis

method on this data set results in the clustering solution

plot shown in Figure 1.

The strongest associations between TFs are found between

EZH2, POU5F1, SUZ12 and CTBP2. A secondary cluster

of transcription factor associations is composed of

EOMES, SMAD2+3 and NANOG.

The finding of SUZ12 as a cofactor can be accounted for:

EZH2 and SUZ12 are subunits of Polycomb repressive

complex 2 (PRC2), which is responsible for the repressive

histone 3 lysine 27 trimethylation (H3K27me3) chromatin

modification (Yoo and Hennighausen, 2012). CTBP2 is a

known transcriptional repressor (Turner and Crossley,

2001).

The method has been applied previously for the

identification of TFs associated with both high tissue-

specificity and high gene expression levels (Rincon et al.,

2015). The method will be made available as a web tool.

FIGURE 1. Transcription factor co-associations in the EZH2 data set.

Note the tendency of EZH2 to co-localize with POU5F1, SUZ12 and

CTBP2.

REFERENCES De Bleser,P. et al. (2007) A distance difference matrix approach to identifying

transcription factors that regulate differential gene expression. Genome Biol., 8,

R83.

ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements

in the human genome. Nature, 489, 57–74.

Griffon,A. et al. (2015) Integrative analysis of public ChIP-seq experiments reveals

a complex multi-cell regulatory landscape. Nucleic Acids Res., 43, e27.

Grote,P. et al. (2013) The tissue-specific lncRNA Fendrr is an essential regulator of

heart and body wall development in the mouse. Dev. Cell, 24, 206–214.

Rincon,M.Y. et al. (2015) Genome-wide computational analysis reveals

cardiomyocyte-specific transcriptional Cis-regulatory motifs that enable

efficient cardiac gene therapy. Mol. Ther. J. Am. Soc. Gene Ther., 23, 43–52.

Tang,Q. et al. (2011) A comprehensive view of nuclear receptor cancer cistromes.

Cancer Res., 71, 6940–6947.

Turner,J. and Crossley,M. (2001) The CtBP family: enigmatic and enzymatic

transcriptional co-repressors. BioEssays News Rev. Mol. Cell. Dev. Biol., 23,

683–690.

Yoo,K.H. and Hennighausen,L. (2012) EZH2 methyltransferase and H3K27

methylation in breast cancer. Int. J. Biol. Sci., 8, 59–65.


56



P12. PHENETIC: MULTI-OMICS DATA INTERPRETATION USING

INTERACTION NETWORKS Dries De Maeyer

1,2,3*, Bram Weytjens

1,2,3, Luc De Raedt

4 & Kathleen Marchal

2,3.

Centre for Microbial and Plant Genetics, KULeuven1; Department for Information Sciences (INTEC, IMinds), UGent

2;

Department for Plant Biotechnology and Bioinformatics, UGent3; Department of Computer Science, KULeuven

4.

*[email protected]

The omics revolution has introduced new challenges when studying interesting phenotypes. High throughput omics

technologies such as next-generation sequencing and microarray technologies generate large amounts of data.

Interpreting the resulting data from these experiments is not trivial due to the data’s size and the inherent noise of the

underlying technologies. In addition to this, the “omics” technologies have led to an ever expanding biological

knowledge which has to be taken into account when interpreting new experimental results. Interaction network in

combination with subnetwork inference methods provide a solution to this problem by mining the current public

interactomics knowledge using experimental omics data to better understand the molecular mechanisms driving the

interesting phenotypes under study.

INTRODUCTION

Computational methods are becoming essential for

analyzing large scale omics datasets in the light of current

knowledge. By representing publicly available

interactomics knowledge as interaction networks

subnetwork inference methods can extract the actual

molecular mechanisms that drive an interesting phenotype.

The PheNetic framework is such a method that allows for

mining interaction networks with multi-omics datasets.

Using this framework different types of biological

applications have been analyzed in the past such as KO-

transcriptomics interpretation (De Maeyer, 2013),

expression analysis (De Maeyer, 2015) and distinguishing

driver from passenger mutation from eQTL experiments

(De Maeyer).

METHODS

Interaction networks provide a flexible representation of

public biological interactomics knowledge. These

networks represent the physical interactions between

genes and their corresponding gene products in the

interactome of the organism under research (Cloots, 2011).

The interaction network integrates different layers of

homogeneous interactomics data, e.g. signalling, protein-

protein, (post)transcriptional and metabolic interactomics

data, into a single heterogeneous network representation.

The PheNetic framework uses interaction networks to find

biologically valid paths which connect (in)activated genes

selected from multi-omics data sets. These paths provide a

biological explanation of how the genes from these data

sets can trigger each other. Finding the best explanations

or paths in the interaction network corresponds to finding

that subnetwork that best explains the observed results and

provides an insight into the molecular mechanisms that

drive the interesting phenotype. Depending on the type of

biological application and provided data different types of

paths can be used to infer the subnetwork such as KO-

transcriptomics interpretation (De Maeyer, 2013),

expression analysis (De Maeyer, 2015) and interpreting

eQTL experiments (De Maeyer).


In a first setup PheNetic was used to study the pathways

and processes involved in acid resistance in Escherichia

coli (De Maeyer, 2013). Using our framework we were

able to determine the different molecular pathways that

drive acid resistance and identify the regulators that

underlie this phenotype. It was shown that subnetwork

inference methods outperform naïve gene rankings in

identifying the biological pathways associated with the

phenotype under research based.

In a second setup PheNetic was used to interpret

expression data (De Maeyer, 2015) to extract from the

interaction network those parts of the interaction network

that show differences in expression. This method was

provided as a web server that can be accessed at

http://bioinformatics.intec.ugent.be/

phenetic and that allows for an intuitive and visual

interpretation of the inferred subnetworks.

In a third setup PheNetic was used to select driver

mutations from passenger mutations in coupled genetic-

transcriptomics data sets from evolution experiments (De

Maeyer). Evolved strains with the same phenotype are

expected to have consistent changes in the same pathways.

Therefore, finding the subnetwork that best connects the

mutations to the differentially expressed genes over all

strains is expected to identify the driver mutations over

passenger mutations in combination with identifying the

molecular mechanisms that induce the observed change in

phenotype. This approach provides a systemic insight in

both the biological processes and genetic background that

induces phenotype.

Based on the different approaches it can be concluded that

PheNetic is a flexible framework for subnetwork selection

that allows for solving a large variety of biological

applications using multi-omics data sets.

REFERENCES Cloots, L., & Marchal, K. (2011). Curr Opin Microbiol, 14(5), 599-607. De Maeyer, D., Renkens, J., Cloots, L., De Raedt, L., & Marchal, K.

(2013). Mol Biosyst, 9(7), 1594-1603.

De Maeyer, D., Weytjens, B., Renkens, J., De Raedt, L., & Marchal, K. (2015). Nucleic Acids Res, 43(W1), W244-250.

De Maeyer, D., Weytjens, B., De Raedt, L., & Marchal, K. Molecular

biology and evolution. Submitted


57



P13. THE ROLE OF HLA ALLELES UNDERLYING CYTOMEGALOVIRUS

SUSCEPTIBILITY IN ALLOGENEIC TRANSPLANT POPULATIONS

Nicolas De Neuter1,2*

, Benson Ogunjimi3, Anke Verlinden

4, Kris Laukens

1,2 & Pieter Meysman

1,2.


Antwerpen (biomina)2; Centre for Health Economics Research and Modeling Infectious Diseases (CHERMID), Vaccine

and Infectious Disease Institute, University of Antwerp3; Antwerp University Hospital

4.

*[email protected]

In this study, we aim to characterize those HLA alleles that increase or decrease the risk of cytomegalovirus infections

following tissue or organ transplants. This HLA-dependent susceptibility will then be explained using state-of-the-art

HLA peptide affinity methods to identify the underlying molecular reason. This insight can greatly aid prediction of

those transplantation patients that are most at risk from cytomegalovirus infection.

INTRODUCTION

Patients suffering from disorders of the hematopoietic

system or with chemo-, radio-, or immuno- sensitive

malignancies such as leukemia often receive

hematopoietic stem cell transplantation therapy (HSCT).

The transplantation is preceded by a conditioning regimen

that eradicates the recipient’s malignant cell population

through intensive chemotherapy and irradiation,

simultaneously ablating the recipient’s bone marrow. Self

(autologous) or non-self (allogeneic) hematopoietic stem

cells are then reintroduced into the recipient after which

they are allowed to reestablish hematopoietic functions.

HSCT is associated with high morbidity and mortality and

requires careful monitoring of patients during the weeks

following transplantation. Opportunistic cytomegalovirus

(CMV) infections are one of the major causes of this high

morbidity and mortality and can occur in up to 80% of

HSCT patients, depending on the use of prophylactic

treatment or pre-emptive therapy and the serological CMV

status of donor and recipient. CMV disease can manifest

itself as life-threatening pneumonia, gastrointestinal

disease, retinitis, encephalitis or hepatitis.

The relevance of HLA alleles in varicella zoster virus

associated disease has recently been demonstrated by our

group (Meysman et al., 2015) and similar insights might

be gained in CMV related disease. Several studies have

already shown a correlation between the incidence of

CMV infection and the presence of certain human

leukocyte antigens (HLA) alleles in the transplant

recipient. However, the exact alleles identified in previous

studies are very inconsistent, likely due to small sample

sizes and type I multiple testing errors.

METHODS

Anonymized patient records on the HLA alleles, CMV

infection and serological status of 1284 transplant

recipients were collected from the Antwerp University

Hospital (UZA). This data set was further extended with

publicly available HLA data from transplant patient and

the counts for the HLA alleles of each loci present were

combined. A hypergeometric distribution was used to test

HLA loci (A, B, C, DRB1, DQB1 and DPB1) for

statistical over- or underrepresentation of their respective

alleles. HLA alleles were tested for over- or

underrepresentation in two test populations: recipients

who were seropositive for CMV before transplantation

and recipients who developed a CMV infection post-

transplantation. In the later case, we also examined if

donor seropositivity had an influence on the CMV

infection status. The P value cutoff used is 0.05 and was

adjusted with a Bonferroni correction for multiple testing,

in this case the number of alleles tested per loci.

Putative nonameric peptides were generated in silico from

CMV protein sequences available in online protein

sequence repositories such as the UniProt Knowledgebase.

Three complementary methods were employed to predict

the affinity of each putative nonameric peptide to the

significantly enriched or depleted HLA alleles. The

methods used were: NetCTLpan, the stabilized matrix

method (SMM) and an in-house-developed approach

called CRFMHC. Peptide-binding affinity results of each

predictor were normalized against the affinity of a

restricted panel of human proteins and used to compare

results between predictors. Additionally, each CMV

protein was assessed for depletion of high-affinity

peptides using a hypergeometric distribution.

RESULTS

Preliminary results on a small portion of the UZA data

reveals HLA alleles underlying either CMV seropositivity

or CMV infection with a trend towards significance but do

not reach the Bonferroni corrected threshold. We expect

the additional data to increase the power of the analysis.

REFERENCES Meysman,P. et al. (2015) Varicella-Zoster Virus-Derived Major

Histocompatibility Complex Class I-Restricted Peptide Affinity Is

a Determining Factor in the HLA Risk Profile for the

Development of Postherpetic Neuralgia. J. Virol., 89, 962–969.


58



P14. NOVOPLASTY: IN SILICO ASSEMBLY OF PLASTID GENOMES FROM

WHOLE GENOME NGS DATA

Nicolas Dierckxsens1,2*

, Olivier Hardy2, Ludwig Triest

3, Patrick Mardulyn

2 & Guillaume Smits

1,4.

Interuniversity Institute of Bioinformatics Brussels (IB2), ULB-VUB, Triomflaan CP 263, 1050 Brussels, Belgium1;

Evolutionary Biology and Ecology Unit, CP 160/12, Faculté des Sciences, Université Libre de Bruxelles, Av. F. D.

Roosevelt 50, B-1050 Brussels, Belgium2; Plant Biology and Nature Management, Vrije Universiteit Brussel, Brussels,

Belgium3; Department of Paediatrics, Hôpital Universitaire des Enfants Reine Fabiola (HUDERF), Université Libre de

Bruxelles (ULB), Brussels, Belgium4.

*[email protected]

Thanks to the evolution in next-generation sequencer (NGS) technology, whole genome data can be readily obtained

from a variety of samples. There are many algorithms available to assemble these reads, but few of them focus on

assembling the plastid genomes. Therefore we developed a new algorithm that solely assembles the plastid genomes

from whole genome data, starting from a single seed. The algorithm is capable of utilizing the full advantage of very high

coverage, which makes it even capable of assembling through problematic regions (AT-rich). The algorithm has been

tested on several whole genome Illumina datasets and it outperformed other assemblers in runtime and specificity. Every

assembly resulted in a single contig for any chloroplast or mitochondrial genome and this always within a timeframe of

30 minutes.

INTRODUCTION

Chloroplasts and mitochondria are both responsible for

generating metabolic energy within eukaryotic cells. Both

plastids are maternally inherited and have a persistent gene

organization, what makes them ideal for phylogenetic

studies or as a barcode in plant and food identification

(Brozynska et al., 2014). But assembling these plastids

genomes is not always that straightforward with the

currently available tools. Therefore we developed a new

algorithm, specifically for the assembly of plastid

genomes from whole genome data.

METHODS

The algorithm is written in Perl. All assemblies were

executed on Intel Xeon CPU machine containing 24 cores

of 2.93 GHz with a total of 96,8 GB of RAM. All non-

human samples were sequenced on the Illumina HiSeq

platform (101 bp paired-end reads). The human

mitochondria samples (PCR-free) were sequenced on the

Illumina HiSeqX platform (150 bp paired-end reads). The

Gonioctena intermedia sample was also sequenced on the

PacBio platform.


Algorithm. The algorithm is similar to string overlap

algorithms like SSAKE (Warren et al., 2007) and VCAKE

(Jeck et al., 2007). It starts with reading the sequences into

a hash table, which facilitates a quick accessibility. The

assembly has to be initiated by a seed that will be

extended bidirectionally in iterations. The seed input is

quite flexible, it can be one sequence read, a conserved

gene or even a complete mitochondrial genome from a

distant species. Every base extension is determined by a

consensus between the overlapping reads. Unlike most

assemblers, NOVOPlasty doesn’t try to assemble every

read, but will extend the given seed until the circular

plastid is formed.

Assemblies. NOVOPlasty has currently been tested for the

assembly of 8 chloroplasts and 6 mitochondria. Since

chloroplasts contain an inverted repeat, two versions of the

assembly are generated. The differ only in the orientation

of the region between the two repeats; the correct one will

have to be resolved manually. Besides the mitochondrion

of the leaf beetle Gonioctena intermedia, all assemblies

resulted in a complete circular genome. A comparative

study of four assemblers for the mitochondrial genome of

G. intermedia clearly shows the speed and specificity of

NOVOPlasty (Table 1).

NOVO

Plasty MIRA MITO bim ARC

Duration (min) 12 536 4777* 586

Memory (GB) 15 57,6 63,4 1,9

Storage (GB) 0 144 418 12

Total contigs 1 3434 2221 2502

Mitochondrial contigs 1 1 4 48

Coverage (%) 98 94 94 84

Mismatches 10 25 26 2

Unidentified nucleotides 43 194 197 0

TABLE 1. Benchmarking results between four assemblies of the

mitochondrial genome of Gonioctena intermedia. The assemblies were constructed with MITObim (Hahn et al., 2013), MIRA (Chevreux et al.,

1999), ARC (Hunter et al., 2015) and NOVOPlasty.*manually terminated

Discussion. Despite the many available assemblers, many

researchers still struggle to find a good assembler for

plastids genomes. NOVOPlasty offers an assembler

specifically designed for plastids that will deliver the

complete genome within 30 minutes. The algorithm will

be tested on more datasets and a comparative study with

other assemblers is in progress.

REFERENCES Brozynska et al. PLoS One 9 (2014).

Chevreux et al. Computer Science and Biology: Proceedings of the

German Conference on Bioinformatics (GCB) (1999).

Hahn et al. Nucleic Acids Research, 1-9 (2013).

Hunter et al. http://dx.doi.org/10.1101/014662 (2015).

Jeck et al. BMC Bioinformatics 23, 2942-2944 (2007).

Warren et al. BMC Bioinformatics 23, 500-501 (2007).


59



P15. ENANOMAPPER - ONTOLOGY, DATABASE AND TOOLS FOR

NANOMATERIAL SAFETY EVALUATION

Friederike Ehrhart1, Linda Rieswijk

1, Chris T. Evelo

1, Haralambos Sarimveis

2, Philip Doganis

2, Georgios Drakakis

2,

Bengt Fadeel3, Barry Hardy

4, Janna Hastings

5, Christoph Helma

6, Nina Jeliazkova

7, Vedrin Jeliazkov

7, Pekka Kohonen

89,

Roland Grafström9, Pantelis Sopasakis

10, Georgia Tsiliki

2 & Egon Willighagen

1.

Department of Bioinformatics - BiGCaT, Maastricht University1; National Technical University of Athens

2; Karolinska

Institutet3; Douglas Connect

4; European Molecular Biology Laboratory – European Bioinformatics Institute

5; In silico

toxicology6; Ideaconsult Ltd.

7; VTT Technical Research Centre of Finland

8; Misvik Biology

9; IMT Institute for Advanced

Studies10

. *[email protected]

eNanoMapper is an open computational infrastructure for engineered nanomaterial data: it comprises a semantic web

supported database, ontology, and user applications for up- and download of experimental data, and tools for modelling.

INTRODUCTION

Nanomaterials are defined by size: between 1 nm and 100

nm in at least one dimension. The properties of these

material do not always resemble those of the bulk

material, i.e. micro- and bigger particles, or solutions.

Nanomaterials can differ in reactivity, toxicity in

biological organisms and ecosystems depending on their

size and surface properties and the possibility for

“leakage” of the material it is made off. That is why it is

so difficult to assess the safety of nanomaterials and why

the NanoSafety Cluster defined a need for a new

computational infrastructure in 2012. eNanoMapper is a

European project with partners from eight European

countries. This project has been developing an

computational infrastructure consisting of a semantic web

assisted database, a modular ontology, and tools to use

them for nanomaterial safety assessment. Data sharing,

data storage, data analysis tools, and web services are

currently under development, being developed and tested,

and put into production use. The project website can be

found at www.enanomapper.net.

PROBLEM

The eNanoMapper platform is designed to support hosting

of data on nanomaterial properties relevant for nanosafety

assessment as found in existing databases like the

NanoMaterial Registry, DaNa Knowledge Base,

Nanoparticle Information Library NIL, Nanomaterial-

Biological Interactions Knowledgebase, caNanoLab,

InterNano, Nano-EHS Database Analysis Tool, nanoHUB,

etc. Each of them has different data formats and

descriptors, like CODATA-VAMAS’ Universal

Description System, ISO-Tab(-Nano), OECD templates,

custom spreadsheets, and images. Interoperability is a

main aim and semi-automatic import or upload of

information and to integrate it in the eNanoMapper data

structure is being enabled. Vice versa, retrieval or

download of experimental data from the database for (re-

)analysis should be provided too, using programmable

interfaces to the data and the ontology. Database and

search functionality should be semantic web compatible:

the project developed and maintain a nanosafety ontology

to support this. This eNanoMapper ontology was

developed using the Web Ontology Language and the

challenge is to map nanomaterial terms to their multiple

ontology terms, namely physico-chemical properties,

biological and ecological impact, experimental assay

description, and known safety aspects.


The current eNanoMapper demo database instance,

available at https://data.enanomapper.net/, contains the

physico-chemical, biologic and environmental properties

of nanomaterials of 465 different nanomaterials1. Loading

data into the database supports various formats, including

the OECD Harmonized Templates and the data structure

used by the NanoWiki2. A web interface is designed to

support all interactions with the database you may want to

perform, including uploading of experimental data, as well

as querying data to support analysis and modelling of

nanoparticle properties. The eNanoMapper ontology is

available under

http://purl.enanomapper.net/onto/enanomapper.owl and is

based on a multi-faceted description of nanoparticles

concerning nanoparticle types, physico-chemical

description, life cycle, biological and environmental

characterisation including experimental methods and

protocols, and safety information3. The terms are verified

against the definitions of REACH, ISO, or common

practices used in science in general. The often confused

different meanings of endpoints and assays were

discriminated in the definitions, e.g. size and size

measurement assay. It was partly possible to use existing

ontologies as basis, e.g. NPO, ChEBI, GO, etc. but many

terms had to be added manually. Currently, there are 4592

classes defined. Users can get access and download the

ontology from the U.S. National Center for BioMedical

Ontologies BioPortal platform,

http://bioportal.bioontology.org/ontologies/ENM.

REFERENCES 1 Jeliazkova, N. et al. The eNanoMapper database for

nanomaterial safety information. Beilstein Journal of

Nanotechnology 6, 1609-1634, doi:10.3762/bjnano.6.165 (2015).

2 Willighagen, E.; doi: org/10.6084/m9.figshare.1330208

3 Hastings, J. et al. eNanoMapper: harnessing ontologies to enable data integration for nanomaterial risk assessment. J

Biomed Semantics 6, 10, doi:10.1186/s13326-015-0005-5

(2015).


60



P16. BIOMEDICAL TEXT MINING FOR DISEASE-GENE DISCOVERY:

SOMETIMES LESS IS MORE

Sarah ElShal1,2*

, Jesse Davis3 & Yves Moreau

1,2.

Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data

Analytics Department, KU Leuven1; iMinds Future Health Department, KU Leuven

2; Department of Computer Science,

KU Leuven3.

*[email protected]

Biomedical text is increasingly being made available online in either abstract or full article formats. This goes in parallel

with the knowledge desire to extract information from such text (e.g. finding links between diseases and genes).

Consequently text mining is very popular in the biomedical domain given that it provides the possibility to automatically

analyze these texts in order to extract knowledge. One of the big challenges in text mining is recognizing named entities

(e.g. disease and gene entities) inside a given text, which is widely known as Named Entity Recognition (NER). We

studied two biomedical taggers that apply different NER methods on MEDLINE abstracts. Here, we compare the

contribution of each of the two taggers in associating genes with diseases. We show that with fewer recognized entities

we gain more knowledge and we better associate genes with diseases.

INTRODUCTION

MEDLINE currently has more than 25 million biomedical

citations from different journals all over the world. With

this vast amount of text available, it is increasingly

important to mine such data and find the best ways to

extract relevant knowledge out of it. One example of such

knowledge is links between diseases and genes. However

it is very challenging and time consuming to recognize

biomedical entities inside a given text with the evolving

number of dictionaries and tagging strategies. Different

taggers exist that map MEDLINE abstracts to biomedical

entities. Such tagged entities can be used to generate

disease and gene profiles and by applying certain

similarity measures, we can extract knowledge and

generate disease-gene hypothesis.

METHODS

We compare two MEDLINE taggers that map the whole

set of MEDLINE abstracts to biomedical entities (e.g.

genes, diseases, GO and MeSH terms …). The first one is

MetaMap (Aronson et al., 2010), and the second one has

been used as a text mining pipeline in many resources,

latest in Diseases (Pletscher-Frankild et al., 2015). For

sake of simplicity, we will refer to the second tagger by

m_tagger throughout the rest of the abstract. For each

MEDLINE abstract we could obtain two sets of mapped

entities: (1) the metamap set, and (2) the m_tagger set. The

metamap set (given all the abstracts) corresponds to

78,298 distinct entities vs. 29,536 for M_tagger.

In order to compare the contribution of each tagger to the

disease-gene association process, we proceeded as follows.

First, we generated a validation set from the OMIM

database to acquire a list of experimentally-validated

disease-gene pairs. Second, we generated an entity profile

for every gene in our database and for every disease in our

validation set. This profile corresponds to the TF-IDF

score of a given entity in one profile, which is calculated

according to the set of abstracts found to be linked with a

disease or gene. Then for every disease, we computed the

cosine similarity between its profile and all the gene

profiles. Hence we could have a similarity score for each

disease and gene pair, which we used to rank the genes for

a given disease. We computed the average recall at the top

10, 25, 50, and 100 ranked genes. We ran this analysis

once according to the metamap set and once according to

the m_tagger set. We also tried another association

measure where we filtered the profiles such that they only

contain gene entities. Then we ranked the genes according

to their TF-IDF scores in a given disease profile. This

corresponds to 9290 gene entities in the metamap set, and

10,003 entities in the m_tagger set. Again we measured

the average recall at the different rank thresholds, and we

repeated the analysis using the metamap and m_tagger

profiles.


Figure 1 presents the recall results on the OMIM

validation set. We observe that MetaMap and M_tagger

result in comparable recall when ranking the genes

according to their cosine similarity with the disease

profiles. We also observe that M_tagger results in the best

recall when simply ranking the genes according to their

TF-IDF scores inside the disease profile.

FIGURE 1. Recall results on the OMIM validation set: comparing the

contribution of MetaMap and M_tagger, once with cosine similarity and once with TF-IDF ranks.

Even though using the m_tagger set implies using less

entities than the metamap one, we could gain the same

knowledge to associate genes with diseases. Moreover,

when we further reduced this set of entities to only genes,

we gained even more knowledge and better associated

genes with diseases.

REFERENCES Aronson A.R. et al. J. Am. Med. Inform. Assoc. An overview of MetaMap: historical

perspective and recent advances. 17, 229-236 (2010).

Pletscher-Frankild S. et al. DISEASES: text mining and data integration of disease-

gene associations. Methods. 74, 83-89 (2015).


61



P17. TUNESIM - TUNABLE VARIANT SET SIMULATOR FOR NGS READS

Bertrand Escaliere1,2

, Nicolas Simonis1,3

, Gianluca Bontempi1,2

& Guillaume Smits1,4

.

Interuniversity Institute of Bioinformatics in Brussels1; Machine Learning Group, Université Libre de Bruxelles

2; Institut

de Pathologie et de Génétique3; Hopital Universitaire des Enfants Reine Fabiola, Université Libre de Bruxelles

4.

NGS analysis softwares and pipelines optimization is crucial in order to improve discovery of (new) disease causing

variants. A better combination between existing tools and the right choice of parameters can lead to more specific and

sensitive calling. Simulated datasets allow the step-by-step generation of new alignment or calling software. Creating a

simulator able to insert known human variants at a realistic minor frequency and artificial variants in a tunable controlled

way would allow to overcome three optimization limits: complete knowledge of the input dataset, allowing to determine

exact calling sensitivity and accuracy; optimization on the appropriate population; and the capacity to dynamically test a

pipeline one variable at the time.

INTRODUCTION

Identification of anomalies causing genetic disorders is

difficult. It can be limited by scarcity of affliction

concerned, by disorder genetic heterogeneity, or by

phenotypic pleiotropy associated with the anomalies in a

single gene. Exome and genome sequencing allowed the

identification of many genetic diseases causes, whose

origin remained inaccessible up to now by the usual

techniques of research in genetics (Ng et al., 2009),

(Gilissen et al., 2012), (Yang et al., 2013), (Gilissen et al.,

2014). Exome and genome sequencing data analysis

pipelines are constituted by several steps (roughly:

alignment, quality filters, variant calling) and several

software are available for those steps. Evaluation and

comparison of those tools are crucial in order to improve

pipelines accuracy. Exome and genome sequencing

simulations should allow to determine the veracity of

called variants (false positives and false negatives).

METHODS

We implemented TuneSIM, a wrapper around NGS

dwgsim (http://sourceforge.net/projects/dnaa/) reads

simulator with realistic mutations. Generated reads contain

real mutations from 1KG project and dbsnp138. We use

existing tool dwgsim for reads generations. In order to

generate data as realistic as possible we decided to keep

the haplotype blocks structure. We computed blocks using

vcf files from 1KG project phase 3 in european individuals

with Plink (Purcell et al., 2007). For each block, we

obtained a frequency of each combination of variants and

we used these frequencies for blocks selection. We also

insert variants in an independent way using their

frequencies in dbSNP (Smigielski et al., 2000). Using 33

in house samples, we computed global allele frequency

variants distributions in coding and non coding regions

and we select the variants according to those frequencies.

Similar operation has been performed for CNVs insertion

using 1KG data. We are developing a web interface

allowing users to download existing generated datasets.

After running their pipelines they can upload their output

and see accuracy of their pipelines.


Simulations with different coverage, rate of indels have

been performed and analysed with different pipelines.

Results will be presented.

REFERENCES Gilissen, et al. (2012). Disease gene identification strategies for exome

sequencing. Eur J Hum Genet, 20, 490–497. Gilissen, et al. (2014). Genome sequencing identifies major causes of

severe intellectual disability. Nature, 511, 344–347.

Ng, S. B., et al. (2009). Exome sequencing identifies the cause of a mendelian disorder. Nature Genetics, 42, 30–35.

Purcell, et al. (2007). PLINK: a tool set for whole-genome association

and population-based linkage analyses. American journal of human genetics, 81, 559–575.

Smigielski, E. M., Sirotkin, K., Ward, M., & Sherry, S. T. (2000). dbsnp:

a database of single nucleotide polymorphisms. Nucleic Acids Research, 28, 352–355.

Yang, et al. (2013). Clinical Whole-Exome Sequencing for the Diagnosis

of Mendelian Disorders. N Engl J Med, 369, 1502–1511.


62



P18. RNA-SEQ REVEALS ALTERNATIVE SPLICING WITH

ALTERNATIVE FUNCTIONALITY IN MUSHROOMS

Thies Gehrmann1, Jordi F. Pelkmans

2, Han Wösten

2, Marcel J.T. Reinders

1 & Thomas Abeel

1*.

Delft Bioinformatics Lab, Delft Technical University1; Fungal Microbiology, Science Faculty, Utrecht University

2;

*[email protected]

Alternative splicing is well studied in mammalian genomes, and alternative transcripts are often associated with disease

and their role in regulation is gradually being unveiled. In fungi, the study of alternative splicing has only scratched the

surface. Using RNA-Seq data, we predict alternative transcripts based on existing gene predictions in two mushroom

forming fungi. We study the alternative functionality of genes through functional domains, developmental stages, tissue

and time. This analysis reveals the amount of alternative functionality induced by alternative splicing which was

previously unknown in fungi, and asserts the need for further research.

INTRODUCTION

Transcriptreconstruction algorithms rely on the sparsity

(intergenic regions) of the genome in order distinguish

between genes. In fungi, due to the density of the genome,

transcripts overlap in the up and down-stream untranslated

regions (UTRs) and prevent the use of existing tools for

transcript prediction (Roberts et. al. 2011). Previous

studies (Xie et. al. 2015, Zhao et. al. 2013), were limited

to the study of splice junctions, more advanced functional

analyses. We transform the genomes of S. commune and A.

bisporusin order to enable the prediction of alternative

transcripts applying existing transcript reconstruction

algorithms to RNA-Seq data from different tissue types

and developmental stages. We present a functional

analysis of the resulting transcripts.

METHODS

We apply a transformation on our fungal genomes in order

to reduce the impact of overlapping UTRs which prevent

the prediction of alternative transcripts. We split the

genome into chunks, with each chunk being defined by

existing gene annotations. Thus, the transformation

essentially removes intergenic regions (which contain the

UTRs). Each chunk is then analyzed separately by

Cufflinks (Roberts et. al. 2011). Predicted transcripts are

filtered based on read information and ORF sanity. Protein

domain annotations are predicted for each transcript using

InterPro (Zdobnov & Apweiler 2001).

For each gene with multiple alternative transcripts, we

construct a consensus sequence which allows us to call

specific splicing events without the influence of erroneous

reference annotations.


For both fungi, we find that alternative splicing is

prevalent and many genes have multiple alternative

transcripts (see Table 1).

# Orig. Genes # Filt.

Genes

# Transcripts

S. commune 16,319 14,615 20,077

A. bisporus 10,438 9612 14,320

TABLE 1. The number of originally annotated genes in S. Commune and

A. Bisporus is decreased after prediction based on RNA-Seq data filters

them out. The number of new transcripts predicted indicates that alternative splicing is not a rare event in these fungi.

The frequency of specific events in the two fungi are

similar and match what is seen in humans (Sammeth, M,

et. al. 2008). However, there are significant differences in

the event usage. While most transcripts in S. commune

only have one event associated with it, most transcripts in

A. Bisporushave at least two events. We show that this is a

result of co-operative events.

As our dataset consists of multiple developmental time-

points and tissue types, we are able to observe the

alternative use of transcripts through time. If a gene swaps

transcript usage at a certain time point, this is indicative of

a functional involvement of that particular transcript (Lees

et. al. 2015). We find multiple transcripts in both S.

commune and A. bisporus which are activated in specific

developmental stages of the mushroom. Furthermore, in A.

bisporus, we are able to identify transcripts which are

activated specifically for certain tissue types through

development.

Using protein domain predictions for each transcript in a

gene, we can measure how gene functionality changes

across its transcripts. Figure 1 shows that functional

annotations are not always preserved across all transcripts,

indicating alternative functionality.

FIGURE 1. Many genes in S. commune demonstrate alternative functionality through alternative splicing

This is the first genome-wide functional analysis of

alternative splicing in fungi from RNA-Seq data. We find

a wealth of alternative splicing events in two fungi,

resulting in many newly discovered transcripts. Although

their functional influence is not yet demonstrated, we

present evidence to suggest that they are relevant to

mushroom development.

REFERENCES Lees, J. G., et. al. BMC Genomics, 16:1 (2015)

Roberts, A., et. al. Bioinformatics 27:17, 2325–2329. (2011)

Sammeth, M., et. al. PLoS Computational Biology, 4:8. (2008)

Xie, B.-B., et. al.. BMC Genomics, 16:54(2015).

Zdobnov, E. M., & Apweiler, R. Bioinformatics 17:9 (2001)

Zhao, C., et. al. BMC Genomics, 14:21. (2013).


63



P19. MSQROB: AN R/BIOCONDUCTOR PACKAGE FOR ROBUST RELATIVE

QUANTIFICATION IN LABEL-FREE MASS SPECTROMETRY-BASED

QUANTITATIVE PROTEOMICS

Ludger Goeminne1,2,3*

, Kris Gevaert2,3

& Lieven Clement1.

Department of Applied Mathematics, Computer Science and Statistics, Ghent University1; VIB Medical Biotechnology

Center2; Department of Biochemistry, Ghent University

3.

*[email protected]

MSqRob is an R/Bioconductor package that uses robust ridge regression on peptide-level data for robust relative

quantification of proteins in label-free data-dependent acquisition (DDA) mass spectrometry (MS)-based proteomic

experiments. It has been shown that statistical methods inferring at the peptide-level outperform workflows that

summarize peptide intensities prior to inference. MSqRob improves upon existing peptide-level methods by three

modular extensions: (1) ridge regression, (2) empirical Bayes variance estimation and (3) M-estimation with Huber

weights. The extensions make MSqRob less sensitive towards outliers and missing peptides, enabling more proteins to be

processed. Our software provides streamlined data analysis pipelines for experiments with simple layouts as well as for

more complex multi-factorial designs. Using a spike-in dataset, we illustrate that MSqRob grants more stable protein fold

change estimates and improves the differential abundance (DA) ranking.

INTRODUCTION

In a typical label-free DDA LC-MS/MS-based proteomic

workflow, proteins are digested to peptides, separated by

RP-HPLC and analyzed by a mass spectrometer. However,

several issues inherent to the protocol make data analysis

non-trivial. Most of the common data analysis procedures

use summarization-based workflows. We have previously

shown that inference at the peptide level outperforms these

summarization-based approaches (Goeminne et al., 2015).

However, even these pipelines are sensitive to outliers and

suffer from overfitting. Here, we present MSqRob, an

R/Bioconductor package that starts form peptide-level data

and provides robust inference on DA at the protein level.

METHODS

Dataset. To demonstrate the performance of our package,

we use the CPTAC dataset, in which 48 known human

proteins were spiked-in at different concentrations in a

yeast proteome background. Ideally, when comparing

different spike-in conditions, only the human proteins

should be flagged as differentially abundant.

Competing analytical methods. MaxLFQ+Perseus,

which summarizes peptide data followed by pairwise t-

tests.

LM model. Generally, peptide-based models are

constructed as follows:

𝑦𝑖𝑗𝑘𝑙𝑚𝑛

= 𝑡𝑟𝑒𝑎𝑡𝑖𝑗 + 𝑝𝑒𝑝𝑖𝑘 + 𝑏𝑖𝑜𝑟𝑒𝑝𝑖𝑙 + 𝑡𝑒𝑐ℎ𝑟𝑒𝑝𝑖𝑚+ 𝜀𝑖𝑗𝑘𝑙𝑚𝑛

with 𝑦𝑖𝑗𝑘𝑙𝑚𝑛 the nth

log2-transformed normalized feature

intensity for the ith

protein under the jth

treatment 𝑡𝑟𝑒𝑎𝑡𝑖𝑗 ,

the kth

peptide sequence 𝑝𝑒𝑝𝑖𝑘 , the lth biological repeat

𝑏𝑖𝑜𝑟𝑒𝑝𝑖𝑙 and the mth

technical repeat 𝑡𝑒𝑐ℎ𝑟𝑒𝑝𝑖𝑚 , and

𝜀𝑖𝑗𝑘𝑙𝑚𝑛 a normally distributed error term with mean zero

and variance 𝜎𝑖2.

MSqRob. MSqRob adds the following improvements to

the LM model:

1. Ridge regression: shrink parameter estimates

towards 0 by adding a ridge penalty term to the

loss function.

2. Stabilize variance estimation by borrowing

information across proteins with empirical

Bayes (EB): shrink individual variances towards

the pooled variance.

3. M estimation with Huber weights: weigh down

observations with large errors.


MSqRob uses MaxQuant or Mascot peptide-level data as

input. It performs preprocessing, robust model fitting and

returns log2 fold change estimates and FDR corrected p-

values for all model parameters and/or (user specified)

contrasts. Advanced users have the flexibility to (a) adopt

their own preprocessing pipeline (e.g. transformation,

normalization, drop contaminants…) and (b) specify the

appropriate model structure. Compared to competing

methods, MSqRob returns more stable log2 fold change

estimates, improves DA ranking (Figure 1) and is able to

discern between consistently strong DA and an accidental

hit caused by outliers or a small variance due to random

chance in low-abundant proteins.

FIGURE 1. Receiver operating characteristic (ROC) curves showing the

superior performance of MSqRob compared to a simple linear model (LM) and a summarizarion-based approach (MaxLFQ+Perseus) when

comparing the lowest spike-in concentration 6A with the second lowest

spike-in concentration 6B. Stars denote the methods’ cut off at an estimated 5 % FDR.

REFERENCES Goeminne LJE et al. Journal of Proteome Research 14, 2457-2465

(2015).


64



P20. A MIXTURE MODEL FOR THE OMICS BASED IDENTIFICATION OF

MONOALLELICALLY EXPRESSED LOCI AND THEIR DEREGULATION IN

CANCER

Tine Goovaerts1, Sandra Steyaert

1, Jeroen Galle

1, Wim Van Criekinge

1 & Tim De Meyer

1*.

BIOBIX lab of Bioinformatics and Computational Genomics, Department of Mathematical Modelling,

Statistics and Bioinformatics, Ghent University1.

*[email protected]

Imprinting is a phenomenon featured by parent-specific monoallelic gene expression. Its deregulation has been

associated with non-Mendelian inherited genetic diseases but is also a common feature of cancer. As imprinting does not

alter the genome yet is mitotically inherited, epigenetics is deemed to be a key regulator. Current knowledge in the field

is particularly hampered by a lack of accurate computational techniques suitable for omics data. Here we introduce a

mixture model for the identification of monoallelically expressed loci based on large scale omics data that can also be

exploited to identify samples and loci featured by loss of imprinting / monoallelic expression.

INTRODUCTION The genome-wide identification of mono-allelically

expressed or epigenetically modified loci typically

requires the presence of SNPs to discriminate both alleles.

Current methods predominantly rely on genotyping for the

identification of heterozygous loci in a limited sample set,

followed by testing whether the expression/epigenetic

modification levels for both alleles deviate from a 1:1 ratio

for those loci (Wang et al., 2014). This approach is limited

by the genotyping step and the required presence of

heterozygous individuals. As large scale omics data is

becoming increasingly available, an alternative strategy

may be to screen larger numbers (e.g. hundreds) of

samples, ensuring the presence of heterozygous

individuals at predictable rates, thereby also avoiding the

need for and limitations of a prior genotyping step.

Based on this concept, a previous strategy (Steyaert et al.,

2014) enabled us to identify and validate approximately 80

loci featured by monoallelic DNA methylation, but had

several drawbacks, such as computational inefficiency,

heavy reliance on Hardy-Weinberg equilibrium (HWE),

need for 100% imprinting and low power, which limited

its practical use. Here we present a novel mixture model

for the identification of monoallelically modified or

expressed loci from large-scale omics data (without

known genotypes) that largely circumvents previous

drawbacks.

METHODS The rationale of the methodology is that RNA-seq and

ChIP-seq(-like) derived SNP data for monoallelic loci are

featured by a general lack of apparent heterozygosity.

More specifically, under the null-hypothesis (no

imprinting) the homozygous and heterozygous sample

fractions can be modelled as a mixture of (beta-)binomial

distributions, with weights according to HWE or

empirically derived. For imprinted loci however, the

heterozygous fraction is split and shifted towards the two

homozygous fractions (Figure 1), which can be evaluated

with a likelihood ratio test. The model does not require but

can incorporate prior genotyping data and allows for

deviation from HWE, sequencing errors and efficiency

differences and partial monoallelic events. Once loci

featured by monoallelic events have been identified in

control data, a loss of imprinting index can be calculated

for each non-normal sample based on the mixture model

likelihoods and loci generally featured by loss of

imprinting in the pathology under study can be identified.

RESULTS & DISCUSSION We demonstrate the applicability of the novel mixture

model with simulations and a proof of concept study using

breast cancer and control RNA-seq data from The Cancer

Genome Atlas (TCGA Research Network, 2008). Well

known imprinted loci such as IGF2 (Figure 1) and H19

were indeed identified. Ongoing efforts are directed

towards artefact-free RNA/ChIP-seq data based allele

frequency inference and the efficient implementation of a

beta-binomial based mixture.

FIGURE 1. Observed (red) and modelled (green) allele frequencies for a

100% (right, no observable heterozygotes) and a partially imprinted

(left) SNP of the IGF2 gene

In conclusion, we introduce a novel mixture model for the

identification of loci featured by monoallelic events which

can subsequently be exploited to determine their

deregulation in the pathology of interest.

REFERENCES Steyaert S et al. Nucleic Acids Research 42, e157 (2014).

TCGA Research Network. Nature 455, 1061-1068 (2008). Wang X & Clark AG. Heredity 113, 156-166 (2014).


65



P21. GEVACT: GENOMIC VARIANT CLASSIFIER TOOL

Isel Grau1,4

, Dorien Daneels2,3

, Sonia Van Dooren2,3

, Maryse Bonduelle2,

Dewan Md. Farid1,3

, Didier Croes2,3

, Ann Nowé1,3

& Dipankar Sengupta1,3*

.

Como - Artificial Intelligence Lab, Vrije Universiteit Brussel1; Centre for Medical Genetics, Reproduction and Genetics,

Reproduction Genetics and Regenerative Medicine, Vrije Universiteit Brussel,UZ Brussel2; Interuniversity Institute of

Bioinformatics in Brussels, ULB-VUB3; Department of Computer Sciences, Universidad Central de Las Villas

4 .

*[email protected]

High throughput screening (HTS) techniques, like genome or exome screening are becoming norms in the conventional

clinical analysis. However, classifying the identified variants to be pathogenic, or potentially pathogenic or non-

pathogenic, is still a manual, tedious and time consuming process for clinicians or geneticists. Thus, to facilitate the

variant classification process, we have developed GEVACT, a Java based tool, designed on an algorithm, i.e. based on the

existing literature and knowledge of clinical geneticists. GEVACT can classify variants annotated by Alamut Batch, with

a future plan to support for inputs from other annotation software's also.

INTRODUCTION

With the emergence of new screening techniques, targeted

or whole exome and genome screening are becoming

standard diagnostic norms in clinical settings to identify

the variants for a genetic disease (Ng et al., 2010;

Saunders et al., 2012). However, development of

bioinformatics solutions for pathogenic classification of

the variants still remains a big challenge and henceforth,

making the process ponderous for geneticists and

clinicians. In this work, we describe GEVACT (Genomic

Variant Classifier Tool), a tool for classification of

genomic single nucleotide and short insertion/deletion

variants. The aim of this study was to design and

implement a variant classification algorithm, based on a

literature review of cardiac arrhythmia syndromes

(Hofman et al., 2013; Schulze-Bahr et al., 2000; Wilde &

Tan, 2007) and existing knowledge of clinical geneticists.

METHODS

The algorithm we propose for GEVACT is based on a

published variant classification schema for cardiac

arrhythmia syndromes. This approach is based on the yield

of DNA testing over a time span of 15 years (1996-2011),

between probands with isolated/familial cases, and also

between probands with or without clear disease-specific

clinical characteristics (Hofman et al., 2013). It proposes

two varying approaches: one to classify missense variants

and another to classify nonsense and frameshift variants.

The algorithm is implemented in two phases: pre-

processing and classification. In the pre-processing phase,

the annotated tab-delimited variant file (vcf.ann) from the

Alamut batch, is refined based on the gene list for the

disease-of-interest, so as to reduce the number of variants

for the analysis. Filters are applied to look for variants that

have already been reported in the Human Genome

Mutation Database (Stenson et al., 2003) and in ClinVar

(Landrum et al., 2014), or that have previously been

detected and classified in an internal patient population.

And lastly, the variants are filtered based on their location

in the genome and their coding effect, followed by the

check for minor allele frequency of the variant in a control

population (Sherry ST et al. 2001). Thereafter, in the

classification phase, the filtered variants are classified as

missense or nonsense and frameshift variants. For

missense variants the classification is based on the

parameters: amino acid substitution and its impact on

protein function (Adzhubei et al., 2010; Kumar et al.,

2009), biochemical variation (Mathe et al., 2006),

conservation (Pollard et al., 2010), frequency of variant

alleles in a control population (ExAC, 2015), effects on

splicing (Desmet et al., 2009), family and phenotype

information and functional analysis. Whereas, for the

nonsense and frameshift variants, it is based on: effects on

splicing, frequency of variant alleles in a control

population, family and phenotype information and

functional analysis. For each parameter, a score is given to

the variant, which is subsequently cumulated.

Conclusively, based on the cumulative score each variant

is classified into one of the five categories: Class I - Non-

Pathogenic; Class II - VUS1 (unlikely pathogenic); Class

III - VUS2 (unclear); Class IV - VUS3 (likely

pathogenic); Class V - Pathogenic (Sharon et al., 2008).


In this study, we report a Java based tool called GEVACT,

developed for classification of genomic variants. Input for

the tool is an annotated vcf file, while the output depicts

the cumulative classification score along with the class

label for a variant. The tool was tested on a dataset of 130

cardiac arrhythmia syndrome patients, available at UZ

Brussel. The results of the variant classification made by

the tool were cross-validated by manual curation,

performed by the clinical geneticist. Definitively, the

study indicates the tool to be promising but needs to be

further validated on datasets from other diseases. In

addition to, we are working on the tool to be adaptable for

file inputs from other annotation software.

REFERENCES Adzhubei IA et al. Nat Methods 7(4), 248-249 (2010).

Desmet et al. Nucleic Acids Res 37 (9): e67 (2009). Exome Aggregation Consortium (ExAC), Cambridge, MA (2015).

Hofman N et al. Circulation 128(14),1513-21 (2013).

Kumar P et al. Nat Protoc 4(7), 1073–1081 (2009). Landrum MJ et al. Nucleic Acids Res 42(1), D980-5 (2014).

Mathe E et al. Nucleic Acids Res 34(5),1317-25 (2006).

Ng SB et al. Nat Genetics 42, 30–35 (2010). Pollard K et al. Genome Res 20, 110-121 (2010).

Saunders CJ et al. Sci Transl Med 4, 154ra135 (2012).

Sharon EP et al. Hum Mutat. 29(11), 1282–1291 (2008). Sherry ST et al. Nucleic Acids Res 29(1),308-11 (2001).

Schulze-Bahr E et al. Z Kardiol 89 Suppl 4:IV12-22 (2000).

Stenson et al. Hum Mutat. 21:577-581 (2003). Wilde AA & Tan HL Circ J 71, Suppl A:A12-9 (2007).


66



P22. MAPPI-DAT: MANAGEMENT AND ANALYSIS FOR HIGH

THROUGHPUT INTERACTOMICS DATA FROM ARRAY-MAPPIT

EXPERIMENTS

Surya Gupta1,2,3

, Jan Tavernier1,2

& Lennart Martens1,2,3

.

Medical Biotechnology Center, VIB, Ghent, Belgium1; Department of Biochemistry, Ghent University, Ghent, Belgium

2;

Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium3.

INTRODUCTION

Proteins are highly interesting objects of study, involved

in different cellular and molecular functions. Identification

and quantification of these proteins along with their

interacting proteins, nucleic acids and molecules can

provide insight into development and disease mechanisms

at the systems level. Yet studying these interactions is not

trivial. In vivo methods exist to determine these

interactions, but these suffer from several drawbacks [4].

To overcome existing problems, an innovative approach

called MAPPIT (Mammalian Protein-Protein Interaction

Trap) [2] has been established in the Cytokine Receptor

Lab to determine interacting partners of proteins in

mammalian cells. To allow screening of thousands of

interactors simultaneously, MAPPIT has been parallelized

in the array MAPPIT system [3].

AIM

However, no effective pipeline existed to process the high-

through put data generated from array MAPPIT. We

therefore established an automated high-throughput data

analysis system called MAPPI-DAT (Mappit Array

Protein Protein Interaction- Database & Analysis Tool).

METHODS

In the array-MAPPIT platform the interaction of two

proteins (bait-prey) restores a mutated JAK-STAT

signaling pathway which leads to the expression of

florescence emitting genes. In order to rank the positive

interactions based on fluorescence intensity, RankProd [1]

is used. This method was originally developed to

determine differentially expressed genes in microarray

experiments and is available as R package. To minimize

false positive hits from RankProd output, quartile based

filtration was applied. MySQL platform was used to build

the data management system for the array-MAPPIT

system.

RESULTS

To extend and ease the usage of the analysis pipeline and

database system, an interface has been developed called

MAPPI-DAT. MAPPI-DAT is capable of processing

many thousand data points for each experiment, and

comprising a data storage system that stores the

experimental data in a structured way for meta-analysis.

REFERENCES [1] Breitling, R., Armengaud, P., Amtmann, A., & Herzyk, P. (2004).

Rank products: A simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments.

FEBS Letters, 573(1-3), 83–92.

[2] Lievens, S., Peelman, F., De Bosscher, K., Lemmens, I., & Tavernier, J. (2011). MAPPIT: A protein interaction toolbox built on

insights in cytokine receptor signaling. Cytokine and Growth Factor

Reviews, 22(5-6), 321–329.

[3] Lievens, S., Vanderroost, N., Heyden, J. Van Der, Gesellchen, V.,

Vidal, M., Tavernier, J., & Heyden, V. Der. (2009). Array MAPPIT :

High-Throughput Interactome Analysis in Mammalian Cells Array MAPPIT : High-Throughput Interactome Analysis in Mammalian Cells,

877–886.

[4] S.Gopichandran and S.Ranganathan. (2013). Protein-protein Interactions and Prediction: A Comprehensive Overview. Protein and

Peptide Letters, 779–789


67



P23. HIGHLANDER: VARIANT FILTERING MADE EASIER

Raphael Helaers1*

& Miikka Vikkula1.

Human Molecular Genetics (GEHU), de Duve Institute, Université catholique de Louvain1.

*[email protected]

The field of human genetics is being revolutionized by exome and genome sequencing. A massive amount of data is

being produced at ever-increasing rates. Targeted exome sequencing can be completed in a few days using NGS,

allowing for new variant discovery in a matter of weeks. The technology generates considerable numbers of false

positives, and the differentiation of sequencing errors from true mutations is not a straightforward task. Moreover, the

identification of changes-of-interest from amongst tens of thousands of variants requires annotation drawn from various

sources, as well as advanced filtering capabilities. We have developed Highlander, a Java software coupled to a MySQL

database, in order to centralize all variant data and annotations from the lab, and to provide powerful filtering tools that

are easily accessible to the biologist. Data can be generated by any NGS machine (such as Illumina’s HiSeq, or Life

Technologies’ Solid or Ion Torrent) and most variant callers (such as Broad Institute’s GATK or Life Technologies’

LifeScope). Variant calls are annotated using DBNSFP (providing predictions from 6 different programs, and MAF from

1000G and ESP), GoNL and SnpEff, subsequently imported into the database. The database is used to compute global

statistics, allowing for the discrimination of variants based on their representation in the database. The Highlander GUI

easily allows for complex queries to this database, using shortcuts for certain standard criteria, such as “sample-specific

variants”, “variants common to specific samples” or “combined-heterozygous genes”. Users can browse through query

results using sorting, masking and highlighting of information. Highlander also gives access to useful additional tools,

including direct access to IGV, and an algorithm that checks all available alignments for allele-calls at specific positions.


68



P24. DOSE-TIME NETWORK IDENTIFICATION: A NEW METHOD FOR

GENE REGULATORY NETWORK INFERENCE FROM GENE EXPRESSION

DATA WITH MULTIPLE DOSES AND TIME POINTS

Diana M Hendrickx1*

, Danyel G J Jennen1 & Jos C S Kleinjans

1.

Department of Toxicogenomics, Maastricht University, The Netherlands1.

*[email protected]

Toxicogenomics, the application of ‘omics’ technologies to toxicology, is a rapidly growing field due to the need for

alternatives to animal experiments for toxicity testing of compounds. Identification of gene regulatory networks affected

by compounds is important to gain more insight into the mode of action of a toxic compound. The response to a toxic

compound is both time and dose dependent. Therefore, toxicogenomics data are often measured across several time

points and doses. However, to our knowledge, there does not exist a method for gene regulatory network inference that

takes into account both time and dose dependencies. Here we present Dose-Time Network Identification (DTNI), a novel

gene regulatory network inference algorithm that takes into account both dose and time dependencies in the data. We

show that DTNI can be used to infer gene regulatory networks affected by a group of compounds with the same mode of

action. This is illustrated with gene expression (microarray) data from COX inhibitors, measured in human hepatocytes.

INTRODUCTION

Identifying and understanding gene regulatory networks

(GRN) influenced by chemical compounds is one of the

main challenges of systems toxicology. A GRN affected

by one or more compounds evolves over time and with

dose. The analysis of gene expression data measured at

multiple time points and for multiple doses can provide

more insight in the effects of compounds. Therefore, there

is a need for mathematical approaches for GRN

identification from this type of data.

METHODS

One of the mathematical approaches currently used for

GRN inference is based on ordinary differential equations

(ODE), where changes in gene expression over time are

related to each other and to the external perturbation (i.e.

the dose of the compound). Because gene expression data

usually have less data points than variables (genes), ODE

approaches are often combined with interpolation and/or

dimension reduction techniques (PCA). A current method

that combines ODE with both interpolation and dimension

reduction techniques is Time Series Network

Identification (TSNI) (Bansal et al., 2006).

Here, we present Dose-Time Network Identification

(DTNI), a method that extends TSNI by including ODE

that describe changes in gene expression over dose in

relation to each other and to time. We also adapted the

original method so that it can include data from multiple

perturbations (compounds).


By exploiting simulated data, we show that including ODE for expression changes over dose leads to improved

GRN identification compared with including only ODE

that describe changes over time. Furthermore, we show

that DTNI performs better when including data from

multiple perturbations (compounds) than when applying

DTNI to data from a single perturbation. This suggests

that the method is suitable to infer a GRN affected by

compounds with the same mode of action. As an example,

we infer the network affected by COX inhibitors from

public microarray data of 6 COX inhibitors, measured in

human hepatocytes, available from Open TG-Gates

(http://toxico.nibio.go.jp/english/index.html) (Noriyuki et

al., 2012). The interactions in the inferred network were

compared to interactions from ConsensusPathDB, a

database including interactions from 32 different sources

(Kamburov et al., 2013). The inferred network was

validated by leave-one out cross-validation (LOOCV). Six

datasets were created from the original data by leaving out

the data of one compound. The network constructed from

the whole data set showed large overlap with the networks

constructed from each of the LOOCV datasets. Edges in

the network constructed from the whole data set, but not in

the networks constructed from the LOOCV datasets were

removed from the network. The remaining novel

interactions, i.e. those that are not in ConsensusPathDB,

have to be validated experimentally, e.g. by gene-

knockdown experiments.

FIGURE 1. Workflow for identifying a gene regulatory network affected

by a group of compounds with the same mode of action.

REFERENCES Bansal M et al. Bioinformatics 22, 815-822 (2006).

Noriyuki N et al. J Toxicol Sci 37,791-801 (2012).

Kamburov A et al. Nucl Acids Res 41, D793-D800 (2013).


69


Abstract ID: P Category: Poster

P25. IDENTIFICATION OF NOVEL ALLOSTERIC DRUG TARGETS

USING A “DUMMY” LIGAND APPROACH

Susanne M.A. Hermans, Christopher Pfleger & Holger Gohlke*.

Department of Mathematics and Natural Sciences, Institute for Pharmaceutical and Medicinal Chemistry, Heinrich-

Heine-University, Düsseldorf, Germany. *[email protected]

Targeting allosteric sites is a promising strategy in drug discovery due to their regulatory role in almost all cellular

processes. Currently, there is no standard method to identify novel pockets and to detect whether a pocket has a

regulatory effect on the protein. Here, we present a new and efficient approach to probe information transfer through

proteins in the context of dynamically dominated allostery that exploits “dummy” ligands as surrogates for allosteric

modulators.

INTRODUCTION

Allosteric regulation is the coupling between separated

sites in biomacromolecules such that an action at one site

changes the function at a distant site. Allosteric drugs are

popular, they often have less side effects then orthosteric

drugs because the allosteric sites are less conserved. The

identification of novel allosteric pockets is complicated by

the large variation in allosteric regulation, ranging from

rigid body motions to disorder/order transitions, with

dynamically dominated allostery in between (Motlagh et

al., 2014). Here we focus on dynamically dominated

allostery with minimal or no conformational changes.

Novel pockets do not have a known ligand, therefore we

generate “dummy” ligands to function as surrogates for

allosteric ligands. We have developed an efficient

approach to probe information transfer through proteins

using “dummy” ligands and detect if allosteric coupling is

present between the novel pocket and the orthosteric site.

METHODS

In a preliminary study to test the general feasibility, the

approach was applied to conformations extracted from a

MD trajectory of the holo and apo structures of LFA1.

The grid-based PocketAnalyzer program (Craig et al.,

2011) is used to detect putative binding sites. “Dummy”

ligands were generated for each detected pocket along the

ensemble. Finally, the Constraint Network Analysis

(CNA) software, which links biomacromolecular structure,

(thermo-)stability, and function, is used to probe the

allosteric response by monitoring altered stability

characteristics of the protein due to the presence of the

“dummy” ligand (Pfleger et al., 2013; Krüger et al., 2013;

Pfleger, 2014). The results were compared to those of the

holo structure with the bound allosteric ligand to validate

the “dummy” ligand approach.


Remarkably, the usage of “dummy” ligands almost

perfectly reproduced the results obtained from the known

allosteric effector. Although it turned out that the intrinsic

rigidity of the “dummy” ligands over-stabilizes the LFA1

structure, these results are already encouraging. Even for

the LFA1 apo structures, where the allosteric pocket is

partially closed, the results are in agreement with known

allosteric effectors. Overall, the results obtained from the

validation of the “dummy” ligand approach are

encouraging. This suggests that our “dummy” ligand

approach for the characterization of unexplored allosteric

pockets is a promising step towards identifying novel drug

targets.

REFERENCES Craig, I.R. et al. J. Chem. Inf. Model. 51 2666–2679 (2011). Krüger, D. M. et al. Nucleic Acids Res. 41 340–348 (2013).

Motlagh, H.N. et al. Nature 508 7496 331–339 (2014).

Pfleger, C. et al. J. Chem. Inf. Model. 53 1007–1015 (2013). Pfleger, C. Doctoral Thesis, Heinrich Heine University, Düsseldorf,

Germany (2014).


70



P26. PASSENGER MUTATIONS CONFOUND INTERPRETATION OF ALL

GENETICALLY MODIFIED CONGENIC MICE

Paco Hulpiau1,2,3

*, Liesbet Martens1,2,3

*, Yvan Saeys1,2,3

, Peter Vandenabeele1,2,4

& Tom Vanden Berghe1,2

.

Inflammation Research Center, VIB, Ghent, Belgium1; Department of Biomedical Molecular Biology, Ghent University,

Ghent, Belgium2; Data Mining and Modelling for Biomedicine (DaMBi), Ghent, Belgium

3; Methusalem Program, Ghent

University, Belgium4. *[email protected], [email protected]

Targeted mutagenesis in mice is a powerful tool for functional analysis of genes. However, genetic variation between

embryonic stem cells (ESCs) used for targeting (previously almost exclusively 129-derived) and recipient strains (often

C57BL/6J) typically results in congenic mice in which the targeted gene is flanked by ESC-derived passenger DNA

potentially containing mutations. Comparative genomic analysis of 129 and C57BL/6J mouse strains revealed indels and

single nucleotide polymorphisms resulting in alternative or aberrant amino acid sequences in 1,084 genes in the 129-

strain genome.

INTRODUCTION

Annotating the passenger mutations to the reported

genetically modified congenic mice that were generated

using 129-strain ESCs revealed that nearly all these mice

possess multiple passenger mutations potentially

influencing the phenotypic outcome. We illustrated this

phenotypic interference of 129-derived passenger

mutations with several case studies and developed a Me-

PaMuFind-It web tool to estimate the number and possible

effect of passenger mutations in transgenic mice of interest.

METHODS

We analyzed the SNP data release v3 from the Mouse

Genome Project available at Sanger Institute (Keane et al.,

2011). The data in the indel vcf file and SNP vcf file were

filtered to retrieve indels and SNPs present in at least one

of the three 129 strains (129P2/OlaH, 129S1/SvIm and

129S5SvEvB) and affecting the protein coding sequence

of the genes. These so-called protein coding variants are

based on the following sequence ontology (SO) terms:

stop gained, stop lost, inframe insertion, inframe deletion,

frameshift variant, splice donor variant, splice acceptor

variant, and coding sequence variant. In total, 949 indels

and 446 SNPs affecting 1,084 mouse genes were retained.

We gathered chromosome and gene start and end positions

for 1,084 genes covering 1,395 variations. The Ensembl

gene ID was used to find the most upstream and

downstream start and stop in all Ensembl transcripts for

that gene. Next these genome coordinates were used to

search for flanking genes within 2, 10, and 20 Mbps

upstream and downstream. We then downloaded all mouse

phenotypic allele data from the MGI resource and

extracted the data of genetically modified mouse lines.

Information on 5,322 genes (corresponding to 7,979 129-

derived genetically modified mouse lines) was connected

to genes with passenger mutations and affected genes.

Additionally we filtered the data to identify putative

regulatory variants. All data were stored in a MySQL

database and can be queried using the publicly available

web tool Me-PaMuFind-It:

http://me-pamufind-it.org/

Passenger genome mutations in gene-targeted mice (Nechanitzky and

Mak, 2015)


The vast majority of existing and well-characterized

genetically engineered congenic mice have been created

using 129 ESCs. 99.5% of these mouse lines are affected

by a median number of 20 passenger mutations within a

10 cM flanking region. This implies that nearly all

genetically modified congenic mice contain multiple

passenger mutations despite intensive backcrossing.

Consequently, the phenotypes observed in these mice

might be due to flanking passenger mutations rather than a

defect in the targeted gene (Vanden Berghe et al, 2015).

REFERENCES Keane, T.M., Goodstadt, L., Danecek, P., White, M.A., Wong, K., Yalcin,

B., Heger, A., Agam, A., Slater, G., Goodson, M., et al. (2011).

Mouse genomic variation and its effect on phenotypes and gene

regulation. Nature 477, 289–294.

Nechanitzky R, Mak TW (2015). Passenger Mutations Identified in the

Blink of an Eye. Immunity 43(1), 9-11. Vanden Berghe, T., Hulpiau, P., Martens, L. et al (2015). Passenger

Mutations Confound Interpretation of All Genetically Modified

Congenic Mice. Immunity 43(1), 200-9.


71


Abstract ID: 000 Category: Abstract template

P27. DETECTING MIXED MYCOBACTERIUM TUBERCULOSIS INFECTION

AND DIFFERENCES IN DRUG SUSCEPTIBILITY WITH WGS DATA

Arlin Keo1

& Thomas Abeel1,2,*

.

Delft Bioinformatics Lab, Delft University of Technology, Delft, the Netherlands

1; Broad Institute of MIT

and Harvard, Cambridge, MA, USA2.

*t.abeel@ tudelft.nl

Mycobacterium tuberculosis is a bacterial pathogen that causes tuberculosis and infects millions of people. When a

person is infected with more than one distinct strain type of tuberculosis (TB), referred to as mixed infection, diagnosis

and treatment is complicated. Due to difficulty of diagnosis the prevalence of mixed infections among TB patients

remain uncertain. Whole genome sequencing (WGS) yields a great number of single nucleotide polymorphisms (SNPs)

and offers increased resolution to distinguish distinct strains. Here, we present a tool that maps sample reads against 21

bp cluster specific SNP markers to detect putative mixed infections and estimate the frequencies of the present

subpopulations.

INTRODUCTION

Mycobacterium tuberculosis is a clonal, bacterial pathogen

that causes the pulmonary disease tuberculosis (TB), and it

infects and kills millions of people worldwide [1]. The

study of genetic diversity within the M. tuberculosis

complex (MTBC) is complicated by mixed TB infections,

which happens when a person is infected with more than

one distinct strain type of MTBC. This often results in

poor diagnosis and treatment of patients as the bacterial

subpopulation may have undetected differences in drug

susceptibility [2]. A strain typing method should be able to

distinguish closely related strains, to also allow the

detection of a mixed infection at finer resolutions [3]. This

study aims to detect a possible mixed TB infection at

different levels in MTBC and to determine the frequencies

of the present strains based on established tree paths in the

MTBC phylogenetic tree.

METHODS

A global comprehensive dataset of 5992 MTBC strains

was used for analysis, and 226570 SNPs were extracted

from this set to construct a SNP-based phylogenetic tree

with RAxML. In this bifurcating tree, each branch

represents a cluster of strains and splits into two new

monophyletic subclusters of genetically more closely

related strain. These ¨splits¨ were used to define clusters

and subclusters that contain more than 10. Global SNP

association was done for each cluster to get cluster-

specific SNPs, those for which the true positive rate, true

negative rate, positive predictive value, and negative

predictive value were >0.95. Markers were generated from

these SNPs by extending them with 10 bp sequence on

each side based on reference genome H37Rv. Each

hierarchical cluster now has a set of specific SNP markers.

By mapping sample reads against these 21 bp cluster-

specific SNP markers the tool determines the presence of

paths in the phylogenetic tree that start at the MTBC root

node. Paths that split indicate the presence of multiple

strains and thus a mixed infection.

The read depth at the root node represents a frequency of 1

of the present MTBC species. If the path splits further in

the tree, the total read depth is divided over the two

subpaths and determines the frequencies of those present

subclusters (Figure 1).

FIGURE 1. Detection of mixed TB infection with hierarchical clusters.

The detected strains are combined with detected drug

susceptibility profiles. A minimized reference genome

consisting of drug resistance genes and 1000 bp flanking

regions is used to map sample reads with BWA, and call

variants with Pilon. Ambiguous variation calls may

indicate that present strains in a mixed infection sample

also have differences in drug susceptibility.


In the phylogenetic tree 308 clusters (MTBC root

excluded) were defined and there are 14823 SNP markers

in total that are specific to a cluster and unique within the

cluster. The known MTBC lineages 1 to 6 have between

355-614 markers.

7661 TB samples were tested, present strain(s) and

frequencies could be predicted for 7495 samples of which

914 (~12%) are mixed infections (Table 1).

# of subpopulations 1 2 3 >3

# of samples 6581 798 95 21 TABLE 1. 914 Out of 7495 samples is a mixed infection.

REFERENCES 1. World Health Organization. Global Tuberculosis Report. World

Health Organization, Geneva, Switzerland, 2014.

2. Zetola et al. Mixed Mycobacterium tuberculosis complex infections

and false-negative results for rifampicin resistance by GeneXpert MTB/RIF are associated with poor clinical outcomes. Journal of

Clin. Microb., 52:2422-2429, 2014.

3. G. Plazzotta, T. Cohen, and C. Colijn. Magnitude and sources of bias in the detection of mixed strain M. tuberculosis infection. Journal of

theoretical biology, 368:67–73, 2015.


72



P28. APPLICATION OF HIGH-THROUGHPUT SEQUENCING TO

CIRCULATING MICRORNAS REVEALS NOVEL BIOMARKERS FOR DRUG-

INDUCED LIVER INJURY Julian Krauskopf

1*, Florian Caiment

1, Sandra Claessen

1, Kent J. Johnson

2, Roscoe L. Warner

2, Shelli J. Schomaker

3,

Deborah A. Burt3, Jiri Aubrecht

3, Jos C. Kleinjans

1.

Department of Toxicogenomics, Maastricht University, Maastricht 6200 MD, The Netherlands1; Pathology Department,

University of Michigan, Ann Arbor, MI 48109, USA2; Drug Safety Research and Development, Pfizer, Inc., Groton, CT

06340, USA2. *[email protected]

Drug-induced liver-injury (DILI) is a leading cause of acute liver failure and the major reason for withdrawal of drugs

from the market. Preclinical evaluation of drug candidates has failed to detect about 40% of potentially hepatotoxic

compounds in humans. At the onset of liver injury in humans, currently used biomarkers have difficulty differentiating

severe DILI from mild, and/or predict the outcome of injury for individual subjects. Therefore, new biomarker

approaches for predicting and diagnosing DILI in humans are urgently needed. Recently, circulating microRNAs

(miRNAs) such as miR-122 and miR-192 have emerged as promising biomarkers of liver injury in preclinical species

and in DILI patients. In this study, we focused on examining global circulating miRNA profiles in serum samples from

subjects with liver injury caused by accidental acetaminophen (APAP)-overdose. Upon applying next generation high-

throughput sequencing of small RNA libraries, we identified 36 miRNAs, including three novel miRNA-like small

nuclear RNAs, which were enriched in serum of APAP overdosed subjects. The set comprised miRNAs that are

functionally associated with liver-specific biological processes and relevant to APAP toxic mechanisms. Although more

patients need to be investigated, our study suggests that profiles of circulating miRNAs in human serum might provide

additional biomarker candidates and possibly mechanistic information relevant to liver injury.


73



P29. INFORMATION THEORETIC MODEL FOR GENE PRIORITIZATION

Ajay Anand Kumar1,2 *

, Geert Vandeweyer1,2

, Lut Van Laer1,2

& Bart Loeys1,2

.

Department of Medical Genetics, University of Antwerp1; Biomedical informatics, Antwerp University Hospital

2.

*[email protected]

The identification of top candidate genes involved in human diseases from a list of candidate genes remains

computationally challenging. Many tools exist for this computational prioritization, of which the core typically utilizes

fusion or integration of various genomic annotation data sources. However, due to the rapid generation of novel data

high-throughput experiments, annotation sources often become outdated, lead to annotation errors. Hence, predictions

based on these computational tools are not reliable. To tackle this, we propose an information theoretic model that

effectively fuses annotation sources and regression model under Bayesian framework to prioritize candidate genes. Our

method is fast and performs better as compared to four existing tools on their own benchmark dataset.

INTRODUCTION

Gene Prioritizaton has become a central research problem

in the bioinformatics domain. With the advent of exome

sequencing in clinical genetics, it became a necessity to

automate the identification of the top most genes likely

involved in the disease from a given pool of affected

genes. Various annotation sources can be integrated or

fused to learn multiple functionality of genes and then

design a classifiers/regressor for prioritization. We

propose here an early data integration method that

implements an information retrieval model to fusing the

data at functional feature level and then designing a

discriminative regression model in Bayesian framework to

prioritize candidate genes.

METHODS

Principle behind our approach is based on guilt-by-

association. Genes that are known to be disease associated

might also share similar functions. The idea is that a

classifier or regressor can be trained on the linear

mapping between functional proximity profiles of genes

and their phenotypic proximity profiles. We implemented

Bayesian regressor to infer the degree of association of the

test genes with the query disease. The work-flow of is

shown in the Figure 1. The details are:

1. Functional annotation: Text, Ontologies (GO, MPO),

Sequence similarity, Pathways, Interactions. Phenotype

annotation: Human Phenotype Ontology (HPO), Disease

Ontology (DO), HuGe/ MeSh terms and GAD

2. TF - IDF (Term Frequency – Inverse document

frequency) methodology is used to assign statistical

weights to the functional attributes of genes form these

annotation sources. TF-IDF is data driven model

traditionally used for information retrieval. We apply same

methodology for weighing features. Together, it gives

gene-by-gene functional & phenotypic proximity profiles.

3. Finally, the Bayesian linear regression model for a

given set of query disease or training genes it learns the

linear mapping between functional & phenotypic

proximity profiles. Y = βX + η, where is Gaussian

distributed. We have incorporated traditional non-

informative Normal-Inverse Gama (NIG) priors for

estimating the unknowns namely β and б.


We performed leave-one-out cross validation experiment

on the benchmark data set that was used to compare four

other tools whose design principles are similar to our

method [1]. Our dataset consisted of 1040 disease genes

categorized under manually curated 12 different disease

classes [2]. In our preliminary results for 1154

prioritizations under the cut-off of top 5%, 10% and 30%

genes ranked in random control dataset we achieved

AUROC of 86.31 % against their best achieved score of

83.0%. This clearly indicates our method is comparatively

better with other tools mentioned in the comparative

analysis.

FIGURE 1. Workflow of Bayesian regression model for gene

prioritization.

Currently, we are incurring large-scale cross-validation

with manually curated 6762 disease gene association with

more number of tools and benchmark data [3].

Additionally, we also plan to explore to develop

probabilistic generative approach to model co-

occurrences, dependencies of features for effective data

fusion that can help in finding novel disease causing

genes.

REFERENCES 1. Chen B et.al BMC Med Genomics. 2015;8 Suppl 3:S2

2. Goh et.al Proc Natl Acad Sci USA 2007, 104(21):8685-8690

3. Börnigen, Daniela, et al. Bioinformatics 28.23 (2012): 3081-3088.


74



P30. GALAHAD: A WEB SERVER FOR THE ANALYSIS OF DRUG EFFECTS

FROM GENE EXPRESSION DATA

Griet Laenen1,2,*

, Amin Ardeshirdavani1,2

, Yves Moreau1,2

& Lieven Thorrez1,3

.

Dept. of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics,

KU Leuven1 ; iMinds Medical IT Dept., KU Leuven

2 ; Dept. of Development and Regeneration @ Kulak, KU Leuven

3.

*[email protected]

Galahad (https://galahad.esat.kuleuven.be) is a web-based application for the analysis of gene expression data from drug

treatment versus control experiments, aimed at predicting a drug’s molecular targets and biological effects. Galahad

provides data quality assessment and exploratory analysis, as well as computation of differential expression. Based on

the obtained differential expression values, drug target prioritization and both pathway and disease enrichment can be

calculated and visualized. Drug target prioritization is based on the integration of the gene expression data with a

functional protein association network.

INTRODUCTION

Gene expression analysis is frequently employed to study

the effects of drug compounds on cells. The observed

transcriptional patterns can provide valuable information

for identifying compound–protein inter-actions as well as

resulting biological effects. To facilitate the analysis of

this particular data type and enable an in-depth exploration

of a drug’s mode of effect, we have developed Galahad1.

INPUT

The main input for Galahad are raw Affymetrix human,

mouse or rat DNA microarray data derived from both

untreated control samples and samples treated with a drug

of interest. In addition, Galahad provides the possibility to

start from differential expression data derived with other

platforms to perform drug target prioritization and

enrichment analysis.

METHODS

The different analyses are depicted in Figure 1 and

include:

preprocessing of the raw data with RMA or

MAS5.0, as indicated by the user;

quality assessment and exploratory analysis to

ascertain data quality, uncover experimental

issues, and help in deciding whether certain

arrays need to be considered as outlying;

differential expression analysis to determine the

significance of gene up- and downregulation

following drug treatment;

genome-wide drug target prioritization by

means of an in-house developed algorithm for

network neighborhood analysis integrating the

expression data with functional protein

association infor-mation2;

prediction of molecular pathways involved in the

drug’s mode of effect;

identification of associated disease phenotypes

enabling side effect prediction and drug

repositioning.

OUTPUT

The output is displayed in a series of tabs corresponding to

the different analyses selected by the user:

in the Quality Control and Data Exploration

tabs, several diagnostic plots are displayed along

with a short explanation;

the Differential Expression tab contains a sorted

table listing all genes together with their log2

ratios and P-values for differential expression, as

well as links to the corresponding GeneCards

sections;

in the Drug Target Prioritization tab, a ranked

list of genes as potential targets of the drug can be

found, together with the network diffusion-based

scores and P-values for prioritization, and links to

the corresponding GeneCards section; in addition,

a network-based visualization is available for

each gene, showing the 10 interaction partners

contrib-uting most to the gene’s ranking;

the tabs summarizing the results for Pathway

and Disease Enrichment contain a sorted table

with pathway or disease ontology IDs, names,

and database links, together with the number of

differentially expressed genes in the

corresponding gene sets and the accompanying P-

values; in addition, network graphs are available,

consisting of the top 10 most significant

pathways or disease phenotypes, along with their

associated genes colored according to fold change.

FIGURE 1. Overview of the Galahad analysis steps.

REFERENCES 1. Laenen G. et al. Nucl Acids Res 43, W208-W212 (2015).

2. Laenen G. et al. Mol BioSyst 9, 1676-1685 (2013).


75



P31. KMAD: KNOWLEDGE BASED MULTIPLE SEQUENCE ALIGNMENT

FOR INTRINSICALLY DISORDERED PROTEINS

Joanna Lange1,2

, Lucjan S Wyrwicz1 & Gert Vriend

2*.

Laboratory of Bioinformatics and Biostatistics, M. Sklodowska-Curie Memorial Cancer Center;

Institute of Oncology1, CMBI, Radboud University Nijmegen

2.

*[email protected]

INTRODUCTION

Intrinsically disordered proteins (IDPs) lack tertiary

structure and thus differ from globular proteins in terms of

their sequence – structure – function relations. IDPs have a

lower sequence conservation, different types of active

sites, and a different distribution of functionally important

regions, which altogether makes their multiple sequence

alignment (MSA) difficult.

Algorithms underlying existing MSA programs are

directly or indirectly based on knowledge obtained from

studying three dimensional protein structures. Hereby we

introduce a tool for Knowledge based Multiple sequence

Alignment for intrinsically Disordered proteins, KMAD,

that incorporates SLiM, domain, and PTM annotations to

improve the alignments.

KMAD web server is accessible at

http://www.cmbi.ru.nl/kmad/. A standalone version is

freely available.

METHODS

Dataset of proteins experimentally proven to be disordered

was obtained from DisProt (Sickmeier et al., 2007). For

each IDP all homologous sequences were extracted from

SwissProt (The Uniprot Consortium, 2014) using BLAST.

The sequence sets were aligned with several MSA tools.

Apart from manual validation we also performed a

benchmark validation on reference sets from BAliBASE

(Thompson et al., 2005) and PREFAB holding structure-

based 'gold standard' sequence alignments. For this

purpose we used KMAD and a modified version of

KMAD, which performs a ’refinement’ of Clustal Omega

(Sievers et al., 2011) alignments.


Manual validation showed that KMAD bypasses many

mistakes made by Clustal Omega. An example of an

alignment mistake is shown on Figure 1.

a) Clustal Omega

b) KMAD

FIGURE 1. Excerpts from Clustal Omega and KMAD alignments of human sialoprotein (SIAL HUMAN) with four homologues. Various PTM

kinds are highlighted with bright colours

In the field of sequence alignment research it is common

practice to compare the sequence alignments obtained with

MSA software with those that are obtained from structure

superpositions. IDPs do not possess a static 3D structure

so that this method is not applicable to KMAD alignments.

Both of the validation methods that we used have their

disadvantages, but so far there is no alternative. Validation

on benchmark alignments of structured proteins is biased

towards Clustal Omega, because it was optimized to work

with structured proteins. On the other hand, the manual

inspection based on the same features that influence the

alignment is not a very elegant method, but given the

nature of IDPs probably the best we can do.

REFERENCES Edgar, R. C. (2004). MUSCLE: multiple sequence alignment with high

accuracy and high throughput. Nucleic Acids Research, 32(5), 1792–

1797. Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W.,

Lopez, R., McWilliam, H., Remmert, M., S oding, J., Thompson, J.

D., and Higgins, D. G. (2011). Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

Molecular System Biology, 7(539), 539.

Sickmeier, M., Hamilton, J. a., LeGall, T., Vacic, V., Cortese, M. S., Tantos, A., Szabo, B., Tompa, P., Chen, J., Uversky, V. N.,

Obradovic, Z., and Dunker, a. K. (2007). DisProt: the Database of

Disordered Proteins. Nucleic Acids Research, 35(Database issue), D786–93.

The Uniprot Consortium (2014). Activities at the Universal Protein

Resource (UniProt). Nucleic Acids Research, 42(Database issue), D191–8.

Thompson, J. D., Koehl, P., Ripp, R., and Poch, O. (2005). BAliBASE

3.0: latest developments of the multiple sequence alignment benchmark. Proteins: Structure, Function, and Bioinformatics,

61(1), 127–136.


76



P32. ON THE LZ DISTANCE FOR DEREPLICATING

REDUNDANT PROKARYOTIC GENOMES

Raphaël R. Léonard1,2*

, Damien Sirjacobs², Eric Sauvage1, Frédéric Kerff

1 & Denis Baurain².

Centre for Protein Engineering, University of Liège1; PhytoSYSTEMS, University of Liège

2.

*[email protected]

The fast-growing number of available prokaryotic genomes, along with their uneven taxonomic distribution, is a problem

when trying to assemble broadly sampled genome sets for phylogenomics and comparative genomics. Indeed, most of

the new genomes belong to the same subset of hyper-sampled phyla, such as Proteobacteria and Firmicutes, or even to

single species, such as Escherichia coli (almost 2000 genomes as of Sept 2015), while the continuous flow of newly

discovered phyla prompts for regular updates. This situation makes it difficult to maintain sets of representative genomes

combining lesser known phyla, for which only few species are available, and sound subsets of highly abundant phyla. An

automated straightforward method is required but none are publicly available. The LZ distance, in conjunction with the

quality of the annotations, can be used to create an automated approach for selecting a subset of representative genomes

without redundancy. We are planning to release this tool on a website that will be made publicly available.

INTRODUCTION

The LZ distance (Lempel and Ziv, 1977; Otu and Sayood,

2003) is inspired by compression algorithms, such as gzip

or WinRAR. This distance, amongst others, has already

been used in attempts to produce alignment-free

phylogenetic trees (Bacha and Baurain, 2005; Hohl et al.

2007), though the results were disappointing in such a

context (due to the heterogeneity of the substitution

process at large evolutionary scales). However, the LZ

distance is likely to provide enough resolving power to

identify groups of redundant genomes and to keep only

one representative for each group.

METHODS

For each pair of genomes A and B, the LZ distance is

computed from the gzip-compressed file lengths of the

corresponding nucleotide assemblies s(A) and s(B) and of

their concatenations s(A+B) and s(B+A). These distances,

along with taxonomic information, are stored in a

database.

A clustering method is then applied to regroup the similar

genomes into a user-specified number of groups. For each

of these groups, a representative is chosen based on the

quality of the genomic assemblies (chromosomes rather

than scaffolds) and of the protein annotations (e.g., few

rather than many “unknown proteins”).


Our method using the LZ distance is currently under

development using the genomes from the release 28 of

Ensembl Bacteria (ftp://ftp.ensemblgenomes.org/pub/

bacteria/release-28/). It contains 20,950 unique

prokaryotic genomes, composed of 286 Archaea and

20,664 Bacteria. The three most represented phyla are the

Proteobacteria (8642, of which 1980 E. coli), the

Firmicutes (7766) and the Actinobacteria (2673). These

genomes are already the result of a pre-processing step

designed to remove extra assemblies for strains present in

multiple copies (due to parallel sequencing or

resequencing in different labs).

We are working on different approaches for validating our

dereplication method, based on (1) current taxonomy, (2)

16S rRNA phylogeny, and (3) clustering using genomic

signatures (Moreno-Hagelsieb et al. 2013).

First, we compute a central measure of the taxonomic

“purity” of all genome clusters, which reflects the amount

of “mixture” at different taxonomic levels (phylum, class,

order etc). A good clustering should regroup different

genera (or species) without amalgamating distinct classes

(or phyla). Second, we cut the branches of a large 16S

rRNA tree based on the same genome collection to

produce an equal number of groups to compare with our

clustering method. We then compute a statistic of the

overlap between the 16S subtrees and the LZ clusters. A

good clustering should have a reasonable overlap with the

gold standard that is the 16S rRNA tree. Third, using the

same overlap metric, we compare the LZ clusters to

clusters obtained using the genomic signature.

Finally, an interactive tool will be made available through

a website. It will allow the users to download pre-

computed sets of representative genomes for either the

complete database or for taxonomic subsets. We are also

planning to allow users to upload their own genomes to

cluster them with the LZ method.

REFERENCES Ziv, J. and a. Lempel. 1977. ‘A Universal Algorithm for Sequential Data

Compression.’ IEEE Transactions on Information Theory 23.3.

doi:10.1109/TIT.1977.1055714. Otu, H. H. and K. Sayood. 2003. ‘A New Sequence Distance Measure for

Phylogenetic Tree Construction.’ Bioinformatics 19.16: 2122–2130.

doi:10.1093/bioinformatics/btg295. Moreno-Hagelsieb, G., Z. Wang, S. Walsh and A. Elsherbiny. 2013.

‘Phylogenomic Clustering for Selecting Non-Redundant Genomes

for Comparative Genomics.’ Bioinformatics 29.1: 947–949.

doi:10.1093/bioinformatics/btt064.

Höhl, M. and M. a Ragan. 2007. ‘Is Multiple-Sequence Alignment

Required for Accurate Inference of Phylogeny?’ Systematic biology 56.2: 206–221. doi:10.1080/10635150701294741.

Bacha, S. and Baurain, D. 2005. ‘Application of Lempel-Ziv complexity

to alignment-free sequence comparison of protein families’. Benelux Bioinformatics Conference 2005.

http://hdl.handle.net/2268/80179


77



P33. THE ROLE OF MIRNAS IN ALZHEIMER’S DISEASE

Ashley Lu1,2*

, Annerieke Sierksma1,2

, Bart De Strooper1,2

& Mark Fiers1,2

.

VIB Center for the Biology of Disease1; KU Leuven Center for Human Genetics

2.

*[email protected]

MicroRNAs (miRNA) play an important role in post-transcriptional regulation and were shown to be dysregulated in

Alzheimer’s disease. By analysing the hippocampal miRNA and mRNA expression of two mouse models of Alzheimer’s

disease, we identify a set of miRNAs that are dysregulated with the onset of cognitive impairments. Using GO

enrichment analysis we aim to identify miRNAs that likely play a role in learning and memory.

INTRODUCTION

MiRNAs are small non-coding RNAs involved in post-

transcriptional regulation through mRNA inhibition or

degradation. Past studies have suggested miRNAs to play

a direct role in Alzheimer’s disease (AD), e.g. by

modulating the expression of genes involved in the

formation of neuropathological protein aggregates (Lau P

& De Strooper B, 2010). In this study, we investigated the

changes in miRNA and mRNA expression in two AD

mouse models: APPswe/PS1L166P

(Radde R, 2006) and

Thy-Tau22 (Schindowski K, 2006), which have similar

patterns of cognitive impairment, but different pathology.

We aim to better understand the functional role of

miRNAs in AD-related cognitive impairments.

METHODS

RNA was extracted from the left hippocampus of 96 mice.

The experiment covers the two models (APPswe/PS1L166P

& Thy-Tau22), with wild type controls for each. All

genotypes are tested at two ages (4 and 10 months); before

and after onset of cognitive impairment. This yields eight

experimental groups with twelve mice each.

Expression profiles of miRNAs and mRNAs were

generated using Illumina single-end sequencing.

Differential Expression (DE) analysis was performed

using the limma package of R/Bioconductor with a linear

model to test the effects of age, genotype and their

interaction.

Functional analysis of the mRNAs and miRNAs are

conducted separately. For mRNAs, gene ontology analysis

was applied to sets of the most up- and down regulated

genes.

To determine the functional impact of dysregulated

miRNAs we determined which mRNAs are the most likely

direct targets of each miRNA using the following

approach: 1) for each miRNA we calculated the Pearson’s

correlation coefficient to each mRNA based on the

miRNA and mRNA expression data. 2) For each miRNA

we extracted the predicted set of targets from Targetscan

(Lewis BP & Burge CB & Bartel DP, 2005), with Diana

(Maragkakis M et al. 2011) as backup when Targetscan

had no record. 3) We filtered the miRNA target genes by

determining the leading edge set in a GSEA PreRanked

analysis (Subramanian A. et al, 2005) using the predicted

target mRNAs of each miRNA against the mRNAs ranked

according to the Pearson’s scores generated in step 1. We

additionally investigated target sets based on a Pearson’s

correlation coefficient cut-off of -0.2, -0.3, and -0.4. 4)

Gene-ontology analysis was then applied to these

candidate target sets to infer the likely biological function

of each miRNA.


DE analysis showed that the direction of expression level

changes in mRNAs are similar between APPswe/PS1166P

and Thy-Tau22 in terms of age*genotype interaction

effects. However, for the miRNAs the expression pattern

is less obvious. Overall, the effect size is more pronounced

in APPswe/PS1L166P

mouse than the Thy-Tau22 for both

miRNAs and mRNAs.

Functional analyses of the down-regulated mRNAs show a

clear enrichment in cognition and neural development

related categories, whereas up-regulated genes show a

clear inflammatory signature.

Combining miRNA target prediction with miRNA/mRNA

correlation analysis shows a marked increase of GO

enrichment scores. This analysis strongly suggests a

regulatory role for miRNAs in the down regulation of

genes involved in learning, cognition and related

categories.

This analysis workflow has allowed focusing on a list of

miRNAs that likely play a direct role in the observed

learning and memory deficits in AD mouse models, and

have been used to select candidate miRNAs for

downstream in vivo experiments, which will hopefully

provide a deeper understanding in the impact of AD on

learning and cognition.

REFERENCES Lau P & De Strooper B. Seminars in Cell & Developmental Biology,

21(7), 768–773, (2010).

Radde R. EMBO reports, 7(9), 940–946, (2006).

Schindowski K. The American Journal of Pathology, 169(2),599–616, (2006).

Lewis BP & Burge CB & Bartel DP. Cell, 120,15-20 (2005).

Maragkakis M et al. Nucleic Acids Research (2011)

Subramanian A. et al. Proceedings of the National Academy of Sciences

of the United States of America, 102(43), 15545–15550, (2005)


78



P34. FUNCTIONAL SUBGRAPH ENRICHMENTS

FOR NODE SETS IN REGULATORY NETWORKS

Pieter Meysman1,2*

, Yvan Saeys3,4

, Ehsan Sabaghian5,6

, Wout Bittremieux1,2

,

Yves van de Peer5,6

, Bart Goethals1

& Kris Laukens1,2

.


Antwerpen (biomina)2; VIB Inflammation Research Center

3; Department of Respiratory Medicine, Ghent University

4;

Department of Plant Biotechnology and Bioinformatics, Ghent University5; Department of Plant Systems Biology,

VIB/Ghent University6.

*[email protected]

We have developed a subgroup discovery algorithm to find subgraphs in a single graph that are associated with a given

set of nodes. The association between a subgraph pattern and a set of vertices is defined by its significant enrichment

based on a Bonferroni-corrected hypergeometric probability value, and can therefore be considered as a network-focused

extension of traditional gene ontology enrichment analysis. We demonstrate the operation of this algorithm by applying it

on two transcriptional regulatory networks and show that we can find relevant functional subgraphs enriched for the

selected nodes.

INTRODUCTION

Frequent subgraph mining (FSM) is a common but

complex problem within the data mining field that has

gained in importance as more graph data has become

available. However traditional FSM finds all frequent

subgraphs within the graph dataset, while often a more

interesting query is to find the subgraphs that are most

associated with a specific set of nodes. Nodes of interest

might be those that are associated with a specific disease,

or those that are differentially expressed in an omics

experiment.

METHODS To address this issue, we developed a novel subgraph

mining algorithm that can efficiently construct, match and

test candidate subgraphs against the given graph for

enrichment within a specific set of nodes (Meysman et al.

2015). To allow the enrichment testing, each candidate

subgraph is built around a ‘source’ node. A subgraph

match where the source node corresponds to a node of

interest is counted as a ‘hit’. If the source node is not a

node of interest, it is counted as a background hit. In this

manner the problem of enrichment can be easily tested

using a hypergeometric test. Furthermore, we show that

this definition of enrichment allows us to drastically prune

the search space that the algorithm must traverse to find all

enriched subgraphs.

An implementation of the algorithm is available at

http://adrem.ua.ac.be/sigsubgraph.

RESULTS & DISCUSSION The first data set concerned the yeast genes that have

remained in duplicate following the most recent whole

genome duplication. Within the yeast transcriptional

network, we found that these duplicate genes were

enriched for self-regulating motifs (e.g. feedback loops,

self edges, etc.), which matches the duplicated nature of

these genes (Figure 1).

FIGURE 1. Enriched subgraphs for yeast duplicated genes

The second data set concerned mining the subgraphs

associated with the homologs of the PhoR transcription

factor across seven different inferred bacterial regulatory

networks from Colombos expression data (Meysman et al.

2014). These PhoR homologs were found to be

significantly associated with several complex regulatory

motifs.

REFERENCES Meysman P et al. Discovery of Significantly Enriched

Subgraphs Associated with Selected Vertices in a

Single Graph. Proceedings of the 14th International

Workshop on Data Mining in Bioinformatics (2015).

Meysman P et al. COLOMBOS v2. 0: an ever expanding

collection of bacterial expression compendia. Nucleic

acids research 42 (D1), D649-D653 (2014).


79


Abstract ID: 000 Category: Poster

P35. HUMANS DROVE THE INTRODUCTION & SPREAD OF

MYCOBACTERIUM ULCERANS IN AFRICA Koen Vandelannoote

1,2,*, Conor Meehan

1*, Miriam Eddyani

1, Dissou Affolabi

3, Delphin Mavinga Phanzu

4, Sara

Eyangoh5, Kurt Jordaens

6, Françoise Portaels

1, Kirstie Mangas

7, Torsten Seemann

7, Herwig Leirs

2, Tim Stinear

7 &

Bouke C. de Jong1.

Institute of Tropical Medicine, Antwerp, Belgium1; Evolutionary Ecology Group, University of Antwerp, Antwerp,

Belgium2; Laboratoire de Référence des Mycobactéries, Cotonou, Benin

3; Institut Médical Evangélique, Kimpese,

Democratic Republic of Congo4; Centre Pasteur du Cameroun, Yaoundé, Cameroun

5 ; Joint Experimental Molecular

Unit, Royal Museum for Central Africa, Tervuren, Belgium6; Department of Microbiology and Immunology, University

of Melbourne, Melbourne, Australia7. *[email protected]

Buruli ulcer (BU) is an insidious neglected tropical disease. BU is reported around the world but the rural regions of

West and Central Africa are most affected. How BU is transmitted and spreads has remained a mystery, even though the

causative agent, Mycobacterium ulcerans, has been known for more than 70 years. Here, using the tools of population

genomics, we reconstruct the evolutionary history of M. ulcerans by comparing 167 isolates spanning 48 years and

representing 11 endemic countries across Africa. The genetic diversity of African M. ulcerans proved very limited

because of its slow substitution rate coupled with its recent origin. We show for the first time how M. ulcerans has

existed in Africa for several hundreds of years but was recently re-introduced during the period of Neo-imperialism. We

also provide evidence of the role that the so-called “Scramble for Africa” played in the spread of the disease.

INTRODUCTION

The clonal population structure of M. ulcerans has meant

that conventional genetic fingerprinting methods have

largely failed to differentiate clinical disease isolates,

complicating molecular analyses on the elucidation of the

population structure, and the evolutionary history of the

pathogen. Whole genome sequencing (WGS) is currently

replacing conventional genotyping methods for M.

ulcerans.

METHODS

We analyzed a panel of 165 M. ulcerans disease isolates

originating from disease foci in 11 different African

countries that had been cultured between 1964 and 2012.

Index-tagged paired-end sequencing-ready libraries were

prepared from gDNA extracts. Genome sequencing was

performed on the Illumina HiSeq 2000 DNA sequencer or

the Illumina MiSeq sequencing platform with respectively

2x150bp and 2x250bp paired-end sequencing chemistry.

Read mapping and SNP detection were performed using

the Snippy v.2.6 pipeline. Bayesian model-based inference

of the genetic population structure was performed using

BAPS v.6.0.1

Evidence for recombination between

different BAPS-clusters was assessed using BRAT-

NextGen2. We used BEAST2 v2.2.1

3 to date evolutionary

events, determine the substitution rate and produce a time-

tree of African M. ulcerans. A permutation test was used

to assess the validity of the temporal signal in the data. To

assess the geospatial distribution of African M. ulcerans

through time, an additional BEAST2 analysis was

performed with a discrete BSSVS geospatial model4.


Resulting sequence reads were mapped to the Ghanaian M.

ulcerans Agy99 reference genome and, after excluding

mobile repetitive elements and small indels, we detected a

total of 9,193 SNPs randomly distributed across the M.

ulcerans chromosome with approximately 1 SNP per 613

bp (0.15% nucleotide divergence). We explored the

distribution of DNA chromosomal deletions and identified

differential genome reduction that strongly supports the

existence of two specific M. ulcerans lineages within the

African continent, hereafter referred to as Lineage Africa I

(Mu_A1) and Lineage Africa II (Mu_A2). Subsequent

SNP-based exploration of the genetic population structure

agreed with the above deletion analysis and subdivided the

African M. ulcerans population into four major clusters.

BRAT-NextGen did not detect any recombined segments

in any isolate, supporting a strongly clonal population

structure for M. ulcerans that is evolving by vertically

inherited mutations. Within the phylogenetic tree, isolates

formed tight, shallow-rooted phylogenetic clusters which

are suggestive of contemporary dispersal. We estimated a

very slow mean genome wide substitution rate of 6.32E-8

per site per year. The Bayesian analysis demonstrated that Mu_A1 has existed in Africa for several hundreds of years

and that Mu_A2 was recently introduced on the continent.

The re-introduction event coincides well with a historical

event of particular interest: the period of Neo-imperialism

(1881-1914). Since tMCRA(Mu_A2) did not predate

colonization it seems very likely that lineage Mu_A2 was

introduced after the instigation of colonial rule through an

influx of BU infected humans. The time-tree of African M.

ulcerans also reveals evidence of the likely role that the

so-called “Scramble for Africa” played in the spread of

endemic Mu_A1 clones in three hydrological basins

(Congo, Oueme & Nyong) that are particularly well

covered by our isolate panel.

REFERENCES 1. Corander, J., et al. (2008) BMC bioinformatics. 9: p. 539.

2. Marttinen, P., et al. (2012) Nucleic acids research. 40(1): p. e6.

3. Bouckaert, R., et al. (2014) PLoS computational biology. 10(4): p. e1003537.

4. Lemey, P., et al., (2009) PLoS computational biology. 5(9): p.

e1000520.


80



P36. LEVERAGING AGO-SRNA AFFINITY TO IMPROVE IN SILICO SRNA

DETECTION AND CLASSIFICATION IN PLANTS

Lionel Morgado1*

& Frank Johannes2,3

.

Groningen Bioinformatics Centre (GBiC), University of Groningen1; Department of Plant Sciences, Center of Life and

Food Sciences Weihenstephan, Technical University Munich2; Institute of Advanced Studies, Technical University

Munich3.

*[email protected]

Small RNAs (sRNA) have an important role in the regulation of gene expression, either through post-transcriptional

silencing or the recruitment of repressive epigenetic marks such as DNA methylation. In plants, the mode of action of a

given sRNA is tightly related with the Argonaute protein (AGO) to which it binds. High throughput sequencing in

combination with immunoprecipitation techniques have made it possible to determine the sequences of sRNA that are

bound to different families of AGO. Here we apply Support Vector Machines (SVM) to recent AGO-sRNA sequencing

data of A. thaliana to learn which sRNA sequence features govern their differential association with certain AGOs. Our

SVM classifiers show good sensitivity and specificity and provide a framework for accurate in silico sRNA detection and

classification in plants.

INTRODUCTION

Small RNA molecules are known to have an important

role in gene expression control. It is therefore of extreme

interest to be able to detect them and determine the

regulatory pathways in which they are involved. With the

current laboratorial methods it is unfeasible to test the high

number of sRNA candidates, but there are computational

methods that can greatly narrow down the list.

Nevertheless, sRNA activity is still far from being fully

understood and that is reflected in the very high false

positive rate of the prediction tools currently available.

High throughput sequencing in combination with

immunoprecipitation (IP) techniques make nowadays

possible to access sRNA sequences associated with

specific AGO. AGO-sRNA binding is a fundamental step

for the activation of specific silencing pathways. Here,

AGO-sRNA data acquired from A. thaliana is explored

with SVM-based algorithms to learn which sequence

features drive different AGO-sRNA associations. Using

this knowledge, a framework for in silico sRNA detection

and classification in plants is presented.

METHODS

A system with 3 layers of classifiers (see figure 1) was

designed to identify different kinds of sRNA: the 1st layer

includes a binary SVM model that filters out sequences

that don’t bind to AGO and are therefore most probably

inactive; 2nd

layer is composed by an ensemble of binary

classifiers, each trained to explore the differences in sRNA

bound to a specific AGO against all others; and finally, the

3rd

layer comprises a multiclass linear model to assign the

most akin AGO to a given sRNA, using scores produced

in the previous layer.

Diverse AGO-sRNA libraries from A. thaliana were

explored, namely from AGO: 1, 2, 4, 5, 6, 7, 9 and 10.

After the typical RNA-seq library preprocessing, quality

check and genome mapping, several features were

extracted from the remaining sequences, namely: position

specific base composition, sequence length, k-mer

composition and entropy scores. The different feature sets

were explored separately and in different combinations.

Initially, highly correlated features (pearson score>0.75)

were removed, and the remaining ones were further

subjected to selection using SVM-RFE (Guyon et al.,

2002) with a linear kernel to handle the large data set size.

A 10-fold cross-validation procedure was executed to

modulate the variation in the data, being the best features

of each round determined as the ones with the highest

average weight across the models with the best ROC-AUC

score in each cross-validation subset. Each round, 1/3 of

the remaining features with the worst performance were

eliminated, being the process repeated until no more

features were available. The best features found were then

used to train the final classifiers using RBF kernels with

optimal parameters. This was repeated for all models in

layers 1 and 2.

FIGURE 1. Proposed architecture for the SVM-based framework.


Although the classifiers are still being optimized,

preliminary results from the 2nd

layer of the framework

(see figure 1) show that the top ranked features by SVM-

RFE reflect indeed significant biological patterns for

AGO-sRNA association. Among others, the relevance of

the 5’ terminal nucleotide was observed, in agreement

with findings from previous work (Mi et al., 2008).

Additionally, the accuracy for the models trained span

values that range from 71% to 86%, showing their

capacity to recognize specific AGO-binding patterns.

REFERENCES Guyon I et al.Gene selection for cancer classification using support vector machines. Mach Learn

46:389-422 (2002)

Mi S et al. Sorting of small RNAs into Arabidospis agonaute complexes is directed by the

5’terminal nucleotide. Cell 133(1): 116-27 (2008).

Zhou A & Pawlowski WP. Regulation of meiotic gene expression in plants. Front Plant Sci 5:

413, 209-215 (2014).

AGO vs noAGO

AGO1 vs

otherAGO

AGO2 vs

otherAGO

AGO10 vs

otherAGO

Final AGO prediction

Layer 1

Layer 2

Layer 3

…


81



P37. ANALYSIS OF RELATIONSHIP PATTERNS

IN UNASSIGNED MS/MS SPECTRA

Aida Mrzic1,2*

, Wout Bittremieux1,2

, Trung Nghia Vu4, Dirk Valkenborg

3,5,6, Bart Goethals

1& Kris Laukens

1,2.


Antwerpen (biomina)2; Flemish Institute for Technological Research (VITO), Mol

3; Karolinska Institutet, Stockholm

4;

CFP, University of Antwerp5; I-BioStat, Hasselt University

6.

*[email protected]

Tandem mass spectrometry (MS/MS) spectra generated in proteomics experiments often contain a large portion of

unexplained peaks, despite continuous search engines improvements. Here we use pattern mining technique to determine

the origin of these unassigned spectra. We discover patterns that indicate the presence of chimeric spectra and missed

post-translational modifications (PTMs).

INTRODUCTION

Regardless of being a rich source of information, mass

spectra acquired in mass spectrometry proteomics

experiments often contain a significant number of

unexplained peaks, or even remain completely

unidentified. The unexplained fraction of mass spectra

may come from low-quality or chimeric MS/MS spectra,

or unexpected PTMs. To interpret the unexplained data,

we propose a structured analysis of the peaks occurring in

MS/MS spectra. We employ an unsupervised pattern

mining technique (Naulaerts et al., 2013) to discover

which peaks are associated with each other, and therefore

are likely to have a common origin.

METHODS

Frequent itemset mining

The technique we used to discover relationships between

frequently co-occurring peaks in MS/MS data is frequent

itemset mining, a class of data mining techniques that is

specifically designed to discover co-occurring items in

transactional datasets. The typical example of frequent

itemset mining is the discovery of sets of products that are

frequently bought together. Here, every set of products

purchased together represents a single transaction, which

results in a dataset consisting of a large number of

supermarket basket transactions that can be mined for

frequent patterns (Figure 1). In our approach a transaction

consists of the mass differences between relevant peaks in

the MS/MS spectrum.

FIGURE 1. Frequent itemset mining principle.

Mass differences associations

In order to detect relationships between different types of

mass spectrometry peaks, a distinction is made between

peaks that were relevant for spectrum identification

(assigned peaks) and peaks that were not used for the

identification (unassigned peaks) (Vu et al., 2013). The

mass differences between peaks (either assigned,

unassigned, or both) are then calculated so that for each

MS/MS spectrum in the dataset there is a single

transaction consisting of all its mass differences.

After obtaining these transactions for all MS/MS spectra

in the dataset, frequent itemset mining can be employed to

detect relationship patterns (Figure 2). These patterns can

indicate previously unknown characteristics of the spectra,

or even detect novel PTMs.

FIGURE 2. Outline of the approach.


In order to evaluate our approach, we used MS/MS

datasets from the PRoteomics IDEntifications (PRIDE)

database (Vizcaino et al., 2013). This database contains a

large number of publicly available datasets from mass-

spectrometry-based proteomics experiments. However, the

quality of the submitted datasets can be subject to a large

variability, which makes it a proper candidate for our

pattern mining approach.

Preliminary results show that the detected patterns are able

to capture valid information in a spectrum. The obtained

patterns indicate peaks originating from the same peptide

in case of chimeric spectra and mass differences

originating from common PTMs.

REFERENCES Naulaerts et al. Brief Bioinform, 16(2): 216–231 (2015).

Vizcaino et al. Nucleic Acids Res, 41(D1):D1063-9 (2013).

Vu et al. Proteome Science, 12:54 (2014).


82



P38. MINING ACROSS “OMICS” DATA FOR DRUG PRIORITIZATION

Stefan Naulaerts1,2*

, Pieter Meysman1,2

, Bart Goethals1, Wim Vanden Berghe

,3 & Kris Laukens

1,2.


Antwerpen (biomina)2; Department for Biomedical Sciences, University of Antwerp

3.

*[email protected]

Drug resistance and response have traditionally been investigated by means of case-by-case studies. The process to

profile drug compounds is time and resource intensive. Large scale information on gene expression and protein

abundance, protein interactions, as well as functional and pathways annotations exist nowadays, as well as freely

accessible repositories for drug targets. Also structural evidence of select drug compounds is publicly available. These

data offer an enormous opportunity for data integration and pattern mining efforts across each of these levels. Here, we

apply frequent itemset mining to identify structurally similar compounds, and to detect patterns within the biological

effect profiles of these chemical compound families. Next, we explore how we can link both types of patterns to meta-

information (such as drug interactions) in a bid to identify promising compounds and speed up the drug discovery

process by means of candidate prioritization.

INTRODUCTION

In the last decades, several widely used databases have

emerged. These vary from gene expression data and mass-

spectrometric protein identifications to resources covering

interaction graphs or functional annotations of proteins

and chemicals.

The presence of these resources offers interesting

opportunities to gain deeper insight in drug mode of action,

as well as help reduce important bottlenecks with regards

to the speed of novel drug discovery or drug repurposing,

by intelligently prioritizing potentially interesting

compounds.

METHODS

To integrate the listed kinds of data, we use pattern mining

methods that are collectively known as “frequent itemset

mining”. This set of techniques uses clever heuristics to

efficiently find items that occur more often together than a

minimal threshold. In this work, we identified several

pattern types based on their source:

Expression itemsets

Metadata itemsets

Graph patterns (protein-protein, protein-drug and

chemical structures)

For subgraph mining, we used GASTON1. All other data

sources were analysed with Apriori2.

To deal with the extreme numbers of patterns that result

from mining this kind of data, we used a filter which

incorporates several quality measures based on objective

data mining measures properties (e.g. lift), as well as more

biologically inspired methods (e.g. functional coherence in

the Gene Ontology3 tree).

Simple classification based on the patterns was performed

with CBA4.


We were able to identify several backbone patterns within

the chemical structures studied and used these to define

“chemical compound families”. Next, we used this

classification as starting point to group experimental

evidence (bio-assays, interactions and metadata). After

applying cut-offs based on the quality measures, all

patterns remaining were significant and made sense

biologically.

Unsurprisingly, structurally similar compound families

show significant pattern overlaps in drug-drug interactions,

gene expression, term co-occurrence and conserved

protein-protein interactions. We found that specific

patterns in the biological profile often correlate with

specific discriminative structural patterns. Moreover, these

collections of structural frequent subgraphs seemed highly

relevant for the mode in which a compound connects to

the “core” proteome. This central proteome performs

essential functions of the cell (e.g. energy metabolism) and

it is known to be conserved across cell types. Structurally

distinct compound families converge much later (if at all)

to the same “core proteins” than more similar chemicals

do. This observation corresponds to currently known

pathway knowledge and tissue biology.

We were further able to associate previously unseen

compounds to chemicals present in the database, based on

the subgraph collection and by extension to the biological

profile patterns. Manual survey of literature indicated that

several compounds not covered by our database have

recently been approved or are in testing as alternative

drugs to the compounds we hypothesized as being

substantially similar.

FIGURE 1. Visualizing the dexamethasone environment. Both predictions

and experimental evidence (drug-target and protein-protein interactions) are shown.

REFERENCES 1. Nijssen S & Kok J. ENTCS 127, 77-87 (2005). 2. Agrawal R & Srikant R. Proc 20th Int Conf on Very Large Databases

(1994).

3. Ashburner M et al. Nat Genet 25, 25-29 (2000). 4. Liu B et al. KDD (1998).


83



P39. ABUNDANT TRANS-SPECIFIC POLYMORPHISM AND A COMPLEX

HISTORY OF NON-BIFURCATING SPECIATION IN THE GENUS

ARABIDOPSIS Polina Novikova

1, Nora Hohmann

2, Marcus Koch

2 & Magnus Nordborg

1.

Gregor Mendel Institute, Austrian Academy of Sciences, Vienna Biocenter (VBC), A-1030 Vienna, Austria1; Centre for

Organismal Studies Heidelberg, University of Heidelberg, D-69120 Heidelberg, Germany2.

*[email protected]

The prevailing notion of species rests on the concept of reproductive isolation. Under this model, sister taxa should not

share genetic variation unless they still hybridize, or diverged too recently for genetic drift to have eliminated shared

ancestral polymorphism, and gene trees should generally agree with species trees. Advances in sequencing technology

are finally making it possible to evaluate this model. We sequenced (Illumina 100bp paired reads) multiple individuals

from 26 proposed taxa in the genus Arabidopsis. Cluster analysis identified seven distinct groups, corresponding to four

common species — the model species A. thaliana, plus A. arenosa, A. halleri and A. lyrata — and three species with

very limited geographical distribution. However, at the level of gene trees, only the separation of A. thaliana from the

remaining taxa was universally supported, and even in this case there was abundant sharing of ancestral polymorphism

with the other taxa, demonstrating that reproductive isolation must be fairly recent. By considering the distribution of

derived alleles, we were also able to reject a bifurcating species tree because there is clear evidence for asymmetrical

gene flow between taxa. Finally, we show that the pattern of sharing and divergence between taxa differs between gene

ontologies, suggesting a role for selection.


84



P40. RIBOSOME PROFILING ENABLES THE DISCOVERY OF SMALL OPEN

READING FRAMES (SORFS), A NEW SOURCE OF BIOACTIVE PEPTIDES Volodimir Olexiouk

1,*, Jeroen Crappé

1, Steven Verbruggen

1 & Gerben Menschaert

1,*.

Lab of Bioinformatics and Computational Genomics (BioBix), Department of Mathematical Modelling, Statistics and

Bioinformatics, Faculty of Bioscience Engineering, Ghent University1.

INTRODUCTION

Evidence for micropeptides, defined as translation

products from small open reading frames (sORFs), has

recently emerged. While limitations contributed to

sequencing technologies as well as proteomics have

stalled the discovery of micropeptides. It is the advent of

ribosome profiling (RIBO-SEQ), a next generation

sequencing technique revealing the translation machinery

on a sub-codon resolution, that provided evidence in favor

of translating sORFs. RIBO-SEQ captures and

subsequently sequences the +-30 nt mRNA-fragments

captured within ribosomes, providing means to identify

translating sORFs, possible encoding functional

micropeptides. Since the advent of ribosome profiling

several micropeptides were described with import cellular

functions micropeptides (e.g. Toddler, Pri-peptides,

Sarcolipin and Myoregulin).

METHODS

RIBO-SEQ allows the identification of sORFs with

ribosomal activity, however in order to further access the

coding potential (potential of sORFs truly encoding

functional micropeptides) down-stream analysis is

necessary. Here we propose a pipeline which starts from

RIBO-SEQ, implements state-of-the-art tools and metrics

accessing the coding potential of sORFs and creates a list

of candidate sORFs for downstream analysis (e.g.

proteomic identification). In summary, assessment of the

coding potential includes: PhyloCSF (conservation

analysis), FLOSS-score (Ribosome protected fragment

(RPF) length distribution analysis), ORFscore (distribution

analysis of RPFs towards the first frame of a coding

sequence (CDS), BLASTp (sequence similarity), VarAn

(genetic variation analysis). In an attempt to set a

community standard in addition to make sORFs accessible

to a larger audience, a public database (www.sorfs.org) is

provided where public available datasets were processed

by this pipeline, allowing users to browse, query and

export identified ORFs. Furthermore a PRIDE-respin

pipeline was developed in order to periodically search the

PRIDE database for proteomic evidence.


The pipeline has been tested and curated on three different

cell-lines. These cell-lines include: HCT116 (human), E14

mESC (mouse) and s2 (fruitfly). Results obtained

provided similar results to those reported in recent

literature proving its relevance. All metrics, as stated

above, have been carefully inspected for their biological

relevance and contributed significantly to the detection of

sORFs. The pipeline is currently being finalized, however

is available upon request. The public repository is

accessible at http://www.sorfs.org, and includes the

datasets mentioned above resulting in 263354 sORFs. Two

querying interfaces were implemented, a default query

interface intended for browsing sORFs and a BioMart

query interface for advanced querying and export

functions. sORFs have their own detail page, visualizing

the above discussed metrics and ribosome profiling data

and a link to the UCSC-browser is provided, visualizing

the RIBO-SEQ data.

REFERENCES Pauli,A., Norris,M.L., Valen,E., Chew,G.-L., Gagnon,J. a,

Zimmerman,S., Mitchell,A., Ma,J., Dubrulle,J., Reyon,D., et al.

(2014) Toddler: an embryonic signal that promotes cell movement

via Apelin receptors. Science, 343, 1248636.

Pauli,A., Norris,M.L., Valen,E., Chew,G.-L., Gagnon,J. a, Zimmerman,S., Mitchell,A., Ma,J., Dubrulle,J., Reyon,D., et al.

(2014) Toddler: an embryonic signal that promotes cell movement

via Apelin receptors. Science, 343, 1248636. Crappé,J., Ndah,E., Koch,A., Steyaert,S., Gawron,D., De Keulenaer,S.,

De Meester,E., De Meyer,T., Van Criekinge,W., Van Damme,P., et

al. (2014) PROTEOFORMER: deep proteome coverage through ribosome profiling and MS integration. Nucleic Acids Res.,

10.1093/nar/gku1283.

Ingolia,N.T. (2014) Ribosome profiling: new views of translation, from single codons to genome scale. Nat. Rev. Genet., 15, 205–13.

Crappé,J., Van Criekinge,W., Trooskens,G., Hayakawa,E., Luyten,W.,

Baggerman,G. and Menschaert,G. (2013) Combining in silico prediction and ribosome profiling in a genome-wide search for novel

putatively coding sORFs. BMC Genomics, 14, 648. Pauli,A., Norris,M.L., Valen,E., Chew,G.-L., Gagnon,J. a,

Zimmerman,S., Mitchell,A., Ma,J., Dubrulle,J., Reyon,D., et al.

(2014) Toddler: an embryonic signal that promotes cell movement via Apelin receptors. Science, 343, 1248636.

Chanut-Delalande,H., Hashimoto,Y., Pelissier-Monier,A., Spokony,R.,

Dib,A., Kondo,T., Bohère,J., Niimi,K., Latapie,Y., Inagaki,S., et al. (2014) Pri peptides are mediators of ecdysone for the temporal

control of development. Nat. Cell Biol., 16


85


Abstract ID: P PosterBeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015


P41. RIGAPOLLO, A HMM-SVM BASED APPROACH TO SEQUENCE

ALIGNMENT

Gabriele Orlando1,2,3,4

, Wim Vranken1,2,3

and & Tom Lenaerts1,4,5

. 1Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, La Plaine Campus, Triomflaan, CP 263

1;

2Structural

Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 22;

3Structural Biology Research Center, VIB,1050 Brussels,

Belgium3;.

4Machine Learning group, Université Libre de Bruxelles, Brussels, 1050, Belgium

4;.

5Artificial Intelligence

lab, Vrije Universiteit Brussel, Brussels, 1050, Belgium5.

INTRODUCTION

Reliable protein alignments are a central problem for

many bioinformatics tools, such as homology modelling.

Over the years many different algorithms have been

developed and different kinds of information have been

used to align very divergent sequences [1]. Here we

present a pairwise alignment tool, called Rigapollo, based

on pairwise HMM-SVM, which includes backbone

dynamics predictions [2] in the alignment process: recent

work suggests that protein backbone dynamics is often

evolutionary conserved and contains information

orthogonal to the amino acid conservation..

METHODS

Rigapollo uses a pairwise HMM-SVM alignment

approach to infer the optimal alignment between two

proteins, taking into consideration both sequence and

dynamic information. The model (described in Figure 1) is

composed by 3 states: M (match), G1 (gap in the first

sequence) and G2 (gap in the second sequence). The

transition probabilities are defined in the same way as a

standard HMM. This new alignment tool is further

designed in the following manner:

Defining the N-dimensional feature vectors:

Each amino acid in the sequences is described by an N-

dimensional feature vector. That vector can be defined

using any kind of information, ranging from evolutionary

information (i.e. PSSM calculated with HHblits [3])) to

dynamics predictions (using the DynaMine predictor [2]).

While standard pairwise HMMs require the definition of a

finite and discrete alphabet of observable states, our model

works directly using these feature vectors (that can be both

orthonormal or not orthonormal), evaluating the emission

probability with a support vector machine (SVM).

Definition of the emisisonemission probability:

We define the emission probability using a SVM trained

to discriminate matches from mismatches. We define as

matches all the positions in the reference pairwise

alignments that do not contain gaps and we use the

concatenation of the previously defined feature vectors to

describe them. These matches are considered positive hits.

For what concerns the mismatches, we perform the same

procedure, but couple positions that, in the reference

alignment, are shifted a number of amino acids, varying

between 5 and 10. After the training, the predicted

emission probabilities for the M state, given the

concatenation of two feature vectors, will be a function of

the distance from the decision hyperplane of the SVM

(called f(D)). The corresponding emission probabilities for

the states G1 and G2 will be modeled as 1-f(D)


For the evaluation of the performances of Rigapollo, we

adopted two publicly available subsets of the Balibase and

SABmark alignmenta datasets, already used to evaluate

other pairwise alignment tools [1]; from the MSAs, all-

pair pairwise alignments has been extracted, and all these

that shared a percentage of sequence equal to the median

of the one of the full database has been put in the subset.

The datasets consist respectively in 38 and 123 manually

curated, structure based pairwise alignments and they

share very low sequence identity. For the evaluation of the

performances we performed a 10 folds randomized cross-

validtion. Rigapollo increases the quality of low sequence

identity pairwise alignment from 5 to 10% respect to the

state of the art methods and it seams appears that the

increase in the performancewse is more marked in very

divergent sequences, such as the onesthose in the

SABmark dataset , where the dynamics information seams

to significantly increase the quality of the alignment. This

is probably due to the fact that dynamics are often well

conserved in functional patterns, also when the sequence

is not preserved [2].

REFERENCES [1] Do Chuong B.et al. Research in Computational Molecular Biology.

Springer Berlin Heidelberg, 2006

[2] Cilia, Elisa, et al. Nucleic acids research 42.W1 (2014): W264-W270

[3] Remmert, Michael, et al.Nature methods 9.2 (2012): 173-175.

Figure 1: Structure of the pairwise HMM-SVM model


86



P42. EARLY FOLDING AND LOCAL INTERACTIONS

R. Pancsa1, M. Varadi

1, E. Cilia

2,3, D. Raimondi

1,2,3 & W. F. Vranken

1,3,*.

Structural Biology Research Centre, VIB and Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium1;

Machine Learning Group, Université Libre de Bruxelles, Brussels, Belgium2; Interuniversity Institute of Bioinformatics

in Brussels (IB)2, Brussels, Belgium

3.

*[email protected]

INTRODUCTION

Protein folding is in its early stages largely determined by

the protein sequence and complex local interactions

between amino acids, resulting in the formation of foldons

that provide the context for further folding into the native

state. These early folding processes are therefore

important to understand subsequent folding steps and their

influence on, for example, aggregation, but they are

difficult to study experimentally. We here address this

issue computationally by assembling and analysing a

dataset on early folding residues from hydrogen deuterium

exchange (HDX) data from NMR and MS, and analyse

how they relate to the sequence-based backbone dynamics

predictions from DynaMine (Cilia et al. 2013, 2014) and

evolutionary information from multiple sequence

alignments.

METHODS

We assembled a dataset of HDX experimental data from

NMR and MS from literature for 57 proteins totalling

4172 residues. The data was classified by the into early,

intermediate and late classes depending on the folding

time where protection of the backbone NH was observed,

and into strong, medium and weak classes depending on

how long the amides remain protected upon unfolding the

native state. This resulted in 219 residue sets that are

organised in XML files and loaded into a database that is

made available online via http://start2fold.eu.

The DynaMine predictions were run locally with a new

version of the software that handles C- and N-terminal

effects. These original predictions were then normalised

by shifting them so that the maximum prediction value for

each protein is always 1.0, so not affecting the relative

differences between the prediction values within each

protein, but effectively normalising the values between

different proteins. MSAs were generated for each

sequence in the dataset using HHblits and Jackhmmer with

3 iterations and E value threshold of 10-4

. All the retrieved

homologs have minimum 90% coverage with the query

sequence. By using HHfilter, a post processing tool

provided in the HHblits package, we built two different

sets of MSAs by varying the maximum pairwise sequence

identity threshold between the collected homologs in each

MSA. The (ungapped) sequences in the MSAs were

predicted without normalisation in order to preserve the

differences within a protein family, and mapped back to

the full (gapped) MSA.


Our analysis shows that the DynaMine-predicted rigidity

of the protein backbone represents where the protein is

likely to adopt specific lower free energy conformations

based on sequence-encoded local interactions, as

evidenced by the HDX data on early folding (Figure 1).

This effect is also present on a per-residue basis.

FIGURE 1. Distribution of DynaMine predictions for early folding residues (green) and non-early folding residues (brown) for the original

(left) and normalized (right) values.

When relating the secondary structure elements as

observed in the native fold to the early folding residues,

we observe that the ‘early folding’ secondary structure

elements also tend to be more rigid overall. Finally, we

examined whether early folding is conserved in evolution

on the basis of multiple sequence alignments. Although

there is no conservation of individual amino acids, the

physical characteristic of a rigid backbone seems to be

conserved.

We therefore propose that the backbone dynamics of the

protein is a fundamental physical feature conserved by

proteins that can provide important insights into their

folding mechanisms and stability.

REFERENCES Cilia, E., Pancsa, R., Tompa, P., Lenaerts, T., & Vranken, W. F. (2013).

From protein sequence to dynamics and disorder with DynaMine.

Nature Communications, 4, 2741. http://doi.org/10.1038/ncomms3741

Cilia, E., Pancsa, R., Tompa, P., Lenaerts, T., & Vranken, W. F. (2014).

The DynaMine webserver: predicting protein dynamics from

sequence. Nucleic Acids Research, 12(Web Server), W264–W270.

http://doi.org/10.1093/nar/gku270


87



P43. BINDING SITE SIMILARITY DRUG REPOSITIONING:

A GENERAL AND SYSTEMATIC METHOD FOR DRUG DISCOVERY

AND SIDE EFFECTS DETECTION

Daniele Parisi & Yves Moreau.

I developed a protocol based on prediction of druggable cavities, comparison of these putative binding sites and cross-

docking between bound ligands and the binding site detected to be similar to the one of the complex, in order to study the

cross reactivity of known compounds. It is a general method because it can find applications both in drug repositioning

and in the study of adverse effects, and it is systematic because it consists in several subsequent steps. It would indicate

ligands to screen, reducing the number of candidates and allowing companies or universities to save money and time

from unnecessary tests.

INTRODUCTION

The ability of small molecules to interact with multiple

proteins is referred to as polypharmacology [1]

, and the

strategy that aims to exploit the positive aspects of

polypharmacology is drug repositioning, whereby existing

drugs are investigated for efficacy against targets for other

indications. Existing drugs are privileged structures with

verified bioavailability and compatibility. Furthermore,

virtual screening allows to conduct repositioning of

existing drugs against novel disease targets without the

expense of purchasing thousands of compounds [2]

. The

combination of structure-based virtual screening (such as

estimation of similarity of protein-ligand binding sites and

consequent cross-docking) and drug repositioning

represents a highly efficient and fast methodology for

predicting cross-reactivity and putative side effects of drug

candidates [3]

.

METHODS

Each step of my work is related to a bioinformatics

technique or tool, resulting to be the coupling of different

software.

1. At first there is the choice of the query (a single protein

as PDB file) and the templates (a set of PDB

structures). At least one of the two categories has to

present a ligand bound in a cavity;

2. prediction of druggable cavities in all the protein

structures using a geometry-based or an energy-based

algorithm (Fpocket, geometry-based tool, in my case);

3. comparison of the query binding sites to the binding

sites of the templates for assessing the similarity. It can

be carried out by an alignment or alignment-free

algorithm (I used Apoc, an alignment based tool);

4. cross-docking of the ligand available in the pair of

similar binding sites, into the other cavity, in order to

study the binding with a different target for toxicity or

new therapeutic indications (AutodockVina);

5. Fingerprinting of the new complex ligand-cavity for

scoring the docking poses.

I applied this protocol on two different queries (Thrombin

and Dihydrofolate reductase), using a data set of 1067

druggable proteins as tamplates (Druggable Cavity

Directory).


The method works well in repositioning ligands among

proteins of the same family (intraprotein), but is not able

to detect interprotein similarities (among not related

proteins). It happens because of the big size of the

predicted cavities (larger than the mere space occupied by

the ligand) coupled to the alignment-based algorithm used,

which make difficult to have a sufficient similarity rate

and exponentially increase the false negatives. For my

further works I will divide the cavity space in subpockets,

disengage the similarity from the sequence by using

pharmacophoric maps, and couple the structure based

similarity to the ligand based and network based. All the

information will be fused with data integrations algorithms.

REFERENCES On the origins of drug polypharmacology, Xavier Jalencas and Jordi

Mestres, Med. Chem. Commun., 2013, 4, 80.

Drug repositioning by structure-based virtual screening, Dik-Lung Ma, Daniel Shiu-Hin Chana and Chung-Hang Leung, Chem. Soc. Rev.,

2013, 42, 2130.

Comparison and Druggability Prediction of Protein−Ligand Binding Sites from Pharmacophore-Annotated Cavity Shapes, Jeremy

Desaphy, Karima Azdimousa, Esther Kellenberger, and Didier Rognan, J. Chem. Inf. Model. 2012, 52, 2287−2299.


88



P44. ASSESSMENT OF THE CONTRIBUTION OF COCOA-DERIVED STRAINS

OF ACETOBACTER GHANENSIS AND ACETOBACTER SENEGALENSIS TO

THE COCOA BEAN FERMENTATION PROCESS THROUGH A GENOMIC

APPROACH

Rudy Pelicaen, Koen Illeghems, Luc De Vuyst, and Stefan Weckx*.

Research Group of Industrial Microbiology and Food Biotechnology (IMDO), Faculty of Sciences and Bioengineering

Sciences, Vrije Universiteit Brussel, Brussels, Belgium; Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB,

Brussels, Belgium. *[email protected]

Acetobacter ghanensis LMG 23848T and Acetobacter senegalensis 108B are acetic acid bacteria species that originate

from a spontaneous cocoa bean heap fermentation process. They have been indicated as strains with interesting

functionalities through extensive metabolic and kinetic studies. Whole-genome sequencing of A. ghanensis LMG 23848T

and A. senegalensis 108B allowed to unravel their genetic adaptations to the cocoa bean fermentation ecosystem.

INTRODUCTION

Fermented dry cocoa beans are the basic raw material for

chocolate production. The cocoa pulp-bean mass contents

of the cocoa pods undergo, once taken out of the pods, a

spontaneous fermentation process that lasts four to six

days. This process is characterised by a succession of

yeasts, lactic acid bacteria (LAB), and acetic acid bacteria

(AAB) coming from the environment (De Vuyst et al.,

2015).

METHODS

Total genomic DNA isolation and purification of A.

ghanensis LMG 23848T and A. senegalensis 108B was

followed by the construction of an 8-kb paired-end library,

454 pyrosequencing, and assembly of the sequence reads

using the GS De Novo Assembler version 2.5.3 with

default parameters. Genome finishing was performed by

PCR assays to close gaps in the draft assembly using

CONSED 23.0. Automated gene prediction and annotation

of the assembled genome sequences were carried out using

the bacterial genome sequence annotation platform

GenDB v2.2 (Meyer et al., 2003). The predicted genes

were functionally characterised using searches in public

databases and bioinformatics tools, and annotations were

manually curated. Comparative analysis of the genome

sequences of the cocoa-derived strains A. ghanensis LMG

23848T (this study), A. senegalensis 108B (this study), and

A. pasteurianus 386B (Illeghems et al., 2013) was

accomplished by the EDGAR framework (Blom et al.,

2009).


The genomes of the strains investigated consisted of a

circular chromosomal DNA sequence with a size of 2.7

Mbp and two plasmids for A. ghanensis LMG 23848T and

a circular chromosomal DNA sequence with a size of 3.9

Mbp and one plasmid for A. senegalensis 108B (Figure 1).

Comparative analysis revealed that the order of

orthologous genes was highly conserved between the

genome sequences of A. pasteurianus 386B and A.

ghanensis LMG 23848T. Evidence was found that both

species possessed the genetic ability to be involved in

citrate assimilation and they displayed adaptations in their

respiratory chain. As is the case for many AAB, the

missing gene encoding phosphofructokinase in the

genome sequences of both A. ghanensis LMG 23848T and

A. senegalensis 108B resulted in a non-functional upper

part of the Embden–Meyerhof–Parnas pathway. However,

the presence of genes coding for membrane-bound PQQ-

dependent dehydrogenases enabled the AAB strains

examined to rapidly oxidise ethanol into acetic acid.

Furthermore, an alternative TCA cycle, characterised by

genes coding for a succinyl-CoA:acetate-CoA transferase

and a malate:quinone oxidoreductase, was present.

Furthermore, evidence was found in both genome

sequences that glycerol, mannitol and lactate could be

used as energy sources. Thus, although both species

displayed genetic adaptations to the cocoa bean

fermentation process, their dependence on glycerol,

mannitol and lactate may partly explain their low

competitiveness during cocoa bean fermentation processes,

as these substrates have to be formed through yeast or

LAB activities, respectively.

FIGURE 1. Graphical representation of the genomes of A. ghanensis

LMG 23848T (A) and A. senegalensis 108B (B).

REFERENCES Blom, J., Albaum, S., Doppmeier, D., Pühler, A., Vorhölter, F.-J., Zakrzewski, M.,

Goesmann, A., 2009. EDGAR: a software framework for the comparative

analysis of prokaryotic genomes. BMC Bioinformatics 10, 1-14.

De Vuyst, L., Weckx, S., 2015. The functional role of lactic acid bacteria in cocoa

bean fermentation. In: Mozzi, F., Raya, R.R., Vignolo, G.M. (Eds.).

Biotechnology of Lactic Acid Bacteria: Novel Applications. Wiley-Blackwell,

Ames, IA, USA. In press.Illeghems, K., De Vuyst, L., Weckx, S., 2013.

Complete genome sequence and comparative analysis of Acetobacter

pasteurianus 386B, a strain well-adapted to the cocoa bean fermentation

ecosystem. BMC Genomics 14, 526.

Meyer, F., Goesmann, A., McHardy, A. C., Bartels, D., Bekel, T., et al., 2003.

GenDB - an open source genome annotation system for prokaryote genomes.

Nucleic Acids Res. 31, 2187-2195.


89



P45. REPRESENTATIONAL POWER OF GENE FEATURES

FOR FUNCTION PREDICTION

Konstantinos Pliakos1*

, Isaac Triguero2,3

, Dragi Kocev4 & Celine Vens

1.

Department of Public Health and Primary Care, KU Leuven Kulak1; Department of Respiratory Medicine, Ghent

University2; Data Mining and Modelling for Biomedicine group, VIB Inflammation Research Center

3; Department of

Knowledge Technologies, Jožef Stefan Institute4.

*[email protected]

We present a short study on gene function prediction datasets, revealing an existing issue of non-unique feature

representation, as well as the effect of this issue on hierarchical multi-label classification algorithms.

INTRODUCTION

This study focuses on hierarchical multi-label

classification (HMC). HMC is a variant of classification

where one sample can be assigned to several classes

simultaneously. It differs though from multi-label

classification as these classes are organized in a hierarchy.

That means that a sample belonging to a class

automatically belongs to all its super-classes. Typical

HMC tasks include gene function prediction or text

classification. Here, we focus on the former.

A typical characteristic of genes is that they can be

described in several ways: using information about their

sequence, homology to well-characterized genes,

expression profiles, secondary structure of their derived

proteins, etc. The HMC community has multiple research

datasets at its disposal on gene functions (e.g., (Vens et al.,

2008) or (Schietgat et al., 2010)), each representing genes

by one type of features. Indisputably, researchers should

get advantage of this amount of data but the question

arises how “good” these datasets are. How discriminant

are the features describing a gene? Here, a short study is

trying to display existing data-related problems and give

answers to the aforementioned questions.

DATA STUDY & RESULTS

After careful experimentation on various publicly

available datasets it was noted that some of them suffer

from large amount of duplicate feature vectors. The

irrational behind this occurrence is that there are genes,

which despite having different functions, have exactly the

same feature representation. The table below lists the

aforementioned problem in the 20 gene function

prediction datasets described in (Vens et al., 2008) and

(Schietgat et al., 2010).

Organism Dataset Nb of genes Nb of unique gene

representations

S. cerevisiae

church 3755 2352

pheno 1591 514

hom 3854 3646

seq 3919 3913

struc 3838 3785

A. thaliana scop 9843 9415

struc 11763 11689

TABLE 1. Datasets, the number of genes and their unique representations.

As it is displayed, the church (micro-array expression) and

the pheno (phenotype features) datasets suffer the most.

More specifically, in pheno dataset the 67.7% of the gene

representations are duplicates. The most frequent feature

vector appears 315 times, 197 times in the training set and

118 times in the test set. Due to this, 20% of the 582 test

examples will give the same feature vector as input for

prediction. In a decision tree model, for example, these

genes will end up in the same leaf, receive the same

prediction (the average class vector of 197 training

examples), but receive a different error term as they are a

priori associated with a different class label-set. In the

training phase, there may still be a lot of variation in the

class vectors of the 197 genes, but no split exists to

separate them. In the Church dataset, the 3755 genes

correspond to only 2352 unique feature descriptors. In

Hom or Struc datasets the number of the duplicates is

lower but still impressive, considering the enormous size

of the feature vectors in these datasets.

For evaluation purposes, ML-KNN (Zhang M. L et al.,

2007) was employed to demonstrate the effect of the

studied problem on the average precision for the FunCat

annotated datasets. Here, “unique” refers to the datasets

occurring after removing all the duplicates. Thus, any

feature vector can only once be included in a gene’s

neighbour set. We report the average of 10 “unique”

versions, each one using a different gene’s class label as

ground truth for the feature vector.

Dataset K= 1 K = 5 K = 17

Train Test (5cv)

Train Test (5cv)

Train Test (5cv)

pheno initial 51.59 23.62 39.55 24.14 32.76 23.59

unique 100 24.21 55.62 24.90 39.70 25.01

hom initial 98.30 39.32 63.64 39.45 48.96 37.28

unique 100 39.14 64.64 39.67 49.28 37.53

TABLE 2. Average Precision rates (%) using ML-KNN.

The table shows that the less discriminant feature

representation can affect the ML-KNN and decrease the

precision of multi-label classification. Indisputably, it

could be concluded that the same problem will be more

obvious or even completely disastrous for two-class or

multi-class classification problems.

CONCLUSION

The major point of this study was to inform the research

community of the relatively low representational power of

the features present in some widely used gene function

prediction datasets, making them even more difficult and

challenging datasets from machine learning perspective.

We observed the same issue in datasets of other HMC

application domains like text categorization.

REFERENCES Zhang M. L. & Zhou Z. H. ML-KNN: A lazy learning approach to multi-label learning, Pattern

recognition 40, 2038-2048, (2007). Vens C. et al. Decision trees for hierarchical multi-label classification, Machine Learning 73, 185-214,

(2008).

Schietgat L. et al. Predicting gene function using hierarchical multi-label decision tree ensembles, BMC Bioinformatics 11, (2010).


90



P46. ANALYSIS OF BIAS AND ASYMMETRY IN THE PROTEIN STABILITY

PREDICTION

Fabrizio Pucci1,*

, Katrien Bernaerts1,2

, Fabian Teheux1, Dimitri Gilis

1 & Marianne Rooman

1.

Department of BioModeling, BioInformatics & BioProcesses1, Université Libre de Bruxelles, 1050 Brussels, Belgium;

BioBased Materials, Faculty of Humanities and Sciences2, Maastricht University, 6200 Maastricht, The Netherlands.

*[email protected]

In many bioinformatics analyses avoiding biases towards the training dataset is one of the most intricate issue. Here we

focus on the specific case of the prediction of protein thermodynamic stability changes upon point mutations (G). In a

first instance we measure the bias towards the destabilizing mutations of some widely used G-prediction algorithms

described in the literature. Then we show how important is the use of the symmetry of the model to avoid biasing. In the

last step we briefly discuss the distribution of the G values for all possible point mutations in a series of proteins with

the aim of understanding whether the distribution is universal and how much it is biased towards the training dataset.

INTRODUCTION

The accurate prediction of the stability changes on a large

scale is still a challenge in protein science. Despite the

large amount of work done in the last years, the results

frequently suffer from hidden biases towards the training

dataset and this makes the evaluation of the real

performances a difficult task.

Here we study the “bias problem” in the case of the

prediction of protein thermodynamic stability changes

upon point mutations and more precisely of its best

descriptor G that is the change of folding free energy

upon mutation from the wild type protein W to the mutant

M. In principle the predicted G value of the inverse

mutation (M to W) has to be exactly equal to minus the

G of the direct mutation (W to M), since the free energy

is a state function.

Unfortunately the asymmetry of the training dataset

towards the destabilizing mutations (reflecting the

evolutionary optimization of protein stability) makes the

prediction of inverse mutations less accurate with respect

to the direct ones. This introduces a series of distortions in

the prediction model that we will analyze here.

METHODS

We computed the G value for a set of almost 200

mutations in which both the structure of the wild type

protein and mutant are known, using a series of prediction

tools, i.e. PoPMuSiC [1], I-Mutant, FoldX, Duet,

AutoMute, CupSat, Eris and ProSMS. We then computed

the Ratio (RID) of the standard deviation between the

predicted and the experimental values of G for the

Inverse mutations to for the Direct mutations (which

should be one in the case of a perfect symmetric

prediction) and compared the results of the different

programs.

If the functional structure of the model is known as in the

case of the artificial neural network of PoPMuSiC, one

can further understand which terms contribute more than

others to deviate the RID from unit and thus propose new

model structures in which the biases are correctly avoided

[2].

In the more blind machine learning approaches (as the

methods based on Random Forest or Support Vector

Machine) in which the functional form is not explicitly

known, the asymmetry correction is less obvious.

In a second part, we investigated how the symmetry of the

G values distribution in the training dataset influences

the prediction of the G distribution for all possible

mutations in a series of proteins with known structures.


The estimation of the asymmetry computed for a

series of available prediction methods gives a RID

values between 1 for bias-corrected methods and

about 3 for the most biased programs. From these

results we have shown that the correct use of the

symmetry in setting up the model structure helps to

avoid unwanted biases towards the destabilizing

mutations.

Furthermore the distribution of the G values for all

point mutations in some proteins has been analyzed

and showed a dependence from the G distribution

of the training dataset when the RID deviate

significantly from one. The understanding of the

relation between the two distrubutions is an

important step to comprehend the universality of the

distribution [3] and how much the proteins are

optimized to minimize the impact of single-site

aminoacid substitution.

REFERENCES [1] Y. Dehouck, Jean Marc Kwasigroch, D. Gilis, M. Rooman (2011),

PopMusic 2.1 : a web server for the estimation of the protein

stability changes upon mutation and sequence optimality. BMC

Bioinformatics. 12, 151 [2] F. Pucci, K. Bernaerts, F. Teheux, D. Gilis, M. Rooman, Symmetry

Principles in Optimization Problems: an application to Protein

Stability Prediction (2015), IFAC-PapersOnLine 48-1, 458-463

[3] Tokuriki N, Stricher F, Schymkowitz J, Serrano L, Tawfik DS, The

stability effects of protein mutations appear to be universally

distributed (2007), J Mol Biol, 356, 1318-1332.


91



P47. MULTI-LEVEL BIOLOGICAL CHARACTERIZATION OF EXOMIC

VARIANTS AT THE PROTEIN LEVEL IMPROVES THE IDENTIFICATION OF

THEIR DELETERIOUS EFFECTS

Daniele Raimondi1,2,3,4

, Andrea Gazzo1,2

, Marianne Rooman1,6

, Tom Lenaerts1,2,5

& Wim Vranken1,2,3,4

.

Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, 1050, Belgium1; Machine Learning group,

Université Libre de Bruxelles, Brussels, 1050, Belgium2; Structural Biology Brussels, Vrije Universiteit Brussel,

Brussels, 1050, Belgium3; Structural Biology Research Centre, VIB, Brussels, 1050, Belgium

4; Artificial Intelligence lab,

Vrije Universiteit Brussel, Brussels, 1050 Belgium5; 3BIO-BioInfo group, Université Libre de Bruxelles, Brussels, 1050,

Belgium6.

*[email protected]

The increasing availability of genome sequence data led to the development of predictors that are capable of identifying

the likely phenotypic effects of Single Nucleotide Variants (SNVs) or short inframe Insertions or Deletions (INDELs).

Most of these predictors focus on SNVs and use a combination of features related to sequence conservation, biophysical

and/or structural properties to link the observed variant to either a neutral or a disease phenotype. Despite notable

successes, the mapping between genetic alterations and phenotypic effects is riddled with levels of complexity that are

not yet fully understood and that are often not taken into account in the predictions. A better multi-level molecular and

functional contextualization of both the variant and the protein may therefore significantly improve the predictive quality

of variant-effect predictors.

INTRODUCTION

The phenotypical interpretation at the organism level of

protein-level alterations is the ultimate goal of the variant-

effect prediction field. This causal relationship is still far

from being completely understood and is confounded by

many aspects related to the intrinsic complexity of cell life. A

crucial restriction of variant-effect prediction is that an

alteration of the protein’s molecular phenotype, even if it is a

sine qua non condition for the disease phenotype in the

carrier individual,may not constitute in itself a sufficient

cause for the disease: this also depends on the particular role

that the affected protein plays in the well-being of the

organism. Even the most commonly used features, which

relate evolutionary constraints with likely functional damage,

offer only a partial correlation with the pathogenicity of the

variant. Consequently, additional information that bridges the

variant-phenotype gap is crucial to improve variant-effect

predictions.

METHODS

We address the inherently complex variant-effect prediction

problem through the integration of different sources of

information. By describing each (protein, variant) pair from

different perspectives corresponding to different levels of

contextualisation, we assembled the most relevant and

accessible pieces of information that are currently available,

with the aim to elucidate the fuzzy and complex mapping

between molecular-level alterations and the individual-level

phenotypic outcome. We use three variant-oriented features

with different characteristics: the log-odd ratio (LOR) score

and Conservation index (CI) [1], which are column-wise

measures of the conservation of a mutated column within a

multiple-sequence alignment (MSA), and the PROVEAN [2]

predictions (PROV), which provide a sequence-wide measure

of the change in evolutionary distance between the mutated

target protein and close functional homologs that correlates

with the deleteriousness of variants. The protein-oriented

features use pathway [4] and protein-protein interaction

networks information [5] (DGR) as well as genetic and

clinical information, for instance an evaluation of how

tolerant the affected genes are to homozygous loss-of-

function mutations (REC) [3].


DEOGEN is our novel variant effect predictor that can

natively handle both SNVs and inframe INDELs. By

integrating information from different biological scales and

mimicking the complex mixture of effects that lead from the

variant to the phenotype, we obtain significant improvements

in the variant-effect prediction results. Next to the typical

variant-oriented features based on the evolutionary

conservation of the mutated positions, we added a collection

of protein-oriented features that are based on functional

aspects of the gene affected. We cross-validated DEOGEN on

36825 polymorphisms, 20821 deleterious SNVs and 1038

INDELs from SwissProt.

Method Missing SNVs Sen Spe Pre Bac MCC

PROVEAN 0.0 78 79 68 79 56

SIFT 2.0 85 69 61 77 52

Mutation Assessor 0.6 85 71 63 78 54

PolyPhen2 (HumDiv) 4.0 89 63 57 76 50

CADD 7.0 82 75 66 78 55

EFIN 0.0 86 80 87 83 64

MutationTaster 20.7 86 75 69 81 60

GERP++ 20.7 97 24 45 61 28

DEOGEN 4.4 77 92 85 84 71

FIGURE 1. Comparison of the performances of 8 variant-effect predictors with DEOGEN on Humsavar 2013 dataset.

REFERENCES [1]Calabrese, R. et al., R. Functional annotations improve the predictive

score of human disease-related mutations in proteins. Hum. Mutat. 30, 123744 (2009).

[2]Choi, Y. et al., Predicting the functional effect of amino acid substitutions and indels. PLoS One 7, e46688 (2012).

[3]Daniel G. MacArthur et al. A Systematic Survey of Loss-of-Function

Variants in Human Protein-Coding Genes Science 17 February 2012: 335 (6070), 823-828.

[4]Atanas Kamburov et al. (2011) ConsensusPathDB: toward a more

complete picture of cell biology. Nucleic Acids Research 39:D712-717.


92



P48. NGOME: PREDICTION OF NON-ENZYMATIC PROTEIN

DEAMIDATION FROM SEQUENCE-DERIVED SECONDARY STRUCTURE AND

INTRINSIC DISORDER

J. Ramiro Lorenzo1, Leonardo G. Alonso

2 & Ignacio E. Sánchez

1*.

Protein Physiology Laboratory, Facultad de Ciencias Exactas y Naturales and IQUIBICEN - CONICET, Universidad de

Buenos Aires, Argentina1; Protein Structure-Function and Engineering Laboratory, Fundación Instituto Leloir and

IIBBA - CONICET, Buenos Aires, Argentina2. *[email protected]

Asparagine residues in proteins undergo spontaneous deamidation, a post-translational modification that may act as a

molecular clock for the regulation of protein function and turnover. Asparagine deamidation is modulated by protein

local sequence, secondary structure and hydrogen bonding. We present NGOME, an algorithm able to predict non -

enzymatic deamidation of internal asparagine residues in proteins, in the absence of structural data, from sequence based

predictions of secondary structure and intrinsic disorder. NGOME may help the user identify deamidation-prone

asparagine residues, often related to protein gain of function, protein degradation or protein misfolding in pathological

processes.

INTRODUCTION

Protein deamidation is a post-translational modification in

which the side chain amide group of a glutamine or

asparagine (Asn) residue is transformed into an acidic

carboxylate group. Deamidation often, but not always,

leads to loss of protein function1,2

. Deamidation rates in

proteins vary widely, with halftimes for particular Asn

residues ranging from several days to years. In contrast

with the ubiquity and importance of Asn deamidation,

there is currently no publicly available algorithm for the

prediction of Asn deamidation A structure-based

algorithm was published3, but is no longer available online

and is not useful for proteins of unknown structure or

those that are intrinsically disordered.

METHODS

Dataset. We collected from the literature experimental

reports of deamidation of Asn residues in proteins using

mass spectrometry or Edman sequencing. Since

deamidation rates depend strongly on pH and temperature,

we only included experiments at neutral or slightly basic

pH and up to 313K. An Asn residue was considered a

positive if unequivocal change to aspartic or isoaspartic

residue was observed. Asn residues for which direct

experimental evidence was not obtained were not taken

into account.

NGOME training. We trained the algorithm by randomly

splitting the dataset into training and test sets 100 times,

while keeping a similar number of positive and negative

Asn-Xaa dipeptides in the two sets. For each splitting, we

selected the weights for disorder4 and alpha helix

prediction5 in NGOME algorithm to maximize the area

under the ROC curve for the training set. For the test set,

the area under the ROC curve for NGOME was larger than

for sequence-based prediction 97 out of 100 times. Finally,

we selected the average values of weights for NGOME.


Both protein sequence and structure can influence Asn

deamidation kinetics. In the absence of secondary and

tertiary structure, Asn deamidation rates are governed by

the identity of the N+1 amino acid3. In model peptides, the

Asn-Gly dipeptide is by far the fastest to deamidate, with

bulky N+1 side chains generally slowing down the

reaction. Several structural features decreasing Asn

deamidation rates have also been identified, including

alpha helix formation and hydrogen bond formation by the

Asn side chain, the N+1 backbone amide and the

neighbouring residues3.

We compiled a database of 281 Asn residues (67 positives

and 214 negatives) in 39 proteins to train NGOME. We

computed t50 for all Asn in the dataset and generated a

ROC curve by considering as positives Asn residues with

different values of t50. The area under the ROC curve is

larger for the NGOME predictions (0.9640) than for the

sequence-based predictions (0.9270) (p-value 6×10-3

).

NGOME also performs better for threshold value s

yielding few false positives. NGOME can also

discriminate between positive and negative Asn-Gly

dipeptides whereas sequence-based prediction can not.

The area under the ROC curve is 0.7051 for the NGOME

predictions, larger than the random value of 0.5 for

sequence-based prediction (p-value 9×10–3

). Since

NGOME requires only a protein sequence as an input and

not a three-dimensional structure, we envision that

GNOME will be useful to systematically evaluate whole

proteome data and in the study of intrinsically disordered

proteins for which the structural data is scarce. NGOME is

freely available as a webserver at the National EMBnet

node Argentina, URL: http://www.embnet.qb.fcen.uba.ar/

in the subpage “Protein and nucleic acid structure and

sequence analysis”.

REFERENCES 1. Curnis, F., et al. J Biol Chem 281:36466-36476 (2006).

2. Reissner, K.J. and Aswad, D.W. Cell Mol Life Sci 60:1281 -1295

(2003). 3. Robinson, N.E. and Robinson, A.B. Proc Natl Acad Sci U S A

98:4367-4372 (2001).

4. Dosztanyi, Z., et al. Bioinformatics 21:3433-3434 (2005).

5. Cole, C., et al. Nucleic Acids Res 36:W197-201 (2008).


93



P49. OPTIMAL DESIGN OF SRM ASSAYS USING MODULAR EMPIRICAL

MODELS

Jérôme Renaux1,*

, Alexandros Sarafianos1, Kurt De Grave

1 & Jan Ramon

1.

Department of Computer Science, KU Leuven.1 *[email protected]

Targeted proteomics techniques such as Selected Reaction Monitoring (SRM) have become very popular for protein

quantification due to their high sensitivity and reproducibility. However, these rely on the selection of optimal transitions,

which are not always known in advance and may require expensive and time-consuming discovery experiments to

identify. We propose a computer program for the automated identification of optimal transitions using machine learning

and show encouraging results when compared to a widely used spectral library.

INTRODUCTION

A major issue with both SRM is to know which transitions

to monitor in order to maximally detect a specific protein,

these being different from one protein to another. Good

candidates are transitions whose chemical properties will

make them likely to occur and easy to detect by the mass

spectrometer, while being sufficiently specific indicators

of their parent protein.

Traditionally, targeted proteomics assays, which consist of

lists of ions or transitions to monitor, are designed through

costly exploratory experiments. Recently, attempts have

been made to produce software to help design optimal

assays. These efforts rely on some extent on collaborative

databases of mass spectra which are mined to identify the

best possible peptides to include in the assays. While

successful, these approaches still depend on past

exploratory analyses and on the coverage of the exploited

databases. Therefore, their performance decrease in cases

where such databases cannot be leveraged, such as when

dealing with little-studied organisms or rare, low-

abundance proteins.

We propose an approach called SIMPOPE (Sequence of

Inductive Models for the Prediction and Optimization of

Proteomics Experiments) that models all the steps of the

typical tandem mass spectrometry (MS/MS) workflow in

order to accurately predict the properties of peptide and

fragment ions within a given proteome, and subsequently

identify optimal assays among them.

METHODS

SIMPOPE consists of a sequential suite of predictive

models for each step of the MS/MS workflow. It exploits

knowledge from public databases and combines it with the

generalizing power of machine learning models to

compensate for noisy or missing data. All models are

probabilistic, allowing to keep track of the inherent

uncertainty of the successive predictions and to weight the

results accordingly for the assay prediction.

Enzymatic cleavage is modelled using CP-DT(Fannes et

al., 2013), which models the behaviour of the trypsin

enzyme using random forests. Retention time prediction is

achieved using the Elude tool from the Percolator suite

(Moruz et al., 2010). The charge distribution of

electrospray precursor ions is also modelled using random

forests trained on experimental data mined from PRIDE

(Vizcaino et al., 2013). Fragmentation patterns and

product ion intensity are predicted with the help of random

forest models trained on MS-LIMS data (Degroeve &

Martens 2013; De Grave et al., 2014). Finally, prior

knowledge about the abundance of proteins within a given

proteome is incorporated as prior probabilities, obtained

when available from PaxDB.

On the human proteome, these steps yield a total of 321

000 000 transitions together with their relevant chemical

properties. We then compute a score for every single

transition, based on these properties and on their aliasing

with other transitions in terms of Q1 and Q3 m/z.


We validated our approach by computing scores for 2000

reference transitions from the SRMAtlas database (Picotti

et al., 2014). Based on these scores, we can rank the

reference transitions among all possible transitions.

Intuitively, reference transitions should rank high, and

therefore have a low rank (ideally, in the top five). Based

on the average number of transitions per protein in our

reference set, a perfect median rank would be 3.2, while a

totally random scoring system should yield a median rank

of 151. The approach we propose achieved a median rank

of 15, signifying that using our scoring method, 50% of

the reference transitions are ranked in the top 15. This

result is encouraging as it shows that the scores predicted

by SIMPOPE do correlate with the quality of the

transitions. We can subsequently use that score as a

feature to train an additional model on top of the ones

described here to refine the assay prediction process

(further results on the poster).

REFERENCES Degroeve, S. & Martens, L. MS2PIP: a tool for MS/MS peak

intensity prediction. Bioinformatics, 29, pp.3199–203 (2013).

Fannes, T. et al. Journal of Proteome Research, 12(5), pp.2253–2259 (2013).

De Grave, K. De et al. Prediction of peptide fragment ion intensity : a

priori partitioning reconsidered. International Mass Spectrometry Conference 2014, (2014).

Moruz, L., Tomazela, D. & Käll, L. Training, selection, and robust

calibration of retention time models for targeted proteomics. Journal of Proteome Research, 9(10), pp.5209–5216 (2010).

Picotti, P. et al. A complete mass-spectrometric map of the yeast

proteome applied to quantitative trait analysis. Nature, 494(7436), pp.266–270 (2014).

Vizcaino, J. a. et al. The Proteomics Identifications (PRIDE) database

and associated tools: status in 2013. Nucleic Acids Research, 41(D1), pp.D1063–D1069 (2013).


94



P50. EVALUATING THE ROBUSTNESS OF LARGE INDEL IDENTIFICATION

ACROSS MULTIPLE MICROBIAL GENOMES

Alex Salazar1,2

& Thomas Abeel1,2*

.

Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands1; Genome Sequencing and Analysis

Program, Broad Institute of MIT and Harvard2.

*[email protected]

Comparing large structural variants—such as large insertions and deletions (indels)—across multiple genomes can reveal

important insights in microbial organisms. Unfortunately, most studies that compare sequence variants only focus on

single nucleotide variants and small indels. In this study, we investigated whether current available variant callers are

robust when identifying the same large indel across multiple genomes—an important criteria for accurately associating

large variants. By simulating over 8,000 large indels of various sizes across 161 bacterial strains, we found that

breakpoint detection is precise when identifying both deletions and insertion. We suggest that left-most-overlap

normalization across all samples will ensure uniform breakpoint coordinates of identical large variants which can then be

incorporated to existing association pipelines.

INTRODUCTION

Structural sequence variants—such as large insertion and

deletions (indels)—along with small sequence variants (e.g.

single nucleotide variants and small indels) can enable more

robust comparisons of microbial populations. Unfortunately,

limitations in variant calling methods restrict investigations to

compare only small variants across multiple microbial

genomes—thereby ignoring larger variants (e.g. indels of size

greater than 50nt). The recent development of structural

variant detecting tools now provide an opportunity to

compare and associate large indels with phenotype and

population structure across a collection of samples. However,

these tools have only been benchmarked against a single

genome and their ability to consistently call large events

across multiple genomes remains uncharacterized.

METHODS

In this study, we systematically benchmarked the robustness

of large indel identification across multiple genomes using

five recently developed structural variant detection tools:

Pilon (Walker et al., 2014), Breseq (Barrick et al., 2014),

BreakSeek (Zhao et al., 2015), and MindTheGap (Rizk et al.,

2014). Using a manually-curated reference genome for

M. tuberculosis (H37Rv), we simulated nearly 10,000

deletions and 8,000 thousand insertions—ranging from 50nt

to 550nt. Overall, the simulation experiment resulted in a

total 1.6 million expected deletions and 1.3 million expected

insertions when we aligned short-reads from a data set of 161

clinical strains of M. tuberculosis (Zhang et al., 2013).

After identifying the simulated indels using the variant

detecting tools, we used a distance test to investigate each

tool’s robustness in breakpoint and genotype prediction. For

each simulated indel prediction, we computed the distance of

the predicted breakpoint coordinate to the expected

breakpoint coordinate. We also calculated a genotype

similarity score using the Damerau-Levenshtein distance.


We found that all tools are able to precisely predict the

breakpoint coordinate of the same large event present across

multiple genomes. For deletions, Breseq and Breakseek

consistently identified more than 96% of all simulated

deletions regardless of size. This number ranged from 87% to

93% in Pilon and correlated with decreasing deletion size.

Breseq and Pilon correctly predicted the exact breakpoint

coordinate for about two-thirds of all identified simulated

indels. This number ranged from 1% to 7% in Breakseek calls

and inversely correlated with increasing deletion size.

For insertions, MindTheGap consistently identified

approximately 97% of all simulated insertions, but Pilon’s

performance worsened as the number of insertions that it

identified ranged from 69% to 93%--again, we observed a

direct correlation of missed calls as the insertion size

increased. Both tools correctly predicted the exact breakpoint

coordinate for about two-thirds of all identified simulated

indels. Nevertheless, we found 99% of the predicted

breakpoint coordinates made by the four tools were within

10nt of the expected breakpoint coordinate.

Our results also indicate that Pilon, Breseq, Breakseek, and

MindTheGap are robust when predicting the genotype of

large indels across multiple samples. The large majority of

identified simulated deletions had a size and genotype

similarity of more than 98%. In insertions, the size similarity

of insertions varied widely in both MindTheGap and Pilon

calls indicating that both tools have a difficult time

determining the exact length of an insertion sequence.

Overall, these results show that breakpoint detection is

precise when identifying deletion and insertions of any size.

Therefore, a simple normalization procedure—such as left-

most-overlap normalization across samples—will ensure

consistent breakpoint location for identical large events. This

will enable researchers to incorporate large variants to

existing association pipelines; opening novel opportunities to

associate large variants with phenotype and population

structure.

REFERENCES Barrick,J.E. et al. (2014) Identifying structural variation in haploid

microbial genomes from short-read resequencing data using breseq.

BMC Genomics, 15, 1039.

Rizk,G. et al. (2014) MindTheGap: integrated detection and assembly of short and long insertions. Bioinformatics, 30, 1–7.

Walker,B.J. et al. (2014) Pilon: an integrated tool for comprehensive

microbial variant detection and genome assembly improvement. PLoS One, 9, e112963.

Zhang,H. et al. (2013) Genome sequencing of 161 Mycobacterium

tuberculosis isolates from China identifies genes and intergenic regions associated with drug resistance. Nat. Genet., 45, 1255–60.

Zhao,H. and Zhao,F. (2015) BreakSeek: a breakpoint-based algorithm for

full spectral range INDEL detection. Nucleic Acids Res., 1–13.


95



P51. INTEGRATING STRUCTURED AND UNSTRUCTURED DATA SOURCES

FOR PREDICTING CLINICAL CODES

Elyne Scheurwegs1,3*

, Kim Luyckx2, Léon Luyten

2, Walter Daelemans

3 & Tim Van den Bulcke

1.

Advanced Database Research and Modeling (ADReM), University of Antwerp1; Antwerp University Hospital

2; Center

for Computation Linguistics and Psycholinguistics (CliPS), University of Antwerp3;

*[email protected]

Automated clinical coding is a task in medical informatics, in which information found in patient files is translated to

various types of coding systems (e.g. ICD-9-CM). The information in patient files consists of multiple data sources, both

in structured (e.g. lab test results) and unstructured form (e.g. a text describing the progress of a patient over multiple

days during the stay). This work studies the complementarity of information derived from these different sources to

enhance clinical code prediction.

INTRODUCTION

The increased accessibility of healthcare data through the

large-scale adoption of electronic health records stimulates

the development of algorithms that monitor hospital

activities, such as clinical coding applications.

Clinical coding consists of the translation of information

found in a patient file to diagnostic and procedural codes,

originating from a medical ontology to patient files.

In our work, we investigate if unstructured (textual) and

structured data sources, present in electronic health

records, can be combined to assign clinical diagnostic and

procedural codes (specifically ICD-9-CM) to patient stays.

Our main objective is to evaluate if integrating these

heterogeneous data types improves prediction strength

compared to using the data types in isolation.

METHODS

Several datasets were collected from the clinical data

warehouse of the Antwerp University Hospital (UZA).

The resulting dataset consists of a randomized subset of

anonymized data of patient stays, in 14 different medical

specialties. Two separate data integration approaches were

evaluated on each dataset from a medical specialty.

With early data integration, multiple sources are combined

prior to training a model. This is achieved by using a

single bag of features that are given to the prediction

pipeline. Feature selection is performed with tf-idf for

unstructured sources and gainratio and minimal

redundancy, maximum relevance (mRMR) for structured

source filtering.

The late data integration method trains a separate model

on each data source, and then combines the prediction

output for each code in a meta-learner. This meta-learner

is mainly used to find which sources perform best for a

certain code.

The prediction task in both approaches was cast as a multi-

class classification task, in which an array of binary

predictions was made (one for each clinical code).


Late data integration improves the predictions of ICD-9-

CM diagnostic codes made in comparison to the best

individual prediction source (i.e. overall F-measure

increased from 30.6% to 38.3%). Early data integration

does not show this trend and only performs well with a

limited number of combinations of sources. ICD-9-CM

procedure codes also show this trend, with the exception

of the RIZIV data source, which shows a better prediction

when used individually. The predictive strength of the

models varies strongly between different medical

specialties.

The results show that the data sources, independent of

their structured or unstructured nature, are able to provide

complementary information when predicting ICD-9-CM

codes, particularly when combined within the late data

integration approach. This approach also allows for

including as many sources as possible, as the effects of

including a source that does not contain any additional

information barely influences the end result. This is an

advantage when the information content of a data source is

not previously known. A disadvantage is the loss of

information due to the strong generalisation as each data

source is effectively reduced to a single feature for the

meta-learner.

Early data integration seems to suffer when combining

sources that have features with a largely differing

information content and different numbers of features. An

unstructured data source typically renders 30,000

different, weak features, while a structured source often

contains only 500 different features.

CONCLUSIONS

Models using multiple electronic health record data

sources systematically outperform models using data

sources in isolation in the task of predicting ICD-9-CM

codes over a broad range of medical specialties.

ACKNOWLEDGEMENT

This work is supported by a doctoral research grant (nr.

131137) by the Agency for Innovation by Science and

Technology in Flanders (IWT). The datasets used in this

research were made available by the Antwerp University

Hospital (UZA) for restricted use.

REFERENCES Scheurwegs, E et al. Data integration of structured and unstructured

sources for assigning clinical codes to patient stays. Journal of the American Medical Informatics Association (2015): ocv115.


96



P52. SUPERVISED TEXT MINING FOR DISEASE AND GENE LINKS

Jaak Simm1,2,3*

, Adam Arany1,2

, Sarah ElShal1,2

& Yves Moreau1,2

.

Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing, and Data

Analytics, KU Leuven, Kasteelpark Arenberg 10, box 2446, 3001 Leuven, Belgium1; iMinds Medical IT, Kasteelpark

Arenberg 10, box 2446, 3001 Leuven, Belgium2; Institute of Gene Technology, Tallinn University of Technology,

Akadeemia tee 15A, Estonia3.

*[email protected]

Scientific publications contain rich information about genetic disorders. Text mining these publications provides an

automatic way to quickly query and summarize the information. We propose a supervised learning approach that takes

advantage of the well known unsupervised approach TF-IDF (term frequency–inverse document frequency) and

integrates it with supervised approach using logistic loss error metric. The preliminary results on OMIM dataset look

promising.

INTRODUCTION

Scientific publications contain rich information about

genetic disorders. Text mining these publications provides

an automatic way to quickly query and summarize the

information.

The traditional approaches employ unsupervised text

mining approaches like TF-IDF (term frequency–inverse

document frequency) or Latent Dirichlet Allocation

(LDA) by Blei et al. (2003) for linking terms to genes and

diseases. A recent text mining software Beegle (ElShal et

al., 2015) developed for linking diseases and genes has

taken this approach using TF-IDF as its similarity metric.

PROPOSED METHOD

Our work proposes a supervised learning of the

importance of the textual terms, which can automatically

filter out many terms that are unnecessary for the task at

hand. We formulate it as a prediction of supervised values

y given the terms for all genes g and all diseases d where i

is the index of the term:

and wi is the weight for the term i and σ is sigmoid

function. The main idea is to learn the weight vector w that

minimizes the difference between known values y and

predictions. The minimization can transformed into a

logistic regression.

For the supervised values we use OMIM database

(Hamosh et al., 2003). More specifically y corresponds to

1 if there is a link between the given gene-disease pair and

0 if there is no link. Intuitively, in this setup the text

mining is transformed into a classification problem. We

use dataset of 330 OMIM terms and their linked genes and

randomly sample genes as negatives for each disease.

For the textual terms we use MEDLINE abstracts as the

source of biomedical text. We employ MetaMap (Aronson

et al. 2010) to link terms with abstracts. We use geneRIF

to link genes with abstracts, and PubMed to link diseases

with abstracts. We apply a TF-IDF transformation to score

a term with a given disease or gene based on the abstracts

linked to each entity. We only use the terms linked to

abstracts that belong to genes. Hence our vocabulary

consists of 66,883 terms.


The preliminary results show that supervised learning

allows to automatically pick up the keywords that are

informative, improving the recall of the genes that are

related to genetic disorders. We will present more detailed

results in the poster.

We are also investigate how to integrate the supervised

approach to have answers to online queries provided by

Beegle.

REFERENCES Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet

allocation. the Journal of machine Learning research, 3, 993-1022. Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A., & McKusick,

V. A. (2005). Online Mendelian Inheritance in Man (OMIM), a

knowledgebase of human genes and genetic disorders. Nucleic acids research, 33(suppl 1), D514-D517.

ElShal, S., Tranchevent L.C., Sifrim A., Ardeshirdavani A., Davis J.,

Moreau Y. (2015). Beegle: from literature mining to disease-gene discovery. Nucleic Acids Res, gkv905.

Aronson, A. R., & Lang, F. M. (2010). An overview of MetaMap:

historical perspective and recent advances. Journal of the American Medical Informatics Association, 17(3), 229-236.


97



P53. FLOWSOM WEB: A SCALABLE ALGORITHM TO VISUALIZE AND

COMPARE CYTOMETRY DATA IN THE BROWSER

Arne Soete2, Sofie Van Gassen

1,2,3, Tom Dhaene

1, Bart N. Lambrecht

2,3 & Yvan Saeys

2,3.

Department of Information Technology, Ghent University-iMinds, Ghent, Belgium1; Inflammation Research Center, VIB,

Ghent, Belgium2; Department of Respiratory Medicine, Ghent University Hospital, Ghent, Belgium

3.

We developed FlowSOM Web, a web-tool which visualizes cytometry data based on Self-Organizing Maps. Similar cells

are clustered and visualized via star charts. This allows us to process and display millions of cells efficiently.

Additionally, different biological samples (e.g. healthy versus diseased mice) can be compared.

INTRODUCTION

Cytometry data describes cell characteristics in

biological samples. Cells are labeled with fluorescent

antibodies and a flow cytometer measures the properties

of millions of cells one by one. Biologists use this

information to get more insight in diseases and to

diagnose patients. Most of them still analyse this data

manually to differentiate between the different cell types

present. This is done by plotting the data in 2D scatter

plots and selecting groups of cells in a hierarchical way.

This process is called `gating'. Recently, the number of

properties that can be measured simultaneously has

strongly increased. As the number of possible 2D scatter

plots increases exponentially with the number of

properties measured, it becomes infeasible to analyze

them all and relevant information that is present in the

data might be missed.

METHODS

We present FlowSOM, a new algorithm for the

visualization and interpretation of cytometry data (Van

Gassen, et al,. 2015). Using a twolevel clustering and

star charts, our algorithm helps to obtain a clear

overview of how all markers are behaving on all cells,

and to detect subsets that might be missed otherwise.

Our algorithm consists of 4 steps: pre-processing the

data, building a self-organizing map, building a minimal

spanning tree and computing a meta-clustering result.


Although our results are quite similar to SPADE, another

state-of-the art algorithm for the visualization of

cytometry data, our results can be computed much faster

and use less memory. By providing star-charts and an

automatic meta-clustering step, much more information

can be visualised in a single tree than is done by the

SPADE algorithm.

Additionally, multiple states can be compared (e.g.

healthy versus diseased mice) with one another and the

differences between the two states can be visualized via

star-charts.

On this conference, we would like to demonstrate a

recently developed web interface to the underlying R

functionality. This interface allows to upload cytometry

data, run the aforementioned analysis, compare different

cell states and explore the results, via interactive

visualizations, all from the comfort of the browser.

FIGURE 1. Example of a FlowSOM star chart.

REFERENCES Van Gassen, et al. (2015), FlowSOM: Using self-organizing maps for

visualization and interpretation of cytometry data. Cytometry,

87: 636–645


98



P54. TOWARDS A BELGIAN REFERENCE SET

Erika Souche1*

, Amin Ardeshirdavani2, Yves Moreau

2, Gert Matthijs

1 & Joris Vermeesch

1.

Department of Human Genetics, KU Leuven 1; ESAT-STADIUS Center for Dynamical Systems, Signal Processing and

Data Analytic, KU Leuven 2.

*[email protected]

Next-Generation Sequencing (NGS) is increasingly used to study and diagnose human disorders. The simultaneous

sequencing of a large number of genes leading to the detection of a large number of variants, the bottleneck has moved

from sequencing to variant interpretation and classification. Although publically available databases of variant

frequencies help distinguishing causative mutations from common variants, they often lack population specific variant

frequencies. To circumvent this shortage of population specific information, most genetic centers exploit their sequence

data of unrelated and unaffected individuals to filter out common local variants is often done. However the

files/databases are rarely shared and they are mainly based on whole exome data. In this project we demonstrate the

utility of a local variant database generated from whole exome data, describe a procedure allowing the sharing of

information between genetic centers and mine low coverage whole genome data for common variants.

INTRODUCTION

Next-Generation Sequencing (NGS) is increasingly used

to study and diagnose human disorders. The simultaneous

sequencing of a large number of genes leading to the

detection of a large number of variants, the bottleneck has

moved from sequencing to variant interpretation and

classification. Publically available databases of variant

frequencies provided by, among others, the Exome

Sequencing Project (ESP) the 1000 genomes project

(McVean et al., 2012) or dbSNP (Sherry et al., 2001) help

distinguishing causative mutations from common variants,

identifying up to 78% of variants as common for a Belgian

exome. However, these data sets often lack population

specific variant frequencies and are outperformed by

databases of local variants. For example, using GoNL

(The Genome of the Netherlands Consortium, 2014) alone

allowed the identification of up to 85% of variants as

common for the same Belgian exome. The fact that the

GoNL is based on only 498 individuals further highlights

the importance of building and using population specific

databases.

Such population specific data can be retrieved from locally

sequenced individuals that underwent Whole Exome

Sequencing (WES) or Whole Genome Sequencing (WGS).

Storing only the frequencies and genotype counts of the

variants provides a valuable tool for variant classification

while no sensitive information on the individuals is

included.

METHODS

WES data of 350 unrelated and unaffected individuals

have been parsed. All samples were analysed in a similar

way i.e. reads were aligned to the reference genome with

BWA (Li & Durbin, 2009) and genotyping was performed

according to GATK best practices (McKenna et al., 2010;

DePristo et al., 2011). All samples were genotyped at all

polymorphic positions using GATK HaplotypeCaller and

GenotypeGVCFs. For each position, samples with low

quality genotype were considered as not genotyped and

excluded from the genotype counts. The number of

alternate alleles, allele counts and genotypes were

compiled in a population VCF file, in which individual

genotypes are not accessible.

Variant frequencies can also be extracted from low

coverage WGS. As a pilot we processed the data of

chromosome 21 of about 4,000 WGS. The mapping was

performed with BWA (Li & Durbin, 2009) and the BAM

files were merged per 200 samples. All positions were

genotyped using freebayes (Garrison & Marth, 2012).

Genotype information of all locations outside low

complexity regions were then compiled for all samples

using the integration of Apache Hadoop, HBase and Hive

(see poster “Big data solutions for variant discovery from

low coverage sequencing data, by integration of Hadoop,

Hbase and Hive”). Several models were then used to

distinguish real variants from sequencing errors: the Minor

Allele Frequency (MAF), the transition/transversion ratio,

the expected number of loci with a MAF of 5%, etc.


We demonstrated the effect of our reference set on several

exomes. The inclusion of only 350 individuals allowed the

identification of about 3% additional common variants,

not listed as common by ESP, dbSNP (Sherry et al., 2001),

1000 Genomes (McVean et al., 2012) and GoNL (The

Genome of the Netherlands Consortium, 2014). Since only

the frequencies of the variants in the screened populations

are reported, this file can easily be shared between

laboratories. Besides, the procedure used to generate the

population VCF file can easily be applied to several

genetic centers in order to generate a common population

VCF file, as planned within the BeMGI project.

Finally we expect that the data from WGS will further

increase the performance of our reference set. A genome-

wide variant frequencies file from local population will

become worthwhile when WGS is routinely used in

diagnostics.

REFERENCES DePristo M et al. Nature Genetics 43, 491-498 (2011). Exome Variant Server, NHLBI Exome Sequencing Project (ESP), Seattle,

WA (URL: http://evs.gs.washington.edu/EVS/).

Garrison E & Marth G http://arxiv.org/abs/1207.3907 (2012). Li H & Durbin R Bioinformatics 25, 1754-60 (2009).

McKenna A et al. Genome Research 20, 1297-303 (2010).

McVean et al. Nature 491, 56–65 (2012). Sherry ST, et al. Nucleic Acids Res. 29, 308-11 (2001).

The Genome of the Netherlands Consortium. Nature Genetics 46,

818–825 (2014).


99



P55. MANAGING BIG IMAGING DATA FROM MICROSCOPY:

A DEPARTMENTAL-WIDE APPROACH

Yves Sucaet1*

, Silke Smeets1, Stijn Piessens

1, Sabrina D’Haese

1, Chris Groven

1, Wim Waelput

1 & Peter In’t Veld

1.

Department of Pathology1, Faculty of Medicine, Vrije Universiteit Brussel, Laarbeeklaan 103, 1090 Brussels, Belgium.

*[email protected]

With recent breakthroughs in whole slide imaging (WSI), almost any microscopic material can be digitized in an

efficient manner. In order to mine these data efficiently, a top-down approach was employed to manage various imaging

platforms. At Brussels Free University (VUB), we built a centralized infrastructure that integrates a variety of imaging

platforms (brightfield, fluorescence, multi-vendor formats). With the help of the Pathomation software platform for

digital microscopy, various datastores and image repositories were integrated. Custom coding was used to interact with

various vendor-software and server applications, where needed. The end-result is an interconnected network of

heterogeneous scalable information silos. We currently have two main use cases for WSI: education and biobanking.

These applications are available to the public via http://www.diabetesbiobank.org.

INTRODUCTION

Too often, image analysis and data/image mining projects

remain stuck in micro-environments because they are

limited by vendor-specific solutions that neither scale nor

interact with material from other departments or

institutions. Successful roll-out of digital histopathology

therefore requires more than a whole slide scanner.

If the goal is for an imaging facility to allow a researcher

to conduct a (microscopic) experiment, then that

researcher should not be hindered by the imaging platform

used. Similarly, an instructor integrating digital content

into his or her course, should be able to make their

materials as accessible as possible to as many students as

possible.

At Brussels Free University (VUB), we currently have two

main use cases for whole slide imaging: education and

biobanking. We have set these up in such a way that they

are both scalable and expandable.

METHODS

Whole slide imaging (WSI) has recently provided a boost

to digital capturing of microscopic content (and an

explosion of data, resulting in a veritable digital treasure

trove waiting for bioinformatics to be explored). But

researchers have been digitizing content for a long time

already through various technologies (mounted cameras,

inverted fluorescent microscopes with low magnification,

…).

We envisioned an environment whereby a researcher can

manage and view all of the material related to an

experiment or observation from a single interface,

irrespective of origin or technology used.

The following steps were taken to accomplish this:

Setup a central server (50TB storage)

Centrally store all imaging data provide mapped

drives on the individual workstations to facilitate

a smooth transition for end-users

Install the Pathomation platform for digital

microscopy (PMA.core, PMA.view, PMA.zui)

for universal viewing of digital content and to

provide a uniform end-user experience

Install Pydio (open source) for easy sharing of

digital imaging content (integrated with

Pathomation’s PMA.core so no duplicate user

directories need to be maintained)

Build custom portals to highlight specific

collections of microscopic content and/or serve

specific target audiences


The centralized digital imaging infrastructure is used by

various researchers and graduate students. Recently over

3,000 images were processed and hosted in the course of

one month.

Two use cases are worth highlighting:

For undergraduate students (Medicine, BMS) we

built custom portal websites to supplement their

courses in histology and pathology. These sites

are available at http://histology.vub.ac.be and

http://pathology.vub.ac.be and provide students

with (guided) virtual microscopy without the

need to install any additional software

We also provide access portals to different

specialized biobanks. The Willy Gepts collection

represents a historic milestone in diabetes

research (http://gepts.vub.ac.be) and is

complementary to the Alan Foulis collection

(http://foulis.vub.ac.be). Furthermore, the clinical

diabetes biobank can now be consulted online,

too, via http://www.diabetesbiobank.org.

CONCLUSION

Digital histopathology has been around for some time now,

but often results in heterogeneous data collections. It is

only now that we start looking at integrated approaches on

this varied data can be best handled. Digital pathology

involves much more than the acquisition of a slide scanner.

We have engaged five different imaging platforms onto a

single architecture. We are storing data from all modalities

in a single storage facility, and manage it through a single

access point. The resulting environment assists in

rendering content to any type of display device, without

the need for extra software or background information

concerning the content’s origin.


100



P56. ESTIMATING THE IMPACT OF CIS-REGULATORY VARIATION IN

CANCER GENOMES USING ENHANCER PREDICTION MODELS AND

MATCHED GENOME-EPIGENOME-TRANSCRIPTOME DATA

Dmitry Svetlichnyy1*

, Hana Imrichova1, Zeynep Kalender Atak

1 & Stein Aerts

1.

Laboratory of Computational Biology, University of Leuven1. *[email protected]

The prioritization of candidate driver mutations in the non-coding part of the genome is a key challenge in cancer

genomics. Whereas driver mutations in protein-coding genes can be distinguished from passenger mutations based on

their recurrence, non-coding mutations are usually not recurrent at the same position. We aim to tackle this problem

using machine-learning methods to predict regulatory regions and cancer genome sequences in combination with sample-

specific chromatin profiles obtained using ChIP-seq against H3K27Ac.

INTRODUCTION

Perturbations of gene regulatory networks in cancer cells

can arise from mutations in transcription factors or co-

factors, but also from mutations in regulatory regions.

Prioritizing candidate driver mutations that have a

significant impact on the activity of a regulatory region is

a key challenge in cancer genomics.

METHODS

We have developed enhancer prediction methods using

Random Forest classifiers to estimate the Predicted

Regulatory Impact of a Mutation in an Enhancer

(PRIME). We find that the recently identified driver

mutation in the TAL1 enhancer has a high PRIME score,

representing a “gain-of-target” for the oncogenic

transcription factor MYB [1]. We trained enhancer models

for 45 cancer-related transcription factors, and used these

to score somatic mutations across more than five hundred

breast cancer genomes. Next, we re-sequenced the genome

of ten cancer cell lines representing six different cancer

types (breast, lung, melanoma, ovarian, and colon) and

profiled their active chromatin by ChIP-seq against

H3K27Ac.


Then we integrated these data with matched expression

data and with the Random Forest model predictions for

sets of oncogenic transcription factors per cancer type.

This resulted in surprisingly few high-impact mutations

that generate de novo regulatory (oncogenic) activity at

the chromatin and gene expression level. Our framework

can be applied to identify candidate cis-regulatory

mutations using sequence information alone, and to

samples with combined genome-epigenome-transcriptome

data. Our results suggest the presence of only few cis-

regulatory driver mutations per genome in cancer genomes

that may alter the expression levels of specific oncogenes

and tumor suppressor genes.

REFERENCES 1. Mansour MR, Abraham BJ, Anders L, Berezovskaya A, Gutierrez A,

Durbin AD, et al. An oncogenic super-enhancer formed through somatic mutation of a noncoding intergenic element. Science. 2014;346: 1373–

1377. doi:10.1126/science.1259037


101



P57. I-PV: A CIRCOS MODULE FOR INTERACTIVE PROTEIN

SEQUENCE VISUALIZATION

Ibrahim Tanyalcin1,2*

, Carla Al Assaf3, Alexander Gheldof

1, Katrien Stouffs

1,4, Willy Lissens

1,4 & Anna C. Jansen

5,2.

Center for Medical Genetics, UZ Brussel, Brussels, Belgium1; Neurogenetics Research Group, Vrije Universiteit Brussel,

Brussels, Belgium2; Center for Human Genetics, KU Leuven and University Hospitals Leuven, 3000 Leuven, Belgium

3;

Reproduction, Genetics and Regenerative Medicine, Vrije Universiteit Brussel, Brussels, Belgium4; Pediatric Neurology

Unit, Department of Pediatrics, UZ Brussel, Brussels, Belgium5. *[email protected] or [email protected]

Summary: Today’s genome browsers and protein databanks supply vast amounts of information about proteins. The challenge is to concisely bring together this information in an interactive and easy to generate format. Availability and Implementation: We have developed an interactive CIRCOS module called i-PV to visualize user supplied protein sequence, conservation and SNV data in a live presentable format. I-PV can be downloaded from http://www.i-pv.org.

INTRODUCTION

Today’s genome browsers and protein databanks supply

vast amount of information about both the structural

annotation and the single nucleotide variants (SNV) in

genes. The challenge is to concisely bring together this

information in an interactive and easy to generate format.

Thus, we have developed an interactive CIRCOS

(Krzywinski et al.) module combined with D3 (Bostock et

al.) and plain javascript called i-PV to visualize user

supplied protein sequence, conservation and SNV data

while significantly easing and automating input file

requirements and generation.

METHODS

To use i-PV, only 4 text files (with “.txt” extension) have

to be supplied to the software: conservation scores,

protein and cDNA sequences, and SNVs/Indels files.

Protein and cDNA (or mRNA) sequence files are supplied

in fasta format whereas SNP/Indel fıles are provided as

annotated vcf file (Variant Call Format). The conservation

scores are simply array of numbers separated by newline

characters. The input files are supplied to i-PV, data are

automatically checked for errors or duplicates and

matched against the user provided fasta files, and then an

interactive html file containing the graph is automatically

generated as shown in Fig.1.


Many sequence visualization tools focus on certain aspects

of proteins such as conservation, variations, sequence

alignments or topology. While all these tools are very

useful in their own right, we pursued a more interactivity

based design. Therefore, i-PV is not solely designed for

visualization but also for live presentable graphs and

information that can selectively be displayed and

customized. I-PV combines major sources of information

under one html file that is easy to generate and share on

both desktop and mobile environments.

Last but not least, many visualization tools are based on

rectangular-scroll based representation of information

which does not deliver a “wide angle” view of the

sequence data unlike circular visualization. However, as

like all other types of visualizations, there are also

limitations for circular graphs when it comes to

conveniently zoom in to a particular region or visually

align tracks with different radii. We intend to further

develop this software with several other features based on

end user needs. The current version of i-PV can be

downloaded from http://www.i-pv.org.

FIGURE 1. Overview of i-PV features. (A) SNVs with mouse over

explanation and automatic generated dbSNP links (red: Non-

synonymous, green: Synonymous, gray: Not validated). (B) Console can be hidden for publication quality image. (C) Domains are colored based

on user preference. (D) Conservation data from user generated

alignment with mouse over information. (E) The user can define which amino acids to be shown on the sequence track. (F) Switch the color of

the background to black. (G) Amino acids are plotted and split into 5

main categories (nonpolar: gray circle, polar: magenta circle, negative: blue triangle, positive: red triangle, aromatic: green hexagon). (H)

Adjustable conservation score threshold to display regions above a

certain percentage of maximum conservation score. (I) Font-size of chosen amino acids can be adjusted. (J) User selectable amino acids to

be displayed. (K) Up to 17 different amino acid properties can be chosen

to be displayed from drop-down menu. (I) Tile track showing SNVs and indels (red: SNVs, magenta: Indels, gray stroke: Not validated, black:

collapsed due to over display). (M) Gene Name. (N) Buttons for mass

selection of amino acids. (O) User defined regions are marked with custom name tag and mouse over information. (P) Meta-analysis of

amino acid distributions. This information is only displayed in case of

single amino acid comparisons. The log2 ratios are capped between -3 and 3. The maximum and the minimum blosum62 scores are -4 and 11.

Since the blosum62 matrix is diagonally symmetric, the absolute value of

the log ratios are mapped to this range and a p-value is indicated based on how close the two scores are.

REFERENCES Bostock, M., et al. (2011), 'D3: Data-Driven Documents', IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis).

Krzywinski, M., et al. (2009), 'Circos: an information aesthetic for

comparative genomics', Genome Res, 19 (9), 1639-45.


102



P58. SFINX: STRAIGHTFORWARD FILTERING INDEX FOR AFFINITY

PURIFICATION-MASS SPECTROMETRY DATA ANALYSIS

Kevin Titeca1,2

, Pieter Meysman3,4

, Kris Gevaert1,2

, Jan Tavernier 1,2

,

Kris Laukens 3,4

, Lennart Martens1,2

& Sven Eyckerman1,2*

.

Medical Biotechnology Center, VIB, B-9000 Ghent, Belgium1; Department of Biochemistry, Ghent University, B-9000

Ghent, Belgium2; Advanced Database Research and Modeling (ADReM), University of Antwerp, Belgium

3; Biomedical

informatics research center Antwerpen (biomina), Belgium4. [email protected]

Affinity purification-mass spectrometry (AP-MS) is one of the most common techniques for the analysis of protein-

protein interactions, but inferring bona fide interactions from the resulting datasets remains notoriously difficult because

of the many false positives. The ideal filter technique for these data is highly accurate, fast and user friendly without the

need to rely on extensive parameter optimization or external databases, which also makes it reproducible and unbiased.

Because none of the existing filter techniques combines all these features, we developed SFINX, the Straightforward

Filtering INdeX.

We here describe the SFINX algorithm and its performance on two independent AP-MS benchmark datasets. SFINX

shows superior performance over the other approaches with accuracy increases of up to 20%, and is extremely fast. It

does not require parameter optimization, and is absolutely independent of external resources. Both the algorithm and its

website interface are highly intuitive with limited need for user input and the possibility of immediate network

visualization and interpretation at http://sfinx.ugent.be/. SFINX might become essential in the toolbox of any scientist

interested in user-friendly and highly accurate filtering of AP-MS data.


103



P59. MAPREDUCE APPROACHES FOR CONTACT MAP PREDICTION:

AN EXTREMELY IMBALANCED BIG DATA PROBLEM

Isaac Triguero1,2*

, Sara del Río3, Victoria López

3, Jaume Bacardit

4, José M. Benítez

3 & Francisco Herrera

3.

VIB Inflammation Research Center1; Department of Respiratory Medicine, Ghent University

2; Department of Computer

Science and Artificial Intelligence3; School of Computing Science, Newcastle University

4.

*[email protected]

The application of data mining and machine learning techniques to biological and biomedicine data continues to be an

ubiquitous research theme in current bioinformatics. The rapid advances in biotechnology are allowing us to obtain and

store large quantities of data about cells, proteins, genes, etc, that should be processed. Moreover, in many of these

problems such as contact map prediction, it is difficult to collect representative positive examples. Learning under these

circumstances, known as imbalance big data classification, may not be straightforward for most of the standard machine

learning methods. In this work we describe the methodology that won the ECBDL'14 big data competition, which was

concerned with the prediction of contact maps. Our methodology is composed of several MapReduce approaches to deal

with big amounts of data. The results show that this model is very suitable to tackle large-scale bioinformatics

classifications problems.

INTRODUCTION

The prediction of a protein’s contact map is a crucial step

for the prediction of the complete 3D structure of a protein.

This is one of the most challenging bioinformatics tasks

within the field of protein structure prediction because of

the sparseness of the contacts (i.e. few positive examples)

and the great amount of data extracted (i.e. millions of

instances, Gbs of disk space) from a few thousand of

proteins.

This problem refers to an imbalance bioinformatics big

data application, in which traditional machine learning

techniques become non effective and non efficient due to

the big dimension of the problem. However, with use of

the emerging cloud-based technologies, these techniques

can be redesigned to extract valuable knowledge from

such amount of data.

The ECDBL’14 competition (http://cruncher.ncl.ac.uk/

bdcomp/) brought up a data set that modeled the contact

map prediction problem as a classification task.

Concretely, the training data set considered was formed by

32 million instances, 631 attributes, 2 classes, 98% of

negative examples and it occupies about 56GB of disk

space.

In this work we describe the methodology with which we

have participated, under the name 'Efdamis', ranking as the

winner algorithm (Triguero et al, 2015).

METHODS

In the proposed methodology, we focused on the

MapReduce (Dean et al, 2008) paradigm in order to

manage this voluminous data set. We extended the

applicability of some pre-processing and classification

models to deal with large-scale problems. This is

composed of four main parts:

An oversampling approach: The goal is to balance the

highly skewed class distribution of the problem by

replicating randomly the instances of the minority

class (del Rio et al, 2014).

An evolutionary feature weighting method: Due the

relative high number of features of the given problem

we developed a feature selection scheme for large-

scale problems that improves the classification

performance by detecting the most significant features

(Triguero et al, 2012).

Building a learning model: As classifier, we focused

on a scalable RandomForest algorithm.

Testing the model: Even the test data can be

considered big data (2.9 millions of instances), so that,

the testing phase was also deployed within a parallel

approach.


Table 1 presents the final results of the top 5 participants

in terms of True Positive Rate (TPR) and True Negative

Rate (TNR). In this particular problem, the necessity of

balancing the TPR and TNR ratios emerged as a difficult

challenge for most of the participants of the competition.

In this sense, the use of scalable preprocessing techniques

played in important role to improve the results of the

RandomForest classifier. First, the designed oversampling

approach allowed us to prevent RandomForest to be

biased to the negative class. Second, our feature weighting

approach provided us the possibility of reducing the

dimensionality of the problem by selecting the most

relevant features. Thus, it resulted in a better performance

as well as a notable reduction of the time requirements. Team TPR TNR TPR * TNR

Efdamis 0.73043 0.73018 0.53335

ICOS 0.70321 0.73016 0.51345

UNSW 0.69916 0.72763 0.50873

HyperEns 0.64003 0.76338 0.48858

PUC-Rio_ICA 0.65709 0.71460 0.46956

TABLE 1: Comparison with the top 5 of the competition.

REFERENCES Dean J., Ghemawat S., Mapreduce: simplified data processing on large

clusters, Commun. ACM 51 (1), 107–113 (2008).

del Río S., et al., On the use of MapReduce for imbalanced big data using

random forest, Inf. Sci. 285 (2014) 112–137.

Triguero I. et al., Integrating a differential evolution feature weighting scheme into prototype generation, Neurocomputing 97 (2012) 332–

343.


104



P60. COEXPNETVIZ: THE CONSTRUCTION AND VIZUALISATION OF CO-

EXPRESSION NETWORKS

Oren Tzfadia1,2

, Tim Diels1,2,4

, Sam De Meyer1,2

, Klaas Vandepoele1,2

, Yves Van de Peer1,2,3,5,*

& Asaph Aharoni6.

Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium1; Department of Plant Biotechnology and

Bioinformatics, Ghent University, 9052 Ghent, Belgium2; Genomics Research Institute (GRI), University of Pretoria,

0028 Pretoria, South Africa3; Department of Mathematics and Computer Science, University of Antwerp, Antwerp,

Belgium4; Bioinformatics Institute Ghent, Ghent University, 9052 Ghent, Belgium

5; Department of Plant Sciences and

the Environment, Weizmann Institute of Science, Rehovot6.

INTRODUCTION

Comparative transcriptomics is a common approach in

functional gene discovery efforts. It allows for finding

conserved co-expression patterns between orthologous

genes in closely related plant species, suggesting that these

genes potentially share similar function and regulation.

Several efficient co-expression-based tools have been

commonly used in plant research but most of these

pipelines are limited to data from model systems, which

greatly limit their utility. Moreover, in addition, none of

the existing pipelines allow plant researchers to make use

of their own unpublished gene expression data for

performing a comparative co-expression analysis and

generate multi-species co-expression networks.

RESULTS

We introduce CoExpNetViz, a computational tool that

uses a set of bait genes as an input (chosen by the user)

and a minimum of one pre-processed gene expression

dataset. The CoExpNetViz algorithm proceeds in three

main steps; (i) for every bait gene submitted, co-

expression values are calculated using Pearson correlation

coefficients, (ii) non-bait (or target) genes are grouped

based on cross-species orthology, and (iii) output files are

generated and results can be visualized as network graphs

in Cytoscape.

AVAILABILITY AND IMPLEMENTATION

The CoExpNetViz tool is freely available both as a PHP

web server (link:

http://bioinformatics.psb.ugent.be/webtools/coexpr/)

(implemented in C++) and as a Cytoscape plugin

(implemented in Java). Both versions of the CoExpNetViz

tool support LINUX and Windows platforms.


105



P61. THE DETECTION OF PURIFYING SELECTION DURING TUMOUR

EVOLUTION UNVEILS CANCER VULNERABILITIES Jimmy Van den Eynden

1* & Erik Larsson

1.

Department of Medical Biochemistry and Cell Biology, Institute of Biomedicine, The Sahlgrenska Academy, University

of Gothenburg, Sweden. *[email protected]

Identification of somatic mutation patterns indicative of positive selection arguably has become the major goal of cancer

genomics. This is motivated by a search for cancer driver genes and pathways that are recurrently activated in tumours

but not normal cells, thus providing possible therapeutic windows. However, cancer cells additionally depend on a large

number of basic cellular processes, and elevated sensitivity to inhibition of certain essential non-driver genes has been

demonstrated in some cases. While such vulnerability genes should in theory be identifiable based on strong purifying

(negative) selection in tumors, these patterns have been elusive and purifying selection remains underexplored in cancer.

We established a new methodology and, using mutational data from 25 TCGA tumor types, we show for the first time

that negative selection in candidate vulnerability genes can be detected.

INTRODUCTION

Recently it was shown that a hemizygous deletion of the

well–known tumour suppressor gene TP53 creates

therapeutic vulnerability in colorectal cancer due to

concomitant loss of the neighbouring gene POLR2A (Liu

et al., 2015).

As any damaging mutation occurring in the single allele of

a hemizygously deleted essential gene, like POLR2A, is

expected to lead to cell death, we hypothesized that

purifying selection in these genes could be unveiled by

demonstrating a lower number of damaging mutations

then could be expected in the absence of any selection.

Therefore we used the POLR2A case as a proof-of-

concept to develop a methodology to detect purifying

selection in large genome sequencing datasets.

METHODS

Mutation and copy number data from 25 different cancers

types and 7,871 samples were downloaded from the

TCGA data portal and pooled together in a large pan-

cancer dataset. Different mutational functional impact

scores were calculated using Annovar. Copy number data

were analyzed using Gistic 2.0 to differentiate POLR2A

copy number neutral from hemizygously deleted samples.


POLR2A was found to be hemizygously deleted in 29% of

all samples. As expected, in over 99% this deletion was

part of the TP53 (driving) deletion on chromosome 17.

POLR2A was mutated 228 times in 2.3% of all samples.

While 14 nonsense mutations and small out-of-frame

insertions or deletions occurred in the copy number

neutral group, none of these damaging mutations were

found in the deletion group (p=0.03, fisher test),

suggesting purifying selection against this type of

mutations.

Next to these truncating mutations, also missense

mutations that have a damaging effect on the gene’s

protein function are expected to be selected against.

Therefore we predicted the functional impact of all

mutations using different functional impact scores. The

median (PolyPhen-2) functional impact score was found

to significantly lower in the deletion group compared to

the copy number neutral group (p=0.002, Wilcoxon test,

fig.1), further confirming that purifying selection has

taken place in POLR2A during tumour evolution.

These preliminary findings confirm that purifying

selection is detectable in vulnerability genes like POLR2A

and this approach could be used to detect other, new

candidate vulnerability genes.

FIGURE 1. Negative selection against POLR2A high impact mutations in

hemizygously deleted tumour samples.

REFERENCES Liu, Y., Zhang, X., Han, C., Wan, G., Huang, X., Ivan, C., … Lu, X.

(2015). TP53 loss creates therapeutic vulnerability in colorectal

cancer. Nature, 520(7549), 697–701. http://doi.org/10.1038/nature14418


106



P62. FLOREMI: SURVIVAL TIME PREDICTION

BASED ON FLOW CYTOMETRY DATA

Sofie Van Gassen1,2,3*

, Celine Vens2,3,4

, Tom Dhaene1, Bart N. Lambrecht

2,3 & Yvan Saeys

2,3.

Department of Information Technology, Ghent University—iMinds1; VIB Inflammation Research Center

2; Department of

Respiratory Medicine, Ghent University3; Department of Public Health and Primary Care, kU Leuven Kulak

4.

*[email protected]

Flow cytometry is a high-throughput technique for single cell analysis. It enables researchers and pathologists to study

blood and tissue samples by measuring several cell properties, such as cell size, granularity and the presence of cellular

markers. While this technique provides a wealth of information, it becomes hard to analyze all data manually. To

investigate alternative automatic analysis methods, the FlowCAP challenges were organized. We will present an

algorithm that obtained the best results on the FlowCAP IV challenge, predicting the time of progression to AIDS for

HIV patients.

INTRODUCTION

The main task of the most recent FlowCAP IV challenge

was a survival modeling challenge: participants had to

predict the time of progression to AIDS for HIV patients,

based on flow cytometry data of an unstimulated and a

stimulated blood sample. Additionally, a secondary task

was the identification of cell populations that could be

indicative of this progression rate. Several challenges

needed to be taken into account: the raw dataset was about

20GB large and about eighty percent of the survival times

were censored.

METHODS

We developed a new algorithm, FloReMi, which

combined several preprocessing steps with a density based

clustering algorithm, a feature selection step and a random

survival forest (Van Gassen et al., 2015).

The input for our algorithm consisted of 2 flow cytometry

samples for each patient: one unstimulated PBMC sample

and one PBMC sample stimulated with HIV antigens. For

each of these samples, 16 parameters were measured for

hundreds of thousands of cells.

First, we included quality control to remove erroneous

measurements from the samples. We also made an

automatic selection of live T cells to focus on the cells of

interest in this specific flow cytometry staining.

Once the dataset was cleaned up, we extracted features for

each patient. This was done by clustering the cells using

the flowDensity (Malek et al., 2015) and flowType

algorithms (Aghaeepour et al., 2012). These algorithms

divide the values for each feature into either “high” or

“low” and use all combinatorial options of “high”, “low”

or “neutral” marker values to group the cells. This resulted

in 310

different cell subsets.

For each of these subsets, we computed the number of

cells assigned to it and the mean fluorescence intensity for

13 markers. Per patient, we collected these numbers for

both samples and also computed the differences between

the two. This resulted in a total of 2,480,058 features per

patient.

Because traditional machine learning algorithms cannot

handle this amount of features, we then applied a feature

selection step. To estimate the usefulness of a feature, we

applied a Cox proportional hazards model on each feature.

The resulting p-value indicates how well the feature

corresponds with the known survival times for the training

set. We ordered the features based on these scores, and

picked only those that were uncorrelated with the others.

This resulted in a final selection of 13 features, on which

we applied several machine learning techniques. We

compared the results of the Cox Proportional Hazards

model, the Additive Hazards model and the Random

Survival Forest.


All three methods performed well on the training dataset.

However, on the test dataset, both the Cox Proportional

Hazards model and the Additive Hazards model obtained

bad results, probably due to overfitting on the training data.

Only the Random Survival Forest obtained good results on

the test dataset (Figure 1). This method outperformed all

other methods submitted to the challenge.

FIGURE 1. On the training dataset, there was a strong correlation

between the scores and the actual survival times for all models. On the test dataset, only the Random Survival Forest performed well.

One important challenge remains: the biological

interpretation of our final features. Although they correlate

with the transition times from HIV to AIDS, it is hard to

interpret them as known cell types, due to our

unsupervised feature extraction. Our method delivers a

first step towards new insights in the progress from HIV to

AIDS.

REFERENCES Malek M et al. Bioinformatics 31.4, 606-607 (2015).

Aghaeepour N et al. Bioinformatics 28, 1009-1016 (2012).

Van Gassen S et al. Cytometry A, DOI 10.1002/cyto.a.22734


107



P63. STUDYING BET PROTEIN-CHROMATIN OCCUPATION TO

UNDERSTAND GENOTOXICITY OF MLV-BASED GENE THERAPY VECTORS

Sebastiaan Vanuytven1*

, Jonas Demeulemeester1, Zeger Debyser

1 & Rik Gijsbers

1,2.

Laboratory for Molecular Virology and Gene Therapy, KU Leuven1; Leuven Viral Vector Core, KU Leuven

2.

*[email protected]

Integrating retroviral vectors are used to treat genetic and acquired disorders that, theoretically, can be cured by

introducing specific gene expression cassettes into patient cells. Clinical trials held over the past two decades have

proven that this approach is effective in curing genetic disorders and can produce better results than the standard therapy

(Touzot, F et al., 2015). Nevertheless, adverse events in a limited number of patients treated with gamma-retroviral

vectors have deterred their widespread application. Specifically, vector integration occurring in proximity of proto-

oncogenes resulted in insertional mutagenesis and clonal expansion of the cells (Hacein-Bey-Abina S et al., 2003).

INTRODUCTION

Retroviruses and their derived viral vectors do not

integrate at random. Their overall integration pattern is

dictated by cellular cofactors that are co-opted by the

invading viral complex. For gammaretroviral vectors

(prototype MLV) the cellular bromo- and extraterminal

domain (BET) family of proteins (BRD2, BRD3 and

BRD4) tethers the viral integrase to the host cell

chromatin (De Rijck J et al., 2013). At the moment the

only available ChIP-seq data derives from HEK-293T

cells exogenously overexpressing FLAG-tagged versions

of the BET proteins (LeRoy G et al., 2012). Yet, the

detailed chromatin binding profile of endogenous BET

proteins in human cells is currently unknown. Here we

report on the chromatin occupation of the endogenous

BET proteins in K562 and human primary CD4+ T cells.

METHODS

Following fixation, all three BET proteins were pulled-

down with specific antibodies (Bethyl Laboratories, α-

BRD2: A302-583A; α-BRD3: A302-368A; α-BRD4:

A301-985A or Abcam ab84776). Subsequently, 1x107

cells per sample were processed for ChIP as previously

described (Pradeepa MM et al., 2012). ChIPed DNA was

amplified with WGA2 using the manufacturer's protocol

(Sigma Aldrich). All ChIP experiments were done with at

least two biological replicates in K562 and CD4+ T cells.

After processing of the ChIP-seq data, we compared the

obtained BET protein-binding sites with MLV integration

sites, histone modifications and other genetic features.

Furthermore, we used motif discovery in the

neighbourhood of BET binding sites and MLV integration

sites to try and discover potential new players in the MLV

integration process.


Analysis showed that 24% of the MLV integration sites

overlap with a BET-binding site in K562 cells, the

majority of which are BRD4 sites. In addition, BET

binding sites located in promoter and enhancer regions are

preferred for MLV integration. Further, evaluation

demonstrated a strong correlation between MLV-

integration in these sites and the occurrence of the

transcription factor recognition motifs for MAX, GATA2,

EGR1, GAPBA and YY1, suggesting a role for these

proteins or the underlying chromatin structures in

targeting integration of MLV to these locations in the

genome via interaction with BET proteins and/or the MLV

long terminal repeat sequences. Recently, we generated

MLV-based vectors that no longer recognize BET-proteins,

BET independent MLV-based (BinMLV) vectors (El

Ashkar S et al., 2014). Integration preferences of BinMLV

vectors are shifted away from epigenetic marks associated

with enhancers and promoters as shown in a PCA analysis,

but they also associate less with BET and MAX binding

sites. Even though, BinMLV vectors still did not integrate

at random, their distribution can overall be described as

more safe, with 3% more integration sites in so-called

genomic "safe-harbor" regions (Sadelain M et al., 2012).

REFERENCES

De Rijck J et al. The BET family of proteins targets moloney murine

leukemia virus integration near transcription start sites, Cell Rep, 5, 886-894, (2013).

El Ashkar S et al. BET-independent MLV-based Vectors Target Away

From Promoters and Regulatory Elements, Mol Ther Nucleic Acids, 3, e179, (2014).

Hacein-Bey-Abina S et al. LMO2-associated clonal T cell proliferation in

two patients after gene therapy for SCID-X1, Science, 302, 415-419, (2003).

LeRoy G et al. Proteogenomic characterization and mapping of nucleosomes decoded by Brd and HP1 proteins, Genome Biol, 13,

R68, (2012).

Pradeepa MM et al. Psip1/Ledgf p52 binds methylated histone H3K36 and splicing factors and contributes to the regulation of alternative

splicing, PLoS Genet, 8, e1002717, (2012).

Sadelain M, Papapetrou EP and Bushman FD. Safe harbours for the integration of new DNA in the human genome, Nat Rev Cancer, 12,

51-58, (2012).

Touzot, F et al. Faster T-cell development following gene therapy compared with haploidentical HSCT in the treatment of SCID-X1,

Blood, 125, 3563-3569, (2015).


108



P64. THE COMPLETE GENOME SEQUENCE OF LACTOBACILLUS

FERMENTUM IMDO 130101 AND ITS METABOLIC TRAITS RELATED TO

THE SOURDOUGH FERMENTATION PROCESS

Marko Verce, Koen Illeghems, Luc De Vuyst & Stefan Weckx*.

Research Group of Industrial Microbiology and Food Biotechnology (IMDO), Faculty of Sciences and Bioengineering

Sciences, Vrije Universiteit Brussel, Brussels, Belgium. *[email protected]

The genome of the lactic acid bacterium species Lactobacillus fermentum IMDO 130101, capable of dominating

sourdough fermentation processes, was sequenced, annotated, and curated. Further, this genome sequence of 2.09 Mbp

was compared to other complete genomes of different strains of L. fermentum to elucidate the potential of L. fermentum

IMDO 130101 as a sourdough starter culture strain. As opposed to the other strains, L. fermentum IMDO 130101

contained unique genes related to carbohydrate import and metabolism as well as a gene coding for a phenolic acid

decarboxylase and a gene encoding a 4,6- -glucanotransferase. The latter enzyme activity may result in the production

of isomalto/malto-polysaccharides. All these features make L. fermentum IMDO 130101 attractive for further study as a

candidate sourdough starter culture strain.

INTRODUCTION

Lactobacillus fermentum is a heterofermentative lactic

acid bacterium often found in fermented food products,

including sourdough. Strain L. fermentum IMDO 130101,

a dominant sourdough strain originally isolated from a rye

sourdough (Weckx et al., 2010) and extensively described

previously (e.g., Vrancken et al., 2008), was sequenced

and compared to other L. fermentum strains with

completed genomes to elucidate unique adaptations of the

strain studied to the sourdough environment.

METHODS

High-quality genomic DNA was used to construct an 8-kb

paired-end library for 454 pyrosequencing. The

pyrosequencing reads were assembled using the GS De

Novo Assembler version 2.5.3 with default parameters.

Primers for gap closure were designed using CONSED

23.0, the gaps amplified with polymerase chain reaction

(PCR) assays and the amplicons sequenced using Sanger

sequencing. The sequences were imported into CONSED

23.0 and used to close the gaps. The genome was

annotated using the automated genome annotation

platform GenDB v2.2 (Meyer et al., 2003), followed by

extensive manual curation. Publicly available genome

sequences of L. fermentum F-6 (Sun et al., 2015), L.

fermentum IFO 3956 (Morita et al., 2008), and L.

fermentum CECT 5716 (Jiménez et al., 2010) were

acquired from RefSeq. Whole-genome comparisons with

the other three L. fermentum strains and ortholog findings

were performed using the progressiveMauve algorithm

(Darling et al., 2010).


The 2.09 Mbp genome was assembled from 403,466 reads,

resulting in 74 contigs. No plasmids were found. The

comparative genome analysis with other strains showed

that 477 coding sequences were found in L. fermentum

IMDO 130101 solely (Figure 1).

L. fermentum IMDO 130101 was predicted to be able to

import and utilise glucose, fructose, xylose, mannose, N-

acetylglucosamine, maltose, sucrose, lactose and gluconic

acid via the heterolactic fermentation pathway. Also, the

ability to degrade raffinose and arabinose was predicted.

Consumption of glucose, fructose, maltose and sucrose

was shown in previous research, although growth with

sucrose as the sole energy source was impaired (Vrancken

et al., 2008). The strain possibly imports isomaltose and

maltodextrins, hence elaborating glucose subunits. The

-glucosidase-encoding gene was not found in the

genomes of the other three strains considered, and neither

were the putative maltodextrin import-related genes, the

trehalose-6-phosphate phosphorylase-encoding gene and a

putative -glucanase-encoding gene, which all may be

adaptations of L. fermentum IMDO 130101 to the

sourdough environment. The presence of the arginine

deiminase gene cluster was confirmed. Also, L. fermentum

IMDO 130101 contained a gene for a phenolic acid

decarboxylase, which may have an impact on sourdough

aroma. Further, a 4,6- -glucanotransferase-encoding gene

was present in strain IMDO 130101 solely, which could

result in isomalto/malto-polysaccharide production, a

soluble dietary fibre with prebiotic properties.

Overall, comparative genome analysis revealed metabolic

traits that are of interest for the use of L. fermentum IMDO

130101 as a functional starter culture for sourdough

fermentation processes.

FIGURE 1. Venn diagram of shared coding sequences between four

different strains of Lactobacillus fermentum.

REFERENCES Darling et al. PLoS ONE 5, e11147 (2010).

Jiménez E. et al. J. Bacteriol. 192, 4800-4800 (2010). Meyer et al. Nucleic Acids Res. 31, 2187-2195 (2003).

Morita et al. DNA Res. 15: 151-161 (2008).

Sun et al. J. Biotechnol. 194, 110-111 (2015). Vrancken et al. Int. J. Food Microbiol. 128, 58-66 (2008).

Weckx et al. Food Microbiol. 27, 1000-1008 (2010).


109



P65. ORTHOLOGICAL ANALYSIS OF AN EBOLA VIRUS – HUMAN PPIN

SUGGESTS REDUCED INTERFERENCE OF EBOLA VIRUS WITH EPIGENETIC

PROCESSES IN ITS SUSPECTED BAT RESERVOIR HOST

Ben Verhees1*

, Kris Laukens1,2

, Stefan Naulaerts1,2

, Pieter Meysman1,2

& Xaveer Van Ostade3.

Biomedical informatics research center Antwerpen (biomina)1; Advanced Database Research and Modeling (ADReM),

University of Antwerp2; Laboratory of Protein Science, Proteomics and Epigenetic Signalling (PPES) and Centre for

Proteomics and Mass spectrometry (CFP-CeProMa), University of Antwerp3.

*[email protected]

Ebola virus is a zoonosis, but its reservoir host has not yet been identified. Recent findings suggest however, that Mops

condylurus, an insect-eating bat, is a likely candidate. Studying the interactions between Ebola virus and its reservoir

host could prove highly informative, as reservoir hosts of zoonotic pathogens often appear to tolerate infections with

these pathogens with little evidence of disease. In this study, a protein-protein interaction network (PPIN) was created

between Ebola virus and human proteins. Orthology data in Myotis lucifugus – a model organism often used for bat

studies – was employed to determine which of the human first neighbors of Ebola virus proteins do not possess an

orthologue in M. lucifugus. Subsequent GO enrichment analysis suggested that these proteins are mostly involved in

epigenetic processes, and thus we hypothesize that Ebola virus displays reduced interference with epigenetic processes in

its reservoir host.

INTRODUCTION

The idea that bats serve as reservoirs for a wide range of

zoonotic pathogens has been the topic of much recent

research. Previous studies on human and bat orthology in

this context have mainly focused on specific genes,

important in fighting off viral infection.

Our study is different however, in that it focuses on

proteins the Ebola virus immediately interacts with in

humans, and the existence of orthologues of these proteins

in bats.

METHODS

Construction of an Ebola virus – human PPIN

An Ebola virus – human PPIN was constructed from in

silico data. All network analysis was done using

Cytoscape v. 3.2.1.

Orthology analysis

Identification of orthologues was performed using the

OMA orthology database, release: September 2015.

Statistics

For the statistical analysis, the hypergeometric test was

performed.

GO enrichment

GO enrichment analysis was performed using ClueGO v.

1.2.7, a Cytoscape plug-in. Default settings were used, and

all ontologies/pathways were examined.


Myotis lucifugus as a model for Mops condylurus

In this study, Myotis lucifugus was used as a model to

study interactions between Ebola virus and Mops

condylurus, its suspected reservoir.

Ebola virus – human PPIN and orthology in M.

lucifugus

An Ebola virus – human PPIN was created, and human

first neighbors of Ebola virus proteins were examined for

existence of orthologues in M. lucifugus. Statistical

analysis revealed that there was an upregulation of human

proteins with orthologues in M. lucifugus (p=0.019).

GO enrichment suggests reduced interference of Ebola

virus with epigenetic processes in its reservoir host

Gene ontology (GO) enrichment analysis was performed

of the human first neighbors of Ebola virus proteins which

do not possess an orthologue in M. lucifugus. The analysis

revealed that these proteins are mostly involved in

epigenetic processes (Figure 1).

FIGURE 1. GO enrichment analysis of human first neighbors of Ebola

virus proteins which do not possess an orthologue in M. lucifugus.

Discussion

Using this novel approach, we have shown that Ebola

virus is likely able to interfere with epigenetic processes in

humans. Secondly, Ebola virus’ ability to interfere with

host epigenetics is likely reduced or altered in its reservoir

host.

While the idea that viruses are able to interact with host

epigenetic mechanisms is fairly recent, over the past few

years significant research has been done exploring this

topic. In a comprehensive review, Li et al. (2014) describe

how specific viral proteins are able to modulate the

activity of chromatin modification complexes, e.g. HATs,

HDACs, HMTs, and HDMTs, and even directly bind

histone proteins. These findings lend support to the results

of our study, as these suggest that Ebola virus is also able

to interact with HDACs, HMTs and several histone

proteins in humans.

REFERENCES Li S et al. Rev Med Virol 24, 223-241 (2014).


110



P66. PLADIPUS EMPOWERS UNIVERSAL DISTRIBUTED COMPUTING

Kenneth Verheggen1,2,3*

, Harald Barsnes4,5

, Lennart Martens1,2,3

& Marc Vaudel4.

Medical Biotechnology Center, VIB, Ghent, Belgium1; Department of Biochemistry, Ghent University, Ghent

2;

Belgium,Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium3; Proteomics Unit, Department of

Biomedicine, University of Bergen, Norway4; KG Jebsen Center for Diabetes Research, Department of Clinical Science,

University of Bergen, Norway5. *[email protected]

The use of proteomics bioinformatics substantially contributes to an improved understanding of proteomes, but this novel

and in-depth knowledge comes at the cost of increased computational complexity. Parallelization across multiple

computers, a strategy termed distributed computing, can be used to handle this increased complexity. However, setting

up and maintaining a distributed computing infrastructure requires resources and skills that are not readily available to

most research groups.

Here, we propose a free and open source framework named Pladipus that greatly facilitates the establishment of

distributed computing networks for proteomics bioinformatics tools.

INTRODUCTION

Various modern day bioinformatics-related fields have a

growing focus on large scale data processing. This

inevitably leads to an increased complexity, as is

illustrated by the recent efforts to elaborate a

comprehensive MS-based human proteome

characterization (Kim et al., 2014; Wilhelm et al., 2014).

Such high-throughput, complex studies are becoming

increasingly popular, but require high performance

computational setups in order to be analyzed swiftly.

METHODS

Here, we present a generic platform for distributed

proteomics software, called Pladipus. It provides an

end-user-oriented solution to distribute

bioinformatics tasks over a network of computers,

managed through an intuitive graphical user interface

(GUI).

Pladipus comes with several modules that work out

of the box. They include SearchGUI (Vaudel et al.,

2011), PeptideShaker (Vaudel et al., 2015),

DeNovoGUI (Muth et al., 2014), MsConvert (part of

Proteowizard (Kessner et al., 2008)) and three

common forms of the BLAST (Altschul et al., 1990)

algorithm (blastn, blastp and blastx). It is possible to

link these together to set up tailored pipelines for

specific needs, including custom, in-house

algorithms and execute the whole on an inexpensive,

scalable cluster infrastructure without additional cost

or expert maintenance requirement. It can even be set

up to allow existing (idle) hardware to hook into the

network and participate in the processing.


To numerically assess the benefits of using a distributed

computing framework, 52 CPTAC experiments (LTQ-

Study6 : Orbitrap@86) (Paulovich et al., 2010) were

searched three times against a protein sequence database

(UniProtKB/SwissProt (release-2015_05)) on Pladipus

networks of various. A selection of three search engines

was applied: X!Tandem, Tide and MS-GF+. As expected

for a distributed system, the wall time is very reproducible

and decreased nearly exponentially with the number of

workers.

FIGURE 1. Benchmarking of a Pladipus network

(16GB ram, 12cores, 250GB disk space, Ubuntu precise)

Pladipus is freely available as open

source under the permissive Apache2

license. Documentation, including

example files, an installer and a video tutorial, can be

found at

https://compomics.github.io/projects/pladipus.html.

REFERENCES Altschul,S.F. et al. (1990) Basic local alignment search tool. J. Mol.

Biol., 215, 403–10. Kessner,D. et al. (2008) ProteoWizard: open source software for rapid

proteomics tools development. Bioinformatics, 24, 2534–6.

Kim,M.-S. et al. (2014) A draft map of the human proteome. Nature, 509, 575–81.

Muth,T. et al. (2014) DeNovoGUI: an open source graphical user

interface for de novo sequencing of tandem mass spectra. J. Proteome Res., 13, 1143–6.

Paulovich,A.G. et al. (2010) Interlaboratory study characterizing a yeast

performance standard for benchmarking LC-MS platform performance. Mol. Cell. Proteomics, 9, 242–54.

Vaudel,M. et al. (2015) PeptideShaker enables reanalysis of MS-derived

proteomics data sets. Nat. Biotechnol., 33, 22–24. Vaudel,M. et al. (2011) SearchGUI: An open-source graphical user

interface for simultaneous OMSSA and X!Tandem searches.

Proteomics, 11, 996–9. Wilhelm,M. et al. (2014) Mass-spectrometry-based draft of the human

proteome. Nature, 509, 582–7.


111



P67. IDENTIFICATION OF ANTIBIOTIC RESISTANCE MECHANISMS USING

A NETWORK-BASED APPROACH

Bram Weytjens1,2,3,4

, Dries De Maeyer1,2,,3,4

& Kathleen Marchal1,2,4

*.

Dept. of Information Technology (INTEC, iMINDS), UGent, Ghent, 9052, Belgium1; Dept. of Plant Biotechnology and

Bioinformatics, Ghent University, Technologiepark 927, 9052 Gent, Belgium2; Dept. of Microbial and Molecular

Systems, KU Leuven, Kasteelpark Arenberg 20, B-3001 Leuven, Belgium3, Bioinformatics Institute Ghent, Ghent

University, Ghent B-9000, Belgium4. * [email protected]

Antibiotic resistance is a growing public health concern as the effectiveness of multiple types of antibiotics is decreasing.

To prevent and combat the further spread of antibiotic resistance in bacteria there is the need to better understand the

relationship between genetic alterations and the (molecular) phenotype of antibiotic resistant strains. As several (-omics)

experiments regarding the attainment of antibiotic resistance by bacteria have already been performed and are publicly

available, we re-analysed a laboratory evolution experiment by Suzuki et al. (Suzuki, 2014) in order to demonstrate the

power of a network-based approach in identifying mutations and molecular pathways driving the resistance phenotype.

INTRODUCTION

While network-based approaches are no longer new in

high-throughput (-omics) analysis, they are not yet widely

used in standard analysis pipelines. We analysed a dataset

consisting of multiple E. coli MDS42 strains, each

independently evolved in the presence of a specific

antibiotic (10 in total). By adapting PheNetic (De Maeyer.

2013), an algorithm which connects genetic alterations to

their differentially expressed genes over a genome-wide

interaction network, we were able to automatically

identify mutations in genes which are known to induce

antibiotic resistance.

METHODS

For every strain whole-genome sequencing data and

microarray data (eQTL data) was available. By finding the

most probable connections between the mutations of every

strain and the strain’s respective expression data over a

biological network, PheNetic was able to not only uncover

potential driver genes and molecular pathways for the

resistance phenotype but also to prioritize the identified

mutations based on the likelihood that they are truly

driving the resistance phenotype. Such network-based

approach has following advantages:

Integration of interactomics (network), genomics

and interactomics data

Multiple related datasets can be analyzed together

FIGURE 1: Part of Amikacin resistance network.


In the case of Amikacin resistance (figure 1) we were able

to uncover a gain-of-function mutation in cpxA, a gene of

a two-component signal transduction mechanisms which is

known to be involved in amikacin resistance for two

strains out of four. For the other two strains, deleterious

cyoB mutations were found, which is known to lead to

intracellular oxidized copper and eventually multidrug

resistance. These genes were furthermore ranked highest

by PheNetic.

REFERENCES Suzuki S et al. Nat Commun 5, 5792 (2014).

De Maeyer D et al. Mol Biosyst 9: 1594-1603 (2013).


112



P68. DEFINING THE MICROBIAL COMMUNITY OF DIFFERENT

LACTOBACILLUS NICHES USING METAGENOMIC SEQUENCING

Sander Wuyts1,2*

, Eline Oerlemans1, Ilke De Boeck

1, Wenke Smets

1, Dieter Vandenheuvel, Ingmar Claes

1 & Sarah

Lebeer1.

Laboratory of Applied Microbiology and Biotechnology, University of Antwerp1; Research Group of Industrial

Microbiology and Food Biotechnology (IMDO), Vrije Universiteit Brussel2 *[email protected]

Next-Generation Sequencing (NGS) has revolutionized the field of microbial community analysis. Due to these high-

throughput DNA-technologies, microbiologists are now able to perform more in-depth analyses of various microbial

communities compared to culture-independent methods. In our lab, we have successfully deployed 16S rDNA amplicon

sequencing using MiSeq-sequencing (Illumina). A bioinformatic pipeline has been built based on mothur (Schloss et al.

2009), UPARSE (Edgar 2013) and Phyloseq (McMurdie & Holmes 2013) to analyse different microbial community

datasets. The focus is on functional analysis of lactobacilli and other lactic acid bacteria in different ecological niches:

ranging from the human upper respiratory tract to naturally fermented plant-based foods.

INTRODUCTION

16S metagenomics is a technique that makes use of the

highly conserved bacterial 16S rRNA gene. This gene

codes for an RNA-molecule which is a component of the

30S small subunit of bacterial ribosomes. It consists of 9

hypervariable regions, flanked by conserved regions for

which primer pairs for PCR/sequencing can be designed.

Due to these characteristics and due to the slow rate of

evolution, this gene has been widely used in bacterial

phylogeny and taxonomy. NGS technologies like Illumina

MiSeq have made it possible to study all the different

16S rRNA gene copies from an environmental sample and

use these to identify the bacteria present in the sample. But

the use of these high-throughput technologies comes with

a cost: the need for a more in-depth bioinformatic analysis.

METHODS

Wetlab:

DNA is extracted using sample dependent extraction

protocols. A barcoded PCR is performed on the V4 region

of the 16S rRNA gene as described in Kozich et al. 2013.

For each sample a different set of primers is used; each

primerset contains a unique combination of barcodes. The

PCR-products are cleaned using AMPure XP (Agencourt)

bead purification and quantified using Qubit (Life

technologies). All samples are equimolary pooled into one

single library. A negative control (= “empty” DNA-

extraction) and a positive control (= “Mock” communities

HM-276D and HM-782D) are always processed together

with the samples. The library is sequenced using a dual

index sequencing strategy (Kozich et al. 2013) and a

2 x 250 bp kit on the Illumina MiSeq.

Bio-informatic analysis:

Samples are demultiplexed on the MiSeq itself, allowing 1

bp difference in the barcodes. The general quality of the

reads is checked using FastQC (Babraham Bioinformatics).

The paired end reads are merged using mothur’s

make.contigs command. Quality control in mothur is

performed using screen.seqs, alignment to the SILVA

database and removal of sequences that do not map to the

database, removal of chimeras using chimera.uchime and

removal of sequences that classify to the lineages

“Mitochondria” and “Chloroplast”.

The distance between sequences are calculated using

mothur’s dist.seqs command and are clustered at 97 %

sequence similarity using mothur’s cluster command.

Alternatively the UPARSE clustering algorithm can be

used for these last two steps. Sequences are classified

using the RDP database and the complete dataset is

exported as a .biom file.

Visualisation and statistical analysis is performed using

the R-package Phyloseq. This analysis depends on the

experimental design but generally consists of a

normalisation step (either using rarefying, proportions or a

statistical mixture model (McMurdie & Holmes 2014)), a

calculation of alpha diversity measurements and a

calculation and visualisation of beta diversity.


The above described method was optimised and proved to

be working. We successfully used this technique to obtain

better insights in the role of lactobacilli in different

ecological niches, e.g. in the murine gastrointestinal tract,

vegetable fermentations and the human upper respiratory

tract.

REFERENCES Edgar, R.C., 2013. UPARSE: highly accurate OTU sequences from

microbial amplicon reads. Nature methods, 10(10), pp.996–8.

Kozich, J.J. et al., 2013. Development of a dual-index sequencing

strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Applied and

environmental microbiology, 79(17), pp.5112–20.

McMurdie, P.J. & Holmes, S., 2013. Phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome

Census Data. PLoS ONE, 8(4).

McMurdie, P.J. & Holmes, S., 2014. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS computational biology,

10(4), p.e1003531.

Schloss, P.D. et al., 2009. Introducing mothur: Open-source, platform-independent, community-supported software for describing and

comparing microbial communities. Applied and Environmental

Microbiology, 75(23), pp.7537–7541.


113



P69. HUNTING HUMAN PHENOTYPE-ASSOCIATED GENES

USING MATRIX FACTORIZATION

Pooya Zakeri1,2,*

, Jaak Simm1,2

, Adam Arany1,2

, Sarah Elshal1,2

& Yves Moreau1,2

.

Department of Electrical Engineering, STADIUS, KU Leuven, Leuven 3001, Belgium1; iMinds Medical IT, Leuven 3001,

Belgium2. *[email protected]

In the last decade, the phenotype-genes identification has received growing attention. It is yet one of the most

challenging problem in biology. In particular, determining disease-associated genes is a demanding process and plays a

crucial role in understanding the relationship between phenotype disease and genes. Typical approaches for gene

prioritization often models each diseases individually, that fails to capture the common patterns in the data. This

motivates us to formulate the hunting phenotype-associated genes problem as a factorization of an incompletely filled

gene-phenotype-matrix where the objective is to predict unknown values. Experimental result on the updated version of

Endeavour benchmark demonstrates that our proposed model can effectively improve the accuracy of the state-of-the-art

gene prioritization model.

INTRODUCTION

In biology, there is often the need to discover the most

promising genes among large list of candidate genes to

further investigate. While a single data source might not

be effective enough, fusing several complementary

genomic data sources results in more accurate prediction.

Moreover, fusing the phenotypic similarity of diseases and

sharing information about known disease genes across

both diseases and genes through a multi-task approach,

enable us to handle gene prioritization for diseases with

very few known genes and genes with limited available

information. Typical strategies for hunting phenotype-

associated genes often models each phenotype

individually [1, 2, 3, 4], that fails to capture the common

patterns in the data. This motivates us to formulate the

hunting phenotype-associated genes task as a factorization

of an incompletely filled gene-phenotype-matrix where the

objective is to predict unknown values.

METHODS

We consider OMIM database which is a human phenotype

disease specific association databases. OMIM focuses on

the relationship between human genotype and associated

diseases. OMIM database can be seen as an incomplete

matrix where each row is a gene and each column is a

phenotype (disease).

The idea behind the factorizing the M×N OMIM matrix is

to represent each row and each column by a latent vector

of size D. Then, the OMIM matrix can be modeled by

product of an N×D gene matrix G and an M× D disease

matrix P.

Bayesian matrix factorization (BPMF) [5] is a famous

method to fill such an incomplete matrix. But BPMF uses

no side information which results in an inaccurate gene-

phenotype-matrix completion.

We propose an extended version of BPMF with an ability

to work with multiple side information sources for

completing gene-phenotype-matrix [6], which allows to

make out-of-genes-phenotype-matrix ranking. In our

proposed framework we are also able to integrate both

genomic data sources and phenotypes information,

whereas earlier approaches for hunting phenotype

associated genes are limited to only fuse genomic

information. This modification is done by adding genomic

and phenotypic features to the corresponding latent

variables [6]. In this study, we consider several genomic

data sources including annotation-based data sources such

as UniProt annotation, literature-based data sources on

each genes, and as well the literature-based phenotypic

information on each diseases, as just as in [1, 4, 9]. The

framework of our Bayesian data fusion model for gene

prioritization is illustrated in Figure 1.

FIGURE 1. The framework of our Bayesian data fusion model for gene prioritization.


We report the average TPR results, when considering the

top 1%, 5%, 10%, and 30% of the ranked genes.

Experimental result on the updated version of Endeavour

[3] benchmark demonstrates that our proposed model can

effectively improve the accuracy of the state-of-the-art

gene prioritization model.

REFERENCES Aerts, S. et al. Nat Biotech, 24(5), 537–544, (2006).

De Bie T, Tranchevent LC, van Oeffelen LMM, Moreau Y, Bioinformatics, 23(13):i125-i132, (2007).

Tranchevent LC1, et. al. NAR, (35) W377-W384(2008) .

ElShal S, et al. Davis J. Moreau Y. NAR, (2015). R. Salakhutdinov and A. Mnih. 25th ICML, 880–887. ACM, (2008).

SIMM J, et al. arXiv:1509.04610 [stat.ML], (2106).


114



P70. THE IMPACT OF HMGA PROTEINS ON REPLICATION ORIGINS

DISTRIBUTION A. Zouaoui

1, M. Kahli

2, E. Besnard

3, R. Desprat

1, N. Kirsten

4, P. Ben-sadoun

1 & J.M. Lemaitre

1.

Institute for Regenerative Medicine and Biotherapy, France1; Institut de Biologie de l’École Normale Supérieure (ENS),

France2; The Gladstone Institutes, University of California San Francisco (UCSF), United States

3; Helmholtz Zentrum

München, Research Unit Gene Vectors, Munich, Germany4.

Proliferative cells can have an irreversible stop in the cell

cycle that is called cellular senescence which can induct

the development of cancer and ageing. Senescence is

characterized by the development of Dense

Heterochromatic Foci (SAHF) and the decline of the DNA

replication. High-Mobility Group A proteins promote

SAHF formation, a proliferative stop and stabilize

senescence when overexpressed.

In a cell, DNA replication is regulated on several

genomics sites called replication origin (« Oris »). Pre-

replication proteic complex is required for DNA

replication to occur. In the pre-replication complex, the

ORC1 protein is involved in recognition of the origin of

replication. DNA autoradiography of eukaryote cells

allowed to find that human replication origins are

bidirectional and spaced at 20-400kb intervals (Huberman

and Riggs, 1968). At each origin, replication forks are

formed and new short nascent strand are synthetized. A

popular method to map replication origins is the

purification of Short Nascent Strand (SNS). Several

laboratories have identified up to 50 000 origins using

microarray and sequencing techniques. Our laboratory has

developed an origin mapping method divided in four cell

type: IMR90, H9, iPSC and HeLa (Besnard et al., 2012).

The Short Nascent Strand was isolated, sequenced and

analyzed. 250 000 origin peaks have been identified with a

peak detection tool named SoleSearch (Blahnik KR, Dou

L, O’Geen H, et al. 2010).

The objective is to find the most sensitive method to

analyze the origin distribution in proliferative and

senescent cells to observe if senescence has an impact on

the origin distribution. The implication of HMGA proteins

on the DNA replication is investigated. Two new methods

are in development to analyze the replication origin with

two more sensitive tools. In the first method, we search

origin peaks with Macs2 tool (Zhang et al., 2008) which

uses a new statistic and algorithm model. In a second time,

origin enrichment is observed with Homer tool (Heinz S et

al., 2010).

Two methods are currently in development to identify the

replication origin site by Illumina GaII sequencing of short

nascent strand. Human SNS-seq reads of 36bp were

mapped to human genome build GRCH38 with BWA tool

(ref). Origin peaks were called by MACS2 and origin

enrichment by Homer. To compare the two methods,

active origins in HeLa cells were detected with each

method. Correlation between ORC1 peaks and origins

identified is calculated to choose the most sensitive

method. The impact of pre-senecence is observed in

comparing origins distribution observed in proliferative

and senescent cells. Origins distribution is compared

before and after induction of HMGA proteins to

investigate the implication of these proteins on the DNA

replication during senescence.

REFERENCES Besnard et al. Best practices for mapping replication origins in

eukaryotic chromosomes. Current Protoc Cell Biol. 2014 Sep 2;

64:22.18.1-22.18.13

Besnard et al. Unraveling cell type-specific and reprogrammable human replication origin signatures associated with G-quadruplex consensus

motifs. Nat Struct Mol Biol. 2012 Aug; 19, 837-44

Blahnik KR, Dou L, O’Geen H, et al. Sole-Search: an integrated analysis program for peak detection and functional annotation using ChIP-seq

data. Nucleic Acids Res. 2010; 38:e13

Fu H et al. Mapping replication origin sequences in eukaryotic chromosomes. Curr Protoc Cell Biol. 2014 Dec 1; 65:22.20.1-

22.20.17

Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of

Lineage-Determining Transcription Factors Prime cis-Regulatory

Elements Required for Macrophage and B Cell Identities. Mol Cell

2010 May 28; 38, 576-589 Hubberman JA et al. On the mechanism of DNA replication in

mammalian chromosomes. J Mol Biol 1968 Mar 14; 32, 327-41

Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol (2008) 9 pp. R13


115

bbc 2015

December 7 - 8, 2015 Antwerp, Belgium

www.bbc2015.be