nutvar 2: sequence-based functional an- notation of ... 2: sequence-based functional an- ... from...

MÁSTER EN BIOINFORMÁTICA Y BIOLOGÍA COMPUTACIONAL ESCUELA NACIONAL DE SALUD- INSTITUTO DE SALUD CARLOS III

NutVar 2: Sequence-based functional an-notation of truncating variantsfrom Genome data

Manuel Tardáguila Sancho

2013-2014

Lab Telenti, Centre Hospitalaire Universitaire Vaudois (CHUV) Institut de Microbiologie (IMUL)- Swiss Institute of Bioinformatics (SIB)

Antonio Rausell (SIB).Amalio Telenti (J. Craig Venter Institute) and Ioannis Xenarios (SIB).

Michael Tress (CNIO)

January, 2015

Venue

Supervisors

Master´s supervisorDate

Index

2

INDEX

INDEX .......................................................................................................................2

OBJECTIVES ...........................................................................................................3

INTRODUCTION ......................................................................................................4

MATERIALS ..............................................................................................................6

RESULTS .................................................................................................................7

Assembly of the training and test sets and variant effect prediction .....................................9

Construction of Pre-Calculated tables (Pre-C tables) and obtention of files necessary to an notate features of molecular damage .................................................................................14

Annotation of gene and sequence based features .............................................................23 Machine Learning ...............................................................................................................28

DISCUSSION .........................................................................................................31

NutVar2 versus implementations in SnpEff and VEP .........................................................31

NutVar2 versus NutVar1 .....................................................................................................33

Future Perspectives ...........................................................................................................35

CONCLUSIONS .....................................................................................................36

REFERENCES .......................................................................................................37

ANEXES .................................................................................................................39

Objectives

3

OBJECTIVES

• Obtain a training set for the classifier using up-to-date data from large se-quencing projects and from the database ClinVar• Extend the analysis of NutVar to splice donor/acceptor disruption sites• Add new functionalities to increase the rank of potential users.• Extend protein feature analysis to functional sites• Develop an annotation pipeline for all the features• Train and test a classifier using Naïve Bayes algorithm

Introduction

4

INTRODUCTION

Recent large sequencing projects report that stop-gains, frameshifts and splice donor/acceptor disruptions giving rise to truncations are collectively prevalent in humans. Current computational appro-aches evaluating their impact on health rely on gene-level characteristics such as evolutionary conservation and functional redundancy across the genome. However, sequence based features such as loss of functional domains, isoform-specific trun-cation and onset of nonsense-mediated decay provide cues that improve the ran-king of potentially pathogenic truncations. Here, we present NutVar2 second release of a classifier producing a ranking of se-verity relying on sequence-based features. New characteristics in NutVar2 include an expanded training set comprised by com-mon truncations from ESP, 1.000 Geno-mes Project, 10GK project and pathogenic truncations extracted from ClinVar, eva-luation of protein functional sites loss and ENSEMBL basis for transcript annotation. The sequence-based score ranks variants according to their potential to cause disea-se, and complements existing gene-based pathogenicity scores.

The amount of data coming out of large scale human exome sequencing projects is growing exponentially [1] as we enter the era of perso-nal genomics. Even though humans only di-ffer in 0.1% of their DNA sequence, the study of these nucleotide changes is the basis to as-certain the genomic predisposition to suffer a disease. Therefore, a series of tools are requi-red to prioritize the clinical study of new/ pre-viously not associated with disease variants regarding their pathogenicity potential.

Setting aside large structural variations (SVs) and copy number variations (CNVs), va-riants that have a profound effect on protein functionality are often associated with disea-se [2-4]. Among these are the so-called ‘Loss-of-Function’ (LoF) variants mainly due to truncation onset after a stop-gain, frameshift or destruction of a splice donor/acceptor site. However, truncating variants have proven to be surprisingly prevalent among the general

population with the average individual ca-rrying between 100 and 200 of them [2, 5]. Furthermore, 20 of these truncations are esti-mated to appear in homozygosis [2].

The assessment of the pathogenicity of trun-cations until the advent of NutVar [4] remai-ned focused in features derived from the gene affected by the variant. These ‘gene-based’ features reflect evolutionary constraints, such as sequence conservation or tolerance to functional variation, and are used to evaluate the pathogenicity potential of the gene. Under this approach, truncations are assumed to be severe irrespective of their position in the gene so its outcome depends solely on the gene being affected. A general conclusion of these studies is that the more conserved the affected gene, the greater the likelihood of the truncation being pathogenic [2].

Two studies pioneered the use of gene-based features and showed its classifying power. First MacArthur et al [2], identified 2.951 LoF candidate variants from the 1000 Genomes project [5] and validated them using arrays whenever possible (n=1.877). The authors fo-cused in producing a high confidence set of truncations leading to LoF, eliminating up to 25% of candidates that were likely sequen-cing/mapping errors, annotation/reference sequence errors, and variants unlikely to cau-se genuine LoF. Interestingly, the authors po-inted that ‘accurate functional interpretation requires integrating multiple variants on the same haplotype’. This means that the sequen-ce context of an a priori truncation needs to be evaluated carefully to determine whether it produces or not a truncation. For instan-ce; SNPs leading to stop-gains are prone to associate with SNPs in adjacent nucleotides reverting their effect, frameshifts are prone to associate with frameshifts in adjacent se-quence recovering the original reading frame, and indels close to or spanning exon splice si-tes can be rescued by alternative splice sites. With the surviving 1.285 High Confidence LoF variants MacArthur et al build a Gene Score based in the observation that genes in which these variants were more prevalent shared a series of characteristics. They were relatively less evolutionarily conserved (i), had more

5

Introduction

paralogs (ii) and had lower connectivity in both protein-protein interaction and gene in-teraction networks (iii). All of these pointed to the redundancy and/or dispensability of these genes, with the typical example being human olfactory receptors.

Within short time Petrovski et al provided a di-fferent but related gene-based score [3]. The authors developed ‘an intolerance scoring system (RVIS) that assesses whether genes have relatively more or less functional genetic variation than expected based on the appa-rently neutral variation found in the gene‘ [3]. Their primary motivation was to distinguish between two types of genes; genes for which variants affecting functionality are rarely ob-served and, if so, tend to accumulate in pa-tients of a certain disease, and genes which carry functional variants at high frequencies. Using a logistics regression model they ob-served that genes causing Mendelian disea-ses are more intolerant to functional variants.

Both of these scores share two central ca-veats; they solely rely in gene-based featu-res and they tend to classify wrongly variants affecting genes that are under positive selec-tion such as genes of the Innate Immune Res-ponse, IRR. First of all, the assumption that a truncation irrespective of its position will lead to LoF is too simplistic; depending on its posi-tion the truncation will affect a greater or lower number of protein domains and functional si-tes. In addition, the genomic coordinate of the truncation dictates the isoform(s) it affects. This is of special importance as most genes have a principal isoform [6]. Furthermore, the Non-sense-Mediated-Decay (NMD) machi-nery abrogates the expression of transcripts with premature termination codons 3´of 50 pb from the last exon-exon junction. In summary, the overall effect of the truncation needs to be inferred from a series of sequence-based fea-tures inherent to the position of the variant. Both group use preliminary strategies to deal with this caveat; MacArthur et al discard every truncation lying in the last 5% of the sequence assuming it will not entail LoF, and Petrovski et al combine their intolerance score with Po-lyPhen2, a sequence-based estimator of the potential impact of missense variants.

Second, variants affecting genes that undergo positive selection tend to be classified wron-gly (for example IRR genes). The gene score of MacArthur et al tends to classify functio-nal variants in these genes as pathogenic when some of them are not. The authors of the RVIS score suggest an alternative to deal with this problem; depending on the nature of the disease being studied (according to the classification of diseases by Goh et al [7]) the association between Intolerance Score and disease shall be interpreted in a different manner. For example, developmental disea-ses are caused by genes with the most the most intolerant scores to functional variation (RVIS low). On the other hand, immunological diseases are caused by genes very tolerant to functional variation (RVIS high)

All these caveats were addressed by the first release of NutVar developed by Rausell et al [4]. The software focuses on truncations, extracting the sequence features aforemen-tioned and providing a classifier that can be blended with either MacArthur or RVIS gene scores. The authors showed the gain in accu-racy when both scores were combined, and validated the decrease in transcript levels for variants eliciting NMD using correlated trans-criptomics and genomics data from the 1000 genomes project and the Geuvadis Project [4]. The authors analyzed in detail genes of the IIR and showed that NutVar complemen-ted gene-based scores and in some instances classified better the truncations in these ge-nes than gene-based scores alone.

Here, we present NutVar2, second release of the classifier. New characteristics in NutVar2 include an expanded training set comprised by common truncations from ESP, 1.000 Ge-nomes Project, 10GK project and pathogenic truncations extracted from ClinVar, evaluation of protein functional sites loss and ENSEM-BL basis for transcript annotation. The imple-mentations and extended functionalities over NutVar1 account for a gain in the accuracy of classification.

Materials & Methods

6

MATERIALS & METHODS

Scripts and original files

All the scripts and files needed to run NutVar2 annotation phase are provided along with this memory as .gz file. The structure of the NutVar2 directory allocates scripts in the bin/ folder whereas data/ contains the files nee-ded to run the file. The bin/ directory is divided in turn between scripts needed to build Pre-C tables and Training set (build_table/ and Training_set/) scripts used exclusively to run VEP (VEP/) scripts used exclusively to run SnpEff (snpeff/) and scripts common for both programs in the pipeline (shared/). The data folder includes files generated in the cons-truction of the Pre-C tables (build_tables/) and training set (Training_set/), intermediate files generated in the annotation process (interme-diate/), files obtained from external sources (external/) and a folder with the final matrix of annotation (final/). Finally a test folder inclu-ding an example that can be used to run the whole pipeline is provided.

The map of dependencies provided in the an-nexes includes the orders needed to run the different scripts. The route to the external ftp sites is also displayed in the map of depen-dencies.

Calculation of the Allele Frequency (AF) in the Training Set

An R script (Beta_distribution.R) was imple-mented to calculate the 95% Credible Intervals for the AF of each variant. HOM=homozygous for the variant, HET=Heterozygous for the variant, WT=wild type.Briefly, the allele of in-terest was calculated using (2*HOM+HET) and the total allese were calculated using (2*(HOM+HET+WT)). For each variant the qbeta function of R was used to calculate the 0.025 and 0.975 vector of probabilities using as shape parameter1 1+allele of interest, and as shape parameter2 1+total alleles- allele of interest. The ncp was set to 0.

Only variants for which the lower limit of the Credible Intervals was bigger than 0.01 were acceted as common.

Minimal Representation

The Minimal Representation of each variant was calculated as explained in the guidelines established in [8].

Selection of Protein features

For functional sites, sites being annotated in UniProt as ‘by similarity´, ‘probable’ or lacking a label (highest level of acceptance) were ac-cepted as valid. Sites labelled as ‘potential’ were discarded.

Naive Bayesian Classifier

The k-fold crossvalidation was set to leave-one-out as there were few examples in the final training set. No variant filtering was per-formed.

Plotting NutVar2 features against class variable

The mean value for each feature was plotted for the two labels of the class variable (Patho-genic and non-Pathogenic). Error bars repre-sent the STD deviation of the mean. Of note, the STD was assumed to be equal between the two classes.

Results

7

User vcf file

Minimal representation

AnnotationSequenceFeatures

SnpEff, VEP or both

Variant effect; p.e. stop_gained

AnnotationGene

Features

Pre-C tables

External Files

MolecularDamage

Gene Essentiality

Innate Immune Gene

Biological Readout

Rank of truncating variants according to Pathogenicity Score

Classifier

Figure 1. Workflow of NutVar2. The variants from the user are converted to minimal representation and their effect is assessed using SnpEff, VEP or both. Next, sequence features of molecular damage are annotated using Pre-C tables. Gene based features assessing gene essentiality and affiliation to the innate immune response are added afterwards. Finally, the probability of the truncating variants being pathogenic is estimated by means of a classifier trained and tested for a known set of common and pathogenic truncating variants. Truncating variants are ranked according to their score of poten-tial pathogenicity.

RESULTS

NutVar2 has been designed to work as a stan-dalone tool to be downloaded and function with the user own set of .vcf files. Figure 1 depicts an overview of NutVar2 workflow; data from the user is first converted from the original vcf file to its minimal representation [8] and the effect of the variant is determined using SnpEff [12], VEP [13] or both. Then sequence features of molecular damage are

annotated for the variant based on Pre-Cal-culated (Pre-C) tables build by the software. Afterwards a layer of gene based features as-sessing gene essentiality and its belonging to the innate immune response is added to the annotation. Finally, variants are introduced in a classifier previously trained with a set of known pathogenic and common (Minimum Allele Frequency, MAF > 1%, assumed to be non-Pathogenic) truncating variants that have been annotated with NutVar2. The result is a

Results

8

InterPro

snpEff VEP

ENSEMBL

UniProt

CCDS

Appris

Pervasive

FUNCTIONALSITES

Figure 2. Development of NutVar2. The training set (1) is constructed with a set of truncating va-

1

MACHINELEARNING

VARIANT EFFECT

TRAINING SET

TRANSCRIPT FEATURES

PROTEIN FEATURES

% sequence affected

Isoforms affected: Constitutive or Alternative

Principal Isoform?Pervasive Isoform?

Max percentage of a do-main affectedFuntional sites

Nonsense-Mediated-Decay degradation:NMD + NMD down frameshift

ES Project

ClinVar

2

3

4

MACHINE LEARNING6

1000 G Project

10 GK Project

ANNOTATION5

Results

9

riants classified as pathogenic (obtained from ClinVar) and truncating variants known to be common obtained from large sequencing projects. The efect of the variant is assessed using SnpEff or VEP or both (2) and a series of features of molecular damage at the transcript level (3) or at the protein level (4) are annotated (5). For each pre-classified variant, features of molecular damage and features of gene essentiality [2, 3] are used to train a classifier (6) under a naive Bayes paradigm. The sensitivity and specificity of the classifier are assessed afterwards by ROC curves.

TRANSCRIPT FEATURES

MACHINE LEARNING

ASN.1 terms ClinVar and VCF0 – unknown Uncertain significan-

ce1 – untested not provided (inclu-

des the cases where data are not availa-ble or unknown)

2 - non-Pathogenic Benign3 - probable-non-Pa-thogenic

Likely benign

4 - probable-patho-genic

Likely pathogenic

5 – pathogenic Pathogenic6 - drug-response drug response7 – histocompatibility histocompatibility255 - other other

Table 1. ClinVar code for the clinical signifi-cance of variants. Obtained from [11].

probability for the variant being pathogenic and a rank of the truncating variants accor-ding to this score.

The core of the developing process of NutVar2 can be divided in four main stages: first, as-sembling a large set of pathogenic and com-mon truncating variants to train the classifier (steps 1 and 2 in Figure 2). Second, building a series of Pre-C tables that will speed up the process of annotation (steps 3 and 4 in Figu-re 2). Third, developing a series of scripts to annotate the set of truncating variants assem-bled in the first step (step 5 in Figure 2). Fina-lly, building the code for the classifier and trai-ning and testing it with the data obtained from all the previous steps (step 6 in Figure 2).

We will analyze in detail these four stages.

1. Assembly of the training and test sets and variant effect predic-tion

Selection of non–Pathogenic and Pathogenic variants for the Training set

First of all a file comprising 1.523.770 variants coming from large sequencing projects (ESP, 1000 genomes, 10GK, in-house projects) was obtained from project supervisor Antonio Rau-sell. The file was in vcf format and the INFO field included allele counts for every variant indicating the number of individuals obser-ved being homozygous (HOM), heterozygous (HET) and wild type (WT) (see Figure 3).

From this large pool of variants we wanted to select truncating variants that are common (Allele Frequency, AF, > 1%). We had to ac-count for statistical uncertainty as the size of the population sample to calculate AF di-ffered widely from variant to variant, and we

had to account for genetic uncertainty that arises from the underlying stochastic evolutio-nary process that gave rise to the population sampled, so we applied a hierarchical Baye-sian model to estimate AF [9]. First, to ac-count for statistical uncertainty, we assumed that alleles are sampled independently within populations and that the samples are drawn independently across loci and population [9]. Second, to account for genetic uncertainty we assumed a parametric form for the among-population allele frequency distribution. It is natural to assume that population allele fre-quencies follow a Beta distribution [9]. Using the Beta distribution (see M&M) we obtained 95% credible intervals for each variant.

Consequently, the AF of a given variant lies with 95% probability between the limits of the credible intervals we calculated. Next, only variants for which the lower limit of the credi-ble interval was bigger than 0.01 (0.1%) were accepted to be common. By means of this

Results

10

ES P

roje

ct

1000

G P

roje

ct

10 G

K Pr

ojec

t

Com

mon

var

iant

s (A

F >

1%)

assu

med

to b

e no

n-pa

thog

enic

1.52

3.77

0 va

riant

s

70.0

42 v

aria

nts

Assu

me

a Be

ta d

istri

butio

n fo

r the

AF

of e

ach

varia

nt

Cal

cula

te u

pper

an

low

er li

mits

of 9

5% C

redi

ble

Inte

rval

s

(Low

er li

mit

- Upp

er li

mit)

Low

er li

mit

> 0.

01

Figure 3. Selection of non-Pathogenic variants from the pool of variants from large sequen-cing projects. An initial pool of 1.523.770 variants along with their allele counts coming from different sequencing projects was assembled. The AF for each variant was assumed to be Beta distributed.

Results

11

ClinVar

Pathogenic variants

63.060 variants

14.571 variants

Select variants reported as ‘likely Pathogenic’ (4) or ‘Pathogenic’ (5)

Discard variants with conflicting reports (variants also classified as ‘benign’ (2)or ‘likely benign’ (3))

Restrict to variants at least once reported as ‘Pathogenic’

95% Credible Intervals were calculated for each variant. Variants with the lower limit of the Credible Interval bigger than 0.01 (1%) were accepted to be common (AF >1%) and therefore non-Pathogenic.

Figure 4. Selection of pathogenic variants from the ClinVar set. From the initial 63.060 variants in the pool only those having been reported as Pathogenic or likely Pathogenic were selected. This subset was further purified from variants showing conflicting reports. Finally, we restricted our analysis to variants having been classified at least once as Pathogenic. The final amount of selected Pathogenic variants was 14.571.

stringent assumption we eliminated variants with high statistical and genetic uncertainty.

A total of 70.042 variants were found to be common and therefore assumed to be non-Pathogenic.

Next, a set of 63.060 variants was obtained from ClinVar [10]. ClinVar classifies the cli-nical significance of variants according to a numeric code (Table 1). Different submissions may address different clinical significances for a given variant. Consequently some variants exhibit conflicting reports; they are classified by some authors as Pathogenic and by other authors as Benign.

In order to discard these conflicting variants we first selected variants from the original set that have at least once been classified as Pa-thogenic (5) or likely Pathogenic (4) (Figure 3). From this subset variants that have been at least once classified as Benign (2) or likely Benign (3) were excluded. Finally, we restric-ted ourselves to variants that have been at least once reported as Pathogenic. The final subset of ClinVar Pathogenic variants amoun-ted to 14.571 variants.

Conversion to minimal representa-tion and variant effect prediction

Once the Pathogenic and Non-Pathogenic variants of the training set have been selec-

Results

12

1

Conversion to minimal representation (MR) form.

Original .vcf file

Undo joint calls

1 1231299 rs19800 CGG AGG

1 1231299 rs19800 CGG CTT, AGG

1 1231300 rs19800 CGG CTT

Conversion of ALT alleles <DEL> or < . > to MR.

1 1231299 rs19800 CGG CTT

1 1231299 rs19800 CGG AGG

+

1 1231299 rs19800 C A

1 1231300 rs19800 GG TT

2

Eliminate padding nucleotides

3

4

Figure 5. Modifications to reach minimal representation. Joint allele calls, if present in the original vcf file, are separated in simple calls (2) and padding nucleotides added to make coherent joint calls are eliminated (3). If any cryptic alternative allele (<DEL>, or < . >) is present it is converted to mini-mal representation via ENSEMBL API.

ENSEMBL API

Results

13

Comparison of SnpEff and Variant Effect Predictor (VEP)

SnpEff VEPVersion 3.6 Release 75Genome Assembly GRCh37.75 GRCh37.75Running time 2 hours o.n. paralelized in 10 chunksRun features Loss-of-Function (LOF and

NMD prediction).

Nextprot information

Pathogenicity Probability

Protein Domain information

SIFT and PolyPhen scores

Terminology of variant effect

Sequence Ontology (S.O.) Sequence Ontology (S.O.)

Same in both

SnpEff exclusive VEP exclusive

Total variants 91.13% 1.77 % 8%Stop_gained 99% 0.15% 0.75%Frameshift_variant 77.32% 16.77% 5.91%Splice_variants (acceptor, donor, region)

88.80% 8.77% 5.34%

Missense_variants >99% -- --Synonimous_variants >99% -- --

Table 2. Characteristics of SnpEff and VEP runs on 1000G phase 1 set.

Table 3. Results of SnpEff and VEP runs on 1000G phase 1 set. SnpEff and VEP showed 91.13% coincidence addressing the consequences of the variants in the set. Variants with a consequence exclusively addressed by SnpEff or VEP amounted to 1.77% and 8%, respectively. Convergence between both predictors was greater for synonymous and missense variants (>99%) while truncating variants showed in general a lower degree of accordance (99% for Stop gains, 88% for splice variants and 77% for frameshifts).

ted, variants are converted from their stan-dard vcf format to minimal representation [8] (Figure 5). This conversion decomposes jo-int calls, eliminates padding nucleotides and modifies genomic coordinates if necessary. It also transforms the deletion alternative alleles annotated as <DEL> or < . > for their minimal representation form using ENSEMBL API. This transformation is a convenient way to avoid errors in variant identification due to the addition of padding nucleotides in joint-calling

of vcf files [8]. Once the variants were in their minimal re-presentation form their effect was assessed using SnpEff [12] and/or VEP [13]. The rea-son for introducing two variant effect predic-tors is an implementation over the first release of NutVar when only SnpEff was used. SnpEff is much faster than VEP but it is less com-prehensive. VEP is becoming a standard in the field of clinical genomics so we wanted to feature it in order to reach a greater audience

Results

14

of users. Besides, the possibility of using the two of them in the same set of variants and restricting the results to variants predicted to have the same effect by both could provide a way to avoid false positives.

Before deciding on using both of them for our pipeline we run a comparative analysis using the 1000 Genomes phase 1 vcf [14]. As shown in Table 2, SnpEff requires much less time to process the data than VEP (2 hours versus and overnight process parallelized in ten chunks). Both programs were run using the option to obtain protein information for the effect of the variant, NextProt in the case of SnpEff and ENSEMBL information in the case of VEP. In addition, VEP also provides SIFT and PolyPhen scores to evaluate the impact of missense variants. The user can later mine all these pieces of information from the resul-ting files.

Importantly, we also obtained the data for the SnpEff predictor of loss-of-function of the va-riant and the predictor of pathogenicity of VEP. The information of these two scores goes in the same sense of NutVar 2, that is, evalua-ting when the molecular damage is severe (loss-of-function) and providing a probability score of pathogenicity (VEP). SnpEff predic-tion of loss-of-function is based in McArthur et al [2] (Pablo Cingolani, personal communi-cation [15]) while the basis for VEP predictor of probability could not be ascertained. The-se scores can later be used to compare with NutVar2 results.

As shown in Table 3 the overall convergence in assessing the effect of a variant between VEP and SnpEff is 91.13%. For missense and synonymous variants both programs show greater accordance (>99%) than for truncating variants, 99%, 88% and 77% for stop gains, splice and frameshift variants respectively.

The Pathogenic and Non-Pathogenic variants of the training set were annotated separately using SnpEff and VEP. Consequently, the fo-llowing stages in the development of NutVar2 were undertaken separately depending on ha-ving used SnpEff or VEP fot the assessment of the effect of variants.

2. Construction of Pre-Calculated tables (Pre-C tables) and obtention of files necessary to annotate featu-res of molecular damage

For NutVar2 to run fast, a series of Pre-C ta-bles and external files need to be built in ad-vance. These Pre-C tables and files (listed in table5) contain transcript and protein annota-tions that would allow NutVar2 to rapidly cal-culate the features of molecular damage for the Training Set obtained in the previous sec-tion. The construction of the Pre-C tables for transcript features and protein features was therefore the second main stage of NutVar 2 development.

Transcript Features Pre-C tables and files

A major implementation of NutVar 2 is the use of ENSEMBL as the source of transcript annotation, while in NutVar1 it was restricted to transcripts belonging to the CCDS smaller subset. From the whole set of human trans-cripts of the GRCh37.75 release we restric-ted ourselves to protein coding transcripts (ENSTs) obtained from protein coding genes (ENSGs) that lie in autosomal and sexual chromosomes (ENSGs in “patch chromoso-mes“ were discarded).

A series of Pre-C tables containing sequen-ce characteristics of every transcript was first derived from ENSEMBL gtf file (gtf_tabladef_sorted_by_SYMBOL.txt, gtf_output_ENSG.txt and gtf_output_ENST.txt, see Table 5). These Pre-C tables were instrumental to build the first main Pre-C table of transcript featu-res; a table of intervals of genomic coordi-nates occupied by coding positions of all the ENSTs for every ENSG (Figure 6). This Pre-C table called ENST_table_full_condensed.txt (see Table 5) would be central for the annota-tion process explained later.

In addition, a second major Pre-C table for the annotation process, a table of Nonsense-Me-diated-Decay window regions was developed in this stage (Figure 6, called NMD_table see Table 5). This Pre-C table allows NutVar2 to calculate swiftly whether or not a truncating

Results

15

ENSEMBLCCDS

Appris Pervasive

1. Create a pre-calculated table of intervals of genomic coordinates occupied by

coding positions of all the protein coding transcripts for every gene in ENSEMBL.

2. Create a pre-calculated table of Nonsense-Mediated-Decay window regions,

from start codon to 50 pb (non-inclusive) from last exon-exon junction

3. Selection of longest isoform, principal isoform (APPRIS) and isoform most

expressed across different tissues (Pervasive isoform)

Version Range and CoverageENSEMBL release 75 (GRCh37.75) 20.314 protein coding genes in autosomal

and sexual chromosomesENSEMBL release 75 (GRCh37.75) 81.732 protein coding transcriptsAPPRIS Gencode 19 17.902 genes for which there is a Princi-

pal IsoformPervasive GRCh37.75 5.227 genes for which there is a Pervasi-

ve isoform

Figure 6. Transcript features annotated from ENSEMBL set of protein coding transcripts. The set of protein coding transcripts of ENSEMBL is used to create two Pre-C tables. First, a table of intervals of genomic coordinates covering all the coding positions of every protein-coding transcript in ENSEMBL. Second a table of Nonsense-Mediated-Decay (NMD) susceptible regions, covering all co-ding positions until 50 pb (not included) of the last exon-exon junction. The APPRIS server is used to select the principal isoform of each gene while the longest isoform is calculated from ENSEMBL data. To select the pervasive isoform, the isoform most expressed across different genes, we use the table provided by Gonzalez-Porta et al [16]. The Consensus CoDifying Sequence (CCDS) set of transcripts is displayed as the software includes the possibility to limit the results to CCDS transcripts.

Results

16

variant leads to NMD and therefore transcript downregulation.

Finally two key files were obtained from exter-nal sources: the APPRIS selection of principal isoforms in Gencode 19 [6] (appris_principal_isoform in Table 5) and the table of pervasive isoforms (isoforms most expressed across di-fferent tissues) obtained from Gonzalez-Porta et al [16]. These files are vital to ascertain the importance of an ENST in the context of the gene; truncating variants affecting the princi-pal or pervasive isoforms of a given gene will have a greater potential of being pathogenic.

Protein Features Pre-C tables and fi-les

By building a Pre-C table of protein functional sites and domains we aimed at quantifying the loss in overall protein features produced by a truncating variant. In order to do so, we mapped functional sites and protein domains to genomic coordinates.

We chose as a basis for protein features Uni-Prot-SwissProt [17], comprised by manually curated annotations, though we plan to imple-ment TrEMBL (automatic annotation without manual curation) in the near future.

The process of data extraction from the Uni-Prot-SwissProt complete file is detailed in Fi-gure7 and Table4. For each ENSEMBL gene, SwissProt annotates protein features for only one isoform, the “displayed isoform”, amoun-ting to a total of 18.955 displayed isoforms. For each displayed isoform we extracted UniProt identifier (AC number), InterPro identifiers of protein domains and protein coordinates and types of selected functional sites (Figure 7). InterPro identifiers were later used to extract domain protein coordinates from InterPro [18]. The types of functional sites selected are listed in Table 4. This selection contains types of small functional sites whose disruption may entail protein function impairment due to them being active catalytic sites, sites of protein post-translational modification, or small motifs needed to bind substrates and cofactors.

Domain Transformations

InterPro provides protein domain coordina-tes of different domain annotators (Pfam, G3DSA,etc) for a combination of AC number (identifying the protein) and InterPro identifier (identifying the domain). As a result of this, the same protein domain can be defined by di-fferent boundaries depending on the domain annotator selected (Figure 8 B)). In addition, repetitions of the same domain can be found across the same polypeptide (Figure8 A)). Consequently, to correctly assess the number and extension of domain loss after a trunca-tion we needed to establish domain bounda-ries and to number domain repetitions (Figu-re8). Domain repetitions obtained from the same domain annotator were numbered in the first place (Figure 8 A)). Secondly, when diffe-rent domain annotators display overlapping coordinates delimiting a protein domain, the outermost coordinates were used to define a “collapsed form” of the domain resulting from all the domain annotators predictions (Figure 8 B)). These two sequential transformations may take place for the same domain in the same polypeptide, as exemplified in Figure 8 C) for the IPR003961 domain of the Tenascin-C human protein.

Mapping protein coordinates to ge-nomic coordinates

Once the types and coordinates of selected functional sites and the coordinates and IDs of collapsed and numbered InterPro domains have been obtained, we proceeded to map-ping them to genomic coordinates.

This process was divided in two steps. Firstly, we selected the ENSEMBL transcript (ENST) whose peptide sequence matched that of the displayed isoform whose sites and domains we wanted to map. Secondly, we obtained the genomic coding positions for the resulting ENST (using the Pre-C tables carrying Trans-cript features) and retrieved the positions co-vered by the protein features.

Results

17

ENSE

MBL

UN

IPR

OT

Swis

sPro

t

Inte

rPro

Prot

ein

Coo

rdin

ates

Gen

omic

Coo

rdin

ates

# U

NIP

RO

T AC

num

ber

# In

terp

ro D

omai

n ID

s

Dom

ain

Coo

rdin

ates

# Si

te In

fo a

nd C

oord

inat

es

Ser 2

7PT

1:49

5112

T 1:

4951

13

C 1

:495

200

Intro

n

A)

Results

18

Site UniProt code

Description

BINDING Describes the interaction between a single amino acid and another che-mical entity

METAL Indicates at which position the protein binds a given metal ion. By defini-tion each ‘Metal binding’ subsection refers to a single amino acid

NP_BIND Describes a region in the protein which binds nucleotide phosphates. It always involves more than one amino acid and includes all residues invol-ved in nucleotide-binding

DNA_BIND Specifies the position and type of each DNA-binding domain present wi-thin the protein.

CA_BIND Specifies the position(s) of the calcium-binding region(s) within the protein. One common calcium-binding motif is the EF-hand, but other calcium-binding motifs also exist.

MOD_RES Position and type of each modified residue excluding lipids, glycans and protein cross-links. For example: phosphorylation, methylation, acetyla-tion...

LIPID Specifies the position(s) and the type of covalently attached lipid group(s)CARBOHYD Specifies the position and type of each covalently attached glycan group

(mono-, di- , or polysaccharide)NON_STD Describes the occurrence of non-standard amino acids selenocysteine

(Sec) or pyrrolysine (Pyl) in the protein sequenceACT_SITE Indicates the residues directly involved in catalysis, used for enzymes

Figure 7 and Table 4. Scheme of the flux of protein information to genomic coordinates and ty-pes of functional sites selected. A) UniProt-SwissProt data addressing InterPro domain IDs present in the peptide and type and protein coordinates of functional sites are extracted for each protein from complete database files obtained from the web ftp server. The UniProt AC number and the InterPro IDs are used to retrieve domain protein coordinates based on InterPro information. Site and Domain coordinates are then mapped to genomic coordinates as described afterwards. B) Types of functio-nal sites selected by the software, functional sites deemed as ‘Potential’ are not used. Adapted from UniProt documentation [19].

B)

The process is detailed in Figure 9. Matching of the peptide sequence of the displayed iso-form against all the ENSEMBL peptide se-quences corresponding to the same genes gave rise to three types of outcomes. The displayed isoform matched exactly at least one of the ENSEMBL peptide sequences in 18.074 cases, whereas in 137 cases at least one of the ENSEMBL peptide sequences con-tained completely the displayed isoform and showed a longer N-terminus (Figure 9). For 712 displayed isoforms there was no mat-ching among the correspondent ENSEMBL peptides. This is due to differences in sequen-ce composition between ENSEMBL and UNI-

PROT, for instance there can be punctual re-sidue changes or intrasequence loops (Figure 9). These 712 unmatched isoforms were later aligned against their cognate ENSEMBL pep-tide sequences using BLAST.

To convert protein coordinates to genomic coordinates we first calculated the offset bet-ween the displayed isoform and the longer ENSEMBL peptide sequence in the 137 ins-tances were the ENSEMBL peptide N-termi-nus was bigger. A longer N-terminus in the ENSEMBL peptide sequence implies that the offset between both forms has to be added to the protein coordinates of every feature in the displayed isoform to allow its correct re-

Results

19

PS51406

PF00147SSF56496 SM00186

Domain collapsed

IPR # B

PS51406IPR # A IPR # A IPR # A

IPR # A1 IPR # A2 IPR # A3Domains numbered

# 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10

Human Tenascin IPR003961

A)

B)

C)

Figure 8. Domain transformations undertaken by the software. Basing on domain coordina-tes the software perform the following transformations. First, numbering the repetitions of the same domain along the peptide sequence (A). Second, when different domain predictors define the same InterPro domain, the outermost coordinates are used to delimit a “collapsed” domain result of all the overlapping predictions (B). In C) an example of both transformations on the same InterPro domain (IPR003961 of the human Tenascin protein, UniProt AC P2481).

allocation in ENSEMBL (Figure 9).Once this had been done we carried on with the process, this time also for the 18.074 dis-played isoforms with an exact matching ENS-EMBL peptide sequence. The protein coordi-nates of the domains and sites were converted to array indexes indicating the beginning of the feature and its length and accounting for the equivalence between residue and three pb codon (Figure 9). Then, the coding geno-

mic coordinates of the ENST giving rise to the matching ENSEMBL peptide sequence were retrieved from Pre-C tables of transcript fea-tures obtained previously. Next, we used the indexes to extract the genomic coordinates occupied by the feature from the array of ge-nomic coding positions of the ENST.

For instance; a Serine amenable to phos-phorylation in position 1 of a given displayed

Results

20

ENSEMBL UNIPROT SwissProt

Domain Coordinates

Site CoordinatesSer 27

P IPR A

Peptide sequence of the UniProt displayed isoform for a given gene

ftp file with all the sequences of all the ENSEMBL transcripts

Peptide sequence of all the EN-SEMBL transcripts of a given gene

FindCorrespondence

Exact correspondence ENSEMBL sequence is N-ter bigger

No correspondence

ENSEMBL UNIPROT

-Residue changes-UniProt N-ter bigger-internal loops

18.074 /18.955

137/18.955

712/18.955

offset

Re-align using

BLAST

T 1:495112T 1:495113

C 1:495200

IntronFeature with Genomic Coordinates

• Apply offset to feature (Domain or Site) co-ordinates when ENSEMBL seq is N-ter bigger

• Transform feature coordinates to 3-based array index references encoding the beggining of the feature and its length

• Retrieve coding genomic positions of the transcript from ENSEMBL

• Select from the array of transcript coding positions the ones applying for the beggining and length of the feature

Figure 9. Mapping of protein features to genomic coordinates. The peptide sequence of the

Results

21

displayed isoform is retrieved from UniProt-SwissProt and matched against the peptide sequences of all the ENSEMBL proteins corresponding to the same gene. Out of 18.955 displayed isoforms, 18.074 have at least one exact correspondent ENSEMBL peptide sequence, 137 have at least one corres-pondent ENSEMBL peptide with a larger N-terminus (denoted in the figure as “offset”) and 712 do not match any of the ENSEMBL peptide sequences for the same gene. Some reasons for this lack of matching are enumerated in the figure; punctual differences in residue composition along the sequen-ces, internal loops, larger UniProt N-terminus, etc. The unmatched 712 displayed isoforms and the corresponding ENSEMBL peptide sequences will be aligned afterwards using BLAST (see Figure 10). For the remaining, first in the case of the 137 displayed isoforms with an N-terminus offset, the offset is applied to protein feature coordinates. Then, for all of them, feature coordinates are converted to array indexes indicating feature start and length multiplied by three to account for the aminoacid-codon equivalence. Next, the coding genomic coordinates of the corresponding ENSEMBL peptide sequence are extracted from ENSEMBL. Finally the array indexes denoting feature start and length are used to select the corresponding coding positions from the array of total coding positions.

isoform was converted to two indexes indi-cating the beginning of the feature (1-1*3; 0) and its length ((1-1+1)*3; 3) both multiplied by three. This means that from the array of geno-mic coordinates corresponding to the ENST giving rise to the ENSEMBL peptide, the Seri-ne starts in position 0 of the array (first coding position) and extends for two more positions for a total of 3 pb. Should the Serine be in position 3 instead, then the indexes (begin 6, length 3) would select genomic coordinates occupying positions 6,7,8 from the array of co-ding positions of the ENST correctly locating the protein feature in genomic coordinates.

For the 712 UniProt displayed isoforms wi-thout a matching ENSEMBL peptide sequen-ce we conducted a multiple alignment against the ENSEMBL peptide sequences correspon-ding to the same gene using BLAST. The re-sults of these alignments were complex; the-re were mismatches due to different residue composition, displayed isoforms with longer N-terminus or C-terminus and even intrase-quence loops both in displayed isoforms and ENSEMBL peptide sequences (Figure 10). Furthermore, for some displayed isoforms these mismatches and loops appeared com-bined.To overcome the difficulties posed by these alignments we took a three-stage pre-process of the protein feature coordinates in the dis-played isoform. First, we selected as the ENS-EMBL peptide sequence matching the displa-yed isoform the one with the highest BLAST bit score. The bit score, S’, is derived from the raw alignment score, S, taking the statistical

properties of the scoring system into account. Because bit scores are normalized with res-pect to the scoring system, they can be used to compare alignment scores from different searches [20].

Secondly, we excluded every protein domain or site lying entirely in regions that were pre-sent in the UniProt displayed isoform but ab-sent in the ENSEMBL peptide sequence we had selected. For instance, features lying enti-rely in the region of the N-terminus of a displa-yed isoform absent in the selected ENSEMBL peptide sequence were discarded. Features lying partially (for example protein domains) were adapted; their boundaries were adjusted to regions present in the ENSEMBL peptide sequence (Figure 10).

Thirdly, as intrasequence loops introduced offsets affecting only features c-terminal to them, we calculated a compound offset (re-sulting from general offsets and loop offsets n-terminal to the feature) for every feature and applied it to its coordinates (Figure 10).

Finally, we proceeded to convert protein coor-dinates to genomic coordinates as described

Results

22

ENSEMBL

UNIPROT

ENSEMBL UNIPROT SwissProt

Peptide sequence of the UniProt displayed isoform for a given gene

Peptide sequence of all the EN-SEMBL transcripts of a given gene

FindCorrespondence

No correspondence

712 (18.955)Realignment using BLAST

T 1:495112T 1:495113

C 1:495200

IntronFeature with Genomic Coordinates

offsetS

F*

offset

offset

offset

Residue mismatches

Bigger UniProt N-terminus

Bigger UniProt C-terminus Intrasequence

loops

685 “isoforms displayed” rescued

• Discard features that lie entirely in regions not present in ENS-EMBL sequence. Adjust those that lie partially.

• Calculate a compound offset from all the partial offsets that lie n-terminal of the protein feature

• Apply compound offset to feature (Domain or Site) coordinates • Transform feature coordinates to 3-based array index references

encoding the beggining of the feature and its length • Retrieve coding genomic positions of the transcript from ENS-

EMBL• Select from the array of transcript coding positions the ones

applying for the beggining and length of the feature

Results

23

Figure 10. BLAST alignment of the 712 UniProt displayed isoforms that do not match any of their gene corresponding ENSEMBL peptide sequences. Each “displayed isoform” is aligned against all of its gene corresponding ENSEMBL peptide sequences using BLAST. Alignments show residue mismatches, bigger N or C terminus in the “displayed isoform”, intrasequence loops and com-binations of all or some of them. For every alignment the best candidate ENSEMBL peptide is selec-ted basing on its bit score. For 685 of the initial 712 “displayed isoforms” a candidate ENSEMBL pepti-de is selected. In order to map their protein features (sites and domains) to genomic coordinates first, features lying entirely in regions absent in the ENSEMBL peptide are discarded and features lying partially are adapted so their boundaries lie within ENSEMBL peptide sequence limits. Second, for every protein feature a compound offset resulting from all the partial offsets n-terminal to the feature is calculated and applied to feature coordinates. Finally, we proceed as explained in Figure 9 to obtain the genomic coordinates of the feature.

above.Addition of gene based features to the variants

As show in figure 1, in addition to features of molecular damage NutVar2 requires a series of gene based features derived from exter-nal files. These gene based features provide two lines of information; the essentiality of the gene and its belonging to the Innate Immu-ne Response. All of these gene features were obtained from external files (Table 6).

Gene essentiality prioritizes genes; genes with a low tolerance to functional variation [3] or rarely observed to carry loss-of-function va-riants [2] are presumed to be essential genes actively under purifying selection. Therefore, functional variants appearing in these genes have a greater potential to be pathogenic.

The belonging of a gene to the Innate Immune Response set is added as a gene feature as these genes are thought to undergo positive selection and thus the amount of functional variation is expected to be higher than that observed under neutral evolution [4, 21]. The-refore, functional variants appearing in these genes may have a lower potential to be pa-thogenic.

Finally, a summary of all the Pre-C tables and files described in this section is displayed in Table 5 (Pre-C tables and external files ca-rrying sequence based features) and Table 6 (external files carrying gene based features).

3. Annotation of gene and se-quence based features

Once stages 1 and 2 of NutVar2 development were completed; that is the training set was divided among Pathogenic and Non-Patho-genic variants and the Pre-C tables and files needed for a swift annotation of sequence and gene features had been built, the annotation phase started.

We developed a series of scripts to calculate the features displayed in Table 7 for the who-le variants of the Training set. These features can be broadly divided into five categories. A first category of informative features (first 7 rows of Table 7) that provide information of the overall effect of the variant in the gene. As the effect of the variant depends on the transcript(s) it affects, variants can lead to truncations in certain isoforms and to other effects in other isoforms of the same gene. That information can be retrieved from this set of informative features.

Secondly, transcript features of molecular da-mage (rows 8 to 14) that have been explai-ned before. Of note, for frameshift variants NutVar2 calculates the onset of derived stop-gains in the new reading frame elicited by the variant. Should they exist, NutVar2 calculates whether they entail NMD. The result is the feature “ratioAffectedIsoformsTargetedby_de-rived_NMD”.

Thirdly, protein features of molecular dama-ge (rows 15 to 24) that have been explained before. These features are intended to cal-culate protein damage globally (percentage

Results

24

Nam

e of

File

Des

crip

tion

Mai

n us

eO

rigin

gtf_

tabl

adef

_sor

ted_

by_S

YMBO

L.tx

tPr

e-C

tabl

e of

inte

rval

s of

gen

omic

coo

rdin

ates

list

ing

all t

he

prot

ein

codi

ng E

NST

s an

d th

e ch

arac

ter o

f the

inte

rval

of c

o-or

dina

tes

(UTR

_5_p

rime,

STA

RT, C

DS,

STO

P an

d U

TR_3

_pr

ime)

. Inc

lude

s eq

uiva

lenc

ies

ENST

-CC

DS

(type

d “N

aNei

n”

whe

n ab

sent

)

Sour

ce

of

equi

-va

lenc

e EN

ST-

CC

DS

PreC

fro

m

ENSE

MBL

gtf

file.

See

*Fi

-gu

reXX

*

gtf_

outp

ut_E

NSG

.txt

Pre-

C ta

ble

show

ing

the

equi

vale

nce

of p

rote

in c

odin

g EN

SG

that

lie

in a

utos

omal

or s

exua

l chr

omos

omes

(Ass

embl

y pa

t-ch

es a

re d

isca

rded

) to

HG

NC

sym

bols

Sour

ce if

equ

iva-

lenc

e EN

SG-H

G-

NC

sym

bol

PreC

fro

m

ENSE

MBL

gtf

file.

See

*Fi

-gu

reXX

*gt

f_ou

tput

_EN

ST.tx

tPr

e-C

tabl

e of

all t

he p

rote

in c

odin

g EN

STs

belo

ngin

g to

eve

ry

ENSG

pre

sent

in g

tf_ou

tput

_EN

SG.tx

tSo

urce

of a

ll pr

o-te

in c

odin

g EN

STPr

eC

from

EN

SEM

BL g

tf fil

e. S

ee *

Fi-

gure

XX*

NM

D_t

able

.txt

Pre-

C t

able

of

Non

sens

e-M

edia

ted-

Dec

ay (

NM

D)

win

dow

re

gion

s, fr

om s

tart

codo

n to

50

pb (

non-

incl

usiv

e) fr

om la

st

exon

-exo

n ju

nctio

n

Cal

culte

N

MD

fro

m v

aria

nt c

oor-

dina

te

PreC

fro

m

ENSE

MBL

gtf

file.

See

*Fi

-gu

reXX

*EN

ST_t

able

_ful

l_co

nden

sed.

txt

Pre-

C ta

ble

of in

terv

als

of g

enom

ic c

oord

inat

es o

ccup

ied

by

codi

ng p

ositi

ons

of a

ll th

e pr

otei

n co

ding

tran

scrip

ts fo

r eve

ry

ENSG

Loca

te

varia

nt

coor

dina

te w

ithin

EN

ST c

onte

xt

PreC

fro

m

ENSE

MBL

gtf

file.

See

*Fi

-gu

reXX

*Pe

rvas

ive.

txt

Tabl

e in

dica

ting

the

perv

asiv

e is

ofor

m fo

r eve

ry E

NSG

Sele

ct P

erva

sive

is

ofor

mG

on

zale

z-Po

rta e

t al*

appr

is_p

rinci

pal_

isof

orm

.txt

Tabl

e in

dica

ting

the

APPR

IS P

rinci

pal is

ofor

m fo

r eve

ry E

NSG

Sele

ct

Prin

cipa

l is

ofor

mAp

pris

Ser

ver

ALL_

ISO

FOR

MS_

PRO

TEIN

_tab

le_f

ull.t

xtPr

e-C

tabl

e of

inte

rval

s of

gen

omic

coo

rdin

ates

ocu

ppie

d by

ev

ery

Inte

rPro

dom

ain

and

Uni

Prot

func

tiona

l site

pre

viou

sly

trans

form

ated

(num

bere

d an

d co

llaps

ed)

Cal

cula

te p

rote

in

feat

ures

affe

cted

by

a tr

unca

tion

Pre-

C

from

E

NS

EM

BL

, U

niPr

ot

and

INTE

RPR

OAL

L_IS

OFO

RM

S_D

OM

AIN

_tab

le_f

ull.t

xtPr

e-C

tabl

e of

inte

rval

s of g

enom

ic co

ordi

nate

s ocu

ppie

d by

all

Inte

rPro

dom

ains

, lab

elle

d as

DO

MAI

N, a

nd U

niPr

ot fu

nctio

-na

l site

s, la

belle

d as

SIT

E, p

revi

ousl

y tra

nsfo

rmat

ed (n

umbe

-re

d an

d co

llaps

ed).

Cal

cula

te t

he t

o-ta

l am

ount

of D

O-

MAI

N

and

SITE

po

sitio

ns a

ffect

ed

Pre-

C

from

E

NS

EM

BL

, U

niPr

ot

and

INTE

RPR

O

Results

25

Table 5. Files and Pre-C tables needed to run the analysis. ENSG= ENSEMBL Gene, ENST=ENSEMBL Transcript, HGNC= HUGO Genome Nomenclature Committee, CCDS= Consensus CoDifying Sequence, UTR_5_prime= 5 prime UTR, START= Start codon, CDS= codifying sequence, STOP= Stop codon, UTR_3_prime= 3 prime UTR. *Gonzalez-Porta et al [16], Appris server [6].

Nam

e of

File

Des

crip

tion

Mai

n us

eO

rigin

pRD

G2.

txt

Prob

abilit

y of

rece

ssiv

e di

seas

e ca

usat

ion

as p

robi

ded

by M

a-cA

rthur

et a

lAs

sess

gen

e es

-se

ntia

lity

McA

rthur

et a

l

RVIS

2.tx

tIn

tole

ranc

e sc

orin

g sy

stem

tha

t as

sess

es w

heth

er g

enes

ha

ve re

lativ

ely

mor

e or

less

func

tiona

l gen

etic

var

iatio

n th

an

expe

cted

bas

ed o

n th

e ap

pare

ntly

neu

tral v

aria

tion

foun

d in

th

e ge

ne

Asse

ss g

ene

es-

sent

ialit

yPe

trovs

ki e

t al

Gen

es_A

llInn

ateI

mm

unity

.txt

Gen

es in

volv

ed in

Inna

te Im

mun

itySe

lect

ion

of

ge-

nes

invo

lved

in

In

nate

Res

pons

e

Lab

Tele

nti

(inho

use)

Gen

es_A

ntiv

iral.t

xtG

enes

invo

lved

in a

ntiv

iral r

espo

nse

Sele

ctio

n of

ge

-ne

s in

volv

ed

in

Inna

te R

espo

nse

Lab

Tele

nti

(inho

use)

Gen

es_I

SGs.

txt

Gen

es in

volv

ed in

Inna

te Im

mun

itySe

lect

ion

of

ge-

nes

invo

lved

in

In

nate

Res

pons

e

Lab

Tele

nti

(inho

use)

Gen

es_O

MIM

rece

ssiv

e.tx

tG

enes

kno

wn

to c

ause

dis

ease

onl

y th

roug

h au

toso

mic

rece

s-si

ve in

herit

ance

Asse

ss g

ene

es-

sent

ialit

yLa

b Te

lent

i (in

hous

e)

Table 6. Files carrying gene based features. McArthur et al [2], Petrovski et al [3].

Results

26

Feature Description1 NumIsoformsInQueryGene Number of total isoforms in the gene2 ratioIsoformsBearingTheVariant Ratio of isoforms bearing the variant

from all the isoforms of the gene. Ran-ges from 0 to 1

3 ratioAffectedIsoforms_stop-gained Ratio of isoforms bearing the variant that give rise to a stop gained from all the isoforms of the gene. Ranges from 0 to 1

4 ratioAffectedIsoforms_frameshift Ratio of isoforms bearing the variant that give rise to a frameshift from all the isoforms of the gene. Ranges from 0 to 1

5 ratioAffectedIsoforms_splice Ratio of isoforms bearing the variant that give rise to a splice donor or spli-ce acceptor abrogation from all the iso-forms of the gene. Ranges from 0 to 1

6 ratioAffectedIsoforms_coding-synonymous Ratio of isoforms bearing the variant that give rise to a synonymous change from all the isoforms of the gene. Ran-ges from 0 to 1

7 ratioAffectedIsoforms_missense Ratio of isoforms bearing the variant that give rise to a missense mutation from all the isoforms of the gene. Ran-ges from 0 to 1

8 ratioAffectedIsoformsTargetedbyNMD Ratio of isoforms bearing the variant that are targeted by Nonsense-Meadia-ted-Decay. Meaningful only for stop-gains (where it ranges from 0 to 1) for the rest its set to “NaN”

9 ratioAffectedIsoformsTargetedby_derived_NMD

Ratio of isoforms bearing the variant for which a downstream Stop-gain is created and are targeted by Nonsense-Meadiated-Decay. Meaningful only for frameshifts (where it ranges from 0 to 1) for the rest its set to “NaN”

of total domain and site positions affected by the truncation) and functionally (max percen-tage of a domain or site affected and number of domains and sites completely affected by the truncation). Whether the variant lies within the boundaries of a protein or site feature (the “domain matched” and “site matched featu-res”) is of special importance in the case of sites, where partial disruption of the site may completely abrogate its function.

Fourthly, the class variable Pathogenic or Non-Pathogenic (rows 25 and 26). The Pathogeni-

city Tag counts the number of times a variable has been annotated as Pathogenic in ClinVar (as explained in the first section of the results, conflicting variants have been left out). Only variants with a Pathogenicity Tag equal or bi-gger than 1 will be accepted as Pathogenic in the training of the classifier. The Credible Tag displays the result of the comparison of the limits of the 95% credible intervals with the threshold of 0.01 (1%) to ascertain whether they are rare (highCI =< 0.01), unknown (low-CI<0.01 and highCI>0.01) or common (lowCI >0.01). Only common variants (Credible tag

Results

27

Feature Description10 IsPrincipalIsoformAffected Indicates whether the Principal Isoform

predicted by Appris is affected by the variant. 0-> not affected, 1->affected NaN->No principal isoform predicted for the gene

11 IsWithinLongestCCDS Indicates whether the variant affect the longest isoform of the gene. 1->affec-ted, 0->not affected

12 IsWithinPervasiveIsoform Indicates whether the variant affect the pervasive isoform of the gene. 1->affec-ted, 0->not affected, NaN->No pervasi-ve isoform for the gene

13 LongestCCDSLength Length of the longest transcript affected14 PercentagePrincipalOrLongestCCDSAffec-

tedPercentage of the longest transcript that lays 3’ downstream the coordinate of the variant

15 DomainINFOAvailable Indicates whether the longest transcript has associated InterPro domain info. 0-> no domain info 1-> domain info pre-sent

16 PercentageOfDomainPositionsAffected Percentage of the total positions which have associated InterPro domain info that lie 3´downstream the coordinate of the variant. Ranges from 0 to 100 unless DomainINFOAvailable equals 0 when is set to NaN

17 maxPercDomainAffected Maximum percentage of a domain potentially affected by the mutation (mapping 3’ downstream the variant). Ranges from 0 to 100 unless Domai-nINFOAvailable equals 0 when is set to NaN

18 NumberOfDomains100Damage Number of complete domains that lie 3´downstream of the variant potentially affected by it. Integer unless Domai-nINFOAvailable equals 0 when is set to NaN

19 DomainMatched Is the variant within the boundaries of an INTERPRO domain? YES=1; NO=0 unless DomainINFOAvailable equals 0 when is set to NaN

20 SiteINFOAvailable Indicates whether the longest transcript has associated SwissProt site info. 0-> no site info 1-> site info present

21 PercentageOfSitePositionsAffected Percentage of the total positions which have associated SwissProt site info that lie 3´downstream the coordinate of the variant. Ranges from 0 to 100 unless SiteINFOAvailable equals 0 when is set to NaN

Results

28

Feature Description22 maxPercSiteAffected Maximum percentage of a site poten-

tially affected by the mutation (mapping 3’ downstream the variant). Ranges from 0 to 100 unless SiteINFOAvailable equals 0 when is set to NaN

23 NumberOfSites100Damage Number of complete sites that lie 3´downstream of the variant potentia-lly affected by it. Integer unless Site-INFOAvailable equals 0 when is set to NaN

24 SiteMatched Is the variant within the boundaries of a site? YES=1; NO=0 unless SiteINFOA-vailable equals 0 when is set to NaN

25 Pathogenicity_Tag Number of times a variant has been re-ported as Pathogenic in ClinVar

26 Credible_Tag(-1,rare, 0,not_credible,1,common)

Tag indicating the result of the compari-son of the limits of the 95 % credible in-tervals obtained from the BetaDistribu-tion with the threshold of 0.01. -1=rare (highCI =< 0.01), 0=unknown (lowCI < 0.01, highCI> 0.01), 1=common (lowCI >0.01)

27 Location_Tag Internal annotation reference28 IsInnateImmunity Gene based feature. See Table 629 IsAntiviral Gene based feature. See Table 630 IsISG Gene based feature. See Table 631 IsOMIMrecessive Gene based feature. See Table 632 pRDG_score Gene based feature. See Table 633 RVIS_score Gene based feature. See Table 6

Table 7. Features calculated by NutVar2 for each variant. lowCI= lower limit of the Credible Inter-val, highCI= higher limit of the Credible Interval.

=1) will be assumed to be ‘ Non-Pathogenic‘ and used to train the classifier.

Finally, the fifth subgroup of features is com-prised by gene-based features (rows 27 to 33, see table 6).

The resulting matrix of data (summarized in table 8) was the substrate for the training of the classifier.

4. Machine Learning

Distribution of features values among the Pathogenic and non-Pa-thogenic classes

Figure 11 depicts the distribution of the mean of every feature we have annotated among the two possible values of the class variable; Pathogenic and non-Pathogenic. It is divided among stop-gains (A), frameshifts (B) and splice donor/ acceptor disruptions (C).

As expected, in all three types of truncations, Pathogenic variants lie more often in genes with longer isoforms than in genes with shor-ter isoforms (feature “longestCCDSlength”). Given the nature of a truncation when it oc-curs in long isoforms, prone to contain multi-ple domains, it is more likely that it will entail severe functional consequences than in shor-ter isoforms with fewer domains.

Results

29

Variant Pathogenic/ Non Pathogenic

At least Domain

INFO

At least Site INFO

Target of NMD

Stop gains (initial 26.070) 2.848 Pathogenic 2.550 1.434 2.393407 Non Pathogenic 319 121 249

Frameshifts (initial 31.145) 2.222 Pathogenic 1.783 1.100 1.0162.891 Non Pathogenic 2.125 942 803

Splice acceptor/donor abrogations (initial 1.211)

577 Pathogenic 522 285 52929 Non Pathogenic 27 5 22

Table 8. Statistics of truncating variants used to train the classifier. From the initial large pools, the stringent criteria leaves much reduced subsets of Pathogenic and Non Pathogenic variants. We show the amount of variants having at least domain info (column 3) or site info (column 4) and the amount of them eliciting NMD in at least one transcript (column 5, in the case of Frameshift variants the number refers to the feature “ratioAffectedIsoformsTargetedby_derived_NMD”) to display the co-verage of the features.

In the case of stop gains (11 A) the mean va-lues for most of the features are greater in Pathogenic variants. This is the case for the features: NMD, Principal isoform affected, longest CCDS affected, percentage of se-quence affected, percentage of domain posi-tions affected, maximum of a domain affected, number of sites with a 100% damage and site matched. That pathogenic variants score hig-her than non-Pathogenic ones for these fea-tures is in coherence with our a priori expecta-tions and has also been observed in NutVar1 [4]. For the rest of the features in stop-gains; derived-NMD does not apply to them and thus is set to 0 and “Pervasive isoform” seems to be more targeted in non-Pathogenic isoforms but it might be due to its lack of comprehen-siveness (there´s only Pervasive information for 5.227 genes). The features number of domains with a 100% damage, domain mat-ched, Percentage of site positions affected and maximum percentage of a site affected associate with non-pathogenicity contrary to what we would have expected.

In the case of frameshifts (11 B) higher means in features involving NMD (NMD and derived NMD) and the importance of the isoform tar-geted (Principal isoform, longest isoform and Pervasive isoform) associate with the Pa-thogenic label. NMD applies for frameshifts and splice truncations as in some instances these truncations are also classified as stop-gains. However, all the features reflecting the percentage of sequence affected and the da-

mage in protein domains (except the domain matched feature) behave inversely as they do in stop-gains; the mean is greater in non-Pathogenic variants. This behaviour is also observed for the percentage of site positions affected and the maximum percentage of a domain affected.

In the case of splice disruptions (11 C), the means of features behave more similar to stop-gains but the few common splice dis-ruptions analyzed (29) might be thwarting our results. Evaluation of the classifier perfor-mance

NutVar2 was built using the Naïve Bayesian paradigm of classification. The performance was evaluated constructing ROC curves for the test subsets of the training data (leave one out crossvalidation). The ROC curves are dis-played in figure 12.

The results show the combination of sequen-ce-based features with gene-based features from MacArthur et al (A),B) and C)) or RVIS (D), E) and F))

For Stop-gains the classifier using sequence based features alone showed a 68% of accu-racy (figure 12 A, D), SB) while its combina-tion with gene based features raised the ac-curacy to 80% and 76% (figure 12 A) and D), SB+GB, MacArthur et al and RVIS respecti-

Results

30

1

.

.

Cla

ss-s

peci

fic M

ean

0

0.2

-0.2

-0.4

-0.6

Cla

ss-s

peci

fic M

ean

0.3

0.4

-0.1

0.2

0.1

0

-0.2

-0.3

Cla

ss-s

peci

fic M

ean

2

3

-2

1

0

-1

ratioA

ffecte

dIsofo

rmsTa

rgeted

byNMD

ratioA

ffecte

dIsofo

rmsTa

rgeted

by_d

erive

d_NMD

IsPrin

cipalI

sofor

mAffecte

d

IsWith

inLon

gestC

CDS

IsWith

inPerv

asive

Isofor

m

Long

estCCDSLe

ngth

Percen

tageP

rincip

alOrLo

nges

tCCDSAffecte

d

Percen

tageO

fDomain

Positio

nsAffe

cted

maxPerc

Domain

Affecte

d

Numbe

rOfDom

ains1

00Dam

age

Domain

Matche

d

Percen

tageO

fSitePos

itions

Affecte

d

maxPerc

SiteAffe

cted

Numbe

rOfSite

s100

Damag

e

SiteMatc

hed

PathogenicNon Pathogenic

A)

B)

C)

Figure 11. Mean values of sequence based features in pathogenic and non-pathogenic va-riants. The fifteen sequence based features where plotted against the class variable values (Pathoge-nic and ‘Non-pathogenic’) for all the truncations in the training set. A) Stop-gains, B) Frameshifts and C) Splice donor/acceptor disruptions. The mean value for every feature in the Pathogenic and ‘Non-Pathogenic’ series is shown. Error bars depict the standard deviation of the mean, however it should be noted that the standard deviation is assumed to be equal between the two values of the variable class.

Results

31

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate: FP/(TN+FP)

True

Pos

itive

Rat

e: T

P/(T

P+FN

)

AUC Model0.68: SB0.50: SB(r)

0.78: GB0.80: GB SB0.78: GB SB(r)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


True

Pos

itive

Rat

e: T

P/(T

P+FN

)


0.81: GB0.84: GB SB0.81: GB SB(r)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


True

Pos

itive

Rat

e: T

P/(T

P+FN

)


0.71: GB0.88: GB SB0.72: GB SB(r)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

12834 Pos and 374 Neg samples


True

Pos

itive

Rat

e: T

P/(T

P+FN

)


0.73: RV0.76: RV ! SB0.73: RV ! SB(r)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9



True

Pos

itive

Rat

e: T

P/(T

P+FN

)


0.55: RV0.67: RV ! SB0.55: RV ! SB(r)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9



True

Pos

itive

Rat

e: T

P/(T

P+FN

)


0.71: RV0.87: RV ! SB0.69: RV ! SB(r)

A) B) C)

vely). These results show no increase in ove-rall performance in stop-gains with respect to the first release of NutVar [4].

For frameshifts the accuracy was 77% and 84% for sequence based and the combina-tion of sequence and gene based features, respectively (Figure 12 B) and E) SB and SB+GB MacArthur and RVIS respectively). This represents a 11% increase in the Area Under the Curve (AUC) with respect to the first release of NutVar [4].

Splice acceptor/donor disruptions show an accuracy of 91% and 88% for sequence ba-sed and the combination of sequence and gene based features, respectively (Figure 12 C) and F) SB and SB+GB MacArthur and RVIS respectively). These variants were not analyzed in the first release of NutVar [4].

DISCUSSION

NutVar2 versus implementations in SnpEff and VEP

Figure 12. ROC curves evaluating the performance of NutVar2. A) Stop-gains, B) Frameshifts and C) Splice disruptions using as gene-based features MacArthur et al. D) Stop-gains, E) Frameshifts and F) Splice disruptions using the RVIS score. SB=Sequence Based, GB= Gene Based. Dashed curves correspond to a randomization test in which rows in sequence features are shuffled column-wise (denoted by SB(r)).

Currently, big sequencing initiatives report that stop-gains, frameshifts and splice donor/acceptor disruptions that give rise to trunca-tions are collectively prevalent in humans. The need to prioritize the study of these trun-cations attending to its possible impact on hu-man health is growing in accordance with the influx of genomic Big Data. NutVar2 provides a ranking of the severity of truncations relying on sequence-based features that can be com-bined with gene-based features to produce an overall score. Other software has been developed aiming at the same objective. Currently two of the most used predictors of variants effects, SnpEff [12] and VEP [13], offer as a feature the prediction of the loss-of-function and the pathogenicity of a variant respectively. As mentioned earlier, the prediction of the loss of function (LoF) of a variant in SnpEff is based in McArthur et al [2, 15]. Therefore, the main criterion for the prediction of LoF in a variant in SnpEff is the amount of sequence affected by the trunca-tion, with variants affecting the last 5% of the

D) E) F)

Discussion

32

POS REF>ALT

GENE ClinVar CLNSIG

SnpEff LoF

VEP Pathog.

NutVar Rank Per. (SB)

NutVar Rank Per.(SB +GB)

13:32906565 CA>C BRCA2 --- True(50%) --- 55,10 96,3913:32906576 CA>C BRCA2 --- True(50%) --- 55,09 96,3913:32906602 GA>G BRCA2 Pathog. True(50%) --- 55,03 96,3713:32907202 TA>T BRCA2 --- True(50%) --- 52,83 96,1213:32911357 CA>C BRCA2 --- True(33%) --- 48,63 95,7913:32911442 GA>G BRCA2 --- True(33%) --- 48,36 95,7413:32912345 GA>G BRCA2 Pathog. True(33%) --- 45,27 95,4313:32912655 CT>C BRCA2 Pathog. True(33%) --- 44,25 95,3513:32913079 GA>G BRCA2 --- --- Pathog. 42,75 95,2713:32913422 GA>G BRCA2 Pathog. True(33%) --- 41,64 95,2113:32913836 CA>C BRCA2 Pathog. True(33%) --- 40,18 95,0413:32953632 CA>C BRCA2 --- True(33%) --- 28,79 93,6613:32953640 GA>G BRCA2 --- --- --- 28,76 93,6613:32954022 CA>C BRCA2 Pathog. True(33%) Pathog. 28,44 93,6013:32972589 GA>G BRCA2 Likely

Pathoge-nic

--- --- 5,53 78,05

17:41245118 G>GGT BRCA1 --- True(19%) --- 95,11 99,9317:41245586 CT>C BRCA1 Pathog. True(19%) Pathog. 96,10 99,9317:41246531 C>CG BRCA1 --- True(32%) --- 97,83 99,95

target isoform being discarded [15], and the rest labelled as LoF positive. Given the di-fferent criteria involved in the NutVar2 score (importance of the isoform targeted, amount of sequence truncated, protein domains and functional sites lost) it is a much-refined sco-re in terms of the subtleties LoF may depend upon.

Although it was not possible to ascertain the basis of the VEP prediction of pathogenicity for variants it must rely heavily on pre-existent databases of variants with known pathogenic consequences such as OMIM and ClinVar [11]. VEP predictions for the training set of

NutVar2 display 99% and 91% with ClinVar classification for Pathogenic and ‘ Likely-Pa-thogenic’ variants respectively (n=15.370 and n=3.192, respectively). Furthermore, this pre-diction showed a lack of comprehensiveness as most of the truncations in the Training set without a ClinVar classification lacked as well a VEP prediction (n=154.919 of which only 267 where classified in ClinVar). In summary, in our hands the VEP predictor of pathogeni-city largely reflects pre-existent data.

To prove further the advantages of NutVar2 over SnpEff and VEP implementations, and as a part of a practical exercise for a job offer,

Table 9. Human frameshift variants in genes BRCA1 and BRCA2. POS = position in the format Chromosome:Genomic coordinate, REF>ALT = change in nucleotides between reference allele (REF) and the observed variant (ALT), ClinVar CLNSIG = Clinical Significance obtained from ClinVar, SnpEff LoF = result of the prediction of loss of function obtained from SnpEff, VEP Pathog. = result of the VEP prediction of pathogenicity. NutVar Rank Per. (SB) = percentiles of the NutVar2 rank based only in sequence based features. NutVar Rank Per. (SB + GB) = percentiles of the NutVar2 rank combining sequence and gene based features.

33

Discussion

a series of Roche 454 human DNA reads were trimmed, mapped to GRCh37.75, sub-jected to VEP and SnpEff and its potential pathogenicity evaluated with NutVar2. The reads corresponded to frameshift variants of the BRCA1 and BRCA2 genes implicated in breast cancer.

As shown in table 9, the sequence-based fea-tures of NutVar2 produce a rank for which per-centiles are shown (table 9, NutVar Rank Per.) reflecting the molecular impact of the trunca-tions in BRCA1 and BRCA2. While the per-centiles occupied by the variants of BRCA2 vary from 5 to 55 the percentiles of BRCA1 stay 95 or above. This means that variants affecting BRCA2 have lower molecular im-pact. However when gene based features are added (NutVar Rank Per.), the percentiles for both genes rise up to 93 (except for a ‘likely pathogenic’ variant in BRCA2, 13:32972589 GA>G). The greater likelihood of being pa-thogenic when sequence and gene features are combined comes as no surprise given the essentiality of both genes and their widely ac-knowledged role in breast cancer.

The predicted pathogenicity of these BRCA1 and BRCA2 frameshift variants is also dis-played in table 9 (VEP Pathog.). It is absent for most of the variants and when present it matches the ClinVar (ClinVar CLNSIG) score almost in all cases.

The LoF prediction of SnpEff is also depicted in table 9. SnpEff estimation indicates whether there is or there is not LoF (True or False) and the percentage of the total transcripts in the gene affected by it [15]. It is difficult to apprai-se from this indicator the possible outcome of the variants in terms of impact on health; for BRCA2 in general LoF affecting more trans-cripts correlate with higher NutVar2 percentile whereas for BRCA1 LoF affect few transcripts and still the NutVar2 is very high. Even for va-riants whose implication in breast cancer is established in ClinVar (17:41245586 CT>C) the LoF prediction of SnpEff is difficult to in-terpret (True (19%)).

In summary, implementations developed for SnpEff and VEP lack the comprehensive or the pathogenicity-focused scope of NutVar2

and provide worse indicators for the prioritiza-tion of the study of truncations.

NutVar2 versus NutVar1

The main upgrades between NutVar1 and NutVar2 are: the inclusion of splice acceptor/donor variants, a bigger training set, the adap-tation to the minimal representation format, the dual use of SnpEff and VEP, the use of ENS-EMBL as the basis for transcript annotation, new transcript features (Pervasive isoform) and new protein features (functional sites). All these features and improves allow for the use of new algorithms for classification that are currently being developed for NutVar2 as well as broaden its range of potential users.

NutVar2 is devised as a standalone tool that can be distributed and downloaded from the web server and applied on the user´s own vcf files without the need to perform any trasnfor-mation. It is designed to be robust (allowing for single or joint variant calling) and include as many variants as possible (inclusion of <DEL> and < . > alternative alleles). Impor-tantly all of the files and Pre-C tables used are built from files directly downloaded from the ftp servers of ENSEMBL, UniProt and INTER-PRO. This has been done with the purpose of updating more easily to changes such as the recent new assembly of the human genome (GRCh38). Again our goal has been to increa-se the adaptability of our software to the user ´s needs.

The inclusion of VEP as an alternative to SnpEff, or even allow for use of both of them, seeks to increase the reach of potential users of NutVar2. This functionality offers as well the possibility to restrict the analysis to variants having been predicted to have the same effect by both programs, thus decreasing the rate of false positives due to annotation errors.

The use of ENSEMBL instead of CCDS as a basis for transcript annotation argues in favor of our effort to extend the reach of NutVar pre-diction to as many protein coding transcripts as possible. This notwithstanding, for users wanting to restrict to CCDS transcripts, the posibility is allowed in the pipeline.

Discussion

34

The mining of new features with respect to NutVar1 (Pervasive isoform, number of do-mains/sites with a 100% damage and total of domain/site positions affected) has been un-dertaken aiming at increaing the amount of feature variables for classification. Our goal has been to offer more cues for the training of the classifier. Some of them such as “number of domains with a 100% damage” (figure 11) may have predictive value and will be instru-mental in the development of a new algorithm of classification that we are currently evalua-ting. In this sense, we are planning to move to a TAN-Bayes or a decision tree (C 4.5) pa-radigm due to both of them using the mutual information that we think some of our features share (figure 11).

The results of NutVar2 performance (figure 12) show a great improve in performance for frameshifts, no increase in the % of AUC for stop-gains and a great accuracy in the clas-sification of splice truncations (around 90%).

It should be noted however that to compa-re two classifiers both must use the same learning/test set. NutVar1 was trained with an smaller training set comprising variants available at the time. In fact, when NutVar2 training set was used to train NutVar1 a si-milar increase was observed for the AUC in frameshifts (data not shown) arguing against the capacity of the features we have selected to increase the accuracy of the classifier with the Naive Bayes paradigm. As mentioned be-fore we are planning to change the algorithm of classification to try profiting of the new fea-

100

80

60

40

20

0

20

40

60

80

100

VPS1

3BM

OB3

CPD

E4DI

PIL

17RB

CYP2

A7CP

N2O

R2L8

DISP

1IL

34 KRTA

P4−5

C17o

rf107

FAM

187B

C5or

f20

TRPM

1AK

R1E2

OR4

D1 RFPL

1O

R10X

1IN

TS4

CACN

A2D4

PRAM

EF10

FAM

187B

SLC6

A18

OR5

AR1

NOL4

PCNX

L2AT

P5G

2SL

C6A9

FCG

R1A

TTN EF

CAB1

3CO

L4A4

LAM

A3EN

PEP

CASP

12SL

C22A

10FC

RL3

MRO

H2B

TRPM

4AL

DH1L

1

100

80

60

40

20

0

20

40

60

80

100

VPS1

3BM

OB3

CPD

E4DI

PIL

17RB

CYP2

A7CP

N2O

R2L8

DISP

1IL

34 KRTA

P4−5

C17o

rf107

FAM

187B

C5or

f20

TRPM

1AK

R1E2

OR4

D1 RFPL

1O

R10X

1IN

TS4

CACN

A2D4

PRAM

EF10

FAM

187B

SLC6

A18

OR5

AR1

NOL4

PCNX

L2AT

P5G

2SL

C6A9

FCG

R1A

TTN EF

CAB1

3CO

L4A4

LAM

A3EN

PEP

CASP

12SL

C22A

10FC

RL3

MRO

H2B

TRPM

4AL

DH1L

1

Seq

uenc

e-ba

sed

path

ogen

icity

Gen

e-ba

sed

path

ogen

icity

(Mac

Arth

ur)

Seq

uenc

e-ba

sed

path

ogen

icity

Gen

e-ba

sed

path

ogen

icity

(RV

IS)

Homozygous stop-gain (MAF>=0.1)

Heterozygous stop-gain (MAF>=0.1)

Heterozygous stop-gain (MAF<0.1)

Homozygous stop-gain (MAF>=0.1)

Heterozygous stop-gain (MAF>=0.1)

Heterozygous stop-gain (MAF<0.1)

Figure 13. The genome of J. Craig Venter analyzed with NutVar1. MAF= Minor Allele Frequency in the general population. Orange = truncations in homozygosis, Green and gray = truncations in hetero-zygosis.

35

Discussion

tures we have mined.

With respect to the impressive accuracy of the classifier for splice variants we believe the classifier is overfitted due to the paucity of non-Pathogenic instances (27 versus the 522 pathogenic). We plan to increase the set of splice variants either using new releases of large sequencing projects and/or re-asses-sing the effect derived from SnpEff and VEP.

Future Perspectives

With the myriad of genomics data coming from NGS technologies expected to keep growing exponentially every year, NutVar2 provides a tool for the end-user to rank and prioritize the study of the variants in his own vcf files. NutVar2 limits its scope to truncating variants for the time being, however, some of the features used by it can also be applied for missense variants. For instance, the principal isoform feature or the gene based features. These features can be combined with current scores to assess the functional consequences of aminoacid substitutions such as PolyPhen, SIFT or Condel in an effort to extend the rea-ch of our tool to missense variants.

Restricting ourselves to truncations, we envi-sage NutVar´s role will produce results such as the one displayed in figure 13. A public genome (in this case Craig J. Venter´s) with phased information that allows constructing haplotypes was used by Antonio Rausell in combination with NutVar1 to create a graph of Venter´s truncations plotting the sequence based score against the gene based score. In addition, as haplotypes could be construc-ted, the truncations are colored according to them being in homozygosis (orange) or in he-terozygosis (green or gray depending on the MAF). Figure 13 provides a very visual and di-rect approach to ascertain the outcome of the truncations present in Venter´s genome; most of the truncations affecting important genes in homozygosis have low molecular impact consistent with little functional consequences. As the molecular impact of the truncations in homozygosis grow (bars to the right in figu-re 13) so does the essentiality of the genes involved decrease;in living individuals severe truncations in homozygosis lie in “dispensa-

ble” genes.

Truncations in heterozygosis (green and gray) have a different behavior, with some severe truncations affecting essential genes (the two right most bars). The reason for this tolerance is the existence of a backup copy for those va-riants in the diploid genome of the individual. However, there are genes that so far have ne-ver been observed to tolerate severe trunca-tions in heterozygosis. These haploinsufficient genes that require a high degree of functiona-lity in both alleles of the locus comprise a spe-cial set of ‘essential genes’. Rausell et al have extensively studied these genes and have re-cently submitted a manuscript (current under evaluation) for which I contributed [23].

Last but not least, we envisage that the pro-gressive completion of systems biology net-works will entail their inclusion as a third fea-ture in the classifier. Currently, we evaluate variants as single entities unrelated to other variants. It is true that essential genes tend to be central genes in protein-protein networks and thus that information is partially captured in gene-based features. However, the combi-nations and interactions between genes are not completely captured in evolutionary pa-rameters. Therefore, these features are very likely to play a future role in functional geno-mics as a means to evaluate the overall effect of variants in the biological network.

Conclusions

36

CONCLUSIONS

• NutVar2 works as a standalone tool that can be distributed and down-loaded.• The implementations and extended functionalities in NutVar2 increase its reach of potential users and provides new cues that can be used to change the paradigm its classification.

References

37

REFERENCES

1. Thayer AM: Next-‐Gen Sequencing Is A Numbers Game. In.; 2014. 2. MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K,

Jostins L, Habegger L, Pickrell JK, Montgomery SB, Albers CA, Zhang ZD, Conrad DF, Lunter G, Zheng H, Ayub Q, DePristo MA, Banks E, Hu M, Handsaker RE, Rosenfeld JA, Fromer M, Jin M, Mu XJ, Khurana E, Ye K, Kay M, Saunders GI, Suner MM, Hunt T, Barnes IH, Amid C, Carvalho-‐Silva DR, Bignell AH, Snow C, Yngvadottir B, Bumpstead S, Cooper DN, Xue Y, Romero IG, Wang J, Li Y, Gibbs RA, McCarroll SA, Dermitzakis ET, Pritchard JK, Barrett JC, Harrow J, Hurles ME, Gerstein MB, Tyler-‐Smith C: A systematic survey of loss-‐of-‐function variants in human protein-‐coding genes. Science 2012, a(6070):823-‐828.

3. Petrovski S, Wang Q, Heinzen EL, Allen AS, Goldstein DB: Genic intolerance to functional variation and the interpretation of personal genomes. PLoS genetics 2013, a(8):e1003709.

4. Rausell A, Mohammadi P, McLaren PJ, Bartha I, Xenarios I, Fellay J, Telenti A: Analysis of stop-‐gain and frameshift variants in human innate immunity genes. PLoS computational biology 2014, a(7):e1003757.

5. Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA: An integrated map of genetic variation from 1,092 human genomes. Nature 2012, a(7422):56-‐65.

6. Rodriguez JM, Maietta P, Ezkurdia I, Pietrelli A, Wesselink JJ, Lopez G, Valencia A, Tress ML: APPRIS: annotation of principal and alternative splice isoforms. Nucleic acids research 2013, a(Database issue):D110-‐117.

7. Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL: The human disease network. Proceedings of the National Academy of Sciences of the United States of America 2007, a(21):8685-‐8690.

8. Minikel E: Converting variants to their minimal representation. In.; 2014.

9. [http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=7&ved=0CFMQFjAG&url=http%3A%2F%2Fwww.tiem.utk.edu%2F~gross%2FMath589Spring2010%2FChapter2.ppt&ei=XcGmVNGTG4XqUtybhKgL&usg=AFQjCNHkrVWlEGpTacpsBcP_ht-‐KgjC3-‐g&sig2=8BxusLAJ0379cpw0vL1cwA&bvm=bv.82001339,d.d24]

10. Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, Maglott DR: ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic acids research 2014, a(Database issue):D980-‐985.

11. Clinical Significance [http://www.ncbi.nlm.nih.gov/clinvar/docs/clinsig/] 12. snpEff [http://snpeff.sourceforge.net/] 13. Variant Effect Predictor (VEP)

[http://www.ensembl.org/info/docs/tools/vep/index.html] 14. ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf

[ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz]

15. Cingolani P: We adopted a definition for LoF variants

* expected to correlate with complete loss of function

* of the affected transcripts:. In. Edited by Tardaguila M; 2014.

References

38

16. Gonzalez-‐Porta M, Frankish A, Rung J, Harrow J, Brazma A: Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome biology 2013, a(7):R70.

17. UniProt main [http://www.uniprot.org/] 18. InterPro main [http://www.ebi.ac.uk/interpro/] 19. [http://www.uniprot.org/help/function_section] 20. BLAST Glossary [http://www.ncbi.nlm.nih.gov/books/NBK62051/] 21. Schaffner SS, P.: Evolutionary adaptation in the human lineage. In:

Nature Education. 2008. 22. Larrañaga P: Supervised Classification Lesson. In. Edited by Tardáguila

M; 2014. 23. István Bartha AR, Paul McLaren, Manual Tardaguila, Pejman Mohammadi,

Jacques Fellay, Amalio Telenti: Heterozygous gene truncation delineates the human haploinsufficient genome. In.; 2014.

Anexes

39

ANEXES:

• MAP OF DEPENDENCIES OF THE CONSTRUCTION OF THE PRE-CALCULATED TABLE (I)• MAP OF DEPENDENCIES OF THE ANNOTATION PHASE OF NUTVAR2 (II)• MAP OF DEPENDENCIES OF THE CONSTRUCTION AND ANNOTATION OF THE TRAINING SET (III)

BLAST SOFTWARE

ENSEMBL release 75 gtf

DOWNLOAD

Homo_sapiens.GRCh37.75.gtf

3_parseo_del_gtf_8.0.pl

gtf_output_ENSG.txt

gtf_output_ENST.txt

gtf_output_EXON.txt

gtf_output_CDS.txt

gtf_output_start_codons.txt

gtf_output_UTR.txt

gtf_output_Selenocysteine.txt

4_def_conversor_de_ficheros_UTR_CDS_etc_múltiples_en_individuales.pl

gtf_output_stop_codons.txt

gtf_output_JOINED_CDS.txt

gtf_output_JOINED_start_codons.txt

gtf_output_JOINED_UTR.txt

gtf_output_JOINED_Selenocysteine.txt

gtf_output_JOINED_stop_codons.txt

5_gtf_ENSG_ENST_EON.pl

gtf_ENSG_ENST_EXON.txt

6_gtf_tabladef_3.0.pl

gtf_tabladef.txt

sort -t $'\t' -k3,3 data/build_tables/gtf_tabladef.txt > data/build_tables/gtf_tabladef_sorted_by_SYMBOL.txt

gtf_tabladef_sorted_by_SYMBOL.txt7_Position_ENST_converter.pl

TRANSCRIPTS_table.txt

8_TRANSCRIPT_TABLE_condenser_1_4.0.pl

ENST_table_midC.txt

ENST_table_full_condensed.txt

ENSEMBL UNIPROT & INTERPRO UNIPROT release-2014_07

DOWNLOAD

uniprot_sprot.dat

INTERPRO release-2014_07

DOWNLOAD

protein2ipr.datHUMAN.fasta

14_parse_del_ID_UNIPROT_human_16_prueba.pl

Equiv_ENSG_seq.txt

15_new_mapping.pl

ENSEMBL_includes_UNIPROT.txt

19_NEW_PROTEIN_POSITION_CONVERTER_DEF_corregida_dfeature.pl

PROTEIN_condensed.txt

18_NEW_2_PROTEINAS_CONDENSER.pl

PROTEIN_Re_mapping.txt

18_NEW_3_full_condensed_per_domain.pl

PROTEIN_full_condensed_feature.txt22_MAPEO_ISOFORMAS_NO_DISPLAYED.pl

ALL_ISOFORMS_PROTEIN_table.txt

23_ISOFORMAS_NO_DISPLAYED_CONDENSER_midC.pl

ALL_ISOFORMS_PROTEIN_table_midC.txt

ALL_ISOFORMS_PROTEIN_table_full.txt

24_TABLA_DOMAIN.pl

ALL_ISOFORMS_DOMAIN_table_full.txt

25_NMD_table.pl

NMD_table.txt

OTROS

appris_principal_isoform_gencode_19_15_10_2014.txt

Pervasive.txt

pRDG2.txt

Genes_AllInnateImmunity.txt

Genes_Antiviral.txt

Genes_ISGs.txt

Genes_OMIMrecessive.txt

RVIS2.txt

ALL_ISOFORMS_DOMAIN_table_midC.txt

db.fastano_aligned.fasta

scp -r data/build_tables/db.fasta /home/bioinfo/SOFTWARE/BLAST/ncbi-blast-2.2.30+/db/db.fasta

makeblastdb -in db.fasta -parse_seqids -dbtype prot

blastp -db db.fasta -query /home/bioinfo/Dropbox/nutvar2/data/build_tables/no_aligned.fasta -outfmt 3 > resultsfmt3.out


resultsfmt3.out resultsfmt7.out

16_Jalview_multiple_alignment.pl

selected_alignments.txt

16_Jalview_multiple_alignment_parteII.pl

aligned_features_coordinates.txt

16_NEW_SELECT_HUMAN_FROM_INTERPRO.pl

protein2ipr_human.dat17_INTERPRO_UNIPROT_SITES_DOMAINS_COORDINATES_MULTIDOMAIN.pl

site_coordinates.txt multidomain.txt

17_MULTIDOMAIN_parte_II_2.0.pl

multidomain_midC.txt

sort -k3,3 -k4,4 -k 1n data/build_tables/ multidomain_midC.txt > data/build_tables/multidomain_midC_sorted.txt

17_MULTIDOMAIN_parte_II_2.0.pl

multidomain_midC_sorted.txt

17_MULTIDOMAIN_parte_III.pl

domain_cordinates.txt

18_Realocator_coordinates.pl

aligned_features_coordinates_midC.txt

aligned_no_display_offset.txt

18_Realocator_coordinates_parteII.pl

UNIPROT_PLUS.txt

ENSEMBL release 75 gtf

DOWNLOAD

Homo_sapiens.GRCh37.75.cdna.all.fa

Homo_sapiens.GRCh37.75.cds.all.fa

26_cDNA_genomic_coordinate.pl

cDNA_genomic_coordinates.txt

27_CDS_genomic_coordinate.pl

CDS_genomic_coordinates.txt

28_compresor_CDS.pl

CDS_genomic_coordinates_full_compresed.txt

wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cdna/ Homo_sapiens.GRCh37.75.cdna.all.fa.gz

perl bin/build_tables/26_cDNA_genomic_coordinate.pl data/build_tables/NMD_table.txt ~/../../media/bioinfo/9631bd0c-4897-4987-8651-994c6463aa73/external_files_high_volume/Human_genome/Homo_sapiens.GRCh37.75.cdna.all.fa data/build_ta-bles/cDNA_genomic_coordinates.txt

perl bin/build_tables/27_CDS_genomic_coordinate.pl ~/../../media/bioinfo/9631bd0c-4897-4987-8651-994c6463aa73/external_fi-les_high_volume/Human_genome/Homo_sapiens.GRCh37.75.cds.all.fa data/build_tables/cDNA_genomic_coordinates.txt ~/Escritorio/Proyecto_clasificador/Raw_Data/CDS_genomic_coordinates.txt

perl bin/build_tables/28_compresor_CDS.pl ~/Escritorio/Proyecto_clasificador/Raw_Data/CDS_genomic_coordinates.txt ~/Es-critorio/Proyecto_clasificador/Raw_Data/CDS_genomic_coordinates_full_compresed.txt

wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cds/ Homo_sapiens.GRCh37.75.cds.all.fa.gz

ORDERSwget ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz

perl bin/build_tables/5_gtf_ENSG_ENST_EON.pl data/build_tables/gtf_output_ENSG.txt data/build_tables/gtf_output_ENST.txt data/build_tables/gtf_output_EXON.txt data/build_tables/gtf_ENSG_ENST_EXON.txt

perl bin/build_tables/6_gtf_tabladef_3.0.pl data/build_tables/gtf_output_ENSG.txt data/build_tables/gtf_output_ENST.txt data/build_tables/gtf_output_JOINED_CDS.txt data/build_tables/gtf_output_JOINED_UTR.txt data/build_tables/gtf_output_JOI-NED_START.txt data/build_tables/gtf_output_JOINED_STOP.txt data/build_tables/gtf_tabladef.txt

perl bin/build_tables/8_TRANSCRIPT_TABLE_condenser_1_4.0.pl ~/Escritorio/Proyecto_clasificador/Raw_Data/TRANS-CRIPTS_table.txt data/build_tables/ENST_table_midC.txt data/build_tables/ENST_table_full_condensed.txt

perl bin/build_tables/New_mapping_scripts/14_parse_del_ID_UNIPROT_human_16_prueba_API_independent.pl data/build_ta-bles/gtf_output_ENSG.txt data/external/HUMAN.fasta data/external/uniprot_sprot_human.dat data/external/Homo_sapiens.GRCh37.75.pep.all.fa data/build_tables/Equiv_ENSG_seq.txt

perl bin/build_tables/New_mapping_scripts/15_new_mapping.pl data/build_tables/Equiv_ENSG_seq.txt data/build_tables/EN-SEMBL_includes_UNIPROT.txt data/build_tables/db.fasta data/build_tables/no_aligned.fasta

scp -r data/build_tables/db.fasta /home/bioinfo/SOFTWARE/BLAST/ncbi-blast-2.2.30+/db/db.fasta

makeblastdb -in db.fasta -parse_seqids -dbtype prot

perl bin/build_tables/3_parseo_del_gtf_8.0.pl data/external/Homo_sapiens.GRCh37.75.gtf data/build_tables/gtf_output_ENSG.txt data/build_tables/gtf_output_ENST.txt data/build_tables/gtf_output_EXON.txt data/build_tables/gtf_output_CDS.txt data/build_tables/gtf_output_start_codons.txt data/build_tables/gtf_output_stop_codons.txt data/build_tables/gtf_output_UTR.txt data/build_tables/gtf_output_Selenocysteine.txt

perl bin/build_tables/4_def_conversor_de_ficheros_UTR_CDS_etc_m√∫ltiples_en_individuales.pl data/build_tables/gtf_output_UTR.txt data/build_tables/gtf_output_start_codons.txt data/build_tables/gtf_output_CDS.txt data/build_tables/gtf_output_Selenocysteine.txt data/build_tables/gtf_output_stop_codons.txt data/build_tables/gtf_output_JOINED_UTR.txt data/build_tables/gtf_output_JOINED_START.txt data/build_tables/gtf_output_JOINED_CDS.txt data/build_tables/gtf_output_JOI-NED_Seleno.txt data/build_tables/gtf_output_JOINED_STOP.txt

perl bin/build_tables/New_mapping_scripts/16_Jalview_multiple_alignment_parteII.pl data/build_tables/selected_alignments.txt data/build_tables/no_aligned.fasta data/build_tables/db.fasta ~/SOFTWARE/BLAST/ncbi-blast-2.2.30+/db/resultsfmt3.out data/build_tables/aligned_features_coordinates.txt

perl bin/build_tables/New_mapping_scripts/16_NEW_SELECT_HUMAN_FROM_INTERPRO.pl ~/../../media/bioinfo/Elements/Proyecto_clasificador_28_07_2014_NO_TOCAR/Documentos/UNIPROT/HUMAN.fasta ~/../../media/bioinfo/Elements/Proyec-to_clasificador_28_07_2014_NO_TOCAR/Documentos/INTERPRO/protein2ipr.dat data/external/protein2ipr_human.dat

sort -t $’\t’ -k3,3 data/build_tables/gtf_tabladef.txt > data/build_tables/gtf_tabladef_sorted_by_SYMBOL.txt

perl bin/build_tables/7_Position_ENST_converter.pl data/build_tables/gtf_tabladef_sorted_by_SYMBOL.txt data/build_tables/TRANSCRIPTS_table.txt

wget ftp://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2014_07/ + protein2ipr.dat + HUMAN.fasta

perl bin/build_tables/New_mapping_scripts/17_INTERPRO_UNIPROT_SITES_DOMAINS_COORDINATES_MULTIDOMAIN.pl ~/../../media/bioinfo/Elements/Proyecto_clasificador_28_07_2014_NO_TOCAR/Documentos/UNIPROT/HUMAN.fasta ~/../../me-dia/bioinfo/Elements/Proyecto_clasificador_28_07_2014_NO_TOCAR/Documentos/UNIPROT/uniprot_sprot.dat data/external/protein2ipr_human.dat data/build_tables/site_coordinates.txt data/build_tables/multidomain.txt

perl bin/build_tables/18_NEW_3_full_condensed_per_domain.pl data/build_tables/PROTEIN_condensed.txt data/build_tables/PROTEIN_full_condensed_feature.txt

perl bin/build_tables/22_MAPEO_ISOFORMAS_NO_DISPLAYED.pl data/build_tables/gtf_output_ENSG.txt data/build_tables/PROTEIN_full_condensed_feature.txt data/build_tables/ENST_table_full_condensed.txt ~/Escritorio/Proyecto_clasificador/Raw_Data/ALL_ISOFORMS_PROTEIN_table.txt

perl bin/build_tables/23_ISOFORMAS_NO_DISPLAYED_CONDENSER_midC.pl ~/Escritorio/Proyecto_clasificador/Raw_Data/ALL_ISOFORMS_PROTEIN_table.txt data/build_tables/ALL_ISOFORMS_PROTEIN_table_midC.txt data/build_tables/ALL_ISO-FORMS_PROTEIN_table_full.txt

perl bin/build_tables/25_NMD_table.pl data/build_tables/gtf_output_ENSG.txt data/build_tables/gtf_output_ENST.txt data/build_tables/gtf_ENSG_ENST_EXON.txt data/build_tables/NMD_table.txt

perl bin/build_tables/24_TABLA_DOMAIN_2.0.pl data/build_tables/ALL_ISOFORMS_PROTEIN_table_full.txt data/build_tables/ALL_ISOFORMS_DOMAIN_table_midC.txt data/build_tables/ALL_ISOFORMS_DOMAIN_table_full.txt



perl bin/build_tables/New_mapping_scripts/16_Jalview_multiple_alignment.pl ~/SOFTWARE/BLAST/ncbi-blast-2.2.30+/db/re-sultsfmt7.out data/build_tables/selected_alignments.txt

From: /home/bioinfo/SOFTWARE/BLAST/ncbi-blast-2.2.30+/db/db.fasta

perl bin/build_tables/New_mapping_scripts/17_MULTIDOMAIN_parte_II_2.0.pl data/build_tables/multidomain.txt data/build_ta-bles/multidomain_midC.txt

sort -k3,3 -k4,4 -k 1n data/build_tables/ multidomain_midC.txt > data/build_tables/multidomain_midC_sorted.txt

perl bin/build_tables/New_mapping_scripts/17_MULTIDOMAIN_parte_III.pl data/build_tables/multidomain_midC_sorted.txt data/build_tables/domain_cordinates.txt

perl bin/build_tables/18_NEW_2_PROTEINAS_CONDENSER.pl ~/Escritorio/Proyecto_clasificador/Raw_Data/PROTEIN.txt data/build_tables/PROTEIN_condensed.txt

perl bin/build_tables/New_mapping_scripts/18_Realocator_coordinates.pl data/build_tables/aligned_features_coordinates.txt data/build_tables/aligned_features_coordinates_midC.txt data/build_tables/aligned_no_display_offset.txt

perl bin/build_tables/New_mapping_scripts/18_Realocator_coordinates_parteII.pl data/build_tables/site_coordinates.txt data/build_tables/domain_coordinates.txt data/build_tables/gtf_output_ENSG.txt data/build_tables/ENSEMBL_includes_UNIPROT.txt data/build_tables/aligned_no_display_offset.txt data/build_tables/UNIPROT_PLUS.txt

perl bin/build_tables/New_mapping_scripts/19_NEW_PROTEIN_POSITION_CONVERTER_DEF_corregida_dfeature.pl data/build_tables/UNIPROT_PLUS.txt ~/Escritorio/Proyecto_clasificador/Raw_Data/TRANSCRIPTS_table.txt ~/Escritorio/Proyec-to_clasificador/Raw_Data/PROTEIN_Re_mapping.txt

- .- .

=&,

- . - .

VEP

Matrix_vep_CCDS_added_gene_based_scores.txt

53BIS_Fuse_Matrix\&Gene_based.pl

10X variant_effec_output_second_round.txt

CCDS VERSION

VEP EXCLUSIVE

scriptssnpEff

EXCLUSIVEscripts

ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf

OUT_FILE_minimal_representation.vcf

2_script_minimal_representation_vcf_7.0.pl

gtf_tabladef_4_sorted_by_SYMBOL.txt

Matrix_vep_added_gene_based_scores.txt

53BIS_Fuse_Matrix\&Gene_based.pl

pRDG2.txt


Genes_Antiviral.txt

Genes_ISGs.txt


RVIS2.txt

ALL_ISOFORMS_DOMAIN_table_full.txt

ENST_table_full_condensed.txt

ALL_ISOFORMS_PROTEIN_table_full.txt

CDS_genomic_coordinates_full_compresed.txt

32_key_PROTEINS_8.0_GLOBAL_3.0.pl

snpeff_detailed_ProtAndSite_Pre_step.txt

SORT

snpeff_detailed_ProtAndSite_Pre_step_ordered.txt

32_key_PROTEINS_8.0_GLOBAL_ParteII.pl

snpeff_detailed_ProtAndSite_Post_step.txt

32_key_PROTEINS_8.0_GLOBAL_3.0.pl

snpeff_DOMAINS_Pre_step.txt

SORT

snpeff_DOMAINS_Pre_step_ordered.txt

32_key_PROTEINS_8.0_GLOBAL_ParteII.pl

snpeff_DOMAINS_Post_step.txt

ORDERSperl bin/shared/2_Script_minimal_representation_vcf_7.0.pl test/example.vcf data/intermediate/example_mr.vcf 2h 40 min for 1GK set . Requires internet

to undo <DEL>

perl ~/Escritorio/Proyecto_clasificador/SOFTWARE/ensembl-tools-release-75/scripts/variant_effect_predictor/variant_effect_pre-dictor.pl -i data/intermediate/example_mr.vcf --offline --output_file data/intermediate/vep_example_mr.vcf --everything --vcf --ca-che --dir ~/Escritorio/Proyecto_clasificador/SOFTWARE/./.vep/tmp/

o.n. for 10 chunks of 1GK set

perl bin/VEP/24_VEP_parser_def_minus_heather.pl data/intermediate/vep_example_mr.vcf data/intermediate/out_vep_parsed.txt

perl bin/shared/25_Downstream_frameshift_API_independent_5.0.pl data/intermediate/out_snpeff_parsed.txt ~/Escritorio/Proyecto_clasificador/Raw_Data/CDS_genomic_coordinates_full_compresed.txt data/intermediate/snpeff_derived_PTCS_API_independent.txt

perl bin/shared/26_NEW_EXTRA_key_%_sequence_2.0.pl data/intermediate/out_snpeff_parsed.txt data/build_tables/ENST_ta-ble_full_condensed.txt data/intermediate/snpeff_percentage.txt

perl bin/shared/32_key_PROTEINS_8.0_GLOBAL_ParteII.pl data/intermediate/snpeff_detailed_ProtAndSite_Pre_step_ordered.txt data/intermediate/snpeff_detailed_ProtAndSite_Post_step.txt

sort -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 -k6,6 -k7,7 -k8,8 data/intermediate/snpeff_DOMAINS_Pre_step.txt > data/intermediate/snpeff_DOMAINS_Pre_step_ordered.txt

perl bin/shared/27_key_NMD_5.0_DERIVED_STOPS_2.0.pl data/intermediate/snpeff_derived_PTCS.txt data/intermediate/out_snpeff_parsed.txt data/build_tables/NMD_table.txt data/build_tables/ENST_table_full_condensed.txt data/intermediate/snpeff_derived_NMD.txt

perl bin/shared/38_global_feature_table_1_4_paralell.pl data/intermediate/snpeff_NMD.txt data/intermediate/snpeff_derived_NMD.txt data/intermediate/snpeff_detailed_ProtAndSite_Post_step.txt data/intermediate/snpeff_DOMAINS_Post_step.txt data/intermediate/snpeff_percentage.txt data/intermediate/snpeff_first_table.txt

perl bin/shared/40_tabla_PEJMAN_16_def_2.0.pl data/intermediate/snpeff_first_table.txt data/intermediate/snpeff_NMD.txt data/intermediate/snpeff_derived_NMD.txt data/intermediate/gtf_output_ENST.txt data/intermediate/gtf_output_ENSG.txt data/build_tables/ENST_table_full_condensed.txt data/external/appris_principal_isoform_gencode_19_15_10_2014.txt data/exter-nal/Pervasive.txt data/intermediate/Matrix_snpeff.txt

perl bin/shared/41_CCDS_collapser_3.0.pl data/build_tables/gtf_tabladef_sorted_by_SYMBOL.txt data/intermediate/snpeff_NMD.txt data/intermediate/snpeff_NMD_CCDS.txt data/intermediate/snpeff_derived_NMD.txt data/intermediate/snpeff_deri-ved_NMD_CCDS.txt data/build_tables/gtf_output_ENSG.txt data/build_tables/gtf_output_ENSG_CCDS.txt data/build_tables/gtf_output_ENST.txt data/build_tables/gtf_output_ENST_CCDS.txt data/build_tables/ENST_table_full_condensed.txt data/build_tables/ENST_table_full_condensed_CCDS.txt data/external/appris_principal_isoform_gencode_19_15_10_2014.txt data/external/appris_principal_isoform_gencode_19_15_10_2014_CCDS.txt data/external/Pervasive.txt data/external/Pervasive_CCDS.txt data/intermediate/snpeff_first_table.txt data/intermediate/snpeff_first_table_CCDS.txt

perl bin/shared/42_tabla_PEJMAN_15.0_version_paralel_4.0.pl data/intermediate/snpeff_first_table_CCDS.txt data/intermedia-te/snpeff_NMD_CCDS.txt data/intermediate/snpeff_derived_NMD_CCDS.txt data/build_tables/gtf_output_ENST_CCDS.txt data/build_tables/gtf_output_ENSG.txt data/build_tables/ENST_table_full_condensed_CCDS.txt data/external/appris_principal_iso-form_gencode_19_15_10_2014_CCDS.txt data/external/Pervasive_CCDS.txt data/intermediate/Matrix_snpeff_CCDS.txt

java -Xmx4g -jar ~/Escritorio/Proyecto_clasificador/SOFTWARE/snpEff/snpEff.jar eff -c ~/Escritorio/Proyecto_clasificador/SOFTWARE/snpEff/snpEff.config -v GRCh37.75 -lof -csvStats -nextProt -sequenceOntology data/intermediate/example_mr.vcf > data/intermediate/example_mr.eff.vcf

perl bin/snpEff/24_snpEff_parser_def_minus_heather.pl data/intermediate/example_mr.eff.vcf data/intermediate/out_snpeff_parsed.txt

perl bin/shared/53BIS_Fuse_Matrix\&Gene_based.pl data/intermediate/Matrix_snpeff.txt data/external/pRDG2.txt data/external/Genes_AllInnateImmunity.txt data/external/Genes_Antiviral.txt data/external/Genes_ISGs.txt data/external/Genes_OMIMreces-sive.txt data/external/RVIS2.txt data/final/Matrix_snpeff_added_gene_based_scores.txt

perl bin/shared/53BIS_Fuse_Matrix\&Gene_based.pl data/intermediate/Matrix_snpeff_CCDS.txt data/external/pRDG2.txt data/external/Genes_AllInnateImmunity.txt data/external/Genes_Antiviral.txt data/external/Genes_ISGs.txt data/external/Genes_OMIMrecessive.txt data/external/RVIS2.txt data/final/Matrix_snpeff_CCDS_added_gene_based_scores.txt

perl bin/shared/27_key_NMD_5.0_3.0.pl data/intermediate/out_snpeff_parsed.txt data/build_tables/NMD_table.txt data/build_tables/ENST_table_full_condensed.txt data/intermediate/snpeff_NMD.txt

perl bin/shared/32_key_PROTEINS_8.0_GLOBAL_3.0.pl data/intermediate/out_snpeff_parsed.txt data/build_tables/ENST_ta-ble_full_condensed.txt data/build_tables/ALL_ISOFORMS_PROTEIN_table_full.txt data/intermediate/snpeff_detailed_ProtAnd-Site_Pre_step.txt

perl bin/shared/32_key_PROTEINS_8.0_GLOBAL_3.0.pl data/intermediate/out_snpeff_parsed.txt data/build_tables/ENST_ta-ble_full_condensed.txt data/build_tables/ALL_ISOFORMS_DOMAIN_table_full.txt data/intermediate/snpeff_DOMAINS_Pre_step.txt

sort -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 -k6,6 -k7,7 -k8,8 data/intermediate/snpeff_detailed_ProtAndSite_Pre_step.txt > data/intermedia-te/snpeff_detailed_ProtAndSite_Pre_step_ordered.txt

perl bin/shared/32_key_PROTEINS_8.0_GLOBAL_ParteII.pl data/intermediate/snpeff_DOMAINS_Pre_step_ordered.txt data/in-termediate/snpeff_DOMAINS_Post_step.txt

perl bin/shared/25_Downstream_frameshift_6.0.pl data/intermediate/out_vep_parsed.txt data/intermediate/vep_derived_PTCS.txt

perl bin/shared/26_NEW_EXTRA_key_%_sequence_2.0.pl data/intermediate/out_vep_parsed.txt data/build_tables/ENST_table_full_condensed.txt data/intermediate/vep_percentage.txt

perl bin/shared/27_key_NMD_5.0_3.0.pl data/intermediate/out_vep_parsed.txt data/build_tables/NMD_table.txt data/build_tables/ENST_table_full_condensed.txt data/intermediate/vep_NMD.txt

perl bin/shared/32_key_PROTEINS_8.0_GLOBAL_3.0.pl data/intermediate/out_vep_parsed.txt data/build_tables/ENST_table_full_condensed.txt data/build_tables/ALL_ISOFORMS_PROTEIN_table_full.txt data/intermediate/vep_detailed_ProtAndSite_Pre_step.txt

sort -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 -k6,6 -k7,7 -k8,8 data/intermediate/vep_detailed_ProtAndSite_Pre_step.txt > data/intermediatevep_detailed_ProtAndSite_Pre_step_ordered.txt

perl bin/shared/32_key_PROTEINS_8.0_GLOBAL_ParteII.pl data/intermediate/vep_detailed_ProtAndSite_Pre_step_ordered.txt data/intermediate/vep_detailed_ProtAndSite_Post_step.txt

perl bin/shared/32_key_PROTEINS_8.0_GLOBAL_3.0.pl data/intermediate/out_vep_parsed.txt data/build_tables/ENST_table_full_condensed.txt data/build_tables/ALL_ISOFORMS_DOMAIN_table_full.txt data/intermediate/vep_DOMAINS_Pre_step.txt

sort -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 -k6,6 -k7,7 -k8,8 data/intermediate/vep_DOMAINS_Pre_step.txt > data/intermediate/vep_DO-MAINS_Pre_step_ordered.txt

perl bin/shared/32_key_PROTEINS_8.0_GLOBAL_ParteII.pl data/intermediate/vep_DOMAINS_Pre_step_ordered.txt data/interme-diate/vep_DOMAINS_Post_step.txt

perl bin/shared/27_key_NMD_5.0_DERIVED_STOPS_2.0.pl data/intermediate/vep_derived_PTCS.txt data/intermediate/out_vep_parsed.txt data/build_tables/NMD_table.txt data/build_tables/ENST_table_full_condensed.txt data/intermediate/vep_derived_NMD.txt

perl bin/shared/38_global_feature_table_1_4_paralell.pl data/intermediate/vep_NMD.txt data/intermediate/vep_derived_NMD.txt data/intermediate/vep_detailed_ProtAndSite_Post_step.txt data/intermediate/vep_DOMAINS_Post_step.txt data/intermediate/vep_percentage.txt data/intermediate/vep_first_table.txt

perl bin/shared/40_tabla_PEJMAN_16_def_2.0.pl data/intermediate/vep_first_table.txt data/intermediate/vep_NMD.txt data/inter-mediate/vep_derived_NMD.txt data/intermediate/gtf_output_ENST.txt data/intermediate/gtf_output_ENSG.txt data/build_tables/ENST_table_full_condensed.txt data/external/appris_principal_isoform_gencode_19_15_10_2014.txt data/external/Pervasive.txt data/intermediate/Matrix_vep.txt

perl bin/shared/41_CCDS_collapser_3.0.pl data/build_tables/gtf_tabladef_sorted_by_SYMBOL.txt data/intermediate/vep_NMD.txt data/intermediate/vep_NMD_CCDS.txt data/intermediate/vep_derived_NMD.txt data/intermediate/vep_derived_NMD_CCDS.txt data/build_tables/gtf_output_ENSG.txt data/build_tables/gtf_output_ENSG_CCDS.txt data/build_tables/gtf_output_ENST.txt data/build_tables/gtf_output_ENST_CCDS.txt data/build_tables/ENST_table_full_condensed.txt data/build_tables/ENST_ta-ble_full_condensed_CCDS.txt data/external/appris_principal_isoform_gencode_19_15_10_2014.txt data/external/appris_princi-pal_isoform_gencode_19_15_10_2014_CCDS.txt data/external/Pervasive.txt data/external/Pervasive_CCDS.txt data/intermediate/vep_first_table.txt data/intermediate/vep_first_table_CCDS.txt

perl bin/shared/42_tabla_PEJMAN_15.0_version_paralel_4.0.pl data/intermediate/vep_first_table_CCDS.txt data/intermediate/vep_NMD_CCDS.txt data/intermediate/vep_derived_NMD_CCDS.txt data/build_tables/gtf_output_ENST_CCDS.txt data/build_ta-bles/gtf_output_ENSG.txt data/build_tables/ENST_table_full_condensed_CCDS.txt data/external/appris_principal_isoform_gen-code_19_15_10_2014_CCDS.txt data/external/Pervasive_CCDS.txt data/intermediate/Matrix_vep_CCDS.txt

perl bin/shared/53BIS_Fuse_Matrix\&Gene_based.pl data/intermediate/Matrix_vep.txt data/external/pRDG2.txt data/external/Ge-nes_AllInnateImmunity.txt data/external/Genes_Antiviral.txt data/external/Genes_ISGs.txt data/external/Genes_OMIMrecessive.txt data/external/RVIS2.txt data/final/Matrix_vep_added_gene_based_scores.txt

perl bin/shared/53BIS_Fuse_Matrix\&Gene_based.pl data/intermediate/Matrix_vep_CCDS.txt data/external/pRDG2.txt data/exter-nal/Genes_AllInnateImmunity.txt data/external/Genes_Antiviral.txt data/external/Genes_ISGs.txt data/external/Genes_OMIMreces-sive.txt data/external/RVIS2.txt data/final/Matrix_vep_CCDS_added_gene_based_scores.txt

$$$/Dialog/Behaviors/GoToView/DefaultURL

$$$/Dialog/Behaviors/GoToView/DefaultURL

ORDERSClinVar ClinVar

DOWNLOAD

20_Script_minimal_representation_vcf_ClinVAR_9.0.pl

clinvar_Antonio_25_08_2014.vcf

JOIN_vcf.vcf

51_Fuse_features_Permisiveness1.0.pl

OUT_plus_Intervals_of_confidence_plus_TAGS.txt

Train\&Test.txt

52_Fuse_Matrix_feaures_Permisiveness_2.0.pl

Matrix_snpeff_CCDS_added_tags.txt

ClinVarFullRelease_2014-08.xmlclinvar_20140807.vcf

prueba_clinvar2.txt

21_Clinvar_no_repetitions_2.0.pl

OUTPUTVCF_CollapsedPerGene_Variants_PlusCountsFromIstvan.txt

In house developed

50_Antonio\&Istvan_Counts_parser.pl

Matrix_snpeff_CCDS.txt

Annotation (see map II)

pRDG2.txt


Genes_Antiviral.txt

Genes_ISGs.txt


RVIS2.txt

53_Fuse_Matrix\&Permisiveness_Gene_based.pl

Matrix_snpeff_CCDS_added_tag_added_scores.txt

wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/archive/2014/clinvar_20140807.vcf

wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/archive/2014/ClinVarFullRelease_2014-08.xml

perl ../../Scripts_Ad_Hoc/Perl/20_Script_minimal_representation_vcf_ClinVAR_9.0.pl ClinVarFullRelease_2014-08.xml clinvar_20140807.vcf prue-ba_clinvar2.txt 1>lista_xml1.txt

perl ~/Escritorio/Proyecto_clasificador/Scripts_Ad_Hoc/Perl/21_Clinvar_no_repetitions_2.0.pl prueba_clinvar2.txt clinvar_Antonio_25_08_2014.vcf

perl ../../Scripts_Ad_Hoc/Perl/50_Antonio\&Istvan_Counts_parser.pl OUTPUTVCF_CollapsedPerGene_Variants_PlusCountsFromIstvan.txt clin-var_Antonio_25_08_2014.vcf JOIN_vcf.vcf

perl ../../Scripts_Ad_Hoc/Perl/51_Fuse_features_Permisiveness1.0.pl JOIN_vcf.vcf OUT_plus_Intervals_of_confidence_plus_TAGS.txt Train\&Test.txt

perl ../../../Scripts_Ad_Hoc/Perl/52_Fuse_Matrix_feaures_Permisiveness_2.0.pl Matrix_snpeff_CCDS.txt ../Train\&Test.txt Matrix_snpeff_CCDS_added_tags.txt 1>lista_52.txt

perl ../../../Scripts_Ad_Hoc/Perl/53_Fuse_Matrix\&Permisiveness_Gene_based.pl Matrix_snpeff_CCDS_added_tags.txt ../Gene_based/pRDG2.txt ../Gene_based/Genes_AllInnateImmunity.txt ../Gene_based/Genes_Antiviral.txt ../Gene_based/Genes_ISGs.txt ../Gene_based/Genes_OMIMreces-sive.txt ../Gene_based/RVIS2.txt Matrix_snpeff_CCDS_added_tag_added_scores.txt

nutvar 2: sequence-based functional an- notation of ... 2: sequence-based functional an- ... from...

Documents