emanuel weitschek emanueltorlone/bigdata/biomedicaldata...‒ today next generation high throughput...
Post on 19-Aug-2020
1 Views
Preview:
TRANSCRIPT
BIOMEDICAL DATA MANAGEMENT AND ANALYSIS Emanuel Weitschek
www.dia.uniroma3.it/~emanuel
emanuel@dia.uniroma3.it
2
Outline
• Growth of biological data
• DNA Sequencing
• Growth of clinical data
• Analysis of biomedical data
• Bioinformatics
• GenData 2020
• The Cancer Genome Atlas and GenData
• Data Mining: classification and clustering
• Data mining systems: Weka and DMB
• Application to biological data sets
− Clinical patients
− Gene Expression Profiles
− Sequences: DNA Barcode, Polyomaviruses, Bacteria and CNEs (with alignment free methods)
− EEG signal processing
• Projects
3
Growth of biological data
• Advances in molecular biology lead to an exponential growth of biological data thanks to the support of computer science
‒ originated by the DNA sequencing method invented by Sanger in early eighties
‒ late nineties significant advances in sequence generation, e.g. Human Genome Project
‒ actually the genomic sequences are doubling every 18 months
‒ GenBank: collection of all publicly available nucleotide sequences (160 M seq)
the era of “Big Data”
4
Growth of biological data
• Advances in molecular biology lead to an exponential growth of biological data thanks to the support of computer science
‒ Today next generation high throughput data from modern parallel sequencing machines, are collected and huge amounts of biological data are currently available on public and private sources
‒ 10000 Human Genomes project (3000 Mbp)
‒ The future: 1000$ genome
• Very large data sets, that are generated by several different biological experiments, need to be automatically processed and analyzed with computer science methods
5
DNA Sequencing
• DNA (deoxyribonucleic acid) is the hereditary material in almost all organisms
• DNA sequencing is the process of determining the order of nucleotides within a DNA molecule
• It includes any method or technology that is used to determine the order of the four bases—adenine (A), cytosine (C), guanine (G), and thymine (T)
• Originated by the DNA sequencing method invented by Sanger in early eighties
• In late nineties significant advances in sequence generation techniques, largely inspired by massive projects such as the Human Genome Project
• High costs and time, e.g., for the Human Genome Project 5 billions $ and 13 years
6
Next Generation Sequencing (NGS)
• Today: next generation high throughput data from modern parallel sequencing machines ‒ Roche 454, Illumina, Applied Biosystems SOLiD,
Helicos Heliscope, Complete Genomics, Pacific Biosciences SMRT, ION Torrent
‒ Next generation sequencing (NGS) machines output a large amount of short DNA sequences, called reads (in fastq format)
‒ Cannot read entire genome one nucleotides at a time from beginning to end
‒ shred the genome and generate shorts reads
‒ Low cost per base (5000$ for a whole human genome)
‒ High speed (24h to sequence a whole human genome )
‒ Large number of reads
‒ Problems: data storage and analysis, high costs for IT infrastructure
7
How do we assemble the genome?
8
How do we assemble the genome?
• Take small tissue or blood sample containing millions of cells with identical DNA
• Use biochemical methods to break the DNA into fragments
• Sequence these fragments to produce reads
• Genome assembly: putting a genome back together from its reads (it is just like reassembling a newspaper)
• Biologists can easily generate enough reads to analyze a large genome, but assembling these reads still presents a major computational challenge
• The difficulty is that researchers do not know where in the genome these reads came from, and so they must use overlapping reads to reconstruct the genome
9
NGS experiments
DNA-seq • Sequence the DNA (whole genome) of an organism/sample
• De Novo assembly and Short read mapping
• Align the sequence to a reference genome (mutations analysis)
RNA-seq • Transcriptome sequencing to detect the gene expression
profiles
• Count the reads that map on a particular gene region (quantitative measures)
Chip-seq • Detection of genome-wide protein-DNA interactions
• Determines if a protein is bound to potential target regions
DNA methylation • Determines epigenetic modifications which influence gene
expression and cell phenotypes
• Epigenetics: study of changes in gene expression or cellular phenotype, caused by mechanisms other than changes in the underlying DNA sequence
10
Growth of clinical data
• The wide spread of electronic data collection in medical environments lead to an exponential growth of clinical data from heterogeneous patient samples
‒ Electronic health records
‒ European and national projects
‒ Integration with genomics
• Very large data sets, that are generated by several different clinical experiments, need
to be automatically processed and analyzed with computer science methods
the era of “Big Data”
11
Growth of clinical data
• Collection of different electronic health patient records
• Data set obtained often from different health care facilities and hospitals
• Data collected by medical doctors or assistans: manual insertion of the clinical trials values or copy from non electronic medical records
• Normally no integrated IT system
• Therefore clinical data sets are often noisy, full of missing data and outliers
• To perform a data mining analysis the records have to be integrated in an unique data set
– by joining on common attributes
– by leaving all the non shared attributes as supplementary data
• Privacy issues: every patient must remain anonymous by mapping it with a private id, that has not to be visible to the data analyst
12
Analysis of biomedical data
• Analyzing these enormous amount of data is becoming very important in order to shed light on biological and medical questions
• Challenges:
‒ data collection
‒ data integration
‒ managing this huge amount of data
‒ discovering the interactions
‒ the integration of the biological know-how
13
Analysis of biomedical data
• Major issues:
– incompleteness (missing values)
– different adopted measure scales
– integration of the disparate collection procedures
• Major problems in clinical databases
– Complexity
– Inaccuracy
– Frequent missing values
– Different adopted measure scales
– Mixed (numeric – textual) variables, e.g. BUN = {12, 16, high, 9, 11 …}
• Final goal: extract relevant information from huge amounts of biomedical data
14
Bioinformatics
• New methods are demanded able to extract relevant information from biological data sets
• Effective and efficient computer science methods are needed to support the analysis of complex biological data sets
• Modern biology is frequently combined with computer science, leading to Bioinformatics
• Bioinformatics is a discipline where biology and computer science merge together in order to design and develop efficient methods for analyzing biological data, for supporting in vivo, in vitro and in silicio experiments and for automatically solving complex life science problems
• Bioinformatician: a computer scientist and biology domain expert, who is able to deal with the computer aided resolution of life science problems
15
Analysis of genomic sequences
• Massive sequencing experiments lead to an increasing availability of biological sequences
• Sequence analysis
− Alignment based methods
Exact methods
Heuristics
− Alignment free methods
Substrings frequencies
Sequence compression
16
GenData 2020
• Data standard definition for genomics and biomedical data • http://www.bioinformatics.deib.polimi.it/genomic_computing
• Build the abstractions, models, and protocols for supporting a network of genomic data
• Genome servers located in the major biologist laboratories in the world
• Distributed computing
• Definition of methods for querying, searching, and analyzing genomic data
• Problems: data management
• huge amount of data
• diversity of the platforms
• diversity of the formats
• How to model and store genetic data so as to gain their integrated accessibility
• Data definition based on the standards proposed by the Functional Genomics Data Society
• Use of Distributed Annotation System (DAS) that defines a communication protocol used to exchange data on genomic or protein sequences
17
GenData 2020
• Final aim: Definition of an unique and global platform for effectively storing, searching and retrieving genomic data via distributed computing
• Focus NGS experiments:
– DNA-seq, RNA-seq, ChIP-seq, DNA methylation
• GenData Format:
– Within the same dataset, two kinds of data:
• Region values aligned w.r.t. a given reference, with specific left-right ends within a chromosome
• Metadata, with free-format attributes, storing all the knowledge about the dataset
18
GenData 2020
19
GenData 2020
• Genometric Query Language (GMQL) is defined as a sequence of algebraic operations following the structure: < variable > = < operator > (< parameters >) < variable >
• Every variable is a collection of datasets describing experiments of the same level
• Offers high-level, declarative operations based on:
– regions and their properties
– general-purpose meta-data
• Inspired by Pig Latin and targeted towards cloud computing
• Will be further developed by adding genometric operations and statistical operations (e.g. clustering) in cooperation with biologists.
• The language allows for queries on the genome involving large datasets describing:
– genomic signals
– reference regions
– distance rules
• http://www.bioinformatics.deib.polimi.it/genomic_computing/GMQL/
20
GenData 2020
21
GenData 2020
22
GenData 2020
23
The Cancer Genome Atlas (TCGA)
• Comprehensive genomic characterization and analysis of more than 30 cancer type tissues to accelerate the understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing
• (2006) Coordinated joint effort of the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) both of the National Institute of Health (NIH) - USA
• Aim: improve the ability to diagnose, treat and prevent cancer
• A free-available platform to search, download, and analyze data sets containing clinical information, genomic characterization data, and high level sequence analysis of the tumor genomes
• Open access
24
Data types
• Tissue pathology data
• Images
• Pathology reports
• Copy-number alterations for non-genetic platforms
• Epigenetic data
• Data summaries, such as genotype frequencies
• Clinical data (biotab and xml)
• DNA Sequencing (whole genome, whole exome, mutations)
• DNA Methylation
• Gene expression data (miRNA, mRNA, Total RNA, Micro- and Protein-arrays)
• Copy numbers
• https://tcga-data.nci.nih.gov/tcga/tcgaDataType.jsp
25
Data archive description
• https://wiki.nci.nih.gov/display/TCGA/Data+archive
26
Metadata (Clinical Data)
• Clinical information about the participant
• Information about how participant samples (biospecimens) were processed by the TCGA Biospecimen Core Resource Center (BCR)
• Available in XML and as a flatfile biotab format
• Example of TCGA clinical data:
− vital status at time of report disease-specific diagnostic information, and initial treatment regimens
− additional clinical follow up information for some or all participants
• Data elements and fields must follow the rules of XML Schema Document (XSD) http://tcga-data.nci.nih.gov/docs/xsd/BCR/
27
TCGA: some statistics
• 30 different tumor types
• 9404 patients
• 13.45 TB experimental data (clinical and genomic)
• 300 different metadata attributes − clinical reports (pdf), images (dcm), blood tests (biotab)
0
200
400
600
800
1000
1200
LAM
L
BLC
A
BR
CA
CH
OL
ESC
A
HN
SC
KIR
C
LIH
C
LUSC
MES
O
PA
AD
PR
AD
SAR
C
STA
D
THYM UC
S
UV
M
Number of patients
Patients
0
1000
2000
3000
4000
5000
6000
7000
LAM
L
BLC
A
BR
CA
CH
OL
ESC
A
HN
SC
KIR
C
LIH
C
LUSC
MES
O
PA
AD
PR
AD
SAR
C
STA
D
THYM UC
S
UV
M
Data quantity (GB)
Data (GB)
28
RNA-Seq, DNA-Methylation, Mutations in Cancer
GELA (Gene Expression Logic Analyzer)
1. Data extraction from The Cancer Genome
Atlas (TCGA), the major repository for
cancer data
2. Focus on breast cancer (1000 patients)
3. Transformation in GenData format
4. Matrix extraction in GMQL
5. Knowledge extraction with GELA and
supervised machine learning techniques
In collaboration with:
Masseroli M., Fiscon G., Cumbo F., Ceri S.
29
Data Mining
• The interdisciplinary field of data mining, which guides the automated knowledge discovery process, is a natural way to approach the complex task of biological data analysis… with the aid of a good bioinformatician
• Data Mining: extraction of knowledge with computer science algorithms for discovering hidden information in heterogeneous data
• The knowledge discovery process:
• Main Applications:
– Classification
– Clustering
• Classification is the action of assigning an unknown object into a predefined class after examining its characteristics
• Clustering partitions objects into groups, such that similar or each other related objects are in same clusters
30
Classification and supervised machine learning
• Classification is the action of assigning an unknown object into a predefined class after examining its characteristics
• Goal: assign an unknown sample to a known class starting from its attributes
• The classification problem may be formulated in the following way:
− given a reference library (training set) composed of samples of known class and
− a collection of unknown samples (query set or test set)
− recognize the latter into the class that are present in the library
• To obtain reliable results
− the query set has to contain only samples from the same class that are present in the reference library
− the reference set has to contain a sufficient number of samples for each class
31
Classification and supervised machine learning
• The user has to provide as input a training set (reference library) containing samples with a priori known class membership
• Based on this training set, the software computes the classification model
• Subsequently, the classification model can be applied to a test set (query set) which contains samples that require classification
• The test set can contain query samples with unknown class membership or, alternatively, samples that also have a priori known species membership, allowing verification of the classifications
32
• Also called logic data mining or classification with logic formulas
• The classifier uses logic propositional formulas in disjunctive or conjuctive normal form (”if then rules”) for classifying the given records, this classification method is also called ruled based
• A logic classifier is a technique for classifying records using a collection of logic propositional formulas (in CNF or DNF), “if… then rules”:
− Antecedent Consequent
− (Condition1) or (Condition2) or … or (Conditionn) Class
− Conditioni: (A1 op v1) and (A2 op v2) and … and (Am op vm)
− A: attribute
− v: value
− op: operator {=, ≠, <, >, ≤, ≥}
Ruled based classification
If pos13 = A and pos400 = G then the virus is RhinoA
33
• WEKA (Waikato Environment for Knowledge Analysis) machine learning software is widely adopted for classification
• WEKA contains several methods to perform supervised classification of general problems
• Input a reference library in arff format
− Data and sequences have to be converted
− sequences have to be of the same region or pre-aligned to the same region
• Weka computes the classification model
• The classification model can be applied to a query set
• For using Weka in biomedical data classification reference and query set have to be converted in arff format or csv
Supervised machine learning: Weka classification software
34
Supervised machine learning: Weka classification software
Mining Big Data using Weka 3 http://www.cs.waikato.ac.nz/ml/weka/bigdata.html
• command-line interface (CLI) to interact with Weka
• use Weka's Knowledge Flow graphical user interface
• write code directly in Java or a Java-based scripting
• train in an incremental fashion
• MOA data stream software containing state-of-the-art algorithms
• Reservoir sampling: incremental sampling
• new packages for distributed data mining:
− distributedWekaBase (“map" and "reduce" tasks)
− distributedWekaHadoop (Hadoop-specific wrappers)
− in the future, other wrappers - for example Spark
35
DMB: a logic data mining system
DMB is based on several computational steps, which have been integrated in the software:
1) Discretization
2) Normalization
3) Feature clustering
4) Feature selection
5) Formulas computation
6) Classification
7) Supplementary Analysis
36
DMB: Discretization
Aim of Discretization :
to transform gene expressions into discrete values
• Converts each gene expression into a new discrete variable that is suitable for treatment into a logic framework
–Determines a set of cut points over the range of values that each variable may assume
–Defines a number of cut points over which the original variable may be considered discrete
• DMB supports two types of discretization, that differ on the rule adopted to select the first set of intervals:
– Unsupervised discretization (unsupervised rules)
– Supervised discretization (supervised rules)
0
0.5
1
1.5
2
2.5
A A A A A A B B B B B B B
Values
Cutpoints
37
DMB: Feature Clustering
Aims of feature clustering:
– To group together similar features
– To reduce the data to be analyzed
• Methods to extract a subset of genes able to characterize the classification model
• Discrete Cluster Analysis (DCA) :
– Discretization is applied to numeric features
– An integer mapping is computed for every feature, that represents the interval in which an experiment falls
– This integer mapping can be represented in a binary form
– Two or more features are merged into the same cluster when their binary representation over the intervals is equal
– Features with the same discretized profile over the samples are clustered
– Finally, a feature for each cluster is elected as its representative
– Clusters composed of a single feature may also be present and are considered as non clustered features
38
DMB: Feature selection
• Aim: identify and remove irrelevant and redundant information from the data set
• Feature selection: selection of the relevant attributes (or features) of the data samples; the identification of a small subset of important attributes or features in a large data set
• The size reduction allows the knowledge extraction algorithm to work faster and more effectively
• Feature selection as a combinatorial problem: Variation of the Set Covering
• To solve the feature selection problem DMB uses an efficient
Greedy Randomized Adaptive Search Procedure (GRASP)
}1,0{
,, ,
max
k
k k
k
k
ijk
x
x
jijixd
otherwise 0
)( 1}1,0{
jkikk
ijij
aaifda
− Data Matrix A (exp. X features)
− xi = 1 if fi is chosen and 0 otherwise
− each constraint is associated with a
pair of items belonging to different
classes
− α is a variable that measures the degree of information redundancy
39
DMB: Formula Extraction
Aim of Formula Extraction :
to extract logic relations that explain the data
• After the output of FS, DMB extracts the logic classification formulas
• The Lsquare method is part of DMB for computing the model of the data, (e.g., the separating ”if-then” formulas or rules)
• Lsquare approaches it as a sequence of Minimum Cost Satisfiability Problems (MinSat), a combinatorial NP-Hard optimization problem
• The problem is solved with an algorithm based on decomposition techniques
• DMB computes for every class of the experimental samples the logic classification formulas in Disjunctive Normal Form
• Inequalities in the form of ”IF Aph1b<0.50” are conjucted in ”AND” and ”OR” clauses
40
DMB: Classification
A logic classifier is a technique for classifying records using a collection of “if… then rules”, named logic formulas:
– Antecedent Consequent
– (Condition1) or (Condition2) or … or (Conditionn) Class
– Conditioni: (A1 op v1) and (A2 op v2) and … and (Am op vm)
– A = attribute; v = value; op = operator {=, ≠, <, >, ≤, ≥}
• Example of logic classification formula is
• The evaluation of the logic formulas and the classification of the samples to the right class is performed with the following steps:
– The formulas are firstly weighted with the Laplace Score on the training set and then applied on the test set for performing the classification assignments
– Additional cut offs of logic formulas with sub-optimal coverage are done by considering the false positive and true positive rates
“IF Aph1b<0.507 then the experimental sample is CONTROL”
41
Logic Data Mining software
DMB software tools:
http://dmb.iasi.cnr.it
Train
files
(dmb)
Test file
fasta
Conversion to
dmb file format
<<Fasta2Dmb.java>>
Cutpoints
Definition
<<makeclasses.c>>
Cutpoints
file
Discretization
<<discretize.c>>
Binary
Test
(dmb)
Creation of the
FS Problem
<<creasc.c>>
Lsc
files FS Problem
Resoluton
<<grasp2.c>> Grasp /
features
files
Feature
Selection
<<filterfeatures.c>>
Creation and
Resolution of the
logic Problem
<<crelogic.c>>
Logic
files
Evaluation
<<evaluate.c>>
Output
FIles
Test
files
(dmb)
Train
files
(dmb)
Binary
Train
(dmb)
Filtered
Test
(dmb)
Filtered
Train
(dmb)
Train file
(fasta)
42
Multiple solutions extraction
• Feature elimination techniques: CAMUR – For logic formulas with just one literal,
e.g., if pos122=C then Polyomavirus WU, eliminate feature 122 and run the analysis again, iterate until the accuracy is lower then a given threshold
– For logic formulas with more conjuncts, e.g., if pos266=T OR pos78=T AND pos80=A then RhinovirusA,
eliminate all the possible combinations of features and each time run the analysis again; iterate until the accuracy is lower then a given threshold
• Feature selection techniques: MISSAL – Explore the solution space by computing equivalent solutions:
Modification of the DMB FS algorithm Take into account adjacent features and
prefer solutions with adjacent features
43
Biological applications of supervised learning and rule-based classification:
Aim: To extract relevant features from the ever-increasing amount of
biological data and to apply supervised learning to classify them
Biology Issue Features Software Data source
Clinical patient classification
Clinical variables (blood, imaging, psicosometric tests…)
DMB, Weka Heterogeneous health care facilities
Gene Expression Analysis
Discretize gene expression profiles
Gela, CAMUR TCGA, EBRI
DNA barcoding Nucleotide sequences of DNA-barcode
Blog, Fasta2Weka Barcode of Life Consortium
Polyoma/Rhyno Viruses
Nucleotide sequences of Polyoma/Rhyno viruses
DMB, MISSAL Istituto Superiore di Sanità
EEG signals processing
Fourier Coefficients extracted from EEG recordings
Matlab, Weka, DMB IRCCS Centro di Neurolesi “Bonino-Pulejo” of Messina
44
Clinical Patients
• The aim of the project is to design a new diagnostic model and possibily a work-flow for the early diagnosis of dementias
• Clinical patients trial samples were collected in different Italian health structures
• Collected data included demographic characteristics, medical history, pharmacological treatments, clinical and neurological examination, psychometric tests, laboratory blood tests, imaging… (700 features x 5000 experiments)
• The data set:
45
Preprocessing and missing values
• Fundamental in clinical data analysis
• Produces clean data and is crucial for obtaining reliable knowledge by the data mining algorithms
• Perform common statistical analysis for every trial: mean, standard deviation, minimum, maximum, number of missing values, modal value, etc.
• Trials filtering:
– exclusion of the trials with a high number of missing values, e.g. more than 20%
– filling of missing values. e.g. by the class mean
– outliers detection by mean - standard deviation - minimum - maximum values analysis and records removal
– correction of the mixed attributes types
46
Preprocessing and missing values
• Missing values treatment:
– Most common attribute value
– Assigning all possible values of the attribute
– Ignoring examples with unknown attribute values
– Event-Covering method
– Treating missing attribute values as special values
– Filling of missing values with the class mean (for numeric variables )
– Filling of missing values with the class modal value (for categorical variables)
• Special scripts and programs are necessary
47
Clinical Patients
• Results:
• Classification Results
Clinical data mining: problems, pitfalls and solutions. E. Weitschek, G. Felici and P. Bertolazzi. IEEE DEXA 2013
48
Gene Expression Profiling
• Major advances in microarray and Next Generation Sequencing technologies lead to an ever-increasing amount of gene expression data available to biological scientists and bioinformaticians
• RNA-seq:
− High-throughput Sequencing and large-scale transcriptome analysis
− Family of methods for detecting the transcriptome expression level
− RNA-Seq technology performs mapping and transcriptome quantification relying on NGS
− It can be used for
• Annotation
• Quantification Determination of abundance of transcripts under different conditions
49
GELA: Gene Expression Logic Analyzer
http://dmb.iasi.cnr.it/gela.php
50
GELA: Software Engineering
• GELA relies on the software architectural pattern Pipes and Filters for computing streams of data.
• Component and flow diagrams:
51
GELA: GUI Release
• Graphic user interface [dmb.iasi.cnr.it/gela.php]
• Guides the user in the RNA-seq clustering and classification process
• Java Swing framework for visualizing the data set, running the software, performing the additional analysis and viewing the clusters, the classification results and the logic separating formulas
• A complete user manual is available on the GELA website.
52
GELA: Command line
• Command line version [dmb.iasi.cnr.it/gela.php]
– Dedicated to the users who want to perform long experiments and batch the entire analysis process
– Compiled and tested on Linux and Windows operating systems
– Source code released for compiling on alternative systems
– Complete user guide is published
53
Study N. 1
• To characterize the gene expression profile of CONTROL versus ALZHEIMER DISEASED mice in early stage (1-3 months) and late stage (6-15 months)
• 119 experimental samples and 16,515 gene expression profiles
• To identify a limited set of genes able to discriminate between the neuro-degeneration and the healthy state
• To identify clusters of corr- and contra-regulated genes
Data Analysis steps
1) Discretization
2) Gene clustering
3) Feature selection
4) Formula extraction
5) Post processing and correlation analysis
GELA: Alzheimer Data Analysis Results
Raw data
54
GELA: Alzheimer Data Analysis Results
• Discrete Clustering method DCA allow to reduce the whole gene set to
‐ 3656 for 1-3 months
‐ 3615 for 6-15 months
• 8 genes are highly co-regulated or contra-regulated ,and able to separate exactly all the healthy from the sick mice (iterative application of the method) in 30-fold cross validation [13]
• More genes are strongly co-regulated with the 8 genes network
If Aph1b<0.50 then the individual is control
55
GELA vs other methods
• Classification Analysis We tested the following widespread algorithms (WEKA [23] implementations) : K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Random Forest (RF) , and C4.5 Decision Tree (C4.5)
• Results:
• GELA outperforms other methods
• Advantages of GELA:
₋ to extract meaningful and compact models
₋ clustering capabilities
₋ availability as an integrated tool
56
GELA: Data Analysis Results
Study N. 2 • Data sets downloaded from public repositories ArrayExpress and GEO
• Files normalized using the standard Affymetrix Expression Console software (ver 1.2), by the MAS5 algorithm
• To classify into diseased and healthy status samples of:
- 54,613 gene expression profiles of 176 Psoriasis human subjects
(85 control and 91 diseased)
- 22,215 gene expression profiles of 178 Multiple Sclerosis human subject (44 control and 134 diseased)
Results:
• GELA outperforms the other methods
57
GELA: Data Analysis Results
Study N. 3
• Data sets downloaded from the public repository The Cancer Genome Atlas (TCGA)
• To characterize the gene expression profile of CONTROL versus BREAST CANCER human samples
• To classify samples of 20531 gene expression profiles of 783 subjects into the tumor
and the healthy state
Results:
• GELA achieves comparable results of SVM and RF, that unfortunately produce classification models whose interpretation is very difficult for human beings
58
CAMUR: cancer analysis
Classification of RNA-seq cancer data through multiple rule-based models
• RNA-seq gene expression analysis and specifically on case-control studies for cancer with rule-based classification algorithms designed to build models, that discriminate cases from controls
• State of the art algorithms typically extract a single classification model that contains few features (genes)
• Goal: elicit a higher amount of knowledge by computing more alternative classification models and therefore to discover several features that are related to the predicted class (cancer)
Database of rules
59
CAMUR: cancer analysis
CAMUR (Classifier with Alternative and MUltiple Rule-based models): a new method and software package able to extract multiple, alternative, and equivalent classification models
– iteratively computes a rule-based classification model
– calculates the power set (or a partial combination) of the features present in the rules
– iteratively eliminates those combinations from the data set
– performs again the classification procedure until a stopping criterion
60
DNA Barcoding: alignment based analysis
• When analyzing biological sequences with a classical approach an overlapping gene region is necessary, due to the fact that an analysis of the characteristics nucleotides present in a determined position for every class is performed
• This is obtained through an alignment between the sequences to analyze, i.e. align areas of the sequences sharing common properties
IF BASE IN POSITION 466 IS C AND
BASE IN POSITION 595 IS T THEN SPECIES IS …1
IF BASE IN POSITION 340 IS G AND
BASE IN POSITION 451 IS A AND
BASE IN POSITION 493 IS C THEN SPECIES IS … 2
IF BASE IN POSITION 340 IS T AND
BASE IN POSITION 466 IS A AND
BASE IN POSITION 625 IS G THEN SPECIES IS … 3
Species 123456789
Species1 ACGTTAACA
Species1 GTCAGTCCA
Species2 GTCGTATAT
Species2 AGCTACGAT
61
DNA Barcoding
• DNA is used for classifying species
• Barcode: small sequence of DNA containing all the information necessary to identify a species of an individual (classification), or to identify new species:
‒ Gene region COI for animals (approx. 650 bp)
‒ Gene regions rbcl and matk for plants (approx. 1000 bp)
‒ Gene regions ITS for fungi (approx 400 bp)
• The sequences are from the same gene region and aligned
• A positional analysis is performed for distinguishing the different species
IF BASE IN POSITION 466 IS C AND
BASE IN POSITION 595 IS T THEN SPECIES IS …1
IF BASE IN POSITION 340 IS G AND
BASE IN POSITION 451 IS A AND
BASE IN POSITION 493 IS C THEN SPECIES IS … 2
IF BASE IN POSITION 340 IS T AND
BASE IN POSITION 466 IS A AND
BASE IN POSITION 625 IS G THEN SPECIES IS … 3
If pos3 = A and pos458 = C then the specimen is a
62
DNA Barcoding
BLOG 2.0
63
BLOG 2.0 vs other methods
Data set NJ PAR NN BLAST BAR BLOG
1,000 83.69 73.31 86.18 86.18 86.25 85.96
10,000 85.53 79.79 86.11 86.09 86.83 88.15
50,000 84.20 79.53 84.76 84.56 85.24 84.58
Overall 84.47 77.54 85.68 85.61 86.11 86.23
70.00
72.00
74.00
76.00
78.00
80.00
82.00
84.00
86.00
88.00
90.00
Ne = 1,000 Ne = 10,000 Ne = 50,000 overall
NJ
PAR
NN
BLAST
DNA-BAR
BLOG
Data set NJ PAR NN BLAST BAR BLOG
Drosophila 83.90 83.90 82.20 83.90 83.90 96.61
Inga 91.28 81.40 88.37 82.56 94.19 90.12
Cypraeidae 91.53 85.31 91.24 92.66 93.22 92.66
Overall 88.90 83.53 87.27 86.37 90.43 93.13
70.00
75.00
80.00
85.00
90.00
95.00
100.00
Inga Cypraeidae Drosophila Overall
NJ
PAR
NN
BLAST
DNA-BAR
BLOG
64
LAF: Logic Alignment Free
• Alignment free k-mer frequency count analysis: – oligomers frequencies: computation of the substring frequencies of a given length k – the k-mers are computed by counting the occurrences of the substrings with a
sliding window of length k, starting at position 1 and ending at position n−k+1 – frequency vector, where each component of the vector is associated to the
frequency of a particular k-mer – coordinate space, that is mathematically tractable
• Combination of alignment free methods and logic data mining:
• http://dmb.iasi.cnr.it/laf.php
65
LAF: Logic Alignment Free
• The LAF analysis is based on the supervised machine learning paradigm: class a priori assigned; training and test set
• Considering each sequence g in the data set, the following steps are performed by LAF
1. The reverse complement g’ of the sequence g is calculated 2. The sequence g is concatenated with its reverse complement g’, obtaining the
sequence G = g + g’ 3. The k-mer counts are computed on G with k = 3 … 6 and
stored in the feature vector F 4. The frequency vector F’ is obtained from F 5. The frequency vectors are collected in a
matrix, whose rows correspond to k-mers, and whose columns correspond to sequences
6. A discretization of the frequencies is performed
7. The data matrix is then processed by a rule-based classifier
66
LAF and Bacteria classification
• Whole genome bacteria classification: – Whole genomes are very difficult to align, because of their length – The method is applied for classifying bacteria in the different levels of the
phylogenetic tree (phylum, class, family, genus, species) – 1964 bacteria whole genome sequences downloaded from GenBank
– Very promising results, especially at the phylum level (98% correct classification rates)
67
Virus genomic sequences analysis
• Viruses are organisms that are hard to identify
• Cause of several diseases which affect human and animal species
• Bioterrorism
68
Rhinoviruses identification
• Rhino from the Greek "nose"
• Most common viral infective agents in humans and are the predominant cause of the common cold
• 1316 Rhinovirus A, B, C sequences of VP42 gene region (~370 bp)
• 116 complete genomes of the three different rhinoviruses (~7200 bp)
69
Rhinoviruses identification
• Positional analysis on VP42 gene region
– High recognition rates (accuracy, recall, precision, f-measure > 90%)
• LAF analysis on complete genomes
– High recognition rates (accuracy, recall, precision, f-measure > 90%)
RHINOVIRUSA RHINOVIRUSB RHINOVIRUSC
pos165=C OR pos165=T pos364=T pos185=C AND pos226=C
pos120=G pos363=T pos70=T
pos190=A pos226=G pos269=T
pos293!=A pos163=G AND pos293=A pos163!=G
pos98=T pos73=C AND pos98=A pos73!=C AND pos98=A
pos50=A AND pos334=G pos334=A pos50=T
pos266=T OR pos78=T AND
pos80=A pos80!=A AND pos 266 =C
pos78!=T AND pos80=A AND
pos266=C
pos53=C AND pos317!=T OR
pos314=C
pos53=T AND pos314=A OR
pos53=A AND pos317=A
pos53=A AND pos317!=A OR
pos53=C AND pos317=T
RHINOVIRUSA RHINOVIRUSB RHINOVIRUSC
GCA<31.92 AND
GCC<16.16
18.71<=CCC<24.18 AND
GCA>=31.92 OR
GCA>=31.92 AND
CC<16.16
18.71<=CCC<24.18 AND
GCA>=31.92
OR
GCA>=31.92 AND GCC<16.16
70
Polyomaviruses identification
• Polyomaviruses are oncogenic (tumor-causing) viruses
• They often persist as latent infections in a host without causing disease, but may produce tumors in a host of a different species
• The name polyoma refers to the viruses' ability to produce multiple (poly-) tumors (-oma)
• Aim of this study :
– find nucleotide positions that allow to distinguish the five human polyomaviruses known to date
– examine and explain them for an effective biological analysis
71
Polyomaviruses identification
• Genome length of each polyomavirus (1200 individuals):
– about 5 Kb in length (in term of nucleotide string from 700 to 2500)
•
• Results: The logic data mining method
– identifies a subset of positions that characterize the polyomaviruses and the gene regions
– exhibits high recognition and classification rates
72
Polyomaviruses identification
73
Polyomaviruses identification
1 ---------- --TTATCT-- -GAAGAG--- AAGCAGCTA- --CAA-----
----->
51 -TGTTTGGAC ---AGT---- ---------- ---------- ---AGACAA-
101 --TTAGTAAA G--------- TAT---CCA- -----GGA-- ----------
151 AGC---AAGC TT-------- ---------- ---------G ACTCA---AG
201 CAGTGTACAT GATTTA---- ---------- -AATATA--- CAA---TCT-
251 -----CCT-- -TAC---ACA CCT------- --GAG----- -TCA---TTC
301 AACACC---- --------GA GTTA------ ---GAATCC- --CCATCT--
351 -CCAAAAAGA AGTGCACCA- --GAGGAGCC TAGCTGTTCT CAGGCAACCC
<----
401 CTCCTAAGAA AAAACATGCA TTTGATGCTT CTTTAGAATT TCCTAAAGAG
----->
451 TTGTTAGAGT TTGTTTCACA TGCTGTATTT AGTAATAAGT GTATAACGTG
<-------
501 C---GTAGTA CAT---ACTA GAGAAAAA-- -GAAGTACTT TAT---AAGT
---->
...
74
Supervised Learning and Logic Data Mining for Virus sequences analysis
Recognize the unknown samples into the virus classes of the training
set of virus sequences with a priori known class (training set)
set of unknown samples (test set)
• Classification
to assign an unknown object to a predefined class after examining its features
• Classification with logic formulas
to classify the unknown object by using logic propositional formulas in disjunctive or conjunctive normal form (”if then rules”)
If pos13 = A and pos400 = G then the virus is RhinoA
Aim: To extract relevant features from the ever-increasing amount of
biological data and applying supervised learning to classify them
75
Multiple solutions extraction
• Feature selection technique
− Explore the solution space by computing equivalent solutions
− Each solution evaluated in term of 𝜶, 𝜷, 𝝈
− Find the smallest set of adjacent features within a segment
that are able to classify the given objects
• Advantages:
− reduce sequencing costs
− point to the most compact meaningful portions of sequences
− more reliable classification models
all the solutions equivalent in term of 𝜶, 𝜷 but
differ for 𝝈 value
𝛽
𝜎
𝛽
Starting point 𝜎 Segment size 𝛽
Aim: to identify set of adjacent equivalent interesting solutions in different biological issues
76
• Results: Compact Adjacent Solutions Extraction
Aim:
- find nucleotide positions that allow to distinguish and characterize gene regions of
Polyoma-, Rhino-, and Influenza viruses
⁻ identify set of multiple compact and locally adjacent interesting solutions
⁻ efficiently extract a large number of highly reliable and compact classification rules
Influenza
Polyoma
Rhino
Viruses sequences identification: MISSAL (MultIple Solutions Sequence AnaLyzer with proximity control)
77
• Results: Solutions Map onto the sequence
Aim:
- find nucleotide positions that allow to distinguish and characterize gene regions of
Polyoma-, Rhino-, and Influenza viruses
⁻ identify set of multiple compact and locally adjacent interesting solutions
⁻ efficiently extract a large number of highly reliable and compact classification rules
Viruses sequences identification: MISSAL (MultIple Solutions Sequence AnaLyzer with proximity control)
78
• Results: Classification results on training set
Aim:
- find nucleotide positions that allow to distinguish and characterize gene regions of
Polyoma-, Rhino-, and Influenza viruses
⁻ identify set of multiple compact and locally adjacent interesting solutions
⁻ efficiently extract a large number of highly reliable and compact classification rules
Viruses sequences identification: MISSAL (MultIple Solutions Sequence AnaLyzer with proximity control)
If pos13 = A and pos400 = G → 𝑉𝑖𝑟𝑢𝑠 = PolyomaBK
79
EEG signal processing and patients classification
• Electro Encefalo Grams (EEG) of 102 patients: – 51 Alzheimer Diseased – 37 Mild Cognitive Impairment – 14 Control
• 19 electrodes • Signal duration of 300 seconds • sampling frequency of 1024 or 256 sp/s • 9 GB of raw signals
• Features extraction: – extraction of 180-seconds signals – transform each one at 256 sp/s – frequency analysis by using Fast Fourier Transform (FFT)
80
EEG signal processing and patients classification
• Analysis: We apply the FFT functions to the collected data
₋ taking into account the 180-seconds signal and extracting N Fourier Coefficients (N equal to 16 and 32);
₋ dividing the signal in 6 epochs of 30 seconds and extracting for each one 16(32) Fourier Coefficients
• Classification:
Classification performances [%] for 180-sec EEG signals
Classification performances [%] for EEG epochs of 30-sec
81
Collaborations
82
Contacts
• Emanuel Weitschek University Roma Tre Department of Computer Science and Automation Rome, Italy www.dia.uniroma3.it/~emanuel emanuel@dia.uniroma3.it
• Giulia Fiscon University Sapienza Department of Computer, Control, and Management Engineering Rome, Italy http://www.iasi.cnr.it/~gfiscon/ fiscon@dis.uniroma1.it
top related