2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 main data mining...
TRANSCRIPT
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 1
Data Analysis (by DM Techniques)
for Biomedical Informatics
Chen. Chun-Hsien
Department of Information Management
Chang Gung University
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 2
Outline
Motivation to data mining for biomedical
informatics
What is data mining?
Applications of data mining
Data mining process
Main data mining techniques
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 3
Motivation
Data explosion problem Tremendous amount of Web pages 40 billion photos on Facebook 1 million new transactions/hour in Walmart database Big data in Clouds 全民健康保險研究資料庫
( 全民健保處方及治療醫令 - 住院 ~ 17X106 筆 , 至 2008/12止 )
We are drowning in data, but starving for knowledge!
Solution: Data Mining (A KDD technology) One of the 10 emerging technologies that will change
the world in the near future (MIT Technology Review)
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 4
What Is Data Mining?
Data mining Automatic extraction of interesting (non-
trivial, implicit, previously unknown and potentially useful) knowledge (rules, regularities, patterns, trends, associations) from large amount of data
What is not data mining? Google/database query processing Expert systems or simple statistical programs
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 5
Example : Mining a Concept Hierarchy
all
Europe North_America
MexicoCanadaSpainGermany
Vancouver
...
......
... ... TorontoFrankfurt
all
region
country
city
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 6
Part of International Sales Data
Region Country City OfficeNorth American USA New York QueenNorth American Canada Vancouver L. ChanNorth American USA L.A. Bay AreaNorth American USA Boston Northern AreaNorth American Canada Toronto CentralNorth American USA Boston Southern AreaNorth American USA New York QueenNorth American USA L.A. Bay AreaNorth American Mexico Mexico City EmpireNorth American Canada Toronto CentralNorth American USA New York Manhattan
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 7
General Applications of Data Mining
Decision support
Biomedical decision support
Fraud detection and management
Market analysis and management
Risk analysis and management
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 8
Specific Applications of Data Mining(Related to Biomedical Domain)
Using Data Mining Techniques Help disease screening, diagnosis and treatment Help identify related genes of genetic diseases Help drug design and discovery
Using Text Mining and Data Ming Techniques Help find related genes of genetic diseases from
medical literature
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 9
Clean, Relevant Data
Data Preprocessing
Data Mining
Evaluation/Presentation
Pattern
Knowledge
Raw data
Steps in a KDD Process(KDD : Knowledge Discovery in Databases)
(Technically)
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 10
Main Steps of a KDD Process
(Fully) Domain knowledge acquisition
Learning important, relevant knowledge and goals of application Data collection and preprocessing (may take 60% of
effort) Data generation, cleaning, and selection Data integration, reduction, and transformation
Data mining (searching for interesting patterns) Choosing function types of data mining
classification , association, clustering, summarization, regression.
Choosing the mining algorithm(s) Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 11
Data Preprocessing
Raw data
Steps in a KDD Process(Step 1)
Clean, Relevant Data
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 12
Why Data Preprocessing?
Data in the real world is dirty incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only aggregate data
noisy: containing errors or outliers inconsistent: containing discrepancies in codes
or names No quality data, no quality mining results!
Quality decisions must be based on quality data
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 13
Major Tasks in Data Preprocessing
Data cleaning Fill in missing values, smooth noisy data, identify or remove
outliers
Data integration Integration of multiple databases, data sources, or files
Data transformation Normalization and aggregation
Data reduction Variable reduction, data set reduction, data representation
reduction
Data discretization Reduce the # of values for variables, especially for numerical
variables
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 14
Relevant Data
Data Preprocessing
Data Mining
Pattern
Raw data
Steps in a KDD Process
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 15
Main Data Mining Techniques
Association Rule Mining Classification Cluster Analysis Outlier Analysis Trend Analysis Linear Regression
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 16
Main Data Mining TechniquesAssociation Rule Mining
Association Rule Mining Goal : find association and/or correlation Finding strong rules :
sales(T, “computer”) sales(T, “software”) [support = 1%, confidence = 75%]
sales(T, “beer”) sales(T, “diaper”)
[support = 2%, confidence = 70%]
age(X, “20..29”) ^ income(X, “30..39K”) buys(X, “PC”)
[support = 2%, confidence = 60%]
(1/5)
data context
data item
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 17
Support and Confidence
Association rule mining Find all the rules X Y with
min. support S and confidence C
support S : the probability that a transaction contains X and Y
confidence C : the conditional probability that a transaction having X also contains Y
Transaction ID Items Bought10001 A,B,C20002 A,C30003 A,D40004 B,E,F
A C (50%, 66.6%) C A (50%, 100%)
Customerbuys diaper (Y)
Customer buy both
Customerbuy beer (X)
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 18
Classification Finding models that describe and distinguish
classes for future forecast Representation models: decision-tree, neural
network
Typical Applications disease screening, diagnosis & treatment credit card/loan approval target marketing pattern recognition
(2/5)Main Data Mining Techniques
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 19
An Example of Classification(Fruit Classifier)
Classifier
output
Class label
oval, red, orange, yellow
shape=roundcolor = red
inputfeatures
Apple
shape=roundcolor = orange
Orange
Mango
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 20
A General Classifier
Classifierinputfeatures output
class label
::
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 21
Model of Supervised Learning
The model is in a form of )...,,,( 21 nxxxfy
Classifier finput
features output
::
x1
x2
xn
y
Main issue: • What are x1, …, xn ?• How to get the model f ?• How to collect training data with output y
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 22
Classification : A 2-Step Process
Model construction
TrainingData(I, O)
ClassificationLearning
Algorithms
ClassifierModel
Model usage
ClassifierModel
inputfeatures output
class label
::
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 23
Main Classification Methods
Decision tree
Artificial neural networks
Naïve Bayesian classification
k-nearest neighbor classifier
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 24
Training Dataset Example for buys_PC(An example from Quinlan’s ID3)
age income student credit_rating buys_PC<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 25
Example: A Decision Tree for “buys_PC”
age?
overcast
student? credit rating?
<=30 >40
no noyes yes
yes
30..40
no yes fairexcellent
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 26
Extracting Classification Rules from a Decision Tree
Rules are easier for humans to understand Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a
conjunction The leaf node holds the class prediction Rule examples
IF age = “<=30” AND student = “no” THEN buys_computer = “no”IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”IF age = “31…40” THEN buys_computer =
“yes”IF age = “>40” AND rating = “fair” THEN buys_computer = “no”IF age = “>40” AND rating = “excellent” THEN buys_computer = “yes”
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 27
A Decision Tree for CAD Screening(Constructed from ~500 Records)
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 28
Clustering
Class label is unknown: Group data to form new
classes
e.g. disease profiling, patient profiling
Clustering based on the principle: maximizing the
intra-class similarity and minimizing the interclass
similarity
(3/5)
Main Data Mining TechniquesCluster analysis
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 29
A
B
C
Points A and B are in a same cluster
XY
ZPoints X, Y, and Z are outliers
Example of Cluster Analysis
二〇二三年四月二十日 Introduction to Data Mining 30
Clustering Example in Clustering Example in High Dimension
(Cluster Analysis CAD dataCluster Analysis CAD data))
Data matrix for visualization
Clustering dendrogram
Profile of CAD patients
Profile of healthy people
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 31
Major Clustering Approaches
Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion
Hierarchy algorithms: Create a hierarchical clustering
structure for the set of data records using some criterion
Density-based: Based on connectivity and density functions
Grid-based: Quantize the data space into a finite number of
cells that form a grid structure on which clustering are
performed
Model-based: A model is hypothesized for each of the
clusters and find the best fit of the records to the given models
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 32
Hierarchical Clustering
Use distance matrix as clustering criteria.
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative(AGNES)
divisive(DIANA)
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 33
Organize the data objects into a several levels of tree clusters, called a dendrogram.
A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.
DendrogramShowing Hierarchically Merged Clusters
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 34
Gene Expression AnalysisGene Expression Analysisby Clusteringby Clustering
Analyze gene behavior from gene microarray data
Clustering
Microarrays
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 35
Profile of Stroke PatientsProfile of Stroke Patients((Diagnosis Indices of Chinese Diagnosis Indices of Chinese
MedicineMedicine))
二〇二三年四月二十日 Data Mining: Concepts and Techniques 36
Examples of SOM Feature Map
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 37
Outlier analysis Outlier: a data object that does not comply with the general
behavior of the given data set
It can be considered as noise or exception but is quite useful in
fraud detection, rare events (disease) analysis
Trend analysis Trend and deviation: regression analysis
Sequential pattern mining, periodicity analysis
Other pattern-directed or statistical
analyses
Other Data Mining Techniques
(4/5)
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 38
Regression example
x
y
y = x + 1
X1
Y1
Y1’
Main Data Mining TechniquesLinear Regression (5/5)
Predict Y’s value at X1
using linear regression
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 39
Linear regression: Y = + X Two parameters , and specify the line and are to be
estimated by using the data at hand. using the least squares criterion to the known values of Y1, Y2,
…, X1, X2, …. Multiple regression: Y = b0 + b1 X1 + b2 X2+…+ bn Xn
Many nonlinear functions can be transformed into the above. Log-linear models:
The joint probabilities of a multi-variable table is approximated by a product of single-variable tables.
Probability: p(a, b, c, d) = p(a) p(b) p(c) p(d) log p(a, b, c, d) = log p(a) +log p(b) +log p(c) +log p(d)
Regression Analysis and Log-Linear Models
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 40
Are All the “Discovered” Patterns Interesting?
A data mining system may generate thousands of patterns, not all of them are interesting.
Suggested approach: Human-centered, query-based, focused mining
Interestingness measures: A pattern is interesting if it is easily understood, potentially useful, novel, valid on new or
test data with some degree of certainty, or it validates some hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures: Objective: based on statistics and structures of data patterns,
e.g., support, confidence, etc. Subjective: based on user’s belief in the data,
e.g., unexpectedness, novelty, actionability, etc.
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 41
Can We Find All and Only Interesting Patterns?
Find all the interesting patterns: Completeness Can a data mining system find all the interesting patterns?
Search for only interesting patterns: Optimization Can a data mining system find only the interesting patterns? Approaches
First generate all the relevant patterns and then filter out the uninteresting ones.
Generate only the interesting patterns—mining query optimization
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 42
Classification of Data Mining Systems
General functionality Descriptive data mining
Predictive data mining
Different views → Different classifications Kinds of databases to be mined
Kinds of knowledge to be discovered
Kinds of disciplines utilized
Kinds of applications adapted
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 43
A Multi-Dimensional View of Data Mining System
Kinds of databases to be mined Relational, transactional, WWW, spatial, time-series, text,
multi-media, object-oriented, object-relational, heterogeneous, legacy
Kinds of knowledge to be mined Association, classification, clustering, trend, characterization,
and outlier analysis, etc. Kinds of disciplines utilized
Machine learning, statistics, visualization, database-oriented, data warehouse (OLAP), etc.
Kinds of applications adapted Biomedical informatics, retail, telecommunication, financing,
fraud analysis, stock market analysis, Web mining, etc.
二〇二三年四月二十日 Main Data Mining Techniques for Biomedical Informatics 44
Summary
Data mining: automatically discovering interesting knowledge from large amounts of data
A natural evolution of database technology, in great demand, with wide applications
A KDD process includes domain knowledge acquisition, data preprocessing, data mining, pattern evaluation, and knowledge presentation.
Main data mining techniques: association rule mining, classification, clustering, outlier, trend analysis, linear regression, etc.
45二〇二三年四月二十日
Main Data Mining Techniques for Biomedical Informatics 45
Thank You !!!!
Have a Nice Day !