concept description and data generalization (baseado nos slides do livro: data mining: c & t)
Post on 20-Dec-2015
215 views
TRANSCRIPT
Concept Description and Concept Description and Data GeneralizationData Generalization
(baseado nos slides do livro: Data (baseado nos slides do livro: Data Mining: C & T)Mining: C & T)
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Two categories of data Two categories of data miningmining
Descriptive miningDescriptive mining: describes concepts or task-: describes concepts or task-relevant data sets in concise, summarative, relevant data sets in concise, summarative, informative, discriminative formsinformative, discriminative forms
Predictive miningPredictive mining: Based on data and analysis, : Based on data and analysis, constructs models for the database, and predicts constructs models for the database, and predicts the trend and properties of unknown datathe trend and properties of unknown data
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
What is Concept What is Concept Description?Description?
Concept description (or class description)Concept description (or class description): : generates descriptions for characterization and generates descriptions for characterization and comparison of datacomparison of data
CharacterizationCharacterization: provides a concise and succinct : provides a concise and succinct summarization of the given collection of datasummarization of the given collection of data
Class comparison (or discrimination)Class comparison (or discrimination): provides : provides descriptions comparing two or more collections of descriptions comparing two or more collections of datadata
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Data GeneralizationData Generalization
A process which A process which abstractsabstracts a large set of task- a large set of task-relevant data in a database from a low relevant data in a database from a low conceptual levels to higher ones.conceptual levels to higher ones.
1
2
3
4
5Conceptual levels
ApproachesApproaches::• Data cube approach(OLAP approach)• Attribute-oriented induction approach
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Concept DescriptionConcept Description vs OLAP vs OLAP
SimilaritiesSimilarities: :
Data generalization
Presentation of data summarization at multiple levels of abstraction.
Interactive drilling, pivoting, slicing and dicing.
DifferencesDifferences::
Complex data types of the attributes and their aggregations
Automated process to find relevant attributes and generalization degree
Dimension relevance analysis and ranking when there are many relevant
dimensions.
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Attribute-Oriented Attribute-Oriented InductionInduction
Proposed in 1989 (KDD ‘89 workshop)Proposed in 1989 (KDD ‘89 workshop) Not confined to categorical data nor particular measures.Not confined to categorical data nor particular measures. How it is done?How it is done?
Collect the task-relevant data (initial relation) using a relational database query
Perform data generalization by attribute removal or attribute generalization, based on the nb. of distinct values of each attribute.
Apply aggregation by merging identical, generalized tuples and accumulating their respective counts
Interactive presentation with users
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Basic Principles (1) Basic Principles (1)
Data focusingData focusing: task-relevant data, including dimensions, : task-relevant data, including dimensions, and the result is the and the result is the initial (working) relationinitial (working) relation..
Attribute-removalAttribute-removal: remove attribute: remove attribute A A if there is a large set if there is a large set of distinct values for of distinct values for AA but: but: (1) there is no generalization operator on A, or
(2) A’s higher level concepts are expressed in terms of other attributes.
Attribute-generalizationAttribute-generalization: If there is a large set of distinct : If there is a large set of distinct values for values for AA, and there exists a set of generalization , and there exists a set of generalization operators onoperators on A A, then select an operator and generalize, then select an operator and generalize AA. .
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Basic Principles (2) Basic Principles (2)
Two methods to control a generalization process:Two methods to control a generalization process:
Attribute-threshold controlAttribute-threshold control: typical 2-8, specified/default: typical 2-8, specified/default if the number of distinct values in an attribute is greater than
the att. threshold, then removal or generalization applies
Generalized relation threshold controlGeneralized relation threshold control: sets a threshold : sets a threshold for the generalized (final) relation/rule sizefor the generalized (final) relation/rule size If the number of distinct tuples in the generalized relation is
greater than the threshold, then further generalization applies
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Basic Principles (3) Basic Principles (3)
Acummulate count or other aggregate values Acummulate count or other aggregate values : to : to provide statistical information about the data at diff. provide statistical information about the data at diff. levels of abstractionlevels of abstraction
Ex: Count value for a tuple in the initial relation is 1,
When generalizing data, n tuples in the initial relation result in groups of identical tuples merged into a single generalized tuple (count is n)
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Basic Algorithm Basic Algorithm
1.1. InitialRelInitialRel: Query processing of task-relevant data, deriving : Query processing of task-relevant data, deriving the the initial relationinitial relation..
2.2. PreGenPreGen:: Based on the analysis of the number of distinct Based on the analysis of the number of distinct values in each attribute, determine generalization plan for values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize?each attribute: removal? or how high to generalize?
3.3. PrimeGenPrimeGen: Based on the PreGen plan, perform : Based on the PreGen plan, perform generalization to the right level to derive a “prime generalization to the right level to derive a “prime generalized relation”, accumulating the counts.generalized relation”, accumulating the counts.
4.4. PresentationPresentation: User interaction: (1) adjust levels by drilling, : User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations.visualization presentations.
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Class Characterization:Class Characterization: Example (1) Example (1)
Describe general characteristics of graduate students Describe general characteristics of graduate students in the Big-University database (in DMQL)in the Big-University database (in DMQL)
use Big_University_DBmine characteristics as “Science_Students”in relevance to name, gender, major, birth_place, birth_date,
residence, phone#, gpafrom studentwhere status in “graduate”
Corresponding SQL statement:Corresponding SQL statement:select name, gender, major, birth_place, birth_date, residence, phone#,
gpafrom studentwhere status in {“Msc”, “MBA”, “PhD” }
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Class Characterization: An Class Characterization: An Example (2)Example (2)
Name Gender Major Birth-Place Birth_date Residence Phone # GPA
JimWoodman
M CS Vancouver,BC,Canada
8-12-76 3511 Main St.,Richmond
687-4598 3.67
ScottLachance
M CS Montreal, Que,Canada
28-7-75 345 1st Ave.,Richmond
253-9106 3.70
Laura Lee…
F…
Physics…
Seattle, WA, USA…
25-8-70…
125 Austin Ave.,Burnaby…
420-5232…
3.83…
Removed Retained Sci,Eng,Bus
Country Age range City Removed Excl,VG,..
Initial Relation
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Class Characterization: An Class Characterization: An Example (3)Example (3)
Gender Major Birth_region Age_range Residence GPA Count M Science Canada 20-25 Richmond Very-good 16 F Science Foreign 25-30 Burnaby Excellent 22 … … … … … … …
Prime Generalized Relation
Birth_Region
GenderCanada Foreign Total
M 16 14 30
F 10 22 32
Total 26 36 62
Cross-tab
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Presentation of Presentation of Generalized Results (1)Generalized Results (1)
Generalized relationGeneralized relation: : Relations where some or all attributes are
generalized, with counts or other aggregation values accumulated.
Cross tabulationCross tabulation::Mapping results into cross tabulation form
(similar to contingency tables). Visualization techniques: Pie charts, bar charts,
curves, cubes, and other visual forms.
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
PresentationPresentation——Generalized RelationGeneralized Relation
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
PresentationPresentation——CrosstabCrosstab
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Presentation of Generalized Presentation of Generalized Results (2)Results (2)
A generalized relation may also be represented in the form of A generalized relation may also be represented in the form of logic ruleslogic rules
Cj = target classCj = target classqqaa = a generalized tuple describing the target class = a generalized tuple describing the target class
t-weight for qt-weight for qaa: percentage of tuples of the target class from the initial : percentage of tuples of the target class from the initial working class that are covered by qworking class that are covered by qaa
range: [0, 1]
n
i
i
at
1
)count(q
)count(qweight
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Presentation of Generalized Presentation of Generalized Results (3)Results (3)
Quantitative characteristic rulesQuantitative characteristic rules: Mapping generalized result : Mapping generalized result into characteristic rules with quantitative information into characteristic rules with quantitative information associated with itassociated with it
The disjunction of the conditions forms a The disjunction of the conditions forms a necessary necessary conditioncondition of the target class, i.e., all tuples of the target of the target class, i.e., all tuples of the target class must satisfy the conditionclass must satisfy the condition
Not a sufficient conditionNot a sufficient condition of the target class, since a tuple of the target class, since a tuple satisfying the same condition could belong to another satisfying the same condition could belong to another classclass
.%]47:["")(_%]53:["")(_)()(
tforeignxregionbirthtCanadaxregionbirthxmalexgrad
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Attribute Relevance Attribute Relevance Analysis (1)Analysis (1)
Why?Why? Which dimensions should be included? How high level of generalization? Automatic vs. interactive Reduce # attributes; easy to understand patterns
What?What? statistical method for preprocessing data
filter out irrelevant or weakly relevant attributes retain or rank the relevant attributes
relevance related to dimensions and levels analytical characterization, analytical comparison
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Attribute relevance Attribute relevance analysis (2)analysis (2)
How?How?1. Data Collection
2. Preliminary relevance analysis using conservative AOI
3. Analytical Generalization Use information gain analysis (e.g., entropy or other
measures) to identify highly relevant dimensions and levels.
Sort and select the most relevant dimensions and levels.
4. Attribute-oriented Induction for class description Using a less conservative threshold for AOI
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Relevance Measures Relevance Measures
Quantitative relevance measureQuantitative relevance measure: : determines the classifying power of an determines the classifying power of an attribute within a set of data.attribute within a set of data.
MethodsMethods:: information gain (ID3)gain ratio (C4.5)gini index2 contingency table statisticsuncertainty coefficient
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Entropy and Information Entropy and Information GainGain
S contains sS contains sii tuples of class C tuples of class Cii for i = {1, …, m} for i = {1, …, m}
Entropy or expected informationEntropy or expected information measures info measures info required to classify any arbitrary tuplerequired to classify any arbitrary tuple
EntropyEntropy of attribute A with values {a of attribute A with values {a11,a,a22,…,a,…,avv}}
Information gainedInformation gained by branching on attribute A by branching on attribute A
s
s
s
s,...,s,ssSE
im
i
im21 2
1
log)I()(
),...,(...
E(A) 1
1
mjj
v
j
mjjssI
s
ss
1
E(A))s,...,s,I(sGain(A) m 21
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Example of Analytical Example of Analytical Characterization (1)Characterization (1)
TaskTask Mine general characteristics describing graduate
students using analytical characterization
GivenGiven attributes name, gender, major, birth_place,
birth_date, phone#, and gpa Gen(ai) = concept hierarchies on ai
Ui = attribute analytical thresholds for ai
Ti = attribute generalization thresholds for ai
R = attribute relevance threshold
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Example of Analytical Example of Analytical Characterization (2)Characterization (2)
1.1. Data collectionData collection target class: graduate student contrasting class: undergraduate student
2.2. Analytical generalization using UAnalytical generalization using U ii attribute removal
remove name and phone# attribute generalization
generalize major, birth_place, birth_date and gpa accumulate counts
candidate relation: gender, major, birth_country, age_range and gpa
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Example: Analytical Example: Analytical characterization (3)characterization (3)
gender major birth_country age_range gpa count
M Science Canada 20-25 Very_good 16
F Science Foreign 25-30 Excellent 22
M Engineering Foreign 25-30 Excellent 18
F Science Foreign 25-30 Excellent 25
M Science Canada 20-25 Excellent 21
F Engineering Canada 20-25 Excellent 18
Candidate relation for Target class: Graduate students (=120)gender major birth_country age_range gpa count
M Science Foreign <20 Very_good 18
F Business Canada <20 Fair 20
M Business Canada <20 Fair 22
F Science Canada 20-25 Fair 24
M Engineering Foreign 20-25 Very_good 22
F Engineering Canada <20 Excellent 24
Candidate relation for Contrasting class: Undergraduate students (=130)
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Example: Analytical Example: Analytical characterization (4)characterization (4)
3. Relevance analysis3. Relevance analysis Calculate expected info required to classify an
arbitrary tuple
Calculate entropy of each attribute: e.g. major
99880250
130
250
130
250
120
250
120130120 2221 .loglog),I()s,I(s
For major=”Science”: S11=84 S21=42 I(s11,s21)=0.9183 For major=”Engineering”: S12=36 S22=46 I(s12,s22)=0.9892 For major=”Business”: S13=0 S23=42 I(s13,s23)=0
Number of grad students in “Science”
Number of undergrad students in “Science”
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Example: Analytical Example: Analytical Characterization (5)Characterization (5)
Calculate expected info required to classify a Calculate expected info required to classify a given sample if S is partitioned according to the given sample if S is partitioned according to the attributeattribute
Calculate information gain for each attributeCalculate information gain for each attribute
Information gain for all attributes
78730250
42
250
82
250
126231322122111 .)s,s(I)s,s(I)s,s(IE(major)
2115021 .E(major))s,I(s)Gain(major
Gain(gender) = 0.0003
Gain(birth_country) = 0.0407
Gain(major) = 0.2115
Gain(gpa) = 0.4490
Gain(age_range) = 0.5971
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Example: Analytical Example: Analytical characterization (5)characterization (5)
4.4. Initial working relation (WInitial working relation (W00) derivation) derivation R = 0.1 remove irrelevant/weakly relevant attributes from
candidate relation => drop gender, birth_country remove contrasting class candidate relation
5.5. Perform attribute-oriented induction on WPerform attribute-oriented induction on W00 using Tusing Tii
major age_range gpa count
Science 20-25 Very_good 16
Science 25-30 Excellent 47
Science 20-25 Excellent 21
Engineering 20-25 Excellent 18
Engineering 25-30 Excellent 18
Initial target class working relation W0: Graduate students
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Mining Class ComparisonsMining Class Comparisons ComparisonComparison: Comparing two or more classes: Comparing two or more classes MethodMethod: :
Partition the set of relevant data into the target class and the contrasting class(es)
Generalize both classes to the same high level concepts Compare tuples with the same high level descriptions Present for every tuple its description and two measures
support - distribution within single class comparison - distribution between classes
Highlight the tuples with strong discriminant features
Relevance AnalysisRelevance Analysis:: Find attributes (features) which best distinguish different classes
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Quantitative Quantitative Discriminant RulesDiscriminant Rules
Cj = target classCj = target class qqaa = a generalized tuple covers some tuples of = a generalized tuple covers some tuples of
target classtarget class but can also cover some tuples of contrasting class
d-weightd-weight range: [0, 1]
quantitative discriminant rule formquantitative discriminant rule form
m
i
ia
ja
)Ccount(q
)Ccount(qweightd
1
d_weight]:[dX)condition(ss(X)target_claX,
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Example (1)Example (1) Compare the general properties between the graduate Compare the general properties between the graduate
students and the undergraduate students at the Big-students and the undergraduate students at the Big-University database, given the attributes: name, gender, University database, given the attributes: name, gender, etc (in DMQL)etc (in DMQL)
use Big_University_DBmine comparison as “Grad-vs-Undergrad”in relevance to name, gender, major, birth_place, birth_date, residence,
phone#, gpafrom “graduate_students”where status in “graduate”versus “undergraduate_students”where status in “undergraduate”analyze count%from student
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Example (2)Example (2)
Quantitative discriminant ruleQuantitative discriminant rule
where 90/(90+210) = 30%
Status Birth_country Age_range Gpa Count
Graduate Canada 25-30 Good 90
Undergraduate Canada 25-30 Good 210
Count distribution between graduate and undergraduate students for a generalized tuple
%]30:["")("3025")(_"")(_
)(_,
dgoodXgpaXrangeageCanadaXcountrybirth
XstudentgraduateX
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Class Description Class Description Quantitative characteristic ruleQuantitative characteristic rule
necessary Quantitative discriminant ruleQuantitative discriminant rule
sufficient Quantitative description ruleQuantitative description rule
necessary and sufficient ]w:d,w:[t...]w:d,w:[t nn111
(X)condition(X)condition
ss(X)target_claX,
n
d_weight]:[dX)condition(ss(X)target_claX,
t_weight]:[tX)condition(ss(X)target_claX,
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Example: Quantitative Example: Quantitative Description RuleDescription Rule
Quantitative description rule for target class Quantitative description rule for target class EuropeEurope
Loc./item TV Computer Both_items
Count t-wt d-wt Count t-wt d-wt Count t-wt d-wt
Europe 80 25% 40% 240 75% 30% 320 100% 32%
N_Am 120 17.65% 60% 560 82.35% 70% 680 100% 68%
Both_ regions
200 20% 100% 800 80% 100% 1000 100% 100%
Crosstab showing associated t-weight, d-weight values and total number
(in thousands) of TVs and computers sold at AllElectronics in 1998
30%]:d75%,:[t40%]:d25%,:[t )computer""(item(X))TV""(item(X)
Europe(X)X,
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
BibliografiaBibliografia
(Livro) (Livro) Data Mining: Concepts and Data Mining: Concepts and TechniquesTechniques, J. Han & M. Kamber, Morgan , J. Han & M. Kamber, Morgan Kaufmann, 2001 (Capítulo 5 – livro 2001, Kaufmann, 2001 (Capítulo 5 – livro 2001, Secção 3.7 – draft)Secção 3.7 – draft)
(Livro) (Livro) Machine LearningMachine Learning, T. Mitchell, , T. Mitchell, McGraw-Hill, 1997 (Secção 3.4)McGraw-Hill, 1997 (Secção 3.4)
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Information-Theoretic Information-Theoretic ApproachApproach
Decision treeDecision tree each internal node tests an attribute each branch corresponds to an attribute value each leaf node assigns a classification
ID3 algorithmID3 algorithm build decision tree based on training objects with
known class labels to classify testing objects rank attributes with information gain measure minimal height
the least number of tests to classify an object
2003/04Sistemas de Apoio à Decisão
(LEIC Tagus)
Top-Down Induction of Top-Down Induction of Decision TreeDecision Tree
Attributes = {Outlook, Temperature, Humidity, Wind}
Outlook
Humidity Wind
sunny rainovercast
yes
no yes
high normal
no
strong weak
yes
PlayTennis = {yes, no}