Download - Data mining query languages
![Page 1: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/1.jpg)
Data Mining Query Languages
Kristen LeFevre
April 19, 2004
With Thanks to Zheng Huang and Lei Chen
![Page 2: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/2.jpg)
Outline
Introduce the problem of querying data mining models
Overview of three different solutions and their contributions
Topic for Discussion: What would an ideal solution support?
![Page 3: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/3.jpg)
Problem Description
You guys are armed with two powerful tools Database management systems Efficient and effective data mining algorithms
and frameworks Generally, this work asks:
“How can we merge the two?” “How can we integrate data mining more
closely with traditional database systems, particularly querying?”
![Page 4: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/4.jpg)
Three Different Answers
DMQL: A Data Mining Query Language for Relational Databases (Han et al, Simon Fraser University)
Integrating Data Mining with SQL Databases: OLE DB for Data Mining (Netz et al, Microsoft)
MSQL: A Query Language for Database Mining (Imielinski & Virmani, Rutgers University)
![Page 5: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/5.jpg)
Some Common Ground
Create and manipulate data mining models through a SQL-based interface (“Command-driven” data mining)
Abstract away the data mining particulars Data mining should be performed on data in
the database (should not need to export to a special-purpose environment)
Approaches differ on what kinds of models should be created, and what operations we should be able to perform
![Page 6: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/6.jpg)
DMQL
Commands specify the following: The set of data relevant to the data mining
task (the training set) The kinds of knowledge to be discovered
• Generalized relation• Characteristic rules• Discriminant rules• Classification rules• Association rules
![Page 7: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/7.jpg)
DMQL
Commands Specify the following:Background knowledge
• Concept hierarchies based on attribute relationships, etc.
Various thresholds• Minimum support, confidence, etc.
![Page 8: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/8.jpg)
DMQL
Syntaxuse database <database_name>{use hierarchy <hierarchy_name> for <attribute>}
<rule_spec>related to <attr_or_agg_list>from <relation(s)>[where <conditions>][order by <order list>]{with [<kinds of>] threshold = <threshold_value> [for <attribute(s)>]}
Specify background knowledge
Specify rules to be discovered
Collect the set of relevant data to mine
Specify threshold parameters
Relevant attributes or aggregations
![Page 9: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/9.jpg)
DMQL
Syntax <rule_spec>
find classification rules [as <rule_name>]
[according to <attributes>]
Find association rules [as <rule_name>]
generalize data [into <relation_name>]
others
![Page 10: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/10.jpg)
DMQL
use database Hospital
find association rules as Heart_Health
related to Salary, Age, Smoker, Heart_Disease
from Patient_Financial f, Patient_Medical m
where f.ID = m.ID and m.age >= 18
with support threshold = .05
with confidence threshold = .7
![Page 11: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/11.jpg)
DMQL
DMQL provides a display in command to view resulting rules, but no advanced way to query them
Suggests that a GUI interface might aid in the presentation of these results in different forms (charts, graphs, etc.)
![Page 12: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/12.jpg)
MSQL
Focus on Association Rules Seeks to provide a language both to
selectively generate rules, and separately to query the rule base
Expressive rule generation language, and techniques for optimizing some commands
![Page 13: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/13.jpg)
MSQL
Get-Rules and Select-Rules Queries Get-Rules operator generates rules over
elements of argument class C, which satisfy conditions described in the “where” clause
[Project Body, Consequent, confidence, support]
GetRules(C) [as R1][into <rulebase_name>][where <conds>][sql-group-by clause][using-clause]
![Page 14: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/14.jpg)
MSQL
<conds> may contain a number of conditions, including: restrictions on the attributes in the body or
consequent• “rule.body HAS {(Job = ‘Doctor’}”• “rule1.consequent IN rule2.body”• “rule.consequent IS {Age = *}”
pruning conditions (restrict by support, confidence, or size)
Stratified or correlated subqueries
in, has, and is are rule subset, superset,
and equality respectively
![Page 15: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/15.jpg)
MSQL
GetRules(Patients)where Body has {Age = *}and Support > .05 and Confidence > .7and not exists ( GetRules(Patients)
Support > .05 and Confidence > .7and R2.Body HAS R1.Body)
Retrieve all rules with descriptors of the form “Age = x” in the body, except when there is a rule with equal or greater support and confidence with a rule containing a superset of the descriptors in the body
![Page 16: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/16.jpg)
MSQL
GetRules(C) R1
where <pruning-conds>
and not exists ( GetRules(C) R2
where <same pruning-conds>
and R2.Body HAS R1.Body)
correlated
stratified
GetRules(C) R1where <pruning-conds>and consequent is {(X=*)}and consequent in (SelectRules(R2)
where consequent is {(X=*)}
![Page 17: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/17.jpg)
MSQL
Nested Get-Rules Queries and their optimization Stratified (non-corrolated) queries are
evaluated “bottom-up.” The subquery is evaluated first, and replaced with its results in the outer query.
Correlated queries are evaluated either top-down or bottom-up (like “loop-unfolding”), and there are rules for choosing between the two options
![Page 18: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/18.jpg)
MSQL
GetRules(Patients)where Body has {Age = *}and Support > .05 and Confidence > .7and not exists ( GetRules(Patients)
Support > .05 and Confidence > .7and R2.Body HAS R1.Body)
![Page 19: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/19.jpg)
MSQL
GetRules(Patients)where Body has {Age = *}and Support > .05 and Confidence > .7
Top-Down Evaluation
For each rule produced by the outer, evaluate the inner
not exists ( GetRules(Patients)Support > .05 and
Confidence > .7and R2.Body HAS R1.Body)
![Page 20: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/20.jpg)
MSQL
not exists ( GetRules(Patients)Support > .05 and
Confidence > .7and R2.Body HAS R1.Body)
Bottom-Up Evaluation
For each rule produced by the inner, evaluate the outer
GetRules(Patients)where Body has {Age = *}and Support > .05 and Confidence > .7
![Page 21: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/21.jpg)
MSQL
Choosing between the two In general, evaluate the expression with more
restrictive conditions first Heuristic rules
• Evaluate the query with higher support threshold first• Next consider confidence threshold• A (length = x) expression is in general more restrictive
than (length > x), which is more restrictive than (length < x)
• “Body IS (constant expression)” is more restrictive than “Body HAS”, which is more restrictive than “Body IN”
• Next consider “Consequent IN” expressions• Descriptors of for (A = a) are more restrictive than
wildcards such as (A = *)
Meant to prevent unconstrained queries from being evaluated first
![Page 22: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/22.jpg)
OLE DB for DM
An extension to the OLE DB interface for Microsoft SQL Server
Seeks to support the following ideas: Define a model by specifying the set of
attributes to be predicted, the attributes used for the prediction, and the algorithm
Populate the model using the training data Predict attributes for new data using the
populated model Browse the mining model (not fully
addressed because it varies a lot by model type)
None of the others seemed to support this
![Page 23: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/23.jpg)
OLE DB for DM
Defining a Mining Model Identify the set of data attributes to be
predicted, the set of attributes to be used for prediction, and the algorithm to be used for building the model
Populating the Model Pull the information into a single rowset
using views, and train the model using the data and algorithm specified
Supports complex objects, so rowset may be hierarchical (see paper for more complex examples)
![Page 24: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/24.jpg)
OLE DB for DM
Using the mining model to predictDefines a new operator prediction join.
A model may be used to make predictions on datasets by taking the prediction join of the mining model and the data set.
![Page 25: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/25.jpg)
OLE DB for DM
CREATE MINING MODEL [Heart_Health Prediction]
[ID] Int Key,
[Age] Int,
[Smoker] Int,
[Salary] Double discretized,
[HeartAttack] Int PREDICT, %Prediction column
USING [Decision_Trees_101]
Identifies the source columns for the training data, the column to be predicted, and the data mining algorithm.
![Page 26: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/26.jpg)
OLE DB for DM
INSERT INTO [Heart_Health Prediction]
([ID], [Age], [Smoker], [Salary])
SELECT [ID], [Age], [Smoker], [Salary] FROM Patient_Medical M, Patient_Financial F
WHERE M.ID = F.ID
The INSERT represents using a tuple for training the model (not actually inserting it into the rowset).
![Page 27: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/27.jpg)
OLE DB for DM
SELECT t.[ID], [Heart_Health Prediction].[HeartAttack]
FROM [Heart_Health Prediction]PREDICTION JOIN (SELECT [ID], [Age], [Smoker], [Salary]FROM Patient_Medical M, Patient_Financial FWHERE M.ID = F.ID) as tON [Heart_Health Prediction].Age = t.Age AND
[Heath_Health Prediction].Smoker = t.Smoker AND [Heart_Health Prediction].Salary = t.Salary
Prediction join connects the model and an actual data table to make predictions
![Page 28: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/28.jpg)
Key Ideas
Important to have an API for creating and manipulating data mining models
The data is already in the DBMS, so it makes sense to do the data mining where the data is
Applications already use SQL, so a SQL extension seems logical
![Page 29: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/29.jpg)
Key Ideas
Need a method for defining data mining models, including algorithm specification, specification of various parameters, and training set specification (DMQL, MSQL, ODBDM)
Need a method of querying the models (MSQL)
Need a way of using the data mining model to interact with other data in the database, for purposes such as prediction (ODBDM)
![Page 30: Data mining query languages](https://reader033.vdocuments.us/reader033/viewer/2022061213/5496edefb479596a4d8b5081/html5/thumbnails/30.jpg)
Discussion Topic: What Functionality would and Ideal Solution Support?