prof. carolina ruiz department of computer science worcester polytechnic institute introduction to...
Post on 01-Apr-2015
212 Views
Preview:
TRANSCRIPT
Prof. Carolina Ruiz
Department of Computer Science
Worcester Polytechnic Institute
INTRODUCTION TO
KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING
“Non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” [Fayyad et al. 1996]
• Raw Data Data Mining
• Patterns
• Analytical Patterns (rules, decision trees)
• Statistical Patterns (data distribution)
• Visual Patterns
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases" AAAI Magazine, pp. 37-54. Fall 1996.
WHAT IS DATA MINING?OR MORE GENERALLY, KNOWLEDGE DISCOVERY IN DATABASES (KDD)
NEED FOR DATA MINING
• Data are being gathered and stored extremely fast
• Computational tools and techniques are needed to help humans in summarizing, understanding, and taking advantage of accumulated data
0102030405060708090
1stQtr
2ndQtr
3rdQtr
4thQtr
East
West
North
DATA ANALYSIS (KDD)PROCESS
data sources
data analysisdata mining• analytical
statistical• visual
models
model/patterns deployment• prediction
• decision supportnew data
data management
• databases• data warehouses
“good” model
model/patternevaluation• quantitative• qualitative
data “pre”-processing
• noisy/missing data • dim. reduction
cleandata
data
• Machine Learning (AI)• Contributes (semi-)automatic
induction of empirical laws from observations & experimentation
• Statistics• Contributes language, framework,
and techniques
• Pattern Recognition• Contributes pattern extraction and
pattern matching techniques
• Databases• Contributes efficient data storage,
data cleansing, and data access techniques
• Data Visualization• Contributes visual data displays and
data exploration
• High Performance Comp.• Contributes techniques to efficiently
handling complexity
• Application Domain• Contributes domain knowledge
KDD IS INTERDISCIPLINARYTECHNIQUES COME FROM MULTIPLE FIELDS
• Confirmatory (verification)• Given a hypothesis, verify its validity
against the data
• Exploratory (discovery)• Prescriptive patterns
• Patterns for predicting behavior of newly encountered entities
• Descriptive patterns
• Patterns for presenting the behavior of observed entities in a human-understandable format
DATA MINING MODES
WHAT DO YOU WANT TO LEARN FROM YOUR DATA?KDD APPROACHES
Data
classification
regression
clustering
summarization
dependency/assoc. analysis
change/deviation detection
0102030405060708090
1stQtr
2ndQtr
3rdQtr
4thQtr
East
West
North
IF a & b & c THEN d & kIF k & a THEN e
b lue
B
b lue
C
o ra nge
D
A
IF A & B THEN IF A & D THEN
A B
C D
0.5
0.750.3
A, B -> C 80%C, D -> A 22%
COMMERCIAL DATA MINING SYSTEMSMatlab
Oracle data mining
and lots more ….
WEKAFrank et al., University of Waikato, New Zealand
ACADEMIC DATA MINING SYSTEMS
RapidMinerKlinkenberg et al., Univ. of Dortmund, Germany
R Programming Language Ross Ihaka and Robert Gentleman, Univ. of Auckland,
New Zealand
and many more ….
DATA MINING RESOURCES – JOURNALS
• Data Mining and Knowledge Discovery JournalNewsletters:
• ACM SIGKDD Explorations Newsletter Related Journals:
• TKDE: IEEE Transactions in Knowledge and Data Engineering• TODS: ACM Transaction on Database Systems• JACM: Journal of ACM• Data and Knowledge Engineering• JIIS: Intl. Journal of Intelligent Information Systems
DATA MINING RESOURCES – CONFERENCES• KDD: ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining
• ICDM: IEEE International Conference on Data Mining,
• SIAM International Conference on Data Mining
• PKDD: European Conference on Principles and Practice of Knowledge Discovery in Databases
• PAKDD Pacific-Asia Conference on Knowledge Discovery and Data Mining
• DaWak: Intl. Conference on Data Warehousing and Knowledge Discovery
Related Conferences:
• ICML: Intl. Conf. On Machine Learning
• IDEAL: Intl. Conf. On Intelligent Data Engineering and Automated Learning
• IJCAI: International Joint Conference on Artificial Intelligence
• AAAI: American Association for Artificial Intelligence Conference
• SIGMOD/PODS: ACM Intl. Conference on Data Management
• ICDE: International Conference on Data Engineering
• VLDB: International Conference on Very Large Data Bases
DATA MINING RESOURCES – BOOKS, DATASETS, …
See resources webpage at:
• http://web.cs.wpi.edu/~ruiz/KDDRG/resources.html
SUMMARY
• KDD is the “non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”
• The KDD process includes data collection and pre-processing, data mining, and evaluation and validation of those patterns
• Data mining is the discovery and extraction of patterns from data, not the extraction of data
• Important challenges in data mining: privacy, security, scalability, real-time, and handling non-conventional data
KDDRG: KNOWLEDGE DISCOVERY AND DATA MINING RESEARCH GROUP
• KDDRG Meetings
• WHEN? Fridays at 1 pm
• WHERE? Beckett Conference Room in Fuller Labs
• To receive announcements of the talks, please subscribe to the KDDRG mailing list
• I’ll send you an email with instructions on how to do so
top related