data mining-current status and research directions
Post on 16-May-2015
2.875 Views
Preview:
TRANSCRIPT
2023년 4월 12일 Data Mining: Status and Directions 1
Data Mining: Current Status and Research
Directions
Jiawei Han
Intelligent Database Systems Research Lab
School of Computing Science
Simon Fraser University, Canada
http://www.cs.sfu.ca/~han
2023년 4월 12일 Data Mining: Status and Directions 2
Outline
Why is data mining hot? Current status: Major technical
progress Is data mining flying high, or not? How to fly data mining high?—
Research directions on data mining
2023년 4월 12일 Data Mining: Status and Directions 3
Why Is Data Mining Hot?
Data mining (knowledge discovery in databases)
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information (knowledge) or patterns from data in
large databases or other information repositories
Necessity is the mother of invention
Data is everywhere—data mining should be
everywhere, too!
Understand and use data—an imminent task!
2023년 4월 12일 Data Mining: Status and Directions 4
Data, Data, Everywhere!!
Relational database—A commodity of every enterprise Huge data warehouses are under construction POS (Point of Sales): Transactional DBs in terabytes Object-relational databases, distributed, heterogeneous,
and legacy databases Spatial databases (GIS), remote sensing database (EOS),
and scientific/engineering databases Time-series data (e.g., stock trading) and temporal data Text (documents, emails) and multimedia databases WWW: A huge, hyper-linked, dynamic, global information
system
2023년 4월 12일 Data Mining: Status and Directions 5
Data Mining Is Everywhere, too!—A Multi-Dimensional View of Data Mining
Databases to be mined
Relational, transactional, object-relational, active, spatial,
time-series, text, multi-media, heterogeneous, legacy,
WWW, etc. Knowledge to be mined
Characterization, discrimination, association, classification,
clustering, trend, deviation and outlier analysis, etc. Techniques utilized
Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, neural network, etc. Applications adapted
Retail, telecommunication, banking, fraud analysis, DNA mining,
stock market analysis, Web mining, Weblog analysis, etc.
2023년 4월 12일 Data Mining: Status and Directions 6
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database Technology
Statistics
OtherDisciplines
InformationScience
MachineLearning (AI) Visualization
2023년 4월 12일 Data Mining: Status and Directions 7
Data Mining—One Can Trace Back to Early Civilization
Most scientific discoveries involve “data mining” Kepler’s Law, Newton’s Laws, periodic table of
chemical elements, …, from “big bang” to DNA Statistics: A discipline dedicated to data analysis Then why data mining? What are the differences?
Huge amount of data—in giga to tera bytes Fast computer—quick response, interactive analysis Multi-dimensional, powerful, thorough analysis High-level, “declarative”—user’s ease and control Automated or semi-automated—mining functions
hidden or built-in in many systems
2023년 4월 12일 Data Mining: Status and Directions 8
A Brief History of Data Mining Activities
1989 IJCAI Workshop on Knowledge Discovery in Databases Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W.
Frawley, 1991) 1991-1994 Workshops on Knowledge Discovery in Databases
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)
Journal of Data Mining and Knowledge Discovery (1997) 1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and
SIGKDD Explorations More conferences on data mining
PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, DaWaK, SPIE-DM, etc.
2023년 4월 12일 Data Mining: Status and Directions 9
Research Progress in the Last Decade
Multi-dimensional data analysis: Data warehouse and OLAP (on-line analytical processing)
Association, correlation, and causality analysis Classification: scalability and new approaches Clustering and outlier analysis Sequential patterns and time-series analysis Similarity analysis: curves, trends, images, texts,
etc. Text mining, Web mining and Weblog analysis Spatial, multimedia, scientific data analysis Data preprocessing and database compression Data visualization and visual data mining Many others, e.g., collaborative filtering
2023년 4월 12일 Data Mining: Status and Directions 10
Multi-Dimensional Data Analysis
Data warehousing: integration from heterogeneous or semi-structured databases
Multi-dimensional modeling of data: star & snowflake schemas
Efficient and scalable computation of data cubes or iceberg cubes
OLAP (on-line analytical processing): drilling, dicing, slicing, etc.
Discovery-driven exploration of data cubes From OLAP to OLAM: A multi-dimensional
view for on-line analytical mining
2023년 4월 12일 Data Mining: Status and Directions 11
Association and Frequent Pattern Analysis
Efficient mining of frequent patterns and association rules: Apriori and FP-growth algorithms Multi-level, multi-dimensional, quantitative
association mining From association to correlation, sequential
patterns, partial periodicity, cyclic rules, ratio rules, etc.
Query and constraint-based association analysis
2023년 4월 12일 Data Mining: Status and Directions 12
Classification: Scalable Methods and Handling of Complex Types of Data
Classification has been an essential theme in machine learning, and statistics research Decision trees, Bayesian classification, neural
networks, k-nearest neighbors, etc. Tree-pruning, Boosting, bagging techniques
Efficient and scalable classification methods Exploration of attribute-class pairs SLIQ, SPRINT, RainForest, BOAT, etc.
Classification of semi-structured and non-structured data Classification by clustering association rules (ARCS) Association-based classification Web document classification
2023년 4월 12일 Data Mining: Status and Directions 13
Clustering and Outlier Analysis
Partitioning methods k-means, k-medoids, CLARANS
Hierarchical methods: micro-clusters Birch, Cure, Chameleon
Density-based methods: DBSCAN and OPTICS, DENCLU
Grid-based methods STING, CLIQUE, WaveCluster
Outlier analysis: statistics-based, distance-based, deviation-
based Constraint-based clustering
COD (Clustering with Obstructed Distance) User-specified constraints
2023년 4월 12일 Data Mining: Status and Directions 14
Sequential Patterns and Time-Series Analysis
Trend analysis Trend movement vs. cyclic variations, seasonal
variations and random fluctuations Similarity search in time-series database
Handling gaps, scaling, etc. Indexing methods and query languages for time-
series Sequential pattern mining
Various kinds of sequences, various methods From GSP to PrefixSpan
Periodicity analysis Full periodicity, partial periodicity, cyclic
association rules
2023년 4월 12일 Data Mining: Status and Directions 15
Similarity Search: Similar Curves, Trends, Images, and Texts
Various kinds of data, various similarity mining methods
Discovery of similar trends in time-series data Data transformation & high-dimensional structures
Finding similar images based on color, texture, etc. Content-based vs. keyword-based retrieval Color histogram-based signature Multi-feature composed signature
Finding documents with similar texts Similar keywords (synonymy & polysemy) Term frequency matrix Latent semantic indexing
2023년 4월 12일 Data Mining: Status and Directions 16
Spatial, Multimedia, Scientific Data Analysis
Multi-dimensional analysis of spatial, multimedia and scientific data Geo-spatial data cube and spatial OLAP The curse of dimensionality problem
Association analysis A progressive refinement methodology Micro-clustering can be used for preprocessing
in the analysis of complex types of data Classification
Association-based for handling high-dimensionality and sparse data
2023년 4월 12일 Data Mining: Status and Directions 17
Data Mining Industry and Applications
From research prototypes to data mining products, languages, and standards IBM Intelligent Miner, SAS Enterprise Miner,
SGI MineSet, Clementine, MS/SQLServer 2000, DBMiner, BlueMartini, MineIt, DigiMine, etc.
A few data mining languages and standards (esp. MS OLEDB for Data Mining).
Application achievements in many domains Market analysis, trend analysis, fraud
detection, outlier analysis, Web mining, etc.
2023년 4월 12일 Data Mining: Status and Directions 18
Is Data Mining Flying? Or Not??
Data mining is flying R & D have been striding forward greatly Applications have been broadened substantially
But not as high as some may have hoped. Why not? Hope to see billions of $’s within years?
A young and coming technology, not a hype! Not bread-and-butter but value-added service
DBMS, WWW, and other information systems will still be a “data mining” aircraft-carrier
Not on-the-shelf in nature Need training, understanding, and customizing (re-
develop.) Young technology—need much R&D to fly high
Much research, development, and real problem solving!
2023년 4월 12일 Data Mining: Status and Directions 19
How to Fly Data Mining High?—Research Directions
Web mining Towards integrated data mining
environments and tools “Vertical” (or application-specific) data
mining Invisible data mining
Towards intelligent, efficient, and scalable data mining methods
2023년 4월 12일 Data Mining: Status and Directions 20
Web Mining: A Fast Expanding Frontier in Data Mining
Mine what Web search engine finds
Automatic classification of Web documents
Discovery of authoritative Web pages, Web
structures and Web communities
Meta-Web Warehousing: Web yellow page
service
Web usage mining
2023년 4월 12일 Data Mining: Status and Directions 21
Mine What Web Search Engine Finds
Current Web search engines: A convenient source for mining keyword-based, return too many, often low quality
answers, still missing a lot, not customized, etc. Data mining will help:
coverage: “Enlarge and then shrink,” using synonyms and conceptual hierarchies
better search primitives: user preferences/hints linkage analysis: authoritative pages and clusters Web-based languages: XML + WebSQL + WebML customization: home page + Weblog + user
profiles
2023년 4월 12일 Data Mining: Status and Directions 22
Discovery of Authoritative Pages in WWW
Page-rank method ( Brin and Page, 1998): Rank the "importance" of Web pages, based on a
model of a "random browser." Hub/authority method (Kleinberg, 1998):
Prominent authorities often do not endorse one another directly on the Web.
Hub pages have a large number of links to many relevant authorities.
Thus hubs and authorities exhibit a mutually reinforcing relationship:
Both the page-rank and hub/authority methodologies have been shown to provide qualitatively good search results for broad query topics on the WWW.
2023년 4월 12일 Data Mining: Status and Directions 23
Automatic Classification of Web Documents
Web document classification: Good human classification: Yahoo!, CS term
hierarchies These classifications can be used as training
sets to build up learning model Key-word based classification is different from
multi-dimensional classification Association or clustering-based classification is
often more effective Multi-level classification is important
2023년 4월 12일 Data Mining: Status and Directions 24
A Multiple Layered Meta-Web Architecture
Generalized Descriptions
More Generalized Descriptions
Layer0
Layer1
Layern
...
2023년 4월 12일 Data Mining: Status and Directions 25
Web Yellow Page Service: A Multi-Layer, Meta-Web Approach
XML: facilitates structured and meta-information extraction Automatic classification of Web documents:
based on Yahoo!, etc. as training set + keyword-based correlation/classification analysis (IR/AI assistance)
Automatic ranking of important Web pages authoritative site recognition and clustering Web pages
Generalization-based multi-layer meta-Web construction With the assistance of clustering and classification analysis
Meta-Web can be warehoused and incrementally updated Querying and mining can be performed on or assisted by meta-
Web
2023년 4월 12일 Data Mining: Status and Directions 26
Importance of Constructing Multi-Layer Meta Web
Benefits of Multi-Layer Meta-Web: Multi-dimensional Web info summary analysis Approximate and intelligent query answering Web high-level query answering (WebSQL, WebML) Web content and structure mining Observing the dynamics/evolution of the Web
Is it realistic to construct such a meta-Web? It benefits even if it is partially constructed The benefit may justify the cost of tool
development, standardization, and partial restructuring
2023년 4월 12일 Data Mining: Status and Directions 27
Web Usage (Click-Stream) Mining
Weblog provides rich information about Web dynamics Multidimensional Weblog analysis:
disclose potential customers, users, markets, etc. Plan mining (mining general Web accessing regularities):
Web linkage adjustment, performance improvements Web accessing association/sequential pattern analysis:
Web cashing, prefetching, swapping Trend analysis:
Dynamics of the Web: what has been changing? Customized to individual users
2023년 4월 12일 Data Mining: Status and Directions 28
Towards Integrated Data Mining Environments and Tools
OLAP Mining: Integration of Data Warehousing and Data Mining
Querying and Mining: An Integrated Information Analysis Environment
Basic Mining Operations and Mining Query Optimization
“Vertical” (or application-specific) data mining
Invisible data mining
2023년 4월 12일 Data Mining: Status and Directions 29
OLAP Mining: An Integration of Data Mining and Data Warehousing
Data mining systems, DBMS, Data warehouse systems
coupling
No coupling, loose-coupling, semi-tight-coupling, tight-coupling
On-line analytical mining data
integration of mining and OLAP technologies
Interactive mining multi-level knowledge
Necessity of mining knowledge and patterns at different levels
of abstraction by drilling/rolling, pivoting, slicing/dicing, etc.
Integration of multiple mining functions
Characterized classification, first clustering and then association
2023년 4월 12일 Data Mining: Status and Directions 30
An OLAM Architecture
Data Warehouse
Meta Data
MDDB
OLAMEngine
OLAPEngine
User GUI API
Data Cube API
Database API
Data cleaning
Data integration
Layer3
OLAP/OLAM
Layer2
MDDB
Layer1
Data Repository
Layer4
User Interface
Filtering&Integration Filtering
Databases
Mining query Mining result
2023년 4월 12일 Data Mining: Status and Directions 31
Querying and Mining: An Integrated Information Analysis Environment
Data mining as a component of DBMS, data warehouse, or Web information system Integrated information processing environment
MS/SQLServer-2000 (Analysis service) IBM IntelligentMiner on DB2 SAS EnterpriseMiner: data warehousing + mining
Query-based mining Querying database/DW/Web knowledge Efficiency and flexibility: preprocessing, on-line
processing, optimization, integration, etc.
2023년 4월 12일 Data Mining: Status and Directions 32
Basic Mining Operations and Mining Query Optimization
Relational databases: There are a set of basic relational operations and a standard query language, SQL E.g., selection, projection, join, set difference,
intersection, Cartesian product, etc. Are there a set of standard data mining operations, on
which optimizations can be done? Difficulty: different definitions on operations Importance: optimization can be performed on them
systematically, standardization to facilitate information exchange and system interoperability
2023년 4월 12일 Data Mining: Status and Directions 33
“Vertical” Data Mining
Generic data mining tools? —Too simple to match domain-specific, sophisticated applications
Expert knowledge and business logic represent many years of work in their own fields!
Data mining + business logic + domain experts
A multi-dimensional view of data miners Complexity of data: Web, sequence, spatial, multimedia, … Complexity of domains: DNA, astronomy, market, telecom, …
Domain-specific data mining tools Provide concrete, killer solution to specific problems Feedback to build more powerful tools
2023년 4월 12일 Data Mining: Status and Directions 34
Invisible Data Mining
Build mining functions into daily information services
Web search engine (link analysis, authoritative
pages, user profiles)—adaptive web sites, etc.
Improvement of query processing: history + data
Making service smart and efficient
Benefits from/to data mining research
Data mining research has produced many scalable,
efficient, novel mining solutions
Applications feed new challenge problems to
research
2023년 4월 12일 Data Mining: Status and Directions 35
Towards Intelligent Tools for Data Mining
Integration paves the way to intelligent mining
Smart interface brings intelligence Easy to use, understand and manipulate
One picture may worth 1,000 words Visual and audio data mining
Human-Centered Data Mining Towards self-tuning, self-managing, self-
triggering data mining
2023년 4월 12일 Data Mining: Status and Directions 36
Integrated Mining: A Booster for Intelligent Mining
Integration paves the way to intelligent mining
Data mining integrates with DBMS, DW, WebDB, etc
Integration inherits the power of up-to-date information
technology: querying, MD analysis, similarity search, etc.
Mining can be viewed as querying database knowledge
Integration leads to standard interface/language,
function/process standardization, utility, and reachability
Efficiency and scalability bring intelligent mining to reality
2023년 4월 12일 Data Mining: Status and Directions 37
One Picture May Worth 1000 Words!
Visual Data Mining Visualization of data Visualization of data mining results Visualization of data mining processes Interactive data mining: visual classification
One melody may worth 1000 words too! Audio data mining: turn data into music and
melody! Uses audio signals to indicate the patterns of data
or the features of data mining results
2023년 4월 12일 Data Mining: Status and Directions 38
Visualization of data mining results in SAS Enterprise Miner: scatter plots
2023년 4월 12일 Data Mining: Status and Directions 39
Visualization of association rules in MineSet 3.0
2023년 4월 12일 Data Mining: Status and Directions 40
Visualization of a decision tree in MineSet 3.0
2023년 4월 12일 Data Mining: Status and Directions 41
Visualization of Data Mining Processes by Clementine
2023년 4월 12일 Data Mining: Status and Directions 42
Interactive Visual Mining by Perception-Based Classification (PBC)
2023년 4월 12일 Data Mining: Status and Directions 43
Human-Centered Data Mining
Finding all the patterns autonomously in a database? — unrealistic because the patterns could be too many but uninteresting
Data mining should be an interactive process User directs what to be mined
Users must be provided with a set of primitives to be used to communicate with the data mining system — using a data mining query language
User should provide constraints on what to be mined
System should use such constraints to guide the mining process (constraint-based mining or mining query optimization)
2023년 4월 12일 Data Mining: Status and Directions 44
Constraint-Based Mining
What kinds of constraints can be used in mining? Knowledge type constraint: classification, association,
etc. Data constraint: SQL-like queries
Find products sold together in Vancouver in Feb.’01. Dimension/level constraints:
in relevance to region, price, brand, customer category.
Rule constraints: small sales (price < $10) triggers big sales (sum >
$200). Interestingness constraints:
E.g., strong rules (min_support 3%, min_confidence 60%, min_lift > 3.0).
2023년 4월 12일 Data Mining: Status and Directions 45
Rule Constraints: A Classification
Succinctness
Anti-monotonicity Monotonicity
Convertible constraints
Inconvertible constraints
2023년 4월 12일 Data Mining: Status and Directions 46
Constraint-Based Clustering Analysis
User-specified constraints: no cluster has less than 1000 gold customers
Resource allocation (clustering) with obstacles
2023년 4월 12일 Data Mining: Status and Directions 47
Towards Automated Data Mining?
It is not realistic to automatically find all the knowledge in a large database
Thus we promote human-centered, constraint-based mining
However, to achieve genuine intelligent data mining, data mining process should be self-tuning, self-managing, self-triggering
Functions should be developed to achieve such performance
2023년 4월 12일 Data Mining: Status and Directions 48
Conclusions
Data mining—A promising research frontier
Data mining research has been striding forward greatly
in the last decade
However, data mining, as an industry, has not been
flying as high as expected
Much research and application exploration are needed Web mining
Towards integrated data mining environments and tools
Towards intelligent, efficient, and scalable data mining methods
2023년 4월 12일 Data Mining: Status and Directions 49
http://www.cs.sfu.ca/~han http://db.cs.sfu.ca
Thank you !!!Thank you !!!
2023년 4월 12일 Data Mining: Status and Directions 50
References
J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001.
J. Han, L. V. S. Lakshmanan, and R. T. Ng, "Constraint-Based, Multidimensional Data Mining", COMPUTER (special issues on Data Mining), 32(8): 46-50, 1999.
top related