data mining - biomisa.orgbiomisa.org/wp-content/uploads/2019/10/lect-1-dm.pdf · • the elements...
TRANSCRIPT
1
Data Mining
Lecture # 1Introduction & Fundamentals
2
Intro & AffiliationsArea of research: Analysis of medical images/signals using Image/signal
processing and Machine Learning Techniques
www.biomisa.org/usman
www.biomisa.org
www.risetech.pk
www.albasr.com
www.ekko.pk
Reference Material
Text Book:
Data Mining --- Concepts and techniques, by Han and Kamber, Morgan Kaufmann, 3rd Edition. (ISBN:1-
55860-489-8)
Ref Books:
• Introduction to Data Mining – Pang-Ning Tan, Michael Steinbach, and Vipin Kumar,
Addison Wesley
• Principles of Data Mining, by Hand, Mannila, and Smyth, MIT Press, 2001. (ISBN:0-262-
08290-X)
• The Elements of Statistical Learning --- Data Mining, Inference, and Prediction, by
Hastie, Tibshirani, and Friedman, Springer, 2001. (ISBN:0-387-95284-5)
• Mining the Web --- Discovering Knowledge from Hypertext Data, by Chakrabarti,
Morgan Kaufmann, 2003. (ISBN:1-55860-754-4)
3
Reference Material II
• Software:– Weka : Data Mining Software in Java, by University of
Waikato, New Zealand– RapidMiner– GeNIe & SMILE, developed at the Decision Systems
Laboratory, University of Pittsburgh– bnlearn - an R package for Bayesian network learning
and inference– . . .
• Website:– http://www.kdnuggets.com/– ….
4
Topics
• Scope: Data Mining• Topics:
– Introduction to Data Mining– Data Understanding – Data Preprocessing– Data Ware Housing– Data Cube Technology– Mining Frequent Patterns– Advanced Pattern Mining– Classification– Advanced Classification Methods– Clustering – Outlier Detection
5
Grading
• Assignments 10%
• Quizzes 10%
• Project 10%
• Mid-Term Exam 30%
• Final Exam 40%
6
Assignment and Project
• Assignments– No assignments will be accepted after due date.– Programming assignments should be well
documented.– Students are “not” allowed to “copy” each other’s
work. Any such work would be marked zero– No tolerance to cheating. If you are not able to
explain your assignment, it will be considered cheating.
• Projects– Applying data mining techniques to solve actual
problems. 7
DATA MINING
8
9
Definition“Data mining is the exploration and analysis of large
quantities of data in order to discover valid, novel, potentially useful, and ultimately understandablepatterns in data.”
Valid: The patterns hold in general.
Novel: We did not know the pattern beforehand.
Useful: We can devise actions from the patterns.
Understandable: We can interpret and comprehend the patterns.
Alternative names
– Knowledge discovery (mining) in databases (KDD)
– Knowledge extraction,
– Knowledge engineering
– Data Science
– Data/pattern analysis
– Data archeology
– Data dredging
– Information harvesting
– Business intelligence
– etc.
10
We will return to the actual topic in two minutes. In the meantime, we are going to play a quick game.
I am going to show you some problems which were shown to pigeons!
Let us see if you are as smart as a pigeon!
Examples of class A
3 4
1.5 5
6 8
2.5 5
Examples of class B
5 2.5
5 2
8 3
4.5 3
Pigeon Problem 1
Examples of class A
3 4
1.5 5
6 8
2.5 5
Examples of class B
5 2.5
5 2
8 3
4.5 3
8 1.5
4.5 7
What class is this object?
What about this one, A or B?
Pigeon Problem 1
Examples of class A
3 4
1.5 5
6 8
2.5 5
Examples of class B
5 2.5
5 2
8 3
4.5 3
8 1.5
This is a B!Pigeon Problem 1
Here is the rule.If the left bar is smaller than the right bar, it is an A, otherwise it is a B.
Examples of class A
4 4
5 5
6 6
3 3
Examples of class B
5 2.5
2 5
5 3
2.5 3
8 1.5
7 7
Even I know this one
Pigeon Problem 2 Oh! This ones hard!
Examples of class A
4 4
5 5
6 6
3 3
Examples of class B
5 2.5
2 5
5 3
2.5 3
7 7
Pigeon Problem 2
So this one is an A.
The rule is as follows, if the two bars are equal sizes, it is an A. Otherwise it is a B.
Examples of class A
4 4
1 5
6 3
3 7
Examples of class B
5 6
7 5
4 8
7 7
6 6
Pigeon Problem 3
This one is really hard!What is this, A or B?
Examples of class A
4 4
1 5
6 3
3 7
Examples of class B
5 6
7 5
4 8
7 7
6 6
Pigeon Problem 3 It is a B!
The rule is as follows, if the sum of the two bars is less than or equal to 10, it is an A. Otherwise it is a B.
Examples of class A
3 4
1.5 5
6 8
2.5 5
Examples of class B
5 2.5
5 2
8 3
4.5 3
Pigeon Problem 1
Here is the rule again.If the left bar is smaller than the right bar, it is an A, otherwise it is a B.
Lef
t B
ar
10
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
Right Bar
Examples of class A
4 4
5 5
6 6
3 3
Examples of class B
5 2.5
2 5
5 3
2.5 3
Pigeon Problem 2
Lef
t B
ar
10
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
Right Bar
Let me look it up… here it is.. the rule is, if the two bars are equal sizes, it is an A. Otherwise it is a B.
Examples of class A
4 4
1 5
6 3
3 7
Examples of class B
5 6
7 5
4 8
7 7
Pigeon Problem 3
Lef
t B
ar
100
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
Right Bar
The rule again:if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it is a B.
Why Mine Data? Commercial Viewpoint• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/grocery stores
– Bank/Credit Card transactions
22
24
25
26
A Single View to the Customer
Customer
Social Media
Gaming
Entertain
BankingFinance
OurKnownHistory
Purchase
Variety (Complexity)
• Relational Data (Tables/Transaction/Legacy Data)• Text Data (Web)• Semi-structured Data (XML) • Graph Data
– Social Network, Semantic Web (RDF), …
• Streaming Data – You can only scan the data once
• A single application can be generating/collecting many types of data
• Big Public Data (online, weather, finance, etc)
28
To extract knowledge all these types of data need to linked together
Evolution of Sciences• Before 1600, empirical science
• 1600-1950s, theoretical science
– Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
• 1950s-1990s, computational science
– Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
– Computational Science traditionally meant simulation. It grew out of our inability
to find closed-form solutions for complex mathematical models.
• 1990-now, data science
– The flood of data from new scientific instruments and simulations
– The ability to economically store and manage petabytes of data online
– The Internet and computing Grid that makes all these archives universally
accessible
– Scientific info. management, acquisition, organization, query, and visualization
tasks scale almost linearly with data volumes. Data mining is a major new
challenge!29
Evolution of Database Technology
30
What is (not) Data Mining?
What is Data Mining?
– Certain names are more prevalent in certain locations (O’Brien, O’Rurke, O’Reilly… in Boston area)
–Identify customers with similar buying habits
–Find all credit applicants who are poor credit risks.
What is not Data Mining?
– Look up phone number in phone directory
– Identify customers who have purchased more than $10,000 in the last month.
–Find all credit applicants with last name of Smith.
31
32
Knowledge Discovery (KDD) Process
• This is a view from typical database systems and data warehousing communities
• Data mining plays an essential role in the knowledge discovery process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
33
A Mining Framework
• Mining usually involves
– Data cleaning
– Data integration from multiple sources
– Warehousing the data
– Data cube construction
– Data selection for data mining
– Data mining
– Presentation of the mining results
– Patterns and knowledge to be used or stored into
knowledge-base
34
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
DecisionMaking
Data Presentation
Visualization Techniques
Data MiningInformation Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
35
Mining vs. Data Exploration
• Business intelligence view
– Warehouse, data cube, reporting but not much mining
• Business objects vs. data mining tools
• Supply chain example: tools
• Data presentation
• Exploration
36
KDD Process: A Typical View from ML and Statistics
Input Data Data Mining
Data Pre-Processing
Post-Processing
• This is a view from typical machine learning and statistics communities
Data integration
Normalization
Feature selection
Dimension reduction
Pattern discoveryAssociation & correlationClassificationClusteringOutlier analysis… … … …
Pattern evaluation
Pattern selection
Pattern interpretation
Pattern visualization
37
Example: Medical Data Mining
• Health care & medical data mining – often
adopted such a view in statistics and machine
learning
• Preprocessing of the data (including feature
extraction and dimension reduction)
• Classification or/and clustering processes
• Post-processing for presentation
• Draws ideas from: machine learning/AI, statistics, and database systems
etc.
Origins of Data Mining
Data Mining
Database
TechnologyStatistics
Machine
Learning
Pattern
RecognitionAlgorithm
Other
Disciplines
Visualization
38
What is Machine Learning?
• Machine Learning– Study of algorithms that
– improve their performance
– at some task
– with experience
• Optimize a performance criterion using example data or past experience.
• Role of Statistics: Inference from a sample
• Role of Computer science: Efficient algorithms to– Solve the optimization problem
– Representing and evaluating the model for inference
Machine Learning
• According to Herbert Simon, learning is, “Any changein a System that allows it to perform better thesecond time on repetition of the same task or onanother task drawn from the same population.” [G. F.Luger and W. A. Stubblefield, Artificial Intelligence:Structures and Strategies for Complex ProblemSolving, The Benjamin/Cummings PublishingCompany, Inc. 1989.]
41
Why “Learn”?• Machine learning is programming computers to
optimize a performance criterion using example data or past experience.
• Learning is used when:– Human expertise does not exist (navigating on Mars),– Humans are unable to explain their expertise (speech
recognition)– Solution changes in time (routing on a computer
network)– Solution needs to be adapted to particular cases (user
biometrics)
The machine learning
pipeline
43
ML Methods
• Supervised Learning
– Classification
– Regression/Prediction
• Unsupervised Learning
• Association Analysis
Predicting house prices
Sentiment analysis
Document
retrieval
Product
recommendation
Product recommendation
Visual Product
recommender
Model Choice
– What type of classifier shall we use? How shall we select its parameters? Is there best classifier...?
– How do we train...? How do we adjust the parameters of the model (classifier) we picked so that the model fits the data?
Features
• Features: a set of variables believed to carry discriminating and characterizing information about the objects under consideration
• Feature vector: A collection of d features, ordered in some meaningful way into a d- dimensional column vector, that represents the signature of the object to be identified.
• Feature space: The d-dimensional space in which the feature vectors lie. A d-dimensional vector in a d-dimensional space constitutes a point in that space.
Features
Feature space (3D)
Features
• Feature Choice
– Good Features
• Ideally, for a given group of patterns coming from the same class, feature values should all be similar
• For patterns coming from different classes, the feature values should be different.
– Bad Features
• irrelevant, noisy, outlier?
Features
“Good” features “Bad” features
Linear separability Non-linear separability Highly correlated features Multi-modal
Readings from Book (3rd Edn.)
• Chapter – 1
Acknowledgments
• Lecture slides are adopted from Data mining-Concepts and Techniques by Han, Kamber and Pei https://hanj.cs.illinois.edu/bk3/bk3_slidesindex.htm
• Lecture slides are adopted from lectures of Dr. Aman Ullah, SS CASE IT, Islamabad
• Lecture series https://www.youtube.com/watch?v=h-q582wpb4Q&list=PLYwpaL_SFmcChP0xiW3KK9elNuhfCLVVi
• Lecture series https://www.youtube.com/watch?v=wAbyG4M2gns&t=1751s
• http://www.cs.uoi.gr/~tsap/teaching/2012f-cs059/slides-en.html
57