data mining - lecture 1
DESCRIPTION
Data mining at IIT MandiTRANSCRIPT
-
1CS 660: Data Mining For Decision Making
Lecture 1 (Week 1)
Varun Dutt
School of Computing and Electrical Engineering
School of Humanities and Social Sciences
Indian Institute of Technology Mandi, India
Scaling the Heights
-
Course Instructor
Prof. Varun Dutt
School of Computing and Electrical Engineering
School of Humanities and Social Sciences
Indian Institute of Technology, Mandi
PWD Rest House 2nd Floor, Mandi - 175 001, H.P., India
Phone: +91-1905-267041
Email: [email protected]
Office Hours: Only with a prior appointment
2
-
A Little About Me! In the office
Qualifications
M.S. degrees in Software Engineering, Engineering and Public Policy, and Rational Simulation (cognitive modeling) from Carnegie Mellon University
Ph.D. in Engineering and Public Policy from Carnegie Mellon University
Post-doctoral fellowship from Carnegie Mellon University
Since 2012 at Indian Institute of Technology, Mandi, India
Research interests
Artificial intelligence and cognitive modeling, Human-Computer Interaction,Environmental decision making, Judgment and Decision Making
Professional Experience
Served as a Software Engineer in Tata Consultancy Services (TCS) and in MothersonSumi INfotech and Designs Ltd.
Serves as Knowledge Editor of a financial daily, Financial Chronicle
Serves as Lead Author on Chapter 2 on UN IPCCs AR5 (WG III) report
-
4A Little About Me! At home
ABBA Fan
Married to Dr. Rajeshwari Dutt with a cute little daughter
Get no sleep!
Do a lot of writing and have a back problem
I have a TA to help!
x5
-
Teaching Assistants
- Sanjay Rathee, Ph.D. student, SCEE, IIT Mandi. Email:
[email protected] (Has been working on
parallelizing A-priori algorithm recently.)
- Akash Porwal, Ph.D. student, SCEE, IIT Mandi. Email:
[email protected] (Has recently joined and is working on
electrical problems concerning Solar Photovoltaics)
5
-
What about you folks?
Please introduce yourselves
6
-
Announcements
Syllabus
Your Grade:
30% Final exam
20% Surprise Quizzes
10% Class Participation
20% Class Assignments
20% Class Project
7
-
8Course Logistics
- Please dont copy or plagiarize! - Being an AI researcher, I know how to catch it
- If found, consequences will be catastrophic!
- If you did copy, then please cite the sources as
(author, date). E.g., (Dutt, 2012)
-
An Example (Witten, Frank, & Hall, 2011)
9
-
Data mining is defined as the process of discovering structural patterns in data.
The process must be automatic or (more usually) semiautomatic.
The patterns discovered must be meaningful in that they lead to some
advantage, usually an economic one.
The data is invariably present in substantial quantities.
Data Mining: What is it? (Witten, Frank, & Hall, 2011)
10
-
Example
11
-
If tear production rate = reduced then
recommendation = none
Otherwise, if age = young and astigmatic =
no then recommendation = soft
Structural Description (Pattern) in Data
12
-
Weather Dataset
13
In this case there are four attributes: outlook, temperature,
humidity, and windy. The outcome is whether to play or not.
-
A set of rules learned from this information might look like this:
If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
Structural Description (Pattern) in Data (also, called a Decision List)
14
-
These rules are meant to be interpreted in order:
The first one; then, if it doesnt apply, the second; and so on. A set of rules that are intended to be
interpreted in sequence is called a decision list.
Interpreted as a decision list, the rules correctly classify all of the examples in the table, whereas
taken individually, out of context, some of the rules
are incorrect. For example, the rule if humidity =
normal then play = yes gets one of the examples
wrong (check which one).
Decision List
15
-
Weather Dataset: Two of the attributestemperature and humidityhave numeric values
16
-
Structural Description (Classification Rules)
17
For this example, there must be inequalities involving these attributes
rather than simple equality tests as in the former case.
This is called a numeric-attribute problemin this case, a mixed-attribute problem because not all attributes are numeric.
Now the first rule given earlier might take the formIf outlook = sunny and humidity > 83 then play = no
-
Association Rules
18
-
Association Rules
19
-
Data Cleaning (scrubbing, also called data cleansing), is the process of amending or removing data in a database that is
incorrect, incomplete, improperly formatted, or duplicated. It is a time
consuming activity often done in a semi-automated manner.
Missing Values: Missing values are frequently indicated by out-of-range entries. Example: A negative number (e.g., 1) in a numeric field that is normally only positive, or a 0 in a numeric field that can
never normally be 0. For nominal attributes, missing values may be
indicated by blanks or dashes.
Inaccurate Values: Pepsi somewhere and Pepsi-Cola somewhere else. Typographical errors. Example: Super-market seller uses her
own cards for discounts to those who forgot their cards.
Preparing Input Data for Data Mining
20
-
Web-mining: Prestige of a web-page based upon how many link to it (PageRank)
Decisions involving judgments (Banks use data-mining while giving you loans accept or reject cases)
Screening images (oil slicks or not in sea using satellite data)
Load forecasting in Electricity Industry
Diagnosing faults in machines in Industry
Marketing and Sales (Pharmaceutical Industry Patient Journeys, Market-Basket Analysis (Pepsi and Diapers on
Thursdays), Discount or Loyalty Cards to Collect Data
Applications of Data Mining in Real World
21
-
Activities
Read Witten, Frank, and Hall, 2011: Chapter 1 (up to page 15 before CPU performance; 21-29, 51-52,
58-60):
http://www.cse.hcmut.edu.vn/~chauvtn/data_mining/
Texts/[7]%20Data%20Mining%20-
%20Practical%20Machine%20Learning%20Tools%2
0and%20Techniques%20(3rd%20Ed).pdf
Read Singhal, 2011
22
-
Thank you!
23
Comments and Questions most welcome!