introduction. instructor: fatme el-moukaddem email: [email protected] office: room 2312...
TRANSCRIPT
![Page 1: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/1.jpg)
Introduction
![Page 2: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/2.jpg)
2
•Instructor: Fatme El-Moukaddem
•Email: [email protected]
•Office: Room 2312 Engineering Building
•Office Hours: • Tuesdays: 10:00am-11:00am• By appointment (Tue/Thu)
Contact Information
![Page 3: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/3.jpg)
3
•Introduction to Data Mining• Pang-Ning Tan, Michael Steinbach, Vipin Kumar
•Data Mining: Practical Machine Learning Tools and Techniques• Ian Witten, Eibe Frank, Mark Hall
Books
![Page 4: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/4.jpg)
4
•Homework: 30%
•Exam 1: 20%
•Exam 2: 20%
•Project (Paper & Presentation): 30%
Assessment
![Page 5: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/5.jpg)
5
•Only in case of an emergency
•Documentation needed
•If you know ahead of time, let me know
Make Up Exam Policy
![Page 6: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/6.jpg)
6
•Exam 1: Sept 30th
•Exam 2: Nov 18th
•Last date to drop with full refund: Sept 22nd
•Last date to drop with no grade: Oct 15th
Important Dates
![Page 7: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/7.jpg)
7
•Weka software
•Matlab
•GNU Octave
Programming Assignments
![Page 8: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/8.jpg)
8
Large amounts of data collected daily Business: sales transactions, customer feedback, stock trading record, product
descriptions Telecommunication networks: carry terabytes of data everyday Medical fields: generates huge amount of medical record, patient monitoring Engineering: scientific experiments, environment monitoring, process measuring
Non traditional nature of dataDifficult to analyze manually, important decisions made based on intuition not on dataPowerful tools needed to automatically uncover valuable information
Why Data Mining?
![Page 9: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/9.jpg)
9
Gap between data and information calls for development of data mining toolNatural evolution of information technologyData collectionDatabase creation and managementAdvanced data analysisData mining
Why Data Mining?
![Page 10: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/10.jpg)
10
◦ Collect all information about customers purchases and interests◦ Point of sale data collection◦ Web logs from e-commerce
◦ Make informed business decisions◦ Customer profiling◦ Targeted marketing◦ Workflow management◦ Store layout◦ Fraud detection
Applications - Business
![Page 11: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/11.jpg)
11
◦ What potential factors will draw investors to the bank?◦ What are the main factors that leave customers unsatisfied?◦ What are the potential types of loans that might bring profit?◦ What methods are commonly used to commit fraud?◦ What incentives will leave customers satisfied?
Questions - Banking
![Page 12: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/12.jpg)
12
◦ What items in the store are popular among teenagers?◦ How likely is it that a vegetarian customer will buy non-vegetarian
products?◦ If an item is purchased by a customer, what other items are likely
to be purchased at the same time?◦ What kind of items should be stocked during the holiday seasons?
Questions - Supermarket
![Page 13: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/13.jpg)
13
◦ Prediction patient outcomes◦ Infection control◦ Pharmaceutical research◦ Treatment effectiveness
◦ Sample questions◦ How likely is it that an adult whose age is more than 70 and who has had a
stroke will have a heart attack?◦ What are the characteristics of patients with a history of at least one
occurrence of stroke?◦ What hospitals provide patients the best recovery rate?
Applications - Healthcare
![Page 14: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/14.jpg)
14
The process of discovering interesting patterns and knowledge from large amounts of data Blends traditional data analysis methods with sophisticated algorithms Part of Knowledge Discovery in Databases (KDD) process: converting raw data into
useful information
What is Data Mining?
![Page 15: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/15.jpg)
15
Any data as long as it is meaningful for the target application◦ Database data◦ Data warehouse data◦ Data streams◦ Sequence data◦ Graph ◦ Spatial data◦ Text data
What kind of Data?
![Page 16: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/16.jpg)
16
•Scalability: terabytes of data• need for efficient algorithms
•High dimensionality: • data with hundreds or thousands of attributes
•Heterogeneous and complex data: •web pages, DNA data, data with temporal and special correlation
Challenges
![Page 17: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/17.jpg)
17
•Data ownership and distribution: data at different physical locations• Reduce communication• Consolidate results from multiple sources• Address data security issues
•Data analysis: hypothesis generation and tests• Thousands of hypotheses
Challenges
![Page 18: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/18.jpg)
18
Build upon methodology from existing fields:◦ Statistics: Sampling, estimation, modeling techniques, hypothesis
testing◦ Machine learning and Pattern recognition: search algorithms, modeling
techniques and learning theory◦ Database systems◦ Parallel and distributed computing
Origins
![Page 19: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/19.jpg)
19
Two major categories:◦Predictive tasks: predict the value of a particular attribute (target or dependent variable) based on the values of other attributes (explanatory or independent variables)
◦Descriptive tasks: derive patterns that summarize relationships in the data◦ Correlations, trends, clusters, anomalies
Data Mining Tasks
![Page 20: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/20.jpg)
20
![Page 21: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/21.jpg)
21
Build a model for the target variable as a function of the explanatory variables. Classification: discrete target variables
◦ Example: Predict whether a customer will renew contract (yes/no)
Regression: continuous target variables◦ Example: Predict the future price of a stock
Predictive Modeling
![Page 22: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/22.jpg)
22
Goal: classify an Iris flower to one of three Iris species Data: Iris data set (Sepal width, sepal length, petal width, petal length, class)
Classification Example
![Page 23: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/23.jpg)
23
![Page 24: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/24.jpg)
24
Divide widths attributes into classes (low, medium, high) to simplify
Rules:◦Petal width low and petal length low => Setosa◦Petal width medium and petal length medium => Versicolour◦Petal width high and petal length high => Virginica
Good classification but not perfect
Classification Example
![Page 25: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/25.jpg)
25
•Used to discover patterns that describe strongly associated features in the data•Discovered patterns represented as implication rules•Search space is exponential•Goal is to extract the most interesting patterns
Association Analysis
![Page 26: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/26.jpg)
26
Goal: find items that are frequently bought together
Rules: {Diapers} -> {Milk} {Bread} -> {Butter, Milk}
Association Example
Trans. ID
Items
1 {bread, butter, diapers, milk}2 {coffee, sugar, cookies, salmon}3 {bread, butter, tea, eggs, milk}4 {butter, diapers, milk, eggs, cookies}… …
![Page 27: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/27.jpg)
27
•Finds groups of closely related observations such that observations that belong to the same group are more similar to each others than to those belonging to other clusters•Applications:•Astronomy: aggregation of stars, galaxies, …•Biology: Plants and animal ecology•Medical imaging•Market research
Clustering
![Page 28: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/28.jpg)
28
•Goal: group related document together•Each document represented by list of pairs (w, c) denoting each word and number of occurrences
1: (dollar, 1), (industry, 4), (country, 2), (labor, 2), (death, 1)2: (machinery, 2), (labor, 3), (market, 4), (country, 1)3: (death, 2), (cancer, 1), (health, 3)….
Clustering Example
![Page 29: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/29.jpg)
29
![Page 30: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/30.jpg)
30
•Identifies observations whose characteristics are significantly different from the rest of the data => Anomalies or Outliers•Applications:•Fraud detection•Network intrusions•Unusual patterns of disease•Ecosystem disturbances
Anomaly Detection
![Page 31: Introduction. Instructor: Fatme El-Moukaddem Email: elmoukad@egr.msu.edu Office: Room 2312 Engineering Building Office Hours: Tuesdays: 10:00am-11:00am](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca25503460f949617e8/html5/thumbnails/31.jpg)
31
•Preprocessing techniques•Classification•Association•Clustering•Anomaly detection•Case studies
Course Outline