data mining 1 - linköping universitystaffjimjo/courses/tnm048/lectures/... · 2019-01-28 ·...
TRANSCRIPT
![Page 1: Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 · Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 •“Principles of](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa95afa2a6d0d04c8463899/html5/thumbnails/1.jpg)
Information Visualization: Data Mining - 1
Matt Cooper Big Data2
• “Principles of Data Mining”
• David Hand
• Heikki Mannila
• Padhraic Smyth
• Mostly about data mining algorithms
• “Data Preparation for Data Mining”
• Dorian Pyle
• Concentrates on data preparation
3
Books
• Part 1: What is the problem?
• Motivation: what is the goal of data mining?
• What is data mining?
• How is it used
• How does data mining relate to:
• InfoViz
• Knowledge discovery
• VDM – Visual Data Mining
Part 1
4
• Q. What is Visualization?
• A. Using some medium/media to convey a representation of some data so that the user can form a cognitive understanding of the data
• It is *not* making pictures!
What is InfoViz
• Often displayed like this
• Transform=data filtering
• Mapping?
• Representation?
Data New dataTransform Mapping Represen-
tationPerceptionDisplay
Visualization
![Page 2: Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 · Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 •“Principles of](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa95afa2a6d0d04c8463899/html5/thumbnails/2.jpg)
• Representation: false ‘picture’ of physical qualities
• Molecules
• Fluid flows
• Body bits
• Primarily 3D -> volume displays
• Sometimes with time -> ‘animation’
• Very occasionally higher dimensionality
For Scientific Visualization:
7
• Data has no ‘real’ representation
• Data isn’t 3D - it’s often quite abstract
• No ‘spatial’ relationships at all
• Data items comprise many different fields
• Imagine characterizing a person
• Sciviz – 3D or maybe 4D
• InfoViz – A zillion dimensions
• What representation?
For InfoViz
8
• Having an (enormous) amount of data
• Wonder what it can tell us
• Isolate (unexpected) relationships
• (Hopefully) find some which are
• Interesting
• Novel
• Informative
• Helpful
• “Secondary data analysis”
Data Mining
9
• We generate enormous amounts of data.
• Every time we:
• Bank
• Shop
• Vote
• Drive
• Fly
• Phone…
• This data is collected.
Data gathering
• All this data is collectable!
• Easy to collect and believed to have value
• We never throw anything away!
• Easy to keep and believed to have value.
• Technologies to gather new information are growing rapidly.
Data gathering (2)
• 2011 UK census
• ~63 Million people
• ~35 questions each
• more than three pages
• ~2+ Billion data items
e.g. census data
![Page 3: Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 · Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 •“Principles of](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa95afa2a6d0d04c8463899/html5/thumbnails/3.jpg)
• ‘Statistics’ versus ‘data mining’
• Statistics
• Want to know the answer to a question
• Gather suitable data (ask the question)
• Analyse the answers
• Gain (probabilistic?) insight into the answer
What is ‘Data Mining’
• Given a database of shoe-buyers…
• Database: What size shoes do people in the income bracket 20000Kr-25000Kr buy?
• Data mining: What common factors (if any) affect the size of shoes people buy?
Database Query & Data mining
14
• “Everyone spoke of an information overload but what there was in fact was a non-information overload”
• Richard Saul Wurman, “What-If, Could-be”, Philadelphia, 1976.
• (Wrote the book “Information Anxiety”)
Motivation
15
• Extraction of interesting (non-trivial), previously unknown (and potentially useful) information or patterns from data in ((very) large) databases.
• Inmon (slightly paraphrased)
What is data mining?
• Knowledge discovery in databases (KDD)
• Knowledge extraction
• Data/pattern analysis
• Data archeology
• Information harvesting
• Business intelligence
Alternative names
17
• (Deductive) query processing.
• Expert systems
• Statistical analysis
What is not data mining?
18
![Page 4: Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 · Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 •“Principles of](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa95afa2a6d0d04c8463899/html5/thumbnails/4.jpg)
• Relational databases
• Transactional databases
• Advanced DB and information repositories:
• Object-oriented and object-relational databases
• Time-series data and temporal data
• Text databases and multimedia databases
• Heterogeneous and legacy databases
• WWW
• Security data (images? video?...)
• Data warehouses
Data Mining: What Data?
19
• Each of (large) number(n) of datums is a ‘tuple’
• Sometimes called a ‘feature vector’
• Tuple: a (large?) number (p) of items
• Each item may be:
• Numeric
• Textual
• other tuple (e.g. fingerprints, images, etc.)
• May be discrete or continuous
• Result is n points in a p-dimensional space
What are the characteristics of the data?
20
ID AGE SEX Education Income
248 54 M School 100 000
249 ?? F Degree 127 831
250 9 M Incomplete 0
251 85 F PhD 56 348
252 32 ?? Degree 48 326
253 45 M ?? ??
Example data set
• Holes
• Missing data values
• Errors and ‘estimates’
• Income of *exactly* 100000?
• Sample inconsistencies:
• E.g. medical records with different numbers of readings for the same person
Problems with data
Objectives of DM
• Identifying patterns in data:
• For representation
• Because they are ‘interesting’
• Unexpected!
23
1. Exploratory Data Analysis
2. Descriptive Modelling
3. Predictive Modelling
! Classification and Regression
4. Discovering Patterns and Rules
5. Retrieval by content
Data Mining tasks
![Page 5: Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 · Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 •“Principles of](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa95afa2a6d0d04c8463899/html5/thumbnails/5.jpg)
• Model:
• A global summary of an entire data set.
• Makes statements about any point in the full measurement space.
• Pattern:
• Makes statements about relationships between variables only in localized regions of the measurement space.
Aside: Models and Patterns
• Pure data mining
• “Explore the data with no clear idea of what we are looking for”
• Typically very visual approach
• Very tied to ‘Visual Data Mining’
• Problems with:
• Large number of data points
• Large numbers of dimensions in data
1. Exploratory Data Analysis
• Attempt to describe all of the data
• Perhaps use:
• Model of overall probability distribution in the p-dimensional space
• Partitioning into groups e.g.:
• Cluster analysis for natural grouping
• Segmentation for user-desired groups
2. Descriptive Modelling Descriptive modelling(2)
• Form a model of the data set which allows prediction of a variable based on the known values of the others
• Classification
• Prediction of a discrete variable
• Regression analysis
• Prediction of a continuous variable
• (Prediction does not mean future here)
3. Predictive modelling
29
Predictive modelling (2)
![Page 6: Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 · Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 •“Principles of](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa95afa2a6d0d04c8463899/html5/thumbnails/6.jpg)
• Q: “Why is PM not the same as DM?”
• Strong similarities, some similar methods
• A: The goals are subtly different:
• DM is about grouping in the variable space and identifying the groups.
• PM is about with predicting one variable.
Descriptive and Predictive Modelling
• Concerned with the identification of local patterns in sub-sets of the space.
• Examples:
• Frequently occurring sets of transactions
• Finding patterns of action indicating fraud
4. Discovering Rules and Patterns
• Using a pattern of interest to locate similar patterns
• Examples: Automatically…
• Finding images with similar content
• Finding text documents with similar content
5. Retrieval by content
33
• All of the preceding classes of task share a common feature:
• The notion of “is like” or “similarity”
• Or difference (dissimilarity)
• Defined through a ‘scoring function’
• In numerical or categorical data this is often easy
• In general it is not…
Score functions
34
• Is an orange like an apple?
• Yes:
• Both are fruit.
• Both grow on trees.
• No:
• One is citrus, one isn’t.
• One is orange, one is is green/red
Scoring functions (2)
• Is this picture
• Like this one?
Scoring functions (3)
![Page 7: Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 · Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 •“Principles of](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa95afa2a6d0d04c8463899/html5/thumbnails/7.jpg)
• Specification of the scoring function(s) is crucial to the effectiveness of the system.
• One of the biggest contributions the user has to make!
Scoring functions (4)
• Segmentation of sales data is extensively used to classify customers by purchasing patterns and demographic data (age, income etc.)
• Use to target marketing
• Example of descriptive modelling
Example applications (1)
• The Advanced Scout system
• Analyses Basketball game logs
• Identifies features of players behaviour
• Circumstances when they play well/badly
• Which opposing players are they good or bad against.
• An example of discovering rules and patterns
Example applications (2)• Dr. John Snow’s
Cholera diagram
• Example of Exploratory Data Analysis
• Also Visual Data Mining
• Done without knowing what caused Cholera!
Example applications (3)
• SKICAT
• Classifies stars and galaxies automatically from digital image data
• Uses a 40-dimensional feature vector
• Works as well as human experts
• Predictive modelling
Example applications (4)• Image searching on the web
• Both Altavista and Google had such functions ~2000
• Both removed them
• Google now has one again (2014)
• Face recognition for security (spotting terrorists)
• Been trialled at several airports in various countries
• Some limited success to date
• Both examples of retrieval by content.
Example Applications (5)
![Page 8: Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 · Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 •“Principles of](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa95afa2a6d0d04c8463899/html5/thumbnails/8.jpg)
Altavista Image Search (2000) Google image search (2015)
Google Image Search (2015) Google Image Search (2015)
2nd
Google Image Search (2015) Google Image Search (2015)
5th
![Page 9: Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 · Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 •“Principles of](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa95afa2a6d0d04c8463899/html5/thumbnails/9.jpg)
Google Image Search (2015)
15th
Google image search (2015)
• Searching text documents for lies on CV’s
• Example of a by content method
Example applications (6)
• Detecting inappropriate medical treatment
• Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (saved Australia $1m/yr).
• Example of Descriptive/Predictive modelling
Fraud Detection and Management
• Data mining: discovering interesting models and patterns in data
• ‘Simplifications’ enabling understanding!
• A natural evolution of database technology, in great demand, with wide applications
• Mining can be performed in a variety of information repositories
Summary (1)
• Information expert’s input still vital
• Defining methods
• Defining scoring functions
Summary (2)
![Page 10: Data Mining 1 - Linköping Universitystaffjimjo/courses/TNM048/lectures/... · 2019-01-28 · Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 •“Principles of](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa95afa2a6d0d04c8463899/html5/thumbnails/10.jpg)
• End of Part 1
55