Download - Haim Lecture2 Data And Data Modeling 2ppg
1
Lecture 2
Part 1 – Data and data modeling
Introduction
• Data has many sources
it can be
gathered from sensors or surveys, or
generated by simulations and computations
• Data can be
raw (untreated) or
derived from raw data via some process,
such as
smoothing, noise-removal, scaling, or interpolation
2
Data Models and Management
• Data
• Data Objects and Models
• Visualization Objects
• Metadata
• Data Retrieval
• DBMS
Data, tasks and simple visualizations
• Data
1D, 2D, 3D, …, nD
Structured and unstructured
• Tasks
Present, Confirm, Explore
Query
Summarize
Analyze
• Simple Visualizations
Points
Line and curves
Charts and graphs
3
• Very large number of parameters
more than 105
• Very large data sets
more than 107
• Multiple data types
discrete and continuous
• Noisy data
often not uniform
• Missing values
could be important
• Lots of different tasks
• Many visualizations
Some Key Data Factors?
Data sets
• List of records
• Each record consists of one or more observations
• Each observation or variable may be
A single number or symbol
A more complex structure
• Each variable may be independent or dependent
• Data may be generated by a process or function
independent variables define function’s domain
dependent variables define function’s range
4
Types or Categories of Data
• Categorical
• Continuous
• Nominal
• Ordinal
• Interval
• Ration
• Qualitative
levels proposed by Stanley Smith Stevens in
1946 article On the theory of scales of measurement.
Categorical Data
• Values each having label or category
may / may not have
ordering relationship
could have implied / imposed partial order
distance metric
absolute zero
5
• Marital status
Single, married, divorced, widowed, …
• Profession
Teacher, student, janitor, tailor, pilot, …
• Weapon used
Machine gun, rifle, gun, knife, paper clip, …
• Insurance company
AAA, Hanover, Aetna, …
• Employed
Yes, no, not sure
Continuous Data
• Numeric values each belonging to some interval
almost all interval values possible
• Weight -- pounds
• Height -- meters
• Age -- years
• Gene expression
• Salary
• Distance from pump -- feet
• Temperature -- degrees
6
Nominal Data (symbolic, categorical)
• Categorical data
order of categories arbitrary, but
numerals assigned as labels or names
• Variables placed into mutually exclusive
categories
• members of one category qualitatively different
from members of any other category
• ==> classification without ordering
• Examples …
• Examples
hair color: brown, black, blond, red
gender: male, female
genomic base pairs (A, C, T, G)
marital status: yes, no
7
Nominal Data (symbolic, categorical)
• Mapping of numbers to labels possible
(e.g. male = 0, female = 1)
• One value not necessarily greater than another
• Statistical computations typically have no
meaning
(although mode can)
Nominal Data (symbolic, categorical)
• may be only way to measure qualitative variable
(religion, gender)
• Operations
equality / inequality (same / different)
• does not establish quantitative relationship
between categories
8
Ordinal Data
• order of categories relevant, and
• numerals / labels having interpretation assigned
to labels
• Categorization of data with ordering
order information available, but
no information about magnitude of distance
between adjacent categories
• Some statistical computations may not have any
meaning
• E. g., …
e. g.,
• Perceptual difficulty scale
very difficult = 10, moderately difficult = 8,
average difficulty = 5, …, easy = 0
• Weapon used by severity
Machine gun = 1, rifle = 2, gun = 3, knife = 4,
paper clip = 5, …
• Likert scale of agreement
5 = strongly agree, to 1 = strongly disagree
9
Ordinal Data
• Operations
Equality / inequality
less than / more than (order)
• Example: students 1st, 2nd, 3rd
1st better than 2nd -- by how much?
cannot compare differences between
categories
Numeric (discrete vs. continuous)
• Discrete values: Integer
Numerical distance between adjacent units is equal
• Continuous values: Real
Any value with arbitrary precision is possible
no gaps in scale
• May lack absolute zero
represents complete absence of characteristic being
measured
zero value is arbitrary starting point
could be replaced by any other value
10
Numeric (discrete vs. continuous)
• Operations
equality / inequality
less than / more than (order)
addition / subtraction
distance metrics
Interval Data
• Continuous data where
data falls in range of numbers
data differences meaningful, but
ratios may have no meaning, since
ranges can be linearly transformed to other
scales
changing interpretation of zero
11
• Distance differences have meaning
90-100 and 80-90 are similar
• Ratios of differences can have meaning
• mean and median have meaning
Examples
• Temperature -- Celsius / Farentheit
Twice the temperature depends on scale used
• IQ measure
12
Ratio Data
• Continuous data, where
both differences and ratios meaningful
zero has meaning
• can be classified as Interval data
• ==> can often be classified ratio data
• Geometric mean can only be applied to ratio
data, and
• arithmetic mean extremely meaningful
Examples
• Temperature, mass, energy, ...
• Age, weight
• Number of students at colloquia
13
Relationship among categories
• Each category provides more computational
possibilities ==>
Ratio more meaningful than interval
Interval more meaningful than ordinal
Ordinal more meaningful than nominal
References
• Babbie, E., 'The Practice of Social Research', 10th edition,
Wadsworth, Thomson Learning Inc., ISBN 0534620299
• Michell, J. (1986). Measurement scales and statistics: a clash of
paradigms. Psychological Bulletin, 3, 398-407.
• Stevens, S.S. (1946). On the theory of scales of measurement.
Science, 103, 677-680.
• Stevens, S.S. (1951). Mathematics, measurement and
psychophysics. In S.S. Stevens (Ed.), Handbook of experimental
psychology (pp. 1-49). New York: Wiley.
• Velleman, P. F. & Wilkinson, L. (1993). Nominal, ordinal, interval,
and ratio typologies are misleading. The American Statistician, 47(1),
65-72. [On line]
http://www.spss.com/research/wilkinson/Publications/Stevens.pdf
14
Typical Data
• Cars
make
model
year
miles per gallon
cost
number of cylinders
weight
...
Typical Data Classes
• 2D scalar
• 3D scalar
• vector data
• organizational data
• complex data models
15
2D Scalar
• Sequence of ordered pairs vi = (x, y)
with x and y in some scalar set
• Where indices are, for example
i {1, 2, 3, ..., n}
i {a, b, c, ..., z}
i a subset of R
• Examples
time series
set of points in (x, y) plane
3D Scalar
• Sequence of ordered triplets vi = (x, y, z)
with x, y and z in some scalar set
• Where indices are, for example
i {1, 2, 3, ..., n}
i {a, b, c, ..., z}
i a subset of R
• Examples
time series of 2D points
set of points in (x, y, z) space
16
Vector Data
• generalization of the above
• n-dim vectors vk = (x1 , x2 , x3 , ... , xn )
where xi in some scalar set
• indices are, for example
k {1, 2, 3, ..., n}
k {a, b, c, ..., z}
k a subset of R
• Examples
time series of n - 1 dim points
set of points in n dim space
Time Series Data
• generalization of the above
• n-dim vectors vk = (x1 , x2 , x3 , ... , xn )
where xi in some scalar set
• And index set based on time
0 t1 < t2 < t3 < ... < tn
• index set often included as parameter in n-dim vector
• but brought out here as special case
because of its importance
• This identifies time as “special” variable
17
Multidimensional Data
• generalization of the above
• n-dimensional vector vk = (x1 , x2 , x3 , ... , xn )
where each xi of some possibly varying data
type
not necessarily all the same
==> Each record consists of number of variables
each having own data type
• index set {k} as before
• extends concept of vector
w/ each coordinate of same data type
to one w/ different data types
• Examples
patient records
census data
18
Structured Data
• Data Tables
Often,
take raw data
xform --> more workable form
• Main idea
Individual items called “cases”
Cases have variables (attributes)
Table View for simple records
Can think of as vector valued function:
f (Record 1) = (190, Red, 2.6 )
9.0Green70Record 3
-7.4Blue200Record 2
2.6Red190Record 1
Attribute 3Attribute 2Attribute 1
19
Billing System /
Patient Information /
Claims and numerous others
Pharmacy System
Adverse Drug Reactions
Patient drug history
Laboratory Results /
DNA
Tests
Medical Records/
Demographics data
Dictation Transcripts
Other notes
Data provided varies /
project
can include hospital,
outpatient, medical, drug, lab
and other results
Redundant + wide varieties
of data input enriches
outcome hypotheses
Note: data consists of large
varieties of 2D, 3D, & nD
scalar, vector, & structured
data
Sources of Patient Information
Data Models
• Data Objects consist of three parts
Data
Geometry (physical)
Topology (relational)
• Any visualization pipeline includes
data objects
data mappings
data displays
20
geometry and topology
• Geometry represents spatial (physical) layout
(embedding) of data in Rn
• Topology represents interconnections
(relationships) between data elements in physical
space.
geometry and topology 2
• Points in space (no connection with each other)
…
• Lines in space (points may have connection) …
• Surfaces in space …
21
Points in space (no connection with
each other)
• Relationship is topology
• Distances may / may not have meaning
• Relative position may / may not have meaning
Lines in space (points may have
connection)
• Line lengths may / may not have meaning
• Above / below may / may not have meaning
22
Surfaces in space
• Points / lines may lie on surface
• Distance may have absolute / relative / no
meaning
• Above / below may / may not have meaning
Data databases and subsets
• origin for data objects
• persistent aspect of data objects
• Any visualization pipeline today will deal with
databases
• may be distributed and their schemas different
• ==> great deal of visualization preprocessing
activity is really database work
23
simulated orsampled data
derived ormassaged data
logical datarepresentation
data transformations -interpolation, filtering, etc.
representation mappings -geometry, color, sound, etc.
Image
rendering -viewing, shading,device transforms, etc.
DBMS
USER
A Simple Visualization Pipeline
queries and probes
Data Objects: The Role of Metadata
• Advanced system function
provides data description
supports rule-based operation
addresses data import problems
(file format standardization)
• Knowledge engineering and data mining
for determining structure in data automatically
• Metadata complexity
descriptive -- frames
active -- production rules
24
A Conceptual Meta-Model for
MetadataTrivedi and Smith, 1991
Metadata
data quality
data dictionary data directory
conceptual
schema
data
content
data
locationdata
access
Metadata Entities
operators
transactions
modeling primitives
procedures / functions
host language programs
Metaprocess Entities
user profiles
maintenance
physical devices
miscellany
Metaenvironment
Entities
Database Management Systems
• Commercial databases
• Example database /visualization application
(GIS)
• Database models
Relational
Object Relational
Object-Oriented
• Database queries
25
Commercial Databases
• Relational
Sybase
Oracle
MYSQL
DB2
• Object-Oriented
Objectivity
ObjectDB
Versant
Gemstone
Db4o
Example: Geographic Information
Systems
Country
City
Contains
Neighbors
Adjacent_To
Layer 1
Layer 2inter-layer
relationship
intra-layer
relationship
26
Example : Relational Model2 Entity tables 3 Relationship tables
Name Language
SwitzerlandSwitzerland
FrenchGerman
Country
Name Population
GenevaZurich
210,000135,000
City
Country Name City
Switzerland GenevaGermany Bonn
Contains
Country Adjacent Country
Switzerland Germany
Adjacent_to
City Neighbor
Geneva LausanneGeneva Zurich
Neighbors
Example : Object Relational Model
CREATE TYPE Country {
Name CharString
REQUIRED;
Languages CharString
MANY;
Contains City MANY;
}
CREATE TYPE City {
............
}
Retains relationalnotions (e.g.,: key)
Support for aggregationin data structure
More language like
27
Example: Object-Oriented Model
class Country : public Region {
String Name;
Set<String> Languages;
Set <Ref<City>> contains
inverse contained_ by;
public:boolean isSpoken (string language);
};
class City {
........
}
almost identical to programming language
supports programminglanguage data structures
encapsulates behavior
Query Language Issues
Formal Visualization Interactions
• Expressive power
What queries can be expressed in query language?
• Extensibility
Can user-defined types be queried like built-in types?
• Result model
How is query result returned to application level?
• Result management
How much of result can be seen at a time ?
28
Example: SQL Relational Queries
• Typical query has this form:
select A1, A2, ... , An
from r1, r2, ... , rm
where P
a simple example...
select temp, press
from measures
where temp > 35 and press = 10.5 and region =
“southwest”
• Predicate can be complex
can include any number of logical connectives
can include sub-queries that perform an initial selection
region temp press ••
Example: Object Oriented Query Language
• Query Interpretation:
retrieve countries whose names start with “Ge” into an
os_list.
os_Set country_extent;
os_list ge_list =
country_extent->query
( “Country”,
this->name==“Ge*”,
...
);
Particular data structure
within database can be
queried. Queries can also
apply to entire database
Result is a complex
data structure
No impedance mismatch
between query language
and programming language
29
Trees, Graphs and Networks
Lots of examples
Hierarchical data
File systems are typical hierarchical data
30
Hierarchical data
• hierarchical data can be represented through
Graphs
Example of a Web Site Hierarchy
Hierarchical data
Relational database model
Website ID Parent ID Child ID Name of the Website
0 NULL 1 Index
0 0 2 Index
0 0 3 Index
0 0 4 Index
0 0 5 Index
0 0 6 Index
0 0 7 Index
0 0 8 Index
0 0 9 Index
1 0 10 About Me
2 0 NULL Resume
3 0 NULL GuestBook
… … … …
10 1 NULL Boot
11 5 NULL 2001
12 5 NULL Dune
13 5 NULL Multiplicity
14 5 18 Star Wars
14 5 19 Star Wars
… … … …
18 14 NULL Books
19 14 NULL Lucas
31
Hierarchical data
• In most cases data not given in hierarchical form,
but
• stored in multi-dim variables
• Goal: Transform data into hierarchical form
• Algorithm …
Algorithm:
Repeat
(1) Select dim - sequence of dim selection
important,
always select most important dim
(2) Segment attributes into some classes
provided chosen attributes not categorical
until maximum hierarchy level reached
32
Complex structured data (graph)
• graph G = (V, E) consists of
set V, (vertices / nodes)
set E, (edges)
• Each edge assigned in unique way to
ordered / unordered pair of (not necessarily
different) nodes
• edges connect vertices
• …
Complex structured data (graph)
• directed graph …
• undirected graph …
• Network …
• properties, metrics …
33
directed graph
• If every e in E assigned to ordered pair of nodes
e = (v, w),
• then graph called directed graph
undirected graph
• If every e in E assigned to unordered pair of
nodes
e = {v, w},
• then graph called undirected graph
34
network
• Edges may have additional meanings (weights)
• ==> graph often called network
properties, metrics
• can define
cyclic, acyclic, DAG, tree,
various metrics
such as
in-number
out-number
35
Complex structured data (graph)
Example of
directed graph
(Paper flow in
government system)
Complex structured data (graph)
Example of
undirected graph
(Social network
of 9-11 Terrorists)
36
Data preprocessing
• Metadata and statistics
• Missing values
• Data Cleansing
• Normalization
• Segmentation
• Sampling and subsetting
• Dimensional reduction
• Aggregation and summarization
• Smoothing and filtering
Metadata and Statistics
• “Data about Data”
• describes content, quality, condition, other characteristics
of data
e.g. min, max, avg, …
• not actual data itself
• may include
Identification (name of dataset, …)
Data Quality (completeness, attribute accuracy,…)
Distribution (formats, media, who holds the data,…)
• Important for correct / useful visualization
37
Missing Values and Data Cleaning
• Missing and empty values …
• Problem definition …
• Approximation vs interpolation …
• Linear regression …
• Piecewise polynomial (spline) interpolation …
Missing and empty values
• missing value of variable
actual value exists in real world measurement
made, but
not entered into data set
• empty value in variable
no real world value exists
38
Missing and empty values
• Example: Sandwich Shop
sells sandwich
with
turkey
Swiss / American cheese
to determine customer preferences + control inventory,
keeps records of customer purchases
Data structure contains “gender”, “cheese type”
gender: “M”: Male ; “F”: Female
cheese: “S”: Swiss ; “A”: American
Missing and Empty Values
Suppose during recording of sale,
customer requests sandwich with no cheese
salesperson forgets to enter customer’s gender
transaction generates record with
both “gender” and “cheese” w/ no entry
39
Missing and empty values
“gender” can be Male or Female (missing value)
“cheese” not measured because no value exists
(empty value)
Problem Definition
• General Problem with missing values twofold
some information content may be missing
Example: Credit application may warn and identify
that certain useful information appears as a result of
certain fields not completed by an applicant
missing value necessary for computation
Example: Age important for estimating reliability
40
Problem Definition
• Create and insert some replacement value for missing
value
objective is to insert value that neither adds nor
subtracts information from data set
Note that for age this is tricky (older typically increases
reliability) and we might decide not to fill in values
• Solution
Use approximation or interpolation to find
missing values
Problem Definition
• General Problem:
given set of n points {xi, yi} with i = 1, ..., n
find function y = f (x) for which yi = f (xi) for all i = 1, ..., n
There may be
several such functions, or even
no simple ones that can deal with
41
Problem Definition
• Information carried in
relationship between
values within single variable (its distribution)
and
relationship to other variables
Approximation vs Interpolation
• For approximation want
| f (x) – f (xi) | for small > 0
• For interpolation want
| f (x) – f (xi) | = 0
Note that approximation less stringent
42
Approximation vs Interpolation
• Approximation
Regression (linear, quadratic, …)
• Interpolation
Polynomial (Lagrange basis, Newton form)
Piecewise polynomial (cubic splines, …)
Orthogonal polynomials (Legendre, …)
Trigonometric functions
Approximation
Climate data approximation based on triangulation
(a), (f) temperature
(b) air pressure
(c) humidity
(d) sea surface temperature
(e) vapor pressure
43
Linear Regression
• Concept
determine the value of one variable given the value of
another variable
• It assumes that the variables’ values change, one with
the other, in some mathematically defined way
• The technique involved simply fits a manifold through the
two-dimensional state space formed by the two variables
• Example
in case of linear regression a straight line is fit through a
two-dimensional point set
Linear Regression
44
Linear Regression
Linear Regression
• Linear regression techniques involve discovering the joint
variability of two or more variables
• Linear regression determines which values of the
predicted variable match values of the predictor variable
• Joint variability
measure how one variable varies as another one varies
45
Linear Regression
• Linear regression tries to discover the parameters of the
straight line equation that best fits the data point
• The expression describing a straight line is
y = a x + b
where b is a constant that indicates where the straight
line crosses the y-axis in state space (the y-intercept)
and a represents the slope of the line
Linear Regression
Linear Regression minimizes the least square error
( ) minˆ
2
1=
n
i
iiyy
:i
y
:ˆi
yi
yEstimated value
Actual valuei
y
46
Linear Regression Solution
Determine b
( ) ( )( )( )
=22 xxn
yxxynb
Step 1
Linear Regression Solution
Determine a once b is known
Step 2
xbya =
is the mean value of x
is the mean value of yy
x
47
Piecewise Polynomial (Spline) Interpolation
Piecewise Polynomial (Spline)
Interpolation
• Set of data points
• Passing a single polynomial through many data points
can sometimes lead to oscillations in the interpolating
polynomial
• The interpolating polynomial linking the data point is most
often selected to be a cubic
• Cubics are differentiable and provide second order
continuity (the derivatives of neighbouring cubics can be
matched)
( ) ( ){ }nn
yxyx ,,...,,00
( ) ( )11
,,, ++ kkkk yxyx
48
Piecewise Polynomial (Spline)
Interpolation
( ) ( ) ( ) ( )33,
2
2,1,0, kkkkkkkkxxsxxsxxssxs +++=
[ ]1
, +kkxxx
( ) kkk yxs =
( ) ( )111 +++ =
kkkkxsxs
( ) ( )1
'
11
'
+++ =kkkk
xsxs
( ) ( )1
''
11
''
+++=
kkkkxsxs
subject to the following constraints:
Interpolation of ( )kk yx ,
Continuity of interpolant
Continuity of first derivatives
Continuity of second derivatives
Piecewise Polynomial (Spline)
Interpolation
49
Normalization
Normalization Example
Chicago-County
LA-County
50
Normalization Example (con‘t)
LA-CountyChicago-County
Normalization
51
Normalization (con‘t)
Normalization Example (con‘t)
52
Normalization (con‘t)
Normalization Example (con‘t)
53
Normalization (con‘t)
Normalization Example (con‘t)
54
Normalization Example (con‘t)
Normalization Example (con‘t)
55
Normalization (con‘t)
Segmentation
• Manual/Automatic Segmentation
• Problem Definition
• k-Means
• Linkage -based Methods
• Kernel Density Estimation
56
Manual/Automatic Segmentation
• Manual Segmentation
based upon
• Attribute values/ranges
• Topological properties
• Automatic Segmentation Algorithms (Clustering
Algorithms)
k-Means,
Kernel Density Estimation, …
Problem Definition
Given:
A data set with N d-dimensional data items
Task:
Determine a (natural) partitioning of the
data set into a number of clusters (k) and
a noise parameter
57
Problem Definition
• Effective and efficient clustering algorithms for large high-
dimensional data sets with high noise level
• Requires Scalability with respect to
the number of data points (N)
the number of dimensions (d)
the noise level
k-Means
• k-Means
Determine k prototypes (p) of a given data set
Assign data points to nearest prototype
Minimize distance criterion:
min),(1 1= =
k
i
N
j
i
ji xpd
58
k-Means 2
• Iterative Algorithm
Shift the prototypes towards the mean of their point set
Re-assign the data points to the nearest prototype
k-Means
59
Linkage-based Methods
• Single LinkageConnected components for distance d
Linkage-based Methods
• Method of Wishart
Reduce data
set
Apply Single
Linkage
60
Kernel Density Estimation
Density FunctionDensity Function
Influence Function: Influence of a data point in its neighborhood
Density Function: Sum of the influences of all data points
Kernel Density Estimation
• Influence Function
The influence of a data point y at a point x in the data
space is modeled by a function
e.g.,
y x
61
Kernel Density Estimation
• Density Function
The density at a point x in the data space is defined as
the sum of the influences of all data points xi, i.e.
Kernel Density Estimation
62
Subsetting
Start with a set of data items and generate a subset
of these data items
• Sampling
Random sampling
…
• Querying
SQL
Sampling
• Motivation: data set is much larger than possible (time-
and/or space-wise) to work on
• Example: voters of an election
( too large to study all of them, so use a representative
sample)
• Important:
The selected subset must be selected such that it
represents some well defined characteristics of the whole
data set especially those we‘re interested in
63
Sampling
• Types of sampling
Non-probabilistic samples
Sample selected on some non-random basis (such as
volunteers, accidental, convenience, self-selected, etc.)
Probabilistic samples
Sample selected on the basis of random selection so
that every element of the data set has an equal chance of being
selected
Sampling
• Types of probabilistic sampling
Simple random sampling
Systematic random sampling
Stratified random sampling
Cluster random sampling
Biased sampling
64
Simple Random Sampling
A random sampling strategy is the least biased sampling method. Using
this method, the locations were determined by generating a list of random
coordinates and placing the points at those coordinates.
Systematic random sampling
• Elements are numbered 1 to N in some order
• Every k-th element is selected
starting with a randomly chosen
number between 1 and k
n
Nk
65
Stratified random sampling
• The data set is divided into non-overlapping subsets
called strata
• Sampling from the strata is simple random
Cluster random sampling
• The sample consists of a selection from randomly chosen groups of
neighbouring elements (clusters)
• Clusters need not necessarily be natural aggregates, but can simply
be artificial
divide the population into
population clusters based on
geographical location (districts,
counties, states, ...)
66
Biased sampling
Dimensional Reduction
What is the problem?
• Large number of features represent an object
• The data is difficult to visualize, especially when some of
the features are not discriminatory
• Irrelevant features may cause a reduction in the accuracy
of the analysis algorithms
67
Problem Definition
Concept
• Identify the most important features of an object
to simplify the processing without loss of quality
to directly visualize the two/three most important
features
Problem Definition
• Solution
The simplest approach is to identify important attributes
based on input from domain experts
Another common approach is Principal Component
Analysis (PCA) which defines new attributes (principal
components or PCs) as mutually-orthogonal linear
combinations of the original attributes
68
• Goal
to discover the key hidden factors that explain the data
to reduce the dimensionality of the data
• Similar to cluster centroids
Principal Component Analysis
PCA (con‘t)
69
70
Computing the Eigenvalues
The Eigenvalues
71
Eigenvalues (con‘t)
Eigenvalues (con‘t)
72
PCA – Dimension Reduction
Data can be projected onto a subspace spanned by the
most important eigenvectors
XCXPCA
=
where the matrix Ckm contains the k eigenvectors
corresponding to the k largest eigenvectors
PCA – Dimension Reduction
• PCA is optimal way to project data in the mean squaresense:
the squared error introduced in the projection isminimized over all projections onto a k dimensionalspace
• But the eigenvalue decomposition of the data covariancematrix (size for m-dimensional data) is veryexpensive to compute
mm
73
SVD – Dimension Reduction
• Singular value decomposition:
where orthogonal matrices U and V, contain the
left and right singular vectors of X, diagonal
matrix S contains the singular values of X
SVD – Dimension Reduction
Data can be projected onto a subspace spanned by theleft singular vectors corresponding to the k largestsingular values
where matrixk
Ukm contains these k
singular vectors
74
3.2.7 Aggregation / Summarization
• Aggregation Functions
count the items in a data set
• For example, the count of the items in (1, 3, 6, 4) is 4
sum the items in a list
• For example, the sum of the list (1, 3, 6, 4) is 14
average (avg) of all items in a data set
• For example, the avg of the items in (1, 3, 6, 4) is 3.5
Effect of Display Resolution
• Data sets are large (large result sets)
millions of data points/results for a query
• Limitation of screen resolution
1 - 3 M pixel on average currently
• This forces coding in space, sound, time, and
creativity
adaptive system and user interfaces
abstracts and level-of-detail
75
Simple Visualizations
Simple
• Tables
• 2D and 3D Scatterplots
• Statistical Charts
• Line and Multi-line Graphs
• Polar Chart
• Images
Complex
• Matrix of Scatterplots
• Heatmaps
• Surfaces
• Volumes
Interactions
• Select
• Probe
• Query
• Analyze
• Explore
The Base Visualization Techniques
76
Visualization Operations (queries)
• Data selection operations
sampling, association, etc.
• Data manipulation operations
functions, filters, interpolations, etc.
• Representation operations
attribute mappings, color maps, etc.
• Image orientation / viewing operations
pan, zoom, rotate, lighting, etc.
• Visualization interactions
direct manipulation, data selection, etc.
77
Univariate
Bivariate (or scatterplots)
78
Scatterplot Matrix
Trivariate
79
Alternatives
80
10th Century Timeline
Lines and curves
The same curve under different The same curve under different scalingsscalings ……
81
0
2
4
6
8
10
12
14
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98bananas
apples
pears
year
Trends in fruit sales
Excel Graphs and Charts
82
Images, Surfaces and Volumes
ExVis MRI
1988
83
Electron density of C-60
84
HIV Reverse Transcriptase Inhibitor
ESP
0.25
0.20
0.15
0.10
0.05
0.00
- 0.05
Van der Waals surface colored by Electrostatic Potential
85
Maps
Maps
86
Surfaces
Volumes
87
Surfaces and Volumes
Graphs and Networks
88
Graph Layout Optimization
TAURINE AND HYPOTAURINE METABOLISM
Cysteine
metabolism
Cyanonoamino ac id
metabolismGlutathione
metabolism
Cysteine
metabols im
1.13.11.191.8.1.3
4.1.1.294.1.1.29
4.1.1.15
4.4.1.12
4.1.1.29
2.3.1.65
2.3.2.2
2.7.3.4
4.4.1.10
1.13.11.20
3-Su lfino-L -alanine
Cyst eamine
Taurine
L-Cyst eate
Tauro ch olate
Acetate
Su lfoace taldehyd e
Ise thionate
5-G luta myl-taurine Ta uro cyamine
T auro cyaminephosp hate
L-Cyst eine
Excr etion
Excr etion
Hyp otaurine
2.3.1.652.3.1.65
2
Before
Cyst ei nemetabol ism
Cyanonoamino acidmetabolism
Glutathionemetabolism
C ysteinemetabolsim
1.13.11.19
1.8.1.3
4.1.1.294.1.1.29
4.1.1.15
2.3.1.65
4.4.1.12
4.1.1.29
2.3.1.652.6.1.55
Excretion
E xcreti on2.3.2.2
2.7.3.4
4.4.1.101.13.11.20
3-Sulfino-L-alani ne
Hypotaurine
Cyst eamine
Taurine
L-Cysteate
Taurocholate
Acetate
Sulfoacetaldehyd e
Isethionate
5-Glutamyl -taurine
Taurocya mi ne
Taurocya mi nephosphate
L-CAfter
Query-based Layouts
89
adding simple interaction
to static visualization
is surprisingly powerful
Starfield (Early Spotfire)
90
Starfield (Early Spotfire) 2
add interaction to everything?
91
92
Trains from Paris to Lyon
93
Trains from Paris to Lyon
Trains from Paris to Lyon
94
3444
33
811
639
10
531
1200
220
UK
Channel Islands
Midlands
North England
Northern Ireland
Scotland
South England
Wales
hotels region
tourist board areas
3444
33
811
639
89
7
209
111
223
10
531
1200
220
UK
Channel Islands
Midlands
North England
Cumbria
Isle of Man
North West
Northumbria
Yorkshire & Humberside
Northern Ireland
Scotland
South England
Wales
hotels region
tourist board areas
95
interaction is the key
add interaction to everything.
interaction is the key
add interaction to everything.
96
Visualizations
• Which visualization to use?
requires taxonomy and more understanding– plots, surfaces, volumes, iconographic displays, ...
research issue - domain dependent
• How best to support the human data explorer?
provide user interactions (selections, navigation)
• What can be automated?
data mining, kdd, user advice, ...