statistical data mining: a short course for the army

Statistical Data Mining: A Short

Course for the Army Conference on Applied Statistics

Edward J. WegmanGeorge Mason University

Jeffrey L. SolkaNaval Surface Warfare Center

Statistical Data Mining Agenda

Introduction and ComplexityData Preparation and CompressionDatabases and Data Mining via Association RulesClustering, Classification, and DiscriminationPattern Recognition and Intrusion DetectionColor Theory and DesignVisual Data MiningCrystalVision Installation and Practice

Introduction to Data Mining

What is Data Mining All AboutHierarchy of Data Set SizeComputational Complexity and FeasibilityData Mining Defined & Contrasted with EDAExamples

Why Data MiningWhat is Knowledge Discovery in

DatabasesPotential Applications

Fraud Detection Manufacturing Processes Targeting Markets Scientific Data Analysis Risk Management Web Intelligence

Data Mining: On what kind of data? Relational Databases Data Warehouses Transactional Databases Advanced

Object-relationalSpatial, Temporal, SpatiotemporalText, wwwHeterogeneous, Legacy, Distributed

Data Mining: Why now? Confluence of multiple disciplines

Database systems, data warehouses, OLAPMachine learningStatistical and data analysis methodsVisualizationMathematical programmingHigh performance computing

Why do we need data mining? Large number of records (cases) (108-1012

bytes) High dimensional data (variables) (10-104

attributes)How do you explore millions of records, tens or

hundreds of fields, and find patterns?

Why do we need data mining?Only a small portion, typically 5% to 10%, of

the collected data is ever analyzed.Data that may never be explored continues

to be collected out of fear that something that may prove important in the future may be missing.

Magnitude of data precludes most traditional analysis (more on complexity later).

KDD and data mining have roots in traditional database technology

As database grow, the ability of the decision support process to exploit traditional (I.e. Boolean) query languages is limited.

• Many queries of interest are difficult/impossible to state in traditional query languages

• “Find all cases of fraud in IRS tax returns.”• “Find all individuals likely to ignore Census

questionnaires.”• “Find all documents relating to this customer’s

problem.”

Complexity

Descriptor Data Set Size in Bytes Storage Mode Tiny 102 Piece of Paper Small 104 A Few Pieces of Paper Medium 106 A Floppy Disk Large 108 Hard Disk Huge 1010 Multiple Hard Disks Massive 1012 Robotic Magnetic Tape Storage Silos Supermassive 1015 Distributed Data Archives

The Huber-Wegman Taxonomy of Data Set Sizes

Complexity

O( n ) Calculate Means, Variances, Kernel Density Estimates

O(n log(n)) Calculate Fast Fourier TransformsO(n c) Calculate Singular Value Decomposition of

an r x c Matrix; Solve a Multiple Linear Regression

O( n2 ) Solve most Clustering AlgorithmsO( an ) Detect Multivariate Outliers

Algorithmic Complexity

Complexity

Table 2: Number of Operations for Algorithms of VariousComputational Complexities and Various Data Set Sizes

n n1/2 n n log(n) n3/2 n2

tiny 10 102 2x102 103 104

small 102 104 4x104 106 108

medium 103 106 6x106 109 1012

large 104 108 8x108 1012 1016

huge 105 1010 1011 1015 1020

Complexity

Table 4: Computational Feasibility on a Pentium PC10 megaflop performance assumed

tiny 10-6

seconds10-5

seconds2x10-5

seconds.0001

seconds.001

seconds

small 10-5

seconds.001

seconds.004

seconds.1

seconds10

seconds

medium .0001seconds

.1seconds

.6seconds

1.67minutes

1.16days

large .001seconds

10seconds

1.3minutes

1.16days

31.7years

huge .01seconds

16.7minutes

2.78hours

3.17years

317,000 years

Complexity

Table 5: Computational Feasibility on a Silicon Graphics Onyx Workstation300 megaflop performance assumed

tiny 3.3x10-8

seconds3.3x10-7

seconds6.7x10-7

seconds3.3x10-6

seconds3.3x10-5

seconds

small 3.3x10-7

seconds3.3x10-5

seconds1.3x10-4

seconds3.3x10-3

seconds.33

seconds

medium 3.3x10-6

seconds3.3x10-3

seconds.02

seconds3.3

seconds55

minutes

large 3.3x10-5

seconds.33

seconds2.7

seconds55

minutes1.04years

huge 3.3x10-4

seconds33

seconds5.5

minutes38.2days

10,464years

Complexity

Table 6: Computational Feasibility on an Intel Paragon XP/S A44.2 gigaflop performance assumed

tiny 2.4x10-9

seconds2.4x10-8

seconds4.8x10-8

seconds2.4x10-7

seconds2.4x10-6

seconds

small 2.4x10-8

seconds2.4x10-6

seconds9.5x10-6

seconds2.4x10-4

seconds.024

seconds

medium 2.4x10-7

seconds2.4x10-4

seconds.0014

seconds.24

seconds4.0

minutes

large 2.4x10-6

seconds.024

seconds.19

seconds4.0

minutes27.8days

huge 2.4x10-5

seconds2.4

seconds24

seconds66.7

hours761

Complexity

Table 7: Computational Feasibility on a Teraflop Grand Challenge Computer1000 gigaflop performance assumed

tiny 10-11

seconds10-10

seconds2x10-10

seconds10-9

seconds10-8

seconds

small 10-10

seconds10-8

seconds4x10-8

seconds10-6

seconds10-4

seconds

medium 10-9

seconds10-6

seconds6x10-6

seconds.001

seconds1

second

large 10-8

seconds10-4

seconds8x10-4

seconds1

second2.8

huge 10-7

seconds.01

seconds.1

seconds16.7

minutes3.2

Complexity

Table 8: Types of Computers for Interactive FeasibilityResponse Time < 1 second

tiny PersonalComputer

PersonalComputer

small PersonalComputer

PersonalComputer

SuperComputer

medium PersonalComputer

PersonalComputer

Super Computer TeraflopComputer

large PersonalComputer

Workstation Super Computer TeraflopComputer

huge PersonalComputer

SuperComputer

TeraflopComputer

--- ---

Complexity

Table 9: Types of Computers for FeasibilityResponse Time < 1 week

tiny PersonalComputer

PersonalComputer

small PersonalComputer

PersonalComputer

medium PersonalComputer

PersonalComputer

large PersonalComputer

PersonalComputer

TeraflopComputer

huge PersonalComputer

PersonalComputer

Super Computer ---

Complexity

Table 10: Transfer Rates for a Variety of Data Transfer Regimes

n standardethernet10 mega-bits/sec

fastethernet

100 mega-bits/sec

hard disktransfer

2027 kilo-bytes/sec

cachetransfer @ 200

megahertz

1.25x106

bytes/sec1.25x107

bytes/sec2.027x106

bytes/sec2x108

bytes/sec

tiny 8x10-5

seconds8x10-6

seconds4.9x10-5

seconds5x10-6

seconds

small 8x10-3

seconds8x10-4

seconds4.9x10-3

seconds5x10-5

seconds

medium .8seconds

.08seconds

.49seconds

5x10-3

seconds

large 1.3minutes

8seconds

49seconds

.5seconds

huge 2.2hours

13.3minutes

1.36hours

50seconds

Complexity

Table 11: Resolvable Number of Pixels AcrossScreen for Several Viewing Scenarios

19 inchmonitor @24 inches

25 inchTV @12 feet

15 footscreen @

20 feet

immersion

Angle 39.005o 9.922o 41.112o 140o

5 seconds of arcresolution(Valyus)

28,084 7,144 29,601 100,800

1 minute of arcresolution

2,340 595 2,467 8,400

3.6 minute of arcresolution(Wegman)

650 165 685 2,333

4.38 minutesof arc resolution

(Maar 1)

534 136 563 1,918

.486 minutes ofarc/foveal cone

(Maar 2)

4,815 1,225 5,076 17,284

Complexity

ScenariosTypical high resolution workstations,

1280x1024 = 1.31x106 pixelsRealistic using Wegman, immersion, 4:5 aspect ratio,

2333x1866 = 4.35x106 pixels Very optimistic using 1 minute arc, immersion, 4:5

aspect ratio, 8400x6720 = 5.65x107 pixelsWildly optimistic using Maar(2), immersion, 4:5 aspect ratio, 17,284x13,828 = 2.39x108 pixels

Massive Data Sets

One Terabyte Datasetvs

One Million Megabyte Data Sets

Both difficult to analyzebut for different reasons

Massive Data Sets: Commonly Used Language

Data Mining = DMKnowledge Discovery in Databases =

KDDMassive Data Sets = MDData Analysis = DA

Massive Data Sets

DM MDDM DA

Even DA + MD DM

1. Computationally Feasible Algorithms2. Little or No Human Intervention

Data Mining of Massive Datasets

Data Mining is a kind of Exploratory Data Analysis with Little or No Human Interaction using Computationally Feasible Techniques,

i.e., the Attempt to find Interesting Structure unknown a priori

Massive Data Sets

Major Issues Complexity Non-homogeneity

Examples Huber’s Air Traffic Control Highway Maintenance Ultrasonic NDE

Massive Data Sets

Air Traffic Control 6 to 12 Radar stations, several hundred

aircraft, 64-byte record per radar per aircraft per antenna turn

megabyte of data per minute

Massive Data Sets

Highway Maintenance Records of maintenance records and

measurements of road quality for several decades

Records of uneven quality Records missing

Massive Data Sets

NDE using Ultrasound Inspection of cast iron projectiles Time series of length 256, 360 degrees,

550 levels = 50,688,000 observations per projectile

Several thousand projectiles per day

Massive Data Sets: A Distinction

Human Analysis of the Structure of Data and Pitfalls

vs Human Analysis of the Data Itself

Limits of HVS and computational complexity limit the latter

Former is the basis for design of the analysis engine

statistical data mining: a short course for the army

Documents

data mining and statistical learning - 2008

computational and statistical issues in data-mining

w4240/w6240 data mining/ statistical machine...

ms1b statistical data mining

statistical models in data mining - university at buffalo

w4240/w6240 data mining/ statistical machine...

introduction to data mining and statistical machine...

statistical models in data mining

statistical data mining - 2

multivariate statistical analysis for data mining

new directions in statistical graph mining

multivariate statistical methods and data mining in

signal modeling, statistical inference and data mining in...

a statistical perspective on data mining

statistical data mining

signal modeling, statistical inference and data mining in...

intelligent statistical data mining with information...

introduction to text mining - indian statistical...

randomizing smartphone malware profiles against statistical...

statistical learning introduction: data mining process and...