machine learning general concepts

Upload: jie-bao

Post on 13-Jan-2016

64 views

Category:

Documents


1 download

DESCRIPTION

Ebook on machine learning basic concepts organized from wikipedia articles

TRANSCRIPT

  • Machine Learning General Conceptssee more at http://ml.memect.com

  • Contents

    1 Machine learning 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.1.1 Types of problems and tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 History and relationships to other elds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.2.1 Relation to statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.4.1 Decision tree learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4.2 Association rule learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4.3 Articial neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4.4 Inductive logic programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4.5 Support vector machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4.6 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4.7 Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4.8 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4.9 Representation learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4.10 Similarity and metric learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4.11 Sparse dictionary learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4.12 Genetic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.6 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.6.1 Open-source software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.6.2 Commercial software with open-source editions . . . . . . . . . . . . . . . . . . . . . . . 61.6.3 Commercial software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.7 Journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.8 Conferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.9 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.11 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.12 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2 Data mining 92.1 Etymology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    i

  • ii CONTENTS

    2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.1 Research and evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.3 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2 Data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.3 Results validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2.4 Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Notable uses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.5.1 Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.5.2 Business . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.5.3 Science and engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5.4 Human rights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.5.5 Medical data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.5.6 Spatial data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.5.7 Temporal data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5.8 Sensor data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5.9 Visual data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5.10 Music data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5.11 Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5.12 Pattern mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5.13 Subject-based data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.5.14 Knowledge grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.6 Privacy concerns and ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.6.1 Situation in Europe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.6.2 Situation in the United States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.7 Copyright Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.7.1 Situation in Europe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.7.2 Situation in the United States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.8 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.8.1 Free open-source data mining software and applications . . . . . . . . . . . . . . . . . . . 172.8.2 Commercial data-mining software and applications . . . . . . . . . . . . . . . . . . . . . . 182.8.3 Marketplace surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    2.9 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.11 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.12 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    3 Statistical classication 243.1 Relation to other problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Frequentist procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3 Bayesian procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4 Binary and multiclass classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

  • CONTENTS iii

    3.5 Feature vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.6 Linear classiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.7 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.9 Application domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.10 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.12 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    4 Cluster analysis 284.1 Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    4.2.1 Connectivity based clustering (hierarchical clustering) . . . . . . . . . . . . . . . . . . . . 294.2.2 Centroid-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2.3 Distribution-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2.4 Density-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2.5 Recent developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2.6 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    4.3 Evaluation and assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3.1 Internal evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3.2 External evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    4.5.1 Specialized types of cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.2 Techniques used in cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5.3 Data projection and preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5.4 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    4.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.7 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    5 Anomaly detection 375.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Popular techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.3 Application to data security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    6 Association rule learning 406.1 Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.2 Useful Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.3 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

  • iv CONTENTS

    6.4 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.5 Alternative measures of interestingness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.6 Statistically sound associations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.7 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    6.7.1 Apriori algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.7.2 Eclat algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.7.3 FP-growth algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.7.4 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    6.8 Lore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.9 Other types of association mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.10 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446.12 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    6.12.1 Bibliographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.12.2 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    7 Reinforcement learning 477.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477.2 Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487.3 Algorithms for control learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    7.3.1 Criterion of optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487.3.2 Brute force . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497.3.3 Value function approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497.3.4 Direct policy search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    7.4 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517.5 Current research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517.6 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    7.6.1 Conferences, journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517.7 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527.8 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527.10 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    8 Structured prediction 548.1 Example: sequence tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548.2 Structured perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548.3 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558.5 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    9 Feature learning 569.1 Supervised feature learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

  • CONTENTS v

    9.1.1 Supervised dictionary learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569.1.2 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    9.2 Unsupervised feature learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579.2.1 K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579.2.2 Principal component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579.2.3 Local linear embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579.2.4 Independent component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589.2.5 Unsupervised dictionary learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    9.3 Multilayer/Deep architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589.3.1 Restricted Boltzmann machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589.3.2 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    9.4 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    10 Online machine learning 6010.1 A prototypical online supervised learning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 60

    10.1.1 The algorithm and its interpretations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6110.2 Example: Complexity in the Case of Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . 61

    10.2.1 Batch Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6210.2.2 Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    10.3 Books with substantial treatment of online machine learning . . . . . . . . . . . . . . . . . . . . . 6210.4 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6210.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6210.6 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    11 Semi-supervised learning 6311.1 Assumptions used in semi-supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    11.1.1 Smoothness assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6411.1.2 Cluster assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6411.1.3 Manifold assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    11.2 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6411.3 Methods for semi-supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    11.3.1 Generative models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6411.3.2 Low-density separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6511.3.3 Graph-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6511.3.4 Heuristic approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    11.4 Semi-supervised learning in human cognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6511.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6611.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6611.7 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    12 Grammar induction 67

  • vi CONTENTS

    12.1 Grammar Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6712.2 Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6712.3 Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    12.3.1 Grammatical inference by trial-and-error . . . . . . . . . . . . . . . . . . . . . . . . . . 6712.3.2 Grammatical inference by genetic algorithms . . . . . . . . . . . . . . . . . . . . . . . . 6712.3.3 Grammatical inference by greedy algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 6812.3.4 Distributional Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6812.3.5 Learning of Pattern languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6812.3.6 Pattern theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    12.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6912.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6912.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6912.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6912.8 Text and image sources, contributors, and licenses . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    12.8.1 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7012.8.2 Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7212.8.3 Content license . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

  • Chapter 1

    Machine learning

    For the journal, see Machine Learning (journal).

    Machine learning is a subeld of computer science[1]that evolved from the study of pattern recognition andcomputational learning theory in articial intelligence.[1]Machine learning explores the construction and study ofalgorithms that can learn from and make predictions ondata.[2] Such algorithms operate by building a model fromexample inputs in order to make data-driven predictionsor decisions,[3]:2 rather than following strictly static pro-gram instructions.Machine learning is closely related to and often over-laps with computational statistics; a discipline that alsospecializes in prediction-making. It has strong ties tomathematical optimization, which deliver methods, the-ory and application domains to the eld. Machine learn-ing is employed in a range of computing tasks where de-signing and programming explicit, rule-based algorithmsis infeasible. Example applications include spam lter-ing, optical character recognition (OCR),[4] search en-gines and computer vision. Machine learning is some-times conated with data mining,[5] although that focusesmore on exploratory data analysis.[6] Machine learningand pattern recognition can be viewed as two facets ofthe same eld.[3]:vii

    When employed in industrial contexts, machine learn-ing methods may be referred to as predictive analytics orpredictive modelling.

    1.1 OverviewIn 1959, Arthur Samuel dened machine learning as aField of study that gives computers the ability to learnwithout being explicitly programmed.[7]

    Tom M. Mitchell provided a widely quoted, more for-mal denition: A computer program is said to learnfrom experience E with respect to some class of tasks Tand performance measure P, if its performance at tasksin T, as measured by P, improves with experience E.[8]This denition is notable for its dening machine learn-ing in fundamentally operational rather than cognitiveterms, thus following Alan Turing's proposal in his paper

    "Computing Machinery and Intelligence" that the ques-tion Can machines think?" be replaced with the ques-tion Can machines do what we (as thinking entities) cando?"[9]

    1.1.1 Types of problems and tasksMachine learning tasks are typically classied into threebroad categories, depending on the nature of the learn-ing signal or feedback available to a learning system.These are:[10]

    Supervised learning. The computer is presentedwith example inputs and their desired outputs, givenby a teacher, and the goal is to learn a general rulethat maps inputs to outputs.

    Unsupervised learning, no labels are given to thelearning algorithm, leaving it on its own to nd struc-ture in its input. Unsupervised learning can be a goalin itself (discovering hidden patterns in data) or ameans towards an end.

    In reinforcement learning, a computer program in-teracts with a dynamic environment in which it mustperform a certain goal (such as driving a vehicle),without a teacher explicitly telling it whether it hascome close to its goal or not. Another exampleis learning to play a game by playing against anopponent.[3]:3

    Between supervised and unsupervised learning is semi-supervised learning, where the teacher gives an incom-plete training signal: a training set with some (oftenmany) of the target outputs missing. Transduction is aspecial case of this principle where the entire set of prob-lem instances is known at learning time, except that partof the targets are missing.Among other categories of machine learning problems,learning to learn learns its own inductive bias based onprevious experience. Developmental learning, elabo-rated for robot learning, generates its own sequences (alsocalled curriculum) of learning situations to cumulativelyacquire repertoires of novel skills through autonomous

    1

  • 2 CHAPTER 1. MACHINE LEARNING

    A support vector machine is a classier that divides its input spaceinto two regions, separated by a linear boundary. Here, it haslearned to distinguish black and white circles.

    self-exploration and social interaction with human teach-ers, and using guidance mechanisms such as active learn-ing, maturation, motor synergies, and imitation.Another categorization of machine learning tasks ariseswhen one considers the desired output of a machine-learned system:[3]:3

    In classication, inputs are divided into two or moreclasses, and the learner must produce a model thatassigns unseen inputs to one (or multi-label classi-cation) or more of these classes. This is typicallytackled in a supervised way. Spam ltering is an ex-ample of classication, where the inputs are email(or other) messages and the classes are spam andnot spam.

    In regression, also a supervised problem, the outputsare continuous rather than discrete.

    In clustering, a set of inputs is to be divided intogroups. Unlike in classication, the groups are notknown beforehand, making this typically an unsu-pervised task.

    Density estimation nds the distribution of inputs insome space.

    Dimensionality reduction simplies inputs by map-ping them into a lower-dimensional space. Topicmodeling is a related problem, where a program isgiven a list of human language documents and istasked to nd out which documents cover similartopics.

    1.2 History and relationships toother elds

    As a scientic endeavour, machine learning grew outof the quest for articial intelligence. Already in theearly days of AI as an academic discipline, some re-searchers were interested in having machines learn fromdata. They attempted to approach the problem with vari-ous symbolic methods, as well as what were then termed"neural networks"; these were mostly perceptrons andother models that were later found to be reinventions ofthe generalized linear models of statistics. Probabilisticreasoning was also employed, especially in automatedmedical diagnosis.[10]:488

    However, an increasing emphasis on the logical,knowledge-based approach caused a rift between AI andmachine learning. Probabilistic systems were plaguedby theoretical and practical problems of data acquisitionand representation.[10]:488 By 1980, expert systems hadcome to dominate AI, and statistics was out of favor.[11]Work on symbolic/knowledge-based learning did con-tinue within AI, leading to inductive logic programming,but the more statistical line of research was now out-side the eld of AI proper, in pattern recognition andinformation retrieval.[10]:708710; 755 Neural networks re-search had been abandoned by AI and computer sciencearound the same time. This line, too, was continued out-side the AI/CS eld, as "connectionism", by researchersfrom other disciplines including Hopeld, Rumelhart andHinton. Their main success came in the mid-1980s withthe reinvention of backpropagation.[10]:25

    Machine learning, reorganized as a separate eld, startedto ourish in the 1990s. The eld changed its goal fromachieving articial intelligence to tackling solvable prob-lems of a practical nature. It shifted focus away fromthe symbolic approaches it had inherited from AI, andtoward methods and models borrowed from statistics andprobability theory.[11] It also beneted from the increas-ing availability of digitized information, and the possibil-ity to distribute that via the internet.Machine learning and data mining often employ the samemethods and overlap signicantly. They can be roughlydistinguished as follows:

    Machine learning focuses on prediction, based onknown properties learned from the training data.

    Data mining focuses on the discovery of (previously)unknown properties in the data. This is the analysisstep of Knowledge Discovery in Databases.

    The two areas overlap in many ways: data mining usesmany machine learning methods, but often with a slightlydierent goal in mind. On the other hand, machinelearning also employs data mining methods as unsuper-vised learning or as a preprocessing step to improve

  • 1.4. APPROACHES 3

    learner accuracy. Much of the confusion between thesetwo research communities (which do often have sepa-rate conferences and separate journals, ECML PKDDbeing a major exception) comes from the basic assump-tions they work with: in machine learning, performanceis usually evaluated with respect to the ability to re-produce known knowledge, while in Knowledge Discov-ery and Data Mining (KDD) the key task is the discov-ery of previously unknown knowledge. Evaluated withrespect to known knowledge, an uninformed (unsuper-vised) method will easily be outperformed by supervisedmethods, while in a typical KDD task, supervised meth-ods cannot be used due to the unavailability of trainingdata.Machine learning also has intimate ties to optimization:many learning problems are formulated as minimizationof some loss function on a training set of examples. Lossfunctions expresses the discrepancy between the predic-tions of the model being trained and the actual prob-lem instances (for example, in classication, one wantsto assign a label to instances, and models are trainedto correctly predict the pre-assigned labels of a set ex-amples). The dierence between the two elds arisesfrom the goal of generalization: while optimization algo-rithms can minimize the loss on a training set, machinelearning is concerned with minimizing the loss on unseensamples.[12]

    1.2.1 Relation to statistics

    Machine learning and statistics are closely related elds.According to Michael I. Jordan, the ideas of machinelearning, from methodological principles to theoreticaltools, have had a long pre-history in statistics.[13] He alsosuggested the term data science as a placeholder to callthe overall eld.[13]

    Leo Breiman distinguished two statistical modellingparadigms: data model and algorithmic model,[14]wherein 'algorithmic model' means more or less the ma-chine learning algorithms like Random forest.Some statisticians have adopted methods from machinelearning, leading to a combined eld that they call statis-tical learning.[15]

    1.3 TheoryMain article: Computational learning theory

    A core objective of a learner is to generalize from itsexperience.[3][16] Generalization in this context is the abil-ity of a learning machine to perform accurately on new,unseen examples/tasks after having experienced a learn-ing data set. The training examples come from some gen-erally unknown probability distribution (considered rep-

    resentative of the space of occurrences) and the learnerhas to build a general model about this space that en-ables it to produce suciently accurate predictions in newcases.The computational analysis of machine learning algo-rithms and their performance is a branch of theoreticalcomputer science known as computational learning the-ory. Because training sets are nite and the future is un-certain, learning theory usually does not yield guaranteesof the performance of algorithms. Instead, probabilis-tic bounds on the performance are quite common. Thebiasvariance decomposition is one way to quantify gen-eralization error.In addition to performance bounds, computational learn-ing theorists study the time complexity and feasibility oflearning. In computational learning theory, a computa-tion is considered feasible if it can be done in polynomialtime. There are two kinds of time complexity results.Positive results show that a certain class of functions canbe learned in polynomial time. Negative results show thatcertain classes cannot be learned in polynomial time.There are many similarities between machine learningtheory and statistical inference, although they use dier-ent terms.

    1.4 ApproachesMain article: List of machine learning algorithms

    1.4.1 Decision tree learningMain article: Decision tree learning

    Decision tree learning uses a decision tree as a predictivemodel, which maps observations about an item to conclu-sions about the items target value.

    1.4.2 Association rule learningMain article: Association rule learning

    Association rule learning is a method for discovering in-teresting relations between variables in large databases.

    1.4.3 Articial neural networksMain article: Articial neural network

    An articial neural network (ANN) learning algorithm,usually called neural network (NN), is a learning al-gorithm that is inspired by the structure and func-

  • 4 CHAPTER 1. MACHINE LEARNING

    tional aspects of biological neural networks. Compu-tations are structured in terms of an interconnectedgroup of articial neurons, processing information usinga connectionist approach to computation. Modern neu-ral networks are non-linear statistical data modeling tools.They are usually used to model complex relationships be-tween inputs and outputs, to nd patterns in data, or tocapture the statistical structure in an unknown joint prob-ability distribution between observed variables.

    1.4.4 Inductive logic programming

    Main article: Inductive logic programming

    Inductive logic programming (ILP) is an approach to rulelearning using logic programming as a uniform represen-tation for input examples, background knowledge, andhypotheses. Given an encoding of the known backgroundknowledge and a set of examples represented as a log-ical database of facts, an ILP system will derive a hy-pothesized logic program that entails all positive and nonegative examples. Inductive programming is a relatedeld that considers any kind of programming languagesfor representing hypotheses (and not only logic program-ming), such as functional programs.

    1.4.5 Support vector machines

    Main article: Support vector machines

    Support vector machines (SVMs) are a set of relatedsupervised learning methods used for classication andregression. Given a set of training examples, each markedas belonging to one of two categories, an SVM trainingalgorithm builds a model that predicts whether a new ex-ample falls into one category or the other.

    1.4.6 Clustering

    Main article: Cluster analysis

    Cluster analysis is the assignment of a set of observationsinto subsets (called clusters) so that observations withinthe same cluster are similar according to some predes-ignated criterion or criteria, while observations drawnfrom dierent clusters are dissimilar. Dierent cluster-ing techniques make dierent assumptions on the struc-ture of the data, often dened by some similarity metricand evaluated for example by internal compactness (simi-larity between members of the same cluster) and separa-tion between dierent clusters. Other methods are basedon estimated density and graph connectivity. Clustering isa method of unsupervised learning, and a common tech-nique for statistical data analysis.

    1.4.7 Bayesian networks

    Main article: Bayesian network

    A Bayesian network, belief network or directed acyclicgraphical model is a probabilistic graphical model thatrepresents a set of random variables and their conditionalindependencies via a directed acyclic graph (DAG). Forexample, a Bayesian network could represent the prob-abilistic relationships between diseases and symptoms.Given symptoms, the network can be used to computethe probabilities of the presence of various diseases. Ef-cient algorithms exist that perform inference and learn-ing.

    1.4.8 Reinforcement learning

    Main article: Reinforcement learning

    Reinforcement learning is concerned with how an agentought to take actions in an environment so as to maxi-mize some notion of long-term reward. Reinforcementlearning algorithms attempt to nd a policy that mapsstates of the world to the actions the agent ought to takein those states. Reinforcement learning diers from thesupervised learning problem in that correct input/outputpairs are never presented, nor sub-optimal actions explic-itly corrected.

    1.4.9 Representation learning

    Main article: Representation learning

    Several learning algorithms, mostly unsupervised learn-ing algorithms, aim at discovering better representationsof the inputs provided during training. Classical exam-ples include principal components analysis and clusteranalysis. Representation learning algorithms often at-tempt to preserve the information in their input but trans-form it in a way that makes it useful, often as a pre-processing step before performing classication or pre-dictions, allowing to reconstruct the inputs coming fromthe unknown data generating distribution, while not beingnecessarily faithful for congurations that are implausibleunder that distribution.Manifold learning algorithms attempt to do so underthe constraint that the learned representation is low-dimensional. Sparse coding algorithms attempt to doso under the constraint that the learned representation issparse (has many zeros). Multilinear subspace learningalgorithms aim to learn low-dimensional representationsdirectly from tensor representations for multidimensionaldata, without reshaping them into (high-dimensional)vectors.[17] Deep learning algorithms discover multiplelevels of representation, or a hierarchy of features, with

  • 1.5. APPLICATIONS 5

    higher-level, more abstract features dened in terms of(or generating) lower-level features. It has been arguedthat an intelligent machine is one that learns a represen-tation that disentangles the underlying factors of variationthat explain the observed data.[18]

    1.4.10 Similarity and metric learning

    Main article: Similarity learning

    In this problem, the learning machine is given pairs of ex-amples that are considered similar and pairs of less simi-lar objects. It then needs to learn a similarity function (ora distance metric function) that can predict if new objectsare similar. It is sometimes used in Recommendation sys-tems.

    1.4.11 Sparse dictionary learning

    In this method, a datum is represented as a linear com-bination of basis functions, and the coecients are as-sumed to be sparse. Let x be a d-dimensional datum, Dbe a d by n matrix, where each column of D representsa basis function. r is the coecient to represent x usingD. Mathematically, sparse dictionary learning means thefollowing x Dr where r is sparse. Generally speaking,n is assumed to be larger than d to allow the freedom fora sparse representation.Learning a dictionary along with sparse representa-tions is strongly NP-hard and also dicult to solveapproximately.[19] A popular heuristic method for sparsedictionary learning is K-SVD.Sparse dictionary learning has been applied in severalcontexts. In classication, the problem is to determinewhich classes a previously unseen datum belongs to. Sup-pose a dictionary for each class has already been built.Then a new datum is associated with the class such thatits best sparsely represented by the corresponding dic-tionary. Sparse dictionary learning has also been appliedin image de-noising. The key idea is that a clean imagepatch can be sparsely represented by an image dictionary,but the noise cannot.[20]

    1.4.12 Genetic algorithms

    Main article: Genetic algorithm

    A genetic algorithm (GA) is a search heuristic that mim-ics the process of natural selection, and uses methods suchas mutation and crossover to generate new genotype inthe hope of nding good solutions to a given problem. Inmachine learning, genetic algorithms found some uses inthe 1980s and 1990s.[21][22] Vice versa, machine learning

    techniques have been used to improve the performanceof genetic and evolutionary algorithms.[23]

    1.5 ApplicationsApplications for machine learning include:

    Adaptive websites

    Aective computing

    Bioinformatics

    Brain-machine interfaces

    Cheminformatics

    Classifying DNA sequences

    Computational advertising

    Computational nance

    Computer vision, including object recognition

    Detecting credit card fraud

    Game playing[24]

    Information retrieval

    Internet fraud detection

    Machine perception

    Medical diagnosis

    Natural language processing[25]

    Optimization and metaheuristic

    Recommender systems

    Robot locomotion

    Search engines

    Sentiment analysis (or opinion mining)

    Sequence mining

    Software engineering

    Speech and handwriting recognition

    Stock market analysis

    Structural health monitoring

    Syntactic pattern recognition

  • 6 CHAPTER 1. MACHINE LEARNING

    In 2006, the online movie company Netix held the rst"Netix Prize" competition to nd a program to betterpredict user preferences and improve the accuracy on itsexisting Cinematch movie recommendation algorithm byat least 10%. A joint team made up of researchers fromAT&T Labs-Research in collaboration with the teams BigChaos and Pragmatic Theory built an ensemble model towin the Grand Prize in 2009 for $1 million.[26] Shortlyafter the prize was awarded, Netix realized that view-ers ratings were not the best indicators of their view-ing patterns (everything is a recommendation) and theychanged their recommendation engine accordingly.[27]

    In 2010 The Wall Street Journal wrote about money man-agement rm Rebellion Researchs use of machine learn-ing to predict economic movements. The article de-scribes Rebellion Researchs prediction of the nancialcrisis and economic recovery.[28]

    In 2014 it has been reported that a machine learning al-gorithm has been applied in Art History to study ne artpaintings, and that it may have revealed previously unrec-ognized inuences between artists.[29]

    1.6 SoftwareSoftware suites containing a variety of machine learningalgorithms include the following:

    1.6.1 Open-source software dlib ELKI Encog H2O Mahout mlpy MLPACK MOA (Massive Online Analysis) ND4J with Deeplearning4j OpenCV OpenNN Orange R scikit-learn Shogun Spark Yooreeka Weka

    1.6.2 Commercial software with open-source editions

    KNIME

    RapidMiner

    1.6.3 Commercial software

    Amazon Machine Learning

    Angoss KnowledgeSTUDIO

    Databricks

    IBM SPSS Modeler

    KXEN Modeler

    LIONsolver

    Mathematica

    MATLAB

    Microsoft Azure

    NeuroSolutions

    Oracle Data Mining

    RCASE

    SAS Enterprise Miner

    STATISTICA Data Miner

    1.7 Journals

    Journal of Machine Learning Research

    Machine Learning

    Neural Computation

    1.8 Conferences

    Conference on Neural Information Processing Sys-tems

    International Conference on Machine Learning

  • 1.10. REFERENCES 7

    1.9 See also Adaptive control Adversarial machine learning Automatic reasoning Cache language model Cognitive model Cognitive science Computational intelligence Computational neuroscience Ethics of articial intelligence Existential risk of articial general intelligence Explanation-based learning Hidden Markov model Important publications in machine learning List of machine learning algorithms

    1.10 References[1] http://www.britannica.com/EBchecked/topic/1116194/

    machine-learning This is a tertiary source that clearlyincludes information from other sources but does notname them.

    [2] Ron Kohavi; Foster Provost (1998). Glossary of terms.Machine Learning 30: 271274.

    [3] C. M. Bishop (2006). Pattern Recognition and MachineLearning. Springer. ISBN 0-387-31073-8.

    [4] Wernick, Yang, Brankov, Yourganov and Strother, Ma-chine Learning in Medical Imaging, IEEE Signal Process-ing Magazine, vol. 27, no. 4, July 2010, pp. 25-38

    [5] Mannila, Heikki (1996). Data mining: machine learning,statistics, and databases. Int'l Conf. Scientic and Statis-tical Database Management. IEEE Computer Society.

    [6] Friedman, Jerome H. (1998). Data Mining and Statistics:Whats the connection?". Computing Science and Statistics29 (1): 39.

    [7] Phil Simon (March 18, 2013). Too Big to Ignore: TheBusiness Case for Big Data. Wiley. p. 89. ISBN 978-1118638170.

    [8] Mitchell, T. (1997). Machine Learning, McGrawHill. ISBN 0-07-042807-7, p.2.

    [9] Harnad, Stevan (2008), The Annotation Game: On Tur-ing (1950) on Computing, Machinery, and Intelligence,in Epstein, Robert; Peters, Grace, The Turing Test Source-book: Philosophical andMethodological Issues in the Questfor the Thinking Computer, Kluwer

    [10] Russell, Stuart; Norvig, Peter (2003) [1995]. ArticialIntelligence: AModern Approach (2nd ed.). Prentice Hall.ISBN 978-0137903955.

    [11] Langley, Pat (2011). The changing science of ma-chine learning. Machine Learning 82 (3): 275279.doi:10.1007/s10994-011-5242-y.

    [12] Le Roux, Nicolas; Bengio, Yoshua; Fitzgibbon, Andrew(2012). Improving First and Second-Order Methods byModeling Uncertainty. In Sra, Suvrit; Nowozin, Sebas-tian; Wright, Stephen J. Optimization for Machine Learn-ing. MIT Press. p. 404.

    [13] MI Jordan (2014-09-10). statistics and machine learn-ing. reddit. Retrieved 2014-10-01.

    [14] http://projecteuclid.org/download/pdf_1/euclid.ss/1009213726

    [15] Gareth James; Daniela Witten; Trevor Hastie; Robert Tib-shirani (2013). An Introduction to Statistical Learning.Springer. p. vii.

    [16] Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar(2012) Foundations of Machine Learning, MIT PressISBN 9780262018258.

    [17] Lu, Haiping; Plataniotis, K.N.; Venetsanopoulos, A.N.(2011). A Survey of Multilinear Subspace Learning forTensor Data (PDF). Pattern Recognition 44 (7): 15401551. doi:10.1016/j.patcog.2011.01.004.

    [18] Yoshua Bengio (2009). Learning Deep Architectures forAI. Now Publishers Inc. pp. 13. ISBN 978-1-60198-294-0.

    [19] A. M. Tillmann, "On the Computational Intractability ofExact and Approximate Dictionary Learning", IEEE Sig-nal Processing Letters 22(1), 2015: 4549.

    [20] Aharon, M, M Elad, and A Bruckstein. 2006. K-SVD: An Algorithm for Designing Overcomplete Dic-tionaries for Sparse Representation. Signal Processing,IEEE Transactions on 54 (11): 4311-4322

    [21] Goldberg, David E.; Holland, John H. (1988). Geneticalgorithms and machine learning. Machine Learning 3(2): 9599.

    [22] Michie, D.; Spiegelhalter, D. J.; Taylor, C. C. (1994). Ma-chine Learning, Neural and Statistical Classication. EllisHorwood.

    [23] Zhang, Jun; Zhan, Zhi-hui; Lin, Ying; Chen, Ni; Gong,Yue-jiao; Zhong, Jing-hui; Chung, Henry S.H.; Li, Yun;Shi, Yu-hui (2011). Evolutionary Computation MeetsMachine Learning: A Survey (PDF). Computational In-telligence Magazine (IEEE) 6 (4): 6875.

    [24] Tesauro, Gerald (March 1995). Temporal DierenceLearning and TD-Gammon". Communications of theACM 38 (3).

    [25] Daniel Jurafsky and James H. Martin (2009). Speech andLanguage Processing. Pearson Education. pp. 207 .

    [26] BelKor Home Page research.att.com

  • 8 CHAPTER 1. MACHINE LEARNING

    [27]

    [28]

    [29] When A Machine Learning Algorithm Studied Fine ArtPaintings, It Saw Things Art Historians Had Never No-ticed, The Physics at ArXiv blog

    1.11 Further reading Mehryar Mohri, Afshin Rostamizadeh, Ameet Tal-

    walkar (2012). Foundations of Machine Learning,The MIT Press. ISBN 9780262018258.

    Ian H. Witten and Eibe Frank (2011). Data Min-ing: Practical machine learning tools and tech-niques Morgan Kaufmann, 664pp., ISBN 978-0123748560.

    Sergios Theodoridis, Konstantinos Koutroumbas(2009) Pattern Recognition, 4th Edition, Aca-demic Press, ISBN 978-1-59749-272-0.

    Mierswa, Ingo and Wurst, Michael and Klinken-berg, Ralf and Scholz, Martin and Euler, Timm:YALE: Rapid Prototyping for Complex Data MiningTasks, in Proceedings of the 12th ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining (KDD-06), 2006.

    Bing Liu (2007), Web Data Mining: Exploring Hy-perlinks, Contents and Usage Data. Springer, ISBN3-540-37881-2

    Toby Segaran (2007), Programming Collective Intel-ligence, O'Reilly, ISBN 0-596-52932-5

    Huang T.-M., Kecman V., Kopriva I. (2006),Kernel Based Algorithms for Mining Huge DataSets, Supervised, Semi-supervised, and Unsuper-vised Learning, Springer-Verlag, Berlin, Heidel-berg, 260 pp. 96 illus., Hardcover, ISBN 3-540-31681-7.

    Ethem Alpaydn (2004) Introduction to MachineLearning (Adaptive Computation and MachineLearning), MIT Press, ISBN 0-262-01211-1

    MacKay, D.J.C. (2003). Information Theory, Infer-ence, and Learning Algorithms, Cambridge Univer-sity Press. ISBN 0-521-64298-1.

    KECMAN Vojislav (2001), Learning and SoftComputing, Support Vector Machines, Neural Net-works and Fuzzy Logic Models, The MIT Press,Cambridge, MA, 608 pp., 268 illus., ISBN 0-262-11255-8.

    Trevor Hastie, Robert Tibshirani and Jerome Fried-man (2001). The Elements of Statistical Learning,Springer. ISBN 0-387-95284-5.

    Richard O. Duda, Peter E. Hart, David G. Stork(2001) Pattern classication (2nd edition), Wiley,New York, ISBN 0-471-05669-3.

    Bishop, C.M. (1995). Neural Networks for PatternRecognition, Oxford University Press. ISBN 0-19-853864-2.

    Ryszard S. Michalski, George Tecuci (1994), Ma-chine Learning: A Multistrategy Approach, VolumeIV, Morgan Kaufmann, ISBN 1-55860-251-8.

    Sholom Weiss and Casimir Kulikowski (1991).Computer Systems That Learn, Morgan Kaufmann.ISBN 1-55860-065-5.

    Yves Kodrato, Ryszard S. Michalski (1990), Ma-chine Learning: An Articial Intelligence Approach,Volume III, Morgan Kaufmann, ISBN 1-55860-119-8.

    Ryszard S. Michalski, Jaime G. Carbonell, TomM. Mitchell (1986), Machine Learning: An Arti-cial Intelligence Approach, Volume II, Morgan Kauf-mann, ISBN 0-934613-00-1.

    Ryszard S. Michalski, Jaime G. Carbonell, Tom M.Mitchell (1983), Machine Learning: An ArticialIntelligence Approach, Tioga Publishing Company,ISBN 0-935382-05-4.

    Vladimir Vapnik (1998). Statistical Learning The-ory. Wiley-Interscience, ISBN 0-471-03003-1.

    Ray Solomono, An Inductive Inference Machine,IRE Convention Record, Section on InformationTheory, Part 2, pp., 56-62, 1957.

    Ray Solomono, "An Inductive Inference Ma-chine" A privately circulated report from the 1956Dartmouth Summer Research Conference on AI.

    1.12 External links International Machine Learning Society Popular online course by Andrew Ng, at Coursera.

    It uses GNU Octave. The course is a free versionof Stanford University's actual course taught by Ng,whose lectures are also available for free.

    Machine Learning Video Lectures mloss is an academic database of open-source ma-

    chine learning software.

  • Chapter 2

    Data mining

    Not to be confused with analytics, information extrac-tion, or data analysis.

    Data mining (the analysis step of the Knowledge Dis-covery in Databases process, or KDD),[1] an interdisci-plinary subeld of computer science,[2][3][4] is the com-putational process of discovering patterns in large datasets involving methods at the intersection of articial in-telligence, machine learning, statistics, and database sys-tems.[2] The overall goal of the data mining process isto extract information from a data set and transform itinto an understandable structure for further use.[2] Asidefrom the raw analysis step, it involves database anddata management aspects, data pre-processing, modeland inference considerations, interestingness metrics,complexity considerations, post-processing of discoveredstructures, visualization, and online updating.[2]

    The term is a misnomer, because the goal is the ex-traction of patterns and knowledge from large amountof data, not the extraction of data itself.[5] It also isa buzzword[6] and is frequently applied to any form oflarge-scale data or information processing (collection,extraction, warehousing, analysis, and statistics) as wellas any application of computer decision support sys-tem, including articial intelligence, machine learning,and business intelligence. The popular book Data min-ing: Practical machine learning tools and techniques withJava[7] (which covers mostly machine learning material)was originally to be named just Practical machine learn-ing, and the term data mining was only added for mar-keting reasons.[8] Often the more general terms "(largescale) data analysis", or "analytics" or when referring toactual methods, articial intelligence and machine learn-ing are more appropriate.The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to extractpreviously unknown interesting patterns such as groups ofdata records (cluster analysis), unusual records (anomalydetection) and dependencies (association rule mining).This usually involves using database techniques such asspatial indices. These patterns can then be seen as a kindof summary of the input data, and may be used in fur-ther analysis or, for example, in machine learning andpredictive analytics. For example, the data mining step

    might identify multiple groups in the data, which can thenbe used to obtain more accurate prediction results by adecision support system. Neither the data collection, datapreparation, nor result interpretation and reporting arepart of the data mining step, but do belong to the overallKDD process as additional steps.The related terms data dredging, data shing, and datasnooping refer to the use of data mining methods to sam-ple parts of a larger population data set that are (or maybe) too small for reliable statistical inferences to be madeabout the validity of any patterns discovered. Thesemethods can, however, be used in creating new hypothe-ses to test against the larger data populations.

    2.1 EtymologyIn the 1960s, statisticians used terms like Data Fish-ing or Data Dredging to refer to what they consid-ered the bad practice of analyzing data without an a-priorihypothesis. The term Data Mining appeared around1990 in the database community. For a short time in1980s, a phrase database mining", was used, but sinceit was trademarked by HNC, a San Diego-based com-pany, to pitch their Database Mining Workstation;[9] re-searchers consequently turned to data mining. Otherterms used include Data Archaeology, Information Har-vesting, Information Discovery, Knowledge Extraction,etc. Gregory Piatetsky-Shapiro coined the term Knowl-edge Discovery in Databases for the rst workshop onthe same topic (KDD-1989) and this term became morepopular in AI and Machine Learning Community. How-ever, the term data mining became more popular in thebusiness and press communities.[10] Currently, Data Min-ing and Knowledge Discovery are used interchangeably.Since about 2007, Predictive Analytics and since 2011,Data Science terms were also used to describe this eld.

    2.2 BackgroundThe manual extraction of patterns from data has occurredfor centuries. Early methods of identifying patterns in

    9

  • 10 CHAPTER 2. DATA MINING

    data include Bayes theorem (1700s) and regression anal-ysis (1800s). The proliferation, ubiquity and increas-ing power of computer technology has dramatically in-creased data collection, storage, and manipulation abil-ity. As data sets have grown in size and complexity, di-rect hands-on data analysis has increasingly been aug-mented with indirect, automated data processing, aidedby other discoveries in computer science, such as neuralnetworks, cluster analysis, genetic algorithms (1950s),decision trees and decision rules (1960s), and supportvector machines (1990s). Data mining is the processof applying these methods with the intention of uncov-ering hidden patterns[11] in large data sets. It bridgesthe gap from applied statistics and articial intelligence(which usually provide the mathematical background) todatabase management by exploiting the way data is storedand indexed in databases to execute the actual learningand discovery algorithms more eciently, allowing suchmethods to be applied to ever larger data sets.

    2.2.1 Research and evolution

    The premier professional body in the eld is theAssociation for Computing Machinery's (ACM) SpecialInterest Group (SIG) on Knowledge Discovery and DataMining (SIGKDD).[12][13] Since 1989 this ACM SIG hashosted an annual international conference and publishedits proceedings,[14] and since 1999 it has published a bian-nual academic journal titled SIGKDD Explorations.[15]

    Computer science conferences on data mining include:

    CIKM Conference ACM Conference on Informa-tion and Knowledge Management

    DMIN Conference International Conference onData Mining

    DMKD Conference Research Issues on Data Min-ing and Knowledge Discovery

    ECDM Conference European Conference on DataMining

    ECML-PKDD Conference European Conferenceon Machine Learning and Principles and Practice ofKnowledge Discovery in Databases

    EDM Conference International Conference onEducational Data Mining

    ICDM Conference IEEE International Conferenceon Data Mining

    KDD Conference ACM SIGKDD Conference onKnowledge Discovery and Data Mining

    MLDM Conference Machine Learning and DataMining in Pattern Recognition

    PAKDD Conference The annual Pacic-AsiaConference on Knowledge Discovery and Data Min-ing

    PAW Conference Predictive Analytics World SDM Conference SIAM International Conference

    on Data Mining (SIAM)

    SSTD Symposium Symposium on Spatial andTemporal Databases

    WSDM Conference ACM Conference on WebSearch and Data Mining

    Data mining topics are also present on many data man-agement/database conferences such as the ICDE Con-ference, SIGMOD Conference and International Confer-ence on Very Large Data Bases

    2.3 ProcessThe Knowledge Discovery in Databases (KDD) pro-cess is commonly dened with the stages:

    (1) Selection(2) Pre-processing(3) Transformation(4) Data Mining(5) Interpretation/Evaluation.[1]

    It exists, however, in many variations on this theme, suchas the Cross Industry Standard Process for Data Mining(CRISP-DM) which denes six phases:

    (1) Business Understanding(2) Data Understanding(3) Data Preparation(4) Modeling(5) Evaluation(6) Deployment

    or a simplied process such as (1) pre-processing, (2) datamining, and (3) results validation.Polls conducted in 2002, 2004, and 2007 show thatthe CRISP-DM methodology is the leading methodologyused by data miners.[16][17][18] The only other data miningstandard named in these polls was SEMMA. However, 3-4 times as many people reported using CRISP-DM. Sev-eral teams of researchers have published reviews of datamining process models,[19][20] and Azevedo and Santosconducted a comparison of CRISP-DM and SEMMA in2008.[21]

  • 2.4. STANDARDS 11

    2.3.1 Pre-processingBefore data mining algorithms can be used, a target dataset must be assembled. As data mining can only uncoverpatterns actually present in the data, the target data setmust be large enough to contain these patterns while re-maining concise enough to be mined within an acceptabletime limit. A common source for data is a data mart ordata warehouse. Pre-processing is essential to analyze themultivariate data sets before data mining. The target setis then cleaned. Data cleaning removes the observationscontaining noise and those with missing data.

    2.3.2 Data miningData mining involves six common classes of tasks:[1]

    Anomaly detection (Outlier/change/deviation de-tection) The identication of unusual data records,that might be interesting or data errors that requirefurther investigation.

    Association rule learning (Dependency modelling) Searches for relationships between variables. Forexample a supermarket might gather data on cus-tomer purchasing habits. Using association rulelearning, the supermarket can determine whichproducts are frequently bought together and use thisinformation for marketing purposes. This is some-times referred to as market basket analysis.

    Clustering is the task of discovering groups andstructures in the data that are in some way or an-other similar, without using known structures inthe data.

    Classication is the task of generalizing knownstructure to apply to new data. For example, an e-mail program might attempt to classify an e-mail aslegitimate or as spam.

    Regression attempts to nd a function which mod-els the data with the least error.

    Summarization providing a more compact repre-sentation of the data set, including visualization andreport generation.

    2.3.3 Results validationData mining can unintentionally be misused, and can thenproduce results which appear to be signicant; but whichdo not actually predict future behavior and cannot bereproduced on a new sample of data and bear little use.Often this results from investigating too many hypothesesand not performing proper statistical hypothesis testing.

    A simple version of this problem in machine learning isknown as overtting, but the same problem can arise atdierent phases of the process and thus a train/test split- when applicable at all - may not be sucient to preventthis from happening.The nal step of knowledge discovery from data is to ver-ify that the patterns produced by the data mining algo-rithms occur in the wider data set. Not all patterns foundby the data mining algorithms are necessarily valid. It iscommon for the data mining algorithms to nd patternsin the training set which are not present in the generaldata set. This is called overtting. To overcome this, theevaluation uses a test set of data on which the data min-ing algorithm was not trained. The learned patterns areapplied to this test set, and the resulting output is com-pared to the desired output. For example, a data miningalgorithm trying to distinguish spam from legitimateemails would be trained on a training set of sample e-mails. Once trained, the learned patterns would be ap-plied to the test set of e-mails on which it had not beentrained. The accuracy of the patterns can then be mea-sured from how many e-mails they correctly classify. Anumber of statistical methods may be used to evaluate thealgorithm, such as ROC curves.If the learned patterns do not meet the desired standards,subsequently it is necessary to re-evaluate and change thepre-processing and data mining steps. If the learned pat-terns do meet the desired standards, then the nal step isto interpret the learned patterns and turn them into knowl-edge.

    2.4 Standards

    There have been some eorts to dene standards forthe data mining process, for example the 1999 Euro-pean Cross Industry Standard Process for Data Mining(CRISP-DM 1.0) and the 2004 Java Data Mining stan-dard (JDM 1.0). Development on successors to these pro-cesses (CRISP-DM 2.0 and JDM 2.0) was active in 2006,but has stalled since. JDM 2.0 was withdrawn withoutreaching a nal draft.For exchanging the extracted models in particular foruse in predictive analytics the key standard is thePredictive Model Markup Language (PMML), which isan XML-based language developed by the Data Min-ing Group (DMG) and supported as exchange format bymany data mining applications. As the name suggests, itonly covers prediction models, a particular data miningtask of high importance to business applications. How-ever, extensions to cover (for example) subspace cluster-ing have been proposed independently of the DMG.[22]

  • 12 CHAPTER 2. DATA MINING

    2.5 Notable usesSee also: Category:Applied data mining.

    2.5.1 GamesSince the early 1960s, with the availability of oraclesfor certain combinatorial games, also called tablebases(e.g. for 3x3-chess) with any beginning conguration,small-board dots-and-boxes, small-board-hex, and cer-tain endgames in chess, dots-and-boxes, and hex; a newarea for data mining has been opened. This is the ex-traction of human-usable strategies from these oracles.Current pattern recognition approaches do not seem tofully acquire the high level of abstraction required to beapplied successfully. Instead, extensive experimentationwith the tablebases combined with an intensive studyof tablebase-answers to well designed problems, and withknowledge of prior art (i.e., pre-tablebase knowledge) is used to yield insightful patterns. Berlekamp (in dots-and-boxes, etc.) and John Nunn (in chess endgames) arenotable examples of researchers doing this work, thoughthey were not and are not involved in tablebase gen-eration.

    2.5.2 BusinessIn business, data mining is the analysis of historical busi-ness activities, stored as static data in data warehousedatabases. The goal is to reveal hidden patterns andtrends. Data mining software uses advanced patternrecognition algorithms to sift through large amounts ofdata to assist in discovering previously unknown strate-gic business information. Examples of what businessesuse data mining for include performing market analysisto identify new product bundles, nding the root causeof manufacturing problems, to prevent customer attritionand acquire new customers, cross-selling to existing cus-tomers, and proling customers with more accuracy.[23]

    In todays world raw data is being collected by com-panies at an exploding rate. For example, Walmartprocesses over 20 million point-of-sale transactionsevery day. This information is stored in a centralizeddatabase, but would be useless without some type ofdata mining software to analyze it. If Walmart ana-lyzed their point-of-sale data with data mining tech-niques they would be able to determine sales trends,develop marketing campaigns, and more accuratelypredict customer loyalty.[24]

    Every time a credit card or a store loyalty card isbeing used, or a warranty card is being lled, datais being collected about the users behavior. Manypeople nd the amount of information stored about

    us from companies, such as Google, Facebook, andAmazon, disturbing and are concerned about pri-vacy. Although there is the potential for our per-sonal data to be used in harmful, or unwanted, waysit is also being used to make our lives better. Forexample, Ford and Audi hope to one day collect in-formation about customer driving patterns so theycan recommend safer routes and warn drivers aboutdangerous road conditions.[25]

    Data mining in customer relationship managementapplications can contribute signicantly to the bot-tom line. Rather than randomly contacting aprospect or customer through a call center or send-ing mail, a company can concentrate its eorts onprospects that are predicted to have a high likeli-hood of responding to an oer. More sophisticatedmethods may be used to optimize resources acrosscampaigns so that one may predict to which channeland to which oer an individual is most likely to re-spond (across all potential oers). Additionally, so-phisticated applications could be used to automatemailing. Once the results from data mining (po-tential prospect/customer and channel/oer) are de-termined, this sophisticated application can eitherautomatically send an e-mail or a regular mail. Fi-nally, in cases where many people will take an actionwithout an oer, "uplift modeling" can be used todetermine which people have the greatest increase inresponse if given an oer. Uplift modeling therebyenables marketers to focus mailings and oers onpersuadable people, and not to send oers to peo-ple who will buy the product without an oer. Dataclustering can also be used to automatically discoverthe segments or groups within a customer data set.

    Businesses employing data mining may see a returnon investment, but also they recognize that the num-ber of predictive models can quickly become verylarge. For example, rather than using one model topredict how many customers will churn, a businessmay choose to build a separate model for each regionand customer type. In situations where a large num-ber of models need to be maintained, some busi-nesses turn to more automated data mining method-ologies.

    Data mining can be helpful to human resources(HR) departments in identifying the characteristicsof their most successful employees. Information ob-tained such as universities attended by highly suc-cessful employees can help HR focus recruiting ef-forts accordingly. Additionally, Strategic EnterpriseManagement applications help a company trans-late corporate-level goals, such as prot and marginshare targets, into operational decisions, such as pro-duction plans and workforce levels.[26]

  • 2.5. NOTABLE USES 13

    Market basket analysis, relates to data-mining usein retail sales. If a clothing store records the pur-chases of customers, a data mining system couldidentify those customers who favor silk shirts overcotton ones. Although some explanations of rela-tionships may be dicult, taking advantage of itis easier. The example deals with association ruleswithin transaction-based data. Not all data are trans-action based and logical, or inexact rules may also bepresent within a database.

    Market basket analysis has been used to identify thepurchase patterns of the Alpha Consumer. Analyz-ing the data collected on this type of user has allowedcompanies to predict future buying trends and fore-cast supply demands.

    Data mining is a highly eective tool in the catalogmarketing industry. Catalogers have a rich databaseof history of their customer transactions for millionsof customers dating back a number of years. Datamining tools can identify patterns among customersand help identify the most likely customers to re-spond to upcoming mailing campaigns.

    Data mining for business applications can be inte-grated into a complex modeling and decision mak-ing process.[27] Reactive business intelligence (RBI)advocates a holistic approach that integrates datamining, modeling, and interactive visualization intoan end-to-end discovery and continuous innova-tion process powered by human and automatedlearning.[28]

    In the area of decision making, the RBI approachhas been used to mine knowledge that is progres-sively acquired from the decision maker, and thenself-tune the decision method accordingly.[29] Therelation between the quality of a data mining sys-tem and the amount of investment that the deci-sion maker is willing to make was formalized byproviding an economic perspective on the valueof extracted knowledge in terms of its payo tothe organization[27] This decision-theoretic classi-cation framework[27] was applied to a real-worldsemiconductor wafer manufacturing line, wheredecision rules for eectively monitoring and con-trolling the semiconductor wafer fabrication linewere developed.[30]

    An example of data mining related to an integrated-circuit (IC) production line is described in thepaper Mining IC Test Data to Optimize VLSITesting.[31] In this paper, the application of datamining and decision analysis to the problem of die-level functional testing is described. Experimentsmentioned demonstrate the ability to apply a system

    of mining historical die-test data to create a proba-bilistic model of patterns of die failure. These pat-terns are then utilized to decide, in real time, whichdie to test next and when to stop testing. This systemhas been shown, based on experiments with histori-cal test data, to have the potential to improve protson mature IC products. Other examples[32][33] of theapplication of data mining methodologies in semi-conductor manufacturing environments suggest thatdata mining methodologies may be particularly use-ful when data is scarce, and the various physical andchemical parameters that aect the process exhibithighly complex interactions. Another implication isthat on-line monitoring of the semiconductor man-ufacturing process using data mining may be highlyeective.

    2.5.3 Science and engineeringIn recent years, data mining has been used widely in theareas of science and engineering, such as bioinformatics,genetics, medicine, education and electrical power engi-neering.

    In the study of human genetics, sequence mininghelps address the important goal of understand-ing the mapping relationship between the inter-individual variations in human DNA sequence andthe variability in disease susceptibility. In simpleterms, it aims to nd out how the changes in anindividuals DNA sequence aects the risks of de-veloping common diseases such as cancer, which isof great importance to improving methods of diag-nosing, preventing, and treating these diseases. Onedata mining method that is used to perform this taskis known as multifactor dimensionality reduction.[34]

    In the area of electrical power engineering, datamining methods have been widely used for conditionmonitoring of high voltage electrical equipment.The purpose of condition monitoring is to obtainvaluable information on, for example, the status ofthe insulation (or other important safety-related pa-rameters). Data clustering techniques such as theself-organizing map (SOM), have been applied tovibration monitoring and analysis of transformer on-load tap-changers (OLTCS). Using vibration mon-itoring, it can be observed that each tap changeoperation generates a signal that contains informa-tion about the condition of the tap changer contactsand the drive mechanisms. Obviously, dierent tappositions will generate dierent signals. However,there was considerable variability amongst normalcondition signals for exactly the same tap position.SOM has been applied to detect abnormal condi-tions and to hypothesize about the nature of theabnormalities.[35]

  • 14 CHAPTER 2. DATA MINING

    Data mining methods have been applied to dissolvedgas analysis (DGA) in power transformers. DGA, asa diagnostics for power transformers, has been avail-able for many years. Methods such as SOM has beenapplied to analyze generated data and to determinetrends which are not obvious to the standard DGAratio methods (such as Duval Triangle).[35]

    In educational research, where data mining hasbeen used to study the factors leading students tochoose to engage in behaviors which reduce theirlearning,[36] and to understand factors inuencinguniversity student retention.[37] A similar exam-ple of social application of data mining is its usein expertise nding systems, whereby descriptorsof human expertise are extracted, normalized, andclassied so as to facilitate the nding of experts,particularly in scientic and technical elds. In thisway, data mining can facilitate institutional memory.

    Data mining methods of biomedical data facili-tated by domain ontologies,[38] mining clinical trialdata,[39] and trac analysis using SOM.[40]

    In adverse drug reaction surveillance, the UppsalaMonitoring Centre has, since 1998, used data min-ing methods to routinely screen for reporting pat-terns indicative of emerging drug safety issues inthe WHO global database of 4.6 million suspectedadverse drug reaction incidents.[41] Recently, simi-lar methodology has been developed to mine largecollections of electronic health records for tempo-ral patterns associating drug prescriptions to medi-cal diagnoses.[42]

    Data mining has been applied to software artifactswithin the realm of software engineering: MiningSoftware Repositories.

    2.5.4 Human rightsData mining of government records particularly recordsof the justice system (i.e., courts, prisons) enables thediscovery of systemic human rights violations in connec-tion to generation and publication of invalid or fraudulentlegal records by various government agencies.[43][44]

    2.5.5 Medical data miningIn 2011, the case of Sorrell v. IMS Health, Inc., decidedby the Supreme Court of the United States, ruled thatpharmacies may share information with outside compa-nies. This practice was authorized under the 1st Amend-ment of the Constitution, protecting the freedom ofspeech.[45] However, the passage of the Health Informa-tion Technology for Economic and Clinical Health Act

    (HITECH Act) helped to initiate the adoption of the elec-tronic health record (EHR) and supporting technology inthe United States.[46] The HITECH Act was signed intolaw on February 17, 2009 as part of the American Recov-ery and Reinvestment Act (ARRA) and helped to openthe door to medical data mining.[47] Prior to the signingof this law, estimates of only 20% of United States-basedphysicians were utilizing electronic patient records.[46]Sren Brunak notes that the patient record becomes asinformation-rich as possible and thereby maximizes thedata mining opportunities.[46] Hence, electronic patientrecords further expands the possibilities regarding medi-cal data mining thereby opening the door to a vast sourceof medical data analysis.

    2.5.6 Spatial data miningSpatial data mining is the application of data miningmethods to spatial data. The end objective of spatial datamining is to nd patterns in data with respect to geog-raphy. So far, data mining and Geographic InformationSystems (GIS) have existed as two separate technologies,each with its own methods, traditions, and approaches tovisualization and data analysis. Particularly, most con-temporary GIS have only very basic spatial analysis func-tionality. The immense explosion in geographically ref-erenced data occasioned by developments in IT, digitalmapping, remote sensing, and the global diusion of GISemphasizes the importance of developing data-driven in-ductive approaches to geographical analysis and model-ing.Data mining oers great potential benets for GIS-basedapplied decision-making. Recently, the task of integrat-ing these two technologies has become of critical impor-tance, especially as various public and private sector or-ganizations possessing huge databases with thematic andgeographically referenced data begin to realize the hugepotential of the information contained therein. Amongthose organizations are:

    oces requiring analysis or dissemination of geo-referenced statistical data

    public health services searching for explanations ofdisease clustering

    environmental agencies assessing the impact ofchanging land-use patterns on climate change

    geo-marketing companies doing customer segmen-tation based on spatial location.

    Challenges in Spatial mining: Geospatial data reposito-ries tend to be very large. Moreover, existing GIS datasetsare often splintered into feature and attribute compo-nents that are conventionally archived in hybrid data man-agement systems. Algorithmic requirements dier sub-stantially for relational (attribute) data management and

  • 2.5. NOTABLE USES 15

    for topological (feature) data management.[48] Related tothis is the range and diversity of geographic data for-mats, which present unique challenges. The digital ge-ographic data revolution is creating new types of dataformats beyond the traditional vector and raster for-mats. Geographic data repositories increasingly includeill-structured data, such as imagery and geo-referencedmulti-media.[49]

    There are several critical research challenges in geo-graphic knowledge discovery and data mining. Miller andHan[50] oer the following list of emerging research top-ics in the eld:

    Developing and supporting geographic datawarehouses (GDWs): Spatial properties are oftenreduced to simple aspatial attributes in mainstreamdata warehouses. Creating an integrated GDW re-quires solving issues of spatial and temporal data in-teroperability including dierences in semantics,referencing systems, geometry, accuracy, and posi-tion.

    Better spatio-temporal representations in geo-graphic knowledge discovery: Current geographicknowledge discovery (GKD) methods generally usevery simple representations of geographic objectsand spatial relationships. Geographic data min-ing methods should recognize more complex geo-graphic objects (i.e., lines and polygons) and rela-tionships (i.e., non-Euclidean distances, direction,connectivity, and interaction through attributed ge-ographic space such as terrain). Furthermore, thetime dimension needs to be more fully integratedinto these geographic representations and relation-ships.

    Geographic knowledge discovery using diversedata types: GKD methods should be developedthat can handle diverse data types beyond the tradi-tional raster and vector models, including imageryand geo-referenced multimedia, as well as dynamicdata types (video streams, animation).

    2.5.7 Temporal data mining

    Data may contain attributes generated and recorded atdierent times. In this case nding meaningful relation-ships in the data may require considering the temporalorder of the attributes. A temporal relationship may in-dicate a causal relationship, or simply an association.

    2.5.8 Sensor data mining

    Wireless sensor networks can be used for facilitating thecollection of data for spatial data mining for a variety of

    applications such as air pollution monitoring.[51] A char-acteristic of such networks is that nearby sensor nodesmonitoring an environmental feature typically registersimilar values. This kind of data redundancy due to thespatial correlation between sensor observations inspiresthe techniques for in-network data aggregation and min-ing. By measuring the spatial correlation between datasampled by dierent sensors, a wide class of specializedalgorithms can be developed to develop more ecientspatial data mining algorithms.[52]

    2.5.9 Visual data miningIn the process of turning from analogical into digi-tal, large data sets have been generated, collected, andstored discovering statistical patterns, trends and infor-mation which is hidden in data, in order to build pre-dictive patterns. Studies suggest visual data mining isfaster and much more intuitive than is traditional datamining.[53][54][55] See also Computer vision.

    2.5.10 Music data miningData mining techniques, and in particular co-occurrenceanalysis, has been used to discover relevant similaritiesamong music corpora (radio lists, CD databases) for pur-poses including classifying music into genres in a moreobjective manner.[56]

    2.5.11 SurveillanceData mining has been used by the U.S. government. Pro-grams include the Total Information Awareness (TIA)program, Secure Flight (formerly known as Computer-Assisted Passenger Prescreening System (CAPPS II)),Analysis, Dissemination, Visualization, Insight, Seman-tic Enhancement (ADVISE),[57] and the Multi-state Anti-Terrorism Information Exchange (MATRIX).[58] Theseprograms have been discontinued due to controversy overwhether they violate the 4th Amendment to the UnitedStates Constitution, although many programs that wereformed under them continue to be funded by dierentorganizations or under dierent names.[59]

    In the context of combating terrorism, two particularlyplausible methods of data mining are pattern miningand subject-based data mining.

    2.5.12 Pattern miningPattern mining is a data mining method that involvesnding existing patterns in data. In this context patternsoften means association rules. The original motivationfor searching association rules came from the desire toanalyze supermarket transaction data, that is, to examinecustomer behavior in terms of the purchased products.

  • 16 CHAPTER 2. DATA MINING

    For example, an association rule beer potato chips(80%)" states that four out of ve customers that boughtbeer also bought potato chips.In the context of pattern mining as a tool to identifyterrorist activity, the National Research Council pro-vides the following denition: Pattern-based data min-ing looks for patterns (including anomalous data patterns)that might be associated with terrorist activity thesepatterns might be regarded as small signals in a largeocean of noise.[60][61][62] Pattern Mining includes newareas such a Music Information Retrieval (MIR) wherepatterns seen both in the temporal and non temporaldomains are imported to classical knowledge discoverysearch methods.

    2.5.13 Subject-based data mining

    Subject-based data mining is a data mining methodinvolving the search for associations between individu-als in data. In the context of combating terrorism, theNational Research Council provides the following de-nition: Subject-based data mining uses an initiating in-dividual or other datum that is considered, based on otherinformation, to be of high interest, and the goal is to de-termine what other persons or nancial transactions ormovements, etc., are related to that initiating datum.[61]

    2.5.14 Knowledge grid

    Knowledge discovery On the Grid generally refers toconducting knowledge discovery in an open environmentusing grid computing concepts, allowing users to inte-grate data from various online data sources, as well makeuse of remote resources, for executing their data miningtasks. The earliest example was the Discovery Net,[63][64]developed at Imperial College London, which won theMost Innovative Data-Intensive Application Award atthe ACM SC02 (Supercomputing 2002) conference andexhibition, based on a demonstration of a fully interactivedistributed knowledge discovery application for a bioin-formatics application. Other examples include work con-ducted by researchers at the University of Calabria, whodeveloped a Knowledge Grid architecture for distributedknowledge discovery, based on grid computing.[65][66]

    2.6 Privacy concerns and ethicsWhile the term data mining itself has no ethical im-plications, it is often associated with the mining of in-formation in relation to peoples behavior (ethical andotherwise).[67]

    The ways in which data mining can be used can in somecases and contexts raise questions regarding privacy, le-gality, and ethics.[68] In particular, data mining govern-

    ment or commercial data sets for national security or lawenforcement purposes, such as in the Total InformationAwareness Program or in ADVISE, has raised privacyconcerns.[69][70]

    Data mining requires data preparation which can uncoverinformation or patterns which may compromise conden-tiality and privacy obligations. A common way for thisto occur is through data aggregation. Data aggregationinvolves combining data together (possibly from varioussources) in a way that facilitates analysis (but that alsomight make identication of private, individual-level datadeducible or otherwise apparent).[71] This is not data min-ing per se, but a result of the preparation of data before and for the purposes of the analysis. The threat to anindividuals privacy comes into play when the data, oncecompiled, cause the data miner, or anyone who has accessto the newly compiled data set, to be able to identify spe-cic individuals, especially when the data were originallyanonymous.[72][73][74]

    It is recommended that an individual is made aware of thefollowing before data are collected:[71]

    the purpose of the data collection and any (known)data mining projects;

    how the data will be used; who will be able to mine the data and use the data

    and their derivatives; the status of security surrounding access to the data; how collected data can be updated.

    Data may also be modied so as to become anonymous,so that individuals may not readily be identied.[71] How-ever, even de-identied"/"anonymized data sets can po-tentially contain enough information to allow identica-tion of individuals, as occurred when journalists wereable to nd several individuals based on a set of searchhistories that were inadvertently released by AOL.[75]

    2.6.1 Situation in EuropeEurope has rather strong privacy laws, and eorts are un-derway to further strengthen the rights of the consumers.However, the U.S.-E.U. Safe Harbor Principles currentlyeectively expose European users to privacy exploitationby U.S. companies. As a consequence of Edward Snow-den's Global surveillance disclosure, there has been in-creased discussion to revoke this agreement, as in partic-ular the data will be fully exposed to the National SecurityAgency, and attempts to reach an agreement have failed.

    2.6.2 Situation in the United StatesIn the United States, privacy concerns have been ad-dressed by the US Congress via the pa