knowledge and patterns
DESCRIPTION
Hoe data mining could (should!) be done!TRANSCRIPT
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
KNOWLEDGE FROM PATTERNSON COMPLEXITY-BASED DATA MINING
ONTONIX
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
Disclaimer
The concepts and methods presented in this document are for illustrative purposes only, and are not intended to be exhaustive. Ontonix assumes no liability or responsibility to any person or company for direct or indirect damages resulting from the use of any information contained herein.
Any reproduction or distribution of this document, in whole or in part, without the prior written consent of Ontonix is prohibited.
Reverse-engineering of the concepts, methods or ideas contained in this document is strictly forbidden. The methods described in the present document are protected by US patents.
OntoSpace is a trademark of Ontonix. All other trademarks are the property of their respective owners.
Copyright 2010, Ontonix S.r.l. All Rights Reserved.
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
The (F)utility of Correlations
Detecting and extracting correlations within data is key step of Data Mining
Lets see some examples …
3
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
4
Good correlation – can derive useful information from a simple linear fit
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
5
Can still get useful information – but now requires a non-linear fit
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
6
What about this case? 10% of data messes up what could have been a nice linear fit!
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
7
The meaning of R2
R2 explains the amount of variation in y that is accounted for by the variation in x
R2 = 0.31 31% of the variability in y is explained by the variability in x
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
8
How useful is simply knowing the value of R2 here?
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
9
Or here?
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
10
And here, where R2 is still ~ 0.6, but the pattern is totally different?
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
11
Or here! (where R2 ~ 0, but information is not!)
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
12
Each plot has a distinct pattern
Which conveys specific and useful information– that cannot be captured by a one-dimensional quantity like R2
Within each plot there are many local effects and gradients which cannot be ignored
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
13
“Many systems we need to model are like herds of cattle – they tend to move together but in irregular ways”
Doug Hubbard, The Failure of Risk Management
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
14
Complexity Analysis is
A means to capture this information and use it to rank the data (or system)
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
3 steps in Complexity Analysis
Step 1: Measure the information content in each plot Step 2: Search for relationships within each plotStep 3: Repeat for all pair-wise combinations of variables
15
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
16
Data mining using complexity analysis
• Step 1: Measure information content in plots
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
17
Step 1: Information content in Plots
• Scatter plots can be “pixelized” into images– Information in the image is measured via Image Entropy
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
What is Image Entropy?
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
19
Data mining using complexity analysis
• Step 2: For each plot, identify relationships that exist between the two variables
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
Step 2: From Data to Images and StructureChaotic image with little or no structure:no information is exchanged between x and y.
Image with evident structure:much information is exchangedbetween x and y.
Image with evident structure:much information is exchangedbetween x and y.
How can we measure exchanged information?
x
x
x
y
y
y
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
Using “Mutual Information”
can be read as "the amount of uncertainty in X, minus the amount of uncertainty in X which remains after Y is known", which is equivalent to "the amount of uncertainty in X which is removed by knowing Y".
H is the measure of uncertainty or entropy of each variable
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
22
Data mining using complexity analysis
• Step 3: Repeat Step 1 and Step 2 for every pair of variables (or plots) within the data set
For 3 parameters, we can have 3 plotsFor 4 parameters, we can have 6 plots …For 50 parameters, we can have 1225 plots!
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
Variable 1 Variable 2 Variable 3 Variable 4 Variable 5 Variable 6 Variable 7 Variable 8 ........
Variable 3
Variable 5
Variable 2
Variable 1
Step 3: Identifying Structure In Data
IF an image contains sufficient information exchangeTHENcreate a link between two variables
Sample 1Sample 2Sample 3....
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
24
Step 3: Repeat for all variables
We can now “map” all the dependencies– Each parameter is shown
along the diagonal– If a pair of parameters exhibit
any relationship, they are connect by a link
– Thus each “plot” is now an off-diagonal link
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
25
To develop a global connectivity map that can identify …
• Variables that are inter-related (i.e. have relationships)
• Variables that are related to more than one variable– Y3 is related to X5, X9 and Y2
• Variables which are NOT involved in any relationships– X1, X2, X4, X6, X7, Y4
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
26
Data mining using complexity analysis
Complexity is now defined as a function of– Number of relationships identified in steps 2 and 3
• More relationships imply– More intricacy within the data – Lower controllability
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
27
Data mining using complexity analysis
and as a function of– “Nature” of these relationships
• Some relationships may be well structured, some nearly chaotic, but most are usually in between
• Well structured relationships yield lower complexity than chaotic ones
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
28
Results of Complexity Analysis
• Number of relationships• Total uncertainty• Current Complexity• Level of complexity at
which the system becomes chaotic
• Robustness – the distance to the critical complexity level
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
29
Consider this …• As the number of relationships increases, your ability to control the
system decreases• As the uncertainty within the relationships increases, your ability to
predict the system decreases
• In other words any increase in complexity impacts your ability to control and predict
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
30
We just finished Data Mining!
Remember, the objectives of data mining are
•Explore large amounts of data to detect patterns and relationships•Extract key insights that deliver business value
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
31
Complexity Analysis has
• Demonstrated that relationships detected by statistical methods can be of limited utility– Especially when data is turbulent, chaotic, highly uncertain – in other
words complex!• Transformed the simple and sometimes futile exercise of extracting
correlations to– A comprehensive intelligence gathering activity
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.
32
As a next step
• You may – Identify which parameters are the key drivers of complexity– Set up early alert triggers if the your system (a business, a
process, a product…) is becoming less robust