knowledge and patterns

32
Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l. KNOWLEDGE FROM PATTERNS ON COMPLEXITY-BASED DATA MINING ONTONIX

Upload: david-wilson

Post on 20-May-2015

722 views

Category:

Documents


1 download

DESCRIPTION

Hoe data mining could (should!) be done!

TRANSCRIPT

Page 1: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

KNOWLEDGE FROM PATTERNSON COMPLEXITY-BASED DATA MINING

ONTONIX

Page 2: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

Disclaimer

The concepts and methods presented in this document are for illustrative purposes only, and are not intended to be exhaustive. Ontonix assumes no liability or responsibility to any person or company for direct or indirect damages resulting from the use of any information contained herein.

Any reproduction or distribution of this document, in whole or in part, without the prior written consent of Ontonix is prohibited.

Reverse-engineering of the concepts, methods or ideas contained in this document is strictly forbidden. The methods described in the present document are protected by US patents.

OntoSpace is a trademark of Ontonix. All other trademarks are the property of their respective owners.

Copyright 2010, Ontonix S.r.l. All Rights Reserved.

Page 3: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

The (F)utility of Correlations

Detecting and extracting correlations within data is key step of Data Mining

Lets see some examples …

3

Page 4: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

4

Good correlation – can derive useful information from a simple linear fit

Page 5: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

5

Can still get useful information – but now requires a non-linear fit

Page 6: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

6

What about this case? 10% of data messes up what could have been a nice linear fit!

Page 7: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

7

The meaning of R2

R2 explains the amount of variation in y that is accounted for by the variation in x

R2 = 0.31 31% of the variability in y is explained by the variability in x

Page 8: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

8

How useful is simply knowing the value of R2 here?

Page 9: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

9

Or here?

Page 10: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

10

And here, where R2 is still ~ 0.6, but the pattern is totally different?

Page 11: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

11

Or here! (where R2 ~ 0, but information is not!)

Page 12: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

12

Each plot has a distinct pattern

Which conveys specific and useful information– that cannot be captured by a one-dimensional quantity like R2

Within each plot there are many local effects and gradients which cannot be ignored

Page 13: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

13

“Many systems we need to model are like herds of cattle – they tend to move together but in irregular ways”

Doug Hubbard, The Failure of Risk Management

Page 14: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

14

Complexity Analysis is

A means to capture this information and use it to rank the data (or system)

Page 15: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

3 steps in Complexity Analysis

Step 1: Measure the information content in each plot Step 2: Search for relationships within each plotStep 3: Repeat for all pair-wise combinations of variables

15

Page 16: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

16

Data mining using complexity analysis

• Step 1: Measure information content in plots

Page 17: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

17

Step 1: Information content in Plots

• Scatter plots can be “pixelized” into images– Information in the image is measured via Image Entropy

Page 18: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

What is Image Entropy?

Page 19: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

19

Data mining using complexity analysis

• Step 2: For each plot, identify relationships that exist between the two variables

Page 20: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

Step 2: From Data to Images and StructureChaotic image with little or no structure:no information is exchanged between x and y.

Image with evident structure:much information is exchangedbetween x and y.

Image with evident structure:much information is exchangedbetween x and y.

How can we measure exchanged information?

x

x

x

y

y

y

Page 21: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

Using “Mutual Information”

can be read as "the amount of uncertainty in X, minus the amount of uncertainty in X which remains after Y is known", which is equivalent to "the amount of uncertainty in X which is removed by knowing Y".

H is the measure of uncertainty or entropy of each variable

Page 22: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

22

Data mining using complexity analysis

• Step 3: Repeat Step 1 and Step 2 for every pair of variables (or plots) within the data set

For 3 parameters, we can have 3 plotsFor 4 parameters, we can have 6 plots …For 50 parameters, we can have 1225 plots!

Page 23: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

Variable 1 Variable 2 Variable 3 Variable 4 Variable 5 Variable 6 Variable 7 Variable 8 ........

Variable 3

Variable 5

Variable 2

Variable 1

Step 3: Identifying Structure In Data

IF an image contains sufficient information exchangeTHENcreate a link between two variables

Sample 1Sample 2Sample 3....

Page 24: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

24

Step 3: Repeat for all variables

We can now “map” all the dependencies– Each parameter is shown

along the diagonal– If a pair of parameters exhibit

any relationship, they are connect by a link

– Thus each “plot” is now an off-diagonal link

Page 25: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

25

To develop a global connectivity map that can identify …

• Variables that are inter-related (i.e. have relationships)

• Variables that are related to more than one variable– Y3 is related to X5, X9 and Y2

• Variables which are NOT involved in any relationships– X1, X2, X4, X6, X7, Y4

Page 26: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

26

Data mining using complexity analysis

Complexity is now defined as a function of– Number of relationships identified in steps 2 and 3

• More relationships imply– More intricacy within the data – Lower controllability

Page 27: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

27

Data mining using complexity analysis

and as a function of– “Nature” of these relationships

• Some relationships may be well structured, some nearly chaotic, but most are usually in between

• Well structured relationships yield lower complexity than chaotic ones

Page 28: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

28

Results of Complexity Analysis

• Number of relationships• Total uncertainty• Current Complexity• Level of complexity at

which the system becomes chaotic

• Robustness – the distance to the critical complexity level

Page 29: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

29

Consider this …• As the number of relationships increases, your ability to control the

system decreases• As the uncertainty within the relationships increases, your ability to

predict the system decreases

• In other words any increase in complexity impacts your ability to control and predict

Page 30: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

30

We just finished Data Mining!

Remember, the objectives of data mining are

•Explore large amounts of data to detect patterns and relationships•Extract key insights that deliver business value

Page 31: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

31

Complexity Analysis has

• Demonstrated that relationships detected by statistical methods can be of limited utility– Especially when data is turbulent, chaotic, highly uncertain – in other

words complex!• Transformed the simple and sometimes futile exercise of extracting

correlations to– A comprehensive intelligence gathering activity

Page 32: Knowledge And Patterns

Copyright 2010, Ontonix S.r.l. All rights reserved. No part of this document may be reproduced in any form without the written consent of Ontonix S.r.l.

32

As a next step

• You may – Identify which parameters are the key drivers of complexity– Set up early alert triggers if the your system (a business, a

process, a product…) is becoming less robust