ultra - vtt technical research centre of finland · • graph based approach • regular...
TRANSCRIPT
Data analysis and regular decompositionBackground on EU-proposal
by Hannu Reittu, VTT(Self-organizing networks, BA 1144)
ULTRA“ULTimate Regularity – Applications of laws on large
structures”
Starting point: Szemerédi’s Regularity Lemma (SRL)
Random bipartite graph: draw independently links withprobability p
Very simple object!
SRL:
Any (large enough) graph can bedecomposed into a bounded number ofsubgraphs, such that pairs of subgraphsare almost like random bipartite graphs
SRL is central for graph theory (and not only)large matrices in data analysis?
Indicates a ’clustering’ that tells the structure of large matrix?computable! Even in exremely large scales (from a sample)
• We have started such a program:• A p2p network
• A real matrix (synthetic):
• Working algorithms: ’regular decomposition’
=
=
Example: segmentation of households based on electric smart meter readings (per ½ hour)
Columns: half hours of the weekrows: different househods
elements: average (over several months) consumption of powerRegular decomposition of rows into 10 groups (1,2,…,10)
=
1
1 1
1 1
1
1
2
2
2
2
2
2
2
4
4 4 4 4
4
4
5 5
5 5 5
5
5
6
66
66
6
6
7
7 77
7
7
7
8
8 88 8
8
8
9
9 9 9 9
9
9
10
10 10 10 10
10
10
Status :1 An employee2 Self employed with employees3 Self employed with no employees4 Unemployed actively seeking work5 Unemployed not actively seeking work6 Retired7 Carer : Looking after relative family
1 2 3 4 5 6 70.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
status
Psta
tus
Meaningfullsegmentation:
ULTRA?
Budget of the call: 36 M EurULTRA: 3-4 M Eur/ 3 years
Big Data analytics
• Graph based approach• Regular decomposition +• Other graph compression techniques• Associations with pattern recognition and machine learning• Companies play a central role: problems, business grade implementations of algorithms, end
users of methodology• Work with real data creating real applications and business opportunities
Some research tasks:– Segmentation based on data– Quantitative division between bulk data and borderline cases– Temporal aspects of data: detecting and predicting the changes– Simple models from data -> planning and generating possible future scenarios– Implementing algorithms in parallel computation platforms