interactive exploration of complex relational data sets in a web - semweb.pro 2012

27
Interactive exploration of complex relational data sets in a web interface Vincent Michel - Logilab SemWeb.pro 2012 - 2 mai 2012 1/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 1/2

Upload: logilab

Post on 30-Jun-2015

301 views

Category:

Documents


0 download

DESCRIPTION

Présentation de Vincent Michel à SemWeb.Pro 2012 http://www.semweb.pro/

TRANSCRIPT

Page 1: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Interactive exploration of complex relational data sets in aweb interface

Vincent Michel - Logilab

SemWeb.pro 2012 - 2 mai 2012

1/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 1 / 27

Page 2: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Table des matières

1 Introduction

2 How to store and query the data ?

3 How to visualize the data ?

4 Conclusion

2/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 2 / 27

Page 3: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Context - Number

Increasing number of databases

Wikipedia - Semantic Web

3/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 3 / 27

Page 4: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Context - Size

Increasing number of HUGE databases

Few examples

Data.bnf.fr : 6.106 rdf triples / 65 Mo (April 2012).

Geonames : 8.106 features / 150.106 rdf triples / 2Gb (March 2012).

Dbpedia : 3.106 things / 19 rdf triples / > 30 Gb (July 2011).

Related issuesStoring data : lots of space required.

Updating data : massive amount of operations.

Querying/Visualization : Scalable tools.

4/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 4 / 27

Page 5: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Context - Format

Increasing number of HUGE databases with DIFFERENT formats

Few examplesGeonames : CSV file.

Data.bnf.fr : separated nt/n3/rdf files.

Dbpedia : (numerous) separated nt/n3 files.

NosDeputes : SQL dumps.

Related issuesStoring data : lots of conversions are required to push data within thesame RDBMS.

Querying data : some links have to be reconstructed from the dumps.

5/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 5 / 27

Page 6: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Context - THE Question

What can we do with THAT ? ? !

How can we query the data ? How can we visualize the data ?

6/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 6 / 27

Page 7: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Table des matières

1 Introduction

2 How to store and query the data ?

3 How to visualize the data ?

4 Conclusion

7/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 7 / 27

Page 8: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Cubicweb framework

A semantic open-source web framework written in Python

An efficient knowledge management systemEntity-relationship data-model : query directly the concepts whileabstracting the physical structure of the underlying database.

RQL (Relational Query Language) : High-level query languageoperating over a relation database (PostgresSQL, MySQL, ...)

Separate query and display : visualize the results in the standard webframework, download them in different formats (JSON, RDF, CSV,...)

Structured database : reconstruct the relations that may exist betweendifferent entities and entity classes, so that queries will be easier (and fasterthan a triple-store).

8/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 8 / 27

Page 9: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Data Model

The schema defines different entity classes with their attributes, as wellas the relationships between the different entity classes

9/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 9 / 27

Page 10: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Querying data

RQL (Relational Query Language)Similar to SPARQL.

Higher-level than SQL, abstracts the details of the tables and the joins.

Example :

Give me 10 locations that have a population greater than 1000000, and thatare in a country named "France"

Any X LIMIT 10 WHERE X is Location,X population > 1000000, X country C, C name "France"

→ A query returns a result set (a list of results), that can be displayed usingviews.

10/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 10 / 27

Page 11: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Viewing dataFull process

Request

Give me 10 locations that have a population greater than 1000000, andthat are in a country named "France"

RQL

Any X LIMIT 10 WHERE X is Location,X population > 1000000, X country C, C name "France"

Standard views : Web framework, JSON, RDF, CSV, ...

http://domain:8080/?rql=my rql&vid=json

11/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 11 / 27

Page 12: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Table des matières

1 Introduction

2 How to store and query the data ?

3 How to visualize the data ?

4 Conclusion

12/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 12 / 27

Page 13: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Goals

We have databases with complex structure

→ Building an interface to visualize queries/perform datamining over arelational database

Why ?Each visualization has a corresponding url → results are addressable,linkable and shareable.

Using Javascript-based visualization : d3.js, protovis, mapstraction, ...

Provide more complex views, based on datamining procedures.

13/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 13 / 27

Page 14: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

A two steps process

Pivot data structure : numerical arrays, JSON

ProcessorConvert result set → pivot structure.

The processor defines some criteria for the mathematical processing.

Array viewFor numerical array, we construct Array views, associated to somevisualizations or datamining processes.

Display the arrays as graphs, histograms, maps...

14/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 14 / 27

Page 15: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Visualization pathway

PostgreSQLdatabase

Result Set

HTML, RDF,JSON, CSV

...

ViewQuery

Result Set

DATAMININGRESULT:

HTML, RDF,JSON, CSV, ...

Array-View

Processor

Query Numpy array

Standard CubicWeb pathway

Dataminingcube

pathway

15/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 15 / 27

Page 16: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Overview of some possibilities

One can combine almost any Processor with any Array view

Processorsattr-asfloat turns all Int, BigInt and Float attributes in the result set tofloats, and returns the corresponding array.

undirected-rel creates a binary numpy array that represents the relations(or corelations) between entities.

attr-astoken turns all String attributes in the result set in a numpy array,based on a Word-n-gram analyze.

Array viewsHistogram, scatterplot, matrix, force-directed graph, dendrogram, ... → eachview has a default processor.

16/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 16 / 27

Page 17: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Example of combination

Query : 20 couples (name, id) of parliamentarians, in alphabetical order

Create tokens from names (processor : attr-astoken), and view tokens in table(array-view : array-table)

http://mondomain.vm:port/?rql= Any X, N ORDERBY N ASCLIMIT 20 WHERE X is Parlementaire, X nom N&arid=attr-astoken&vid=array-table

17/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 17 / 27

Page 18: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Changing the view...

Create tokens from names (processor : attr-astoken), and now, view thenumber of occurence of tokens using a histogram (array-view : protovis-hist)

http://mondomain.vm:port/?rql= Any X, N ORDERBY N ASCLIMIT 20 WHERE X is Parlementaire, X nom N&arid=attr-astoken&vid=protovis-hist

18/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 18 / 27

Page 19: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Histogram visualization - NosDeputes

Number of comments by parliamentary

Processor : attr-asfloat, Array view : protovis-hist

19/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 19 / 27

Page 20: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Scatterplot visualization 1 - Geonames

All couples (latitude, longitude) of the locations in France, with an elevation notnull

Processor : attr-asfloat, Array view : protovis-scatterplot

20/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 20 / 27

Page 21: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Scatterplot visualization 2 - Geonames

All couples (latitude, longitude) of 50000 locations in France, with a populationhigher than 100 (inhabitants)

Processor : attr-asfloat, Array view : protovis-scatterplot

21/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 21 / 27

Page 22: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Matrix visualization - Geonames

All neighbour countries in North America

Processor : undirected-rel, Array view : protovis-binarymap

22/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 22 / 27

Page 23: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Graph visualization - Geonames

All neighbour countries in the World, and their corresponding continent

Processor : undirected-rel, Array view : protovis-forcedirected

23/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 23 / 27

Page 24: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Applying datamining processes

Another kind of processor based on the Numpy Array

All couples (latitude, longitude) of the locations in France, with an elevationhigher than 1

Processor : attr-asfloat, Array view : protovis-scatterplot, Dataminingprocess : Kmeans

24/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 24 / 27

Page 25: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Table des matières

1 Introduction

2 How to store and query the data ?

3 How to visualize the data ?

4 Conclusion

25/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 25 / 27

Page 26: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Visualization in CubicWeb

Easy to interactively process and visualize datasets.

Visualizations directly applied on queries.

All visualizations are defined by an unique URL !

Featuresquick interactive exploratory visualization/datamining processings.

different processors and views for more flexibility.

based on the pivto structure (JSON, numerical arrays).

Use jointly with facets to filter results.

26/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 26 / 27

Page 27: Interactive exploration of complex relational data sets in a web - SemWeb.Pro 2012

Conclusion

Future developmentsMore datamining procedures (unsupervised learning, clustering, ...).

Including more Javascript libraries.

Use sparse arrays for better performance.

Questions ?

27/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 27 / 27