semnews - euroscipy 2011

22
A semantic news aggregator in Python using Dbpedia, Cubicweb and Scikits-learn Vincent Michel - Logilab 27 août 2011

Upload: vincent-michel

Post on 17-Jan-2017

95 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Semnews - Euroscipy 2011

A semantic news aggregator in Python usingDbpedia Cubicweb and Scikits-learn

Vincent Michel - Logilab

27 aoucirct 2011

Context

With a bunch of RSS news

Google Borrows Apple Strategy Googlersquos deal underscores theallure of a business model pioneered

Japan disaster plant cold shutdown could face delay TOKYO(Reuters) - Tokyo Electric Power Co said on Wednesday

Libya shows signs of slipping from Muammar Gaddafirsquos grasp Supply lines to capital in peril as coastal cities fall

how can we analyze them in Python

rarr Clustering (grouping) RSS (eg Google News)

rarr Extractingsynthetizing information

rarr Providing semantic usefuloriginal visualisation andanalytics tools

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Semantic

As of September 2010

MusicBrainz

(zitgist)

P20

YAGO

World Fact-book (FUB)

WordNet (W3C)

WordNet(VUA)

VIVO UFVIVO

Indiana

VIVO Cornell

VIAF

URIBurner

Sussex Reading

Lists

Plymouth Reading

Lists

UMBEL

UK Post-codes

legislationgovuk

Uberblic

UB Mann-heim

TWC LOGD

Twarql

transportdatagov

uk

totlnet

Tele-graphis

TCMGeneDIT

TaxonConcept

The Open Library (Talis)

t4gm

Surge Radio

STW

RAMEAU SH

statisticsdatagov

uk

St Andrews Resource

Lists

ECS South-ampton EPrints

Semantic CrunchBase

semanticweborg

SemanticXBRL

SWDog Food

rdfabout US SEC

Wiki

UNLOCODE

Ulm

ECS (RKB

Explorer)

Roma

RISKS

RESEX

RAE2001

Pisa

OS

OAI

NSF

New-castle

LAAS

KISTIJISC

IRIT

IEEE

IBM

Eureacutecom

ERA

ePrints

dotAC

DEPLOY

DBLP (RKB

Explorer)

Course-ware

CORDIS

CiteSeer

Budapest

ACM

riese

Revyu

researchdatagov

uk

referencedatagov

uk

Recht-spraak

nl

RDFohloh

LastFM (rdfize)

RDF Book

Mashup

PSH

ProductDB

PBAC

Pokeacute-peacutedia

Ord-nance Survey

Openly Local

The Open Library

OpenCyc

OpenCalais

OpenEI

New York

Times

NTU Resource

Lists

NDL subjects

MARC Codes List

Man-chesterReading

Lists

Lotico

The London Gazette

LOIUS

lobidResources

lobidOrgani-sations

LinkedMDB

LinkedLCCN

LinkedGeoData

LinkedCT

Linked Open

Numbers

lingvoj

LIBRIS

Lexvo

LCSH

DBLP (L3S)

Linked Sensor Data (Knoesis)

Good-win

Family

Jamendo

iServe

NSZL Catalog

GovTrack

GESIS

GeoSpecies

GeoNames

GeoLinkedData(es)

GTAA

STITCHSIDER

Project Guten-berg (FUB)

MediCare

Euro-stat

(FUB)

DrugBank

Disea-some

DBLP (FU

Berlin)

DailyMed

Freebase

flickr wrappr

Fishes of Texas

FanHubz

Event-Media

EUTC Produc-

tions

Eurostat

EUNIS

ESD stan-dards

Popula-tion (En-AKTing)

NHS (EnAKTing)

Mortality (En-

AKTing)Energy

(En-AKTing)

CO2(En-

AKTing)

educationdatagov

uk

ECS South-ampton

Gem Norm-datei

datadcs

MySpace(DBTune)

MusicBrainz

(DBTune)

Magna-tune

John Peel(DB

Tune)

classical(DB

Tune)

Audio-scrobbler (DBTune)

LastfmArtists

(DBTune)

DBTropes

dbpedia lite

DBpedia

Pokedex

Airports

NASA (Data Incu-bator)

MusicBrainz(Data

Incubator)

Moseley Folk

Discogs(Data In-cubator)

Climbing

Linked Data for Intervals

Cornetto

Chronic-ling

America

Chem2Bio2RDF

bizdata

govuk

UniSTS

UniRef

UniPath-way

UniParc

Taxo-nomy

UniProt

SGD

Reactome

PubMed

PubChem

PRO-SITE

ProDom

Pfam PDB

OMIM

OBO

MGI

KEGG Reaction

KEGG Pathway

KEGG Glycan

KEGG Enzyme

KEGG Drug

KEGG Cpd

InterPro

HomoloGene

HGNC

Gene Ontology

GeneID

GenBank

ChEBI

CAS

Affy-metrix

BibBaseBBC

Wildlife Finder

BBC Program

mesBBC

Music

rdfaboutUS Census

Media

Geographic

Publications

Government

Cross-domain

Life sciences

User-generated content

ldquoLinking Open Data cloud diagram by Richard Cyganiak and Anja Jentzschhttp lod-cloudnetrdquo

Tools

Fetching storing and querying the datararr CubicWeb ()

4 Semantic CMS high-level database management with metadata

4 Multiple sources RSS micro-blogging SQL database

4 Based on PostgreSQL deals with (very) large database

Data-mining and machine learningrarr Scikits-learn ()

4 Easy-to-use and general-purpose machine learning in Python

4 Unsupervised learning supervised learning model selection

Semantic information databaserarr Dbpedia ()

4 sim 8106 articles with abstracts images from Wikipedia

4 sim 06106 categories 273 types (eg person place )

4 sim 100106 links between articles categories and types

() Open SourceCreative Commons

Storing RSS information with Cubicweb

Each object (or entity) is defined in a schema and may be displayedusing different views

Storing RSS

Define (in schemapy ) how RSS should be stored in the database

class RSSArt ic le ( Ent i tyType ) t i t l e = S t r i n g ( ) T i t l e o f the feedu r i = S t r i n g ( unique=True ) Ur i o f the rss feedcontent = S t r i n g ( ) Content o f the feed

Fetching RSS (based on feedparser and BeautifulSoup)

Simply construct a source by giving an URL and a parser

u r l = u rsquo h t t p feeds bbc i co uk news wor ld rss xml rsquos = session c r e a t e _ e n t i t y ( rsquoCWSource rsquo name=u rsquoBBCNewsminusWorld rsquo u r l = u r l

type=u rsquo datafeed rsquo parser=u rsquo rssminusparser rsquo con f i g =u rsquo synchron iza t ionminusi n t e r v a l =240min rsquo )

s pu l l _da ta ( session )

7 englishamerican journals (The New York Times )

Storing Dbpedia information with Cubicweb

Storing Dbpedia page

Define (in schemapy ) how Dbpedia pages should be stored in thedatabase

class DbpediaPage ( Ent i tyType ) u r i = S t r i n g ( unique=True indexed=True ) Ur i o f the ressourcel a b e l = S t r i n g ( indexed=True ) h t t p wwww3 org 200001 rd fminusschemapageid = S t r i n g ( ) h t t p dbpedia org onto logy wikiPageIDabs t r ac t = S t r i n g ( ) h t t p dbpedia org onto logy abs t r ac thomepage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 homepagethumbnai l = S t r i n g ( ) h t t p dbpedia org onto logy thumbnai ld e p i c t i on = S t r i n g ( ) h t t p xmlns com f o a f 0 1 d ep i c t i o nwikipage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 pagel a t i t u d e = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_posl ong i t ude = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_pos

Storing all dbpedia information (sim 9106 pages sim 100106 links20Go) in Cubicweb takes less than 24 hours

rarr See the full schema

Analyzing RSS news

What can we do with the RSS news stored in database

1 extract relevant features of the data

2 construct a usable (ie matrix) representation of the data

3 cluster (group) RSS together

4 deeper semantic analyze and visualization of the information

Example sentence

ldquoGoogle is to buy mobile phone manufacturer Motorola Mobilityallowing it to mount a serious challenge to Apple Incrdquo

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Extracting information Classical approaches (Scikits-learn)

Char-N-gram

Extracts features of N characters from a text

from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import CharNGramAnalyzeranalyzer = CharNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )

450 features rsquogoorsquo rsquooogrsquo rsquooglrsquo rsquomobrsquo rsquoobirsquo rsquobilrsquo

Word-N-gram

Extracts features of N words from a text

from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import WordNGramAnalyzeranalyzer = WordNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )

58 features rsquogoogle is torsquo rsquois to buyrsquo rsquoserious challenge torsquo

8 Many irrelevant features (tokens)8 Features do not carry lots of contextual information (ie

understandable by humans)

Feature extraction Dbpedia - Context

Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named

Entities Recognition (NER)

Dbpedia feature extraction (CubicwebDbpedia)

from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )

rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo

ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone

manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to

mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt

rarr Try it

ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008

ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007

Feature extraction Dbpedia - Properties

Efficient and robust feature extraction

Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo

Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo

Simple RQL (Relation Query Language) queries

Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage

E label (token)srsquo rsquotokenrsquo token)

eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries

Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Feature extraction Matrix representation and storage

Obamarsquos approval Protest as Spain

Libya rebels fight Obama in Spain

rarr

0 0 1 00 0 0 1 0 1 0 00 0 1 1

Eg The Obama-Merkel-Sarkozy

space

Results stored in a relation (appears in rss) in Cubicweb

rql(rsquoAny X WHERE X appears_in_rss Yrsquo)

Exploiting information Clustering news

Creating clusters (groups) of news using the matrixrepresentation of the data

Meanshift algorithm (Scikits-learn)

Based on locating the maxima of a density function

Automatically tunes the number of clusters

from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_

ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995

rarr Try it

Other possible alternatives Wardrsquos algorithm K-means

Exploiting information Queries and Views

Each query returns a result set (rset) A view is called on a result setrarr define the representation rules

Defining a view (short version)

class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )

do_whatever_you_want_ write_some_html_

you just have to plug a rset from a query

r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )

or within an url

httpmyapplicationrql=Any X WHERE ampvid=example-view

Pluging Scikitslearn in a view

Combine machine learning tools and database query system in purePython

Defining the view

class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo

def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing

rarr Try it

A new approach for querying information from RSS

All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R

E has_type T T label musical artistrsquo)

All living office holder persons in the news

rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)

All news that talk about Barack Obama and any scientist

rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)

All news that talk about a drug

rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)

Try it with an xml view or a thumbnail view

Vizualisation Mapping information

View

class EntityMapView ( View ) __regid__ = rsquomap rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )

s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )

s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )

Based on mapstraction (javascript) http mapstractioncom

4 Automatically locate information from RSS news

Try it

Conclusion

Cubicweb and Scikits-learn

Efficient and easy-to-use tools for data storing querying andmining

Easy to plug together only Python tools all Open Source

Using Dbpedia allows to extract very few highly relevant features

4 Decrease the dimensionality of the data

4 Link features to millions of pages of information

A new semantic way for querying information

4 Simple information queries using RQL expressions

4 Use Dbpedia types and categories to refine the selection

Future Improvements

Named Entities Recognition

rarr use disambiguations links more refined Regular expressions

rarr add new databases (MusicBrainz diseasome )

rarr closely follow Wikipedia with Dbpedia live update

Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo

RSS news analyzing

rarr explore new algorithms bi-clustering

rarr add new data sources Twitters Blogs

rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news
Page 2: Semnews - Euroscipy 2011

Context

With a bunch of RSS news

Google Borrows Apple Strategy Googlersquos deal underscores theallure of a business model pioneered

Japan disaster plant cold shutdown could face delay TOKYO(Reuters) - Tokyo Electric Power Co said on Wednesday

Libya shows signs of slipping from Muammar Gaddafirsquos grasp Supply lines to capital in peril as coastal cities fall

how can we analyze them in Python

rarr Clustering (grouping) RSS (eg Google News)

rarr Extractingsynthetizing information

rarr Providing semantic usefuloriginal visualisation andanalytics tools

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Semantic

As of September 2010

MusicBrainz

(zitgist)

P20

YAGO

World Fact-book (FUB)

WordNet (W3C)

WordNet(VUA)

VIVO UFVIVO

Indiana

VIVO Cornell

VIAF

URIBurner

Sussex Reading

Lists

Plymouth Reading

Lists

UMBEL

UK Post-codes

legislationgovuk

Uberblic

UB Mann-heim

TWC LOGD

Twarql

transportdatagov

uk

totlnet

Tele-graphis

TCMGeneDIT

TaxonConcept

The Open Library (Talis)

t4gm

Surge Radio

STW

RAMEAU SH

statisticsdatagov

uk

St Andrews Resource

Lists

ECS South-ampton EPrints

Semantic CrunchBase

semanticweborg

SemanticXBRL

SWDog Food

rdfabout US SEC

Wiki

UNLOCODE

Ulm

ECS (RKB

Explorer)

Roma

RISKS

RESEX

RAE2001

Pisa

OS

OAI

NSF

New-castle

LAAS

KISTIJISC

IRIT

IEEE

IBM

Eureacutecom

ERA

ePrints

dotAC

DEPLOY

DBLP (RKB

Explorer)

Course-ware

CORDIS

CiteSeer

Budapest

ACM

riese

Revyu

researchdatagov

uk

referencedatagov

uk

Recht-spraak

nl

RDFohloh

LastFM (rdfize)

RDF Book

Mashup

PSH

ProductDB

PBAC

Pokeacute-peacutedia

Ord-nance Survey

Openly Local

The Open Library

OpenCyc

OpenCalais

OpenEI

New York

Times

NTU Resource

Lists

NDL subjects

MARC Codes List

Man-chesterReading

Lists

Lotico

The London Gazette

LOIUS

lobidResources

lobidOrgani-sations

LinkedMDB

LinkedLCCN

LinkedGeoData

LinkedCT

Linked Open

Numbers

lingvoj

LIBRIS

Lexvo

LCSH

DBLP (L3S)

Linked Sensor Data (Knoesis)

Good-win

Family

Jamendo

iServe

NSZL Catalog

GovTrack

GESIS

GeoSpecies

GeoNames

GeoLinkedData(es)

GTAA

STITCHSIDER

Project Guten-berg (FUB)

MediCare

Euro-stat

(FUB)

DrugBank

Disea-some

DBLP (FU

Berlin)

DailyMed

Freebase

flickr wrappr

Fishes of Texas

FanHubz

Event-Media

EUTC Produc-

tions

Eurostat

EUNIS

ESD stan-dards

Popula-tion (En-AKTing)

NHS (EnAKTing)

Mortality (En-

AKTing)Energy

(En-AKTing)

CO2(En-

AKTing)

educationdatagov

uk

ECS South-ampton

Gem Norm-datei

datadcs

MySpace(DBTune)

MusicBrainz

(DBTune)

Magna-tune

John Peel(DB

Tune)

classical(DB

Tune)

Audio-scrobbler (DBTune)

LastfmArtists

(DBTune)

DBTropes

dbpedia lite

DBpedia

Pokedex

Airports

NASA (Data Incu-bator)

MusicBrainz(Data

Incubator)

Moseley Folk

Discogs(Data In-cubator)

Climbing

Linked Data for Intervals

Cornetto

Chronic-ling

America

Chem2Bio2RDF

bizdata

govuk

UniSTS

UniRef

UniPath-way

UniParc

Taxo-nomy

UniProt

SGD

Reactome

PubMed

PubChem

PRO-SITE

ProDom

Pfam PDB

OMIM

OBO

MGI

KEGG Reaction

KEGG Pathway

KEGG Glycan

KEGG Enzyme

KEGG Drug

KEGG Cpd

InterPro

HomoloGene

HGNC

Gene Ontology

GeneID

GenBank

ChEBI

CAS

Affy-metrix

BibBaseBBC

Wildlife Finder

BBC Program

mesBBC

Music

rdfaboutUS Census

Media

Geographic

Publications

Government

Cross-domain

Life sciences

User-generated content

ldquoLinking Open Data cloud diagram by Richard Cyganiak and Anja Jentzschhttp lod-cloudnetrdquo

Tools

Fetching storing and querying the datararr CubicWeb ()

4 Semantic CMS high-level database management with metadata

4 Multiple sources RSS micro-blogging SQL database

4 Based on PostgreSQL deals with (very) large database

Data-mining and machine learningrarr Scikits-learn ()

4 Easy-to-use and general-purpose machine learning in Python

4 Unsupervised learning supervised learning model selection

Semantic information databaserarr Dbpedia ()

4 sim 8106 articles with abstracts images from Wikipedia

4 sim 06106 categories 273 types (eg person place )

4 sim 100106 links between articles categories and types

() Open SourceCreative Commons

Storing RSS information with Cubicweb

Each object (or entity) is defined in a schema and may be displayedusing different views

Storing RSS

Define (in schemapy ) how RSS should be stored in the database

class RSSArt ic le ( Ent i tyType ) t i t l e = S t r i n g ( ) T i t l e o f the feedu r i = S t r i n g ( unique=True ) Ur i o f the rss feedcontent = S t r i n g ( ) Content o f the feed

Fetching RSS (based on feedparser and BeautifulSoup)

Simply construct a source by giving an URL and a parser

u r l = u rsquo h t t p feeds bbc i co uk news wor ld rss xml rsquos = session c r e a t e _ e n t i t y ( rsquoCWSource rsquo name=u rsquoBBCNewsminusWorld rsquo u r l = u r l

type=u rsquo datafeed rsquo parser=u rsquo rssminusparser rsquo con f i g =u rsquo synchron iza t ionminusi n t e r v a l =240min rsquo )

s pu l l _da ta ( session )

7 englishamerican journals (The New York Times )

Storing Dbpedia information with Cubicweb

Storing Dbpedia page

Define (in schemapy ) how Dbpedia pages should be stored in thedatabase

class DbpediaPage ( Ent i tyType ) u r i = S t r i n g ( unique=True indexed=True ) Ur i o f the ressourcel a b e l = S t r i n g ( indexed=True ) h t t p wwww3 org 200001 rd fminusschemapageid = S t r i n g ( ) h t t p dbpedia org onto logy wikiPageIDabs t r ac t = S t r i n g ( ) h t t p dbpedia org onto logy abs t r ac thomepage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 homepagethumbnai l = S t r i n g ( ) h t t p dbpedia org onto logy thumbnai ld e p i c t i on = S t r i n g ( ) h t t p xmlns com f o a f 0 1 d ep i c t i o nwikipage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 pagel a t i t u d e = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_posl ong i t ude = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_pos

Storing all dbpedia information (sim 9106 pages sim 100106 links20Go) in Cubicweb takes less than 24 hours

rarr See the full schema

Analyzing RSS news

What can we do with the RSS news stored in database

1 extract relevant features of the data

2 construct a usable (ie matrix) representation of the data

3 cluster (group) RSS together

4 deeper semantic analyze and visualization of the information

Example sentence

ldquoGoogle is to buy mobile phone manufacturer Motorola Mobilityallowing it to mount a serious challenge to Apple Incrdquo

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Extracting information Classical approaches (Scikits-learn)

Char-N-gram

Extracts features of N characters from a text

from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import CharNGramAnalyzeranalyzer = CharNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )

450 features rsquogoorsquo rsquooogrsquo rsquooglrsquo rsquomobrsquo rsquoobirsquo rsquobilrsquo

Word-N-gram

Extracts features of N words from a text

from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import WordNGramAnalyzeranalyzer = WordNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )

58 features rsquogoogle is torsquo rsquois to buyrsquo rsquoserious challenge torsquo

8 Many irrelevant features (tokens)8 Features do not carry lots of contextual information (ie

understandable by humans)

Feature extraction Dbpedia - Context

Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named

Entities Recognition (NER)

Dbpedia feature extraction (CubicwebDbpedia)

from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )

rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo

ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone

manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to

mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt

rarr Try it

ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008

ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007

Feature extraction Dbpedia - Properties

Efficient and robust feature extraction

Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo

Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo

Simple RQL (Relation Query Language) queries

Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage

E label (token)srsquo rsquotokenrsquo token)

eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries

Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Feature extraction Matrix representation and storage

Obamarsquos approval Protest as Spain

Libya rebels fight Obama in Spain

rarr

0 0 1 00 0 0 1 0 1 0 00 0 1 1

Eg The Obama-Merkel-Sarkozy

space

Results stored in a relation (appears in rss) in Cubicweb

rql(rsquoAny X WHERE X appears_in_rss Yrsquo)

Exploiting information Clustering news

Creating clusters (groups) of news using the matrixrepresentation of the data

Meanshift algorithm (Scikits-learn)

Based on locating the maxima of a density function

Automatically tunes the number of clusters

from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_

ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995

rarr Try it

Other possible alternatives Wardrsquos algorithm K-means

Exploiting information Queries and Views

Each query returns a result set (rset) A view is called on a result setrarr define the representation rules

Defining a view (short version)

class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )

do_whatever_you_want_ write_some_html_

you just have to plug a rset from a query

r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )

or within an url

httpmyapplicationrql=Any X WHERE ampvid=example-view

Pluging Scikitslearn in a view

Combine machine learning tools and database query system in purePython

Defining the view

class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo

def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing

rarr Try it

A new approach for querying information from RSS

All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R

E has_type T T label musical artistrsquo)

All living office holder persons in the news

rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)

All news that talk about Barack Obama and any scientist

rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)

All news that talk about a drug

rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)

Try it with an xml view or a thumbnail view

Vizualisation Mapping information

View

class EntityMapView ( View ) __regid__ = rsquomap rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )

s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )

s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )

Based on mapstraction (javascript) http mapstractioncom

4 Automatically locate information from RSS news

Try it

Conclusion

Cubicweb and Scikits-learn

Efficient and easy-to-use tools for data storing querying andmining

Easy to plug together only Python tools all Open Source

Using Dbpedia allows to extract very few highly relevant features

4 Decrease the dimensionality of the data

4 Link features to millions of pages of information

A new semantic way for querying information

4 Simple information queries using RQL expressions

4 Use Dbpedia types and categories to refine the selection

Future Improvements

Named Entities Recognition

rarr use disambiguations links more refined Regular expressions

rarr add new databases (MusicBrainz diseasome )

rarr closely follow Wikipedia with Dbpedia live update

Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo

RSS news analyzing

rarr explore new algorithms bi-clustering

rarr add new data sources Twitters Blogs

rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news
Page 3: Semnews - Euroscipy 2011

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Semantic

As of September 2010

MusicBrainz

(zitgist)

P20

YAGO

World Fact-book (FUB)

WordNet (W3C)

WordNet(VUA)

VIVO UFVIVO

Indiana

VIVO Cornell

VIAF

URIBurner

Sussex Reading

Lists

Plymouth Reading

Lists

UMBEL

UK Post-codes

legislationgovuk

Uberblic

UB Mann-heim

TWC LOGD

Twarql

transportdatagov

uk

totlnet

Tele-graphis

TCMGeneDIT

TaxonConcept

The Open Library (Talis)

t4gm

Surge Radio

STW

RAMEAU SH

statisticsdatagov

uk

St Andrews Resource

Lists

ECS South-ampton EPrints

Semantic CrunchBase

semanticweborg

SemanticXBRL

SWDog Food

rdfabout US SEC

Wiki

UNLOCODE

Ulm

ECS (RKB

Explorer)

Roma

RISKS

RESEX

RAE2001

Pisa

OS

OAI

NSF

New-castle

LAAS

KISTIJISC

IRIT

IEEE

IBM

Eureacutecom

ERA

ePrints

dotAC

DEPLOY

DBLP (RKB

Explorer)

Course-ware

CORDIS

CiteSeer

Budapest

ACM

riese

Revyu

researchdatagov

uk

referencedatagov

uk

Recht-spraak

nl

RDFohloh

LastFM (rdfize)

RDF Book

Mashup

PSH

ProductDB

PBAC

Pokeacute-peacutedia

Ord-nance Survey

Openly Local

The Open Library

OpenCyc

OpenCalais

OpenEI

New York

Times

NTU Resource

Lists

NDL subjects

MARC Codes List

Man-chesterReading

Lists

Lotico

The London Gazette

LOIUS

lobidResources

lobidOrgani-sations

LinkedMDB

LinkedLCCN

LinkedGeoData

LinkedCT

Linked Open

Numbers

lingvoj

LIBRIS

Lexvo

LCSH

DBLP (L3S)

Linked Sensor Data (Knoesis)

Good-win

Family

Jamendo

iServe

NSZL Catalog

GovTrack

GESIS

GeoSpecies

GeoNames

GeoLinkedData(es)

GTAA

STITCHSIDER

Project Guten-berg (FUB)

MediCare

Euro-stat

(FUB)

DrugBank

Disea-some

DBLP (FU

Berlin)

DailyMed

Freebase

flickr wrappr

Fishes of Texas

FanHubz

Event-Media

EUTC Produc-

tions

Eurostat

EUNIS

ESD stan-dards

Popula-tion (En-AKTing)

NHS (EnAKTing)

Mortality (En-

AKTing)Energy

(En-AKTing)

CO2(En-

AKTing)

educationdatagov

uk

ECS South-ampton

Gem Norm-datei

datadcs

MySpace(DBTune)

MusicBrainz

(DBTune)

Magna-tune

John Peel(DB

Tune)

classical(DB

Tune)

Audio-scrobbler (DBTune)

LastfmArtists

(DBTune)

DBTropes

dbpedia lite

DBpedia

Pokedex

Airports

NASA (Data Incu-bator)

MusicBrainz(Data

Incubator)

Moseley Folk

Discogs(Data In-cubator)

Climbing

Linked Data for Intervals

Cornetto

Chronic-ling

America

Chem2Bio2RDF

bizdata

govuk

UniSTS

UniRef

UniPath-way

UniParc

Taxo-nomy

UniProt

SGD

Reactome

PubMed

PubChem

PRO-SITE

ProDom

Pfam PDB

OMIM

OBO

MGI

KEGG Reaction

KEGG Pathway

KEGG Glycan

KEGG Enzyme

KEGG Drug

KEGG Cpd

InterPro

HomoloGene

HGNC

Gene Ontology

GeneID

GenBank

ChEBI

CAS

Affy-metrix

BibBaseBBC

Wildlife Finder

BBC Program

mesBBC

Music

rdfaboutUS Census

Media

Geographic

Publications

Government

Cross-domain

Life sciences

User-generated content

ldquoLinking Open Data cloud diagram by Richard Cyganiak and Anja Jentzschhttp lod-cloudnetrdquo

Tools

Fetching storing and querying the datararr CubicWeb ()

4 Semantic CMS high-level database management with metadata

4 Multiple sources RSS micro-blogging SQL database

4 Based on PostgreSQL deals with (very) large database

Data-mining and machine learningrarr Scikits-learn ()

4 Easy-to-use and general-purpose machine learning in Python

4 Unsupervised learning supervised learning model selection

Semantic information databaserarr Dbpedia ()

4 sim 8106 articles with abstracts images from Wikipedia

4 sim 06106 categories 273 types (eg person place )

4 sim 100106 links between articles categories and types

() Open SourceCreative Commons

Storing RSS information with Cubicweb

Each object (or entity) is defined in a schema and may be displayedusing different views

Storing RSS

Define (in schemapy ) how RSS should be stored in the database

class RSSArt ic le ( Ent i tyType ) t i t l e = S t r i n g ( ) T i t l e o f the feedu r i = S t r i n g ( unique=True ) Ur i o f the rss feedcontent = S t r i n g ( ) Content o f the feed

Fetching RSS (based on feedparser and BeautifulSoup)

Simply construct a source by giving an URL and a parser

u r l = u rsquo h t t p feeds bbc i co uk news wor ld rss xml rsquos = session c r e a t e _ e n t i t y ( rsquoCWSource rsquo name=u rsquoBBCNewsminusWorld rsquo u r l = u r l

type=u rsquo datafeed rsquo parser=u rsquo rssminusparser rsquo con f i g =u rsquo synchron iza t ionminusi n t e r v a l =240min rsquo )

s pu l l _da ta ( session )

7 englishamerican journals (The New York Times )

Storing Dbpedia information with Cubicweb

Storing Dbpedia page

Define (in schemapy ) how Dbpedia pages should be stored in thedatabase

class DbpediaPage ( Ent i tyType ) u r i = S t r i n g ( unique=True indexed=True ) Ur i o f the ressourcel a b e l = S t r i n g ( indexed=True ) h t t p wwww3 org 200001 rd fminusschemapageid = S t r i n g ( ) h t t p dbpedia org onto logy wikiPageIDabs t r ac t = S t r i n g ( ) h t t p dbpedia org onto logy abs t r ac thomepage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 homepagethumbnai l = S t r i n g ( ) h t t p dbpedia org onto logy thumbnai ld e p i c t i on = S t r i n g ( ) h t t p xmlns com f o a f 0 1 d ep i c t i o nwikipage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 pagel a t i t u d e = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_posl ong i t ude = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_pos

Storing all dbpedia information (sim 9106 pages sim 100106 links20Go) in Cubicweb takes less than 24 hours

rarr See the full schema

Analyzing RSS news

What can we do with the RSS news stored in database

1 extract relevant features of the data

2 construct a usable (ie matrix) representation of the data

3 cluster (group) RSS together

4 deeper semantic analyze and visualization of the information

Example sentence

ldquoGoogle is to buy mobile phone manufacturer Motorola Mobilityallowing it to mount a serious challenge to Apple Incrdquo

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Extracting information Classical approaches (Scikits-learn)

Char-N-gram

Extracts features of N characters from a text

from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import CharNGramAnalyzeranalyzer = CharNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )

450 features rsquogoorsquo rsquooogrsquo rsquooglrsquo rsquomobrsquo rsquoobirsquo rsquobilrsquo

Word-N-gram

Extracts features of N words from a text

from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import WordNGramAnalyzeranalyzer = WordNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )

58 features rsquogoogle is torsquo rsquois to buyrsquo rsquoserious challenge torsquo

8 Many irrelevant features (tokens)8 Features do not carry lots of contextual information (ie

understandable by humans)

Feature extraction Dbpedia - Context

Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named

Entities Recognition (NER)

Dbpedia feature extraction (CubicwebDbpedia)

from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )

rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo

ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone

manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to

mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt

rarr Try it

ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008

ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007

Feature extraction Dbpedia - Properties

Efficient and robust feature extraction

Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo

Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo

Simple RQL (Relation Query Language) queries

Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage

E label (token)srsquo rsquotokenrsquo token)

eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries

Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Feature extraction Matrix representation and storage

Obamarsquos approval Protest as Spain

Libya rebels fight Obama in Spain

rarr

0 0 1 00 0 0 1 0 1 0 00 0 1 1

Eg The Obama-Merkel-Sarkozy

space

Results stored in a relation (appears in rss) in Cubicweb

rql(rsquoAny X WHERE X appears_in_rss Yrsquo)

Exploiting information Clustering news

Creating clusters (groups) of news using the matrixrepresentation of the data

Meanshift algorithm (Scikits-learn)

Based on locating the maxima of a density function

Automatically tunes the number of clusters

from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_

ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995

rarr Try it

Other possible alternatives Wardrsquos algorithm K-means

Exploiting information Queries and Views

Each query returns a result set (rset) A view is called on a result setrarr define the representation rules

Defining a view (short version)

class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )

do_whatever_you_want_ write_some_html_

you just have to plug a rset from a query

r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )

or within an url

httpmyapplicationrql=Any X WHERE ampvid=example-view

Pluging Scikitslearn in a view

Combine machine learning tools and database query system in purePython

Defining the view

class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo

def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing

rarr Try it

A new approach for querying information from RSS

All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R

E has_type T T label musical artistrsquo)

All living office holder persons in the news

rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)

All news that talk about Barack Obama and any scientist

rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)

All news that talk about a drug

rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)

Try it with an xml view or a thumbnail view

Vizualisation Mapping information

View

class EntityMapView ( View ) __regid__ = rsquomap rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )

s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )

s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )

Based on mapstraction (javascript) http mapstractioncom

4 Automatically locate information from RSS news

Try it

Conclusion

Cubicweb and Scikits-learn

Efficient and easy-to-use tools for data storing querying andmining

Easy to plug together only Python tools all Open Source

Using Dbpedia allows to extract very few highly relevant features

4 Decrease the dimensionality of the data

4 Link features to millions of pages of information

A new semantic way for querying information

4 Simple information queries using RQL expressions

4 Use Dbpedia types and categories to refine the selection

Future Improvements

Named Entities Recognition

rarr use disambiguations links more refined Regular expressions

rarr add new databases (MusicBrainz diseasome )

rarr closely follow Wikipedia with Dbpedia live update

Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo

RSS news analyzing

rarr explore new algorithms bi-clustering

rarr add new data sources Twitters Blogs

rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news
Page 4: Semnews - Euroscipy 2011

Semantic

As of September 2010

MusicBrainz

(zitgist)

P20

YAGO

World Fact-book (FUB)

WordNet (W3C)

WordNet(VUA)

VIVO UFVIVO

Indiana

VIVO Cornell

VIAF

URIBurner

Sussex Reading

Lists

Plymouth Reading

Lists

UMBEL

UK Post-codes

legislationgovuk

Uberblic

UB Mann-heim

TWC LOGD

Twarql

transportdatagov

uk

totlnet

Tele-graphis

TCMGeneDIT

TaxonConcept

The Open Library (Talis)

t4gm

Surge Radio

STW

RAMEAU SH

statisticsdatagov

uk

St Andrews Resource

Lists

ECS South-ampton EPrints

Semantic CrunchBase

semanticweborg

SemanticXBRL

SWDog Food

rdfabout US SEC

Wiki

UNLOCODE

Ulm

ECS (RKB

Explorer)

Roma

RISKS

RESEX

RAE2001

Pisa

OS

OAI

NSF

New-castle

LAAS

KISTIJISC

IRIT

IEEE

IBM

Eureacutecom

ERA

ePrints

dotAC

DEPLOY

DBLP (RKB

Explorer)

Course-ware

CORDIS

CiteSeer

Budapest

ACM

riese

Revyu

researchdatagov

uk

referencedatagov

uk

Recht-spraak

nl

RDFohloh

LastFM (rdfize)

RDF Book

Mashup

PSH

ProductDB

PBAC

Pokeacute-peacutedia

Ord-nance Survey

Openly Local

The Open Library

OpenCyc

OpenCalais

OpenEI

New York

Times

NTU Resource

Lists

NDL subjects

MARC Codes List

Man-chesterReading

Lists

Lotico

The London Gazette

LOIUS

lobidResources

lobidOrgani-sations

LinkedMDB

LinkedLCCN

LinkedGeoData

LinkedCT

Linked Open

Numbers

lingvoj

LIBRIS

Lexvo

LCSH

DBLP (L3S)

Linked Sensor Data (Knoesis)

Good-win

Family

Jamendo

iServe

NSZL Catalog

GovTrack

GESIS

GeoSpecies

GeoNames

GeoLinkedData(es)

GTAA

STITCHSIDER

Project Guten-berg (FUB)

MediCare

Euro-stat

(FUB)

DrugBank

Disea-some

DBLP (FU

Berlin)

DailyMed

Freebase

flickr wrappr

Fishes of Texas

FanHubz

Event-Media

EUTC Produc-

tions

Eurostat

EUNIS

ESD stan-dards

Popula-tion (En-AKTing)

NHS (EnAKTing)

Mortality (En-

AKTing)Energy

(En-AKTing)

CO2(En-

AKTing)

educationdatagov

uk

ECS South-ampton

Gem Norm-datei

datadcs

MySpace(DBTune)

MusicBrainz

(DBTune)

Magna-tune

John Peel(DB

Tune)

classical(DB

Tune)

Audio-scrobbler (DBTune)

LastfmArtists

(DBTune)

DBTropes

dbpedia lite

DBpedia

Pokedex

Airports

NASA (Data Incu-bator)

MusicBrainz(Data

Incubator)

Moseley Folk

Discogs(Data In-cubator)

Climbing

Linked Data for Intervals

Cornetto

Chronic-ling

America

Chem2Bio2RDF

bizdata

govuk

UniSTS

UniRef

UniPath-way

UniParc

Taxo-nomy

UniProt

SGD

Reactome

PubMed

PubChem

PRO-SITE

ProDom

Pfam PDB

OMIM

OBO

MGI

KEGG Reaction

KEGG Pathway

KEGG Glycan

KEGG Enzyme

KEGG Drug

KEGG Cpd

InterPro

HomoloGene

HGNC

Gene Ontology

GeneID

GenBank

ChEBI

CAS

Affy-metrix

BibBaseBBC

Wildlife Finder

BBC Program

mesBBC

Music

rdfaboutUS Census

Media

Geographic

Publications

Government

Cross-domain

Life sciences

User-generated content

ldquoLinking Open Data cloud diagram by Richard Cyganiak and Anja Jentzschhttp lod-cloudnetrdquo

Tools

Fetching storing and querying the datararr CubicWeb ()

4 Semantic CMS high-level database management with metadata

4 Multiple sources RSS micro-blogging SQL database

4 Based on PostgreSQL deals with (very) large database

Data-mining and machine learningrarr Scikits-learn ()

4 Easy-to-use and general-purpose machine learning in Python

4 Unsupervised learning supervised learning model selection

Semantic information databaserarr Dbpedia ()

4 sim 8106 articles with abstracts images from Wikipedia

4 sim 06106 categories 273 types (eg person place )

4 sim 100106 links between articles categories and types

() Open SourceCreative Commons

Storing RSS information with Cubicweb

Each object (or entity) is defined in a schema and may be displayedusing different views

Storing RSS

Define (in schemapy ) how RSS should be stored in the database

class RSSArt ic le ( Ent i tyType ) t i t l e = S t r i n g ( ) T i t l e o f the feedu r i = S t r i n g ( unique=True ) Ur i o f the rss feedcontent = S t r i n g ( ) Content o f the feed

Fetching RSS (based on feedparser and BeautifulSoup)

Simply construct a source by giving an URL and a parser

u r l = u rsquo h t t p feeds bbc i co uk news wor ld rss xml rsquos = session c r e a t e _ e n t i t y ( rsquoCWSource rsquo name=u rsquoBBCNewsminusWorld rsquo u r l = u r l

type=u rsquo datafeed rsquo parser=u rsquo rssminusparser rsquo con f i g =u rsquo synchron iza t ionminusi n t e r v a l =240min rsquo )

s pu l l _da ta ( session )

7 englishamerican journals (The New York Times )

Storing Dbpedia information with Cubicweb

Storing Dbpedia page

Define (in schemapy ) how Dbpedia pages should be stored in thedatabase

class DbpediaPage ( Ent i tyType ) u r i = S t r i n g ( unique=True indexed=True ) Ur i o f the ressourcel a b e l = S t r i n g ( indexed=True ) h t t p wwww3 org 200001 rd fminusschemapageid = S t r i n g ( ) h t t p dbpedia org onto logy wikiPageIDabs t r ac t = S t r i n g ( ) h t t p dbpedia org onto logy abs t r ac thomepage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 homepagethumbnai l = S t r i n g ( ) h t t p dbpedia org onto logy thumbnai ld e p i c t i on = S t r i n g ( ) h t t p xmlns com f o a f 0 1 d ep i c t i o nwikipage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 pagel a t i t u d e = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_posl ong i t ude = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_pos

Storing all dbpedia information (sim 9106 pages sim 100106 links20Go) in Cubicweb takes less than 24 hours

rarr See the full schema

Analyzing RSS news

What can we do with the RSS news stored in database

1 extract relevant features of the data

2 construct a usable (ie matrix) representation of the data

3 cluster (group) RSS together

4 deeper semantic analyze and visualization of the information

Example sentence

ldquoGoogle is to buy mobile phone manufacturer Motorola Mobilityallowing it to mount a serious challenge to Apple Incrdquo

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Extracting information Classical approaches (Scikits-learn)

Char-N-gram

Extracts features of N characters from a text

from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import CharNGramAnalyzeranalyzer = CharNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )

450 features rsquogoorsquo rsquooogrsquo rsquooglrsquo rsquomobrsquo rsquoobirsquo rsquobilrsquo

Word-N-gram

Extracts features of N words from a text

from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import WordNGramAnalyzeranalyzer = WordNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )

58 features rsquogoogle is torsquo rsquois to buyrsquo rsquoserious challenge torsquo

8 Many irrelevant features (tokens)8 Features do not carry lots of contextual information (ie

understandable by humans)

Feature extraction Dbpedia - Context

Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named

Entities Recognition (NER)

Dbpedia feature extraction (CubicwebDbpedia)

from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )

rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo

ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone

manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to

mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt

rarr Try it

ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008

ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007

Feature extraction Dbpedia - Properties

Efficient and robust feature extraction

Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo

Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo

Simple RQL (Relation Query Language) queries

Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage

E label (token)srsquo rsquotokenrsquo token)

eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries

Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Feature extraction Matrix representation and storage

Obamarsquos approval Protest as Spain

Libya rebels fight Obama in Spain

rarr

0 0 1 00 0 0 1 0 1 0 00 0 1 1

Eg The Obama-Merkel-Sarkozy

space

Results stored in a relation (appears in rss) in Cubicweb

rql(rsquoAny X WHERE X appears_in_rss Yrsquo)

Exploiting information Clustering news

Creating clusters (groups) of news using the matrixrepresentation of the data

Meanshift algorithm (Scikits-learn)

Based on locating the maxima of a density function

Automatically tunes the number of clusters

from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_

ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995

rarr Try it

Other possible alternatives Wardrsquos algorithm K-means

Exploiting information Queries and Views

Each query returns a result set (rset) A view is called on a result setrarr define the representation rules

Defining a view (short version)

class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )

do_whatever_you_want_ write_some_html_

you just have to plug a rset from a query

r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )

or within an url

httpmyapplicationrql=Any X WHERE ampvid=example-view

Pluging Scikitslearn in a view

Combine machine learning tools and database query system in purePython

Defining the view

class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo

def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing

rarr Try it

A new approach for querying information from RSS

All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R

E has_type T T label musical artistrsquo)

All living office holder persons in the news

rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)

All news that talk about Barack Obama and any scientist

rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)

All news that talk about a drug

rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)

Try it with an xml view or a thumbnail view

Vizualisation Mapping information

View

class EntityMapView ( View ) __regid__ = rsquomap rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )

s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )

s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )

Based on mapstraction (javascript) http mapstractioncom

4 Automatically locate information from RSS news

Try it

Conclusion

Cubicweb and Scikits-learn

Efficient and easy-to-use tools for data storing querying andmining

Easy to plug together only Python tools all Open Source

Using Dbpedia allows to extract very few highly relevant features

4 Decrease the dimensionality of the data

4 Link features to millions of pages of information

A new semantic way for querying information

4 Simple information queries using RQL expressions

4 Use Dbpedia types and categories to refine the selection

Future Improvements

Named Entities Recognition

rarr use disambiguations links more refined Regular expressions

rarr add new databases (MusicBrainz diseasome )

rarr closely follow Wikipedia with Dbpedia live update

Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo

RSS news analyzing

rarr explore new algorithms bi-clustering

rarr add new data sources Twitters Blogs

rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news
Page 5: Semnews - Euroscipy 2011

Tools

Fetching storing and querying the datararr CubicWeb ()

4 Semantic CMS high-level database management with metadata

4 Multiple sources RSS micro-blogging SQL database

4 Based on PostgreSQL deals with (very) large database

Data-mining and machine learningrarr Scikits-learn ()

4 Easy-to-use and general-purpose machine learning in Python

4 Unsupervised learning supervised learning model selection

Semantic information databaserarr Dbpedia ()

4 sim 8106 articles with abstracts images from Wikipedia

4 sim 06106 categories 273 types (eg person place )

4 sim 100106 links between articles categories and types

() Open SourceCreative Commons

Storing RSS information with Cubicweb

Each object (or entity) is defined in a schema and may be displayedusing different views

Storing RSS

Define (in schemapy ) how RSS should be stored in the database

class RSSArt ic le ( Ent i tyType ) t i t l e = S t r i n g ( ) T i t l e o f the feedu r i = S t r i n g ( unique=True ) Ur i o f the rss feedcontent = S t r i n g ( ) Content o f the feed

Fetching RSS (based on feedparser and BeautifulSoup)

Simply construct a source by giving an URL and a parser

u r l = u rsquo h t t p feeds bbc i co uk news wor ld rss xml rsquos = session c r e a t e _ e n t i t y ( rsquoCWSource rsquo name=u rsquoBBCNewsminusWorld rsquo u r l = u r l

type=u rsquo datafeed rsquo parser=u rsquo rssminusparser rsquo con f i g =u rsquo synchron iza t ionminusi n t e r v a l =240min rsquo )

s pu l l _da ta ( session )

7 englishamerican journals (The New York Times )

Storing Dbpedia information with Cubicweb

Storing Dbpedia page

Define (in schemapy ) how Dbpedia pages should be stored in thedatabase

class DbpediaPage ( Ent i tyType ) u r i = S t r i n g ( unique=True indexed=True ) Ur i o f the ressourcel a b e l = S t r i n g ( indexed=True ) h t t p wwww3 org 200001 rd fminusschemapageid = S t r i n g ( ) h t t p dbpedia org onto logy wikiPageIDabs t r ac t = S t r i n g ( ) h t t p dbpedia org onto logy abs t r ac thomepage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 homepagethumbnai l = S t r i n g ( ) h t t p dbpedia org onto logy thumbnai ld e p i c t i on = S t r i n g ( ) h t t p xmlns com f o a f 0 1 d ep i c t i o nwikipage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 pagel a t i t u d e = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_posl ong i t ude = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_pos

Storing all dbpedia information (sim 9106 pages sim 100106 links20Go) in Cubicweb takes less than 24 hours

rarr See the full schema

Analyzing RSS news

What can we do with the RSS news stored in database

1 extract relevant features of the data

2 construct a usable (ie matrix) representation of the data

3 cluster (group) RSS together

4 deeper semantic analyze and visualization of the information

Example sentence

ldquoGoogle is to buy mobile phone manufacturer Motorola Mobilityallowing it to mount a serious challenge to Apple Incrdquo

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Extracting information Classical approaches (Scikits-learn)

Char-N-gram

Extracts features of N characters from a text

from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import CharNGramAnalyzeranalyzer = CharNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )

450 features rsquogoorsquo rsquooogrsquo rsquooglrsquo rsquomobrsquo rsquoobirsquo rsquobilrsquo

Word-N-gram

Extracts features of N words from a text

from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import WordNGramAnalyzeranalyzer = WordNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )

58 features rsquogoogle is torsquo rsquois to buyrsquo rsquoserious challenge torsquo

8 Many irrelevant features (tokens)8 Features do not carry lots of contextual information (ie

understandable by humans)

Feature extraction Dbpedia - Context

Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named

Entities Recognition (NER)

Dbpedia feature extraction (CubicwebDbpedia)

from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )

rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo

ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone

manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to

mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt

rarr Try it

ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008

ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007

Feature extraction Dbpedia - Properties

Efficient and robust feature extraction

Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo

Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo

Simple RQL (Relation Query Language) queries

Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage

E label (token)srsquo rsquotokenrsquo token)

eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries

Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Feature extraction Matrix representation and storage

Obamarsquos approval Protest as Spain

Libya rebels fight Obama in Spain

rarr

0 0 1 00 0 0 1 0 1 0 00 0 1 1

Eg The Obama-Merkel-Sarkozy

space

Results stored in a relation (appears in rss) in Cubicweb

rql(rsquoAny X WHERE X appears_in_rss Yrsquo)

Exploiting information Clustering news

Creating clusters (groups) of news using the matrixrepresentation of the data

Meanshift algorithm (Scikits-learn)

Based on locating the maxima of a density function

Automatically tunes the number of clusters

from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_

ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995

rarr Try it

Other possible alternatives Wardrsquos algorithm K-means

Exploiting information Queries and Views

Each query returns a result set (rset) A view is called on a result setrarr define the representation rules

Defining a view (short version)

class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )

do_whatever_you_want_ write_some_html_

you just have to plug a rset from a query

r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )

or within an url

httpmyapplicationrql=Any X WHERE ampvid=example-view

Pluging Scikitslearn in a view

Combine machine learning tools and database query system in purePython

Defining the view

class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo

def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing

rarr Try it

A new approach for querying information from RSS

All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R

E has_type T T label musical artistrsquo)

All living office holder persons in the news

rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)

All news that talk about Barack Obama and any scientist

rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)

All news that talk about a drug

rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)

Try it with an xml view or a thumbnail view

Vizualisation Mapping information

View

class EntityMapView ( View ) __regid__ = rsquomap rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )

s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )

s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )

Based on mapstraction (javascript) http mapstractioncom

4 Automatically locate information from RSS news

Try it

Conclusion

Cubicweb and Scikits-learn

Efficient and easy-to-use tools for data storing querying andmining

Easy to plug together only Python tools all Open Source

Using Dbpedia allows to extract very few highly relevant features

4 Decrease the dimensionality of the data

4 Link features to millions of pages of information

A new semantic way for querying information

4 Simple information queries using RQL expressions

4 Use Dbpedia types and categories to refine the selection

Future Improvements

Named Entities Recognition

rarr use disambiguations links more refined Regular expressions

rarr add new databases (MusicBrainz diseasome )

rarr closely follow Wikipedia with Dbpedia live update

Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo

RSS news analyzing

rarr explore new algorithms bi-clustering

rarr add new data sources Twitters Blogs

rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news
Page 6: Semnews - Euroscipy 2011

Storing RSS information with Cubicweb

Each object (or entity) is defined in a schema and may be displayedusing different views

Storing RSS

Define (in schemapy ) how RSS should be stored in the database

class RSSArt ic le ( Ent i tyType ) t i t l e = S t r i n g ( ) T i t l e o f the feedu r i = S t r i n g ( unique=True ) Ur i o f the rss feedcontent = S t r i n g ( ) Content o f the feed

Fetching RSS (based on feedparser and BeautifulSoup)

Simply construct a source by giving an URL and a parser

u r l = u rsquo h t t p feeds bbc i co uk news wor ld rss xml rsquos = session c r e a t e _ e n t i t y ( rsquoCWSource rsquo name=u rsquoBBCNewsminusWorld rsquo u r l = u r l

type=u rsquo datafeed rsquo parser=u rsquo rssminusparser rsquo con f i g =u rsquo synchron iza t ionminusi n t e r v a l =240min rsquo )

s pu l l _da ta ( session )

7 englishamerican journals (The New York Times )

Storing Dbpedia information with Cubicweb

Storing Dbpedia page

Define (in schemapy ) how Dbpedia pages should be stored in thedatabase

class DbpediaPage ( Ent i tyType ) u r i = S t r i n g ( unique=True indexed=True ) Ur i o f the ressourcel a b e l = S t r i n g ( indexed=True ) h t t p wwww3 org 200001 rd fminusschemapageid = S t r i n g ( ) h t t p dbpedia org onto logy wikiPageIDabs t r ac t = S t r i n g ( ) h t t p dbpedia org onto logy abs t r ac thomepage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 homepagethumbnai l = S t r i n g ( ) h t t p dbpedia org onto logy thumbnai ld e p i c t i on = S t r i n g ( ) h t t p xmlns com f o a f 0 1 d ep i c t i o nwikipage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 pagel a t i t u d e = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_posl ong i t ude = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_pos

Storing all dbpedia information (sim 9106 pages sim 100106 links20Go) in Cubicweb takes less than 24 hours

rarr See the full schema

Analyzing RSS news

What can we do with the RSS news stored in database

1 extract relevant features of the data

2 construct a usable (ie matrix) representation of the data

3 cluster (group) RSS together

4 deeper semantic analyze and visualization of the information

Example sentence

ldquoGoogle is to buy mobile phone manufacturer Motorola Mobilityallowing it to mount a serious challenge to Apple Incrdquo

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Extracting information Classical approaches (Scikits-learn)

Char-N-gram

Extracts features of N characters from a text

from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import CharNGramAnalyzeranalyzer = CharNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )

450 features rsquogoorsquo rsquooogrsquo rsquooglrsquo rsquomobrsquo rsquoobirsquo rsquobilrsquo

Word-N-gram

Extracts features of N words from a text

from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import WordNGramAnalyzeranalyzer = WordNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )

58 features rsquogoogle is torsquo rsquois to buyrsquo rsquoserious challenge torsquo

8 Many irrelevant features (tokens)8 Features do not carry lots of contextual information (ie

understandable by humans)

Feature extraction Dbpedia - Context

Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named

Entities Recognition (NER)

Dbpedia feature extraction (CubicwebDbpedia)

from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )

rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo

ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone

manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to

mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt

rarr Try it

ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008

ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007

Feature extraction Dbpedia - Properties

Efficient and robust feature extraction

Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo

Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo

Simple RQL (Relation Query Language) queries

Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage

E label (token)srsquo rsquotokenrsquo token)

eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries

Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Feature extraction Matrix representation and storage

Obamarsquos approval Protest as Spain

Libya rebels fight Obama in Spain

rarr

0 0 1 00 0 0 1 0 1 0 00 0 1 1

Eg The Obama-Merkel-Sarkozy

space

Results stored in a relation (appears in rss) in Cubicweb

rql(rsquoAny X WHERE X appears_in_rss Yrsquo)

Exploiting information Clustering news

Creating clusters (groups) of news using the matrixrepresentation of the data

Meanshift algorithm (Scikits-learn)

Based on locating the maxima of a density function

Automatically tunes the number of clusters

from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_

ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995

rarr Try it

Other possible alternatives Wardrsquos algorithm K-means

Exploiting information Queries and Views

Each query returns a result set (rset) A view is called on a result setrarr define the representation rules

Defining a view (short version)

class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )

do_whatever_you_want_ write_some_html_

you just have to plug a rset from a query

r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )

or within an url

httpmyapplicationrql=Any X WHERE ampvid=example-view

Pluging Scikitslearn in a view

Combine machine learning tools and database query system in purePython

Defining the view

class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo

def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing

rarr Try it

A new approach for querying information from RSS

All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R

E has_type T T label musical artistrsquo)

All living office holder persons in the news

rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)

All news that talk about Barack Obama and any scientist

rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)

All news that talk about a drug

rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)

Try it with an xml view or a thumbnail view

Vizualisation Mapping information

View

class EntityMapView ( View ) __regid__ = rsquomap rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )

s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )

s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )

Based on mapstraction (javascript) http mapstractioncom

4 Automatically locate information from RSS news

Try it

Conclusion

Cubicweb and Scikits-learn

Efficient and easy-to-use tools for data storing querying andmining

Easy to plug together only Python tools all Open Source

Using Dbpedia allows to extract very few highly relevant features

4 Decrease the dimensionality of the data

4 Link features to millions of pages of information

A new semantic way for querying information

4 Simple information queries using RQL expressions

4 Use Dbpedia types and categories to refine the selection

Future Improvements

Named Entities Recognition

rarr use disambiguations links more refined Regular expressions

rarr add new databases (MusicBrainz diseasome )

rarr closely follow Wikipedia with Dbpedia live update

Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo

RSS news analyzing

rarr explore new algorithms bi-clustering

rarr add new data sources Twitters Blogs

rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news
Page 7: Semnews - Euroscipy 2011

Storing Dbpedia information with Cubicweb

Storing Dbpedia page

Define (in schemapy ) how Dbpedia pages should be stored in thedatabase

class DbpediaPage ( Ent i tyType ) u r i = S t r i n g ( unique=True indexed=True ) Ur i o f the ressourcel a b e l = S t r i n g ( indexed=True ) h t t p wwww3 org 200001 rd fminusschemapageid = S t r i n g ( ) h t t p dbpedia org onto logy wikiPageIDabs t r ac t = S t r i n g ( ) h t t p dbpedia org onto logy abs t r ac thomepage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 homepagethumbnai l = S t r i n g ( ) h t t p dbpedia org onto logy thumbnai ld e p i c t i on = S t r i n g ( ) h t t p xmlns com f o a f 0 1 d ep i c t i o nwikipage = S t r i n g ( ) h t t p xmlns com f o a f 0 1 pagel a t i t u d e = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_posl ong i t ude = S t r i n g ( ) h t t p wwww3 org 200301 geo wgs84_pos

Storing all dbpedia information (sim 9106 pages sim 100106 links20Go) in Cubicweb takes less than 24 hours

rarr See the full schema

Analyzing RSS news

What can we do with the RSS news stored in database

1 extract relevant features of the data

2 construct a usable (ie matrix) representation of the data

3 cluster (group) RSS together

4 deeper semantic analyze and visualization of the information

Example sentence

ldquoGoogle is to buy mobile phone manufacturer Motorola Mobilityallowing it to mount a serious challenge to Apple Incrdquo

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Extracting information Classical approaches (Scikits-learn)

Char-N-gram

Extracts features of N characters from a text

from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import CharNGramAnalyzeranalyzer = CharNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )

450 features rsquogoorsquo rsquooogrsquo rsquooglrsquo rsquomobrsquo rsquoobirsquo rsquobilrsquo

Word-N-gram

Extracts features of N words from a text

from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import WordNGramAnalyzeranalyzer = WordNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )

58 features rsquogoogle is torsquo rsquois to buyrsquo rsquoserious challenge torsquo

8 Many irrelevant features (tokens)8 Features do not carry lots of contextual information (ie

understandable by humans)

Feature extraction Dbpedia - Context

Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named

Entities Recognition (NER)

Dbpedia feature extraction (CubicwebDbpedia)

from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )

rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo

ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone

manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to

mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt

rarr Try it

ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008

ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007

Feature extraction Dbpedia - Properties

Efficient and robust feature extraction

Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo

Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo

Simple RQL (Relation Query Language) queries

Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage

E label (token)srsquo rsquotokenrsquo token)

eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries

Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Feature extraction Matrix representation and storage

Obamarsquos approval Protest as Spain

Libya rebels fight Obama in Spain

rarr

0 0 1 00 0 0 1 0 1 0 00 0 1 1

Eg The Obama-Merkel-Sarkozy

space

Results stored in a relation (appears in rss) in Cubicweb

rql(rsquoAny X WHERE X appears_in_rss Yrsquo)

Exploiting information Clustering news

Creating clusters (groups) of news using the matrixrepresentation of the data

Meanshift algorithm (Scikits-learn)

Based on locating the maxima of a density function

Automatically tunes the number of clusters

from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_

ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995

rarr Try it

Other possible alternatives Wardrsquos algorithm K-means

Exploiting information Queries and Views

Each query returns a result set (rset) A view is called on a result setrarr define the representation rules

Defining a view (short version)

class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )

do_whatever_you_want_ write_some_html_

you just have to plug a rset from a query

r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )

or within an url

httpmyapplicationrql=Any X WHERE ampvid=example-view

Pluging Scikitslearn in a view

Combine machine learning tools and database query system in purePython

Defining the view

class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo

def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing

rarr Try it

A new approach for querying information from RSS

All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R

E has_type T T label musical artistrsquo)

All living office holder persons in the news

rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)

All news that talk about Barack Obama and any scientist

rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)

All news that talk about a drug

rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)

Try it with an xml view or a thumbnail view

Vizualisation Mapping information

View

class EntityMapView ( View ) __regid__ = rsquomap rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )

s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )

s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )

Based on mapstraction (javascript) http mapstractioncom

4 Automatically locate information from RSS news

Try it

Conclusion

Cubicweb and Scikits-learn

Efficient and easy-to-use tools for data storing querying andmining

Easy to plug together only Python tools all Open Source

Using Dbpedia allows to extract very few highly relevant features

4 Decrease the dimensionality of the data

4 Link features to millions of pages of information

A new semantic way for querying information

4 Simple information queries using RQL expressions

4 Use Dbpedia types and categories to refine the selection

Future Improvements

Named Entities Recognition

rarr use disambiguations links more refined Regular expressions

rarr add new databases (MusicBrainz diseasome )

rarr closely follow Wikipedia with Dbpedia live update

Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo

RSS news analyzing

rarr explore new algorithms bi-clustering

rarr add new data sources Twitters Blogs

rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news
Page 8: Semnews - Euroscipy 2011

Analyzing RSS news

What can we do with the RSS news stored in database

1 extract relevant features of the data

2 construct a usable (ie matrix) representation of the data

3 cluster (group) RSS together

4 deeper semantic analyze and visualization of the information

Example sentence

ldquoGoogle is to buy mobile phone manufacturer Motorola Mobilityallowing it to mount a serious challenge to Apple Incrdquo

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Extracting information Classical approaches (Scikits-learn)

Char-N-gram

Extracts features of N characters from a text

from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import CharNGramAnalyzeranalyzer = CharNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )

450 features rsquogoorsquo rsquooogrsquo rsquooglrsquo rsquomobrsquo rsquoobirsquo rsquobilrsquo

Word-N-gram

Extracts features of N words from a text

from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import WordNGramAnalyzeranalyzer = WordNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )

58 features rsquogoogle is torsquo rsquois to buyrsquo rsquoserious challenge torsquo

8 Many irrelevant features (tokens)8 Features do not carry lots of contextual information (ie

understandable by humans)

Feature extraction Dbpedia - Context

Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named

Entities Recognition (NER)

Dbpedia feature extraction (CubicwebDbpedia)

from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )

rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo

ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone

manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to

mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt

rarr Try it

ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008

ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007

Feature extraction Dbpedia - Properties

Efficient and robust feature extraction

Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo

Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo

Simple RQL (Relation Query Language) queries

Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage

E label (token)srsquo rsquotokenrsquo token)

eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries

Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Feature extraction Matrix representation and storage

Obamarsquos approval Protest as Spain

Libya rebels fight Obama in Spain

rarr

0 0 1 00 0 0 1 0 1 0 00 0 1 1

Eg The Obama-Merkel-Sarkozy

space

Results stored in a relation (appears in rss) in Cubicweb

rql(rsquoAny X WHERE X appears_in_rss Yrsquo)

Exploiting information Clustering news

Creating clusters (groups) of news using the matrixrepresentation of the data

Meanshift algorithm (Scikits-learn)

Based on locating the maxima of a density function

Automatically tunes the number of clusters

from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_

ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995

rarr Try it

Other possible alternatives Wardrsquos algorithm K-means

Exploiting information Queries and Views

Each query returns a result set (rset) A view is called on a result setrarr define the representation rules

Defining a view (short version)

class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )

do_whatever_you_want_ write_some_html_

you just have to plug a rset from a query

r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )

or within an url

httpmyapplicationrql=Any X WHERE ampvid=example-view

Pluging Scikitslearn in a view

Combine machine learning tools and database query system in purePython

Defining the view

class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo

def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing

rarr Try it

A new approach for querying information from RSS

All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R

E has_type T T label musical artistrsquo)

All living office holder persons in the news

rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)

All news that talk about Barack Obama and any scientist

rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)

All news that talk about a drug

rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)

Try it with an xml view or a thumbnail view

Vizualisation Mapping information

View

class EntityMapView ( View ) __regid__ = rsquomap rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )

s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )

s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )

Based on mapstraction (javascript) http mapstractioncom

4 Automatically locate information from RSS news

Try it

Conclusion

Cubicweb and Scikits-learn

Efficient and easy-to-use tools for data storing querying andmining

Easy to plug together only Python tools all Open Source

Using Dbpedia allows to extract very few highly relevant features

4 Decrease the dimensionality of the data

4 Link features to millions of pages of information

A new semantic way for querying information

4 Simple information queries using RQL expressions

4 Use Dbpedia types and categories to refine the selection

Future Improvements

Named Entities Recognition

rarr use disambiguations links more refined Regular expressions

rarr add new databases (MusicBrainz diseasome )

rarr closely follow Wikipedia with Dbpedia live update

Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo

RSS news analyzing

rarr explore new algorithms bi-clustering

rarr add new data sources Twitters Blogs

rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news
Page 9: Semnews - Euroscipy 2011

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Extracting information Classical approaches (Scikits-learn)

Char-N-gram

Extracts features of N characters from a text

from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import CharNGramAnalyzeranalyzer = CharNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )

450 features rsquogoorsquo rsquooogrsquo rsquooglrsquo rsquomobrsquo rsquoobirsquo rsquobilrsquo

Word-N-gram

Extracts features of N words from a text

from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import WordNGramAnalyzeranalyzer = WordNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )

58 features rsquogoogle is torsquo rsquois to buyrsquo rsquoserious challenge torsquo

8 Many irrelevant features (tokens)8 Features do not carry lots of contextual information (ie

understandable by humans)

Feature extraction Dbpedia - Context

Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named

Entities Recognition (NER)

Dbpedia feature extraction (CubicwebDbpedia)

from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )

rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo

ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone

manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to

mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt

rarr Try it

ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008

ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007

Feature extraction Dbpedia - Properties

Efficient and robust feature extraction

Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo

Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo

Simple RQL (Relation Query Language) queries

Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage

E label (token)srsquo rsquotokenrsquo token)

eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries

Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Feature extraction Matrix representation and storage

Obamarsquos approval Protest as Spain

Libya rebels fight Obama in Spain

rarr

0 0 1 00 0 0 1 0 1 0 00 0 1 1

Eg The Obama-Merkel-Sarkozy

space

Results stored in a relation (appears in rss) in Cubicweb

rql(rsquoAny X WHERE X appears_in_rss Yrsquo)

Exploiting information Clustering news

Creating clusters (groups) of news using the matrixrepresentation of the data

Meanshift algorithm (Scikits-learn)

Based on locating the maxima of a density function

Automatically tunes the number of clusters

from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_

ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995

rarr Try it

Other possible alternatives Wardrsquos algorithm K-means

Exploiting information Queries and Views

Each query returns a result set (rset) A view is called on a result setrarr define the representation rules

Defining a view (short version)

class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )

do_whatever_you_want_ write_some_html_

you just have to plug a rset from a query

r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )

or within an url

httpmyapplicationrql=Any X WHERE ampvid=example-view

Pluging Scikitslearn in a view

Combine machine learning tools and database query system in purePython

Defining the view

class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo

def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing

rarr Try it

A new approach for querying information from RSS

All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R

E has_type T T label musical artistrsquo)

All living office holder persons in the news

rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)

All news that talk about Barack Obama and any scientist

rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)

All news that talk about a drug

rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)

Try it with an xml view or a thumbnail view

Vizualisation Mapping information

View

class EntityMapView ( View ) __regid__ = rsquomap rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )

s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )

s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )

Based on mapstraction (javascript) http mapstractioncom

4 Automatically locate information from RSS news

Try it

Conclusion

Cubicweb and Scikits-learn

Efficient and easy-to-use tools for data storing querying andmining

Easy to plug together only Python tools all Open Source

Using Dbpedia allows to extract very few highly relevant features

4 Decrease the dimensionality of the data

4 Link features to millions of pages of information

A new semantic way for querying information

4 Simple information queries using RQL expressions

4 Use Dbpedia types and categories to refine the selection

Future Improvements

Named Entities Recognition

rarr use disambiguations links more refined Regular expressions

rarr add new databases (MusicBrainz diseasome )

rarr closely follow Wikipedia with Dbpedia live update

Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo

RSS news analyzing

rarr explore new algorithms bi-clustering

rarr add new data sources Twitters Blogs

rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news
Page 10: Semnews - Euroscipy 2011

Extracting information Classical approaches (Scikits-learn)

Char-N-gram

Extracts features of N characters from a text

from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import CharNGramAnalyzeranalyzer = CharNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )

450 features rsquogoorsquo rsquooogrsquo rsquooglrsquo rsquomobrsquo rsquoobirsquo rsquobilrsquo

Word-N-gram

Extracts features of N words from a text

from s c i k i t s l ea rn f e a t u r e _ e x t r a c t i o n t e x t import WordNGramAnalyzeranalyzer = WordNGramAnalyzer ( min_n=3 max_n=6)fea tu res = analyzer analyze ( sentence )

58 features rsquogoogle is torsquo rsquois to buyrsquo rsquoserious challenge torsquo

8 Many irrelevant features (tokens)8 Features do not carry lots of contextual information (ie

understandable by humans)

Feature extraction Dbpedia - Context

Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named

Entities Recognition (NER)

Dbpedia feature extraction (CubicwebDbpedia)

from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )

rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo

ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone

manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to

mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt

rarr Try it

ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008

ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007

Feature extraction Dbpedia - Properties

Efficient and robust feature extraction

Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo

Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo

Simple RQL (Relation Query Language) queries

Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage

E label (token)srsquo rsquotokenrsquo token)

eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries

Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Feature extraction Matrix representation and storage

Obamarsquos approval Protest as Spain

Libya rebels fight Obama in Spain

rarr

0 0 1 00 0 0 1 0 1 0 00 0 1 1

Eg The Obama-Merkel-Sarkozy

space

Results stored in a relation (appears in rss) in Cubicweb

rql(rsquoAny X WHERE X appears_in_rss Yrsquo)

Exploiting information Clustering news

Creating clusters (groups) of news using the matrixrepresentation of the data

Meanshift algorithm (Scikits-learn)

Based on locating the maxima of a density function

Automatically tunes the number of clusters

from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_

ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995

rarr Try it

Other possible alternatives Wardrsquos algorithm K-means

Exploiting information Queries and Views

Each query returns a result set (rset) A view is called on a result setrarr define the representation rules

Defining a view (short version)

class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )

do_whatever_you_want_ write_some_html_

you just have to plug a rset from a query

r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )

or within an url

httpmyapplicationrql=Any X WHERE ampvid=example-view

Pluging Scikitslearn in a view

Combine machine learning tools and database query system in purePython

Defining the view

class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo

def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing

rarr Try it

A new approach for querying information from RSS

All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R

E has_type T T label musical artistrsquo)

All living office holder persons in the news

rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)

All news that talk about Barack Obama and any scientist

rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)

All news that talk about a drug

rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)

Try it with an xml view or a thumbnail view

Vizualisation Mapping information

View

class EntityMapView ( View ) __regid__ = rsquomap rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )

s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )

s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )

Based on mapstraction (javascript) http mapstractioncom

4 Automatically locate information from RSS news

Try it

Conclusion

Cubicweb and Scikits-learn

Efficient and easy-to-use tools for data storing querying andmining

Easy to plug together only Python tools all Open Source

Using Dbpedia allows to extract very few highly relevant features

4 Decrease the dimensionality of the data

4 Link features to millions of pages of information

A new semantic way for querying information

4 Simple information queries using RQL expressions

4 Use Dbpedia types and categories to refine the selection

Future Improvements

Named Entities Recognition

rarr use disambiguations links more refined Regular expressions

rarr add new databases (MusicBrainz diseasome )

rarr closely follow Wikipedia with Dbpedia live update

Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo

RSS news analyzing

rarr explore new algorithms bi-clustering

rarr add new data sources Twitters Blogs

rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news
Page 11: Semnews - Euroscipy 2011

Feature extraction Dbpedia - Context

Main Hypothesis Only things that exist in Dbpedia (ieWikipedia) have some interest in news analysisrarr Named

Entities Recognition (NER)

Dbpedia feature extraction (CubicwebDbpedia)

from cubes semnews views ne r t oo l s import Dbped iaEnt i t iesAna lyzeranalyzer = Dbped iaEnt i t iesAna lyzer ( session lang= rsquo en rsquo )tokens = analyzer analyze ( sentence )

rarr 3 features rsquoApple Incrsquo rsquoGooglersquo rsquoMotorolarsquo

ltENAMEX TYPE=ORGANIZATIONgt GoogleltENAMEXgt is to buy mobile phone

manufacturer ltENAMEX TYPE=ORGANIZATIONgt MotorolaltENAMEXgt Mobility allowing it to

mount a serious challenge to ltENAMEX TYPE=ORGANIZATIONgt Apple IncltENAMEXgt

rarr Try it

ldquoDBpedia Spotlight Shedding Light on the Web of Documentsrdquo Pablo N Mendes et al I-Semantics 2011ldquoLearning Named Entity Recognition from Wikipediardquo Joel Nothman 2008

ldquoLarge-Scale Named Entity Disambiguation Based on Wikipedia Datardquo Silviu Cucerzan 2007

Feature extraction Dbpedia - Properties

Efficient and robust feature extraction

Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo

Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo

Simple RQL (Relation Query Language) queries

Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage

E label (token)srsquo rsquotokenrsquo token)

eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries

Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Feature extraction Matrix representation and storage

Obamarsquos approval Protest as Spain

Libya rebels fight Obama in Spain

rarr

0 0 1 00 0 0 1 0 1 0 00 0 1 1

Eg The Obama-Merkel-Sarkozy

space

Results stored in a relation (appears in rss) in Cubicweb

rql(rsquoAny X WHERE X appears_in_rss Yrsquo)

Exploiting information Clustering news

Creating clusters (groups) of news using the matrixrepresentation of the data

Meanshift algorithm (Scikits-learn)

Based on locating the maxima of a density function

Automatically tunes the number of clusters

from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_

ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995

rarr Try it

Other possible alternatives Wardrsquos algorithm K-means

Exploiting information Queries and Views

Each query returns a result set (rset) A view is called on a result setrarr define the representation rules

Defining a view (short version)

class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )

do_whatever_you_want_ write_some_html_

you just have to plug a rset from a query

r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )

or within an url

httpmyapplicationrql=Any X WHERE ampvid=example-view

Pluging Scikitslearn in a view

Combine machine learning tools and database query system in purePython

Defining the view

class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo

def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing

rarr Try it

A new approach for querying information from RSS

All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R

E has_type T T label musical artistrsquo)

All living office holder persons in the news

rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)

All news that talk about Barack Obama and any scientist

rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)

All news that talk about a drug

rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)

Try it with an xml view or a thumbnail view

Vizualisation Mapping information

View

class EntityMapView ( View ) __regid__ = rsquomap rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )

s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )

s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )

Based on mapstraction (javascript) http mapstractioncom

4 Automatically locate information from RSS news

Try it

Conclusion

Cubicweb and Scikits-learn

Efficient and easy-to-use tools for data storing querying andmining

Easy to plug together only Python tools all Open Source

Using Dbpedia allows to extract very few highly relevant features

4 Decrease the dimensionality of the data

4 Link features to millions of pages of information

A new semantic way for querying information

4 Simple information queries using RQL expressions

4 Use Dbpedia types and categories to refine the selection

Future Improvements

Named Entities Recognition

rarr use disambiguations links more refined Regular expressions

rarr add new databases (MusicBrainz diseasome )

rarr closely follow Wikipedia with Dbpedia live update

Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo

RSS news analyzing

rarr explore new algorithms bi-clustering

rarr add new data sources Twitters Blogs

rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news
Page 12: Semnews - Euroscipy 2011

Feature extraction Dbpedia - Properties

Efficient and robust feature extraction

Keep the meaning of a textrarr interpretable features8 rsquo said former soldier Larry rsquo8 rsquo said student Larry rsquo4 rsquo said Larry Page rsquo

Robust features based on redirectionseg rsquoObamarsquo rsquoBarak Obambarsquo rsquoPres Obamarsquo redirect to rsquoBarack Obamarsquo

Simple RQL (Relation Query Language) queries

Fast based on indexed SQL tables and regular expressions rset = rql(rsquoAny E WHERE E is DbpediaPage

E label (token)srsquo rsquotokenrsquo token)

eg 19 entities extracted in 4s in 765 words among sim 8106 dbpedia entries

Different labels but same URIrarr cross-language featureextraction eg GrenadaGrenaderarr http dbpediaorgresourceGrenada

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Feature extraction Matrix representation and storage

Obamarsquos approval Protest as Spain

Libya rebels fight Obama in Spain

rarr

0 0 1 00 0 0 1 0 1 0 00 0 1 1

Eg The Obama-Merkel-Sarkozy

space

Results stored in a relation (appears in rss) in Cubicweb

rql(rsquoAny X WHERE X appears_in_rss Yrsquo)

Exploiting information Clustering news

Creating clusters (groups) of news using the matrixrepresentation of the data

Meanshift algorithm (Scikits-learn)

Based on locating the maxima of a density function

Automatically tunes the number of clusters

from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_

ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995

rarr Try it

Other possible alternatives Wardrsquos algorithm K-means

Exploiting information Queries and Views

Each query returns a result set (rset) A view is called on a result setrarr define the representation rules

Defining a view (short version)

class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )

do_whatever_you_want_ write_some_html_

you just have to plug a rset from a query

r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )

or within an url

httpmyapplicationrql=Any X WHERE ampvid=example-view

Pluging Scikitslearn in a view

Combine machine learning tools and database query system in purePython

Defining the view

class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo

def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing

rarr Try it

A new approach for querying information from RSS

All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R

E has_type T T label musical artistrsquo)

All living office holder persons in the news

rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)

All news that talk about Barack Obama and any scientist

rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)

All news that talk about a drug

rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)

Try it with an xml view or a thumbnail view

Vizualisation Mapping information

View

class EntityMapView ( View ) __regid__ = rsquomap rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )

s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )

s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )

Based on mapstraction (javascript) http mapstractioncom

4 Automatically locate information from RSS news

Try it

Conclusion

Cubicweb and Scikits-learn

Efficient and easy-to-use tools for data storing querying andmining

Easy to plug together only Python tools all Open Source

Using Dbpedia allows to extract very few highly relevant features

4 Decrease the dimensionality of the data

4 Link features to millions of pages of information

A new semantic way for querying information

4 Simple information queries using RQL expressions

4 Use Dbpedia types and categories to refine the selection

Future Improvements

Named Entities Recognition

rarr use disambiguations links more refined Regular expressions

rarr add new databases (MusicBrainz diseasome )

rarr closely follow Wikipedia with Dbpedia live update

Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo

RSS news analyzing

rarr explore new algorithms bi-clustering

rarr add new data sources Twitters Blogs

rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news
Page 13: Semnews - Euroscipy 2011

Overview

1 Introduction

2 Extracting information from RSS news

3 Analyzing RSS news

Feature extraction Matrix representation and storage

Obamarsquos approval Protest as Spain

Libya rebels fight Obama in Spain

rarr

0 0 1 00 0 0 1 0 1 0 00 0 1 1

Eg The Obama-Merkel-Sarkozy

space

Results stored in a relation (appears in rss) in Cubicweb

rql(rsquoAny X WHERE X appears_in_rss Yrsquo)

Exploiting information Clustering news

Creating clusters (groups) of news using the matrixrepresentation of the data

Meanshift algorithm (Scikits-learn)

Based on locating the maxima of a density function

Automatically tunes the number of clusters

from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_

ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995

rarr Try it

Other possible alternatives Wardrsquos algorithm K-means

Exploiting information Queries and Views

Each query returns a result set (rset) A view is called on a result setrarr define the representation rules

Defining a view (short version)

class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )

do_whatever_you_want_ write_some_html_

you just have to plug a rset from a query

r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )

or within an url

httpmyapplicationrql=Any X WHERE ampvid=example-view

Pluging Scikitslearn in a view

Combine machine learning tools and database query system in purePython

Defining the view

class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo

def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing

rarr Try it

A new approach for querying information from RSS

All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R

E has_type T T label musical artistrsquo)

All living office holder persons in the news

rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)

All news that talk about Barack Obama and any scientist

rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)

All news that talk about a drug

rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)

Try it with an xml view or a thumbnail view

Vizualisation Mapping information

View

class EntityMapView ( View ) __regid__ = rsquomap rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )

s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )

s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )

Based on mapstraction (javascript) http mapstractioncom

4 Automatically locate information from RSS news

Try it

Conclusion

Cubicweb and Scikits-learn

Efficient and easy-to-use tools for data storing querying andmining

Easy to plug together only Python tools all Open Source

Using Dbpedia allows to extract very few highly relevant features

4 Decrease the dimensionality of the data

4 Link features to millions of pages of information

A new semantic way for querying information

4 Simple information queries using RQL expressions

4 Use Dbpedia types and categories to refine the selection

Future Improvements

Named Entities Recognition

rarr use disambiguations links more refined Regular expressions

rarr add new databases (MusicBrainz diseasome )

rarr closely follow Wikipedia with Dbpedia live update

Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo

RSS news analyzing

rarr explore new algorithms bi-clustering

rarr add new data sources Twitters Blogs

rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news
Page 14: Semnews - Euroscipy 2011

Feature extraction Matrix representation and storage

Obamarsquos approval Protest as Spain

Libya rebels fight Obama in Spain

rarr

0 0 1 00 0 0 1 0 1 0 00 0 1 1

Eg The Obama-Merkel-Sarkozy

space

Results stored in a relation (appears in rss) in Cubicweb

rql(rsquoAny X WHERE X appears_in_rss Yrsquo)

Exploiting information Clustering news

Creating clusters (groups) of news using the matrixrepresentation of the data

Meanshift algorithm (Scikits-learn)

Based on locating the maxima of a density function

Automatically tunes the number of clusters

from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_

ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995

rarr Try it

Other possible alternatives Wardrsquos algorithm K-means

Exploiting information Queries and Views

Each query returns a result set (rset) A view is called on a result setrarr define the representation rules

Defining a view (short version)

class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )

do_whatever_you_want_ write_some_html_

you just have to plug a rset from a query

r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )

or within an url

httpmyapplicationrql=Any X WHERE ampvid=example-view

Pluging Scikitslearn in a view

Combine machine learning tools and database query system in purePython

Defining the view

class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo

def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing

rarr Try it

A new approach for querying information from RSS

All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R

E has_type T T label musical artistrsquo)

All living office holder persons in the news

rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)

All news that talk about Barack Obama and any scientist

rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)

All news that talk about a drug

rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)

Try it with an xml view or a thumbnail view

Vizualisation Mapping information

View

class EntityMapView ( View ) __regid__ = rsquomap rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )

s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )

s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )

Based on mapstraction (javascript) http mapstractioncom

4 Automatically locate information from RSS news

Try it

Conclusion

Cubicweb and Scikits-learn

Efficient and easy-to-use tools for data storing querying andmining

Easy to plug together only Python tools all Open Source

Using Dbpedia allows to extract very few highly relevant features

4 Decrease the dimensionality of the data

4 Link features to millions of pages of information

A new semantic way for querying information

4 Simple information queries using RQL expressions

4 Use Dbpedia types and categories to refine the selection

Future Improvements

Named Entities Recognition

rarr use disambiguations links more refined Regular expressions

rarr add new databases (MusicBrainz diseasome )

rarr closely follow Wikipedia with Dbpedia live update

Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo

RSS news analyzing

rarr explore new algorithms bi-clustering

rarr add new data sources Twitters Blogs

rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news
Page 15: Semnews - Euroscipy 2011

Exploiting information Clustering news

Creating clusters (groups) of news using the matrixrepresentation of the data

Meanshift algorithm (Scikits-learn)

Based on locating the maxima of a density function

Automatically tunes the number of clusters

from s c i k i t s l ea rn c l u s t e r import MeanShift est imate_bandwidthbandwidth = est imate_bandwidth (X q u a n t i l e =0005)c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )c l u s t e r i n g f i t (X)l a b e l s = c l u s t e r i n g labe ls_

ldquoMean shift mode seeking and clusteringrdquo Yizong Cheng IEEE Transactions on Pattern Analysis and Machine Intelligence1995

rarr Try it

Other possible alternatives Wardrsquos algorithm K-means

Exploiting information Queries and Views

Each query returns a result set (rset) A view is called on a result setrarr define the representation rules

Defining a view (short version)

class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )

do_whatever_you_want_ write_some_html_

you just have to plug a rset from a query

r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )

or within an url

httpmyapplicationrql=Any X WHERE ampvid=example-view

Pluging Scikitslearn in a view

Combine machine learning tools and database query system in purePython

Defining the view

class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo

def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing

rarr Try it

A new approach for querying information from RSS

All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R

E has_type T T label musical artistrsquo)

All living office holder persons in the news

rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)

All news that talk about Barack Obama and any scientist

rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)

All news that talk about a drug

rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)

Try it with an xml view or a thumbnail view

Vizualisation Mapping information

View

class EntityMapView ( View ) __regid__ = rsquomap rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )

s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )

s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )

Based on mapstraction (javascript) http mapstractioncom

4 Automatically locate information from RSS news

Try it

Conclusion

Cubicweb and Scikits-learn

Efficient and easy-to-use tools for data storing querying andmining

Easy to plug together only Python tools all Open Source

Using Dbpedia allows to extract very few highly relevant features

4 Decrease the dimensionality of the data

4 Link features to millions of pages of information

A new semantic way for querying information

4 Simple information queries using RQL expressions

4 Use Dbpedia types and categories to refine the selection

Future Improvements

Named Entities Recognition

rarr use disambiguations links more refined Regular expressions

rarr add new databases (MusicBrainz diseasome )

rarr closely follow Wikipedia with Dbpedia live update

Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo

RSS news analyzing

rarr explore new algorithms bi-clustering

rarr add new data sources Twitters Blogs

rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news
Page 16: Semnews - Euroscipy 2011

Exploiting information Queries and Views

Each query returns a result set (rset) A view is called on a result setrarr define the representation rules

Defining a view (short version)

class MyEnt i t iesView ( View ) __regid__ = rsquo exampleminusview rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsetfor e n t i t y in r s e t e n t i t i e s ( )

do_whatever_you_want_ write_some_html_

you just have to plug a rset from a query

r s e t = r q l ( rsquo Any X WHERE rsquo )s e l f wview ( rsquo exampleminusview rsquo r s e t )

or within an url

httpmyapplicationrql=Any X WHERE ampvid=example-view

Pluging Scikitslearn in a view

Combine machine learning tools and database query system in purePython

Defining the view

class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo

def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing

rarr Try it

A new approach for querying information from RSS

All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R

E has_type T T label musical artistrsquo)

All living office holder persons in the news

rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)

All news that talk about Barack Obama and any scientist

rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)

All news that talk about a drug

rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)

Try it with an xml view or a thumbnail view

Vizualisation Mapping information

View

class EntityMapView ( View ) __regid__ = rsquomap rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )

s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )

s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )

Based on mapstraction (javascript) http mapstractioncom

4 Automatically locate information from RSS news

Try it

Conclusion

Cubicweb and Scikits-learn

Efficient and easy-to-use tools for data storing querying andmining

Easy to plug together only Python tools all Open Source

Using Dbpedia allows to extract very few highly relevant features

4 Decrease the dimensionality of the data

4 Link features to millions of pages of information

A new semantic way for querying information

4 Simple information queries using RQL expressions

4 Use Dbpedia types and categories to refine the selection

Future Improvements

Named Entities Recognition

rarr use disambiguations links more refined Regular expressions

rarr add new databases (MusicBrainz diseasome )

rarr closely follow Wikipedia with Dbpedia live update

Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo

RSS news analyzing

rarr explore new algorithms bi-clustering

rarr add new data sources Twitters Blogs

rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news
Page 17: Semnews - Euroscipy 2011

Pluging Scikitslearn in a view

Combine machine learning tools and database query system in purePython

Defining the view

class RSSClustering ( View ) __regid__ = rsquo rssminusc l u s t e r i n g rsquo

def c a l l ( s e l f ) Get the r e s u l t s set and create the datar s e t = s e l f cw_rsetoccurrence_matr ix = OccurrenceMatr ix ( )occurrence_matr ix cons t ruc t_ma t r i x ( r s e t = r s e t ) Scale the dataX = scale ( occurrence_matr ix ax is =0 w i th_s td=True copy=True ) Compute bandwith f o r c l u s t e r i n gbandwidth = est imate_bandwidth (X) C lus te r i ngc l u s t e r = MeanShift ( bandwidth=bandwidth )c l u s t e r f i t (X)l a b e l s = c l u s t e r l abe ls_ Perform some HTML render ing

rarr Try it

A new approach for querying information from RSS

All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R

E has_type T T label musical artistrsquo)

All living office holder persons in the news

rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)

All news that talk about Barack Obama and any scientist

rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)

All news that talk about a drug

rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)

Try it with an xml view or a thumbnail view

Vizualisation Mapping information

View

class EntityMapView ( View ) __regid__ = rsquomap rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )

s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )

s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )

Based on mapstraction (javascript) http mapstractioncom

4 Automatically locate information from RSS news

Try it

Conclusion

Cubicweb and Scikits-learn

Efficient and easy-to-use tools for data storing querying andmining

Easy to plug together only Python tools all Open Source

Using Dbpedia allows to extract very few highly relevant features

4 Decrease the dimensionality of the data

4 Link features to millions of pages of information

A new semantic way for querying information

4 Simple information queries using RQL expressions

4 Use Dbpedia types and categories to refine the selection

Future Improvements

Named Entities Recognition

rarr use disambiguations links more refined Regular expressions

rarr add new databases (MusicBrainz diseasome )

rarr closely follow Wikipedia with Dbpedia live update

Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo

RSS news analyzing

rarr explore new algorithms bi-clustering

rarr add new data sources Twitters Blogs

rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news
Page 18: Semnews - Euroscipy 2011

A new approach for querying information from RSS

All musical artists in the newsrql(rsquoDISTINCT Any E R WHERE E appears_in_rss R

E has_type T T label musical artistrsquo)

All living office holder persons in the news

rql(rsquoDISTINCT Any E WHERE E appears_in_rss RE has_type T T label office holderE has_subject C C label Living peoplersquo)

All news that talk about Barack Obama and any scientist

rql(rsquoDISTINCT Any R WHERE E1 label Barack ObamaE1 appears_in_rss R E2 appears_in_rss RE2 has_type T T label scientistrsquo)

All news that talk about a drug

rql(rsquoAny X R WHERE X appears_in_rss RX has_type T T label drugrsquo)

Try it with an xml view or a thumbnail view

Vizualisation Mapping information

View

class EntityMapView ( View ) __regid__ = rsquomap rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )

s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )

s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )

Based on mapstraction (javascript) http mapstractioncom

4 Automatically locate information from RSS news

Try it

Conclusion

Cubicweb and Scikits-learn

Efficient and easy-to-use tools for data storing querying andmining

Easy to plug together only Python tools all Open Source

Using Dbpedia allows to extract very few highly relevant features

4 Decrease the dimensionality of the data

4 Link features to millions of pages of information

A new semantic way for querying information

4 Simple information queries using RQL expressions

4 Use Dbpedia types and categories to refine the selection

Future Improvements

Named Entities Recognition

rarr use disambiguations links more refined Regular expressions

rarr add new databases (MusicBrainz diseasome )

rarr closely follow Wikipedia with Dbpedia live update

Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo

RSS news analyzing

rarr explore new algorithms bi-clustering

rarr add new data sources Twitters Blogs

rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news
Page 19: Semnews - Euroscipy 2011

Vizualisation Mapping information

View

class EntityMapView ( View ) __regid__ = rsquomap rsquo

def c a l l ( s e l f ) r s e t = s e l f cw_rsets e l f in i t_map ( )for e n t i t y in r s e t e n t i t i e s ( )

s e l f add_marker ( e n t i t y l a t i t u d e e n t i t y long i tude e n t i t y d c _ t i t l e )

s e l f center_and_zoom (0 0 1 5 )s e l f f in ish_map ( )

Based on mapstraction (javascript) http mapstractioncom

4 Automatically locate information from RSS news

Try it

Conclusion

Cubicweb and Scikits-learn

Efficient and easy-to-use tools for data storing querying andmining

Easy to plug together only Python tools all Open Source

Using Dbpedia allows to extract very few highly relevant features

4 Decrease the dimensionality of the data

4 Link features to millions of pages of information

A new semantic way for querying information

4 Simple information queries using RQL expressions

4 Use Dbpedia types and categories to refine the selection

Future Improvements

Named Entities Recognition

rarr use disambiguations links more refined Regular expressions

rarr add new databases (MusicBrainz diseasome )

rarr closely follow Wikipedia with Dbpedia live update

Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo

RSS news analyzing

rarr explore new algorithms bi-clustering

rarr add new data sources Twitters Blogs

rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news
Page 20: Semnews - Euroscipy 2011

Conclusion

Cubicweb and Scikits-learn

Efficient and easy-to-use tools for data storing querying andmining

Easy to plug together only Python tools all Open Source

Using Dbpedia allows to extract very few highly relevant features

4 Decrease the dimensionality of the data

4 Link features to millions of pages of information

A new semantic way for querying information

4 Simple information queries using RQL expressions

4 Use Dbpedia types and categories to refine the selection

Future Improvements

Named Entities Recognition

rarr use disambiguations links more refined Regular expressions

rarr add new databases (MusicBrainz diseasome )

rarr closely follow Wikipedia with Dbpedia live update

Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo

RSS news analyzing

rarr explore new algorithms bi-clustering

rarr add new data sources Twitters Blogs

rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news
Page 21: Semnews - Euroscipy 2011

Future Improvements

Named Entities Recognition

rarr use disambiguations links more refined Regular expressions

rarr add new databases (MusicBrainz diseasome )

rarr closely follow Wikipedia with Dbpedia live update

Motorola Mobility (11 46 15 August 2011) rsquoOn the 15 AugustGoogle announced that it agreed to acquire the companyrsquo

RSS news analyzing

rarr explore new algorithms bi-clustering

rarr add new data sources Twitters Blogs

rarr rsquofrom scikitslearn predict __future__ rsquo rarr use MatrixCompletion to predict new edges in the correlation graph

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news
Page 22: Semnews - Euroscipy 2011

Thanks you for your attention

Questions

  • Introduction
  • Extracting information from RSS news
  • Analyzing RSS news