intelligent database systems lab presenter : yan-shou sie authors : jeroen de knijff, flavius...

20
Intelligent Database Systems Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM 2012. DKE Data & Knowledge Engineering

Upload: joel-haynes

Post on 12-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM 2012. DKE Data & Knowledge

Intelligent Database Systems Lab

Presenter : YAN-SHOU SIE

Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK

HOGENBOOM

2012. DKE

Data & Knowledge Engineering

Page 2: Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM 2012. DKE Data & Knowledge

Intelligent Database Systems Lab

Outlines

MotivationObjectivesMethodologyExperimentsConclusionsComments

Page 3: Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM 2012. DKE Data & Knowledge

Intelligent Database Systems Lab

Motivation

• In the past, data were stored physically, not digitally, and were often structured manually so that the desired information could be found easily.

• Today, data are often stored digitally and are usually unstructured, as in documents. Manually structuring documents is time consuming.

Page 4: Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM 2012. DKE Data & Knowledge

Intelligent Database Systems Lab

Objectives

• makes it interesting to investigate possibilities to automatically organize documents. This could be performed by automatically generating a concept taxonomy from a document corpus.

• In our current work, we present a framework for automatically constructing a domain taxonomy from text corpora. We call this framework Automatic Domain Taxonomy Construction from Text (ADTCT).

Page 5: Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM 2012. DKE Data & Knowledge

Intelligent Database Systems Lab

Methodology ADTCT Framework

Page 6: Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM 2012. DKE Data & Knowledge

Intelligent Database Systems Lab

• Term Extraction: use a part-of-speech parser• Term Filtering:– domain pertinence DP

– lexical cohesion LC

Methodology-ADTCT Framework

Page 7: Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM 2012. DKE Data & Knowledge

Intelligent Database Systems Lab

Methodology-ADTCT Framework− domain consensus DC

norm _freq

− final domain score

Page 8: Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM 2012. DKE Data & Knowledge

Intelligent Database Systems Lab

• Concept hierarchy creation

– subsumption method

– hierarchical clustering algorithm

Methodology-ADTCT Framework

Page 9: Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM 2012. DKE Data & Knowledge

Intelligent Database Systems Lab

• subsumption method

– Concept x potentially subsumes concept y if:

– A score calculated for each potential parent

Methodology-ADTCT Framework

Page 10: Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM 2012. DKE Data & Knowledge

Intelligent Database Systems Lab

– Explained• P(p|x) ex : ‘Technology adaptation’ potential parents : Technology , Technological , Adaptation

Methodology

0.4

0.2

0.05

Technological :

0.32

0.4 = t

0.6

Technology adaptation :

Page 11: Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM 2012. DKE Data & Knowledge

Intelligent Database Systems Lab

• Hierarchical clustering method– Algorithm:

1. Start with n clusters (each term is a cluster).2. Compute the distances between clusters.3. Merge the two nearest clusters into one cluster.

Return to step 2 if more than one cluster remains; otherwise, the algorithm has finished.

– distance measuresdocument co-occurrence similaritywindow-based similarity

Methodology-ADTCT Framework

Page 12: Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM 2012. DKE Data & Knowledge

Intelligent Database Systems Lab

– document co-occurrence similarity

– window-based similarity

» Suppose that we have a document with four concepts: ‘Ad,’‘Bert,’ ‘Cees,’ and ‘Dirk.’ If the window size is 2, the following windows are created for this

document: {Ad}, {Ad, Bert}, {Bert, Cees},{Cees, Dirk}, and {Dirk}.

Methodology-ADTCT Framework

Page 13: Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM 2012. DKE Data & Knowledge

Intelligent Database Systems Lab

– hierarchical clustering algorithm ex : ‘System’ appears in documents {1,3,6,8} and windows {1,5,10,14,18,20,28}; ‘Process’ appears in documents {1,3,6,12} and windows {1,5,12,14,18,25,30}.

the similarities are converted to distances:

Methodology-Implementation

window similarity :

document similarity :

Min

MaxAvg = 0.15

Page 14: Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM 2012. DKE Data & Knowledge

Intelligent Database Systems Lab

Methodology• ADCTC Implementation

Page 15: Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM 2012. DKE Data & Knowledge

Intelligent Database Systems Lab

Experiments• Experimental setup – lexical precision :– common semantic cotopy :

– local taxonomic precision :

– taxonomic precision and recall :

– taxonomic F-measure (TF):

Page 16: Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM 2012. DKE Data & Knowledge

Intelligent Database Systems Lab

Experiments• Experimental results

Page 17: Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM 2012. DKE Data & Knowledge

Intelligent Database Systems Lab

Experiments– trade-off decision mathematically

Suppose minimal average depth = 3 , minimal quality = 0.60, t=0.20, t=0.25, t=0.30 obey these constraints. γ=0.40 and λ=0.60

t=0.20

t=0.25

t=0.30

Page 18: Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM 2012. DKE Data & Knowledge

Intelligent Database Systems Lab

Experiments

Page 19: Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM 2012. DKE Data & Knowledge

Intelligent Database Systems Lab

Conclusions

• Ourevaluation in the field of management and economics indicates that a trade-off between taxonomy quality and depth must be made when choosing one of these methods.

• The subsumption method is preferable for shallow taxonomies, whereas the hierarchical clustering algorithm is recommended for deep taxonomies.

Page 20: Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM 2012. DKE Data & Knowledge

Intelligent Database Systems Lab

Comments• Advantages

-Automatically create taxonomies that approach the quality of manually created taxonomies and save even more time

• Applications- Clustering , Classification, etc.