gesis robert strötgen social science information centre, bonn eurocris 2002, 29th august 2002......

29
GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002 ... using Meta-Data Extraction and Query Translation Treatment of Semantic Heterogeneity ...

Upload: jewel-daniel

Post on 17-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

GESIS

Robert StrötgenSocial Science Information Centre, Bonn

euroCRIS 2002, 29th August 2002

... using Meta-Data Extraction and Query Translation

Treatment of Semantic Heterogeneity ...

Page 2: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

2GESIS

Outline

What is semantic heterogeneity?Meta-Data extractionSemantic relationsQuery translationOutlook

Page 3: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

3GESIS

Project CARMEN

Metadata (Dublin Core Element Set in RDF, “Meta-Maker”, digital signatures)Retrieval on structured documents and heterogeneous data types (search engine and gatherer for XML documents)Methods for treatment of resisting semantic heterogeneity in CARMEN

Page 4: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

4GESIS

Semantic Heterogeneity

Technical heterogeneity (different platforms, databases, formats) is not the issue of CARMENSemantic heterogeneity appears in different data collections using

different thesauri or classifications for content description

varying or no metadata at all or when intellectually indexed documents meet

completely un-indexed Internet pages

Page 5: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

5GESIS

Material: Social Sciences

SOLIS/FORIS vs. Internet documents from social sciencesspecialized documentation databases with high-quality content description like abstract, controlled keywords and classificationInternet documents in the majority of cases without any metadata, high semantic and formal heterogeneity

Page 6: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

6GESIS

Extraction of Meta-Data

PostScriptunstructured

PostScript document

extractorheuristicsextractorheuristicsextractorheuristicsextractorheuristicsextractorheuristics

structuredHTML document

Safety analysis of nuclear reactors strongly relies on numerical simultation of the reactor core. ...

www.tum.de/preprints/...

dc:cre

ator

Schmid,Werner

Math. Subject

Classification

dc:subject

(Keyword)

dcq:abstract

Multirid Methods, Eigenvalue Problems, Multigroup Diffusion

further MSC

Multigrid methods; domain decomposition

65N55

Classifi-cation

rdf:type

rdf:value

rdfs:la

bel

Page 7: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

7GESIS

Meta-Data in Test Corpus

Size: 3,661 documentsFile format: only HTML documentsTITLE:

Correct title tags: 96 % Title, but incorrectly coded: 17.7 % of the rest

KEYWORD: Correct keyword tags: 25.5 %

ABSTRACT: Correct description tags: 21 % Abstract, but incorrectly coded: 39,4 % of the rest

Page 8: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

8GESIS

Extraction from HTML files - Some Problems

Missing or irregular use of Meta tags (author, keywords, DC-Tags)Inconsistent use of semantic HTML tags (title, h1, h2, address etc.)Irregular formatting style for context information (type size, type style, horizontal orientation etc.)Missing context information (date, author, institution, etc.)Not specification consistent use of HTML!

Page 9: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

9GESIS

Converting HTML XML

Advantages: (syntactical) homogenisation of HTML files XML allows the use of many existing tools for

document analysis, particularly the query language XPath.

Disadvantage: Poor performance of the converting process

(not a big issue: extraction runs during gathering process, not at retrieval time)

Page 10: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

10GESIS

HTML Heuristic : Title (part)

If (<title>-tag exists && <title> does not contain "untitled" && HMAX exists){ /* 'does not contain "untitled"' is to be searched as case insensitive substring in <title> */ If (<title>==HMAX) { <1> Title[1]=<title> } elsif (<title> contains HMAX) { /* ' contain' does always mean case insensitive substring */ <2> Title[0,8]=<title> } elsif (HMAX contains <title>) { <3> Title[0,8]=HMAX } else { <4> Title[0,8]=<title> + HMAX } } elsif (<title> exists && S exists) { /* i.e. <title> exists AND an item //p/b, //i/p etc. exists */ <5> Title[0,5]=<title> + S } elsif (<title> exits) { <6> Title[0,5]=<title> } elsif (<Hx> exits) { <7> Title[0,3]=HMAX } elsif (S exits) { <8> Title[0,1]= S }}

Page 11: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

11GESIS

Results and Outlook

Extraction of Meta-Data TITEL: 80 % extracted with medium or high quality KEYWORDS: nearly 100 % extracted with high quality ABSTRACTS: 90 % extracted with medium/high

qualityConclusion

In principle transferable on other domains Expensive maintenance Only compromise solution, until builders of web pages

use Dublin Core or other Meta-Data standard

Page 12: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

12GESIS

Semantic Relations

Intellectual transfers relations(Cross-Concordances)

Tools for creation: SIS-TMS for thesauri, CarmenX for classifications

Statistical transfer relations (Co-occurrence analysis)

Page 13: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

13GESIS

Cross-Concordances in SIS-TMS

Page 14: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

14GESIS

SIS-TMS Correlation Editor

Page 15: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

15GESIS

Parallel Corpus

document set Bdocument set A

doc. A1

a

b

c

d

thesaurus or classification

known relation ofdocuments

Derivedrelation ofterms

x

a

y

z

thesaurus or classification

doc. B2

doc. B1

doc. A3

doc. A2

doc. B3

document set Bdocument set A

doc. A1

a

b

c

d

thesaurus or classification

known relation ofdocuments

Derivedrelation ofterms

x

a

y

z

thesaurus or classification

doc. B2

doc. B1

doc. A3

doc. A2

doc. B3

document set Bdocument set A

doc. A1

a

b

c

d

thesaurus or classification

known relation ofdocuments

Derivedrelation ofterms

document set Bdocument set A

doc. A1

a

b

c

d

thesaurus or classification

known relation ofdocuments

Derivedrelation ofterms

x

a

y

z

x

a

y

z

thesaurus or classification

doc. B2

doc. B1

doc. A3

doc. A2

doc. B3

Page 16: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

16GESIS

Corpus with Internet DocumentsIn ternet fu ll-tex t docum ents

Dokum ent

Dokum ent

Dokum ent

te rm s from prob.indexer

x

a

y

z

c lass ifica tion /thesaurus

a

b

c

d...

...

Social Sciences‘ Internet documents are not indexed using a thesaurus or classification

Page 17: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

17GESIS

Simulating a Parallel CorpusIn ternet fu ll-tex t docum ents

docum ent

docum ent

docum ent

te rm s from prob.indexer

c lass ifica tion /thesaurus

a

b

c

d

probablistic search results inweighted relations betweenclassification classes orthesuaurs term s todocum ents

0.5

0.1

0.8

probabilis tic search

...

x

a

y

z...

Page 18: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

18GESIS

Result: Simulated Parallel Corpus

In ternet fu ll-text docum ents

docum ent

docum ent

docum ent

term s from prob.indexer

x

a

y

z

0.8

0.60.9

0.50.7

0.8

0.4

c lass ification /thesaurus

a

b

c

d

0.1

0.1

0.1

0.8

0.8

0.8

0.5

0.5

......

Page 19: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

19GESIS

a x(0,8) ; y(0,4)

b x(0,3) ; z(0,3)

c a(0,2) ; y(0,4)

d x(0,6) ; y(0,7)

term s from prob.indexer

x

a

y

z

class ification /thesaurus

a

b

c

d......

Term-Term-Matrix

Page 20: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

20GESIS

Tool: Jester

Java Enviroment for Statistical TransfERs: Support and assistance for creating statistical transfer relations from a parallel corpus

Page 21: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

21GESIS

Query Transformation

Query

Transformations

A B C Databases

Query' v2 Query Query' v3

Page 22: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

22GESIS

Binding of Query Languages

Plugable QueryParsers and QueryPrinters for different query languages make exploitation in other contexts easy.

Page 23: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

23GESIS

CARMEN Transfer Architecture

Retrieval server (HyRex) identifies transferable parts of a query and sends them to the transfer serviceExchange of partial queries using XML/XIRQLTransfer service runs as TomCat servlet server

transfer module(CGI/Servlets)

querytransfer

HyRex

XIRQLquery

XIRQLpartial query

http (text/xirql)XIRQL

partial query

XIRQ-partial query'

XIRQLpartial query'

XIRQLquery'

http (text/xirql)

Page 24: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

24GESIS

Evaluation of Transfer Modules

Retrieval tests using transfer modules (using a corpus with Internet documents indexed with Fulcrum SearchServer)Limitation: no use of weight information of transfer relationsTested transfer: SOLIS/IZ-Thesaurus SoWi Internet documents/free-termsComparison: search using IZ-Thesaurus terms vs. search using free-terms from transfer2 exemplary searches per 3 domains (women studies, migration, sociology of industry)

Page 25: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

25GESIS

Exemplary Search: “Dominanz“

„Dominanz“ (“dominance“): 16 relevant documents10 transfer terms (Dominanz, Messen, Mongolei, Nichtregierungsorganisation, Flugzeug, Datenaustausch, Kommunikationsraum, Kommunikationstechnologie, Medienpädagogik, Wüste):

14 additive documents, thereof 7 relevant (50%, increase 44%)Precision: 77%

Page 26: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

26GESIS

Exemplary Search: „Leiharbeit“

„Leiharbeit“ (“temporary work“): 10 relevant documents4 transfer terms (Leiharbeit, Arbeitsphysiologie, Organisationsmodell, Risikoabschätzung):

10 additive documents, thereof 2 relevant (20%, increase 20%)Precision: 60%

Page 27: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

27GESIS

Results

All exemplary searches using transfers leads to additive relevant documents compared with a search without transferQuota of relevant documents from all new documents between 13% and 55%Transfer terms not always evident (Example „Wüste“ (“desert”))Partly very many transfer terms (user parametrizing or better algorithms needed)

Page 28: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

28GESIS

Outlook (What needs to be done?)

Improvement of double corpora: Kind of documents Diversity of document types Diversity of institutions / web sites Domain Corpus size

Comparison of transfers using statistical relations intellectual relationsImprovement of algorithmsEffect of interactive, repetitive retrieval and user parametrizing / adjustmentUser tests

Page 29: GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment

29GESIS

Exploitation

Services (transfer)Software (Java classes)Projects:

Virtuelle Fachbibliothek Sozialwissenschaften (ViBSoz)

European Schools Treasury Browser (ETB) Informationsverbund Bildung – Sozialwissenschaften

– Psychologie (InfoConnex)Contact: [email protected]