1 current research information as part of digital libraries and the heterogeneity problem....

24
1 Current research information as part of digital libraries and the heterogeneity problem. Integrated searches in the context of databases with different content analyses. CRIS2002, Kassel Jürgen Krause University of Koblenz-Landau and Social Science Information Centre (IZ-Bonn) Lennéstr. 30, 53113 Bonn, Germany, mailto:Krause@ bonn . iz - soz .de

Post on 21-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

1

Current research information as part of digital libraries and the

heterogeneity problem.

Integrated searches in the context of

databases with different content analyses.

CRIS2002, Kassel

Jürgen Krause

University of Koblenz-Landau and Social Science Information Centre (IZ-Bonn)

Lennéstr. 30, 53113 Bonn, Germany, mailto:[email protected]

2

jk

Heterogeneity

trans fer and co o rd in atio n

u s ers

Ap p licationarea n

h igh relevanceh igh quali tycontent analys is

M 6

M 4

less relevanceno ab s trac tsh igh quali ty ind exing

M 5low relevancewww-d ocum entssearch access by search engines

M 1

h igh relevanceh igh quali tycontent analys is

M 3only ti tless im p le au tom atic indexing

M 2h igh relevanceproba li lis tic autom atic indexing of fu ll text

decentralized/polycentric docum ent space

www-d ocum entsc .a. B y sc ientis ts

in fo r m at ions er vicecen ters

p u b l is h ers

s c ien t is ts

l ib ra r iesl ib rar yc ata lo gues

wwwelec tron icp ub l is h ing

3

jk

“Scientists are increasingly using search engines to locate research of interest; some rarely use libraries, locating research articles primarily online ... About 85% of users use search engines to locate information.”

Lawrence/Giles (1999:107)

“It doesn’t matter what you want to know, there are people in the Internet who already have this knowledge and want to help you”

(Hahn, 1999: 107)

4

jk

Weizenbaum 2000/2001 Germany

„Das Internet ist ein großer Misthaufen ...“

( „The internet is a large dunghill ...“

Nov. 2000 „Gutenbergs Folgen“ Kongreß Mainz

Mai 2001 Fachseminar Hamburg

www.heise.de/newsticker/data/wst-03.05.01-001/

5

jk

WWW today:

a) “Worst case” of the ambiguity problem

Out of the estimated 800 million pages on around three million servers, only 6% relate to the fields of science and education (by comparison: 1.5% relate to pornography).

NEC 1999

NEC 2000: thousand million

the “worst case” for the ambiguity problem

No reasonable results can be obtained without additional conceptual components

6

jk

Summary and consequences

When used for specialist information retrieval (IR), general WWW search engines run counter to nearly every criterion which actually permits a successful search based on IR knowledge. This involves all the main components of an IR system, the database and its selection, the use of research logic and user expectations. Based on his/her knowledge of these aspects, the user should develop the best possible research strategy, something which is impossible with WWW search engines

7

jk

Nevertheless WWW search engines have one advantage compared with current specialist databases: embedded in an enormous volume of irrelevant data is data which is not found in specialist databases and which may be of value to experts. This means that it is simply not possible to return to the recommendation to narrow down the search to the original specialist databases. New ways have to be found to make research, including WWW sources, more satisfactory than is the case at present using general WWW search engines.

8

jk

Heterogeneity

trans fer and co o rd in atio n

u s ers

Ap p licationarea n

h igh relevanceh igh quali tycontent analys is

M 6

M 4

less relevanceno ab s trac tsh igh quali ty ind exing

M 5low relevancewww-d ocum entssearch access by search engines

M 1

h igh relevanceh igh quali tycontent analys is

M 3only ti tless im p le au tom atic indexing

M 2h igh relevanceproba li lis tic autom atic indexing of fu ll text

decentralized/polycentric docum ent space

www-d ocum entsc .a. B y sc ientis ts

in fo r m at ions er vicecen ters

p u b l is h ers

s c ien t is ts

l ib ra r iesl ib rar yc ata lo gues

wwwelec tron icp ub l is h ing

9

jk

Conceptual gaps

• Different stages of content analysis of textual data:

• an intelligently indexed term in a library classification

• ……

• automatic full text indexing in fully unstructured data pools

Descriptor A in one such system: wide range of meanings

Additional to technological integration:

10

jk

Research Projects IZ Bonn

ViBSoz „Social Science Virtual Library“, Virtual Library Project of the German Research Association (DFG)

CARMEN „ Content Analysis, Retrieval and Metadata: Effective Networking“, special support program of the German Ministry of Education and Research (BMBF).

ELVIRA “Electronic Retrieval and Analysis System for Industrial Associations”, funded by the German Federal Ministry of Economics and Technology

ETB “The European Schools Treasury Browser” funded by the European Commission

11

jk

Metadata

U.S. Bureau of the Census: Integrated Information solutions – The future of census bureau data access and dissemination, Sept. 1999. Working paper

“Recent surveys of Census Bureau customers show that two out of three use multiple data sets. ... If we continue to saddle data users with the burden of putting data from disparate sources into digestible forms, we do it at the risk of our own peril.“(p.2)

“Solutions of these issues ... will remove around the further development of standards, metadata ...“ (p.3)

“IIS will help minimize data user burden, data uncertainty and maximize data quality and usefulness through the use of metadata“ (p.2)

12

jk

CARMEN: Remaining heterogeneity 

  

retrieval

m etadata

heterogenity

docum ents

13

jk

DIN SICT paper: German position

„Strategie für die Standardisierung der Informations- und Kommunikationstechnik (ICT)“ (DIN Berlin 2002, draft)

... It is ... necessary to find a new concept relating to the still existing demand for consistency retention and interoperability. This concept can be described by means of the following premise: standardization must be considered in terms of the remaining heterogeneity. Only joint interaction between intellectual and automatic processes for the treatment of heterogeneity and standardization will produce a solution strategy which also ensures, under present-day marginal conditions, usable consistency and interoperability conditions

(translation from German)

14

jk

CARMEN and ViBSoz:Coping with heterogenity by transfer components

RetrievalMetadata

Coping with heterogeneity

• Cross-concordances

• Statistical transformation and neural networks

• Deductive methods

Documents

extract metadata from various document formats algorithmically

15

jk

Mathematics – Physics: MSC and PACS

statistical:

PACS 62.30.+d Mechanical and elastic waves; vibrations (Mechanische und elastische Wellen, Schwingungslehre)

MSC 74S15 Boundary element methods (Randelementmethode)

intellectual:

PACS 62. Not connected

16

jk

Example: semantic-pragmatic relation

Einfache Suche  

Suchbegriff Dominanz(dominance) Zahl der relevanten Treffer 16

G. Binder

17

jk

Erweiterte Suche  

Transferbegriffe Dominanz, Messen, Mongolei, Nichtregierungsorganisation, Flugzeug, Datenaustausch, Kommunikationsraum, Kommunikationstechnologie, Medienpädagogik,

Zahl zusätzliche relev. Treffer

7

Anteil der zusätzlichen relev. Treffer an den zusätzl. Treffern

50%

G. Binder

• Mitglieder des Vereins wom@n reisten zur UNO-Frauenkonfernez nach Beijing. Auf der Fahrt durch die Mongolei und die Wüste ...

jk

Text fact integration: simple directed transfer in ELVIRA

Transformations

Texts?

Facts?

Formalization

InformationNeed

TexteTexteTexts

TexteTexteFacts

Text-Query

Fact-Query

Direct Links

IterativeSearch

IterativeSearch

19

jk

Standard method: one step transformation

non-differentiated handling of vagueness

A B C

document term sets

question

20

jk

Two step transformation

V1: Handling of vagueness between questions and terms

AB C

document term sets

V3V2

V2/V3: Bilateral handling of vagueness

question

21

jk

X from A

B

A

C

Thesaurus A

Thesaurus B

Thesaurus C

Broker

X from A

Y from B

Z from C

Jugendlicher +Arbeitslosigkeit

JugendarbeitslosigkeitYouth unemployment

IZ ThesaurusIZ Soz.

SWD USB Köln

Jugendarbeitslosigkeit

22

jk

Statistical and Neural networks transformation

• Co-occurence-based similarity

• In ViBSoz: statistical crosswalk between two different thesauri (SWD as a universal thesaurus and the IZ thesaurus for the social sciences as a special thesaurus),

• in ELVIRA between a thesaurus for data and free text terms

• Transformation networks • USB Thesaurus to the IZ Thesaurus

• the USB Thesaurus or IZ Thesaurus to the IZ  Precision

LSI and Transformation network x Statistical methods

 Fig. 3: Transformation network USB Thesaurus to IZ Thesaurus (Fig. 7-12 from Mandl 2000:206)

 Recall

24

jk

Layer model

25

jk

Conclusion

Todays search engines do not adequately solve the problem of a worldwide search for relevant documents and data in a special scientific community. They only represent an incomplete, albeit valuable first step. Users want to interlink literature and research project databases with the catalogues of virtual libraries, the WWW homepages of science institutions and fact sources, e.g. data archives with their survey data. In this case integration should not be performed only on a technical level or using solely intellectually created links, as is the case at present. A key role is played here by automatic transfer between different content analysis methods and standardizations of the document sets to be integrated. Based on the initial empirical results of different IZ projects, the proposed strategy appears to be highly promising: vagueness problems are not treated non-specifically as a transfer between all documents and the query but will be done cognitively plausible with individual bilateral modules.