eric sieverts university library utrecht it department institute for media & information...

26
Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)

Post on 21-Dec-2015

217 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)

Eric SievertsUniversity Library Utrecht

IT DepartmentInstitute for Media &

Information Management(Hogeschool van Amsterdam)

Page 2: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)

Google and/or/not databases

• why using search engines ? • functionality of search engines

(including the latest technology) • what is hidden for search engines ? • search engines databases

• why would people prefer google ? • what is up for us, librarians ?

Eric Sieverts | [email protected] | http://www.library.uu.nl/medew/it/eric | Bielefeld 2002 Conference, 7 febr 2002

Page 3: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)

why using search engines ?

• easy to use best match technique • such a good relevance ranking

(at least some of them) • still a lot of additional (hidden) functionality • recent language technological methods • such large collections

Eric Sieverts | [email protected] | http://www.library.uu.nl/medew/it/eric | Bielefeld 2002 Conference, 7 febr 2002

Page 4: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)

why using search engines ?

some common document ranking parameters

• the more terms from your query in a document, the better (now for most engines only "all the terms")

• the more prominent a term in a document, the better (in <title>, in the first few sentences, in a <meta> tag)

• the more frequently repeated a search term, the better • the closer together the terms in a document, the better • the more uncommon a search term, the higher its weight • the more "popular" a web-page, the better

(more hyperlinks pointing to it, more people visiting it, ..) google’s strong point

Eric Sieverts | [email protected] | http://www.library.uu.nl/medew/it/eric | Bielefeld 2002 Conference, 7 febr 2002

Page 5: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)

why using search engines ?

google offers a lot of additional functionality• boolean search (if you really want to - I do occasionally!)

• "citation" search (other web-pages linking to "this" site)

• similarity search (means here: similar linking patterns; not really better than word-based similarity search)

• disappeared documents in result set can be retrieved from archive cache

• many other document types than just plain html

• also image search, usenet archives, integration of open directory subject tree

see google see google advanced search

Eric Sieverts | [email protected] | http://www.library.uu.nl/medew/it/eric | Bielefeld 2002 Conference, 7 febr 2002

Page 6: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)

why using search engines ?

modern language technology aboard

categorisation of result sets• (formerly) northernlight's custom search folders

(rulebased method)

• teoma (statistics based method)

• wisenut (statistics based method)

• fast-alltheweb (statistics based method)

teoma wisenut

Eric Sieverts | [email protected] | http://www.library.uu.nl/medew/it/eric | Bielefeld 2002 Conference, 7 febr 2002

Page 7: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)
Page 8: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)
Page 9: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)
Page 10: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)
Page 11: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)
Page 12: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)

why using search engines ?

search engine “sizes”

see for instance “search engine watch”

search engine watch

december 2001

Eric Sieverts | [email protected] | http://www.library.uu.nl/medew/it/eric | Bielefeld 2002 Conference, 7 febr 2002

Page 13: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)

what is hidden for (most) search engines ? (and consequently for their users ! )

non-HTML documents: flash, office-files, pdf (not fundamentally impossible, as google demonstrates)

"real-time" data (too difficult to keep track)

dynamically, database generated pages(out of fear for spider traps; but google seems to do it)

all information hidden in searchable databases(spiders cannot fill out database search forms)

to-be-paid-for or licensed information(bibliographic databases, full-text scientific journals, ....)

all information that is not (yet) on the web

Eric Sieverts | [email protected] | http://www.library.uu.nl/medew/it/eric | Bielefeld 2002 Conference, 7 febr 2002

Page 14: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)

search engines vs. databases

besides - for us obvious - differences in content:

differences in functionality

database search engine field searching boolean, proximity, truncation controlled vocabulary

- categories- thesauri- etc

modern retrieval technology relevance ranking ease of use

but do users use all of this ??despite its importance !!

Eric Sieverts | [email protected] | http://www.library.uu.nl/medew/it/eric | Bielefeld 2002 Conference, 7 febr 2002

Page 15: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)

why do students graduate on google" ?

why do so many users prefer the use of search engines ?

apparent simplicity of search engine interface

too many separate other search systems to address overwhelming choice of databases example overwhelming choice of digital primary sources example

plethora of different database system interfaces

interfaces crowded with "functionality"

what would you use ? – if you did't know what's the difference – if you did't know what you'd miss

Eric Sieverts | [email protected] | http://www.library.uu.nl/medew/it/eric | Bielefeld 2002 Conference, 7 febr 2002

Page 16: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)
Page 17: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)
Page 18: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)
Page 19: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)
Page 20: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)
Page 21: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)

do you miss so much with only google ?

• google also indexes .PDF , .DOC , .PPT , .XLS , .RTF

• the web also contains preprints, reports, projects etc. that are NOT in databases

• many scientists (and others) put copies of their published articles on their personal websites

that seems fine, but you still get low recall, because:

• the web remains a very fragmented incomplete mess (behind that simple google screen)

• it is not indexed consistently and in a controlled way but for many users lousy recall is no problem at all .....

Eric Sieverts | [email protected] | http://www.library.uu.nl/medew/it/eric | Bielefeld 2002 Conference, 7 febr 2002

Page 22: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)

what is up for libraries ?

• realise better integrated access to all our precious (and expensive) information sources

• realise more advanced retrieval possibilities while keeping the advances of controlled indexing as well

central index solution meta-search / portal solution

- our own choice of advanced local search engine / retrieval software

- problems with indexing remotely stored data

- problems with non-uniform controlled indexing

- many remote and locally available retrieval systems addressed in a single query (via Z39.50, http, etc.)

- restricted to common denominator of classical boolean functionality

- problems with non-uniform controlled indexing

Eric Sieverts | [email protected] | http://www.library.uu.nl/medew/it/eric | Bielefeld 2002 Conference, 7 febr 2002

Page 23: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)

indexer

internet

document text files

central index

searchintegrated system:local central index solution

indexing-rules fortargets

full-text links

document text files

Page 24: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)

internet

searchintegrated system:metasearch /portal solution

index

files

search

query-generator / result-collector

index

search

index

search

index index index

Z39.50

Z39.50 Z39.50

internal api

http http xml

Z39.50 http

configurationdata fortargets

search search search

files

files files files files

Page 25: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)

and some look into the (near) future ....

competition between “ “ and "our databases" will continue

library based search systems will improve

performance of web search engines will improve as well

- automatic methods of uniform classification and controlled keyword indexing

- more flexible xml-based methods for metasearch-solutions (srw, sru)

- improved access to remote data to be locally indexed

- xml, rdf metadata & the semantic web will improve concept- and meaning- based retrieval on the web

- ever more information will be available on the web

- newest technologies will continue to be tested on the web first

Eric Sieverts | [email protected] | http://www.library.uu.nl/medew/it/eric | Bielefeld 2002 Conference, 7 febr 2002

Page 26: Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)