© tefko saracevic, rutgers university1 the invisible web - finding things that are hard to find -...

37
© Tefko Saracevic, Rutgers Universi ty 1 The Invisible Web - finding things that are hard to find - Tefko Saracevic, PhD Rutgers University http:// www.scils.rutgers.edu /~tefko (contains also a list of sites relevant to the topic and this presentation)

Post on 21-Dec-2015

217 views

Category:

Documents


3 download

TRANSCRIPT

© Tefko Saracevic, Rutgers University 1

The Invisible Web- finding things that are hard to

find -

The Invisible Web- finding things that are hard to

find -

Tefko Saracevic, PhDRutgers University

http://www.scils.rutgers.edu/~tefko(contains also a list of sites relevant to

the topic and this presentation)

Tefko Saracevic, PhDRutgers University

http://www.scils.rutgers.edu/~tefko(contains also a list of sites relevant to

the topic and this presentation)

© Tefko Saracevic, Rutgers University 2

What is “Invisible Web?”What is “Invisible Web?”

• Materials that general search engines cannot or WILL not include in their collection of web pages (indexes)

• You cannot find through general search engines

• Contains a vast amount of information– much of it authoritative, qualitative– much of it specialized

• Materials that general search engines cannot or WILL not include in their collection of web pages (indexes)

• You cannot find through general search engines

• Contains a vast amount of information– much of it authoritative, qualitative– much of it specialized

© Tefko Saracevic, Rutgers University 3

Why search engines miss?

Why search engines miss?

• Size: Web is huge, cannot cover all• Economics: associated costs are high

– also pay per crawl & rank

• Technical: still limited capabilities• Spam: eliminating bad also looses good• Restrictions: some site do not let in• Deep structure: some sites complex

• Size: Web is huge, cannot cover all• Economics: associated costs are high

– also pay per crawl & rank

• Technical: still limited capabilities• Spam: eliminating bad also looses good• Restrictions: some site do not let in• Deep structure: some sites complex

© Tefko Saracevic, Rutgers University 4

Web size - who knows?Web size - who knows?

• Web Characterization Project - OCLC – provides statistics about the web– 1998: 2.8, 2002: 9.04 mill web sites (IP address)

• In 2002: 35% public, 29% private, 36% provisional sites– Public sites (2002):

• 55% US, 7% German, 6% Japanese, 3% each French, Spanish, 2% each Italian, Dutch, Chinese,1% each Korean, Russian, Polish, Portuguese

– Adult sites (2002): 3.3%– IP address volatility - all sites (disappearance pattern):

• 13% of sites in 2002 were also in 1998; 51% in 2001

• Web Characterization Project - OCLC – provides statistics about the web– 1998: 2.8, 2002: 9.04 mill web sites (IP address)

• In 2002: 35% public, 29% private, 36% provisional sites– Public sites (2002):

• 55% US, 7% German, 6% Japanese, 3% each French, Spanish, 2% each Italian, Dutch, Chinese,1% each Korean, Russian, Polish, Portuguese

– Adult sites (2002): 3.3%– IP address volatility - all sites (disappearance pattern):

• 13% of sites in 2002 were also in 1998; 51% in 2001

© Tefko Saracevic, Rutgers University 5

How search engines work?How search engines work?• Crawlers, spiders: go out to find

– new & changed sites; periodic, not for each query• Databases, caches:

– gather content; could be submitted, bought• Indexing: creating appropriate entries

– various, mostly proprietary algorithms• Retrieval engine: searching on basis of query• Interface: gathers query, displays results

– could be ordered by pay

• Crawlers, spiders: go out to find– new & changed sites; periodic, not for each query

• Databases, caches: – gather content; could be submitted, bought

• Indexing: creating appropriate entries– various, mostly proprietary algorithms

• Retrieval engine: searching on basis of query• Interface: gathers query, displays results

– could be ordered by pay

© Tefko Saracevic, Rutgers University 6

Search engines differSearch engines differ• Substantial differences among

search engines on each aspect• Information about search engines:

Search Engine Watch ratings, news, statistics, charts

Search Engine Showdown run by a librarian, news links, ratings

Extreme Searcher update of a popular book

• Substantial differences among search engines on each aspect

• Information about search engines: Search Engine Watch

ratings, news, statistics, charts

Search Engine Showdown run by a librarian, news links, ratings

Extreme Searcher update of a popular book

© Tefko Saracevic, Rutgers University 7

Search engine coverageSearch engine coverage

• No engine covers more than 16% of WWW

• Hard to discern & compare coverage• Many national search engines - own

coverage• Many topical search engines – own

coverage• Many comprehensive sources

independent of search engines

• No engine covers more than 16% of WWW

• Hard to discern & compare coverage• Many national search engines - own

coverage• Many topical search engines – own

coverage• Many comprehensive sources

independent of search engines

© Tefko Saracevic, Rutgers University 8

Specialized sourcesSpecialized sources

• Meta search engines• Specialized engines & catalogs• Domain (subject) engines & catalogs• Reference sources• Libraries as web sources• Virtual libraries• Subject databases• Societies, organizations

• Meta search engines• Specialized engines & catalogs• Domain (subject) engines & catalogs• Reference sources• Libraries as web sources• Virtual libraries• Subject databases• Societies, organizations

© Tefko Saracevic, Rutgers University 9

Meta search enginesMeta search engines

• Search engines that cover search engines Search Engine Colossus

international meta engine

Dogpile results from a number of search engines

Surfwax -gives statistics and text sourcesSearch Engine Guide

categorized by topic; other engine information

• Search engines that cover search engines Search Engine Colossus

international meta engine

Dogpile results from a number of search engines

Surfwax -gives statistics and text sourcesSearch Engine Guide

categorized by topic; other engine information

© Tefko Saracevic, Rutgers University 10

meta engines … (cont.)meta engines … (cont.)

Vivisimo clusters results; innovative

Complete Planet over 100,000 databases & s engines

Webbrain results in tree structure – fun to use

Vivisimo clusters results; innovative

Complete Planet over 100,000 databases & s engines

Webbrain results in tree structure – fun to use

© Tefko Saracevic, Rutgers University 11

Domain engines & catalogs

Domain engines & catalogs

•Cover general & specific areas

Open Directory Project – large edited catalog of the web – global, run by volunteers

BUBL LINK -selected Internet resources covering all academic subject areas – UK

Profusion – search in categories

•Cover general & specific areas

Open Directory Project – large edited catalog of the web – global, run by volunteers

BUBL LINK -selected Internet resources covering all academic subject areas – UK

Profusion – search in categories

© Tefko Saracevic, Rutgers University 12

domain engines …domain engines …

• Exist in many domains & subjects – rich! Psychcrawler Amer Psychological Association

web index for psychology Entrez PubMed – Nat Library of Medicine CiteSeer - NEC Research Center

scientific literature, citations index - free Think Quest – an international organization

education resources, programs

• Exist in many domains & subjects – rich! Psychcrawler Amer Psychological Association

web index for psychology Entrez PubMed – Nat Library of Medicine CiteSeer - NEC Research Center

scientific literature, citations index - free Think Quest – an international organization

education resources, programs

© Tefko Saracevic, Rutgers University 13

domain engines …domain engines … KIRKE - Katalog der Internetressourcen für

die Klassische Philologie aus Erlangen a variety of resources

Perseus Digital Library Tufts University covers antiquity to renaissance

Sch of Slavonic & East European Studies, University College London includes country resources, e.g. Croatia

U Mich Document Center official documents from all over the

world

KIRKE - Katalog der Internetressourcen für die Klassische Philologie aus Erlangen a variety of resources

Perseus Digital Library Tufts University covers antiquity to renaissance

Sch of Slavonic & East European Studies, University College London includes country resources, e.g. Croatia

U Mich Document Center official documents from all over the

world

© Tefko Saracevic, Rutgers University 14

Reference servicesReference services

• Reference services - several models – Q&A, directories, email answers etc. Ask Jeeves!

most popular, commercial Information Please

almanac type questions

• Reference services - several models – Q&A, directories, email answers etc. Ask Jeeves!

most popular, commercial Information Please

almanac type questions

© Tefko Saracevic, Rutgers University 15

reference …reference …

• Digital reference - new service area for libraries

QuestionPoint L of Congress & OCLC project for a global reference network

Virtual Reference Desk – L of Congress compilation of web reference sites

LiveRef - maintained at Iowa State U a registry of real time digital reference

services

• Digital reference - new service area for libraries

QuestionPoint L of Congress & OCLC project for a global reference network

Virtual Reference Desk – L of Congress compilation of web reference sites

LiveRef - maintained at Iowa State U a registry of real time digital reference

services

© Tefko Saracevic, Rutgers University 16

Libraries as web sourcesLibraries as web sources

• Academic libraries providing open collections & services; models vary Rutgers libraries - big long term effort University of California, Berkeley

a most elaborate effort together with Sun Corporation

Bibliothèque Nationale de France includes virtual exhibitions, among others

• Academic libraries providing open collections & services; models vary Rutgers libraries - big long term effort University of California, Berkeley

a most elaborate effort together with Sun Corporation

Bibliothèque Nationale de France includes virtual exhibitions, among others

© Tefko Saracevic, Rutgers University 17

Virtual libraries on the Web

Virtual libraries on the Web

• Libraries emerging only on the Web Virtual Library –

Switzerland, US, UK & other countries – ‘oldest virtual library on the Web’

Internet Public Library Michiganalso a long term effort

Librarians Index of the Internetvery popular and comprehensive

• Libraries emerging only on the Web Virtual Library –

Switzerland, US, UK & other countries – ‘oldest virtual library on the Web’

Internet Public Library Michiganalso a long term effort

Librarians Index of the Internetvery popular and comprehensive

© Tefko Saracevic, Rutgers University 18

virtual libraries …virtual libraries …

Academic Info Digital Library many links to digital collections & resources

in various subjects

Gabriel Gateway to European National Libraries

Museum of online museums a delight

Academic Info Digital Library many links to digital collections & resources

in various subjects

Gabriel Gateway to European National Libraries

Museum of online museums a delight

© Tefko Saracevic, Rutgers University 19

Subjects databasesSubjects databases

• Many subject specific sites– rich & often unique coverage & services– different approaches & requirements

• Examples in health related domains:WebMDHealth – news, medical

informationRxlist - The Internet Drug IndexMayo Clinic HealthOasis – health advice

• Many subject specific sites– rich & often unique coverage & services– different approaches & requirements

• Examples in health related domains:WebMDHealth – news, medical

informationRxlist - The Internet Drug IndexMayo Clinic HealthOasis – health advice

© Tefko Saracevic, Rutgers University 20

Societies, organizations Societies, organizations

• Great many rich sources for searching– differences in requirements, depth,

richness

Examples from variety of organizations: Assoc. for Computing Machinery

Digital Library; subscription or registration US State Department

about the U.S & other countries Genealogy – Church of Later Day Saints

most comprehensive historical list of records

• Great many rich sources for searching– differences in requirements, depth,

richness

Examples from variety of organizations: Assoc. for Computing Machinery

Digital Library; subscription or registration US State Department

about the U.S & other countries Genealogy – Church of Later Day Saints

most comprehensive historical list of records

© Tefko Saracevic, Rutgers University 21

Language barriers on the Web

Language barriers on the Web

• English still the major language– but declining, now slightly over

50%

• Multilingual retrieval search enginesEuroseek

searches in a number of languages

All the Web results in 45 languages

• English still the major language– but declining, now slightly over

50%

• Multilingual retrieval search enginesEuroseek

searches in a number of languages

All the Web results in 45 languages

© Tefko Saracevic, Rutgers University 22

Language barriers: translations

Language barriers: translations

• A number of translation sites – machine aided – i.e. plug in terms,

phrases, sentences in one & review in the other language , but effectiveness???

Free Translations from to English, & 8 other languages

Babel Fish from to English and 9 languages, translates URLs

Travlang great for travelers, but annoying commercials

• A number of translation sites – machine aided – i.e. plug in terms,

phrases, sentences in one & review in the other language , but effectiveness???

Free Translations from to English, & 8 other languages

Babel Fish from to English and 9 languages, translates URLs

Travlang great for travelers, but annoying commercials

© Tefko Saracevic, Rutgers University 23

Web news; keeping upWeb news; keeping up

• What is going on on the Web? Some major sources of news and evaluations:

Free Pint – newsletter, articles, links Internet Resources Newsletter – UK based ResearchBuzz – daily updates; many aspects About.com Web Search – tools, Web Search

Forum Resource Shelf – newsletter with archive

• What is going on on the Web? Some major sources of news and evaluations:

Free Pint – newsletter, articles, links Internet Resources Newsletter – UK based ResearchBuzz – daily updates; many aspects About.com Web Search – tools, Web Search

Forum Resource Shelf – newsletter with archive

© Tefko Saracevic, Rutgers University 24

keeping up …keeping up …

• Information Today– trade & professional monthly

newspaper & web site– industry news– searcher columns– general analyses of trends

• Information Today– trade & professional monthly

newspaper & web site– industry news– searcher columns– general analyses of trends

© Tefko Saracevic, Rutgers University 25

Evaluations, ratingsEvaluations, ratings

• Many sources evaluate web sites: The Scout Report –

librarians’ BIBLE! Annotations. Comprehensive. Medical Library Assoc. – ten most useful sites; MLA user guide for health inf.,

recommendations Web 100 – commercial, user ratings, news Evaluating web pages UC Berkeley

– tutorial and guide

• Many sources evaluate web sites: The Scout Report –

librarians’ BIBLE! Annotations. Comprehensive. Medical Library Assoc. – ten most useful sites; MLA user guide for health inf.,

recommendations Web 100 – commercial, user ratings, news Evaluating web pages UC Berkeley

– tutorial and guide

© Tefko Saracevic, Rutgers University 26

Archiving the webArchiving the web• Internet Archive – a large undertaking

– includes web archive & lots more publicly available & free

– 10 billion web pages archived from 1996 to a few months ago

– Wayback Machine – search to look at old versions of web pages

• But there is more. e.g.:– Million Book Project – International Children’s Digital Library

• Internet Archive – a large undertaking– includes web archive & lots more publicly

available & free– 10 billion web pages archived from 1996 to a

few months ago – Wayback Machine – search to look at old

versions of web pages

• But there is more. e.g.:– Million Book Project – International Children’s Digital Library

© Tefko Saracevic, Rutgers University 27

Needed for Web searching

Needed for Web searching

• Knowledge & competencies on– variety of web sources & their

organization– search engines– web search strategies– search dynamics, feedback

• Keeping up & up & up– constant updates, changes, innovations– many domain/subject specific

• Knowledge & competencies on– variety of web sources & their

organization– search engines– web search strategies– search dynamics, feedback

• Keeping up & up & up– constant updates, changes, innovations– many domain/subject specific

© Tefko Saracevic, Rutgers University 28

Needed for Web searching by professionals

Needed for Web searching by professionals

• Knowledge of SOURCES in area of interest

• search engines not enough• not too helpful in finding these other sources;

structure hard to discern

• Evaluation of sources – a key professional skill!

• standard criteria & Web criteria: authority; accuracy; currency (timeliness);

objectivity; coverage, persistence, usability

• Knowledge of SOURCES in area of interest

• search engines not enough• not too helpful in finding these other sources;

structure hard to discern

• Evaluation of sources – a key professional skill!

• standard criteria & Web criteria: authority; accuracy; currency (timeliness);

objectivity; coverage, persistence, usability

© Tefko Saracevic, Rutgers University 29

Needed competencies …

Needed competencies …

• Knowledge of users & use• Knowledge of searching• Use of technology• Adaptability, flexibility• Integration with other resources• Teaching others • Constant learning & update

– keeping up, keeping up, keeping up

• Knowledge of users & use• Knowledge of searching• Use of technology• Adaptability, flexibility• Integration with other resources• Teaching others • Constant learning & update

– keeping up, keeping up, keeping up

© Tefko Saracevic, Rutgers University 30

information

WWW

But now really: How to do it?

© Tefko Saracevic, Rutgers University 31

© Tefko Saracevic, Rutgers University 32

© Tefko Saracevic, Rutgers University 33

P.S. a few weird sites…P.S. a few weird sites…

• SelectSmart.com – all kinds of quizzes for you

• James Dean official web site• Deaducated

– Dead Librarians’ Society• Livejournal

– blogs & authoring tools

• SelectSmart.com – all kinds of quizzes for you

• James Dean official web site• Deaducated

– Dead Librarians’ Society• Livejournal

– blogs & authoring tools

© Tefko Saracevic, Rutgers University 34

SourcesSources• About.com Web Search http://websearch.about.com• Academic Info Digital Library http://www.academicinfo.net/digital.html• All the Web http://www.alltheweb.com/• Ask Jeeves! http://www.ask.com/• Assoc. for Computing Machinery http://www.acm.org/• Babelfish http://babelfish.altavista.com/tr• Bibliothèque Nationale de France http://www.bnf.fr/ • BUBL LINK http://bubl.ac.uk/link/• CDNET Search.com http://www.search.com/• CiteSeer http://citeseer.nj.nec.com/• CompletePlanet http://completeplanet.com• Deaducated http://www.geocities.com/deadlibrarians/• Dogpile http://www.dogpile.com/• Entrez PubMed http://www.ncbi.nlm.nih.gov/PubMed/• Extreme Searcher http://www.extremesearcher.com/• Free Pint http://www.freepint.com/

• About.com Web Search http://websearch.about.com• Academic Info Digital Library http://www.academicinfo.net/digital.html• All the Web http://www.alltheweb.com/• Ask Jeeves! http://www.ask.com/• Assoc. for Computing Machinery http://www.acm.org/• Babelfish http://babelfish.altavista.com/tr• Bibliothèque Nationale de France http://www.bnf.fr/ • BUBL LINK http://bubl.ac.uk/link/• CDNET Search.com http://www.search.com/• CiteSeer http://citeseer.nj.nec.com/• CompletePlanet http://completeplanet.com• Deaducated http://www.geocities.com/deadlibrarians/• Dogpile http://www.dogpile.com/• Entrez PubMed http://www.ncbi.nlm.nih.gov/PubMed/• Extreme Searcher http://www.extremesearcher.com/• Free Pint http://www.freepint.com/

© Tefko Saracevic, Rutgers University 35

sources …sources …• Free Translations http://www.freetranslations.com• Gabriel http://www.kb.nl/gabriel/• Genealogy http://www.familysearch.org/• Information Please http://www.infoplease.com/• International Children’s Digital Library http://www.icdlbooks.org/• Internet Archive http://www.archive.org/• Internet Public Library, Michigan http://www.ipl.org/• Internet Resources Newsletter. http://www.hw.ac.uk/libwww/irn/• James Dean http://www.jamesdean.com/• KIRKE http://www.phil.uni-erlangen.de/~p2latein/ressourc/ressourc.html• Librarians Index to the Internet http://lii.org/• Live Journal http://www.livejournal.com/• LiveRef http://www.public.iastate.edu/~CYBERSTACKS/LiveRef.htm• Mayo Clinic http://www.mayohealth.org/

• Free Translations http://www.freetranslations.com• Gabriel http://www.kb.nl/gabriel/• Genealogy http://www.familysearch.org/• Information Please http://www.infoplease.com/• International Children’s Digital Library http://www.icdlbooks.org/• Internet Archive http://www.archive.org/• Internet Public Library, Michigan http://www.ipl.org/• Internet Resources Newsletter. http://www.hw.ac.uk/libwww/irn/• James Dean http://www.jamesdean.com/• KIRKE http://www.phil.uni-erlangen.de/~p2latein/ressourc/ressourc.html• Librarians Index to the Internet http://lii.org/• Live Journal http://www.livejournal.com/• LiveRef http://www.public.iastate.edu/~CYBERSTACKS/LiveRef.htm• Mayo Clinic http://www.mayohealth.org/

© Tefko Saracevic, Rutgers University 36

sources …sources …• Medical Library Assoc. ten top sites

http://www.mlanet.org/resources/medspeak/topten.html• Medical Library Assoc. user guide for health inf.

http://www.mlanet.org/resources/userguide.html• Medscape http://www.medscape.com/• Million Book Project http://www.archive.org/texts/collection.php?

collection=millionbooks• Museum of online museums. http://www.coudal.com/moom.php• OCLC Web Characterization Project http://wcp.oclc.org/• Open Directory Project http://dmoz.org• Perseus Digital Library http://www.perseus.tufts.edu/• Profusion http://www.profusion.com/• Psychcrawler http://www.psychcrawler.com/• QuestionPoint http://www.questionpoint.org/ • ResearchBuzz. http://www.researchbuzz.com/index.shtml• Resource Shelf http://resourceshelf.blogspot.com/• Rutgers Libraries http://www.libraries.rutgers.edu/• RxList http://www.rxlist.com/

• Medical Library Assoc. ten top sites http://www.mlanet.org/resources/medspeak/topten.html

• Medical Library Assoc. user guide for health inf. http://www.mlanet.org/resources/userguide.html

• Medscape http://www.medscape.com/• Million Book Project http://www.archive.org/texts/collection.php?

collection=millionbooks• Museum of online museums. http://www.coudal.com/moom.php• OCLC Web Characterization Project http://wcp.oclc.org/• Open Directory Project http://dmoz.org• Perseus Digital Library http://www.perseus.tufts.edu/• Profusion http://www.profusion.com/• Psychcrawler http://www.psychcrawler.com/• QuestionPoint http://www.questionpoint.org/ • ResearchBuzz. http://www.researchbuzz.com/index.shtml• Resource Shelf http://resourceshelf.blogspot.com/• Rutgers Libraries http://www.libraries.rutgers.edu/• RxList http://www.rxlist.com/

© Tefko Saracevic, Rutgers University 37

sources …sources …• Sch of East Eur & Slavonic Studies http://www.ssees.ac.uk/dirctory.htm• Search Engine Colossus http://www.searchenginecolossus.com/• Search Engine Guide http://www.searchengineguide.com/• Search Engine Showdown http://searchengineshowdown.com/• Search Engine Watch http://searchenginewatch.com/• Select Smart.com http://www.selectsmart.com/home.html• Surfwax http://www.surfwax.com/• The Scout Report. http://scout.cs.wisc.edu/• Think Quest http://www.thinkquest.org/• Travlang http://www.travlang.com• U California Berkeley http://sunsite.berkeley.edu/• U Mich Documents Center http://www.lib.umich.edu/govdocs/• US State department http://www.state.gov/• Virtual Library http://vlib.org• Virtual Reference Desk http://www.loc.gov/rr/askalib/virtualref.html• Vivisimo http://vivisimo.com• Web 100 http://www.web100.com• Webbrain http://www.webbrain.com/html/default_win.html• WebMD http://my.webmd.com/webmd_today/home/default

• Sch of East Eur & Slavonic Studies http://www.ssees.ac.uk/dirctory.htm• Search Engine Colossus http://www.searchenginecolossus.com/• Search Engine Guide http://www.searchengineguide.com/• Search Engine Showdown http://searchengineshowdown.com/• Search Engine Watch http://searchenginewatch.com/• Select Smart.com http://www.selectsmart.com/home.html• Surfwax http://www.surfwax.com/• The Scout Report. http://scout.cs.wisc.edu/• Think Quest http://www.thinkquest.org/• Travlang http://www.travlang.com• U California Berkeley http://sunsite.berkeley.edu/• U Mich Documents Center http://www.lib.umich.edu/govdocs/• US State department http://www.state.gov/• Virtual Library http://vlib.org• Virtual Reference Desk http://www.loc.gov/rr/askalib/virtualref.html• Vivisimo http://vivisimo.com• Web 100 http://www.web100.com• Webbrain http://www.webbrain.com/html/default_win.html• WebMD http://my.webmd.com/webmd_today/home/default