© tefko saracevic, rutgers university1 the invisible web - finding things that are hard to find -...
Post on 21-Dec-2015
217 views
TRANSCRIPT
© Tefko Saracevic, Rutgers University 1
The Invisible Web- finding things that are hard to
find -
The Invisible Web- finding things that are hard to
find -
Tefko Saracevic, PhDRutgers University
http://www.scils.rutgers.edu/~tefko(contains also a list of sites relevant to
the topic and this presentation)
Tefko Saracevic, PhDRutgers University
http://www.scils.rutgers.edu/~tefko(contains also a list of sites relevant to
the topic and this presentation)
© Tefko Saracevic, Rutgers University 2
What is “Invisible Web?”What is “Invisible Web?”
• Materials that general search engines cannot or WILL not include in their collection of web pages (indexes)
• You cannot find through general search engines
• Contains a vast amount of information– much of it authoritative, qualitative– much of it specialized
• Materials that general search engines cannot or WILL not include in their collection of web pages (indexes)
• You cannot find through general search engines
• Contains a vast amount of information– much of it authoritative, qualitative– much of it specialized
© Tefko Saracevic, Rutgers University 3
Why search engines miss?
Why search engines miss?
• Size: Web is huge, cannot cover all• Economics: associated costs are high
– also pay per crawl & rank
• Technical: still limited capabilities• Spam: eliminating bad also looses good• Restrictions: some site do not let in• Deep structure: some sites complex
• Size: Web is huge, cannot cover all• Economics: associated costs are high
– also pay per crawl & rank
• Technical: still limited capabilities• Spam: eliminating bad also looses good• Restrictions: some site do not let in• Deep structure: some sites complex
© Tefko Saracevic, Rutgers University 4
Web size - who knows?Web size - who knows?
• Web Characterization Project - OCLC – provides statistics about the web– 1998: 2.8, 2002: 9.04 mill web sites (IP address)
• In 2002: 35% public, 29% private, 36% provisional sites– Public sites (2002):
• 55% US, 7% German, 6% Japanese, 3% each French, Spanish, 2% each Italian, Dutch, Chinese,1% each Korean, Russian, Polish, Portuguese
– Adult sites (2002): 3.3%– IP address volatility - all sites (disappearance pattern):
• 13% of sites in 2002 were also in 1998; 51% in 2001
• Web Characterization Project - OCLC – provides statistics about the web– 1998: 2.8, 2002: 9.04 mill web sites (IP address)
• In 2002: 35% public, 29% private, 36% provisional sites– Public sites (2002):
• 55% US, 7% German, 6% Japanese, 3% each French, Spanish, 2% each Italian, Dutch, Chinese,1% each Korean, Russian, Polish, Portuguese
– Adult sites (2002): 3.3%– IP address volatility - all sites (disappearance pattern):
• 13% of sites in 2002 were also in 1998; 51% in 2001
© Tefko Saracevic, Rutgers University 5
How search engines work?How search engines work?• Crawlers, spiders: go out to find
– new & changed sites; periodic, not for each query• Databases, caches:
– gather content; could be submitted, bought• Indexing: creating appropriate entries
– various, mostly proprietary algorithms• Retrieval engine: searching on basis of query• Interface: gathers query, displays results
– could be ordered by pay
• Crawlers, spiders: go out to find– new & changed sites; periodic, not for each query
• Databases, caches: – gather content; could be submitted, bought
• Indexing: creating appropriate entries– various, mostly proprietary algorithms
• Retrieval engine: searching on basis of query• Interface: gathers query, displays results
– could be ordered by pay
© Tefko Saracevic, Rutgers University 6
Search engines differSearch engines differ• Substantial differences among
search engines on each aspect• Information about search engines:
Search Engine Watch ratings, news, statistics, charts
Search Engine Showdown run by a librarian, news links, ratings
Extreme Searcher update of a popular book
• Substantial differences among search engines on each aspect
• Information about search engines: Search Engine Watch
ratings, news, statistics, charts
Search Engine Showdown run by a librarian, news links, ratings
Extreme Searcher update of a popular book
© Tefko Saracevic, Rutgers University 7
Search engine coverageSearch engine coverage
• No engine covers more than 16% of WWW
• Hard to discern & compare coverage• Many national search engines - own
coverage• Many topical search engines – own
coverage• Many comprehensive sources
independent of search engines
• No engine covers more than 16% of WWW
• Hard to discern & compare coverage• Many national search engines - own
coverage• Many topical search engines – own
coverage• Many comprehensive sources
independent of search engines
© Tefko Saracevic, Rutgers University 8
Specialized sourcesSpecialized sources
• Meta search engines• Specialized engines & catalogs• Domain (subject) engines & catalogs• Reference sources• Libraries as web sources• Virtual libraries• Subject databases• Societies, organizations
• Meta search engines• Specialized engines & catalogs• Domain (subject) engines & catalogs• Reference sources• Libraries as web sources• Virtual libraries• Subject databases• Societies, organizations
© Tefko Saracevic, Rutgers University 9
Meta search enginesMeta search engines
• Search engines that cover search engines Search Engine Colossus
international meta engine
Dogpile results from a number of search engines
Surfwax -gives statistics and text sourcesSearch Engine Guide
categorized by topic; other engine information
• Search engines that cover search engines Search Engine Colossus
international meta engine
Dogpile results from a number of search engines
Surfwax -gives statistics and text sourcesSearch Engine Guide
categorized by topic; other engine information
© Tefko Saracevic, Rutgers University 10
meta engines … (cont.)meta engines … (cont.)
Vivisimo clusters results; innovative
Complete Planet over 100,000 databases & s engines
Webbrain results in tree structure – fun to use
•
Vivisimo clusters results; innovative
Complete Planet over 100,000 databases & s engines
Webbrain results in tree structure – fun to use
•
© Tefko Saracevic, Rutgers University 11
Domain engines & catalogs
Domain engines & catalogs
•Cover general & specific areas
Open Directory Project – large edited catalog of the web – global, run by volunteers
BUBL LINK -selected Internet resources covering all academic subject areas – UK
Profusion – search in categories
•Cover general & specific areas
Open Directory Project – large edited catalog of the web – global, run by volunteers
BUBL LINK -selected Internet resources covering all academic subject areas – UK
Profusion – search in categories
© Tefko Saracevic, Rutgers University 12
domain engines …domain engines …
• Exist in many domains & subjects – rich! Psychcrawler Amer Psychological Association
web index for psychology Entrez PubMed – Nat Library of Medicine CiteSeer - NEC Research Center
scientific literature, citations index - free Think Quest – an international organization
education resources, programs
• Exist in many domains & subjects – rich! Psychcrawler Amer Psychological Association
web index for psychology Entrez PubMed – Nat Library of Medicine CiteSeer - NEC Research Center
scientific literature, citations index - free Think Quest – an international organization
education resources, programs
© Tefko Saracevic, Rutgers University 13
domain engines …domain engines … KIRKE - Katalog der Internetressourcen für
die Klassische Philologie aus Erlangen a variety of resources
Perseus Digital Library Tufts University covers antiquity to renaissance
Sch of Slavonic & East European Studies, University College London includes country resources, e.g. Croatia
U Mich Document Center official documents from all over the
world
KIRKE - Katalog der Internetressourcen für die Klassische Philologie aus Erlangen a variety of resources
Perseus Digital Library Tufts University covers antiquity to renaissance
Sch of Slavonic & East European Studies, University College London includes country resources, e.g. Croatia
U Mich Document Center official documents from all over the
world
© Tefko Saracevic, Rutgers University 14
Reference servicesReference services
• Reference services - several models – Q&A, directories, email answers etc. Ask Jeeves!
most popular, commercial Information Please
almanac type questions
• Reference services - several models – Q&A, directories, email answers etc. Ask Jeeves!
most popular, commercial Information Please
almanac type questions
© Tefko Saracevic, Rutgers University 15
reference …reference …
• Digital reference - new service area for libraries
QuestionPoint L of Congress & OCLC project for a global reference network
Virtual Reference Desk – L of Congress compilation of web reference sites
LiveRef - maintained at Iowa State U a registry of real time digital reference
services
• Digital reference - new service area for libraries
QuestionPoint L of Congress & OCLC project for a global reference network
Virtual Reference Desk – L of Congress compilation of web reference sites
LiveRef - maintained at Iowa State U a registry of real time digital reference
services
© Tefko Saracevic, Rutgers University 16
Libraries as web sourcesLibraries as web sources
• Academic libraries providing open collections & services; models vary Rutgers libraries - big long term effort University of California, Berkeley
a most elaborate effort together with Sun Corporation
Bibliothèque Nationale de France includes virtual exhibitions, among others
• Academic libraries providing open collections & services; models vary Rutgers libraries - big long term effort University of California, Berkeley
a most elaborate effort together with Sun Corporation
Bibliothèque Nationale de France includes virtual exhibitions, among others
© Tefko Saracevic, Rutgers University 17
Virtual libraries on the Web
Virtual libraries on the Web
• Libraries emerging only on the Web Virtual Library –
Switzerland, US, UK & other countries – ‘oldest virtual library on the Web’
Internet Public Library Michiganalso a long term effort
Librarians Index of the Internetvery popular and comprehensive
• Libraries emerging only on the Web Virtual Library –
Switzerland, US, UK & other countries – ‘oldest virtual library on the Web’
Internet Public Library Michiganalso a long term effort
Librarians Index of the Internetvery popular and comprehensive
© Tefko Saracevic, Rutgers University 18
virtual libraries …virtual libraries …
Academic Info Digital Library many links to digital collections & resources
in various subjects
Gabriel Gateway to European National Libraries
Museum of online museums a delight
Academic Info Digital Library many links to digital collections & resources
in various subjects
Gabriel Gateway to European National Libraries
Museum of online museums a delight
© Tefko Saracevic, Rutgers University 19
Subjects databasesSubjects databases
• Many subject specific sites– rich & often unique coverage & services– different approaches & requirements
• Examples in health related domains:WebMDHealth – news, medical
informationRxlist - The Internet Drug IndexMayo Clinic HealthOasis – health advice
• Many subject specific sites– rich & often unique coverage & services– different approaches & requirements
• Examples in health related domains:WebMDHealth – news, medical
informationRxlist - The Internet Drug IndexMayo Clinic HealthOasis – health advice
© Tefko Saracevic, Rutgers University 20
Societies, organizations Societies, organizations
• Great many rich sources for searching– differences in requirements, depth,
richness
Examples from variety of organizations: Assoc. for Computing Machinery
Digital Library; subscription or registration US State Department
about the U.S & other countries Genealogy – Church of Later Day Saints
most comprehensive historical list of records
• Great many rich sources for searching– differences in requirements, depth,
richness
Examples from variety of organizations: Assoc. for Computing Machinery
Digital Library; subscription or registration US State Department
about the U.S & other countries Genealogy – Church of Later Day Saints
most comprehensive historical list of records
© Tefko Saracevic, Rutgers University 21
Language barriers on the Web
Language barriers on the Web
• English still the major language– but declining, now slightly over
50%
• Multilingual retrieval search enginesEuroseek
searches in a number of languages
All the Web results in 45 languages
• English still the major language– but declining, now slightly over
50%
• Multilingual retrieval search enginesEuroseek
searches in a number of languages
All the Web results in 45 languages
© Tefko Saracevic, Rutgers University 22
Language barriers: translations
Language barriers: translations
• A number of translation sites – machine aided – i.e. plug in terms,
phrases, sentences in one & review in the other language , but effectiveness???
Free Translations from to English, & 8 other languages
Babel Fish from to English and 9 languages, translates URLs
Travlang great for travelers, but annoying commercials
• A number of translation sites – machine aided – i.e. plug in terms,
phrases, sentences in one & review in the other language , but effectiveness???
Free Translations from to English, & 8 other languages
Babel Fish from to English and 9 languages, translates URLs
Travlang great for travelers, but annoying commercials
© Tefko Saracevic, Rutgers University 23
Web news; keeping upWeb news; keeping up
• What is going on on the Web? Some major sources of news and evaluations:
Free Pint – newsletter, articles, links Internet Resources Newsletter – UK based ResearchBuzz – daily updates; many aspects About.com Web Search – tools, Web Search
Forum Resource Shelf – newsletter with archive
• What is going on on the Web? Some major sources of news and evaluations:
Free Pint – newsletter, articles, links Internet Resources Newsletter – UK based ResearchBuzz – daily updates; many aspects About.com Web Search – tools, Web Search
Forum Resource Shelf – newsletter with archive
© Tefko Saracevic, Rutgers University 24
keeping up …keeping up …
• Information Today– trade & professional monthly
newspaper & web site– industry news– searcher columns– general analyses of trends
• Information Today– trade & professional monthly
newspaper & web site– industry news– searcher columns– general analyses of trends
© Tefko Saracevic, Rutgers University 25
Evaluations, ratingsEvaluations, ratings
• Many sources evaluate web sites: The Scout Report –
librarians’ BIBLE! Annotations. Comprehensive. Medical Library Assoc. – ten most useful sites; MLA user guide for health inf.,
recommendations Web 100 – commercial, user ratings, news Evaluating web pages UC Berkeley
– tutorial and guide
• Many sources evaluate web sites: The Scout Report –
librarians’ BIBLE! Annotations. Comprehensive. Medical Library Assoc. – ten most useful sites; MLA user guide for health inf.,
recommendations Web 100 – commercial, user ratings, news Evaluating web pages UC Berkeley
– tutorial and guide
© Tefko Saracevic, Rutgers University 26
Archiving the webArchiving the web• Internet Archive – a large undertaking
– includes web archive & lots more publicly available & free
– 10 billion web pages archived from 1996 to a few months ago
– Wayback Machine – search to look at old versions of web pages
• But there is more. e.g.:– Million Book Project – International Children’s Digital Library
• Internet Archive – a large undertaking– includes web archive & lots more publicly
available & free– 10 billion web pages archived from 1996 to a
few months ago – Wayback Machine – search to look at old
versions of web pages
• But there is more. e.g.:– Million Book Project – International Children’s Digital Library
© Tefko Saracevic, Rutgers University 27
Needed for Web searching
Needed for Web searching
• Knowledge & competencies on– variety of web sources & their
organization– search engines– web search strategies– search dynamics, feedback
• Keeping up & up & up– constant updates, changes, innovations– many domain/subject specific
• Knowledge & competencies on– variety of web sources & their
organization– search engines– web search strategies– search dynamics, feedback
• Keeping up & up & up– constant updates, changes, innovations– many domain/subject specific
© Tefko Saracevic, Rutgers University 28
Needed for Web searching by professionals
Needed for Web searching by professionals
• Knowledge of SOURCES in area of interest
• search engines not enough• not too helpful in finding these other sources;
structure hard to discern
• Evaluation of sources – a key professional skill!
• standard criteria & Web criteria: authority; accuracy; currency (timeliness);
objectivity; coverage, persistence, usability
• Knowledge of SOURCES in area of interest
• search engines not enough• not too helpful in finding these other sources;
structure hard to discern
• Evaluation of sources – a key professional skill!
• standard criteria & Web criteria: authority; accuracy; currency (timeliness);
objectivity; coverage, persistence, usability
© Tefko Saracevic, Rutgers University 29
Needed competencies …
Needed competencies …
• Knowledge of users & use• Knowledge of searching• Use of technology• Adaptability, flexibility• Integration with other resources• Teaching others • Constant learning & update
– keeping up, keeping up, keeping up
• Knowledge of users & use• Knowledge of searching• Use of technology• Adaptability, flexibility• Integration with other resources• Teaching others • Constant learning & update
– keeping up, keeping up, keeping up
© Tefko Saracevic, Rutgers University 33
P.S. a few weird sites…P.S. a few weird sites…
• SelectSmart.com – all kinds of quizzes for you
• James Dean official web site• Deaducated
– Dead Librarians’ Society• Livejournal
– blogs & authoring tools
• SelectSmart.com – all kinds of quizzes for you
• James Dean official web site• Deaducated
– Dead Librarians’ Society• Livejournal
– blogs & authoring tools
© Tefko Saracevic, Rutgers University 34
SourcesSources• About.com Web Search http://websearch.about.com• Academic Info Digital Library http://www.academicinfo.net/digital.html• All the Web http://www.alltheweb.com/• Ask Jeeves! http://www.ask.com/• Assoc. for Computing Machinery http://www.acm.org/• Babelfish http://babelfish.altavista.com/tr• Bibliothèque Nationale de France http://www.bnf.fr/ • BUBL LINK http://bubl.ac.uk/link/• CDNET Search.com http://www.search.com/• CiteSeer http://citeseer.nj.nec.com/• CompletePlanet http://completeplanet.com• Deaducated http://www.geocities.com/deadlibrarians/• Dogpile http://www.dogpile.com/• Entrez PubMed http://www.ncbi.nlm.nih.gov/PubMed/• Extreme Searcher http://www.extremesearcher.com/• Free Pint http://www.freepint.com/
• About.com Web Search http://websearch.about.com• Academic Info Digital Library http://www.academicinfo.net/digital.html• All the Web http://www.alltheweb.com/• Ask Jeeves! http://www.ask.com/• Assoc. for Computing Machinery http://www.acm.org/• Babelfish http://babelfish.altavista.com/tr• Bibliothèque Nationale de France http://www.bnf.fr/ • BUBL LINK http://bubl.ac.uk/link/• CDNET Search.com http://www.search.com/• CiteSeer http://citeseer.nj.nec.com/• CompletePlanet http://completeplanet.com• Deaducated http://www.geocities.com/deadlibrarians/• Dogpile http://www.dogpile.com/• Entrez PubMed http://www.ncbi.nlm.nih.gov/PubMed/• Extreme Searcher http://www.extremesearcher.com/• Free Pint http://www.freepint.com/
© Tefko Saracevic, Rutgers University 35
sources …sources …• Free Translations http://www.freetranslations.com• Gabriel http://www.kb.nl/gabriel/• Genealogy http://www.familysearch.org/• Information Please http://www.infoplease.com/• International Children’s Digital Library http://www.icdlbooks.org/• Internet Archive http://www.archive.org/• Internet Public Library, Michigan http://www.ipl.org/• Internet Resources Newsletter. http://www.hw.ac.uk/libwww/irn/• James Dean http://www.jamesdean.com/• KIRKE http://www.phil.uni-erlangen.de/~p2latein/ressourc/ressourc.html• Librarians Index to the Internet http://lii.org/• Live Journal http://www.livejournal.com/• LiveRef http://www.public.iastate.edu/~CYBERSTACKS/LiveRef.htm• Mayo Clinic http://www.mayohealth.org/
• Free Translations http://www.freetranslations.com• Gabriel http://www.kb.nl/gabriel/• Genealogy http://www.familysearch.org/• Information Please http://www.infoplease.com/• International Children’s Digital Library http://www.icdlbooks.org/• Internet Archive http://www.archive.org/• Internet Public Library, Michigan http://www.ipl.org/• Internet Resources Newsletter. http://www.hw.ac.uk/libwww/irn/• James Dean http://www.jamesdean.com/• KIRKE http://www.phil.uni-erlangen.de/~p2latein/ressourc/ressourc.html• Librarians Index to the Internet http://lii.org/• Live Journal http://www.livejournal.com/• LiveRef http://www.public.iastate.edu/~CYBERSTACKS/LiveRef.htm• Mayo Clinic http://www.mayohealth.org/
© Tefko Saracevic, Rutgers University 36
sources …sources …• Medical Library Assoc. ten top sites
http://www.mlanet.org/resources/medspeak/topten.html• Medical Library Assoc. user guide for health inf.
http://www.mlanet.org/resources/userguide.html• Medscape http://www.medscape.com/• Million Book Project http://www.archive.org/texts/collection.php?
collection=millionbooks• Museum of online museums. http://www.coudal.com/moom.php• OCLC Web Characterization Project http://wcp.oclc.org/• Open Directory Project http://dmoz.org• Perseus Digital Library http://www.perseus.tufts.edu/• Profusion http://www.profusion.com/• Psychcrawler http://www.psychcrawler.com/• QuestionPoint http://www.questionpoint.org/ • ResearchBuzz. http://www.researchbuzz.com/index.shtml• Resource Shelf http://resourceshelf.blogspot.com/• Rutgers Libraries http://www.libraries.rutgers.edu/• RxList http://www.rxlist.com/
• Medical Library Assoc. ten top sites http://www.mlanet.org/resources/medspeak/topten.html
• Medical Library Assoc. user guide for health inf. http://www.mlanet.org/resources/userguide.html
• Medscape http://www.medscape.com/• Million Book Project http://www.archive.org/texts/collection.php?
collection=millionbooks• Museum of online museums. http://www.coudal.com/moom.php• OCLC Web Characterization Project http://wcp.oclc.org/• Open Directory Project http://dmoz.org• Perseus Digital Library http://www.perseus.tufts.edu/• Profusion http://www.profusion.com/• Psychcrawler http://www.psychcrawler.com/• QuestionPoint http://www.questionpoint.org/ • ResearchBuzz. http://www.researchbuzz.com/index.shtml• Resource Shelf http://resourceshelf.blogspot.com/• Rutgers Libraries http://www.libraries.rutgers.edu/• RxList http://www.rxlist.com/
© Tefko Saracevic, Rutgers University 37
sources …sources …• Sch of East Eur & Slavonic Studies http://www.ssees.ac.uk/dirctory.htm• Search Engine Colossus http://www.searchenginecolossus.com/• Search Engine Guide http://www.searchengineguide.com/• Search Engine Showdown http://searchengineshowdown.com/• Search Engine Watch http://searchenginewatch.com/• Select Smart.com http://www.selectsmart.com/home.html• Surfwax http://www.surfwax.com/• The Scout Report. http://scout.cs.wisc.edu/• Think Quest http://www.thinkquest.org/• Travlang http://www.travlang.com• U California Berkeley http://sunsite.berkeley.edu/• U Mich Documents Center http://www.lib.umich.edu/govdocs/• US State department http://www.state.gov/• Virtual Library http://vlib.org• Virtual Reference Desk http://www.loc.gov/rr/askalib/virtualref.html• Vivisimo http://vivisimo.com• Web 100 http://www.web100.com• Webbrain http://www.webbrain.com/html/default_win.html• WebMD http://my.webmd.com/webmd_today/home/default
• Sch of East Eur & Slavonic Studies http://www.ssees.ac.uk/dirctory.htm• Search Engine Colossus http://www.searchenginecolossus.com/• Search Engine Guide http://www.searchengineguide.com/• Search Engine Showdown http://searchengineshowdown.com/• Search Engine Watch http://searchenginewatch.com/• Select Smart.com http://www.selectsmart.com/home.html• Surfwax http://www.surfwax.com/• The Scout Report. http://scout.cs.wisc.edu/• Think Quest http://www.thinkquest.org/• Travlang http://www.travlang.com• U California Berkeley http://sunsite.berkeley.edu/• U Mich Documents Center http://www.lib.umich.edu/govdocs/• US State department http://www.state.gov/• Virtual Library http://vlib.org• Virtual Reference Desk http://www.loc.gov/rr/askalib/virtualref.html• Vivisimo http://vivisimo.com• Web 100 http://www.web100.com• Webbrain http://www.webbrain.com/html/default_win.html• WebMD http://my.webmd.com/webmd_today/home/default