presented by: allen brown is/se date : 2003 - 05-12
DESCRIPTION
Searching the Web Or “If there’s so much out there, why can’t I find it?”. Presented by: Allen Brown IS/SE Date : 2003 - 05-12. . Outline - Searching the Web. Information Cartography Visible and Invisible Web Information Information Finding Strategies - PowerPoint PPT PresentationTRANSCRIPT
Searching the WebOr “If there’s so much out there, why can’t I find it?”
Presented by: Allen Brown IS/SE
Date: 2003-05-12
Searching the Web - 2
Outline - Searching the Web
1. Information Cartography2. Visible and Invisible Web Information3. Information Finding Strategies4. Reference Tools, Pathfinders,
Specialized Information Repositories, Subject Directories, and Search Engines
5. Information Search Strategies6. Information Evaluation Strategies7. Information Finding Summary8. Search Engines and their Characteristics
Searching the Web - 3
Information CartographyImagine a physical map of an ocean basin• identifiable areas of the sea floor• large abyssal plain• many undulating hills above the plain• occasional higher elevations or plateaus• sparse atolls and seamounts
Imagine the Web• some information content identifiable by subject• vast amounts of very low value information• some good stuff distributed across many sites• occasional high quality site with quality and quantity• sparse stunningly useful sites (to die for)
Searching the Web - 4
Information Cartography - 2
In searching for information we need to adjust the:•breadth of search to find all that is relevant in an “ocean” of information•quality level to find only “atolls” of information quality
to find everything that is important and useful
quality
completeness
Information issues:
+ location!
Searching the Web - 5
Information space
Visible and Invisible Information
Visible = indexed by search engine
engine 4 db 2
engine 3
engine 1
engine 2
site 3
db 1 db 4site 7
site 5db 6
Invisible = not indexed but accessible
Searching the Web - 6
Search Engines Won’t Do It All!
According to a recent study reported in Nature (1) no search engine indexes more than 16% of the Web. Even though search engine databases are enormous, they cover very little of what's actually available on the Web.
1) Steve Lawrence and C. Lee Giles. (July 8, 1999). Accessibility of Information on the Web. Nature, 400, 107 - 109
Searching the Web - 7
Information Finding StrategiesIdentify Starting Points based on your question:
What type of information do you need?Facts, statistics, government document, scholarly articles, popular
opinion, music, picture, multimedia, news, …
What form do you want the information in?Dictionary definition, encyclopedia entry, journal article, elementary
school project, video file, audio file, …
What type of site would offer this information?Academic, commercial, government, non-government organization
How much information do you need?Introduction, in-depth, references, …
Searching the Web - 8
Information Finding
Reference Materials (Often invisible)– dictionaries, thesauri, encyclopedia, newspapers
Information Pathfinders (Sometimes invisible) / Portals / Vortals– subject specific, highly relevant, sometimes bizarre– usually high quality– managed by dedicated enthusiasts, possibly amateur– e.g., Web design, Perl, micro cars, Curta calculators, …
Specialized Information Repositories (Often invisible) / Portals– institution-based, sometimes obscure– usually high quality– managed by information professionals– e.g., government documents, archives, …
Searching the Web - 9
Information Finding - 2
Subject Indices (Often invisible – but this is changing)– subject-based– e.g., Yahoo
Search Engines and Search Brokers (Visible web)– e.g., Google, Alta Vista, Hot Bot, Lycos, Vivisimo, dogpile
Searching the Web - 10
Reference Tools - Dictionaries http://www.yourdictionary.com/
Searching the Web - 11
Reference Tools - Thesauri http://www.visualthesaurus.com/index.jsp
Searching the Web - 12
Reference Tools - Encyclopedia http://www.britannica.com/
Searching the Web - 13
A pathfinder site provides an information map of what is available within a fairly narrow area of interest; usually compiled by domain experts. These sites are often called “vortals” (vertical portals).
Pathfinders
Searching the Web - 14
Specialized Information Repositories - National Library of Canada
A specialized information repository often collects and catalogues relatively specific information; usually compiled by information experts. Some are considered to be vortals.
Searching the Web - 15
Subject directories are lists compiled by people. They are organized in a hierarchy where each subject includes a list of sub-topics. These sites are often called “portals” - a one-site starting location for general information seeking.
Subject Directorieswww.yahoo.com
Searching the Web - 16
Subject Directories
Subjects lists are usually evaluated but sites are not presented in order of relevancy. In other words, the best sites on a topic are not necessarily listed first. Sites are compiled through submission of URLs by site creators and human evaluation and selection.
One advantage of is their browsability, although this feature is only suitable with fairly general topics. A disadvantage is their relatively small size.
Other examples of subject directories :Infomine: http://infomine.ucr.edu Scout Report Signpost: http://www.signpost.org/signpost
Searching the Web - 17
Invisible Web Directories
Look athttp://www.invisible-web.net/
Searching the Web - 18
Search Engines
Search engines use computer programs that automatically collect web sites using "spiders" or "robots". The sites are indexed and stored in an index database.
To query a search engine, type topic keywords and Boolean connectors into a search "box." The search engine scans its index and returns links to websites containing the specified keyword relationships.
Size matters - an advantage of using search engines is their coverage (though size is relative), but this can also be a disadvantage if relevance ranking is poor.
Searching the Web - 19
UserWorldWideWeb
indexdatabase
Search Engine
Search Engines: Operational Concepts
query
query results
crawling and page contents
extraction and
indexing
query parsing,
index lookup, results
ranking and management
Searching the Web - 20
Search Engines - Does Size Matter?
Searching the Web - 21
Size
If you are looking for unusual or hard-to-find information should try one or more of the search engines with a large index to check more web content. This improves the likelihood of finding what you seek. However, for general searches or when looking for information about popular topics, a large index does not necessarily equal better results. Also, large indexes may have longer re-visit intervals.
Searching the Web - 22
Search Engines:Search Scopingand Ranking / Results ManagementIt is essential to learn and apply each engine's specialized search
formats to narrow results and filter and push the most relevant pages to the top of the results list. Use Boolen operators, proximity connectors, stems, wild cards, sounds-like, media-type and metadata filters.
Result relevancy ranking also depends on the size of the search index and how the search engine interprets and uses your query.
Each engine determines result relevancy ranking in unique ways. Consult the help file of each engine to learn about these.
Some engines offer search refinement and conceptual clustering for better focus (tighter “hit cluster”) or greater accuracy / validity (centred on the “right stuff”).
Searching the Web - 23
Search Engines - Search Scoping
+ expands the scope, - reduces the scope• Exact phrase - - quotes, e.g., “We hold these things to be self-evident”• Boolean operators - and - (default) or + (caution!) not - (extreme
caution!), e.g., large male dog, large or male or dog, not cat• Proximity connectors - near - (depends on engine), e.g., spring near flower• Stemming and wildcards - + e.g., swim* swim, swimming, swimmer,
swimmers, swimmingly, …• Sounds-like - + e.g., table cable, able, fable, …• Media type - - e.g., image, audio file, …• Concept-based + - e.g., synonym thesaurus, antonym, homonym, …• Metadata-based - - in some systems
Searching the Web - 24
Search Engines - Ranking
Result relevancy ranking (=“usefulness”) can be done according to two techniques (or some combination):
• Conventional - using intra-page information• Relative - using extra-page information
Searching the Web - 25
Search Engines - Conventional RankingConventional (intra-page):• frequency of words (number and density)• phrases (exact word sequences)• hierarchy (e.g., closer to the top of the document)• adjacency (proximity of words)• metadata (keywords provided by content owners)• font size and style (relative intra-page)
Jack Christensen repairs CURTA calculators. I've known Jack for many years and can highly recommend him. Here are a few questions I asked Jack: What do you charge to clean a Curta? Typically $65 to $95, depending on the work involved. More often than not, the upper carriage needs a complete disassembly, whereas the main body can be cleaned without a complete disassembly. If the main body needs to be completely disassembled, something is usually bent, out of adjustment, or broken. What do you charge when repairing a Curta? I charge $20 per hour of my time. It seems my hours are about 90 minutes long, however, because I rarely finish in the time I originally quoted. Extended repair time is absorbed by me. What spare parts do you have? Are they expensive? I actually have many hundreds of new original Curta parts. Most are for inside the instrument, though. I use them when I do general cleaning and repairs. Outer body pieces, replacement cannisters, and external parts that are easily damaged or broken due to abuse are not generally available, although I do occasionally locate some these items. Sometimes I have to fabricate a part, or repair an item as best I can. Obviously, this takes time, and the cost is high. Parts costs are charged as the traffic will bear. I usually try to be blunt about this to the Curta owner, often telling them that a severely damaged unit is best sold as a "parts Curta". Unfortunately, I've sometimes had to tell this to someone who wanted to repair a Curta looked upon as an heirloom. What to them appears to be a minor issue actually turns out to be a major problem (e.g., a crank handle tilted downward is due to a broken main shaft). I think the most I ever charged for a repair was about $375. There were many severe problems with the unit. Generally, when the price gets to be above $175 most people simply decide to keep the damaged Curta as a memento. Can you replace a clearing ring? What costs are involved? The plastic clearing rings are easy to install. I have several new ones, but I typically do not sell them separately as a spare part. Rather, I install them during a general cleaning and repair. Metal rings are more difficult to replace. As with the plastic clearing rings, I will only install a metal clearing ring during a general cleaning and repair. It takes a special tool to properly swage the rivet in place. [Editor's note: Very old Type I clearing rings were held on with a screw and nut. The nut was also crimped to the screw threads.] I used all the new metal clearing rings I had about five years ago, but I do have a few used ones that were removed from other damaged Curtas. I have these for both the Type I and
Searching the Web - 26
Search Engines - Relative Ranking
Relative (extra-page):• popularity (page visits - from the search engine)• citation (links pointing to the item)• relevance of the pages containing the links pointing to the item (!)
Yahoo
Web Pages
Searching the Web - 27
Search Engines: Keys to Success
Ranked and manageable results query construction and search engine features
World Wide Web Size Large index and / or several engines
Scoped query “wide net” but appropriate “sieve”
carefully constructed for your needs
Searching the Web - 28
Meta Search Engines
“Meta" search tools are able to search the index databases of multiple engines “simultaneously”, via a single interface.
“Meta” search tools don’t really search metadata. They are simply brokers that reformulate a query and hand it off to a set of search engines, then combine the results.
“Meta” engines are very fast but they do not offer the same level of control over the relationship between keywords as do individual search engines.
Also, meta search engines may produce poor ranking of combined results.
Searching the Web - 29
Search Engines
Examples of popular search engines include:
Google: http://www.google.com Alta Vista: http://www.altavista.comAll the Web http://www.alltheweb.com Northern Light: http://www.northernlight.com Also seeThe KartOO clustering visual engine http://www.kartoo.com/For meta engines, try Vivisimo at http://vivisimo.com/
Searching the Web - 30
Information Search Strategies
• Think hard about what you are looking for!• Use a Reference Tool, if appropriate• Use a Pathfinder, if you know one• Use a Specialized Information Repository, if appropriate• Use Subject Indexes, if it is a common topic• Use several Search Engines, if needed, especially for the obscure or
academic topic, but learn how they work• Use keywords - be narrow, and specific (and technical)• Use phrases - try synonyms or related concepts• Use Boolean connectors - but find out if / how the engine uses them• Use stemming and wildcards - but find out if / how the engine uses
them• Use media-type filters or metadata, if appropriate
Searching the Web - 31
Information Search Tools - Use
breadth
depth
Reference Tool
Pathfinder
Specialized Information Repository
Subject Indexes
Search Engines and Meta-engines
generic simple lookup created by professionals contains “invisible” content
obscure oracademic caveat emptor!
popular or common pre-selected by interested people
related or themed pre-selected by professionalscontains “invisible” content
focused content pre-selected by domain experts
Informationspace
hard to use well
easy to use
Searching the Web - 32
CARS checklist:http://library.queensu.ca./inforef/guides/evalchart.htm• Credibility
- author credentials stated with email contact- evidence of quality control (site location)
• Accuracy- timeliness - comprehensiveness - audience & purpose
• Reasonableness- fairness- objectivity - consistency- world view
• Support- source documentation or bibliography
Information Evaluation Strategies: CARS
Searching the Web - 33
Summary There is much information on the Web, but it’s not:
- all there- all good (or all bad)- always easy to locate
Use an information search strategy that:- matches the information sought - uses the appropriate tools- uses them in the correct ways
Use an information evaluation strategy, e.g., CARS methodology.
Choose and use search engines wisely, knowing their strengths, features, and their limitations.
Searching the Web - 34
How Do Search Engines Work?
Three Activities Occur:1. Crawling
– fetch pages– compile URL list (a db)– re-visit pages
2. Page harvesting– parse page– add to index db and establish ranking
3. Responding to search requests– parse query– apply to index– present and rank results
Searching the Web - 35
Search Engines: Operation
Search Engine
Indexdatabase
CrawlerRobot
HarvesterRobot
URLdatabase
QueryProcessor
query
query results
pagecontents
Really clever stuffin here
Fairly clever stuffin here
URL
WorldWideWeb
fetch
fetch
URL
User
re-visit
Searching the Web - 36
Search Engine - Hardware
(not really …)
Searching the Web - 37
How Do Search Engines Work?
• See “The Anatomy of a Large-Scale Hypertextual Web Search Engine” at http://www-db.stanford.edu/~backrub/google.html
Searching the Web - 38
References• Information Search Strategies:
<http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/FindInfo.html>
• Information Evaluation Strategies:<http://www.vuw.ac.nz/~agsmith/evaln/evaln.htm>
• Search Engines:< http://www.library.arizona.edu/search.htm>< http://www.brightplanet.com/deepcontent/tutorials/search/index.asp >< http://www.searchenginewatch.com/ >
• Susan Maze, David Moxley, Donna Smith: Authoritative Guide to Web Search Engines, Neal Schuman Pub, 1997, ISBN 1555703054