cp3024 lecture 10 search engines. what is the main www problem? with an estimated 800 million web...
TRANSCRIPT
What is the main WWW problem?
With an estimated 800 million web pages finding the one you want is difficult!
What is a Search Engine?
A page on the web connected to a backend program
Allows a user to enter words which characterise a required page
Returns links to pages which match the query
Types of Search Engine
Automatic search engine e.g. Altavista, Lycos
Classified Directory e.g. Yahoo!Meta-Search Engine e.g. Dogpile
Components of a Search Engine
Robot (or Worm or Spider)– collects pages– checks for page changes
Indexer– constructs a sophisticated file structure to
enable fast page retrievalSearcher
– satisfies user queries
Query Interface
Usually a boolean interface– (Fred and Jean) or (Bill and Sam)
Normally allows phrase searches– "Fred Smith"
Also proximity searchesNot generally understood by usersMay have extra 'friendlier' features
?
Search Results
Presented as linksSupposedly ordered in terms of relevancy
to the querySome Search Engines score resultsNormally organised if groups of ten per
page
Problems
Links are often out of dateUsually too many links are returnedReturned links are not very relevantThe Engines don't know about enough
pagesDifferent engines return different resultsU.S. bias
Improving query results
To look for a particular page use an unusual phrase you know is on that page
Use phrase queries where possibleCheck your spelling!Progressively use more termsIf you don't find what you want, use
another Search Engine!
Who operates Search Engines?
People who can get money from venture capitalists!
Many search engines originate from U.S. universities
Often paid for by advertisementsEngines monitor carefully what else
interests you (paid by the click)
Robot Discovery
Robots visit sites while following linksThe more links the more visitsMake sure you don't exclude Robots from
visiting public pages
Payments
Some search engines only index paying customers
The more you pay the higher you appear on answers to queries
Self submission
Register your page with a search enginePay for a company to register you with
many search enginesGet registration with many search engines
for free!
Getting to the top
Only relevant queries should be ranked highly
Search engines only look at textSearch engine operators try to stop "search
engine spamming"Some queries are pre-answered
Get where you should be!
Put more than graphics on a pageDon't use framesUse the <ALT….> tagMake good use of <TITLE> and <H1>Consider using the <META> tagGet people to link to your page
Summary
Search Engines are vital to the Web userSearch Engines are not perfect by a long
wayThere are tactics for better searchingPage design can bring more visitors via
Search EnginesThe more links the better!
In the beginning
WWLib-TOS– Manually constructed directory– Classified on Dewey Decimal– Simple data structure– Proof of concept
Motive - Why Generate Metadata Automatically?
Meta tags are not compulsoryOld pages are less likely to have meta tagsAvailable data can be unreliableThe Web of Trust requires comprehensive
resource descriptionAn essential prerequisite for widespread
deployment of RDF applications
Method - How can Metadata be Generated Automatically?
Using an automatic classifierThe classifier classifies Web Pages
according to Dewey Decimal Classification
Other useful metadata can be extracted during the process of automatic classification
Automatic Classification
Intended to combine the intuitive accuracy of manually maintained classified directories with the speed and comprehensive coverage of automated search engines
DDC has been adopted because of its universal coverage, multilingual scope and hierarchical nature
Automatic Classifier - How does it work?
Firstly, the page is retrieved from a URL or local file and parsed to produce a document object
Automatic Classifier - How does it work?
The document object is then compared with DDC objects representing the top ten DDC classes
Automatic Classifier - How does it work?
Each time a word in the document matches a word in the DDC object, the two associated weights are added to a total score
A measure of similarity is then calculated using a similarity coefficient
Automatic Classifier - How does it work?
If there is a significant measure of similarity the document will be compared with any subclasses of that DDC class
If there are no subclasses (i.e. the DDC class is a leaf node) the document is assigned the classmark
If the result is not significant, the comparison process will proceed no further through that particular branch of the DDC object hierarchy
Metadata elements
The automatic classification process can be used to extract other useful metadata elements other than the classification classmarks:
KeywordsClassmarksWord count
TitleURLAbstract
A unique accession number and associated dates can be obtained and supplied by the system
Metadata elements - Wolverhampton Core
Wolverhampton Core Dublin Core
1 Unique Accession number Identifier
2 Title Title
3 URL Identifier
4 Abstract Description
5 Keywords Subject
6 Classmarks Subject
7 Word count
8 Classification date
9 Last modified date Date
RDF Schema
There is a significant overlap with the Dublin Core element set
Requirement for implementation clarityThose that have Dublin Core equivalents
are declared as sub-propertiesMaintain interoperability with Dublin Core
applications
RDF Schema
<rdf:Description ID="Keyword"> <rdf:type rdf:resource="http://www.w3.org/TR/WD-rdf-syntax#Property"/> <rdfs:subPropertyOf resource="http://purl.org/metadata/dublin_core#Subject"/> <rdfs:label>Keyword</rdfs:label> </rdf:Description>
<rdf:Description ID="Classmark"> <rdf:type rdf:resource="http://www.w3.org/TR/WD-rdf-syntax#Property"/> <rdfs:subPropertyOf resource="http://purl.org/metadata/dublin_core#Subject"/> <rdfs:label>Classmark</rdfs:label> </rdf:Description>
Classifier Evaluation
Automatic metadata generation will become important for the widespread deployment of RDF based applications
Documents created before the invention of RDF generating authoring tools also need to be described
RDF utilised in this manner may encourage interoperability between search engines
More info: http://www.scit.wlv.ac.uk/~ex1253/