quad search: a novel metasearch engine (lakritid) leonidas akritidis 1 george voutsakelis 2...
TRANSCRIPT
Quad Search: A novel metasearch Quad Search: A novel metasearch engineengine
(http://cheetah.csd.auth.gr/~lakritid)(http://cheetah.csd.auth.gr/~lakritid)Leonidas AkritidisLeonidas Akritidis11
George VoutsakelisGeorge Voutsakelis22
Dimitrios KatsarosDimitrios Katsaros1,21,2
Panayiotis BozanisPanayiotis Bozanis22
11Data Engineering Lab, Dept. of Informatics, Aristotle Univ., Thessaloniki, Data Engineering Lab, Dept. of Informatics, Aristotle Univ., Thessaloniki, HellasHellas
22Computer & Communication Engineering Dept., Univ of Thessaly, Volos, Computer & Communication Engineering Dept., Univ of Thessaly, Volos, HellasHellas
1111thth Panhellenic Conference of Informatics, Patras, Hellas, 18-20/05/2007 Panhellenic Conference of Informatics, Patras, Hellas, 18-20/05/2007
IntroductiIntroductionon
Single Search EnginesSingle Search Engines
•Maintenance of a document databaseMaintenance of a document database•Low Web CoverageLow Web Coverage•Medium ScalabilityMedium Scalability•Paid ListingsPaid Listings
Metasearch EnginesMetasearch Engines
•Effortless invocation of multiple search enginesEffortless invocation of multiple search engines•No document databaseNo document database•Increased Web CoverageIncreased Web Coverage•Improved retrieval effectivenessImproved retrieval effectiveness
IntroductionIntroduction
Metasearch Metasearch EnginesEngines
Rank Rank AggregationAggregation
Rank Rank Aggregation Aggregation MethodsMethods
KE MethodKE Method
Antispam Antispam VersionVersion
Metasearch EnginesMetasearch Engines
The Metasearch Engines use the document The Metasearch Engines use the document databases that the component search databases that the component search engines maintainengines maintain
UUsseerr
MMeettaasseeaarrcchh EEnnggiinnee
CCoommppoonneenntt EEnnggiinnee 11
CCoommppoonneenntt EEnnggiinnee 22
CCoommppoonneenntt EEnnggiinnee NN
……
DDooccuummeenntt DDaattaabbaassee 11
DDooccuummeenntt DDaattaabbaassee 22
DDooccuummeenntt DDaattaabbaassee NN
IntroductionIntroduction
Metasearch Metasearch EnginesEngines
Rank Rank AggregationAggregation
Rank Rank Aggregation Aggregation MethodsMethods
KE MethodKE Method
Antispam Antispam VersionVersion
Rank AggregationRank Aggregation
What is Rank Aggregation?What is Rank Aggregation?IntroductionIntroduction
Metasearch Metasearch EnginesEngines
Rank Rank AggregationAggregation
Rank Rank Aggregation Aggregation MethodsMethods
KE MethodKE Method
Antispam Antispam VersionVersion
AA BB DD CC FF EE
BB DD CC AA
BB DD CC AA FF EE
RRaannkk AAggggrreeggaattiioonn BB DD CC AA FF EE
…
Rank Aggregation Rank Aggregation MethodsMethods
Rank Aggregation MethodsRank Aggregation Methods
Unweighted Borda CountUnweighted Borda Count
Spearman’s FootruleSpearman’s Footrule
Kental’s TauKental’s Tau
Markov ChainsMarkov Chains
IntroductionIntroduction
Metasearch Metasearch EnginesEngines
Rank Rank AggregationAggregation
Rank Rank Aggregation Aggregation MethodsMethods
KE MethodKE Method
Antispam Antispam VersionVersion
KE MethodKE Method
DescriptionDescription
Each result is called candidateEach result is called candidate
Each candidate receives a score (weight), Each candidate receives a score (weight), according to the formula below:according to the formula below:
IntroductionIntroduction
Metasearch Metasearch EnginesEngines
Rank Rank AggregationAggregation
Rank Rank Aggregation Aggregation MethodsMethods
KE MethodKE Method
Antispam Antispam VersionVersion
m
i 1n
m
r iw
kn 1
10
•r(i): The candidate’s rank in the i-th enginer(i): The candidate’s rank in the i-th engine•n: The number of the candidate’s appearancesn: The number of the candidate’s appearances•m: The number of the invoked search enginesm: The number of the invoked search engines•k: The length of the top-k listk: The length of the top-k list
Antispam Version of the KE Antispam Version of the KE MethodMethod
We say that a search engine has been We say that a search engine has been spammed by aspammed by a
page when it ranks the page too highly with page when it ranks the page too highly with respect torespect to
the other pages, according to the view of a the other pages, according to the view of a typical usertypical user
We try to constrain this phenomenon by We try to constrain this phenomenon by proposing theproposing the
Antispam version of the KE Method which can Antispam version of the KE Method which can be betterbe better
described by the following pseudocode:described by the following pseudocode:
1.1. Find which items appear in most than half Find which items appear in most than half pages (let the number of these items be c)pages (let the number of these items be c)
2.2. Apply the KE Method for these itemsApply the KE Method for these items3.3. Position them in results’ list, starting at rank Position them in results’ list, starting at rank
114.4. Apply the KE Method for the rest of the itemsApply the KE Method for the rest of the items5.5. Position them in results’ list starting at rank Position them in results’ list starting at rank
c+1c+1
IntroductionIntroduction
Metasearch Metasearch EnginesEngines
Rank Rank AggregationAggregation
Rank Rank Aggregation Aggregation MethodsMethods
KE MethodKE Method
Antispam Antispam VersionVersion
Quad Search’s Quad Search’s ArchitectureArchitecture
Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features
Schematic diagram of Quad Search’s Schematic diagram of Quad Search’s ArchitectureArchitecture
USER
USER INTERFACE
Database Selector
Quad Bot
Object Builder
Classification Module
Presentation Module
Query Terms
Ranking Algorithm
Results Page
User InterfaceUser Interface
Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features
FeaturesFeatures
Quad Search’s User Interface is friendly and Quad Search’s User Interface is friendly and simple in order to ensure:simple in order to ensure:
•Short download timesShort download times•Compatibility with all major browsersCompatibility with all major browsers•Convenient usageConvenient usage
For this reason, we avoided using:For this reason, we avoided using:
•Large graphics filesLarge graphics files•Javascript and AJAXJavascript and AJAX•Flash PresentationsFlash Presentations
User Interface (Search User Interface (Search Hints)Hints)
Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features
Search HintsSearch Hints
We developed this part of Quad Search to We developed this part of Quad Search to provide:provide:
•Detailed information about all its featuresDetailed information about all its features•Explanation for simple and complex operationsExplanation for simple and complex operations•Many helpful examplesMany helpful examples
Quad Bot (1)Quad Bot (1)
Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features
DescriptionDescription
Quad Bot is responsible for the result retrieval. Quad Bot is responsible for the result retrieval. It consistsIt consists
of the following sub-modules:of the following sub-modules:
• Input Validator: It performs security checksInput Validator: It performs security checks
• Query Dispatcher: It submits the query to the Query Dispatcher: It submits the query to the
component search engines simultaneouslycomponent search engines simultaneously
• Result Collector: It embraces the engines’ Result Collector: It embraces the engines’
responsesresponses
• Result Validator: It performs multiple Result Validator: It performs multiple
conversions to the collected data.conversions to the collected data.
Quad Bot (2 - Quad Bot (2 - Architecture)Architecture)
Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features
ArchitectureArchitecture
Parameter Receiver - Validator
Query Dispatcher
Engine 4
Result Collector
Result Validator
OBJ ECT BUILDER
DB SELECTOR - USER
Engine 3 Engine 2 Engine 1
Web Search APIsWeb Search APIs
Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features
What is a Web Search API?What is a Web Search API?
API stands for Application Programming API stands for Application Programming Interface.Interface.
It is a programming tool supplied by the It is a programming tool supplied by the manufacturer of a large scale applicationmanufacturer of a large scale application
A Web Search API is used to retrieve results A Web Search API is used to retrieve results from major search enginesfrom major search engines
DisadvantagesDisadvantages
• Inaccurate results compared to the “mother” Inaccurate results compared to the “mother” engineengine
• Queries per Day LimitationQueries per Day Limitation• Registration IDs requiredRegistration IDs required• Queries per Registration ID LimitationQueries per Registration ID Limitation
Quad Search Quad Search does notdoes not make use of Search APIs make use of Search APIs
Engine BombingEngine Bombing
Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features
DefinitionDefinition
Engine Bombing occurs when multiple results Engine Bombing occurs when multiple results from thefrom the
same domain enter the presented results’ listsame domain enter the presented results’ list
Many metasearch engines suffer the engine Many metasearch engines suffer the engine bombing.bombing.
Engine Bombing ProtectionEngine Bombing Protection
Quad Search supports a feature to limit the Quad Search supports a feature to limit the different different
results coming from same domainresults coming from same domain
Results FilteringResults Filtering
Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features
Provided FiltersProvided Filters
• Antispam Filter: Application of the antispam Antispam Filter: Application of the antispam version of the KE Methodversion of the KE Method
• Ranking Algorithm Selector: Quad Search Ranking Algorithm Selector: Quad Search provides an option to determine how the provides an option to determine how the collected results will be rankedcollected results will be ranked
• Engine Bombing ProtectionEngine Bombing Protection
Advanced Web SearchAdvanced Web Search
Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features
Advanced Search FiltersAdvanced Search Filters
• File Type Selector: The user can perform File Type Selector: The user can perform searches for files of specific format (PDF, searches for files of specific format (PDF, DOC, XLS and PPT)DOC, XLS and PPT)
• Language Filter: Quad Search can return Language Filter: Quad Search can return documents written in a specifed languagedocuments written in a specifed language
• Domain Filter: The user can search a given Domain Filter: The user can search a given domain, or exclude a domain from a searchdomain, or exclude a domain from a search
• Date Filter: Return results updated in the past Date Filter: Return results updated in the past 3, 6, or 12 months3, 6, or 12 months
Web Search OptionsWeb Search Options
Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features
Quad Search provides the user with the ability Quad Search provides the user with the ability
to setto set
options that will be used in future searchesoptions that will be used in future searches
Some of these options are:Some of these options are:
1.1. Connection Timeout Feature. How long Quad Connection Timeout Feature. How long Quad
Search Search should wait a search engine to should wait a search engine to
respondrespond
2.2. Determine the number of candidates to be Determine the number of candidates to be
collected per component enginecollected per component engine
3.3. Determine the number of results to be Determine the number of results to be
displayed per result pagedisplayed per result page
4.4. Determine whether the results will be opened Determine whether the results will be opened
in a new browser windowin a new browser window
Results Presentation (1)Results Presentation (1)
Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features
Classic View:Classic View: The results are displayed in the The results are displayed in the classic wayclassic way
Array View:Array View: The results are displayed in a The results are displayed in a ranked array.ranked array.
The user can watch the results and their The user can watch the results and their rankings easierrankings easier
Results Presentation (2)Results Presentation (2)
Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features
Results PageResults Page
The results page is highly customizable. A The results page is highly customizable. A relativerelative
screenshot is depicted belowscreenshot is depicted below
Scientific SearchScientific Search
Scientific SearchScientific SearchRelated WorkRelated WorkH-IndexH-IndexSearch OptionsSearch OptionsAdvanced SearchAdvanced SearchCacheCacheExtra FeaturesExtra Features
General FeaturesGeneral Features
Quad Search is capable of searching for Quad Search is capable of searching for
scientists,scientists,
authors and/or published articlesauthors and/or published articles
Google Scholar provides the required dataGoogle Scholar provides the required data
Quad Search collects the data and produces Quad Search collects the data and produces
statisticsstatistics
and chartsand charts
H-IndexH-Index
Scientific SearchScientific SearchRelated WorkRelated WorkH-IndexH-IndexSearch OptionsSearch OptionsAdvanced SearchAdvanced SearchCacheCacheExtra FeaturesExtra Features
DefinitionDefinition
The h-index is an index for quantifying the The h-index is an index for quantifying the scientificscientificproductivity of physicists and other scientists productivity of physicists and other scientists based onbased ontheir publication recordtheir publication record
A A scientist has indexscientist has index h h ifif h h of his Nof his Npp papers have papers have at least at least h h citations each,citations each, and the otherand the other (N (Npp - h) - h) papers have no more than h citations eachpapers have no more than h citations each
Quad Search computes h-index when the user Quad Search computes h-index when the user doesdoesa search for authorsa search for authors
Scientific Search OptionsScientific Search Options
Scientific SearchScientific SearchRelated WorkRelated WorkH-IndexH-IndexSearch OptionsSearch OptionsAdvanced SearchAdvanced SearchCacheCacheExtra FeaturesExtra Features
The scientific search part of Quad Search offers The scientific search part of Quad Search offers
a varietya variety
of options that can be stored and used in future of options that can be stored and used in future
searchessearches
The user can defineThe user can define
• The results’ languageThe results’ language• The results’ subject area (biology, chemistry, The results’ subject area (biology, chemistry,
physics, engineering, medicine etc)physics, engineering, medicine etc)• The number of results to be displayed per The number of results to be displayed per
pagepage• If the results will be opened in the current or If the results will be opened in the current or
in a new windowin a new window
Extra Features - ChartsExtra Features - Charts
Scientific SearchScientific SearchRelated WorkRelated WorkH-IndexH-IndexSearch OptionsSearch OptionsAdvanced SearchAdvanced SearchCacheCacheExtra FeaturesExtra Features
The user can visually check the number of cites The user can visually check the number of cites perper
paper of a specified author. This feature is paper of a specified author. This feature is applicableapplicable
for “Author Searches”for “Author Searches”
Extra Features – Excluding Extra Features – Excluding PapersPapers
Scientific SearchScientific SearchRelated WorkRelated WorkH-IndexH-IndexSearch OptionsSearch OptionsAdvanced SearchAdvanced SearchCacheCacheExtra FeaturesExtra Features
When a user performs an “Author Search”, When a user performs an “Author Search”, Quad SearchQuad Search
transfers all results from Google Scholar (or its transfers all results from Google Scholar (or its cache)cache)
Possibly, some of these articles should not Possibly, some of these articles should not participate inparticipate in
the calculations (e.g. the h-index)the calculations (e.g. the h-index)
The user can exclude the papers that should notThe user can exclude the papers that should notparticipate in the calculations, by deselecting participate in the calculations, by deselecting
thetheappropriate checkboxappropriate checkbox
Future WorkFuture Work
Future WorkFuture WorkConcluding Concluding remarksremarks
Our plans for Quad SearchOur plans for Quad Search
• Support for extra ranking algorithms (e.g. Support for extra ranking algorithms (e.g.
Markov chains)Markov chains)
• Geography aware search for NewsGeography aware search for News
• News Search with RSS feedsNews Search with RSS feeds
• Wide Personalization (users, profiles, topics of Wide Personalization (users, profiles, topics of
interest, stored multimedia and user defined interest, stored multimedia and user defined
customization)customization)
• Image and Video searchesImage and Video searches
• Searches in P2P networks (e-donkey, g-Searches in P2P networks (e-donkey, g-
nutella, etc)nutella, etc)
• Torrent SearchesTorrent Searches
Concluding RemarksConcluding Remarks
Future WorkFuture WorkConcluding Concluding remarksremarks
ConclusionsConclusions
• In this session, we presented a pair of rank In this session, we presented a pair of rank aggregation algorithms, KE Method and its aggregation algorithms, KE Method and its antispam versionantispam version
• We injected some new parameters like the We injected some new parameters like the number of the top-k lists that a page appears number of the top-k lists that a page appears and the total number of the exploited search and the total number of the exploited search enginesengines
• We also presented a novel meta-search We also presented a novel meta-search engine, Quad Searchengine, Quad Search
• Quad Search offers a wide variety of new Quad Search offers a wide variety of new features for web search, like the ranking features for web search, like the ranking algorithm selector, the engine bombing algorithm selector, the engine bombing protection etcprotection etc
• Quad Search also provides options for Quad Search also provides options for searches for scientific articles. It also searches for scientific articles. It also computes statistics like h-indexcomputes statistics like h-index