may 2007 may 2007 search engines challenges & trends david rashty david.rashty@gmail.com

Post on 26-Dec-2015

217 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

May 2007May 2007

Search Engines Search Engines Challenges & TrendsChallenges & Trends

David RashtyDavid Rashty

david.rashty@gmail.comdavid.rashty@gmail.com

Search Challenges Search Challenges

• Where web search fails ?Where web search fails ?

• Search engines characteristicsSearch engines characteristics

• Search engines user interfaceSearch engines user interface

• Search engines trendsSearch engines trends

• Q & AQ & A

(1)(1)

About MyselfAbout Myself• One of the first One of the first WWWWWW developers (1992). Created developers (1992). Created

the 10th website of its kind.the 10th website of its kind.

• Founder of two start-ups/ventures: Founder of two start-ups/ventures: Addwise.comAddwise.com (1999- 2005) which deals with (1999- 2005) which deals with software application and Information architecture.software application and Information architecture.SnunitSnunit (1994-1999) which develops e-learning (1994-1999) which develops e-learning activities.activities.

• Currently involved in a new venture called Currently involved in a new venture called “ResearchTrail.com”.“ResearchTrail.com”.

(2)(2)

Where Web Search Fails ?Where Web Search Fails ?

(3)(3)

How People Search How People Search • NavigationalNavigational – – (find out what is the address of a website) (find out what is the address of a website)

‘How do I find the website of CNN’‘How do I find the website of CNN’

• FactualFactual – – (find exact information)(find exact information) ““Population of Population of China; President Bush's email; Flights from NY China; President Bush's email; Flights from NY to Detroitto Detroit““

• ComprehensiveComprehensive – – (build a picture of a new world ) (build a picture of a new world ) ‘I ‘I need to understand the market around wireless need to understand the market around wireless networking’, ‘I need to know more about networking’, ‘I need to know more about Leukemia Leukemia

(4)(4)

Search Skills Vary Significantly Search Skills Vary Significantly between Peoplebetween PeopleSome may succeed and some may fail, in locating what Some may succeed and some may fail, in locating what

they are looking forthey are looking for

Web +/- refers to Web expertise, Econo +/- refers to domain knowledge

From(Christoph Hölscher & Gerhard Strube, 2000), http://www9.org/w9cdrom/81/81.html

Only users who could rely both on high web expertise and high domain knowledge ("double experts") were able to solve an average of 3.2 out of the 5 tasks

(Christoph Hölscher & Gerhard Strube , 2000)

((55))

Gap in Web Search Gap in Web Search

• Despite the existence of huge websites and Despite the existence of huge websites and powerful search engines, novice users powerful search engines, novice users have have difficulty finding comprehensive informationdifficulty finding comprehensive information about even common topics.about even common topics.

Searching for relevant information on the World Wide Web is often a laborious and frustrating task for casual and experienced users

(Christoph Hölscher, Gerhard Strube, 2000)

(6)(6)

Search Challenge: Effectiveness Search Challenge: Effectiveness

If users don't find the result with their first If users don't find the result with their first query, they are progressively query, they are progressively less and less less and less likely to succeedlikely to succeed with additional searches. with additional searches. Many users don't even bother… Many users don't even bother…

(source: Nilsen, 2002)(source: Nilsen, 2002)

((77))

JupiterResearch found that 71% of online consumers use search engines to find health-related information, but only 16% find the information they are looking for

(ZDNet Research, June 2006)

Scatter Nature of Information Scatter Nature of Information

• Users often retrieve incomplete information Users often retrieve incomplete information because of the because of the complex scatter of relevant complex scatter of relevant facts about a topicfacts about a topic across web pages across web pages (source: (source:

Bahavnani 2006)Bahavnani 2006)

(8)(8)

Information Density Information Density • General pagesGeneral pages contained information on many contained information on many

subjects with medium amount of detail (portals)subjects with medium amount of detail (portals)

• Specific pagesSpecific pages contained information on a few contained information on a few subjects with high amount of detail (articles, subjects with high amount of detail (articles, expert sites)expert sites)

• Sparse pagesSparse pages contained information on a few contained information on a few subjects with little detail (references)subjects with little detail (references)

(source: Bahavnani 2006)(source: Bahavnani 2006)

((99))

Search ChallengeSearch Challenge

• Searching for comprehensive information need Searching for comprehensive information need knowledge and skillsknowledge and skills

• Novice users are lacking advanced search Novice users are lacking advanced search skillsskills

• Information scatterings is not addressed by Information scatterings is not addressed by search engines, novice users are usually search engines, novice users are usually unaware of that. unaware of that.

(10)(10)

Search Engines CharacteristicsSearch Engines Characteristics

(11)(11)

Search Engines Overlap Search Engines Overlap • study looked at search results from more than 12,500 study looked at search results from more than 12,500

random queries on Ask Jeeves, Google, MSN search random queries on Ask Jeeves, Google, MSN search and Yahoo, and found that the overlap in first page and Yahoo, and found that the overlap in first page results for these four engines was a scant 1.1% on results for these four engines was a scant 1.1% on average for a given queryaverage for a given query

• 84.9% of total results are unique to one engine 84.9% of total results are unique to one engine

• 11.4% of total results were shared by any two engines 11.4% of total results were shared by any two engines

• 2.6% of total results were shared by any three 2.6% of total results were shared by any three engines engines

• 1.1% of total results were shared by any four engines 1.1% of total results were shared by any four engines

(12)(12)Source: Search Engine Watch

Search Engines Overlap Search Engines Overlap

(13)(13)

Users Attention Users Attention

(14)(14)(source: Checkit)

Search Query Syntax Search Query Syntax

(15)(15)

Old SE / ProfessionalOld SE / Professional

Modern SE / NoviceModern SE / Novice

Is it more Is it more advancedadvanced or helping or helping define the define the query better?query better?

Search Query ProblemsSearch Query Problems

(16)(16)

• VocabularyVocabulary - Two people are unlikely to use the - Two people are unlikely to use the same word to describe the same thingsame word to describe the same thing… (source: (source: Google)Google)

• OperatorsOperators – – most people don’t know how to use most people don’t know how to use search engines operators (“Only 16% of participants search engines operators (“Only 16% of participants used quotation marks,many incorrectly”, source: used quotation marks,many incorrectly”, source: HargittaiHargittai ))

• Query lengthQuery length - Average query was 2.6 words long - Average query was 2.6 words long (in 2001),up from 2.4 words in 1997 (source: Google)(in 2001),up from 2.4 words in 1997 (source: Google)

• Boolean operatorsBoolean operators - People Don’t Understand - People Don’t Understand Boolean Logic (AND, OR) ! (source: Google)Boolean Logic (AND, OR) ! (source: Google)

Search Query LengthSearch Query Length

(17)(17)(Source: Yahoo)(Source: Yahoo)

SE Commands, OperatorsSE Commands, Operators & Shortcuts & Shortcuts

(18)(18)

• GoogleGoogle: more than 60 : more than 60 http://www.google.com/intl/en/help/features.htmhttp://www.google.com/intl/en/help/features.htmll

• YahooYahoo: more than 60 (: more than 60 (http://help.yahoo.com/help/us/ysearch/basics/bahttp://help.yahoo.com/help/us/ysearch/basics/basics-04.htmlsics-04.html))

• How many people use them ?How many people use them ?

• How many people are aware of them ?How many people are aware of them ?

• Are they useful ?Are they useful ?

Advanced Search ?Advanced Search ?

(19)(19)

Used by less than 10%!!Used by less than 10%!!

Invisible Web / Hidden WebInvisible Web / Hidden Web

(20)(20)

• Deep WebDeep Web (or Deepnet, invisible Web or hidden (or Deepnet, invisible Web or hidden Web) refers to World Wide Web content not part Web) refers to World Wide Web content not part of the surface Web indexed by search engines. of the surface Web indexed by search engines. ((source: Wikipedia))

• Includes: Includes: Dynamic content, unlinked content, Dynamic content, unlinked content, limited access content, scripted content, non-limited access content, scripted content, non-text contenttext content

• MoreMore than 500 timesthan 500 times as much information as as much information as traditional search engines "know about" is traditional search engines "know about" is available in the deep Web (available in the deep Web (source: Computerworld))

Invisible Web / Hidden WebInvisible Web / Hidden Web

(21)(21)

Misleading & Spam ContentMisleading & Spam Content

(22)(22)

• Spam, adwareSpam, adware

• People add unrelated terms, use multiple People add unrelated terms, use multiple domains, link farms, guestbook botsdomains, link farms, guestbook bots

• CloakingCloaking - Also known as - Also known as stealthstealth, a , a technique used by some Web sites to technique used by some Web sites to deliver one page to a search engine for deliver one page to a search engine for indexing while serving an entirely indexing while serving an entirely different page to everyone else different page to everyone else

Spam Content (1)Spam Content (1)

(23)(23)Source: http://www.yr-bcn.es/webspam/

Black nodes are spam, white nodes are non-spam

Corpus consists of 77 million pages from 12,000 hosts. These pages have been annotated at the level of hosts. Over 3,000 hosts have been manually labelled by at least two human judges as ”Spam”, ”Not Spam” or ”Borderline.

Spam Content (2)Spam Content (2)

(24)(24)Source: http://www.yr-bcn.es/webspam/

Red nodes are spam, blue nodes are normal, and green nodes are normal pages with spam content.

It is composed of a connected graph of 5,000 Web pages and is labeled at the page level. Each Web page is labeled as ”Spam”, ”Not Spam” or ”Borderline” - the last category corresponds to Web page where the content is only partially spam, blog spam pages for example.

Misleading Content AgencyMisleading Content Agency

(25)(25)

Unsafe Content (2006)Unsafe Content (2006)

(26)(26)

1- "Red" rated sites failed SiteAdvisor's safety tests. Examples are sites that distribute adware, send a high volume of spam, or make unauthorized changes to a user's computer.

2 - "Yellow" rated sites engage in practices that warrant important advisory information based on SiteAdvisor's safety tests. Examples are sites which send a high volume of "non-spammy" email, display many popup ads, or prompt a user to change browser settings.

Source: http://www.siteadvisor.com/studies/search_safety_dec2006

Search Engines ChallengeSearch Engines Challenge

• Can we trust SE ranking ?Can we trust SE ranking ?

• How to handle misleading & spam How to handle misleading & spam information ?information ?

• How to use advanced features, NLP is not How to use advanced features, NLP is not enoughenough

(27)(27)

Search User InterfaceSearch User Interface

(28)(28)

AltaVista AltaVista 19951995

(29)(29)

Google Google 19981998

(30)(30)

Google Google 20072007

(31)(31)

KartOO KartOO 20072007

(32)(32)

Advanced UI ???Advanced UI ???

(33)(33)

Search UI ChallengeSearch UI Challenge

• Search engines UI didn’t change much in the Search engines UI didn’t change much in the last 10 years (web did change…).last 10 years (web did change…).

• Search engines UI does not reflect what is Search engines UI does not reflect what is known about user behavior.known about user behavior.

• 1,000,000……. results but only 30 are 1,000,000……. results but only 30 are currently useful.currently useful.

• Too much noise !!Too much noise !!

(34)(34)

Search Engines TrendsSearch Engines Trends

(35)(35)

(36)(36)

Search Engines Statistics Search Engines Statistics

Total 97.71 % Total 97.71 %

2.29 % Left for 2.29 % Left for all the othersall the others

Search TrendsSearch Trends

(37)(37)

VerticalVertical

ContentContent

ClusteringClustering

VisualizationVisualization

Improved UIImproved UI

TailoredTailored

Assisted Assisted SearchSearch

Community Community

Factual Factual QAQA

ExpertsExperts

MetaMetaLanguageLanguage

Search Search TrendsTrends

Clusty Clusty 20072007 (clustering)(clustering)

(38)(38)

Grokker Grokker 20072007 (clustering + visualization)(clustering + visualization)

(39)(39)

Rollyo Rollyo 20072007 (tailor made search)(tailor made search)

(40)(40)

Yahoo Search Builder Yahoo Search Builder 20072007 (tailor made search)(tailor made search)

(41)(41)

MetaCrawler MetaCrawler 20072007 (combined search)(combined search)

(42)(42)

ChaCha ChaCha 20072007 (expert/community search)(expert/community search)

(43)(43)

Trexy Trexy 20072007 (strategies)(strategies)

(44)(44)

Snap Snap 20072007 (improved UI)(improved UI)

(45)(45)

SearchMash SearchMash 20072007 (Google playground)(Google playground)

(46)(46)

Swiki Swiki 20072007 (social search)(social search)

(47)(47)

Pure Video Pure Video 20072007 (content oriented)(content oriented)

(48)(48)

Pure Video Pure Video 20072007 (subject oriented/vertical)(subject oriented/vertical)

(49)(49)

Yahoo Answers Yahoo Answers (Q&A)(Q&A)

(50)(50)

Kasamba Kasamba (experts)(experts)

(51)(51)

Baidu Baidu (language)(language)

(52)(52)

Search TrendsSearch Trends

(53)(53)

VerticalVertical

ContentContent

ClusteringClustering

VisualizationVisualization

Improved UIImproved UI

TailoredTailored

Assisted Assisted SearchSearch

Community Community

Factual Factual QAQA

ExpertsExperts

MetaMetaLanguageLanguage

Search Search TrendsTrends

Search Trends ChallengeSearch Trends Challenge

• How do we combine all the relevant features How do we combine all the relevant features together without complicating the user together without complicating the user interface ?interface ?

• Will Google add more advanced features ?Will Google add more advanced features ?

(54)(54)

ResearchTrailResearchTrail

(55)(55)

ResearchTrail Alpha Application ResearchTrail Alpha Application

((5656))

http://www.ResearchTrail.comhttp://www.ResearchTrail.com

References & QAReferences & QA

(57)(57)

Reference Reference

• Article in YNET - Article in YNET - http://www.ynet.co.il/articles/0,7340,L-3338735http://www.ynet.co.il/articles/0,7340,L-3338735,00.html,00.html

• Pandia - Pandia - http://www.pandia.com/http://www.pandia.com/

• ResearchBuzz - ResearchBuzz - http://www.researchbuzz.org/wp/http://www.researchbuzz.org/wp/

• Prolog (Hebrew) - Prolog (Hebrew) - http://www.i-zm.info/http://www.i-zm.info/

((5858))

More Examples… More Examples…

(59)(59)

Thank You!Thank You!

David RashtyDavid Rashtydaviddavid.rashty@gmail.com.rashty@gmail.com

top related