introduction to web search and advanced internet services

Download Introduction to Web Search and Advanced Internet Services

If you can't read please download the document

Upload: tanika

Post on 05-Jan-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Introduction to Web Search and Advanced Internet Services. Tao Yang UCSB CS290N, Spring 2014. Table of Content. Search Engine Architecture and Process Web Content and Size Users Behavior in Search Sponsored Search: Advertisement Impact to Business and Search Engine Optimization - PowerPoint PPT Presentation

TRANSCRIPT

  • Introduction to Web Search and Advanced Internet ServicesTao YangUCSB CS290N, Spring 2014

  • Table of ContentSearch Engine Architecture and ProcessWeb Content and SizeUsers Behavior in SearchSponsored Search: AdvertisementImpact to Business and Search Engine OptimizationSearch engine history and related fields

  • Web search basics

    Web

    Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

    Sponsored Links

    CG Appliance ExpressDiscount Appliances (650) 756-3931Same Day Certified Installationwww.cgappliance.comSan Francisco-Oakland-San Jose, CA

    Miele Vacuum CleanersMiele Vacuums- Complete SelectionFree Shipping!www.vacuums.com

    Miele Vacuum CleanersMiele-Free Air shipping!All models. Helpful advice.www.best-vacuum.com

    Miele, Inc -- Anything else is a compromiseAt the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances.Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similarpages

    MieleWelcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similarpages

    Miele - Deutscher Hersteller von Einbaugerten, Hausgerten ... - [ Translate this page ]Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit...ein Leben lang. ... Whlen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similarpages

    Herzlich willkommen bei Miele sterreich - [ Translate this page ]Herzlich willkommen bei Miele sterreich Wenn Sie nicht automatischweitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERTE ... www.miele.at/ - 3k - Cached - Similarpages

  • Search engine architecture: key piecesSpider (a.k.a. crawler/robot) builds corpusCollects web pages recursivelyFor each known URL, fetch the page, parse it, and extract new URLsRepeatAdditional pages from direct submissions & other sourcesIndexer creates inverted indexes so online system can searchOnline query process serves query resultsFront end query reformulation, word processingBack end finds matching documents and ranks them

  • *Inverted indexLinked lists generally preferred to arraysDynamic space allocationInsertion of terms into documents easySpace overhead of pointers248163264128235813213413161Sorted by docID (more later on why).

  • Indexing Process

  • Indexing ProcessText acquisitionidentifies and stores documents for indexingText transformationtransforms documents into index terms or featuresIndex creationtakes index terms and creates data structures (indexes) to support fast searching

  • Indexing and Mining at Ask.comInternet Web documentsContent classificationSpammerremovalDuplicateremovalInverted indexgenerationLink graph generationOnlineDatabaseDocument respositoryDocument respositoryDocument respositoryClick data analysis

  • Query Process

  • Query ProcessUser interactionsupports creation and refinement of query, display of resultsRankinguses query and indexes to generate ranked list of documentsEvaluationmonitors and measures effectiveness and efficiency (primarily offline)

  • Ask.com Online Engine Architecture Neptune Clustering MiddlewareDocumentAbstractCacheFrontendClient queriesTraffic load balancerCacheCacheFrontendFrontendFrontend

    DocumentAbstractDocumentAbstractDocumentdescriptionPageInfoPage InfoHierarchicalCacheStructuredDBWeb page index

  • User InteractionQuery transformationImproves initial query, both before and after initial searchIncludes text transformation techniques used for documentsSpell checking and query suggestion provide alternatives to original queryQuery expansion and relevance feedback modify the original query with additional terms

  • User InteractionResults outputConstructs the display of ranked documents for a queryGenerates snippets to show how queries match documentsHighlights important words and passagesRetrieves appropriate advertising in many applicationsMay provide clustering and other visualization tools

  • Online System SupportPerformance optimizationDesigning ranking algorithms for efficient processingTerm-at-a time vs. document-at-a-time processingSafe vs. unsafe optimizationsDistributionProcessing queries in a distributed environmentQuery broker distributes queries and assembles resultsCaching is a form of distributed searching

  • EvaluationLoggingLogging user queries and interaction is crucial for improving search effectiveness and efficiencyQuery logs and clickthrough data used for query suggestion, spell checking, query caching, ranking, advertising search, and other componentsRanking analysisMeasuring and tuning ranking effectivenessPerformance analysisMeasuring and tuning system efficiency

  • General Search: identify relevant information with a horizontal/exhaustive view of the world.Vertical Search:Focus on specific segment of web contentIntegrate domain knowledge (e.g. taxonomies /ontology), & deep web Examples: travel in Expedia, products in Amazon.General Search vs. Vertical Search

  • Example of Vertical Search: Question Answering

  • Table of ContentSearch Engine Architecture and ProcessWeb Content and SizeUsers Behavior in SearchSponsored Search: AdvertisementImpact to Business and Search Engine OptimizationSearch Engine History/Related Fields

  • Characteristics of Web ContentNo design/co-ordinationDistributed content creation, linkingContent includes truth, lies, obsolete information, contradictions Structured (databases), semi-structured Scale -- hugeGrowth slowed down from initial volume doubling every few monthsContent can be dynamically generated

  • Dynamic Web ContentA page without a static html versionE.g., current status of flight AA129Current availability of rooms at a hotelUsually, assembled at the time of a request from a browserTypically, URL has a ? character in itMost dynamic content is ignored by web spidersMany reasons including malicious spider trapsAcquired for some content (e.g. news stores)Application-specific spidering

  • The web: sizeWhat is being measured?Number of hostsNumber of (static) html pagesVolume of dataNumber of hosts netcraft surveyhttp://news.netcraft.com/archives/web_server_survey.htmlGives monthly report on how many web servers are out thereNumber of pages numerous estimatesMore to follow later in this courseFor a Web engine: how big its index is

  • The web: the number of hosts

  • The web: web server vendors

  • Static pages: rate of changeFetterly et al. study: several views of data, 150 million pages over 11 weekly crawlsBucketed into 85 groups by extent of change

  • DiversityLanguages/EncodingsHundreds (thousands ?) of languages, W3C encodings: 55 (Jul01) [W3C01]Google (mid 2001): English: 53%, JGCFSKRIP: 30%Document & query topicPopular Query Topics (from 1 million Google queries, Apr 2000)

    Arts14.6%Arts: Music6.1%Computers13.8%Regional: North America5.3%Regional10.3%Adult: Image Galleries4.4%Society8.7%Computers: Software3.4%Adult8%Computers: Internet3.2%Recreation7.3%Business: Industries2.3%Business7.2%Regional: Europe1.8%

  • Table of ContentSearch Engine Architecture and ProcessWeb Content and SizeUsers Behavior in SearchSponsored Search: AdvertisementImpact to Business and Search Engine OptimizationSearch Engine History/Related Fields

  • The userDiverse in access methodologyIncreasingly, high bandwidth connectivityGrowing segment of mobile users: limitations of form factor keyboard, displayDiverse in search methodologySearch, search + browse, filter by attribute Average query length ~ 2.5 termsHas to do with what theyre searching forPoor comprehension of syntaxEarly engines surfaced rich syntax Boolean, phrase, etc.Current engines hide these

  • *Web Search: How do users find content?Informational (~25%) want to learn about something

    Navigational (~40%) want to go to that page

    Transactional (~35%) want to do something (web-mediated)Access a serviceDownloads ShopGray areasFind a good hubExploratory search see whats there autismUnited AirlinesCar rental FinlandBroder 2002, A Taxomony of web search

  • Users evaluation of enginesRelevance and validity of resultsUI Simple, no clutter, error tolerantTrust Results are objective, the engine wants to help mePre/Post process tools providedMitigate user errors (auto spell check)Explicit: Search within results, more like this, refine ...Anticipative: related searches

  • Users evaluationQuality of pages varies widelyRelevance is not enoughDuplicate eliminationPrecision vs. recallOn the web, recall seldom mattersWhat mattersPrecision at 1? Precision above the fold?Comprehensiveness must be able to deal with obscure queriesRecall matters when the number of matches is very smallUser perceptions may be unscientific, but are significant over a large aggregate

  • *What about on MobileQuery characteristics:Best known studies by Kamvar and Baluja (2006 and 2007) and by Yi, Maghoul, and Pedersen (2008)Have a different distribution than the query distribution for PC usersBias towards shorter queriesData contradicts that: 2.6 words per query, same # chars as PCDifficulty of query entry is a significant hurdleMuch higher location-based activityMore notification-driven tasks

  • *Implications and ChallengesTask-orientationSpecialized content packagingLocality Inference from queries and from devicesMinimize typing and round-trips: get results, not just linksLess room to display SERP + other accessoriesUse of mobile in social settings and leveraging notification abilities

  • Table of ContentSearch Engine Architecture and ProcessWeb Content and SizeUsers Behavior in SearchSponsored Search: AdvertisementImpact to Business and Search Engine Optimization

  • *Search queryAd

  • *QuestionsDo you think an average user, knows the difference between sponsored search links and algorithmic search results?

  • *How it worksAdvertiserLanding pageSponsoredsearch engineEngine decides when/where to show this ad.Engine decides how much to charge advertiser on a click.Ad Index

  • Three sub-problemsMatch ads to query/contextOrder the adsPricing on a click-through

  • Paid placementGives consumer opportunity to click through to an advertiserCompensated by advertiser for click throughEach consumer reveals clues about her information need at handThe keyword(s) he types (e.g., miele)Keyword(s) in his email (gmail)Personal profile information (Yahoo! )Complex logistical problems: selling contracts, scheduling ads supply chain optimization

  • Table of ContentSearch Engine Architecture and ProcessWeb Content and SizeUsers Behavior in SearchSponsored Search: AdvertisementImpact to Business and Search Engine OptimizationSearch Engine History/Related Fields

  • Search Traffic is Important for Business:

    Example of Site Traffic Analysis

  • The trouble with paid placementIt costs money. Whats the alternative?Search Engine Optimization:Tuning your web page to rank highly in the search results for select keywordsAlternative to paying for placementThus, intrinsically a marketing functionAlso known as Search Engine MarketingPerformed by companies, webmasters and consultants (Search engine optimizers) for their clients

  • Search engine optimizationMotivesCommercial, political, religious, lobbiesPromotion funded by advertising budgetOperatorsContractors (Search Engine Optimizers) for lobbies, companiesWeb mastersHosting servicesForumWeb master world ( www.webmasterworld.com )Search engine specific tricks Discussions about academic papers More pointers in the Resources

  • The spam industry

  • Simplest formsEarly engines relied on the density of termsThe top-ranked pages for the query maui resort were the ones containing the most mauis and resortsSEOs responded with dense repetitions of chosen termse.g., maui resort maui resort maui resort Often, the repetitions would be in the same color as the background of the web pageRepeated terms got indexed by crawlersBut not visible to humans on browsersCant trust the words on a web page, for ranking.

  • Keyword stuffing

  • Invisible textauctions.hitsoffice.com/Pornographic Content

  • Cloaking:

  • Link FarmsBoost pagerank of a website

  • Table of ContentSearch Engine Architecture and ProcessWeb Content and SizeUsers Behavior in SearchSponsored Search: AdvertisementImpact to Business and Search Engine OptimizationSearch Engine History/Related Fields

  • *Information Retrieval (IR) SystemIRSystem

  • *From Information Retrieval to Web SearchChallenging due to Large-scale and noisy data.retrieving relevant documents to a query.retrieving from large sets of documents efficiently.Relevance is a subjective judgment and may include:Simplest notion of relevance is that the query string appears verbatim in the document.More:Being on the proper subject.Being timely (recent information).Being authoritative (from a trusted source).Satisfying the goals of the user and his/her intended use of the information (information need).

  • *Problems with KeywordsMay not retrieve relevant documents that include synonymous terms.restaurant vs. cafPRC vs. ChinaMay retrieve irrelevant documents that include ambiguous terms.bat (baseball vs. mammal)Apple (company vs. fruit)bit (unit of data vs. act of eating)

  • *Search Intent AnalysisTaking into account the meaning of the words used.Taking into account the order of words in the query.Adapting to the user based on direct or indirect feedback.Taking into account the authority of the source.

  • Topics: Text miningText mining is a cover-all marketing termA lot of what weve already talked about is actually the bread and butter of text mining:Text classification, clustering, and retrievalBut we will focus in on some of the higher-level text applications:Extracting document metadataTopic tracking and new story detectionCross document entity and event coreferenceText summarizationQuestion answering

  • Topics: Information extractionGetting semantic information out of textual dataFilling the fields of a database recordE.g., looking at an event web page:What is the name of the event?What date/time is it?How much does it cost to attendOther applications: resumes, health data, A limited but practical form of natural language understanding

  • Topics: Recommendation systemsUsing statistics about the past actions of a group to give advice to an individualE.g., Amazon book suggestions or NetFlix movie suggestionsA matrix problem: but now instead of words and documents, its users and documents

  • *History of IR and Web Search1960-70s: Initial exploration of text retrieval systems for small corpora of scientific abstracts, and law and business documents.Development of the basic Boolean and vector-space models of retrieval.1980s:Larger document database systems, many run by companies:Lexis-NexisDialogMEDLINE

  • *From IR to Web Search1990s:Organized CompetitionsNIST TRECSearching FTPable documents on the InternetArchieWAISSearching the World Wide WebLycosYahooAltavista

  • *IR/Web Search History in 2000s2000sLink analysis for Web SearchGoogleInktomi TeomaFeedback based engine:DirectHit (Ask.com)Automated Information ExtractionWhizbangFetchBurning GlassQuestion AnsweringTREC Q/A trackAsk.com/Ask Jeeves

  • *IR/Web Search Activities in 2000s2000s continued:Multimedia IRImageVideoAudiomusicCross-Language IRDocument SummarizationMobile search

  • *Related AreasInformation Management and Data AnalysisInformation Science &CHIMachine Learning and data miningNatural Language ProcessingLarge-scale systemsDatabase/data storesOperating systems/networking supportWeb language analysisCompression/fast algorithms.Fault tolerance/paralle+distributed systems

    *