relevancy and search quality analysis - search technologies
TRANSCRIPT
1
The Manifold Path to Search QualityEnterprise Search & Analytics Meetup
Mark David – Architect, Data Scientist
Avi Rappaport – Senior Search Quality Analyst
19 March 2015
2
“manifold”
• adjective
– having numerous different parts, elements, features, forms, etc.
- dictionary.com
3
Search Technologies: Who We Are
The leading independent IT services firm specializing in the design,
implementation, and management of enterprise search and big data
search solutions.
4
Solutions
Corporate Wide Search – “Google for the Enterprise.” A single, secure point of search for all
users and all content. Strategic initiative for corporate wide information distribution and search.
Data Warehouse Search – A Big Data search solution that enables interactive query and analytics
with extremely large data sets for business intelligence and fraud detection.
E-Commerce Search – Leverages machine learning and accuracy metrics to deliver a better
online user experience and maximize revenues from visitor search activity.
Search & Match – Increase recruiter productivity and fill rates in the staffing industry. Provides a
better search experience followed by automated candidate-to-job matching.
Search for Media & Publishing – Improve user search experience for publishers of large amounts
of content such as government organizations, research firms, and media publications.
Government Search – A solution focused on design and development search for government
information portals or archiving systems.
5
Search Technologies: Background
San Diego
London UK
San Jose, CR
Cincinnati
Prague, CZ
Washington (HQ)
Frankfurt DE
• Founded 2005
• 150+ employees
• 600+ customers worldwide
• Deep enterprise search expertise
• Consistent revenue growth
• Consistent profitability
6
600+ Customers
7
Search Technologies: What We Do
• All aspects of search application implementation
– Content access and processing, search system architecture, configuration, deployment
– Accuracy analysis, metrics, engine scoring, relevancy ranking, query enhancement
– User interface, analytics, visualization
• Technology assets to support implementation– Aspire high performance content processing
– Content Connectors (Document, Jive, SharePoint, Salesforce, Box.com, etc.)
• Engagement models
– Most projects start with an “assessment”
– Fully project-managed solutions, designed, delivered, and supported
– Experts for hire, supporting in-house teams or as a subcontractor
9
Content sources
Connectors
AspireContent Processing
PipelinesIndexes
Search Engine
Web Browser
Staging Repository
Publishers
Technology Assets
1. Aspire Framework– High Performance Content Processing
– Ingests and processes content and publishes to a variety of indexes for commercial and open source search engines
2. Aspire Data Connectors– API level access to content repositories
3. Query Processing Language (QPL)– Advanced query processing
Complements to commercial and open source search technologies
1
2
3 QPL
11
Understand Your Data
• Data Analysis
– Access patterns & rates, sources, schemas, field typing,
duplicates, near-duplicates, term frequencies, etc.
• Content Processing
– Source connection, format conversion, sub-document
separation, field boundaries, multiple-source assembly, etc.
• Text Processing
– Character decoding, tag stripping, tokenization, sentence
boundaries, normalization, entity extraction, pattern
recognition, disambiguation, filtering, etc.
12
Understand Your Users
• Search Scope
– Interviews
– Log Analysis
– Scenarios
– Wireframes & mockups
• Search Quality
• Improvements
– Relevance
– Coverage
– UX
13
Understand Your Search Engine
• How does it score results?
• How accurate is it for the short head?
• How accurate is it for the long tail?
• When you change it to improve a particular type of query,
how do you know that the overall accuracy improved?
14
Regression Testing of Search
• Step 1: Gather a Set of Judgments
• If you already have lots of user data:
– Use click log analysis to gather sets of clearly good and clearly
bad results
– Ignore unclear tracks
• If user data not yet available:
– Manual judgments
• End up with a set of queries with associated “good” and
“bad” documents
15
Regression Testing of Search
• Step 2: Instrument the Search Results
• Periodically execute all those queries, and score the results
• How to score:
– Every good document adds a position-based amount
– Every bad document subtracts the same amount
– Unknown documents don’t affect the score (except by
occupying a position)
17
Relevancy Improvements from Data
• Text Processing
– Typos
– Entity Extraction
– Breaks
– Parts of Speech
• Data Analysis
– TF-IDF
– Phrase Dictionary
– Boilerplate
18
To Correct or Not To Correct
• Should typos be “fixed”?
• This goes back to knowing your audience
• Example: Haircutz• In document-to-document situations, generally yes.
19
Bigger Needles in the Haystack
• Entity Extraction: How big a chunk?
• Example: [email protected]– Is that 1, 2, 3, 4, or 5 tokens?
• Multi-indexing is a key component of accuracy
– Different people think differently, so the indexes need to have
different ways of representing the data.
20
Breaker, Breaker
• Don’t match across boundaries
– Paragraph
– Sentence
– Phrase
• Whitespace does have meaning!
• Punctuation does have meaning!
21
Parts is Parts
• Figuring out the part of speech (noun vs. verb vs. adjective)
would seem to clearly help
– We avoid matching on the incorrect version
• Study after study shows that it does not!
• Why not?
– Closely related (in English)
• Example: to go on a run– Prevalence of noun phrases in the group of “important” terms
22
How Common are Tokens Terms?
• Term Frequency (not “Token Frequency”)
– Example: The West, West London, The Wild West• Do your full text processing when you’re gathering statistics
– And adjust it and re-run it when the data changes
• Inverse Document Frequency
– In how many docs does this term occur?
– NOT: How many times does this term occur across all docs?
23
Let Me Re-Phrase That
• Some general dictionaries are freely available
– Example: locations (geonames.org)
• Others can be derived
– Example: Company names from stock markets, business registries, Wikipedia, etc.
• More useful are terms from your industry
– Can you think of lists that are available internally?
– Example: Job titles in a recruiting company
• Most useful are terms from your data
– Statistical generation of common 2-shingles and 3-shingles
– Query log analysis
24
Lorem ipsum…
• Boilerplate text recognition
• Pre-process:
– Simple text processing this time
– Split by paragraphs
– Calculating hash signatures for paragraphs
– Count occurrences
• Find the cliff
• Filter out early in the main pipeline
– Early steps must match the entire pre-processing pipeline
26
Search Quality
• Best possible results
– Given the searchable data
– For the primary users and their primary tasks
• Simple query term matching - relevance
• And beyond
– Enriched content
– Query enhancement
• Results presentation
– Clarity
– Context
27
Short Head & Long Tail
• Query Frequency
– Short Head
• A few frequent queries
– Short Middle
• Often to 50% by traffic
– Long Tail
• Rare to unique queries
• Can be to 75% distinct
28
But What Do They Really Want?
• Query log reports show what users think they’re looking for
– Domain research for more about why
• Behavior shows more about whether they’re finding it
– Session ending
• Frequent for zero matches
– CTR - click-through rate
• Results (with bounce rates)
– Query refinement
• Typing, facets
• Navigation via search
29
You say “tomay-toe”
• Users vocabulary is not content vocabulary
– Consistent problems from small to web-scale search
• Create synonyms
• Scalable automated disambiguation
– Data analysis
• Using dictionaries and co-occurrence
– Search log behavior analysis
• Query refinement and reformulation, click tracks
– Language ambiguity - even Netflix has a hard time past 85%
– Human domain expertise, editorial oversight
30
Scope (aka, this is not Google)
• User confusion
– Is this a location box?
– Is it Google?
• Design for clarity
– UI and graphic design
– Watch out for default to subscope searches
• Improve content coverage
• Add Best Bets for internal and external locations
• Link to other search engines
• Federate search
32
Best Bang for the Buck
• Concentrate on the short head
– Top 10% by traffic
• Simple relevance test
– Perform query
– Evaluate results
• Are there any results?
• Are they the most useful available? (Domain expertise)
• Validate against user behavior
– Store judgments
– Easy fixes
– Re-test (easy to miss this)
34
Context and Navigation
• Facets
• Results grouping / diversity
– options for ambiguous queries
• Integrate with collaboration tools
– Allow user comments, reviews
35
Relevance and Ranking
• Best results patterns
– Part or serial number queries
• Tuned boosting
– Feedback on clicks and other signals
– Freshness
• de-duplication!