relevancy and search quality analysis - search technologies

1

The Manifold Path to Search QualityEnterprise Search & Analytics Meetup

Mark David – Architect, Data Scientist

Avi Rappaport – Senior Search Quality Analyst

19 March 2015

2

“manifold”

• adjective

– having numerous different parts, elements, features, forms, etc.

- dictionary.com

3

Search Technologies: Who We Are

The leading independent IT services firm specializing in the design,

implementation, and management of enterprise search and big data

search solutions.

4

Solutions

Corporate Wide Search – “Google for the Enterprise.” A single, secure point of search for all

users and all content. Strategic initiative for corporate wide information distribution and search.

Data Warehouse Search – A Big Data search solution that enables interactive query and analytics

with extremely large data sets for business intelligence and fraud detection.

E-Commerce Search – Leverages machine learning and accuracy metrics to deliver a better

online user experience and maximize revenues from visitor search activity.

Search & Match – Increase recruiter productivity and fill rates in the staffing industry. Provides a

better search experience followed by automated candidate-to-job matching.

Search for Media & Publishing – Improve user search experience for publishers of large amounts

of content such as government organizations, research firms, and media publications.

Government Search – A solution focused on design and development search for government

information portals or archiving systems.

5

Search Technologies: Background

San Diego

London UK

San Jose, CR

Cincinnati

Prague, CZ

Washington (HQ)

Frankfurt DE

• Founded 2005

• 150+ employees

• 600+ customers worldwide

• Deep enterprise search expertise

• Consistent revenue growth

• Consistent profitability

6

600+ Customers

http://www.lenovo.com/

http://www.lenovo.com/

http://www.chick-fil-a.com/Home.asp

http://www.chick-fil-a.com/Home.asp

http://commons.wikimedia.org/wiki/Image:US-FederalTradeCommission-Seal.svg

http://commons.wikimedia.org/wiki/Image:US-FederalTradeCommission-Seal.svg

http://www.petco.com/

http://www.petco.com/

http://www.teleflora.com/

http://www.teleflora.com/

7

Search Technologies: What We Do

• All aspects of search application implementation

– Content access and processing, search system architecture, configuration, deployment

– Accuracy analysis, metrics, engine scoring, relevancy ranking, query enhancement

– User interface, analytics, visualization

• Technology assets to support implementation– Aspire high performance content processing

– Content Connectors (Document, Jive, SharePoint, Salesforce, Box.com, etc.)

• Engagement models

– Most projects start with an “assessment”

– Fully project-managed solutions, designed, delivered, and supported

– Experts for hire, supporting in-house teams or as a subcontractor

8

Search Engine and Big Data ExpertiseOur Technology and Integration Partners

9

Content sources

Connectors

AspireContent Processing

PipelinesIndexes

Search Engine

Web Browser

Staging Repository

Publishers

Technology Assets

1. Aspire Framework– High Performance Content Processing

– Ingests and processes content and publishes to a variety of indexes for commercial and open source search engines

2. Aspire Data Connectors– API level access to content repositories

3. Query Processing Language (QPL)– Advanced query processing

Complements to commercial and open source search technologies

1

2

3 QPL

10

Engine

Data Users

11

Understand Your Data

• Data Analysis

– Access patterns & rates, sources, schemas, field typing,

duplicates, near-duplicates, term frequencies, etc.

• Content Processing

– Source connection, format conversion, sub-document

separation, field boundaries, multiple-source assembly, etc.

• Text Processing

– Character decoding, tag stripping, tokenization, sentence

boundaries, normalization, entity extraction, pattern

recognition, disambiguation, filtering, etc.

12

Understand Your Users

• Search Scope

– Interviews

– Log Analysis

– Scenarios

– Wireframes & mockups

• Search Quality

• Improvements

– Relevance

– Coverage

– UX

13

Understand Your Search Engine

• How does it score results?

• How accurate is it for the short head?

• How accurate is it for the long tail?

• When you change it to improve a particular type of query,

how do you know that the overall accuracy improved?

14

Regression Testing of Search

• Step 1: Gather a Set of Judgments

• If you already have lots of user data:

– Use click log analysis to gather sets of clearly good and clearly

bad results

– Ignore unclear tracks

• If user data not yet available:

– Manual judgments

• End up with a set of queries with associated “good” and

“bad” documents

15

Regression Testing of Search

• Step 2: Instrument the Search Results

• Periodically execute all those queries, and score the results

• How to score:

– Every good document adds a position-based amount

– Every bad document subtracts the same amount

– Unknown documents don’t affect the score (except by

occupying a position)

16

Understand the Data

17

Relevancy Improvements from Data

• Text Processing

– Typos

– Entity Extraction

– Breaks

– Parts of Speech

• Data Analysis

– TF-IDF

– Phrase Dictionary

– Boilerplate

18

To Correct or Not To Correct

• Should typos be “fixed”?

• This goes back to knowing your audience

• Example: Haircutz• In document-to-document situations, generally yes.

19

Bigger Needles in the Haystack

• Entity Extraction: How big a chunk?

• Example: [email protected]– Is that 1, 2, 3, 4, or 5 tokens?

• Multi-indexing is a key component of accuracy

– Different people think differently, so the indexes need to have

different ways of representing the data.

20

Breaker, Breaker

• Don’t match across boundaries

– Paragraph

– Sentence

– Phrase

• Whitespace does have meaning!

• Punctuation does have meaning!

21

Parts is Parts

• Figuring out the part of speech (noun vs. verb vs. adjective)

would seem to clearly help

– We avoid matching on the incorrect version

• Study after study shows that it does not!

• Why not?

– Closely related (in English)

• Example: to go on a run– Prevalence of noun phrases in the group of “important” terms

22

How Common are Tokens Terms?

• Term Frequency (not “Token Frequency”)

– Example: The West, West London, The Wild West• Do your full text processing when you’re gathering statistics

– And adjust it and re-run it when the data changes

• Inverse Document Frequency

– In how many docs does this term occur?

– NOT: How many times does this term occur across all docs?

23

Let Me Re-Phrase That

• Some general dictionaries are freely available

– Example: locations (geonames.org)

• Others can be derived

– Example: Company names from stock markets, business registries, Wikipedia, etc.

• More useful are terms from your industry

– Can you think of lists that are available internally?

– Example: Job titles in a recruiting company

• Most useful are terms from your data

– Statistical generation of common 2-shingles and 3-shingles

– Query log analysis

24

Lorem ipsum…

• Boilerplate text recognition

• Pre-process:

– Simple text processing this time

– Split by paragraphs

– Calculating hash signatures for paragraphs

– Count occurrences

• Find the cliff

• Filter out early in the main pipeline

– Early steps must match the entire pre-processing pipeline

25

Understand the Users

26

Search Quality

• Best possible results

– Given the searchable data

– For the primary users and their primary tasks

• Simple query term matching - relevance

• And beyond

– Enriched content

– Query enhancement

• Results presentation

– Clarity

– Context

27

Short Head & Long Tail

• Query Frequency

– Short Head

• A few frequent queries

– Short Middle

• Often to 50% by traffic

– Long Tail

• Rare to unique queries

• Can be to 75% distinct

28

But What Do They Really Want?

• Query log reports show what users think they’re looking for

– Domain research for more about why

• Behavior shows more about whether they’re finding it

– Session ending

• Frequent for zero matches

– CTR - click-through rate

• Results (with bounce rates)

– Query refinement

• Typing, facets

• Navigation via search

29

You say “tomay-toe”

• Users vocabulary is not content vocabulary

– Consistent problems from small to web-scale search

• Create synonyms

• Scalable automated disambiguation

– Data analysis

• Using dictionaries and co-occurrence

– Search log behavior analysis

• Query refinement and reformulation, click tracks

– Language ambiguity - even Netflix has a hard time past 85%

– Human domain expertise, editorial oversight

30

Scope (aka, this is not Google)

• User confusion

– Is this a location box?

– Is it Google?

• Design for clarity

– UI and graphic design

– Watch out for default to subscope searches

• Improve content coverage

• Add Best Bets for internal and external locations

• Link to other search engines

• Federate search

31

Fix The Short Head Issues

32

Best Bang for the Buck

• Concentrate on the short head

– Top 10% by traffic

• Simple relevance test

– Perform query

– Evaluate results

• Are there any results?

• Are they the most useful available? (Domain expertise)

• Validate against user behavior

– Store judgments

– Easy fixes

– Re-test (easy to miss this)

33

Choosing, Not Typing

• Auto-suggest

– Curated search suggestions

• Best Bets

• Did you mean?

34

Context and Navigation

• Facets

• Results grouping / diversity

– options for ambiguous queries

• Integrate with collaboration tools

– Allow user comments, reviews

35

Relevance and Ranking

• Best results patterns

– Part or serial number queries

• Tuned boosting

– Feedback on clicks and other signals

– Freshness

• de-duplication!

36

Relevance and Regression Testing

37

Search Technologies

Co

38

Reference Architecture

Content

sources

Connectors

Indexes

Semantics

Text Mining

Quality

Metrics

Aspire

Content Processing Pipelines

Aspire Aspire Aspire Aspire

Aspire Aspire Aspire Aspire

Big Data Framework

Big Data Array

Indexes

QPL

Search Engine

Web Browser

Staging

Repository