south big data hub: text data analysis panel

30
Text Data Analysis Panel: South Big Data Hub Trey Grainger SVP of Engineering, Lucidworks

Upload: trey-grainger

Post on 17-Feb-2017

209 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: South Big Data Hub: Text Data Analysis Panel

Text Data Analysis Panel: South Big Data HubTrey Grainger

SVP of Engineering, Lucidworks

Page 2: South Big Data Hub: Text Data Analysis Panel

Trey Grainger SVP of Engineering

• Previously Director of Engineering @ CareerBuilder• MBA, Management of Technology – Georgia Tech• BA, Computer Science, Business, & Philosophy – Furman University• Information Retrieval & Web Search - Stanford University

Other fun projects: • Co-author of Solr in Action, plus numerous research papers• Frequent conference speaker• Founder of Celiaccess.com, the gluten-free search engine• Lucene/Solr contributor

About Me

Page 3: South Big Data Hub: Text Data Analysis Panel

what do you do?

Page 4: South Big Data Hub: Text Data Analysis Panel
Page 5: South Big Data Hub: Text Data Analysis Panel

Search-Driven

Everything

Customer Service Custome

r Insights

Fraud Surveillance

Research Portal

Online Retail Digital Content

Page 6: South Big Data Hub: Text Data Analysis Panel

Lucidworks enables Search-Driven Everything

Data Acquisition

Indexing & Streaming

Smart Access API

Recommendations & Alerts Analytics & InsightsExtreme Relevancy

CUSTOMER SERVICE

RESEARCH PORTAL

DIGITAL CONTENT

CUSTOMER INSIGHTS

FRAUD SURVEILLANCE

ONLINERETAIL

•Access all your data in a number of ways from one place.

•Secure storage and processing from Solr and Spark.

•Acquire data from any source with pre-built connectors and adapters.

Machine learning and advanced analytics turn all of your apps into intelligent data-driven applications.

Page 7: South Big Data Hub: Text Data Analysis Panel
Page 8: South Big Data Hub: Text Data Analysis Panel
Page 9: South Big Data Hub: Text Data Analysis Panel
Page 10: South Big Data Hub: Text Data Analysis Panel
Page 11: South Big Data Hub: Text Data Analysis Panel
Page 12: South Big Data Hub: Text Data Analysis Panel

how do you do it?

Page 13: South Big Data Hub: Text Data Analysis Panel

Solr is the popular, blazing-fast, open source enterprise

search platform built on Apache Lucene™.

Page 14: South Big Data Hub: Text Data Analysis Panel

Key Solr Features:● Multilingual Keyword search● Relevancy Ranking of results● Faceting & Analytics (nested / relational)● Highlighting● Spelling Correction● Autocomplete/Type-ahead Prediction● Sorting, Grouping, Deduplication● Distributed, Fault-tolerant, Scalable● Geospatial search● Complex Function queries● Recommendations (More Like This)● Graph Queries and Traversals● SQL Query Support● Streaming Aggregations● Batch and Streaming processing● Highly Configurable / Plugins● Learning to Rank● Building machine-learning models● … many more

*source: Solr in Action, chapter 2

Page 15: South Big Data Hub: Text Data Analysis Panel

The standard for enterprise search.

of Fortune 500 uses Solr.

90%

Page 16: South Big Data Hub: Text Data Analysis Panel

Reference Architecture (Lucidworks Fusion)

Page 17: South Big Data Hub: Text Data Analysis Panel

Bay Area Search

Type-aheadPrediction

Building an Intent Engine

Search Box

Semantic Query Parsing

Intent Engine

Spelling Correction

Entity / Entity Type Resolution

Machine-learned Ranking

Relevancy Engine (“re-expressing intent”)

User Feedback (Clarifying Intent)

Query Re-writing Search Results

Query Augmentation

Knowledge Graph

Contextual Disambiguation

Page 18: South Big Data Hub: Text Data Analysis Panel

Additional References:

Page 19: South Big Data Hub: Text Data Analysis Panel

what’s next?

Page 20: South Big Data Hub: Text Data Analysis Panel

Basic Keyword Search(inverted index, tf-idf, bm25, query formulation, etc.)

Taxonomies / Entity Extraction(entity recognition, ontologies, synonyms, etc.)

Query Intent(query classification, semantic query parsing, concept expansion, rules, clustering, classification)

Relevancy Tuning(signals, AB testing/genetic algorithms, Learning to Rank, Neural Networks)

Self-learning

Page 21: South Big Data Hub: Text Data Analysis Panel

The Three C’sContent:Keywords and other features in your documents

Collaboration:How other’s have chosen to interact with your system

Context:Available information about your users and their intent

Reflected Intelligence “Leveraging previous data and interactions to improve how new data and interactions should be interpreted”

Page 22: South Big Data Hub: Text Data Analysis Panel

Feedback LoopsUser

Searches

User Sees

ResultsUser

takes an

action

Users’ actions inform system improvements

Page 23: South Big Data Hub: Text Data Analysis Panel

● Recommendation Algorithms● Building user profiles from past searches, clicks, and other actions● Identifying correlations between keywords/phrases● Building out automatically-generated ontologies from content and

queries● Determining relevancy judgements (precision, recall, nDCG, etc.)

from click logs● Learning to Rank - using relevancy judgements and machine

learning to train a relevance model● Discovering misspellings, synonyms, acronyms, and related

keywords● Disambiguation of keyword phrases with multiple meanings● Learning what’s important in your content

Examples of Reflected Intelligence

Page 24: South Big Data Hub: Text Data Analysis Panel

Key Technologies• Keyword Search

- Lucene/Solr• Taxonomies / Entity Extraction

- Solr Text Tagger- Word2Vec / Dice Conceptual Search- SolrRDF

• Query Intent- Probabilistic Query Parser (SOLR-9418)- Semantic Knowledge Graph (SOLR-9480)

• Relevancy Tuning- Solr Learning to Rank Plugin (SOLR-8542)

• General Needs: a solid log processing framework (Apache Spark, Lucidworks Fusion, or Solr Daemon Expression)

Page 25: South Big Data Hub: Text Data Analysis Panel
Page 26: South Big Data Hub: Text Data Analysis Panel

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Semantic Knowledge Graph Traversal

software engineer*(materialized node)

Java

C#

.NET

.NET Developer

Java Developer

HibernateScalaVB.NET

Software Engineer

Data Scientist

SkillNodes

has_related_skillStartingNode

SkillNodes

has_related_skill Job TitleNodes

has_related_job_title

0.900.88 0.93

0.93

0.34

0.74

0.91

0.89

0.74

0.89

0.780.72

0.48

0.93

0.76

0.83

0.80

0.64

0.61

0.780.55

Page 27: South Big Data Hub: Text Data Analysis Panel

Knowledge Graph

Page 28: South Big Data Hub: Text Data Analysis Panel

Knowledge Graph

Page 29: South Big Data Hub: Text Data Analysis Panel

Traditional Keyword Search

Recommendations

SemanticSearch

User Intent

Personalized Search

Augmented Search

Domain-awareMatching

Page 30: South Big Data Hub: Text Data Analysis Panel

Contact InfoTrey Grainger

[email protected] @treygrainger

http://solrinaction.comOther presentations: http://www.treygrainger.com