south big data hub: text data analysis panel
TRANSCRIPT
Text Data Analysis Panel: South Big Data HubTrey Grainger
SVP of Engineering, Lucidworks
Trey Grainger SVP of Engineering
• Previously Director of Engineering @ CareerBuilder• MBA, Management of Technology – Georgia Tech• BA, Computer Science, Business, & Philosophy – Furman University• Information Retrieval & Web Search - Stanford University
Other fun projects: • Co-author of Solr in Action, plus numerous research papers• Frequent conference speaker• Founder of Celiaccess.com, the gluten-free search engine• Lucene/Solr contributor
About Me
what do you do?
Search-Driven
Everything
Customer Service Custome
r Insights
Fraud Surveillance
Research Portal
Online Retail Digital Content
Lucidworks enables Search-Driven Everything
Data Acquisition
Indexing & Streaming
Smart Access API
Recommendations & Alerts Analytics & InsightsExtreme Relevancy
CUSTOMER SERVICE
RESEARCH PORTAL
DIGITAL CONTENT
CUSTOMER INSIGHTS
FRAUD SURVEILLANCE
ONLINERETAIL
•Access all your data in a number of ways from one place.
•Secure storage and processing from Solr and Spark.
•Acquire data from any source with pre-built connectors and adapters.
Machine learning and advanced analytics turn all of your apps into intelligent data-driven applications.
how do you do it?
Solr is the popular, blazing-fast, open source enterprise
search platform built on Apache Lucene™.
Key Solr Features:● Multilingual Keyword search● Relevancy Ranking of results● Faceting & Analytics (nested / relational)● Highlighting● Spelling Correction● Autocomplete/Type-ahead Prediction● Sorting, Grouping, Deduplication● Distributed, Fault-tolerant, Scalable● Geospatial search● Complex Function queries● Recommendations (More Like This)● Graph Queries and Traversals● SQL Query Support● Streaming Aggregations● Batch and Streaming processing● Highly Configurable / Plugins● Learning to Rank● Building machine-learning models● … many more
*source: Solr in Action, chapter 2
The standard for enterprise search.
of Fortune 500 uses Solr.
90%
Reference Architecture (Lucidworks Fusion)
Bay Area Search
Type-aheadPrediction
Building an Intent Engine
Search Box
Semantic Query Parsing
Intent Engine
Spelling Correction
Entity / Entity Type Resolution
Machine-learned Ranking
Relevancy Engine (“re-expressing intent”)
User Feedback (Clarifying Intent)
Query Re-writing Search Results
Query Augmentation
Knowledge Graph
Contextual Disambiguation
Additional References:
what’s next?
Basic Keyword Search(inverted index, tf-idf, bm25, query formulation, etc.)
Taxonomies / Entity Extraction(entity recognition, ontologies, synonyms, etc.)
Query Intent(query classification, semantic query parsing, concept expansion, rules, clustering, classification)
Relevancy Tuning(signals, AB testing/genetic algorithms, Learning to Rank, Neural Networks)
Self-learning
The Three C’sContent:Keywords and other features in your documents
Collaboration:How other’s have chosen to interact with your system
Context:Available information about your users and their intent
Reflected Intelligence “Leveraging previous data and interactions to improve how new data and interactions should be interpreted”
Feedback LoopsUser
Searches
User Sees
ResultsUser
takes an
action
Users’ actions inform system improvements
● Recommendation Algorithms● Building user profiles from past searches, clicks, and other actions● Identifying correlations between keywords/phrases● Building out automatically-generated ontologies from content and
queries● Determining relevancy judgements (precision, recall, nDCG, etc.)
from click logs● Learning to Rank - using relevancy judgements and machine
learning to train a relevance model● Discovering misspellings, synonyms, acronyms, and related
keywords● Disambiguation of keyword phrases with multiple meanings● Learning what’s important in your content
Examples of Reflected Intelligence
Key Technologies• Keyword Search
- Lucene/Solr• Taxonomies / Entity Extraction
- Solr Text Tagger- Word2Vec / Dice Conceptual Search- SolrRDF
• Query Intent- Probabilistic Query Parser (SOLR-9418)- Semantic Knowledge Graph (SOLR-9480)
• Relevancy Tuning- Solr Learning to Rank Plugin (SOLR-8542)
• General Needs: a solid log processing framework (Apache Spark, Lucidworks Fusion, or Solr Daemon Expression)
Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.
Knowledge Graph
Semantic Knowledge Graph Traversal
software engineer*(materialized node)
Java
C#
.NET
.NET Developer
Java Developer
HibernateScalaVB.NET
Software Engineer
Data Scientist
SkillNodes
has_related_skillStartingNode
SkillNodes
has_related_skill Job TitleNodes
has_related_job_title
0.900.88 0.93
0.93
0.34
0.74
0.91
0.89
0.74
0.89
0.780.72
0.48
0.93
0.76
0.83
0.80
0.64
0.61
0.780.55
Knowledge Graph
Knowledge Graph
Traditional Keyword Search
Recommendations
SemanticSearch
User Intent
Personalized Search
Augmented Search
Domain-awareMatching
Contact InfoTrey Grainger
[email protected] @treygrainger
http://solrinaction.comOther presentations: http://www.treygrainger.com