south big data hub: text data analysis panel

Text Data Analysis Panel: South Big Data HubTrey Grainger

SVP of Engineering, Lucidworks

Trey Grainger SVP of Engineering

• Previously Director of Engineering @ CareerBuilder• MBA, Management of Technology – Georgia Tech• BA, Computer Science, Business, & Philosophy – Furman University• Information Retrieval & Web Search - Stanford University

Other fun projects: • Co-author of Solr in Action, plus numerous research papers• Frequent conference speaker• Founder of Celiaccess.com, the gluten-free search engine• Lucene/Solr contributor

About Me

http://solrinaction.com/

http://www.celiaccess.com/

what do you do?

Search-Driven

Everything

Customer Service Custome

r Insights

Fraud Surveillance

Research Portal

Online Retail Digital Content

Lucidworks enables Search-Driven Everything

Data Acquisition

Indexing & Streaming

Smart Access API

Recommendations & Alerts Analytics & InsightsExtreme Relevancy

CUSTOMER SERVICE

RESEARCH PORTAL

DIGITAL CONTENT

CUSTOMER INSIGHTS

FRAUD SURVEILLANCE

ONLINERETAIL

•Access all your data in a number of ways from one place.

•Secure storage and processing from Solr and Spark.

•Acquire data from any source with pre-built connectors and adapters.

Machine learning and advanced analytics turn all of your apps into intelligent data-driven applications.

how do you do it?

Solr is the popular, blazing-fast, open source enterprise

search platform built on Apache Lucene™.

Key Solr Features:● Multilingual Keyword search● Relevancy Ranking of results● Faceting & Analytics (nested / relational)● Highlighting● Spelling Correction● Autocomplete/Type-ahead Prediction● Sorting, Grouping, Deduplication● Distributed, Fault-tolerant, Scalable● Geospatial search● Complex Function queries● Recommendations (More Like This)● Graph Queries and Traversals● SQL Query Support● Streaming Aggregations● Batch and Streaming processing● Highly Configurable / Plugins● Learning to Rank● Building machine-learning models● … many more

*source: Solr in Action, chapter 2

The standard for enterprise search.

of Fortune 500 uses Solr.

90%

Reference Architecture (Lucidworks Fusion)

Bay Area Search

Type-aheadPrediction

Building an Intent Engine

Search Box

Semantic Query Parsing

Intent Engine

Spelling Correction

Entity / Entity Type Resolution

Machine-learned Ranking

Relevancy Engine (“re-expressing intent”)

User Feedback (Clarifying Intent)

Query Re-writing Search Results

Query Augmentation

Knowledge Graph

Contextual Disambiguation

Additional References:

http://www.treygrainger.com/posts/presentations/crowdsourced-query-augmentation-through-the-semantic-discovery-of-domain-specific-jargon/

http://www.treygrainger.com/posts/presentations/building-a-real-time-big-data-analytics-platform-with-solr/

http://www.treygrainger.com/posts/presentations/scaling-recommendations-semantic-search-data-analytics-with-solr/

http://www.treygrainger.com/posts/presentations/enhancing-relevancy-through-personalization-semantic-search/

http://www.treygrainger.com/posts/presentations/semantic-multilingual-strategies-in-lucenesolr/

http://www.treygrainger.com/posts/presentations/leveraging-lucene-solr-as-a-knowledge-graph-and-intent-engine/

http://www.treygrainger.com/posts/presentations/reflected-intelligence-evolving-self-learning-data-systems/

http://www.treygrainger.com/posts/presentations/searching-on-intent-knowledge-graphs-personalization-and-contextual-disambiguation/

http://www.treygrainger.com/posts/presentations/building-a-cloud-like-knowledge-discovery-platform/

http://www.treygrainger.com/posts/presentations/reflected-intelligence-lucene-solr-as-a-self-learning-data-system/

http://www.treygrainger.com/posts/presentations/building-a-real-time-solr-powered-recommendation-engine/

http://www.treygrainger.com/posts/presentations/the-semantic-knowledge-graph/

https://www.researchgate.net/publication/265512095_Augmenting_recommendation_systems_using_a_model_of_semantically-related_terms_extracted_from_user_behavior

https://www.researchgate.net/publication/264160850_PGMHD_A_Scalable_Probabilistic_Graphical_Model_for_Massive_Hierarchical_Data_Problems

https://www.researchgate.net/publication/283329737_Query_Sense_Disambiguation_Leveraging_Large_Scale_User_Behavioral_Data

https://www.researchgate.net/publication/283980991_Improving_the_Quality_of_Semantic_Relationships_Extracted_from_Massive_User_Behavioral_Data

https://www.researchgate.net/publication/282816550_Crowdsourced_Query_Augmentation_through_Semantic_Discovery_of_Domain-specific_Jargon

https://www.researchgate.net/publication/306926620_Entity_Type_Recognition_Using_an_Ensemble_of_Distributional_Semantic_Models_to_Enhance_Query_Understanding

https://www.researchgate.net/publication/304859620_Application_of_Statistical_Relational_Learning_to_Hybrid_Recommendation_Systems

https://www.researchgate.net/publication/288529613_Mining_Massive_Hierarchical_Data_Using_a_Scalable_Probabilistic_Graphical_Model

https://www.researchgate.net/publication/308368512_Macro-optimization_of_email_recommendation_response_rates_harnessing_individual_activity_levels_and_group_affinity_trends

https://www.researchgate.net/publication/307604163_The_Semantic_Knowledge_Graph_A_compact_auto-generated_model_for_real-time_traversal_and_ranking_of_any_relationship_within_a_domain

what’s next?

Basic Keyword Search(inverted index, tf-idf, bm25, query formulation, etc.)

Taxonomies / Entity Extraction(entity recognition, ontologies, synonyms, etc.)

Query Intent(query classification, semantic query parsing, concept expansion, rules, clustering, classification)

Relevancy Tuning(signals, AB testing/genetic algorithms, Learning to Rank, Neural Networks)

Self-learning

The Three C’sContent:Keywords and other features in your documents

Collaboration:How other’s have chosen to interact with your system

Context:Available information about your users and their intent

Reflected Intelligence “Leveraging previous data and interactions to improve how new data and interactions should be interpreted”

Feedback LoopsUser

Searches

User Sees

ResultsUser

takes an

action

Users’ actions inform system improvements

● Recommendation Algorithms● Building user profiles from past searches, clicks, and other actions● Identifying correlations between keywords/phrases● Building out automatically-generated ontologies from content and

queries● Determining relevancy judgements (precision, recall, nDCG, etc.)

from click logs● Learning to Rank - using relevancy judgements and machine

learning to train a relevance model● Discovering misspellings, synonyms, acronyms, and related

keywords● Disambiguation of keyword phrases with multiple meanings● Learning what’s important in your content

Examples of Reflected Intelligence

Key Technologies• Keyword Search

- Lucene/Solr• Taxonomies / Entity Extraction

- Solr Text Tagger- Word2Vec / Dice Conceptual Search- SolrRDF

• Query Intent- Probabilistic Query Parser (SOLR-9418)- Semantic Knowledge Graph (SOLR-9480)

• Relevancy Tuning- Solr Learning to Rank Plugin (SOLR-8542)

• General Needs: a solid log processing framework (Apache Spark, Lucidworks Fusion, or Solr Daemon Expression)

https://issues.apache.org/jira/browse/SOLR-9418

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Semantic Knowledge Graph Traversal

software engineer*(materialized node)

Java

C#

.NET

.NET Developer

Java Developer

HibernateScalaVB.NET

Software Engineer

Data Scientist

SkillNodes

has_related_skillStartingNode

SkillNodes

has_related_skill Job TitleNodes

has_related_job_title

0.900.88 0.93

0.93

0.34

0.74

0.91

0.89

0.74

0.89

0.780.72

0.48

0.93

0.76

0.83

0.80

0.64

0.61

0.780.55

Knowledge Graph

Traditional Keyword Search

Recommendations

SemanticSearch

User Intent

Personalized Search

Augmented Search

Domain-awareMatching

Contact InfoTrey Grainger

trey.grainger@lucidworks.com @treygrainger

http://solrinaction.comOther presentations: http://www.treygrainger.com

http://solrinaction.com/

http://www.treygrainger.com/

south big data hub: text data analysis panel

Technology

kick your mainframe into high gear with mobile, …...data...

energy data hub naming conventions and · web viewtitle...

chesapeake bay program data hub

hadoop as a data hub

[xt] longitudinal data/panel data

data hub news letter - iom.int

product data hub guidance document

(version 10.1) informatica data integration hub...

panel data analysis - amine ouazad · balanced panel data...

automated product data publishing from oracle product hub...

oracle financial services data integration hub · 1...

oracle customer hub data sheet

beltos customer data hub

milton logistics hub project panel review -...

big data hub workshop

(version 10.0.0) informatica data integration hub...

milton logistics hub project panel review · subject:...

econometrics of panel data -...

personal data governance - trust-hub · personal data...

data hub digital - digital realty · data hub using service...