crowd sourced intelligence built into search over hadoop

30
1 ©MapR Technologies - Confidential Crowd Sourcing Reflected Intelligence Using Search and Big Data Ted Dunning, MapR Grant Ingersoll, LucidWorks

Upload: lucenerevolution

Post on 10-May-2015

965 views

Category:

Education


0 download

DESCRIPTION

Presented by Ted Dunning, Chief Application Architect, MapR & Grant Ingersoll, Chief Technology Officer, LucidWorks Search has quickly evolved from being an extension of the data warehouse to being run as a real time decision processing system. Search is increasingly being used to gather intelligence on multi-structured data leveraging distributed platforms such as Hadoop in the background. This session will provide details on how search engines can be abused to use not text, but mathematically derived tokens to build models that implement reflected intelligence. In such a system, intelligent or trend-setting behavior of some users is reflected back at other users. More importantly, the mathematics of evaluating these models can be hidden in a conventional search engine like SolR, making the system easy to build and deploy. The session will describe how to integrate Apache Solr/Lucene with Hadoop. Then we will show how crowd-sourced search behavior can be looped back into analysis and how constantly self-correcting models can be created and deployed. Finally, we will show how these models can respond with intelligent behavior in realtime.

TRANSCRIPT

Page 1: Crowd sourced intelligence built into search over hadoop

1 ©MapR Technologies - Confidential

Crowd Sourcing Reflected Intelligence Using Search and Big Data Ted Dunning, MapR Grant Ingersoll, LucidWorks

Page 2: Crowd sourced intelligence built into search over hadoop

2 ©MapR Technologies - Confidential

Is Search Enough?

• Keyword search is a commodity

• Holistic view of the data and the user interactions with that data are critical

• Search, Discovery and Analytics are the key to unlocking this view of users and data

Search

MapR and LucidWorks

Page 3: Crowd sourced intelligence built into search over hadoop

3 ©MapR Technologies - Confidential

Grant’s Background

Co-founder: – LucidWorks – CTO

– Apache Mahout

Long time Lucene/Solr committer

Author: Taming Text

Background in IR and NLP – Built CLIR, QA and a variety of other search-based apps

Page 4: Crowd sourced intelligence built into search over hadoop

4 ©MapR Technologies - Confidential

Ted’s Background

Academia, Startups – MapR, MusicMatch, ID Analytics, Veoh Networks

– Big data since before big

Open source – since the dark ages before the internet

– Mahout, Zookeeper, Drill

– bought the beer at first HUG

Co-Author – Mahout in Action

MapR – Chief Application Architect

Founding member of Apache Drill

Page 5: Crowd sourced intelligence built into search over hadoop

5 ©MapR Technologies - Confidential

Agenda

Intro

Search (R)evolution

Reflected Intelligence Use Cases

Building a Next Generation Search and Discovery Platform – MapR

– LucidWorks

1+1=3

Page 6: Crowd sourced intelligence built into search over hadoop

6 ©MapR Technologies - Confidential

User Interactions With Big Data

Data

Data

Data

DFS

Key Value Store

Index

Command Line

Query Language

Keyword Search

System Administrator

Engineer

End User

Page 7: Crowd sourced intelligence built into search over hadoop

7 ©MapR Technologies - Confidential

User Interactions With Big Data

Data

Data

Data

DFS

Key Value Store

Index

Command Line

Query Language

Keyword Search

System Administrator

Engineer

End User

Reflected Intelligence

Page 8: Crowd sourced intelligence built into search over hadoop

8 ©MapR Technologies - Confidential

Search (R)evolution

Search use leads to search abuse – denormalization frees your mind – scoring is just a sparse matrix multiply

Lucene/Solr evolution – non free text usages abound – many DB-like features – noSQL before NoSQL was cool – flexible indexing – finite State Transducers FTW!

Scale

“This ain’t your father’s relevance anymore”

Page 9: Crowd sourced intelligence built into search over hadoop

9 ©MapR Technologies - Confidential

Search, Discovery and Analytics

Large-scale analysis is key to reflected intelligence – correlation analysis

• based on queries, clicks, mouse tracks,

even explicit feedback

• produce clusters, trends, topics, SIP’s

– start with engineered knowledge,

refine with user feedback

Large-scale discovery features

encourage experimentation

Always test, always enrich!

Search

Discovery Analytics

Page 10: Crowd sourced intelligence built into search over hadoop

10 ©MapR Technologies - Confidential

Social Media Analysis in Telecom

Goal – Detect flash-mob traffic events

– Provision additional resources before failures

Method: Correlate mobile traffic analysis with social media analysis – events cause traffic micro-bursts

– participants tweet the events ahead of time

– tweet locations converge on burst location

Deploy operations faster to predict outages and better handle emergency situations – high cost bandwidth augmentation can be marshaled as the traffic appears

– anticipation beats reaction

Page 11: Crowd sourced intelligence built into search over hadoop

11 ©MapR Technologies - Confidential

Provenance is 80% of Value

Problem – Broadcasters don’t know what audiences really like at a micro level

Method: – Analysis of social media to determine advertising reach and response

– Time resolution of social traffic can provide detailed response metrics

Results: – In one case the untargeted advertising was worth 5x more if with

supporting response data

Page 12: Crowd sourced intelligence built into search over hadoop

12 ©MapR Technologies - Confidential

Claims Analysis

Goal – Insurance claims processing and analysis

– fraud analysis

Method – Combine free text search with metadata analysis to identify high risk

activities across the country

– Integrate with corporate workflows to detect and fix outliers in customer relations

Results – Questions that took 24-48 hours now take seconds to answer

Page 13: Crowd sourced intelligence built into search over hadoop

13 ©MapR Technologies - Confidential

Virginia Tech - Help the World

Grab data around crisis

Search immediately

Large-scale analysis enriches data to find ways to improve responses and understanding

http://www.ctrnet.net

Page 14: Crowd sourced intelligence built into search over hadoop

14 ©MapR Technologies - Confidential

Can Search Catch the Bad Guys?

Online Drug Counterfeit detection

Identify commonly used language indicating counterfeits – you know it when you see it

– and you know you have seen it

Feed to analyst via search-driven application – enrich based on analysts feedback

Page 15: Crowd sourced intelligence built into search over hadoop

15 ©MapR Technologies - Confidential

Veoh - Cross Recommendations

Cross recommendation as search – with search used to build cross recommendation!

Recommend content to people who exhibit certain behaviors (clicks, query terms, other)

(Ab)use of a search engine – but not as a search engine for content

– more like a search engine for behavior

Page 16: Crowd sourced intelligence built into search over hadoop

16 ©MapR Technologies - Confidential

Recommendation Basics

History:

User Thing

1 3

2 4

3 4

2 3

3 2

1 1

2 1

Page 17: Crowd sourced intelligence built into search over hadoop

17 ©MapR Technologies - Confidential

Recommendation Basics

History as matrix:

t1+t3 cooccur 2 times, t1+t4 once, t2+t4 once

t1 t2 t3 t4

u1 1 0 1 0

u2 1 0 1 1

u3 0 1 0 1

Page 18: Crowd sourced intelligence built into search over hadoop

18 ©MapR Technologies - Confidential

Recommendation Basics

Coocurrence

t1 t2 t3 t4

t1 2 0 2 1

t2 0 1 0 1

t3 2 0 1 1

t4 1 1 1 2

t3 not t3

t1 2 1

not t1 1 1

Page 19: Crowd sourced intelligence built into search over hadoop

21 ©MapR Technologies - Confidential

Spot the Anomaly

Root LLR is roughly like standard deviations

A not A

B 13 1000

not B 1000 100,000

A not A

B 1 0

not B 0 2

A not A

B 1 0

not B 0 10,000

A not A

B 10 0

not B 0 100,000

0.44 0.98

2.26 7.15

Page 20: Crowd sourced intelligence built into search over hadoop

22 ©MapR Technologies - Confidential

Threshold by Score

Coocurrence

t1 t2 t3 t4

t1 2 0 2 1

t2 0 1 0 1

t3 2 0 1 1

t4 1 1 1 2

Page 21: Crowd sourced intelligence built into search over hadoop

23 ©MapR Technologies - Confidential

Threshold by Score

Indicators

t1 t2 t3 t4

t1 1 0 0 1

t2 0 1 0 1

t3 0 0 1 1

t4 1 0 0 1

indicator: t2, t4

Page 22: Crowd sourced intelligence built into search over hadoop

24 ©MapR Technologies - Confidential

Search Engine for Reflected Intelligence

Map-reduce “big data” part – Logs record user + item occurrence

– Group by user to get rows of occurrence matrix

– Self-join to get cooccurrence

– Log-likelihood test to find anomalies

Search part – Anomalous cooccurrences are indicators

(or use statistical scores to provide fancy boosts)

– Indicator fields and other meta-data are indexed

– Recommendation implemented using a single search

– Boosts, functions, similarity also can reflect learned behavior

Page 23: Crowd sourced intelligence built into search over hadoop

25 ©MapR Technologies - Confidential

What Platform Do You Need?

Fast, efficient, scalable search – bulk and near real-time indexing

– handle billions of records with sub-second search and faceting

Large scale, cost effective storage and processing capabilities

NLP and machine learning tools that scale to enhance discovery and analysis

Integrated log analysis workflows that close the loop between the raw data and user interactions

Page 24: Crowd sourced intelligence built into search over hadoop

26 ©MapR Technologies - Confidential

Shards

1 2

3 N

Search View

•Documents •Users •Logs

Document Store

Analytic Services

View into numeric/historic data

Classification Recommendation

Personalization & Machine Learning

Services

Classification Models In memory Replicated Multi-tenant

Discovery & Enrichment Clustering, classification, NLP, topic identification, search log analysis, user behavior

Content Acquisition ETL, batch or near real-time

Access APIs

Data • LucidWorks Search

connectors • Push

Reference Architecture

Page 25: Crowd sourced intelligence built into search over hadoop

27 ©MapR Technologies - Confidential

MapR

MapR provides the technology leading Hadoop distribution – full eco-system distribution

– integrated data platform

– complete solution for data integrity

MapR clusters also provide tight integration with search technologies like LucidWorks – integration is key for effective ops

Page 26: Crowd sourced intelligence built into search over hadoop

28 ©MapR Technologies - Confidential

LucidWorks

LucidWorks provides the leading packaging of Apache Lucene and Solr – build your own, we support

– founded by the most prominent Lucene/Solr experts

LucidWorks Search – “Solr++”

• UI, REST API, MapR connectors, relevance tools, much more

LucidWorks Big Data – Big Data as a Service

– Integrated LucidWorks Search, Hadoop, machine learning with prebuilt workflows for many of these tasks

Page 27: Crowd sourced intelligence built into search over hadoop

29 ©MapR Technologies - Confidential

LucidWorks Big Data

Inputs

API

Mgmt Search, Discovery, Analytics

Processing & Storage

Analytics Service Document Service

Big Data LucidWorks Web HDFS

Admin

Service Mgmt

Data Mgmt

Provisioning, Monitoring & Configuration

Page 28: Crowd sourced intelligence built into search over hadoop

30 ©MapR Technologies - Confidential

Easy Wins

Analyze logs from application stored in MapR

Seamlessly store search indexes in MapR – and feed to Pig, Mahout and others

– use mirrors + NFS to directly deploy indexes

Snapshots make backups a snap – A lot like version control for data

LucidWorks 2.5.2 easily connects with MapR

Page 29: Crowd sourced intelligence built into search over hadoop

31 ©MapR Technologies - Confidential

1 + 1 = 3

Page 30: Crowd sourced intelligence built into search over hadoop

32 ©MapR Technologies - Confidential

Learn More

Talk to Ted @ted_dunning

[email protected]

Talk to Grant @gsingers

[email protected]

MapR and Lucid Works http://www.mapr.com

http://www.lucidworks.com

Hash Tags

#mapr #lucenesolr #lucidworks