leveraging solr and mahout
DESCRIPTION
My talk from last night's Big Data Warehouse meetup in NYC on using Solr and Mahout to build next generation data access toolsTRANSCRIPT
Confidential © Copyright 2012
Leveraging Solr and Mahout for Next Gen Data Access and Insight
Grant IngersollChief Scientist
Confidential and Proprietary © 2012 LucidWorks
Search is Dead, Long Live Search
Content
Users
Access
Content Relationships
• Modern Data Challenges are multi-structured
• Search is a system building block- Text is only a part of the story
• If the algorithms fit,
use them!
• Embrace fuzziness!
• Scoring features are everywhere
Confidential and Proprietary © 2012 LucidWorks3
Topics
• Intros
• Search (R)Evolution
• Apache Solr• Apache Mahout
• Search and Machine Learning
• Scaling
Confidential and Proprietary © 2012 LucidWorks
• Co-founder:- LucidWorks – Chief Scientist- Apache Mahout
• Long time Lucene/Solr committer• Author: Taming Text
- www.manning.com/ingersoll
• Background in IR and NLP- Built CLIR, QA and a variety of other search-based apps
Grant’s Background
Confidential and Proprietary © 2012 LucidWorks
Search (R)evolution
• Search use leads to search abuse- Denormalization frees your mind- Scoring is just a sparse matrix multiply
• Lucene/Solr evolution- Non-free text usages abound- Many DB-like features- NoSQL before NoSQL was cool- Flexible indexing- Finite State Transducers FTW!
• Scale
• “This ain’t your father’s relevance anymore”
Confidential and Proprietary © 2012 LucidWorks
Apache Solr?
• “Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface and many more features. It runs in a Java servlet container such as Tomcat. “- http://lucene.apache.org/solr
• Did I mention free?
Confidential and Proprietary © 2012 LucidWorks
Apache Mahout
• Goal: create library of scalable machine learning algorithms
• Mahout’s 3 “C”s provide tools for helping across many aspects of discovery- Collaborative Filtering- Classification- Clustering
• Also: - Collocations (Statistically Interesting Phrases)- SVD- Java math, primitives libraries and more
Confidential and Proprietary © 2012 LucidWorks
Search + Machine Learning
• Search-driven applications present multiple opportunities for leveraging machine learning- Clustering – Enhance Discovery, outlier detection- Classification – Queries, Documents, Users- Content Recommendation – Collab. Filtering and
personalization- NLP – phrases, named entities, co-reference, much more
• Many of these can also power faceted navigation
• Aside: Search can also often be used effectively to implement many machine learning algorithms
Confidential and Proprietary © 2012 LucidWorks
How and When
Shards
12
3 N
Search View
•Documents •Users •Logs
DocumentStore
Analytic Services
•View into numeric/historic data
•Classification•Recommendation
Personalization & Machine Learning
Services
Classification Models
In memoryReplicatedMulti-tenant
Discovery & EnrichmentClustering, classification, NLP, topic identification, search log analysis, user behavior
Content AcquisitionETL, batch or near real-time
Access APIs
Data• LucidWorks Search
connectors• Push
Confidential and Proprietary © 2012 LucidWorks
Scaling
• Search- Solr Cloud = Large scale, distributed search and faceting
» http://wiki.apache.org/solr/SolrCloud
• Machine Learning- Mahout is built on Hadoop for most things- SGD is sequential and really fast
• Sometimes all you can do is make an educated guess- Storm, Kafka, etc. can help by allowing you to make estimates
in near real time
Confidential and Proprietary © 2012 LucidWorks
Wrap
• Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users
• LucidWorks has combined many of these things into LucidWorks Big Data- http://www.lucidworks.com/products/lucidworks-big-data
• Design for the big picture when building search-based applications
Confidential and Proprietary © 2012 LucidWorks
Resources
• LucidWorks- http://www.lucidworks.com- http://www.lucidworks.com/products/lucidworks-big-data- @LucidImagineer
• Me- [email protected] @gsingers
• Taming Text- http://www.manning.com/ingersoll- http://www.tamingtext.com- @tamingtext