data science with solr and spark
TRANSCRIPT
Data Science with Solr and Spark
Grant Ingersoll
@gsingers
CTO, Lucidworks
Get Started
ASF Mailing Lists, lucidworks.com, docs.lucidworks.com Solr for Data Science
http://www.slideshare.net/gsingers/solr-for-data-science
Fusion for Data Science
https://www.youtube.com/watch?v=oSs5DNTVxv4
Lucidworks Fusion Is Search-Driven Everything
• Drive next generation relevance via Content, Collaboration and Context
• Harness best in class Open Source: Apache Solr + Spark
• Simplify application development and reduce ongoing maintenance
Fusion is built on three core principles:
Fusion Architecture
REST
API
Worker Worker Cluster Mgr.
Apache Spark
Shards Shards
Apache Solr
HD
FS (O
ptio
nal)
Shared Config Mgmt
Leader Election
Load Balancing
ZK 1
Apache Zookeeper
ZK N
DATABASEWEBFILELOGSHADOOP CLOUD
Connectors
Alerting/Messaging
NLP
Pipelines
Blob Storage
Scheduling
Recommenders/Signals
…
Core Services
Admin UI
SECURITY BUILT-IN
Lucidworks View
• Why Search for Data Engineering?
• Quick intro to Solr
• Quick intro to Spark
• Solr + Spark
• Demo!
Let’s Do This
The Importance of Importance
Search-Driven Everything
Customer Service
Customer Insights
Fraud Surveillance
Research Portal
Online Retail Digital Content
• Graph Traversal
• Grouping and Joins
• Stats, expressions, transformations and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance
Solr Key Features
• Full text search (Info Retr.)
• SQL and JDBC! (v6)
• Facets/Guided Nav galore
• Lots of data types
• Spelling, auto-complete, highlighting
• Cursors
• More Like This
• De-duplication
Lucene: the Ace in the Hole
• Vector Space or Probabilistic, it’s your choice!
• Killer FST
• Wicked fast
• Pluggable compression, queries, indexing and more
• Advanced Similarity Models
• Lang. Modeling, Divergence from Random, more
• Easy to plug-in ranking
Spark Key Features
• General purpose, high powered cluster computing system
• Modern, faster alternative to MapReduce
• 3x faster w/ 10x less hardware for Terasort
• Great for iterative algorithms
• APIs for Java, Scala, Python and R
• Rich set of add-on libraries for machine learning, graph processing, integrations with SQL and other systems
• Deploys: Standalone, Hadoop YARN, Mesos
Why do data engineering with Solr and Spark?
Solr Spark
• Data exploration and visualization
• Easy ingestion and feature selection
• Powerful ranking features
• Quick and dirty classification and clustering
• Simple operation and scaling
• Stats and math built in
• Advanced machine learning: MLLib, Mahout, Deep Learning4j
• Fast, large scale iterative algorithms
• General purpose batch/streaming compute engine
Whole collection analysis!
• Lots of integrations with other big data systems
Why Spark for Solr?
• Spark-shell!
• Build the index in parallel very, very quickly!
• Aggregations
• Boosts, stats, iterative computations
• Offline compute to update index with additional info (e.g. PageRank, popularity)
• Whole corpus analytics and ML: clustering, classification, CF, Rankers
• Joins with other storage (Cassandra, HDFS, DB, HBase)
Why Solr for Spark?
• Massive simplification of operations!
• World-class multilingual text analytics:
• No more: tokens = str.toLowerCase().split(“\\s+“)
• Non “dumb” distributed, resilient storage
• Random access with smart queries
• Table scans
• Advanced filtering, feature selection
• Schemaless when you want, predefined when you don’t
• Spatial, columnar, sparse
Spark + Solr in Anger
http://github.com/lucidworks/spark-solr
Map<String, String> options = new HashMap<String, String>();options.put("zkhost", zkHost);options.put("collection”, "tweets");DataFrame df = sqlContext.read().format("solr").options(options).load(); count = df.filter(df.col("type_s").equalTo(“echo")).count();
DemoSpark Shell, run k-Means, LDA, Word2Vec, Random Forest, index clusters
Resources
• This code: published in a few weeks
• Company: http://www.lucidworks.com
• Our blog: http://www.lucidworks.com/blog
• Book: http://www.manning.com/ingersoll
• Solr: http://lucene.apache.org/solr
• Fusion: http://www.lucidworks.com/products/fusion
• Twitter: @gsingers
Appendix A: Deeper Solr Details
• Parallel Execution of SQL across SolrCloud
• Realtime Map-Reduce (“ish”) Functionality
• Parallel Relational Algebra
• Builds on streaming capabilities in 5.x
• JDBC client in the works
Just When You Thought SQL was Dead
Full, Parallelized, SQL Support
• Lots of Functions:
• Search, Merge, Group, Unique, Parallel, Select, Reduce, Select, innerJoin, hashJoin, Top, Rollup, Facet, Stats, Update, JDBC, Intersect, Complement, Logit
• Composable Streams
• Query optimization built in
SQL Guts
Example
select str_s, count(*), sum(field_i), min(field_i), max(field_i), avg(field_i) from collection1 where text=’XXXX’ group by str_s
rollup( search(collection1, q=”(text:XXXX)”, qt=”/export”, fl=”str_s,field_i”, partitionKeys=str_s, sort=”str_s asc”, zkHost=”localhost:9989/solr”), over=str_s, count(*), sum(field_i), min(field_i), max(field_i), avg(field_i)
• Graph Traversal
• Find all tweets mentioning “Solr” by me or people I follow
• Find all draft blog posts about “Parallel SQL” written by a developer
• Find 3-star hotels in NYC my friends stayed in last year
• BM25F Default Similarity
• Geo3D search
Make Connections, Get Better Results
Streaming API & Expressions●API ○ Java API to provide programming framework ○ Returns tuples as a JSON stream ○ org.apache.solr.client.solrj.io
●Expressions ○ String Query Language ○ Serialization format ○ Allows non-Java programmers to access Streaming API
DocValues must be enabled for any field to be returned
Streaming Expression Request
curl -‐-‐data-‐urlencode 'stream=search(sample, q="*:*", fl="id,field_i", sort="field_i asc")' http://localhost:8901/solr/sample/stream
Streaming Expression Response
{"responseHeader": {"status": 0, "QTime": 1}, "tuples": { "numFound": -‐1, "start": -‐1, "docs": [ {"id": "doc1", "field_i": 1}, {"id": "doc2", "field_i": 2}, {"EOF": true}] }}
Architecture●MapReduce-ish ○ Borrows Shuffling concept from M/R ●Logical tiers for performing the query ○ SQL tier: translates SQL to streaming expressions for parallel query plan,
selects worker nodes, merges results ○ Worker tier: executes parallel query plan, streams tuples from data tables
back ○ Data Table tier: queries SolrCloud collections, performs initial sort and
partitioning of results for worker nodes
JDBC Client●Parallel SQL includes a “thin” JDBC client
●Expanded to include SQL Clients such as DbVisualizer (SOLR-8502)
●Client only works with Parallel SQL features
Learning MoreJoel Bernstein’s presentation at Lucene Revolution: ●https://www.youtube.com/watch?v=baWQfHWozXc
Apache Solr Reference Guide: ●https://cwiki.apache.org/confluence/display/solr/
Streaming+Expressions ●https://cwiki.apache.org/confluence/display/solr/Parallel
+SQL+Interface
Spark Architecture
Spark Master (daemon)
Spark Slave (daemon)
my-spark-job.jar (w/ shaded deps)
My Spark App SparkContext
(driver) • Keeps track of live workers • Web UI on port 8080 • Task Scheduler • Restart failed tasks
Spark Executor (JVM process)
Tasks
Executor runs in separate process than slave daemon
Spark Worker Node (1...N of these)
Each task works on some partition of a data set to apply a transformation or action
Cache
Losing a master prevents new applications from being executed
Can achieve HA using ZooKeeper
and multiple master nodes
Tasks are assigned based on data-locality
When selecting which node to execute a task on, the master takes into account data locality
• RDD Graph • DAG Scheduler • Block tracker • Shuffle tracker