data science with solr and spark

2016

OCTOBER 13-16, 2016

BOSTON, MA

http://lucenerevolution.com

http://lucenerevolution.com

Data Science with Solr and Spark

Grant Ingersoll

@gsingers

CTO, Lucidworks

Get Started

ASF Mailing Lists, lucidworks.com, docs.lucidworks.com Solr for Data Science

http://www.slideshare.net/gsingers/solr-for-data-science

Fusion for Data Science

https://www.youtube.com/watch?v=oSs5DNTVxv4

http://lucidworks.com

http://docs.lucidworks.com

http://www.slideshare.net/gsingers/solr-for-data-science

https://www.youtube.com/watch?v=oSs5DNTVxv4

Lucidworks Fusion Is Search-Driven Everything

• Drive next generation relevance via Content, Collaboration and Context

• Harness best in class Open Source: Apache Solr + Spark

• Simplify application development and reduce ongoing maintenance

Fusion is built on three core principles:

Fusion Architecture

REST

API

Worker Worker Cluster Mgr.

Apache Spark

Shards Shards

Apache Solr

HD

FS (O

ptio

nal)

Shared Config Mgmt

Leader Election

Load Balancing

ZK 1

Apache Zookeeper

ZK N

DATABASEWEBFILELOGSHADOOP CLOUD

Connectors

Alerting/Messaging

NLP

Pipelines

Blob Storage

Scheduling

Recommenders/Signals

…

Core Services

Admin UI

SECURITY BUILT-IN

Lucidworks View

• Why Search for Data Engineering?

• Quick intro to Solr

• Quick intro to Spark

• Solr + Spark

• Demo!

Let’s Do This

The Importance of Importance

Search-Driven Everything

Customer Service

Customer Insights

Fraud Surveillance

Research Portal

Online Retail Digital Content

• Graph Traversal

• Grouping and Joins

• Stats, expressions, transformations and more

• Lang. Detection

• Extensible

• Massive Scale/Fault tolerance

Solr Key Features

• Full text search (Info Retr.)

• SQL and JDBC! (v6)

• Facets/Guided Nav galore

• Lots of data types

• Spelling, auto-complete, highlighting

• Cursors

• More Like This

• De-duplication

Lucene: the Ace in the Hole

• Vector Space or Probabilistic, it’s your choice!

• Killer FST

• Wicked fast

• Pluggable compression, queries, indexing and more

• Advanced Similarity Models

• Lang. Modeling, Divergence from Random, more

• Easy to plug-in ranking

Spark Key Features

• General purpose, high powered cluster computing system

• Modern, faster alternative to MapReduce

• 3x faster w/ 10x less hardware for Terasort

• Great for iterative algorithms

• APIs for Java, Scala, Python and R

• Rich set of add-on libraries for machine learning, graph processing, integrations with SQL and other systems

• Deploys: Standalone, Hadoop YARN, Mesos

Why do data engineering with Solr and Spark?

Solr Spark

• Data exploration and visualization

• Easy ingestion and feature selection

• Powerful ranking features

• Quick and dirty classification and clustering

• Simple operation and scaling

• Stats and math built in

• Advanced machine learning: MLLib, Mahout, Deep Learning4j

• Fast, large scale iterative algorithms

• General purpose batch/streaming compute engine

Whole collection analysis!

• Lots of integrations with other big data systems

Why Spark for Solr?

• Spark-shell!

• Build the index in parallel very, very quickly!

• Aggregations

• Boosts, stats, iterative computations

• Offline compute to update index with additional info (e.g. PageRank, popularity)

• Whole corpus analytics and ML: clustering, classification, CF, Rankers

• Joins with other storage (Cassandra, HDFS, DB, HBase)

Why Solr for Spark?

• Massive simplification of operations!

• World-class multilingual text analytics:

• No more: tokens = str.toLowerCase().split(“\\s+“)

• Non “dumb” distributed, resilient storage

• Random access with smart queries

• Table scans

• Advanced filtering, feature selection

• Schemaless when you want, predefined when you don’t

• Spatial, columnar, sparse

Spark + Solr in Anger

http://github.com/lucidworks/spark-solr

Map<String, String> options = new HashMap<String, String>();options.put("zkhost", zkHost);options.put("collection”, "tweets");DataFrame df = sqlContext.read().format("solr").options(options).load(); count = df.filter(df.col("type_s").equalTo(“echo")).count();

http://github.com/lucidworks/spark-solr

DemoSpark Shell, run k-Means, LDA, Word2Vec, Random Forest, index clusters

Resources

• This code: published in a few weeks

• Company: http://www.lucidworks.com

• Our blog: http://www.lucidworks.com/blog

• Book: http://www.manning.com/ingersoll

• Solr: http://lucene.apache.org/solr

• Fusion: http://www.lucidworks.com/products/fusion

• Twitter: @gsingers

http://www.lucidworks.com

http://www.lucidworks.com/blog

http://www.manning.com/ingersoll

http://lucene.apache.org/solr

http://www.lucidworks.com/products/fusion

Appendix A: Deeper Solr Details

• Parallel Execution of SQL across SolrCloud

• Realtime Map-Reduce (“ish”) Functionality

• Parallel Relational Algebra

• Builds on streaming capabilities in 5.x

• JDBC client in the works

Just When You Thought SQL was Dead

Full, Parallelized, SQL Support

• Lots of Functions:

• Search, Merge, Group, Unique, Parallel, Select, Reduce, Select, innerJoin, hashJoin, Top, Rollup, Facet, Stats, Update, JDBC, Intersect, Complement, Logit

• Composable Streams

• Query optimization built in

SQL Guts

Example

select str_s, count(*), sum(field_i), min(field_i), max(field_i), avg(field_i) from collection1 where text=’XXXX’ group by str_s

rollup( search(collection1, q=”(text:XXXX)”, qt=”/export”, fl=”str_s,field_i”, partitionKeys=str_s, sort=”str_s asc”, zkHost=”localhost:9989/solr”), over=str_s, count(*), sum(field_i), min(field_i), max(field_i), avg(field_i)

• Graph Traversal

• Find all tweets mentioning “Solr” by me or people I follow

• Find all draft blog posts about “Parallel SQL” written by a developer

• Find 3-star hotels in NYC my friends stayed in last year

• BM25F Default Similarity

• Geo3D search

Make Connections, Get Better Results

Streaming API & Expressions●API ○ Java API to provide programming framework ○ Returns tuples as a JSON stream ○ org.apache.solr.client.solrj.io

●Expressions ○ String Query Language ○ Serialization format ○ Allows non-Java programmers to access Streaming API

DocValues must be enabled for any field to be returned

Streaming Expression Request

curl -‐-‐data-‐urlencode 'stream=search(sample, q="*:*", fl="id,field_i", sort="field_i asc")' http://localhost:8901/solr/sample/stream

Streaming Expression Response

{"responseHeader": {"status": 0, "QTime": 1}, "tuples": { "numFound": -‐1, "start": -‐1, "docs": [ {"id": "doc1", "field_i": 1}, {"id": "doc2", "field_i": 2}, {"EOF": true}] }}

Architecture●MapReduce-ish ○ Borrows Shuffling concept from M/R ●Logical tiers for performing the query ○ SQL tier: translates SQL to streaming expressions for parallel query plan,

selects worker nodes, merges results ○ Worker tier: executes parallel query plan, streams tuples from data tables

back ○ Data Table tier: queries SolrCloud collections, performs initial sort and

partitioning of results for worker nodes

JDBC Client●Parallel SQL includes a “thin” JDBC client

●Expanded to include SQL Clients such as DbVisualizer (SOLR-8502)

●Client only works with Parallel SQL features

Learning MoreJoel Bernstein’s presentation at Lucene Revolution: ●https://www.youtube.com/watch?v=baWQfHWozXc

Apache Solr Reference Guide: ●https://cwiki.apache.org/confluence/display/solr/

Streaming+Expressions ●https://cwiki.apache.org/confluence/display/solr/Parallel

+SQL+Interface

https://www.youtube.com/watch?v=baWQfHWozXc

https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions

https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface

Spark Architecture

Spark Master (daemon)

Spark Slave (daemon)

my-spark-job.jar (w/ shaded deps)

My Spark App SparkContext

(driver) •  Keeps track of live workers •  Web UI on port 8080 •  Task Scheduler •  Restart failed tasks

Spark Executor (JVM process)

Tasks

Executor runs in separate process than slave daemon

Spark Worker Node (1...N of these)

Each task works on some partition of a data set to apply a transformation or action

Cache

Losing a master prevents new applications from being executed

Can achieve HA using ZooKeeper

and multiple master nodes

Tasks are assigned based on data-locality

When selecting which node to execute a task on, the master takes into account data locality

•  RDD Graph •  DAG Scheduler •  Block tracker •  Shuffle tracker

data science with solr and spark

Technology