solr jdbc - lucene/solr revolution 2016
TRANSCRIPT
O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
Solr JDBCKevin Risden
Apache Lucene/Solr Committer; Avalon Consulting, LLC
3
03About me
• Consultant with Avalon Consulting, LLC• ~4 years working with Hadoop and Search• Contributed patches to Ambari, HBase, Knox, Solr, Storm• Installation, security, performance tuning, development,
administration
• Kevin Risden• Apache Lucene/Solr Committer• YCSB Contributor
4
03Overview• Background• Use Case• Solr JDBC• Demo• Future Development/Improvements
5
01Background - What is JDBC?
The JDBC API is a Java API that can access any kind of tabular data, especially data stored in a Relational Database.
Source: https://docs.oracle.com/javase/tutorial/jdbc/overview/
JDBC drivers convert SQL into a backend query.
6
01Background - Why should you care about Solr JDBC?
• SQL skills are prolific.• JDBC drivers exist for most relational databases.• Existing reporting tools work with JDBC/ODBC
drivers.
Solr 6 works with SQL and existing JDBC tools!
7
01Use Case – Analytics – Utility RatesData set: 2011 Utility RatesQuestions:• How many utility companies serve the state of Maryland?• Which Maryland utility has the cheapest residential rates?• What are the minimum and maximum residential power rates excluding missing data
elements?• What is the state and zip code with the highest residential rate?
How could you answer those questions with Solr?
Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
• Facets• Filter Queries• Filters• Grouping
• Sorting• Stats• String queries
together
8
01Use Case – Analytics – Utility Rates
Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
Method: Lucene syntaxQuestions:• How many utility companies serve the state of Maryland?
http://solr:8983/solr/rates/select?q=state%3A%22MD%22&wt=json&indent=true&group=true&group.field=utility_name&rows=10&group.limit=1
• Which Maryland utility has the cheapest residential rates?http://solr:8983/solr/rates/select?q=state%3A%22MD%22&wt=json&indent=true&group=true&group.field=utility_name&rows=1&group.limit=1&sort=res_rate+asc
• What are the minimum and maximum residential power rates excluding missing data elements?http://solr:8983/solr/rates/select?q=*:*&fq=%7b!frange+l%3D0.0+incl%3Dfalse%7dres_rate&wt=json&indent=true&rows=0&stats=true&stats.field=res_rate
• What is the state and zip code with the highest residential rate?http://solr:8983/solr/rates/select?q=res_rate:0.849872773537&wt=json&indent=true&rows=1Is there a better way?
9
01Solr JDBCHighlights• JDBC Driver for Solr• Powered by Streaming Expressions and Parallel SQL
• Thursday - Parallel SQL and Analytics with Solr – Yonik Seeley• Thursday - Creating New Streaming Expressions – Dennis
Gove• Integrates with any* JDBC client * tested with the JDBC clients in this presentation
Usagejdbc:solr://SOLR_ZK_CONNECTION_STRING?collection=COLLECTION_NAME
Apache Solr Reference Guide - Parallel SQL Interface
10
01Solr JDBC - Architecture
11
01DemoProgramming Languages• Java• Python/Jython• R• Apache Spark
Web• Apache Zeppelin• RStudio
GUI – JDBC• DbVisualizer• SQuirreL SQL
GUI – ODBC• Microsoft Excel• Tableau*
https://github.com/risdenk/solrj-jdbc-testing
12
01Demo – Javaimport org.slf4j.Logger;import org.slf4j.LoggerFactory;import java.sql.*;public class SolrJJDBCTestingJava { private static final Logger LOGGER = LoggerFactory.getLogger(SolrJJDBCTestingJava.class); public static void main(String[] args) throws Exception { String sql = args[0]; try (Connection con = DriverManager.getConnection("jdbc:solr://solr:9983?collection=test")) { try (Statement stmt = con.createStatement()) { try (ResultSet rs = stmt.executeQuery(sql)) { ResultSetMetaData rsMetaData = rs.getMetaData(); int columns = rsMetaData.getColumnCount(); StringBuilder header = new StringBuilder(); for(int i = 1; i < columns + 1; i++) { header.append(rsMetaData.getColumnLabel(i)).append(","); } LOGGER.info(header.toString()); while (rs.next()) { StringBuilder row = new StringBuilder(); for(int i = 1; i < columns + 1; i++) { row.append(rs.getObject(i)).append(","); } LOGGER.info(row.toString()); } } } } }} Apache Solr Reference Guide - Generic
13
01Demo – Python#!/usr/bin/env python# https://pypi.python.org/pypi/JayDeBeApi/import jaydebeapiimport sysif __name__ == '__main__': jdbc_url = "jdbc:solr://solr:9983?collection=test” driverName = "org.apache.solr.client.solrj.io.sql.DriverImpl” statement = "select fielda, fieldb, fieldc, fieldd_s, fielde_i from test limit 10” conn = jaydebeapi.connect(driverName, jdbc_url) curs = conn.cursor() curs.execute(statement) print(curs.fetchall()) conn.close()
Apache Solr Reference Guide - Python/Jython
14
01Demo – Jython#!/usr/bin/env jython# http://www.jython.org/jythonbook/en/1.0/DatabasesAndJython.html# https://wiki.python.org/jython/DatabaseExamples#SQLite_using_JDBCimport sys from java.langimport Class from java.sqlimport DriverManager, SQLExceptionif __name__ == '__main__': jdbc_url = "jdbc:solr://solr:9983?collection=test” driverName = "org.apache.solr.client.solrj.io.sql.DriverImpl” statement = "select fielda, fieldb, fieldc, fieldd_s, fielde_i from test limit 10” dbConn = DriverManager.getConnection(jdbc_url) stmt = dbConn.createStatement() resultSet = stmt.executeQuery(statement) while resultSet.next(): print(resultSet.getString("fielda")) resultSet.close() stmt.close() dbConn.close() Apache Solr Reference Guide - Python/Jython
15
01Demo – R# https://www.rforge.net/RJDBC/library("RJDBC")solrCP <- c(list.files('/opt/solr/dist/solrj-lib', full.names=TRUE), list.files('/opt/solr/dist', pattern='solrj', full.names=TRUE, recursive = TRUE))drv <- JDBC("org.apache.solr.client.solrj.io.sql.DriverImpl", solrCP, identifier.quote="`")conn <- dbConnect(drv, "jdbc:solr://solr:9983?collection=test", "user", "pwd")dbGetQuery(conn, "select fielda, fieldb, fieldc, fieldd_s, fielde_i from test limit 10")dbDisconnect(conn)
Apache Solr Reference Guide - R
16
01Demo – Apache Zeppelin
Apache Solr Reference Guide - Apache Zeppelin
17
01Demo – RStudio
18
01Demo – DbVisualizer
Apache Solr Reference Guide - DbVisualizer
19
01Demo – SQuirreL SQL
Apache Solr Reference Guide - SQuirreL SQL
20
01Demo – Microsoft Excel
21
01Use Case – Analytics – Utility Rates
Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
Method: Lucene syntaxQuestions:• How many utility companies serve the state of Maryland?
http://solr:8983/solr/rates/select?q=state%3A%22MD%22&wt=json&indent=true&group=true&group.field=utility_name&rows=10&group.limit=1
• Which Maryland utility has the cheapest residential rates?http://solr:8983/solr/rates/select?q=state%3A%22MD%22&wt=json&indent=true&group=true&group.field=utility_name&rows=1&group.limit=1&sort=res_rate+asc
• What are the minimum and maximum residential power rates excluding missing data elements?http://solr:8983/solr/rates/select?q=*:*&fq=%7b!frange+l%3D0.0+incl%3Dfalse%7dres_rate&wt=json&indent=true&rows=0&stats=true&stats.field=res_rate
• What is the state and zip code with the highest residential rate?http://solr:8983/solr/rates/select?q=res_rate:0.849872773537&wt=json&indent=true&rows=1Is there a better way?
22
01Use Case – Analytics – Utility RatesMethod: SQLQuestions:• How many utility companies serve the state of Maryland?
select distinct utility_name from rates where state='MD';• Which Maryland utility has the cheapest residential rates?
select utility_name,min(res_rate) from rates where state='MD' group by utility_name order by min(res_rate) asc limit 1;
• What are the minimum and maximum residential power rates excluding missing data elements?
select min(res_rate),max(res_rate) from rates where not res_rate = 0;• What is the state and zip code with the highest residential rate?
select state,zip,max(res_rate) from rates group by state,zip order by max(res_rate) desc limit 1;
How should you answer those questions with Solr? – Using SQL!Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
23
01Use Case – Analytics – Utility Rates
How should you answer those questions with Solr? – Using SQL!Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
24
01Future Development/Improvements• Replace Presto with Apache Calcite - SOLR-8593• Improve SQL compatibility• Ability to specify optimization rules (push downs, joins,
etc)• Potentially use Avatica JDBC/ODBC drivers
• Streaming Expressions/Parallel SQL improvements - SOLR-8125
• JDBC driver improvements - SOLR-8659Info on how to get involved
25
01Future Development/ImprovementsSQL Join
Info on how to get involved
SELECT movie_title,character_name,lineFROM movie_dialogs_movie_titles_metadata aJOIN movie_dialogs_movie_lines bONa.movieID=b.movieID;
select( innerJoin( search(movie_dialogs_movie_titles_metadata, q=*:*, fl="movieID,movie_title", sort="movieID asc"), search(movie_dialogs_movie_lines, q=*:*, fl="movieID,character_name,line", sort="movieID asc"), on="movieID” ), movie_title,character_name,line)
Streaming Expression Join
26
01Questions?