apache drill use cases & roadmap - hadoop meetup nyc - september 28, 2015
TRANSCRIPT
![Page 1: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/1.jpg)
© 2015 Dremio Corporation1
Drill Use Cases & Roadmap
Hadoop Meetup NYCSeptember 28, 2015
![Page 2: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/2.jpg)
© 2015 Dremio Corporation2
Agenda• Drill Background• Common Use Cases• Roadmap
![Page 3: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/3.jpg)
© 2015 Dremio Corporation3
About Dremio• Currently in Stealth• Founded in June 2015• Building on open source technologies
including Drill, Parquet, Spark
Jacques NadeauFounder & CTO
• Apache Drill PMC Chair• Recognized SQL & NoSQL expert
• Quigo (AOL); Offermatica (ADBE); aQuantive (MSFT)
Tomer ShiranFounder & CEO
• Apache Drill Founder• MapR (VP Product); Microsoft; IBM Research
• Carnegie Mellon, Technion
Julien Le DemArchitect
• Apache Parquet Founder• Apache Pig PMC Member• Twitter (Lead, Analytics Data Pipeline); Yahoo! (Architect)
Top Silicon Valley VCs
![Page 4: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/4.jpg)
© 2015 Dremio Corporation4
Apache Drill• Apache Foundation Project• More than 40 contributors from 10
companies• Open source SQL query engine for
relational & non-relational datastores• Designed for Interactive Queries• Scales from one laptop to 1000s of servers
![Page 5: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/5.jpg)
© 2015 Dremio Corporation5
Drill Integrates With What You HaveAny Datastore (Relational or Not)• File systems
– Traditional: Local files and NAS– Hadoop: HDFS and MapR-FS– Cloud storage: Amazon S3, Google Cloud
Storage, Azure Blob Storage• NoSQL databases
– MongoDB– HBase– MapR-DB– Hive
• And you can add new datastores:
Any Client• Multiple interfaces: ODBC, JDBC, REST,
C, Java• BI tools
– Tableau, Qlik, MicroStrategy, TIBCO Spotfire
• Excel• Command line (Drill shell)• Web and mobile apps via REST API
– Many JSON-powered chart libraries (see D3.js)
• SAS, R, …
![Page 6: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/6.jpg)
© 2015 Dremio Corporation6
Drill is Built for Modern Analytical OrganizationsExecute Fast• Standard SQL• Read data fast• Leverage columnar
encodings and execution
• Execute operations quickly
• Scale out, not up
Iterate Fast• Work without prep• Decentralize data
management• In-situ security• Explore + query• Access multiple
sources• Avoid the ETL rinse
cycle
![Page 7: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/7.jpg)
© 2015 Dremio Corporation7
JSON Model, Columnar Speed
JSONBSON
Mongo
HbaseNoSQL
ParquetAvro
CSVTSV
Schema-lessFixed schema
Flat
Complex
Name Gender AgeMichael M 6Jennifer F 3
{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}
RDBMS/SQL-on-Hadoop table
Apache Drill table
![Page 8: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/8.jpg)
© 2015 Dremio Corporation8
Drill Provides the Best of Both WorldsActs Like a Database• ANSI SQL: SELECT, FROM, WHERE,
JOIN, HAVING, ORDER BY, WITH, CTAS, ALL, EXISTS, ANY, IN, SOME
• VarChar, Int, BigInt, Decimal, VarBinary, Timestamp, Float, Double, etc.
• Subqueries, scalar subqueries, partition pruning, CTE
• Data warehouse offload• Tableau, ODBC, JDBC• TPC-H & TPC-DS-like workloads• Supports Hive SerDes• Supports Hive UDFs• Supports Hive Metastore
Even When Your Data Doesn’t• Path based queries and wildcards
– select * from /my/logs/– select * from /revenue/*/q2
• Modern data types– Map, Array, Any
• Complex Functions and Relational Operators
– FLATTEN, kvgen, convert_from, convert_to, repeated_count, etc
• JSON Sensor analytics• Complex data analysis• Alternative DSLs
![Page 9: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/9.jpg)
© 2015 Dremio Corporation9
Data Lake, More Like Data Maelstrom
HDFS HDFSmongod mongod
HDFS HDFS
HBase HBase
Cassandra Cassandra
HDFS
HDFS
HBaseWindows Desktop
Mac Desktop
HBase & HDFS Cluster
HDFS ClusterMongoDB Cluster
Cassandra Cluster
DesktopClustered Servers
![Page 10: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/10.jpg)
© 2015 Dremio Corporation10
Run Drillbits Wherever; Whatever Your Data
Drillbit
HDFS HDFSmongod mongod
HDFS HDFS
HBase HBase
Drillbit
DrillbitDrillbitDrillbit Drillbit
Cassandra Cassandra
Drillbit Drillbit
HDFS
HDFS
HBase
Drillbit
Drillbit
Windows Desktop
Drillbit
Mac Desktop
Drillbit
![Page 11: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/11.jpg)
© 2015 Dremio Corporation11
EXAMPLES USE CASES
![Page 12: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/12.jpg)
© 2015 Dremio Corporation12
Interactive Query for Hadoop• Challenge:
– Pre-existing workflow with Hive as data warehouse
– Analysts frustrated with query completion times
• Solution:– Partition cluster between interactive and
batch (MR/Spark) workloads– Install Drill on interactive nodes– Utilize Drill’s Hive metastore integration
to expose existing datasets at interactive speeds
ODBC & JDBC BI ToolsDrill JDBC Drill ODBC
Hive Metastore
![Page 13: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/13.jpg)
© 2015 Dremio Corporation13
SQL for NoSQL• Challenge:
– Data has been moved from Oracle to MongoDB
– Business Users unable to use Tableau
• Solution:– Install Drill on each MongoDB Node– Use Drill’s ODBC driver and powerful
parallelization capabilities to provide interactive in-situ query capabilities
Drill ODBC
![Page 14: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/14.jpg)
© 2015 Dremio Corporation14
Data Warehouse Offload• Challenge
– Capacity-based Data Warehouse license constraints
– Data inflow rate too high– Need broader time horizons
• Solution:– Export data from traditional
data warehouse– Load data into Hadoop– Use Drill on top of Hadoop
nodes Existing Warehouse (Teradata, Vertica, Netezza)
ODBC & JDBC BI ToolsDrill JDBC Drill ODBC
![Page 15: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/15.jpg)
© 2015 Dremio Corporation15
S3
Cloud JSON & Sensor Analytics• Problem:
– Sensors logging into S3– Complex JSON: map keys have meaning
and can change for every record• Solution:
– Spin Up EC2 Analysis Cluster– Set up number of S3 Workspaces in Drill– Leverage Drill’s FLATTEN and KVGEN
capabilities to access key-based data– Expose Data via REST API to custom
application
EC2 Node
JSONJSON
JSONJSON
EC2 Node
EC2 NodeEC2 Node
Rest API
Custom Reporting Application
![Page 16: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/16.jpg)
© 2015 Dremio Corporation16
Secure SQL for Everything• Challenge:
– Data in MongoDB, Hadoop and RDBMS– Provide a single endpoint for SQL-based access – Ensure that different users and groups have different
access• Solution:
– Setup Drill leveraging chained security– Use Drill views to expose row-level access control– Leverage Drill User, Group and PAM integrations to
control column filtering and masking– Expose data utilizing JDBC, ODBC and REST apis
MaskedSales.viewowner: Cindy
GrossSales.viewowner: dba
RawSales.parquetowner: dba
Frank file view perm
Cindy file view perm
dba delegated read
Query by Frank
MySQL
Drill ODBC
![Page 17: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/17.jpg)
© 2015 Dremio Corporation17
ETL among modern systems and Formats• Challenge:
– Data held in MongoDB & JSON based log files.
– Want to run large scale machine learning against data
• Solution: – Do simple CTAS query in Drill– Converts data into high performance
Parquet format – Large-scale parallel conversion and load
into Hadoop
![Page 18: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/18.jpg)
© 2015 Dremio Corporation18
UPCOMING FEATURES
![Page 19: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/19.jpg)
© 2015 Dremio Corporation19
Access Data in More PlacesNoSQLAvailableHBaseMongoDBMapRDB
SoonElasticsearchCassandraSolr
AvailableHiveJDBCMySQL
SoonPhoenixPostgresOracle
RDBMSAvailableHDFSMapR-FSS3NASAzure
SoonDistDAS
File Systems Available
JSONParquetText & CSVAvroHive Serdes
SoonExcelHTTPD LogBSON
File Formats
![Page 20: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/20.jpg)
© 2015 Dremio Corporation20
Enhanced Flexibility• JSON literals in SQL• Improved dynamic schema
capabilities• Type tools• Transformation UDFs
![Page 21: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/21.jpg)
© 2015 Dremio Corporation21
Improved Management• WebUI Authentication• Web & SQL Authorization• Advanced Workload Management• Enhanced spooling/memory
capabilities
![Page 22: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/22.jpg)
© 2015 Dremio Corporation22
Query Data in More Ways• Spark Dataframe & RDD: Read & Write
– Use Drill to work with NoSQL in your Spark Pipeline
• MapReduce Input and Output Formats– Use Drill
• Enhanced Rest Capabilities– Better support for Complex Data
![Page 23: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/23.jpg)
© 2015 Dremio Corporation23
Performance• Faster Parquet and ORC readers• Parquet enhancements
– More statistics, bloomfilters, indexes, page pruning
• Pin Tables in Memory• Vectorization Enhancements
![Page 24: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/24.jpg)
© 2015 Dremio Corporation24
Recap• Drill Background
– SQL on Everything• Common Use Cases
– Interactive analyst experience, no matter where the data exists
• Roadmap– More data, more access, more flexibility,
easier management and higher performance
![Page 25: Apache Drill Use Cases & Roadmap - Hadoop Meetup NYC - September 28, 2015](https://reader036.vdocuments.us/reader036/viewer/2022081419/5872f36c1a28ab8c718b50a1/html5/thumbnails/25.jpg)
© 2015 Dremio Corporation25
Thank You!
• Download at drill.apache.org
• Get in touch:• [email protected]
• Ask questions:• [email protected]
• Tweet• @DremioHQ, @ApacheDrill