Download - Qubole @ AWS Meetup Bangalore - July 2015
Big Data as a Service
Joydeep Sen SarmaHariharan IyerShubham Tagra
Introduction
• Introduction to Qubole
• How Qubole integrates with AWS:
– EC2– S3– RedShift– Kinesis
• Hive vs. Presto vs. RedShift vs. Spark
• Qubole vs. EMR
2
Agenda
Introduction
• Founded ~ 10/2011
• Team:– Founding Crew initial authors of Apache Hive, ran Data @Facebook– + Notable Alumni from Greenplum/Vertica/EngineYard/Oracle/AWS etc– + 50 engineers + 20 sales/mkting across Bangalore/Palo-Alto
• Financing:– Completed Series-B 10/2014– LightSpeed, Charles-River, Norwest, Anand/Venky
• Product: Qubole Data Service– Big Data as a Service– AWS/GCE/Azure
3
About Qubole
Introduction
4
Customers
IntroductionQubole Data Service
6
IntroductionQubole Data Service
Introduction
• Self-Serve Big Data Analytics:– Lack of Hadoop trained IT/engineers– Team of Analysts
• Lowest TCO– Cloud Optimized - takes full advantage of AWS
• Unified Platform for all Tools:– Hive/Pig/Spark/SQL/Map-Reduce/Cascading/…– Pick and Choose. Combine and Use
• Awesome Support and Solutions
7
Why Qubole
Self Service Analytics: Direct Access to Big Data
8
Self Service Analytics: Manage Clusters Easily
9
Self Service Analytics: Schedule Jobs
10
Self Service Analytics: Self Serve Dashboards using Notebooks
11
Qubole and EC2
• Introduction to Qubole
• How Qubole integrates with AWS:
– EC2– S3– RedShift– Kinesis
• Hive vs. Presto vs. RedShift vs. Spark
• Qubole vs. EMR
12
Agenda
Qubole and EC2
• Custom AMIs for much faster boot-up
• Auto-termination
• Auto-scaling
• Spot Instances
• EBS
13
EC2 Magic Sauce
• Auto-start and termination– Cluster starts automatically when you need to run a command
• Intelligent - no cluster required for metadata commands– Terminated after couple of hours of Idle time
• Auto-scaling– Min Size <= Cluster <= Max Size
14
Cluster LifeCycle Management
Qubole and EC2
15
Map Slots Reduce Slots
Slave
Slave
Slave
Slave
Slave
SELECT * FROM FOO JOIN BAR ON BAZ = ...
Auto-Scaling
• Upscaling– Engine-specific algorithms– Cannot just look at expected time (parallelism matters)
• Downscaling– Decommissioning takes time– Need to consider hour boundary– Stuck on mapper output
• Output offloading
• AWS Integration– Hour boundary– Eventual consistency
16
Why is it hard?
Qubole and EC2
Min Cluster Size: 400Max Cluster Size: 800Time for which cluster size < max size: 49%
17
But it pays off!
Qubole and EC2
18
But it pays off!
Expected Compute Hours Compute hours saved Savings (%)
2902246.2 2107311.01 72.6
4655815.5 2105486.11 45.2
1698052.65 1658738.375 97.6
1776944.4 1476547.835 83.1
2063127.85 838628.7 40.6
919721.25 613630.955 66.7
Qubole and EC2
• Various configurations– All nodes on-demand– Minimum nodes on-demand, rest combination of on-demand & spot– All nodes spot
• Minimum set has higher bid price => less likely to lose– Up to 90% savings compared to on-demand price
19
Spot instances
Qubole and EC2
• Not always available– Fall back to on-demand
• Increases overall cost of cluster– Periodically replace extra on-demand instances when spot available
• Can go away at any time– Hadoop has built-in resiliency– Place replicas on stable instances
• Auto-scaling– Must maintain requested ratio
20
Why is it hard?
Qubole and EC2
• Useful for newer instance types - c3/m3– Low ephemeral storage
• Better performance/$– Compared to older instance types with more storage
• Writes are changed– Minimize writes to EBS volumes– Use them only when ephemeral is near full
21
Elastic Block Store
Qubole and EC2
• Spot Fleets• EBS-only instances
22
What’s coming next
Qubole and EC2
Resources
• Auto-scaling Hadoop Clusters in Qubole
• Spot Instances in Qubole Clusters
• Rebalancing Hadoop Clusters for Better Spot Utilization
• Improved Performance with Low-Ephemeral-Storage Instances
23
Qubole and EC2
Agenda
• Introduction to Qubole
• How Qubole integrates with AWS:
– EC2– S3– RedShift– Kinesis
• Hive vs. Presto vs. RedShift vs. Spark
• Qubole vs. EMR
24
Agenda
Qubole and S3
• No real Rename (aka Move) operation– renames are copies and expensive!
• S3 connection establishment is expensive– ie. - small calls like getObjectDetails(key) and reads are expensive!
• S3 has bulk prefix listing– listObjectsChunked(startKey, maxListing)
• Puts are atomic– Objects created when object uploaded– Unlike HDFS where files are created on first write
• MultiPart!
25
S3 != HDFS
Qubole and S3
• Naiive
• Smart
• Up to 1000x improvement
26
Prefix Listing
for path in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]:result << listObject(path)
pathList = listPrefix(‘/x’)while (entry = pathList.next()):
if entry in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]:result << entry
Qubole and S3
• Split Computation– Divide input files into tasks for Map-Reduce/Spark/Presto
• Recovering Partitions
• List Paths matching regex pattern (‘/x/y/z/*/*’)
• and many more ..
27
Prefix Listing - Use Cases
Qubole and S3
• Normally:– Write data to temporary location - atomically rename to final location
• With S3:– Write data to final location– Atomic puts deal with speculation/retries– Optional: Remove on Failure
• By default in Hive, DirectFileOutputCommitter in MR/Spark
• Tricky: retries/speculation must use same path
28
Direct Writes
Qubole and S3
• Naiive:
29
Pre-Fetching
Client S3
• Smart:
Client S3
S3 Local Disk Cache (Presto)
30
S3 Local Disk Cache (Presto)
31
Qubole and S3
• Populating Cache while performing Query may cause Slowness
• Large Files are split– Cache Files or Splits?
• Should Caching Combine Small Files?
• Should Caching transform data into Columnar?
Watch out for Table Copies!
32
Why S3 Caching is hard
Qubole and S3
• Handle S3 Timeouts and Exceptions (truncated streams)
• Optimize away seek() operations
• Data Sharing across Organizations using Cross-Account Roles
33
Miscellaneous
Resources
● S3 Optimizations in Hive○ http://www.qubole.com/optimizing-hadoop-for-s3-part-1/
● Caching in Presto○ http://www.qubole.com/blog/product/caching-presto/
● Qubole vs. EMR Performance Comparison○ http://www.qubole.com/a-performance-comparison-of-qubole-and-amazons-elastic-mapre
duce-emr/
● Data Sharing via AWS Roles:○ https://qubole-eng.quora.com/Securely-sharing-data-across-Organizations-with-Qubole-2
Qubole and S3
Agenda
• Introduction to Qubole
• How Qubole integrates with AWS:
– EC2– S3– RedShift– Kinesis
• Hive vs. Presto vs. RedShift vs. Spark
• Qubole vs. EMR
35
Agenda
• Introduction to Presto
• Brief Introduction to Kinesis
• Qubole’s Value add to Kinesis
• Brief introduction to Redshift
• Qubole’s Value add to Redshift
• When to use which system
36
Presto, Kinesis and RedShift
Presto, Kinesis, RedShift
• Interactive SQL Query Engine for Big Data
• Open source by Facebook in late 2013
• Follows ANSI-SQL
• Own execution engine
37
Presto, What is it?
Presto, Kinesis, RedShift
● Extensibility
○ Pluggable Datasources
● Performance
○ In-memory Execution
○ Aggressive Pipelining
○ Highly efficient Java code
○ Dynamic Query Compilation
○ Vectorization38
Why Presto?
Presto, Kinesis, RedShift
● Smooth learning curve as it adheres to ANSI-SQL
● Active open source community
● Proven worth at scale in companies like Facebook, NetFlix, Airbnb
39
Why not something else?
Presto, Kinesis, RedShift
● Self managed Presto clusters
● Auto-configured
● Autoscaling
● Data Caching
40
Benefits of Presto @Qubole
Presto, Kinesis, RedShift
● Kinesis Connector
● S3 Optimizations
● Insert Support
● UDF support
41
Qubole’s Contribution
Presto, Kinesis, RedShift
● Average times across the performance tests by MediaMath on a 22TB text format table, partitioned on date, queried on partition with ~1.2b rows
http://www.qubole.com/blog/big-data/performance-testing-presto/
42
Comparison with Hive
Presto, Kinesis, RedShift
● High Capacity Pipe for Real-Time Processing
● Key Concepts
○ Record
○ Streams
○ Shards
○ Checkpoints
43
Kinesis
Presto, Kinesis, RedShift
● Streaming usecase
○ Spark
● SQL usecase
○ Via Hive
○ Via Presto
44
Qubole and Kinesis
Presto, Kinesis, RedShift
● Kinesis Connector
45
Presto-Kinesis Integration
Presto, Kinesis, RedShift
● Example○ Step1: Define Schema
46
Presto-Kinesis Integration
Presto, Kinesis, RedShift
○ Step2: Run Query
47
Presto-Kinesis Integration
● Datawarehouse service
● OLAP
● Storage + Compute
48
Redshift
Presto, Kinesis, RedShift
● ETL Usecase
○ DBImport
○ DBExport
● Adhoc Query Usecase
○ Direct Query
○ Hive
○ Presto
49
Qubole and Redshift
Presto, Kinesis, RedShift
Two ways to access Redshift via Presto● Via Hive JDBC Storage Handler
50
Presto-Redshift integration
Presto, Kinesis, RedShift
Two ways to access Redshift via Presto● Via jdbc connector
51
Presto-Redshift integration
Presto, Kinesis, RedShift
● Single Platform
● Consistent User Interface
● Cross-source Joins without data consolidation
52
New Opportunities
Presto, Kinesis, RedShift
● Hive: ETL, huge Joins, Group By on high cardinality columns
● Redshift: Interactive Queries when data loading is acceptable
● Presto:
○ Interactive Queries
○ Direct Queries without loading
○ Joining data across Redshift, Kinesis, S3, MySql, Cassandra, etc
53
Hive? Presto? RedShift?
Presto, Kinesis, RedShift
● Iterative Machine Learning
● In-Memory Computing
● Spark Streaming
54
Got Spark?
Presto, Kinesis, RedShift
Resources
● Presto vs Hive○ http://www.qubole.com/blog/big-data/performance-testing-presto/
● Presto Kinesis Integration○ http://blogs.aws.amazon.com/bigdata/post/Tx2DDFNHXSAAH2G/Presto-Amazon-Kinesis-Connector-for-I
nteractively-Querying-Streaming-Data
● Presto Kinesis Connector○ https://github.com/qubole/presto-kinesis
● Hive Redshift/Jdbc Connector○ http://www.qubole.com/blog/product/hive-jdbc-storage-handler/
● Hive Redshift/Jdbc Connector○ https://github.com/qubole/Hive-JDBC-Storage-Handler
Agenda
• Introduction to Qubole
• How Qubole integrates with AWS:
– EC2– S3– RedShift– Kinesis
• Hive vs. Presto vs. RedShift vs. Spark
• Qubole vs. EMR
56
Agenda