stratio big data spain
Post on 02-Jul-2015
663 Views
Preview:
TRANSCRIPT
AN EFFICIENT DATA MINING SOLUTION
Hadoop?
Cassandra?
Spark?
Stratio Deep
An efficient data mining solution
“Two and two are four?
Sometimes… Sometimes they are five.”
G. Orwell
#StratioBD
Goals
• Why do you need Cassandra?
• What is the problem?
• Why do you need Spark?
• How do they work together?
#StratioBD
Cassandra
#StratioBD
• Based on DynamoDB…
• Replication, Key/Value, P2P
• And based on Big Table…
• Column oriented
ROBUST FAST EFFICENT
NO BOTTLENECK REPLICATEDDECENTRALIZED
Another Database?
Why?
One User – Lot of data
Case A
#StratioBD
Many User – Few data
Case B
#StratioBD
Many user – Lot of data
Case C
#StratioBD
Crawler app
#StratioBD
Cassandra, I choose you
100MIndexedpages
3kreads
Query time
< 1s
But…
Marketingwalks in
New query
“I need to find all the reference to the domain ACME.
I need the answer by Friday.”
#StratioBD
Problem
Cassandra is not well suited to resolved this type of
queries
You need to design the schema with the query in mind
#StratioBD
ChallengeAccepted
What options do we have?
• Run Hive Query on top of C*
• Write an ETL script and load data into another DB
• Clone the cluster
#StratioBD
What options do we have?
Run Hive Query on top of C*
Write ETL scripts and load into another DB
Clone the cluster
#StratioBD
And now… what can we do?
“We can't solve problems by using the same kind
of thinking we used when we created them”
#StratioBD
Albert Einstein
• Alternative to MapReduce• A low latency cluster computing system• For very large datasets• Create by UC Berkeley AMP Lab in 2010.• May be 100 times faster than MapReduce for:
Interactive algorithms. Interactive data mining
Spark
#StratioBD
Logistic regression inSpark vs Hadoop
SOURCE | http://spark.incubator.apache.org/
#StratioBD
WHO USES SPARK?
Spark and Cassandra
Integration points
#StratioBD
Cassandra’s HDFS abstraction layer
Advantantages:• Easily integrates with legacy systems.
Drawbacks:• Very high-level: no access to low level Cassandra’s features.
• Questionable performance.
INTEGRATION POINTS: HDFS OVER CASSANDRA
#StratioBD
Cassandra’s Hadoop Interface
• Thrift protocol
• CQL3 (our implementation)
Uses the novel Cassandra’s CqlPagingInputFormat
INTEGRATION POINTS: HDFS OVER CASSANDRA
#StratioBD
• Supports CQL3 features
• Respects data locality
• Good compromise between
performance / implementation complexity
CQL3 Integration
INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3
#StratioBD
CQL3 Integration (II)
Provides a Java friendly API:
• Developers map Column Families to custom serializable POJOs
• StratioDeep wraps the complexity of performing Spark calculations
directly over the user provided POJOs.
INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3
#StratioBD
Demo
Drawbacks:
• Still not preforming as well as we’d like
Uses Cassandra’s Hadoop Interface
• No analyst-friendly interface:
No SQL-like query features
CQL3 Integration (III)
INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3#StratioBD
Bring the integration to another level:
• Dump Cassandra’s Hadoop Interface
• Direct access to Cassandra’s SSTable(s) files.
• Extend Cassandra’s CQL3 to make use of Spark’s distributed
data processing power
Future extensions
What are we currently working on?
#StratioBD
#StratioBD
Conclusion
THANKS
top related