stratio big data spain

AN EFFICIENT DATA MINING SOLUTION

Hadoop?

Cassandra?

Spark?

Stratio Deep

An efficient data mining solution

“Two and two are four?

Sometimes… Sometimes they are five.”

G. Orwell

#StratioBD

• Why do you need Cassandra?

• What is the problem?

• Why do you need Spark?

• How do they work together?

#StratioBD

Cassandra

#StratioBD

• Based on DynamoDB…

• Replication, Key/Value, P2P

• And based on Big Table…

• Column oriented

ROBUST FAST EFFICENT

NO BOTTLENECK REPLICATEDDECENTRALIZED

Another Database?

One User – Lot of data

Case A

#StratioBD

Many User – Few data

Case B

#StratioBD

Many user – Lot of data

Case C

#StratioBD

Crawler app

#StratioBD

Cassandra, I choose you

100MIndexedpages

3kreads

Query time

But…

Marketingwalks in

New query

“I need to find all the reference to the domain ACME.

I need the answer by Friday.”

#StratioBD

Problem

Cassandra is not well suited to resolved this type of

queries

You need to design the schema with the query in mind

#StratioBD

ChallengeAccepted

What options do we have?

• Run Hive Query on top of C*

• Write an ETL script and load data into another DB

• Clone the cluster

#StratioBD

What options do we have?

Run Hive Query on top of C*

Write ETL scripts and load into another DB

Clone the cluster

#StratioBD

And now… what can we do?

“We can't solve problems by using the same kind

of thinking we used when we created them”

#StratioBD

Albert Einstein

• Alternative to MapReduce• A low latency cluster computing system• For very large datasets• Create by UC Berkeley AMP Lab in 2010.• May be 100 times faster than MapReduce for:

Interactive algorithms. Interactive data mining

#StratioBD

Logistic regression inSpark vs Hadoop

SOURCE | http://spark.incubator.apache.org/

#StratioBD

WHO USES SPARK?

Spark and Cassandra

Integration points

#StratioBD

Cassandra’s HDFS abstraction layer

Advantantages:• Easily integrates with legacy systems.

Drawbacks:• Very high-level: no access to low level Cassandra’s features.

• Questionable performance.

INTEGRATION POINTS: HDFS OVER CASSANDRA

#StratioBD

Cassandra’s Hadoop Interface

• Thrift protocol

• CQL3 (our implementation)

Uses the novel Cassandra’s CqlPagingInputFormat

INTEGRATION POINTS: HDFS OVER CASSANDRA

#StratioBD

• Supports CQL3 features

• Respects data locality

• Good compromise between

performance / implementation complexity

CQL3 Integration

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3

#StratioBD

CQL3 Integration (II)

Provides a Java friendly API:

• Developers map Column Families to custom serializable POJOs

• StratioDeep wraps the complexity of performing Spark calculations

directly over the user provided POJOs.

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3

#StratioBD

Drawbacks:

• Still not preforming as well as we’d like

Uses Cassandra’s Hadoop Interface

• No analyst-friendly interface:

No SQL-like query features

CQL3 Integration (III)

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3#StratioBD

Bring the integration to another level:

• Dump Cassandra’s Hadoop Interface

• Direct access to Cassandra’s SSTable(s) files.

• Extend Cassandra’s CQL3 to make use of Spark’s distributed

data processing power

Future extensions

What are we currently working on?

#StratioBD

Conclusion

THANKS

stratio big data spain

Technology

learn to use stratio crossdata

big data, analytics and 4th generation data warehousing by...

el impacto del big data en la estrategia de los medios de...

capturing value from big financial data by marcelo...

spark summit - stratio streaming

stratio crossdata: an efficient distributed datahub with...

stratio big data science platform - files.meetup.com big...

the big black book spain

spain big band

the promise and peril of abundance: making big data small....

big data spain 2016: keynote

automating big data benchmarking and performance analysis...

big data spain 2014

why spark by stratio - v.1.0

stratio big data science platform -...

stratio platform overview v4.1

digit.d1 big data poc · digit.d1 – big data poc doris...

the big business of big data. jon bruner at big data spain...

european exploration in north america. the big three spain...

digit.b4 big data poc - joinup.eu...digit.b4 – big data...