partners in crime: cassandra analytics and etl with hadoop

19

Click here to load reader

Upload: stu-hood

Post on 07-May-2015

7.235 views

Category:

Technology


2 download

DESCRIPTION

Light slides supporting a Hadoop and Cassandra integration talk at the 2010 Cassandra Summit. The code is more interesting: http://github.com/stuhood/cassandra-summit-demo

TRANSCRIPT

Page 1: Partners in Crime: Cassandra Analytics and ETL with Hadoop

Partners in Crime

Date: August 10th, 2010

Cassandra Summit 2010

Cassandra Analytics and ETL with Hadoop

Page 2: Partners in Crime: Cassandra Analytics and ETL with Hadoop

What is Hadoop?

• Distributed processing framework (MapReduce)– Moves processing to the data

• Distributed filesystem– Allows data to move when processing can't

Page 3: Partners in Crime: Cassandra Analytics and ETL with Hadoop

Why use Hadoop with Cassandra?

Perfect partners for big data laundering

• Cassandra optimized for access• Hadoop optimized for processing

– Many analytics frameworks– Existing integrations

• RDBMS → Hadoop → Cassandra

Page 4: Partners in Crime: Cassandra Analytics and ETL with Hadoop

Cluster Layouts

• Existing Hadoop cluster?– Start Hadoop tasktrackers on Cassandra cluster– Processing performed on local nodes

Page 5: Partners in Crime: Cassandra Analytics and ETL with Hadoop

Cluster Layouts

• No Hadoop cluster?– Start all Hadoop daemons on 2-3 nodes

• MapReduce depends lightly on HDFS– Start Hadoop tasktrackers on Cassandra cluster

Page 6: Partners in Crime: Cassandra Analytics and ETL with Hadoop

Hadoop Integration Points

• JVM MapReduce– Keys/values iterated in process

• Hadoop Streaming– Performs IPC on stdin/stdout to arbitrary processes

• Apache Pig– High level relational language (SQL alternative)

• Apache Hive– Forthcoming support for Cassandra storage

Page 7: Partners in Crime: Cassandra Analytics and ETL with Hadoop

Demo

• Code– github.com/stuhood/cassandra-summit-demo

• Flow– Load with Hadoop Streaming– Analyze with Apache Pig– Load/Process with JVM MapReduce

Page 8: Partners in Crime: Cassandra Analytics and ETL with Hadoop

Hadoop Streaming Summary

• Mapper/Reducer scripts– Any language

• Script is moved to the data

cat $input | mapper | sort | reducer > $output

Page 9: Partners in Crime: Cassandra Analytics and ETL with Hadoop

ETL with Streaming

• ETL to Cassandra in ~50 linesLoad!

Page 10: Partners in Crime: Cassandra Analytics and ETL with Hadoop

ETL with Streaming

1)Files in HDFS2)Hadoop Streaming3)bin/load-mapper.py (the code you write)4)Cassandra's Streaming Shim5)Cassandra

Page 11: Partners in Crime: Cassandra Analytics and ETL with Hadoop

Apache Pig Summary

• Declarative relational language

Page 12: Partners in Crime: Cassandra Analytics and ETL with Hadoop

Analytics with Pig

• Analytics from Cassandra in ~20 linesAnalyze!

Page 13: Partners in Crime: Cassandra Analytics and ETL with Hadoop

Analytics with Pig

1)Data stored in Cassandra2)Cassandra's Pig LoadFunc3)bin/analyze.pig (the code you write)4)Files in HDFS

Page 14: Partners in Crime: Cassandra Analytics and ETL with Hadoop

JVM MapReduce Summary

• Extend Mapper/Reducer base classes• Hadoop:

– Transports the Jar to nodes near the data– Efficiently streams data through

Page 15: Partners in Crime: Cassandra Analytics and ETL with Hadoop

Load/Process with MapReduce

• Efficient bulk loading in ~80 linesSummarize!

Page 16: Partners in Crime: Cassandra Analytics and ETL with Hadoop

Load/Process with MapReduce

1)Files in HDFS2)MapReduce3)Mapper/Reducer (the code you write)4)Cassandra's ColumnFamilyOutputFormat5)Cassandra

Page 17: Partners in Crime: Cassandra Analytics and ETL with Hadoop

Future Work

• Pig Output• Hive• Hadoop Streaming Input• Optimizations

Page 18: Partners in Crime: Cassandra Analytics and ETL with Hadoop

Questions?

Page 19: Partners in Crime: Cassandra Analytics and ETL with Hadoop

References

• Code available at– github.com/stuhood/cassandra-summit-demo

• Open issues– CASSANDRA-1315– CASSANDRA-1322– CASSANDRA-1368

• “Hadoop + Cassandra” - Jeremy Hanna– slideshare.net/jeromatron/cassandrahadoop-4399672