partners in crime: cassandra analytics and etl with hadoop

Partners in Crime

Date: August 10th, 2010

Cassandra Summit 2010

Cassandra Analytics and ETL with Hadoop

What is Hadoop?

• Distributed processing framework (MapReduce)– Moves processing to the data

• Distributed filesystem– Allows data to move when processing can't

Why use Hadoop with Cassandra?

Perfect partners for big data laundering

• Cassandra optimized for access• Hadoop optimized for processing

– Many analytics frameworks– Existing integrations

• RDBMS → Hadoop → Cassandra

Cluster Layouts

• Existing Hadoop cluster?– Start Hadoop tasktrackers on Cassandra cluster– Processing performed on local nodes

Cluster Layouts

• No Hadoop cluster?– Start all Hadoop daemons on 2-3 nodes

• MapReduce depends lightly on HDFS– Start Hadoop tasktrackers on Cassandra cluster

Hadoop Integration Points

• JVM MapReduce– Keys/values iterated in process

• Hadoop Streaming– Performs IPC on stdin/stdout to arbitrary processes

• Apache Pig– High level relational language (SQL alternative)

• Apache Hive– Forthcoming support for Cassandra storage

Demo

• Code– github.com/stuhood/cassandra-summit-demo

• Flow– Load with Hadoop Streaming– Analyze with Apache Pig– Load/Process with JVM MapReduce

Hadoop Streaming Summary

• Mapper/Reducer scripts– Any language

• Script is moved to the data

cat $input | mapper | sort | reducer > $output

ETL with Streaming

• ETL to Cassandra in ~50 linesLoad!

ETL with Streaming

1)Files in HDFS2)Hadoop Streaming3)bin/load-mapper.py (the code you write)4)Cassandra's Streaming Shim5)Cassandra

Apache Pig Summary

• Declarative relational language

Analytics with Pig

• Analytics from Cassandra in ~20 linesAnalyze!

Analytics with Pig

1)Data stored in Cassandra2)Cassandra's Pig LoadFunc3)bin/analyze.pig (the code you write)4)Files in HDFS

JVM MapReduce Summary

• Extend Mapper/Reducer base classes• Hadoop:

– Transports the Jar to nodes near the data– Efficiently streams data through

Load/Process with MapReduce

• Efficient bulk loading in ~80 linesSummarize!

Load/Process with MapReduce

1)Files in HDFS2)MapReduce3)Mapper/Reducer (the code you write)4)Cassandra's ColumnFamilyOutputFormat5)Cassandra

Future Work

• Pig Output• Hive• Hadoop Streaming Input• Optimizations

Questions?

References

• Code available at– github.com/stuhood/cassandra-summit-demo

• Open issues– CASSANDRA-1315– CASSANDRA-1322– CASSANDRA-1368

• “Hadoop + Cassandra” - Jeremy Hanna– slideshare.net/jeromatron/cassandrahadoop-4399672

partners in crime: cassandra analytics and etl with hadoop

Technology