partners in crime: cassandra analytics and etl with hadoop
DESCRIPTION
Light slides supporting a Hadoop and Cassandra integration talk at the 2010 Cassandra Summit. The code is more interesting: http://github.com/stuhood/cassandra-summit-demoTRANSCRIPT
Partners in Crime
Date: August 10th, 2010
Cassandra Summit 2010
Cassandra Analytics and ETL with Hadoop
What is Hadoop?
• Distributed processing framework (MapReduce)– Moves processing to the data
• Distributed filesystem– Allows data to move when processing can't
Why use Hadoop with Cassandra?
Perfect partners for big data laundering
• Cassandra optimized for access• Hadoop optimized for processing
– Many analytics frameworks– Existing integrations
• RDBMS → Hadoop → Cassandra
Cluster Layouts
• Existing Hadoop cluster?– Start Hadoop tasktrackers on Cassandra cluster– Processing performed on local nodes
Cluster Layouts
• No Hadoop cluster?– Start all Hadoop daemons on 2-3 nodes
• MapReduce depends lightly on HDFS– Start Hadoop tasktrackers on Cassandra cluster
Hadoop Integration Points
• JVM MapReduce– Keys/values iterated in process
• Hadoop Streaming– Performs IPC on stdin/stdout to arbitrary processes
• Apache Pig– High level relational language (SQL alternative)
• Apache Hive– Forthcoming support for Cassandra storage
Demo
• Code– github.com/stuhood/cassandra-summit-demo
• Flow– Load with Hadoop Streaming– Analyze with Apache Pig– Load/Process with JVM MapReduce
Hadoop Streaming Summary
• Mapper/Reducer scripts– Any language
• Script is moved to the data
cat $input | mapper | sort | reducer > $output
ETL with Streaming
• ETL to Cassandra in ~50 linesLoad!
ETL with Streaming
1)Files in HDFS2)Hadoop Streaming3)bin/load-mapper.py (the code you write)4)Cassandra's Streaming Shim5)Cassandra
Apache Pig Summary
• Declarative relational language
Analytics with Pig
• Analytics from Cassandra in ~20 linesAnalyze!
Analytics with Pig
1)Data stored in Cassandra2)Cassandra's Pig LoadFunc3)bin/analyze.pig (the code you write)4)Files in HDFS
JVM MapReduce Summary
• Extend Mapper/Reducer base classes• Hadoop:
– Transports the Jar to nodes near the data– Efficiently streams data through
Load/Process with MapReduce
• Efficient bulk loading in ~80 linesSummarize!
Load/Process with MapReduce
1)Files in HDFS2)MapReduce3)Mapper/Reducer (the code you write)4)Cassandra's ColumnFamilyOutputFormat5)Cassandra
Future Work
• Pig Output• Hive• Hadoop Streaming Input• Optimizations
Questions?
References
• Code available at– github.com/stuhood/cassandra-summit-demo
• Open issues– CASSANDRA-1315– CASSANDRA-1322– CASSANDRA-1368
• “Hadoop + Cassandra” - Jeremy Hanna– slideshare.net/jeromatron/cassandrahadoop-4399672