overview of hadoop for data mining federal big data group confidential mark silverman treeminer,...
TRANSCRIPT
Overview of Hadoop forData MiningFederal Big Data Groupconfidential
Mark SilvermanTreeminer, Inc.
155 Gibbs Street Suite 514Rockville, Maryland 20850
(240) [email protected]
TREEMINER, INC.CONFIDENTIAL
Agenda
• Introduction to Hadoop• Developing and testing a Map/Reduce
application• Auto-Clustering in Hadoop and
Interworking with Apache Storm
TREEMINER, INC.CONFIDENTIAL
Introduction to Hadoop
• Hadoop consists of:• Clustered, distributed, highly available file
system (HDFS)• Execution framework (Map/Reduce)
TREEMINER, INC.CONFIDENTIAL
Hadoop File System
• “Rack” aware• Local storage• Distributed copies (generally 3)
Rack
TREEMINER, INC.CONFIDENTIAL
Sample Hadoop File System
TREEMINER, INC.CONFIDENTIAL
Hadoop “Eco-System”
• HiveAllows SQL-like querying of data in HDFS
• PigBasic scripting language for Hadoop
• DatabasesHbase, Accumulo, Cassandra, Neo4j
TREEMINER, INC.CONFIDENTIAL
Map / ReduceParallel Execution Framework
TREEMINER, INC.CONFIDENTIAL
Map / ReduceParallel Execution Framework
TREEMINER, INC.CONFIDENTIAL
WordCount Example
TREEMINER, INC.CONFIDENTIAL
Getting Started
• Cloudera and Hortonworks have sandboxes that are easy to download and are fully contained implementations in a VM. Also download from Apache.
http://hortonworks.com/products/hortonworks-sandbox/http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-3-x.htmlhttp://hadoop.apache.org/releases.html
TREEMINER, INC.CONFIDENTIAL
Developing In Map / Reduce
• Standalone Mode – Hadoop runs as single process, best for debugging
• Pseudo-Distributed – Separate processes on same server
• Fully Distributed – Full blown cluster
TREEMINER, INC.CONFIDENTIAL
Eclipse Framework
• Write code in eclipse• PC or Linux• Options:
• Run Hadoop on Windows • Run Eclipse in Linux with Plugin• Run Eclipse in Windows, Remote debug and
profiling• Profiling: Yourkit
TREEMINER, INC.CONFIDENTIAL
WordCount
• Create a project in eclipse• Load wordcount code (widely available
and in sandbox downloads)• Compile jar file• Execute on hadoop in standalone mode$ hadoop jar path/to/file.jar input output
TREEMINER, INC.CONFIDENTIAL
Monitoring Hadoop Jobs
TREEMINER, INC.CONFIDENTIAL
Monitoring Hadoop Jobs
TREEMINER, INC.CONFIDENTIAL
Resources
http://www.cloudera.com
http://www.hortonworks.com
hadoop.apache.org
http://web.stanford.edu/class/cs246/homeworks/tutorial.pdf
Hadoop: A Definitive Guide by Tom White
TREEMINER, INC.CONFIDENTIAL
Example: Document AutoClustering using Hadoop and Storm
https://www.youtube.com/watch?v=5X65WV0n4rU