overview of hadoop for data mining federal big data group confidential mark silverman treeminer,...

17
Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850 (240) 389-0750 [email protected]

Upload: ada-poole

Post on 22-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850

Overview of Hadoop forData MiningFederal Big Data Groupconfidential

Mark SilvermanTreeminer, Inc.

155 Gibbs Street Suite 514Rockville, Maryland 20850

(240) [email protected]

Page 2: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850

TREEMINER, INC.CONFIDENTIAL

Agenda

• Introduction to Hadoop• Developing and testing a Map/Reduce

application• Auto-Clustering in Hadoop and

Interworking with Apache Storm

Page 3: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850

TREEMINER, INC.CONFIDENTIAL

Introduction to Hadoop

• Hadoop consists of:• Clustered, distributed, highly available file

system (HDFS)• Execution framework (Map/Reduce)

Page 4: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850

TREEMINER, INC.CONFIDENTIAL

Hadoop File System

• “Rack” aware• Local storage• Distributed copies (generally 3)

Rack

Page 5: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850

TREEMINER, INC.CONFIDENTIAL

Sample Hadoop File System

Page 6: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850

TREEMINER, INC.CONFIDENTIAL

Hadoop “Eco-System”

• HiveAllows SQL-like querying of data in HDFS

• PigBasic scripting language for Hadoop

• DatabasesHbase, Accumulo, Cassandra, Neo4j

Page 7: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850

TREEMINER, INC.CONFIDENTIAL

Map / ReduceParallel Execution Framework

Page 8: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850

TREEMINER, INC.CONFIDENTIAL

Map / ReduceParallel Execution Framework

Page 9: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850

TREEMINER, INC.CONFIDENTIAL

WordCount Example

Page 10: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850

TREEMINER, INC.CONFIDENTIAL

Getting Started

• Cloudera and Hortonworks have sandboxes that are easy to download and are fully contained implementations in a VM. Also download from Apache.

http://hortonworks.com/products/hortonworks-sandbox/http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-3-x.htmlhttp://hadoop.apache.org/releases.html

Page 11: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850

TREEMINER, INC.CONFIDENTIAL

Developing In Map / Reduce

• Standalone Mode – Hadoop runs as single process, best for debugging

• Pseudo-Distributed – Separate processes on same server

• Fully Distributed – Full blown cluster

Page 12: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850

TREEMINER, INC.CONFIDENTIAL

Eclipse Framework

• Write code in eclipse• PC or Linux• Options:

• Run Hadoop on Windows • Run Eclipse in Linux with Plugin• Run Eclipse in Windows, Remote debug and

profiling• Profiling: Yourkit

Page 13: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850

TREEMINER, INC.CONFIDENTIAL

WordCount

• Create a project in eclipse• Load wordcount code (widely available

and in sandbox downloads)• Compile jar file• Execute on hadoop in standalone mode$ hadoop jar path/to/file.jar input output

Page 14: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850

TREEMINER, INC.CONFIDENTIAL

Monitoring Hadoop Jobs

Page 15: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850

TREEMINER, INC.CONFIDENTIAL

Monitoring Hadoop Jobs

Page 16: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850

TREEMINER, INC.CONFIDENTIAL

Resources

http://www.cloudera.com

http://www.hortonworks.com

hadoop.apache.org

http://web.stanford.edu/class/cs246/homeworks/tutorial.pdf

Hadoop: A Definitive Guide by Tom White

Page 17: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850

TREEMINER, INC.CONFIDENTIAL

Example: Document AutoClustering using Hadoop and Storm

https://www.youtube.com/watch?v=5X65WV0n4rU