kmeans in-hadoop

14
K-means in Hadoop K-means && Spark && Plan

Upload: tianwei-liu

Post on 06-Dec-2014

1.292 views

Category:

Documents


5 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Kmeans in-hadoop

K-means in HadoopK-means && Spark && Plan

Page 2: Kmeans in-hadoop

Outline

• K-means• Spark• Plan

23/4/10 2

Page 3: Kmeans in-hadoop

K-means in Hadoop

• Programs :• Kmeans.py: k-means core algorithm• Wrapper.py: local control iterations of k-means• Generator.py: generate data in random of

range• Graph.py: draw data

23/4/10 3

Page 4: Kmeans in-hadoop

Flowchart

23/4/10 4

Page 5: Kmeans in-hadoop

Kmeans.py

• use “in-mapper combining” technology, for implementing combiner functionality within every map task. Notice, not combiner phase.

• It makes a discrete Combine step between Map and Reduce unnecessary. Typically, it is not guaranteed that a combiner function will be called on every mapper or that ,if called , it will only be called once.

• In-mapper combiner design patten, we will guarantee that combiner-like key aggregation occurs in every mapper, instead of optionally in some mappers.

23/4/10 5

Page 6: Kmeans in-hadoop

Kmeans.py

• The aggregation is done entirely in the memory, without touching disk and it happens before  any emission code has been called    

• But it can not assure “Memory Leak” issue. We should use python to control this condition.

• Results (3.6G Test Dataset)• Old: 30+ min• Current: 9+ min , in reduce phase we only

use 1~2 second. Saving significant time.

23/4/10 6

Page 7: Kmeans in-hadoop

Generator.py

23/4/10 7

Page 8: Kmeans in-hadoop

Wrapper.py

• Main controller for k-means iterations• Function :

• Start mapper-reduce• Carry basic data and program with mapper phase• Verify whether it runs end.

• Result:• Source : 13 clusters• Target : 10 cluster -> 180 + iterations• Target : 13 cluster -> 7-8 iterations

23/4/10 8

Page 9: Kmeans in-hadoop

Processing(13-clusters)

• 110331.286264 -> 43648.070121• 43648.070121 -> 22167.351291• 22167.351291 -> 5853.008014• 5853.008014 -> 552.292067 • 552.292067 -> 8.202320• 8.202320 -> 0.000000• 0.000000 -> 0.000000

23/4/10 9

Page 10: Kmeans in-hadoop

Spark

• In-memory , high performance , use Scala• Spark 启用了内存分布数据集,除了能够提供交互式查询外,它还

可以优化迭代工作负载。• Spark 和 Scala 能够紧密集成,其中的 Scala 可以像操作本地集

合对象一样轻松地操作分布式数据集。• 尽管创建 Spark 是为了支持分布式数据集上的迭代作业,但是实际

上它是对 Hadoop 的补充,可以在 Hadoo 文件系统中并行运行。• Scala 是一种多范式语言,它以一种流畅的、让人感到舒服的方法支持与命

令式、函数式和面向对象的语言相关的语言特性。

23/4/10 10

Page 11: Kmeans in-hadoop

Spark• Spark 是为集群计算中的特定类型的工作负载而设计,即那些在并

行操作之间重用工作数据集(比如机器学习算法)的工作负载。• Spark 引进了内存集群计算的概念,可在内存集群计算中将数据集

缓存在内存中,以缩短访问延迟。

23/4/10 11

Page 12: Kmeans in-hadoop

其他的大数据分析框架• GraphLab  :侧重于机器学习算法的并行实现•  Storm : “实时处理的 Hadoop” ,它主要侧重于流处

理和持续计算(流处理可以得出计算的结果)。 Storm 是用 Clojure 语言( Lisp 语言的一种方言)编写的,但它支持用任何语言(比如 Ruby 和 Python )编写的应用程序。

23/4/10 12

Page 13: Kmeans in-hadoop

Plan

• 27 PCs run properly in Hadoop• Remote management : write some shell scripts,

power saving, task submit from everyone etc.• Build Mesos, spark, ZooKeeper, Hbase in our

platform.

23/4/10 13

Page 14: Kmeans in-hadoop

thanks

23/4/10 14