kmeans in-hadoop

K-means in HadoopK-means && Spark && Plan

Outline

• K-means• Spark• Plan

23/4/10 2

K-means in Hadoop

• Programs ：• Kmeans.py: k-means core algorithm• Wrapper.py: local control iterations of k-means• Generator.py: generate data in random of

range• Graph.py: draw data

23/4/10 3

Flowchart

23/4/10 4

Kmeans.py

• use “in-mapper combining” technology, for implementing combiner functionality within every map task. Notice, not combiner phase.

• It makes a discrete Combine step between Map and Reduce unnecessary. Typically, it is not guaranteed that a combiner function will be called on every mapper or that ,if called , it will only be called once.

• In-mapper combiner design patten, we will guarantee that combiner-like key aggregation occurs in every mapper, instead of optionally in some mappers.

23/4/10 5

Kmeans.py

• The aggregation is done entirely in the memory, without touching disk and it happens before any emission code has been called

• But it can not assure “Memory Leak” issue. We should use python to control this condition.

• Results (3.6G Test Dataset)• Old: 30+ min• Current: 9+ min ， in reduce phase we only

use 1~2 second. Saving significant time.

23/4/10 6

Generator.py

23/4/10 7

Wrapper.py

• Main controller for k-means iterations• Function ：

• Start mapper-reduce• Carry basic data and program with mapper phase• Verify whether it runs end.

• Result:• Source ： 13 clusters• Target ： 10 cluster -> 180 + iterations• Target ： 13 cluster -> 7-8 iterations

23/4/10 8

Processing(13-clusters)

• 110331.286264 -> 43648.070121• 43648.070121 -> 22167.351291• 22167.351291 -> 5853.008014• 5853.008014 -> 552.292067 • 552.292067 -> 8.202320• 8.202320 -> 0.000000• 0.000000 -> 0.000000

23/4/10 9

Spark

• In-memory , high performance , use Scala• Spark 启用了内存分布数据集，除了能够提供交互式查询外，它还

可以优化迭代工作负载。• Spark 和 Scala 能够紧密集成，其中的 Scala 可以像操作本地集

合对象一样轻松地操作分布式数据集。• 尽管创建 Spark 是为了支持分布式数据集上的迭代作业，但是实际

上它是对 Hadoop 的补充，可以在 Hadoo 文件系统中并行运行。• Scala 是一种多范式语言，它以一种流畅的、让人感到舒服的方法支持与命

令式、函数式和面向对象的语言相关的语言特性。

23/4/10 10

Spark• Spark 是为集群计算中的特定类型的工作负载而设计，即那些在并

行操作之间重用工作数据集（比如机器学习算法）的工作负载。• Spark 引进了内存集群计算的概念，可在内存集群计算中将数据集

缓存在内存中，以缩短访问延迟。

23/4/10 11

其他的大数据分析框架• GraphLab ：侧重于机器学习算法的并行实现• Storm ： “实时处理的 Hadoop” ，它主要侧重于流处

理和持续计算（流处理可以得出计算的结果）。 Storm 是用 Clojure 语言（ Lisp 语言的一种方言）编写的，但它支持用任何语言（比如 Ruby 和 Python ）编写的应用程序。

23/4/10 12

Plan

• 27 PCs run properly in Hadoop• Remote management : write some shell scripts,

power saving, task submit from everyone etc.• Build Mesos, spark, ZooKeeper, Hbase in our

platform.

23/4/10 13

thanks

23/4/10 14

kmeans in-hadoop

Documents

spark spark spark

spark plan2012

mapper phase

combiner phase

mapper combiner design

iterations target

iterations function

hadoop programs