spark shark

Spark/Shark@oza_x86

Tuesday, October 22, 13

おまえだれよ？Tsuyoshi Ozawa @oza_x86

OSS developerApache Hadoop の開発をしてます

github : oza

←の22章を書いたよ！

アジェンダ

• Hadoop/MapReduce の復習

• Spark の概要

• Shark の概要

• 分散処理基盤

• たくさんの計算機を使って高速に処理

• Open Source!

MapReduce 処理部

データ保存部

1台辺りの構成Hadoop の構成

MapReduce が提供するもの

• MapReduce

• 処理の分散並列化

• 耐障害性

• ジョブ監視のための基盤

• 開発者のための抽象化されたインタフェース(Map/Reduce)

• 引用元

• http://www.slideshare.net/shiumachi/impala-15324018

MapReduce 概要

reduce

入力読み込み

処理結果書き込み

Shuffle

並列処理集約処理

ところで...• HDFS ためたデータに対して，機械学習を行い，高度な解析を行うということが色々な場所で行われている

•Mahout (Hadoop 上のライブラリ)

• Jubatus (オンライン学習基盤)

MapReduce の問題点• 機械学習のような繰り返し処理を行うようなものでは，性能が出ない

• なぜ？

• リソース割り当てに起因する問題

• プログラムの起動に 15 sec 程度かかる

• ディスク書き込みに起因する問題

• HDFS への書き込みオーバヘッドが大きいため

• Shuffle でローカルディスクに書き出すため

Spark の出番Tuesday, October 22, 13

Spark とは?• 繰り返し処理を高速化するために HDFS に

特殊なキャッシュを乗っけた

• 機械学習を書くために DSL を提供

• Map/Reduce 以外にも色々と API が定義されている

• DSL は勝手に分散処理される

• 実装をがんばっているため，起動に15secもかからない

• Apache Incubator

• http://spark-project.org/

• Scala で 20k

なぜキャッシュ?

reduce

入力読み込み

Shuffle

なぜキャッシュ?

reduce

入力読み込み

なぜ特殊なキャッシュ?

•キャッシュは揮発性→プロセスが落ちたら，再構築が必要

•全体の処理をやり直す羽目に...

100回の繰り返し処理を考える

reduce

99回目

99回目の結果

HDFS map

reduce

99回目

99回目の結果

HDFS map

reduce

99回目

99回目の結果

キャッシュが壊れる

reduce

99回目

99回目の結果

reduce

99回目

99回目の結果

読み直してやり直し！！

特殊なキャッシュ• Resillient Distributed Datasets

• チェックポイントから処理の依存関係を見て最小限で復帰する仕組み

• ナイーブにキャッシュすると性能が出ないので Java のオブジェクトをそのまま保存

• https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

• http://spark-project.org/examples/

file = spark.textFile("hdfs://...") file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)

起動が速い• 1秒以内で立ち上がる

• 仕組み

• Spark avoids this problem by using a fast event-driven RPC library to launch tasks and by reusing its worker processes. It can launch thousands of tasks per second with only about 5 ms of over- head per task, making task lengths of 50–100 ms and MapReduce jobs of 500 ms viable. What surprised us is how much this affected query performance, even in large (multi-minute) queries[2].

• 意訳: 実装がんばったら，5msec で 1タスク立ち上がるようになったよ

• [2]Shark: SQL and Rich Analytics at Scale

Spark と MapReduce の比較

• Hadoop/MapReduce

• ユーザは Map/Reduce という関数を書く

• チェックポイントは勝手にとってくれる

• 起動に15secくらいかかる

• お気軽に色々書ける

MapReduce

Spark (Spark の中にキャッシュ！！)

• Spark

• ユーザは DSL を書く

• チェックポイントは自分で取る

• 起動は1sec以内

• お気軽に色々書ける...??

ベンチマーク結果• 結構速い

• https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

Apache HiveSQL っぽいものを書くと MapReduce

プログラムにコンパイル

Spark DSL...?• 集計の度に

• 厳しいのでは...?

Shark の出番

file = spark.textFile("hdfs://...") file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)

Shark とは?• Spark 上で SQL はじめました

Shark(SQL)

CREATE TABLE logs_last_month_cached AS SELECT * FROM logs WHERE time > date(...);

SELECT page, count(*) c FROM logs_last_month_cached GROUP BY page ORDER BY c DESC LIMIT 10;

Shark のポイント• SQL を Spark の DSL にコンパイル

• Apache Hive の Spark 版(フォーク)

• 性能が出るようにキャッシュをうまく管理

• ストレージフォーマットを工夫

• キャッシュの置き場所を工夫

• Spark との親和性を重視

もうちょっと詳しく

• [2]Shark: SQL and Rich Analytics at ScaleTuesday, October 22, 13

Shark のベンチマークselection query:SELECT pageURL, pageRankFROM rankings WHERE pageRank > X;

同条件のシンプルなクエリで速い理由 Spark の起動時間が高速タスク割り当てが高速

まとめ• 機械学習用の処理基盤 Spark

• 繰り返し処理にてHadoop の最大100倍高速

• Spark のSQLインタフェース Shark

• キャッシュの速さと足回りの速さ，クエリの最適化により Hive の数倍高速

• select のような基本的なクエリでHive よりも高速に動作

spark shark

spark spark hdfs

spark hadoop

spark sql shark

hadoopmapreduce spark

sec hdfs shue spark

pagerank x spark

reducehdfs99 map

hdfs dsl mapreduce api

Documents

clearstorydata.com using spark and shark for fast cycle...

south fulton 2020 prize - unitedwayatlanta.org · get your...

blue shark, shortfin mako shark, silky shark and...

trihug talk on spark and shark

transforming big data with spark and shark - aws re:invent...

shark update and upcoming changes -...

sap big data el viaje instantáneo en el universo de ... ·...

real-time analytics with cassandra, spark, and shark

shark fact...

sharks hammerhead shark great white shark tiger shark bull...

the origins and rise of shark biology in the 20th...

where do they live? · 2020. 4. 23. · species of shark...

real-time healthcare analytics on apache hadoop using ... ·...

gwinnett 2020 prize - united way of greater atlanta · get...

reynold xin 辛湜 (shi2) uc berkeley spark and shark...

analytics on spark & shark @ yahoo presented by tim tully...

spark and shark

shark: hive (sql) on spark

the ark family set - to carl cd files/toons practice...clark...

quickie shark /shark...