ncku csie talk about spark

Introduction to SparkWisely Chen (thegiive@gmail.com)

Sr. Engineer at Yahoo

Agenda• Big data will change the world?

• What is Spark?

• Demo (Start a spark cluster / Word Count)

• Break : 10min

• Spark Concept

• Demo (ETL / MLib)

• Q&A

Who am I? • Wisely Chen ( thegiive@gmail.com )

• Sr. Engineer in Yahoo![Taiwan] data team

• Loves to promote open source tech

• Hadoop Summit 2013 San Jose

• Jenkins Conf 2013 Palo Alto

• Spark Summit 2014 San Francisco

• Coscup 2006, 2012, 2013 , OSDC 2007, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012

Taiwan Data Team

Data!Highway

BI!Report

Serving!API

Data!Mart

ETL /Forecast

Machine!Learning

Big data will change the world?

Sensor

Machine Learning

More Sensor

Machine Learning

Human Action Data

Machine LearningRobot

What is sensor?

Internet of thing

More Sensor • 2000

• Browser

• Digital Camera(Photo)

• 2000~2014

• Browser

• Mobile (GPS,More Photo,Video)

• Wearable Device(Pulse)

• Google Glass(More Video)

• Internet of thing(………)

More Sensor

Bigger Data

Machine Learning

More Sensor • 2000

• Browser

• Digital Camera(Photo)

• 2000~2014

• Browser

• Mobile (GPS,More Photo,Video)

• Wearable Device(Pulse)

• Google Glass(More Video)

• Internet of thing(………)

Technology Improve• Sloan Digital Sky Survey(SDSS) collected more data in its

first few weeks than had been amassed in the entire history of astronomy.

• The Large Synoptic Survey Telescope in Chile, due to come on stream in 2016, will acquire that quantity of data every five days.

New Area

30min Zebra fish experiment = 1TB http://research.janelia.org/zebrafish/

Hadoop handle big data well

• 18M of hadoop related jobs on Yahoo Grid

• Yahoo handle over 440 PB data daily

• Most of job are ETL/SQL/BI

EBay’s data volume 2015 : 130EB 2020 : 4000ZB

Vadim Kutsyy “Data Science Empowering Personalization” in Big data innovation Summit 2014 Boston

Data is not only bigger

• We have more area of data

• More Sensor

• Sensor technology improve

• New area

More Sensor

Bigger Data

Better Machine Learning

Word Grammar Check• MS researcher Michele and Eric try to improve

grammar check algorithm

• They took four algorithm and feed in 10M, 100M and 1B words

• In 10M words, sophisticated algorithm(86%) works bester than simpler algorithm(75%)

• In 1B words, simpler algorithm(95%+) improved a lot, even better than sophisticated algorithm(94%)

–Google AI guru Peter Norvig, "The Unreasonable Effectiveness of Data”

“Simple models and a lot of data trump more elaborate models based on less data”

In translate area

Different type of data• In Harvard Data Mining Class, two team do Netflix

recommendation challenge

• Team A came up with a very sophisticated algorithm using the Netflix data.

• Team B used a very simple algorithm, but they added in additional data beyond the Netflix set

• Team B got much better results, close to the best results on the Netflix leaderboard

Taiwan Shopping User Analysis

Man tend to view underwear. But they don’t buy it

Male Users’ Top5 View Categories 1. Computer 2. Camera 3. ………. 4. ………. 5. Woman Underwear

2 types of dataTraffic Data

Transaction Data

User’s views / clicks “Weak intention” Large amount

User’s checkout “Strong Intention”

Small amour

Small Data

Sophisticated Algo

OK Result

Big Data

Simple Algo

Better Result

Data Set 1

Smart Model which leverage

more area of data

Data Set 2

Best Result

More Sensor

Bigger Data

Helpful Robot

Robot• FoxCon’s robot can replace 70% worker

• Google driverless car/big dog

• Amazon warehouse robot

More Sensor

Bigger Data

Helpful Robot

It is not movie it is happening

Sensor

Machine Learning

Robot User Behavior

Recommendation Algorithm

Recommendation to user

(1/3 sales are from recommendation module)

Amazon layoff the editor team and replaced by recommendation algorithm

Sensor

Machine Learning

RobotWeather Humidity

Sun ….

Give more water to area A

Sensor

Machine Learning

DNNresearch, Behavio, Wavii, Flutter, autofuss, DeepMind,

spider.io, Adometry, QQuest Visual, Jetpac

Talaria, Stackdriver SCHAFT, Industrial Perception, Redwood Robotics,

Meka Robotics, Holomni, Bot & Dolly,

Boston Dynamics, Titan Aerospace,

Nest Lab, MyEnergy, Skybox Imaging, Dropcam,

Google buy 47 company at 13,14

Google is top 1 leader in big data

24 company on the ring

Sensor

Machine Learning

Sensor

Machine Learning

Be part of it!!!

The ring will change the world and

big data is the core of ring

Sensor

Machine Learning

Hadoop

Sensor

Machine Learning

Hadoop

Not so well

Hadoop is not good in machine learning

• Efficiency

• Difficult

• Data Engineer

• Data Scientist

• Data Analyst

• Algorithm: it is not so easy to parallelize your algorithm

Sensor

Machine Learning

• Efficiency

• Difficult : Data scientist don’t know how to do

3X~25X than MapReduce framework !

From Matei’s paper: http://0rz.tw/VVqgP

Logistic regression

MR Spark3

KMeans

MR Spark

PageRank

MR Spark

• Efficiency

• Difficult

Data Analyst

Data Engineer

Data Scientist

Language Support• Python : Data Scientist , Data Engineer

• Java : Data Engineer

• Scala : Data Engineer

• SQL : Data Scientist, Data Analyst , Data Engineer

• R : Data Scientist, Data Analyst

• (will be official support in 1.2)

Python Word Count• file = spark.textFile("hdfs://...")

• counts = file.flatMap(lambda line: line.split(" ")) \

• .map(lambda word: (word, 1)) \

• .reduceByKey(lambda a, b: a + b)

• counts.saveAsTextFile("hdfs://...")

Access data via Spark API

Process via Python

Scala Word Count• val file = spark.textFile("hdfs://...")

• val counts = file.flatMap(line => line.split(" "))

• .map(word => (word, 1))

• .reduceByKey(_ + _)

Java Wordcount• JavaRDD<String> file = spark.textFile("hdfs://...");

• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()

• public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }

• });

• JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>()

• public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }

• });

• JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>()

• public Integer call(Integer a, Integer b) { return a + b; }

• });

• counts.saveAsTextFile("hdfs://...");

Highly Recommend

• Scala : Latest API feature, Stable

• Python

• very familiar language

• Native Lib: NumPy, SciPy

What is Spark• From UC Berkeley AMP Lab

• Apache Spark™ is a very fast and general engine for large-scale data processing

• Most activity Big data open source project since Hadoop

Community

Where is Spark?

MapReduce

Hadoop 2.0

Storm HBase Others

MapReduce

Hadoop Architecture

Storage

Resource Management

Computing Engine

MapReduce

Hadoop vs Spark

Hive Shark/SparkSQL

More than MapReduce

Spark Core : MapReduce

SparkSQL: Hive GraphX: Pregel MLib: MahoutStreaming:

Resource Management System(Yarn, Mesos)

How to use it?

• 1. go to https://spark.apache.org/

• 2. Download and unzip it

• 3. ./sbin/start-all.sh or ./bin/spark-shell

• ./ec2/spark-ec2 -k xxx -i xxx -s 3 launch CLUSTERNAME

• http://spark.apache.org/docs/latest/ec2-scripts.html

Python Word Count• file = spark.textFile("hdfs://...")

• counts = file.flatMap(lambda line: line.split(" ")) \

• .map(lambda word: (word, 1)) \

• .reduceByKey(lambda a, b: a + b)

Spark Concept

Why is Spark so fast?

Most machine learning algorithms need iterative computing

PageRank

1st Iter 2nd Iter 3rd Iter

Rank Tmp

Result

Rank Tmp

Result

1.00.58

1.720.39

HDFS is 100x slower than memory

Input (HDFS) Iter 1 Tmp

(HDFS) Iter 2 Tmp (HDFS) Iter N

Input (HDFS) Iter 1 Tmp

(Mem) Iter 2 Tmp (Mem) Iter N

MapReduce

First iteration(HDFS)!take 200 sec

3rd iteration(mem)!take 7.7 sec

Page Rank algorithm in 1 billion record url

2nd iteration(mem)!take 7.4 sec

Memory Size Problem

Cache storage in local disk(2sec)

Cache storage in memory(2sec)

Network transfer(30 sec)

Just memory?• From Matei’s paper: http://0rz.tw/VVqgP

• HBM: stores data in an in-memory HDFS instance.

• SP : Spark

• HBM’1, SP’1 : first run

• Storage: HDFS with 256 MB blocks

• Node information

• m1.xlarge EC2 nodes

• 4 cores

• 15 GB of RAM

100GB data on 100 node cluster

Logistic regression Ru

HBM'1 HBM SP'1 SP3

KMeans

HBM'1 HBM SP'1 SP

Map Reduce

mapInput (HDFS) reduce

reduce

Shuffle

Output (HDFS)

Map Reduce

mapInput (HDFS) reduce

reduce

Shuffle

Output (HDFS)

Map Reduce

map,filtergroupBy on !

non-partitioned data

join with input!co-partitioned

join with inputs not!co-partitioned

Map(Narrow) Reduce(Wide)

DAG Engine

groupBy

Hadoop(4 MR)

groupBy

MR3MR4

Spark (2MR,1map)

groupBy

Map MR2

Input (HDFS)

MR2 Tmp (HDFS)

MapReduce

Tmp (HDFS) MR3

Input (HDFS)

MAP Tmp (MEM)

Tmp (MEM)

Output (HDFS)

Tmp (HDFS)

Output (HDFS)

Stage 1

Stage 2

groupBy

Stage 2

• Resilient Distributed Dataset

• Interface of data, stored in RAM or on Disk

• Built through parallel transformations

RDD a RDD b

val a =sc.textFile(“hdfs://....”)

val b = a.filer( line=>line.contain(“Spark”) )

Value c

val c = b.count()

Transformation Action

Log mining

a = sc.textfile(“hdfs://aaa.com/a.txt”)!err = a.filter(lambda t=> “ERROR” in t )! .filter(lambda t=> “2014” in t)!!err.cache()!err.count()!!m = err.filter(lambda t=>“MYSQL” in t)!! ! .count()!a = err.filter(lambda t=> “APACHE” in t )!! ! .count()

Driver

Worker!!!!

Worker!!!!Task

TaskTask

Log mining

Driver

Worker!!!!!Block1

Worker!!!!!Block2

Worker!!!!!Block3

Log mining

Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Block1 Block2

Block3

Log mining

Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Block1 Block2

Block3

Log mining

Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Cache1 Cache2

Cache3

Log mining

Driver

Worker!!!!!

Cache1 Cache2

Cache3

Log mining

Driver

Worker!!!!!

Cache1 Cache2

Cache3

1st iteration(no cache)!

take same time

with cache!take 7 sec

RDD Cache

• Data locality

• CacheA big shuffle!take 20min

After cache, take only 265ms

self join 5 billion record data

Log Mining

Page Rank

PageRank

1st Iter 2nd Iter 3rd Iter

Rank Tmp

Result

Rank Tmp

Result

1.00.58

1.720.39

SparkSQL

Recommendation

MLlib• Data

• data: [(36, 2802, 4.0), (36, 256, 4.0), …]

• rank, numIter, lambda are int

• candidates : [(0, 2),(0, 3),(0, 4)…]

• model = ALS.train(data, rank, numIter, lambda)

• model.predictAll(candidates)

Homework• 1. Install Spark and run word count (50%)

• Data : http://www.gutenberg.org/ebooks/5000

• Output: total word number

• 2. Write Movie Recommendation (50%)

• Trainning Data : http://arbor.ee.ntu.edu.tw/~wisely/data/lesson.tgz

• Input: 10 rating(1-5) on the 10 movie

• Example: movie 123 rating is 3 , movie 45 is 5

• Output: Top 10 recommendation movie

• Any algorithm is ok

BI (SparkSQL)

Streaming (SparkStreaming)

Machine Learning (MLlib)

Background Knowledge• Tweet real time data store into SQL database

• Spark MLLib use Wikipedia data to train a TF-IDF model

• SparkSQL select tweet and filter by TF-IDF model

• Generate live BI report

Code• val wiki = sql(“select text from wiki”)

• val model = new TFIDF()

• model.train(wiki)

• registerFunction(“similarity” , model.similarity _ )

• select tweet from tweet where similarity(tweet, “$search” > 0.01 )

http://youtu.be/dJQ5lV5Tldw?t=39m30s

ncku csie talk about spark

taiwan data team data

area of data data set

netix data

pb data

additional data

quantity of data

api data

agenda big data

Internet

local chapter workshop - ncku

7. equilibrium statistical mechanics - ncku

ncku csie edalab tsung-wei huang and tsung-yi ho department...

ncku campus map

vlab 1 final presentation ntu ncku

1 an introduction to statistical machine translation dept....

bandpass oversampling converters - ncku

convolution - ncku

advanced calculus (ii) - ncku

advanced calculus (i) - ncku

discriminative training and machine learning approaches...

student council iimba ncku 2009

publishing in nature journals - ncku

subsystems design - ncku

lecture 19 the wavelet transform - ncku

concept map - ncku

section 10.1 665 - ncku

the csie gene

fundamental concepts - ncku

institute of education, ncku introduction