apache spark intro

Post on 24-Jan-2017

217 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Apache Spark Introworkshop

BigData Romania

Apache Spark Intro

★ Apache Spark history★ RDD★ Transformations★ Actions★ Hands-on session

From where to learn Spark ?

http://spark.apache.org/

http://shop.oreilly.com/product/0636920028512.do

Spark architecture

Easy ways to run Spark ?★ your IDE (ex. Eclipse or IDEA)★ Standalone Deploy Mode: simplest way to deploy Spark

on a single machine★ Docker & Zeppelin★ EMR★ Hadoop vendors (Cloudera, Hortonworks)

Supported languages

Spark basics

★ RDD★ Operations : Transformations and Actions

RDD

An RDD is simply an immutable distributed collection of objects!

b c d ge f ih kj ml ona qp

Creating RDD (I) Pythonlines = sc.parallelize([“workshop”, “spark”])

Scalaval lines = sc.parallelize(List(“workshop”, “spark”))

Java JavaRDD<String> lines = sc.parallelize(Arrays.asList(“workshop”, “spark”))

Creating RDD (II) Pythonlines = sc.textFile(“/path/to/file.txt”)

Scalaval lines = sc.textFile(“/path/to/file.txt”)

Java JavaRDD<String> lines = sc.textFile(“/path/to/file.txt”)

RDD persistence MEMORY_ONLY

MEMORY_AND_DISKMEMORY_ONLY_SERMEMORY_AND_DISK_SERDISK_ONLYMEMORY_ONLY_2MEMORY_AND_DISK_2OFF_HEAP

Other data structures in Spark

★ Paired RDD★ DataFrame★ DataSet

Paired RDD

Paired RDD = an RDD of key/value pairs

user1 user2 user3 user4 user5

id1/user1 id2/user2 id3/user3 id4/user4 id5/user5

Spark operations RDD 1

RDD 2

RDD 4

RDD 6

RDD 3

RDD 5

Action

Transformation

TransformationsRDD 1

RDD 2Transformations describe how to transform an RDD into another RDD.

RDD 1

RDD 2

Transformations RDD 1

RDD{1,2,3,4,5,6}

MapRDD{2,3,4,5,6,7}

FilterRDD{1,2,3,5,6}

map x => x +1 filter x => x != 4

Actions

Actions compute a result from an RDD !

RDD 1

Actions

InputRDD{1,2,3,4,5,6}

MapRDD{2,3,4,5,6,7}

FilterRDD{1,2,3,5,6}

map x => x +1 filter x => x != 4

count()=6 take(2)={1,2} saveAsTextFile()

Transformations and Actions

users

administrators

filter

take(3)

Transformations and Actions

users

administrators

filter()

take(3) saveAsTextFile()

Transformations and Actions

users

administrators

filter()

take(3) saveAsTextFile()

persist()

Lazy initialization

users

administrators

filter

take(3)

How Spark Executes Your Program

Hands-on session

MovieLens MovieLens data sets were collected by the GroupLens Research Projectat the University of Minnesota. This data set consists of:

* 100,000 ratings (1-5) from 943 users on 1682 movies. * Each user has rated at least 20 movies.

* Simple demographic info for the users (age, gender, occupation, zip)

Download link : http://grouplens.org/datasets/movielens/

MovieLens dataset

useruser_idagegenderoccupationzipcode

user_ratinguser_idmovie_idratingtimestamp

moviemovie_idtitlerelease_datevideo_releaseimdb_urlgenres...

Exercises already solved !

★ Return only the users with occupation ‘administrator’

★ Increase the age of each user by one★ Join user and rating datasets by user id

Exercises to solve★ How many men/women register to MovieLens★ Distribution of age for male/female registered to

MovieLens★ Which are the movies names with rating x?

★ Average rating by movies★ Sort users by their occupation

Congrats if you reached this slide !

top related