apache spark as cross-over hit for data science

21
1 Apache Spark as Cross- over Hit for Data Science Sean Owen / Director of Data Science / Cloudera

Upload: sean-owen

Post on 11-Aug-2014

444 views

Category:

Data & Analytics


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Apache Spark as Cross-Over Hit for Data Science

1

Apache Spark as Cross-over Hit for Data ScienceSean Owen / Director of Data Science / Cloudera

Page 2: Apache Spark as Cross-Over Hit for Data Science

2

Investigative vs Operational Analytics

Data ScientistExploratory Analytics

Predictive Data ProductsOperational Analytics

Page 3: Apache Spark as Cross-Over Hit for Data Science

Tools of the Trade

3

Page 4: Apache Spark as Cross-Over Hit for Data Science

Trade-offs of the Tools

4

Production DataLarge-ScaleShared Cluster

ContinuousOperationOnline

Throughput, QPS

Few, Simple

Systems LanguagePerformance

Historical SubsetSample

Workstation

Ad HocInvestigation

Offline

Accuracy

Many, Sophisticated

Scripting, High LevelEase of Development

Data

Context

Metrics

Library

Language

Inve

stiga

tive O

perational

Page 5: Apache Spark as Cross-Over Hit for Data Science

R

5

Production DataLarge-ScaleShared Cluster

ContinuousOperationOnline

Throughput, QPS

Few, Simple

Systems LanguagePerformance

Historical SubsetSample

Workstation

Ad HocInvestigation

Offline

Accuracy

Many, Sophisticated

Scripting, High LevelEase of Development

Data

Context

Metrics

Library

Language

Inve

stiga

tive O

perational

Page 6: Apache Spark as Cross-Over Hit for Data Science

Python + scikit

6

Production DataLarge-ScaleShared Cluster

ContinuousOperationOnline

Throughput, QPS

Few, Simple

Systems LanguagePerformance

Historical SubsetSample

Workstation

Ad HocInvestigation

Offline

Accuracy

Many, Sophisticated

Scripting, High LevelEase of Development

Data

Context

Metrics

Library

Language

Inve

stiga

tive O

perational

Page 7: Apache Spark as Cross-Over Hit for Data Science

MapReduce, Crunch, Mahout

7

Production DataLarge-ScaleShared Cluster

ContinuousOperationOnline

Throughput, QPS

Few, Simple

Systems LanguagePerformance

Historical SubsetSample

Workstation

Ad HocInvestigation

Offline

Accuracy

Many, Sophisticated

Scripting, High LevelEase of Development

Data

Context

Metrics

Library

Language

Inve

stiga

tive O

perational

Page 8: Apache Spark as Cross-Over Hit for Data Science

Spark: Something For Everyone

8

• Now Apache TLP• From UC Berkeley

• Scala-based• Expressive, efficient• JVM-based

• Scala-like API• Distributed works

like local• Like Crunch is

Collection-like

• REPL• Interactive

• Distributed• Hadoop-friendly

• Integrate with where data already is

• ETL no longer separate• MLlib

Page 9: Apache Spark as Cross-Over Hit for Data Science

Spark

9

Production DataLarge-ScaleShared Cluster

ContinuousOperationOnline

Throughput, QPS

Few, Simple

Systems LanguagePerformance

Historical SubsetSample

Workstation

Ad HocInvestigation

Offline

Accuracy

Many, Sophisticated

Scripting, High LevelEase of Development

Data

Context

Metrics

Library

Language

Inve

stiga

tive O

perational

Page 10: Apache Spark as Cross-Over Hit for Data Science

Stack Overflow Tag Recommender Demo

10

• Questions have tags like java or mysql

• Recommend new tags to questions

• Available as data dump• Jan 20 2014 Posts.xml

• 24.4GB• 2.1M questions • 9.3M tags (34K unique)

Page 11: Apache Spark as Cross-Over Hit for Data Science

11

<row Id="4" PostTypeId="1" AcceptedAnswerId="7” CreationDate="2008-07-31T21:42:52.667" Score="251" ViewCount="15207" Body="&lt;p&gt;I want to use a track-bar to change a form's opacity.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;This is my code:&lt; p&gt;&#xA;&#xA;&lt;pre&gt;&lt;code&gt;decimal trans = trackBar1.Value / 5000;&#xA;this.Opacity = trans;&#xA;&lt; code&gt;&lt;/pre&gt;&#xA;&#xA;&lt;p&gt;When I try to build it, I get this error:&lt;/p&gt;&#xA;&#xA;&lt;blockquote&gt;&#xA; &lt;p&gt;Cannot implicitly convert type 'decimal' to 'double'.&lt; p&gt;&#xA;&lt;/blockquote&gt;&#xA;&#xA;&lt;p&gt;I tried making &lt;strong&gt;trans&lt;/strong&gt; to &lt;strong&gt;double&lt;/ strong&gt;, but then the control doesn't work. This code has worked fine for me in VB.NET in the past. &lt;/p&gt;&#xA;" OwnerUserId="8” LastEditorUserId="2648239" LastEditorDisplayName="Rich B” LastEditDate="2014-01-03T02:42:54.963" LastActivityDate="2014-01-03T02:42:54.963" Title="When setting a form's opacity should I use a decimal or double?” Tags="&lt;c#&gt;&lt;winforms&gt;&lt;forms&gt;&lt;type- conversion&gt;&lt;opacity&gt;" AnswerCount="13" CommentCount="25" FavoriteCount="23" CommunityOwnedDate="2012-10-31T16:42:47.213" />

Page 12: Apache Spark as Cross-Over Hit for Data Science

Stack Overflow Tag Recommender Demo

12

• CDH 5.0.1• Spark 0.9.0

• Standalone mode• Install libgfortran

• 1 master• 5 workers

• 24 cores• 64GB RAM

Page 13: Apache Spark as Cross-Over Hit for Data Science

13

Page 14: Apache Spark as Cross-Over Hit for Data Science

14

val postsXML = sc.textFile( "hdfs:///user/srowen/SparkDemo/Posts.xml")

postsXML: org.apache.spark.rdd.RDD[String] = MappedRDD[13] at textFile at <console>:15

postsXML.count...res1: Long = 18066983

Page 15: Apache Spark as Cross-Over Hit for Data Science

15

<row Id="4" ... Tags="...c#...winforms..."/>

(4,"c#")(4,"winforms") ...

(4,3104,1.0)(4,2148819,1.0) ...

Page 16: Apache Spark as Cross-Over Hit for Data Science

16

val postIDTags = postsXML.flatMap { line => val idTagRegex = "Id=\"(\\d+)\".+Tags=\"([^\"]+)\"".r val tagRegex = "&lt;([^&]+)&gt;".r idTagRegex.findFirstMatchIn(line) match { case None => None case Some(m) => { val postID = m.group(1).toInt val tagsString = m.group(2) val tags = tagRegex.findAllMatchIn(tagsString) .map(_.group(1)).toList if (tags.size >= 4) tags.map((postID,_)) else None } }}

Page 17: Apache Spark as Cross-Over Hit for Data Science

17

def nnHash(tag: String) = tag.hashCode & 0x7FFFFFvar tagHashes = postIDTags.map(_._2).distinct.map(tag => (nnHash(tag),tag))

import org.apache.spark.mllib.recommendation._val alsInput = postIDTags.map(t => Rating(t._1, nnHash(t._2), 1.0))

val model = ALS.trainImplicit(alsInput, 40, 10)

Page 18: Apache Spark as Cross-Over Hit for Data Science

18

Page 19: Apache Spark as Cross-Over Hit for Data Science

19

def recommend(questionID: Int, howMany: Int = 5): Array[(String, Double)] = { val predictions = model.predict( tagHashes.map(t => (questionID,t._1))) val topN = predictions.top(howMany) (Ordering.by[Rating,Double](_.rating)) topN.map(r => (tagHashes.lookup(r.product)(0), r.rating))}

recommend(7122697).foreach(println)

Page 20: Apache Spark as Cross-Over Hit for Data Science

20

(sql,0.1666023080230586)(database,0.14425980384610013)(oracle,0.09742911781766687)(ruby-on-rails,0.06623183702418671)(sqlite,0.05568507618047555)

I have a large table with a text field, and want to make queries to this table, to find records that contain a given substring, using ILIKE. It works perfectly on small tables, but in my case it is a rather time-consuming operation, and I need it work fast, because I use it in a live-search field in my website. Any ideas would be appreciated...postgresql query-optimization substring text-search

stackoverflow.com/questions/7122697/how-to-make-substring-matching-query-work-fast-on-a-large-table

Page 21: Apache Spark as Cross-Over Hit for Data Science

21

blog.cloudera.com/blog/2014/ 03/why-apache-spark-is-a- crossover-hit-for-data- scientists/

goo.gl/[email protected]