barcelona spain apache spark meetup oct 20, 2015: spark streaming, kafka, mllib, sql, project...

59
Click to edit Master text styles Click to edit Master text styles After Dark Real-time Advanced Analytics, Machine Learning, Graph Analytics, Text NLP, and Recommendations Barcelona Spark Meetup Oct 20 th , 2015 Chris Fregly Principal Data Solutions Engineer IBM Spark Technology Center ** We’re Hiring!! Nice People Only, Please. **

Upload: chris-fregly

Post on 11-Apr-2017

1.353 views

Category:

Software


5 download

TRANSCRIPT

Page 1: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

After Dark Real-time Advanced Analytics, Machine Learning, Graph Analytics, Text NLP, and Recommendations

Barcelona Spark Meetup

Oct 20th, 2015

Chris Fregly Principal Data Solutions Engineer

IBM Spark Technology Center ** We’re Hiring!! Nice People Only, Please. **

Page 2: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Who Am I?

2

Streaming Data Engineer Netflix Open Source Committer

Data Solutions Engineer

Apache Contributor

Principal Data Solutions Engineer IBM Technology Center

Meetup Organizer Advanced Apache Meetup

Book Author Advanced (2016)

Page 3: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Advanced Apache Spark Meetup Total Spark Experts: ~1350+ in 3 mos! #4 most active Spark Meetup in the world! Main Goals Dig deep into the Spark & extended-Spark codebase Study integrations such as Cassandra, ElasticSearch, Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, etc

Surface and share the patterns and idioms of these well-designed, distributed, big data components

Page 4: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark 4

Core

Spark Streaming

real-time Spark SQL structured data

MLlib machine learning

GraphX graph

analytics

BlinkDB approx queries

What is Spark?

Page 5: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Spark Deployments In Production

5

Page 6: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Tools of the Talk

6

  Redis   Docker   Cassandra   MLlib, GraphX   Parquet, JSON   Apache Zeppelin   Spark Streaming, Kafka   Spark SQL, DataFrames   Spark JDBC/ODBC Hive ThriftServer   ElasticSearch, Logstash, Kibana (ELK)

and…

Page 7: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

SMACK Stack!

7

S park (Data Processing) M esos (Cluster Manager) A kka (Actors) C assandra (NoSQL) K afka (Streaming)

Page 8: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Themes of this Talk

 Parallelism  Performance  Streaming  Approximations  Similarity Measures  Recommendations

8

and…

Page 9: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Goals of Spark After Dark   Generate high-quality recommendations

  Demonstrate Spark high-level libraries Spark Streaming -> Kafka, Approximates

Spark SQL -> DataFrames, Cassandra

  GraphX -> PageRank, Shortest Path

  MLlib -> Matrix Factor, Word2Vec

9

Page 10: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Popular Dating Sites

10

Page 11: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles Parallelism

11

Page 12: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

My First Experience With Parallelism Brady Bunch circa 1980 Season 5, Episode 18: “Two Pete’s in a Pod”

12

Page 13: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Parallel Algorithm: O(log n)

13

O(log n)

Page 14: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Non-Parallel Algorithm: O(n)

14

O(n)

Page 15: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Spark is Parallel!

15

Page 16: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles Performance

16

Page 17: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Spark Beats Hadoop @ 100 TB GraySort

17

  On-disk only   28,000 partitions   No in-memory caching

(2014) (2013) (2014)

Page 18: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Improved Shuffle and Network Layer   “Sort-based shuffle”

  Minimize OS resources

  Switched to async Netty

  Keep CPUs hot

  Reuse byte buffers to minimize GC

  Use epoll for I/O to stay in kernel space 18

Page 19: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Project Tungsten: CPU and Memory   More JVM bytecode generation, JIT optimize

  CPU-cache-aware data structs and algos -->

  Custom memory management Serializers Performance New HashMap

19

Page 20: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

DataFrames and Catalyst Optimizer

20

20

https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

Please Use DataFrames!

--> -->

JVM bytecode generation

Page 21: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Columnar Storage Format

21

Skip whole chunks with min-max heuristicsstored in each chunk (sorted data only)

Page 22: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Parquet File Format  Based on Google Dremel

 Implemented by Twitter and Cloudera

 Columnar storage format

 Optimized for fast columnar aggregations

 Tight compression

 Supports pushdowns

 Nested, self-describing, evolving schema 22

Page 23: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Types of Compression   Run Length Encoding: Repeated data   Dictionary Encoding: Fixed set of values

  Delta, Prefix Encoding: Sorted data

23

Page 24: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Types of Query Optimizations   Column, Partition Pruning   Row, Predicate Pushdown

SELECT b FROM table WHERE a in [a2,a3]

24

Page 25: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles Streaming

25

Page 26: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Direct Kafka Streaming – KafkaRDD   No single Receiver, no Write Ahead Log (WAL)   Workers pull from Kafka in parallel   Each KafkaRDD partition stores relevant offsets   Upon Worker Node failure, rebuild from offsets   Optimizes happy path by avoiding the WAL

26

At least once delivery guarantee <--

Page 27: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles Approximations

27

Page 28: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Count Min Sketch   Approximate counters

  Better than HashMap

  Low, fixed memory   Known error bounds   Large num of counters   From Twitter’s Algebird   Streaming example in Spark codebase

28

Page 29: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

HyperLogLog   Approximate cardinality Approx count distinct !  From Twitter’s Algebird!

  Low memory 1.5KB @ 2% error, 10^9 elements !

  Streaming example in Spark codebase

  RDD: countApproxDistinctByKey() 29

Page 30: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Monte Carlo Simulations From Manhattan Project (A-bomb) Simulate movement of neutrons Law of Large Numbers (LLN) Average of results of many trials Converge on expected value SparkPi example in Spark codebase Pi ~ (# red dots / # total dots * 4)

30

Page 31: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles Recommendations

31

Page 32: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles Interactive Demo!

32

Page 33: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Audience Participation Needed!

33

  Navigate to sparkafterdark.com

  Click 3 actresses and 3 actors

-> You are here

->

Page 34: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Types of Recommendations Non-personalized Cold Start No preference or behavior data for user, yet Personalized User-Item Similarity Items that others with similar prefs have liked

Item-Item Similarity Items similar to your previously-liked items

34

Page 35: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles Non-Personalized Recommendations

35

Page 36: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Summary Statistics and Aggregations   Top Users by Like Count

“I might like users with the highest sum aggregation of likes overall.”

SparkSQL + DataFrame = Aggregations

36

Page 37: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Graph Analytics   Top Influencers by Like Graph

“I might like users who have the highest probability of me liking them randomly while walking the like graph.”

GraphX: PageRank

37

Page 38: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles Demo!

Spark SQL/DataFrames + GraphX/PageRank

38

Page 39: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles Similarities

39

Page 40: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Types of Similarity Euclidean: linear measure Magnitude bias Cosine: angle measure Adjust for magnitude bias Jaccard: (intersection / union) Popularity bias Log Likelihood Adjust for popularity bias 40

Ali Matei Reynold Patrick AndyKimberly 1 1 1 1Leslie 1 1!Meredith 1 1 1Lisa 1 1 1Holden 1 1 1 1 1

z!

Page 41: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

All-Pairs Similarity Comparison Compare everything to everything aka. “pair-wise similarity” or “similarity join” Naïve shuffle: O(m*n^2); m=rows, n=cols Minimize shuffle through approximations! Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (ie.0) Principle Component Analysis

41

Page 42: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Reduce m: DIMSUM Sampling “Dimension Independent Matrix Square Using MR” Remove rows with low similarity probability MLlib: RowMatrix.columnSimilarities(…) Twitter: 40% efficiency gain over Cosine Similarity 42

Page 43: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Reduce m: LSH Bucketing “Locality Sensitive Hashing” Split m into b buckets Use similarity hash algorithm Requires pre-processing of data Compare bucket contents in parallel Converts O(m*n^2) -> O(m*n/b*b^2); m=rows, n=cols, b=buckets

ie. 500k x 500k matrix O(1.25e17) -> O(1.25e13); b=50

github.com/mrsqueeze/spark-hash 43

Page 44: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Reduce n: Remove Most Frequent Value Eliminate most-frequent value Represent other values with (index,value) pairs Converts O(m*n^2) -> O(m*nnz^2); nnz=num nonzeros, nnz << n Note: Choose most frequent value (may not be 0)

44

(index,value)

(index,value)

Page 45: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles Personalized Recommendations

45

Page 46: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Recommendation Terminology User User seeking recommendations Item

Item that has been liked or rated Feedback Explicit: like, rating Implicit: search, click, hover, view, scroll Feature Engineering

Dimension reduction

46

Page 47: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Collaborative Filtering Personalized Recs   Like behavior of similar users

“I like the same people that you like. What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity

47

Page 48: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles Demo!

Spark SQL/DataFrames + MLlib/Alternating Least Squares

48

Page 49: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Text-based Personalized Recs (1/3)   Similar profiles to me“Our profiles have similar, unique k-skip n-grams. We might like each other.” MLlib: Word2Vec, TF/IDF, Doc Similarity

49

Page 50: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Text Based Personalized Recs (2/3)

50

 Similar profiles from my past likes“Your profile shares a similar feature vector space to others that I’ve liked. I might like you.” MLlib: Word2Vec, TF/IDF, Doc Similarity

Page 51: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Text-based Personalized Recs (3/3)   Relevant, High-Value Emails “Your initial email has similar named entities to my profile.

I might like you just for making the effort.” MLlib: Word2Vec, TF/IDF, Entity Recognition

51

^ Her Email < My Profile

Page 52: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles The Future of Recommendations!

52

Page 53: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Facial Recognition   Eigenfaces

“Your face looks similar to others that I’ve liked. I might like you.”

MLlib: RowMatrix, PCA, Item-Item Similarity

53 Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

Page 54: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Natural Language Processing: Convo Bot   NLP and DecisionTrees

“If your responses to my trite opening lines are positive, I may read your profile.” MLlib: TF/IDF, DecisionTree, Sentiment Analysis

54

Positive Negative

Page 55: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

55

Maintaining the Spark!

Page 56: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Recommendations for Couples   Pathways of Similarity

“I want Mad Max. You want Message In a Bottle. Let’s find something in between to watch tonight.”

MLlib: RowMatrix, Item-Item Similarity GraphX: Nearest Neighbors, Shortest Path similar similar •  plots -> <- actors

56

Page 57: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles Final Recommendation!

57

Page 58: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

 Get Off the Computer & Meet People! Thank you!!

Chris Fregly @cfregly IBM Spark Tech Center San Francisco, CA, USA

Relevant Links advancedspark.com

Signup for the book and meetup! github.com/fluxcapacitor/pipeline

Clone all code used today! hub.docker.com/r/fluxcapacitor/pipeline

Run all demos presented today!

58

Image courtesy of http://www.duchess-france.org/

Page 59: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Click to edit Master text styles

Click to edit Master text styles

Power of data. Simplicity of design. Speed of innovation.

IBM Spark