machine learning with spark

45
Webinar: Machine Learning with Spark Everything you want to know about Machine Learning but could not find the place and time to ask

Upload: elephantscale

Post on 14-Jan-2017

366 views

Category:

Technology


0 download

TRANSCRIPT

Webinar: Machine Learning with Spark

Everything you want to know about Machine Learning but could not find the place and time to ask

Highlights

Detecting the low hanging fruit for machine learning

Balancing business and science on your team

Choosing the best Machine Learning tools, be it small or Big

Data, R, Python or Spark. (And these are not mutually

exclusive).

Copyright © 2016 Elephant Scale. All rights reserved. 2

What does a data scientist need to know Familiarity with either Java / Scala / Python language

– Need to be comfortable programming - there are many labs– Our platform is Spark, basic familiarity is expected– Our labs are in Scala, basics of Scala will be helpful

Basic understanding of Linux development environment– Command line navigation – Editing files (e.g. using VI or nano)

This is a Machine Learning with Spark class – But, no previous of Machine Learning knowledge is assumed – Class will be paced based on the pace of majority of the

students.

Copyright © 2016 Elephant Scale. All rights reserved. 3

Lots of Labs : Learn By Doing

Copyright © 2016 Elephant Scale. All rights reserved. 4

Where is the ANY

key?

After The Class…

Copyright © 2016 Elephant Scale. All rights reserved. 5

Machine Learning

Recommended Books “Advanced Analytics With Spark” by Sandy Ryza, et al. ”Data Algorithms” by Mahmoud Parsian “Computational Complexity - A Modern Approach” by Sanjeev

Arora and Boaz Barak

6Copyright © 2016 Elephant Scale. All rights reserved.

Why machine learning? Build a model to detect credit card fraud

– thousands of features– billions of transactions

Recommend – millions of products– to millions of users

Estimate financial risk – simulations of portfolios – With millions of instruments

Genome data manipulation– thousands of human genomes – detect genetic associations with disease

Copyright © 2016 Elephant Scale. All rights reserved.

Like Hadoop MapReduce, Spark has linear scalability and fault tolerance for large data sets

However, it adds the following extensions– DAG of operations, instead of Map-then-Reduce– Rich transformations to express solutions in the natural way– RDD – in-memory computation

Addresses the major bottleneck:– Not CPU– Not disk– Not network– But developer productivity

Why Spark?

The story of Spark

It reduces performance overhead

– Be certain the performance adequate

– Scala gives you access to the latest and greatest

– Python and R bindings may come much later

Scala helps you understand the Spark approach better

– Spark is written in Scala

– Think in Scala, think in Spark

Just Scala, no other languages needed

– Such as R with SQL

Copyright © 2016 Elephant Scale. All rights reserved.

Why Scala?

Python

– Popular, well-known

– Many packages

– Graphing

R

– Very popular, well-known

– Very many packages

– Graphing

Why NOT Scala?

About Machine Learning

What is Machine Learning?

It is an algorithm that “learns” from data

– Any algorithm which improves its performance by access to data.

Machine Learning borrows from applied statistics

Also considered a branch of AI (Artificial Intelligence)

12

Sixties

– Commercial computers & mainframes

– Computers play chess

Eighties

– Computational complexity theory

– Artificial intelligence (AI) gets a bad rap

21st century

– Big Data changes it all

A glimpse of history

Computational complexity is simple:

P – all problems that can be solved fast

– (in polynomial time, like n^p, but not exponential)

– Example: system of linear equations

NP – all problems that can be verified fast

– That is, just check if the solution is correct But folks, it does not matter!

P = NP?

“Big O” notation

Example of polynomial time O(n^^3)

Example of exponential time O(2^^n)– How much is that?– Compare to the number of particles in the universe ~ 10^^80– To reach that, our n needs to be log(10^^80)= 80 log (10) ~ 80 * 3 = 240

There are also in-between, such as n^^(log log (n))– But that is still bad enough

O(n) notation

Old reasons

– It is too theoretical, talking only about worst case scenario

– There may be new computers, such as quantum computers

New reason

– Big Data

– Turing machine is inadequate

• Because we hit the size limitations of one computer

• And go into clusters

• And we have other problems than expected

Copyright © 2016 Elephant Scale. All rights reserved.

Why P and NP do not matter

Old thinking:– If you can solve any problem (P = NP), you can be creative

New thinking:– You don’t have to solve problems in order to be creative– Instead, you can pick up the answer from the internet – Examples:

• Google translate• IBM Dr. Watson (Jeopardy winner)• Lesson: re-use world’s data

New thinking:– Rely on the abundance of data– Find an approximate solution that is good enough– “Bad algorithms trained on lots of data can outperform good ones

trained on very little” - Deeplearningfor4

How Big Data changed it all

Turing machine might be too theoretical

But developers often tend to “just code”

“Думать не за свое дело браться”(Жаргон лабухов)

“To think is wrong business to undertake”Russian slang

Copyright © 2016 Elephant Scale. All rights reserved.

The other extreme - no thinking at all

Our approach to Machine Learning is

The Golden Mean approach

Avoid over-theorizing

Avoid “just code”

– Know what to expect of the solution

– When to apply

– The limitations

– The benefits

Copyright © 2016 Elephant Scale. All rights reserved.

The golden mean

Sages advocate the golden means

Types of Machine Learning Supervised Machine Learning:

– A model is “trained” with human labeled training data.– Model then tested on other training data to see performance– Model can then be applied to unknown data.– Classification & regression usually supervised.

Unsupervised Machine Learning– Model tries to find natural patterns in the data.– No human input except parameters of the model.– Example: Clustering

Semi-Supervised Learning– Model is trained with a training set which contains mix of trained

and untrained data20

Supervised Machine Learning Input Data is split into “training” and “test” data, both labeled.

A Model is trained using training data

Prediction is made using model.predict()

Model can be tested using comparing the test dataset– Mean Squared Error: mean(predicted – actual)

21

MLLib Algorithm overview

22

Model Validation Models need to be ‘verified’ / ‘validated’ Split the data set into

– Training set : build / train model– Test set : validate the model

Initially 70% training, 30% validation Tweak the dials to decrease training and increase validation Training set should represent data well-enough

Training Testing

model

23

Creating Feature Vectors: Feature Extraction

Machine Learning only works with vectors. Feature Vectors

are an n-dimensional point in space.

– Select variables from data

– Turn data into numbers (doubles).

– “normalize” (scale down) high magnitude data.

24

Vectors: Dense versus Sparse

Dense Vectors

– Usually have a nonzero value for each variable

– The “telecom churn” dataset we use in the labs is a dense dataset.

– Use Vectors.dense

Sparse Vectors

– Most values are zero (or nonexistent)

– Text Data yields sparse vectors

– One-Hot, factor variables lead to sparse vectors

– Use Vectors.sparse

25

Creating Vectors From Text

How to create vectors from text?

– TF/IDF: Term Frequency Inverse Document Frequency• This essentially means the frequency of a term divided by its

frequency in the larger group of documents (the “corpus”)• Each word in the corpus is then a “dimension” – you would have

thousands of dimensions.

– Word2Vec• Another vectorization algorithm• Uses neural network• Borders on deep learning

26

Visualizing Text using WordCloudState of The Union Speech 2014

27

What is deep learning?– “A neural network with more than 1 hidden layer”

– Deeplearning4j

But what is a neural network?

Copyright © 2016 Elephant Scale. All rights reserved.

Deep learning

Set of algorithms Modeled loosely after the human brain Designed to recognize patterns

Input comes from sensory data– machine perception– labeling – clustering raw input

Recognized patterns– Numerical– Contained in vectors– Translated from real-world data

• Images• Sound• Text• Time series

Copyright © 2016 Elephant Scale. All rights reserved.

Neural networks

Do I have the data?

Which outputs do I care about?

– Spam – not spam

– Fraud – not fraud

Do I have labeled data from which to learn? (Supervised

learning)

Nah, I just need to group things (Unsupervised learning)

– Normal – anomaly

– Group documents

Copyright © 2016 Elephant Scale. All rights reserved.

Basic steps in a neural network

Copyright © 2016 Elephant Scale. All rights reserved.

Neural network node

Copyright © 2016 Elephant Scale. All rights reserved.

Neural network composition

Copyright © 2016 Elephant Scale. All rights reserved.

Deep neural network

Google– ParagraphVectors (implemented as doc2vec)– Represents the meaning of documents– Based on word2vec and word context

Facebook

Copyright © 2016 Elephant Scale. All rights reserved.

Deep learning applications

ML in Spark

Spark Core

SparkSQL

SparkStreamin

gML lib

Standalone YARN MESOS

GraphX

35

Linear algorithms

Linear algorithmsSVM

Logistic regressionLinear regression

Practical use case for SVM

37(c) ElephantScale.com 2016. All rights reserved

History of logistic regression

Invented by (Sir) David Cox, UK

Who wrote 364 books and papers

Best known for

– Proportional hazards model

– Used in analysis of survival data

– Medical research (cancer)

38(c) ElephantScale.com 2016. All rights reserved

Classification algorithms

Naïve BayesDecision Trees

K-Means

Where Naïve Bayes fits in There are many classification algorithms in the world Naïve Bayes Classifier (NBC) is one of the simplest but most

effective K-means and K-nearest neighbors are for numeric data But for

– Names– Symbols– Emails– Texts

NBC may be the best for that

Bayes can do multiclass (and not only binary) classification

40(c) ElephantScale.com 2016. All rights reserved

A is a good candidate for Naïve Bayes(Credit: Sebastian Raschka)

History of Bayes

Discovered by the Reverend Thomas Bayes (1701–1761)

Edited and read at the Royal Society by Richard Price (1763)

Independently reproduced and extended by Laplace (1774)

Naïve Bayes classifiers studied in 1950’s

41(c) ElephantScale.com 2016. All rights reserved

Clustering use case

Anomaly detection

– Find fraud

– Detect network intrusion attack

– Discover problems on servers

– Or on any machinery with sensors

Clustering does not necessarily detects fraud

– But it points to unusual data

– And the need for further investigation

42(c) ElephantScale.com 2016. All rights reserved

Network intrusion

Known unknowns

– Port scanning

– Number of ports accessed per second

– Number of bytes sent/received

But what about unknown unknowns?

– Biggest thread

– New and as yet unclassified attacks

– Connections that are not knows as attacks

– But are out of the ordinary

– Anomalies that are outside clusters43(c) ElephantScale.com 2016. All rights reserved

Shortest Path You have a graph (map) of cities With distances between them

Help the mouse find the shortest path to the cheese

Copyright © 2016 Elephant Scale. All rights reserved. 44Session 7: GraphX

Elephant Scale – Big Data Training done right

45Copyright © 2016 Elephant Scale. All rights reserved.