utah code camp 2014 - learning from data by thomas holloway

Learning from DataA fast-paced guide to machine learning and artificial intelligence

by Thomas HollowayCo-Founder/Software Engineer @ Nuvi (http://www.nuviapp.com)

http://www.nuviapp.com/

Thanks to our Sponsors!

To connect to wireless 1. Choose Uguest in the wireless list

2. Open a browser. This will open a Uof U website 3. Choose Login

– H.B. BARLOW

“Intelligence is the art of good guesswork”

General Intelligence Goals• Deduction, Reasoning, Problem

Solving

• Knowledge Representation

• Planning

• Learning

• Natural Language Processing

• Motion and Manipulation

• Perception

• Social Intelligence

• Creativity

• Early AI research began in the study of logic itself - leading to the algorithms that imitate step-by-step reasoning used to solve puzzles and problems. (heuristics)

• Contrast to methods pulled from economics and probability in the late 80’s/90’s led to very successful approaches for dealing with uncertainty or incompleteness.

• Statistical Approaches, Neural Networks (Probabilistic Nature of Humans to Guess)


Solving


• Planning

• Learning



• Perception


• Creativity

• Represent conceptually about objects, places, things, situations, events, things, times, language

• What they look like

• Categorical features

• Properties

• Relationships between each other

• Meta-knowledge (knowledge of what other people know)

• Causes, effects and lots of other less known research fields

• “what exists” = Ontology


Solving


• Planning

• Learning



• Perception


• Creativity

• Difficult Problems

• Working assumptions, default reasoning, qualification problem

• Commonsense Knowledge

• Major goal is to automatically acquire this largely through unsupervised learning

• Ontology Engineering

• Subsymbolic Form of Commonsense Knowledge

• Not all knowledge can be represented as facts or statements. (As such, intuition to avoid a decision, i.e. “feels too exposed” in a chess match)


Solving


• Planning

• Learning



• Perception


• Creativity

• Set Goals and achieve them

• (visualize the representation of the world, predict how actions will change it, make choices to maximize utility)

• Requires reasoning under uncertainty (as a result of the world/environment matches its predictions) -> error correction

• Move chess piece here, player responds to put me in a seemingly poor position, act accordingly


Solving


• Planning

• Learning



• Perception


• Creativity

• Machine Learning is the study of algorithms that automatically improve through experience.

• Probably the most central role to Artificial Intelligence.

• Unsupervised Learning - finding patterns

• Supervised Learning - classify categorically what something is/belongs and producing a function to represent input -> output

• Reinforcement Learning - rewards

• Developmental Learning - self-exploration, active learning, imitation, guidance, entropy


Solving


• Planning

• Learning



• Perception


• Creativity

• Read and understand text

• Listen and understand speech

• Information Retrieval

• Machine Translation

• Sentiment Analysis

• Category Theory (Quantum Logic in Information Flow Theory)

• Common techniques in semantic indexing, parse trees, syntactic and semantic analysis

• Major Goal to automatically build ontology (for knowledge representation) by scanning books, wikipedia, dictionaries… etc

• Recently used wikitionary and wikipedia to automatically build a part of speech tagger and sentiment analysis engine for multiple languages. *http://www.nuviapp.com/* <— PLUG

http://www.nuviapp.com/*

• Entropic Force (Alex Wissner-Gross argument for intelligence)

• Language Discovery

• Automated Trading Systems

• Machine Translation

• Spam Detection

• Self-Driving Cars

• Facial Recognition

• Gesture Recognition

• Speech Recognition

• Nest

• Shazam

Statistical Machine Learning is the art of taking lots of data and turning it into statistically known probabilities.

• Spotify

• Netflix, Amazon Recommendations

• Duolingo

• Robot Movement

• Fraud Detection

• Intrusion Detection / State Anomaly

• DNA Sequence Alignment

• Siri, Google Voice, Google Now, Xinect

• Sentiment Analysis

• Text/Character Recognition (Scanning books)

• Health Monitoring (Healthcare)

• Pandora, iTunes / iGenius

Types of Machine Learning

• Supervised Learning

• Unsupervised Learning

• Recommendation Systems

• Reinforcement Learning

• (rewards for good responses, punished for bad ones)

• Developmental Learning • (self-exploration, entropic force, cumulative acquisition of novel skills typical

of robot movement - autonomous interaction with environment and “teachers”, imitation, maturation)

Supervised Learning

• Two types that we will discuss within supervised learning:

• Regression analysis (single-valued real output)

• Classification

Linear Regression

Optimization Objectives• Hypothesis:

• Parameters:

• Cost Function:

• Goal:

m = number of samplesx(i) = x at sample iy(i) = y at sample i

Our cost function is effectively taking the

square error difference between all predictions from our hypothesis and

the actual values y - and finally summing

the error up to a total “cost” error.

Minimize the error produced from the cost function by manipulating

the parameters theta.

Gradient Descent• First-Order Optimization Algorithm

• Finds Local Minimum of a function by taking steps proportional to the negative of the gradient of the function at the current point.

• Popular for large-scale optimization problems

• easy to implement

• works on just about any black-box function

• each iteration is relatively cheap

Gradient Descent

repeat until convergence {

for ( j = 1 and j = 0)}

Hypothesis

Cost Function

Gradient Descent


for ( j = 1 and j = 0)}


}

Gradient Descent


for ( j = 1 and j = 0)}

Gradient Descentrepeat until convergence {

}

Hypothesis

1

* note: sometimes referred to as batch gradient descent (given that we iterate over all training examples to perform a single update on our

parameters)

Multivariate Linear Regression

TV Budget Online Ads Billboards Sales

230.1 37.8 63.1 22.1

44.5 39.9 45.1 10.4

17.2 45.8 69.3 9.3

180.8 41.3 58.5 18.5

• Hypothesis:

• Think of x as our example with features in a vector up to n features with

Multivariate Linear Regression

Optimization Objectives• Hypothesis:

• Parameters:

• Cost Function:

• Goal:

m = number of samplesx(i) = x at sample iy(i) = y at sample i

Our cost function is effectively taking the

square error difference between all predictions from our hypothesis and

the actual values y - and finally summing

the error up to a total “cost” error.

Minimize the error produced from the cost function by manipulating

the parameters theta.

Gradient Descent


for ( j = 0…n)}

Hypothesis

Cost Function

Techniques in managing input• Mean-Normalization (make sure all your input have similar ranges)

• FFT for audio

• Mean / Average / Range

• Graph your Cost Function over the number of iterations (make sure it is decreasing)

• Separate data sets (cross validation, test set)

• Train on a given set of data, manipulate regularization / extra features..etc and graph your cost function against the cross validation set

• Finally test against unseen data against your test set

• Typically this is 60-30-10, or even 70-20-10, depending on how you wish to split things up.

Normal Equation• Analytically solves the parameters

• Useful when n is relatively small (n of features < 5000 or so)

• Uses the entire matrix of input

• Each sample = vector of features

Supervised Learning - Classification

• Spam/Not Spam

• Benign/Malignant

• Biometric Identification

• Speech Recognition

• Fraudulent Transactions

• Pattern Recognition

• 0 = Negative Class

• 1 = Positive Class

Logistic Regression / Classification

• What we want is a function that will produce a value between 0 and 1 for all weighted input we provide.

• Sigmoid Activation Unit

Logistic Regression Cost Function

• Hypothesis:

• Cost Function:Linear Regression Cost

Function


• Hypothesis:

• Cost Function:

Logistic Regression Cost Function Intuition


In other words, if we predicted 0 when we should of predicted 1, in this case we are going to return back a very large cost.


In other words, if we predicted 1 when we should of predicted 0, in this case we are going to return back a very large cost.


• This is the “simplified” formula —>

Gradient Descent


for ( j = 0…n)}

Hypothesis

Cost Function

Logistic Regression Decision Boundaries

• The threshold or line at which input data is favoring one class or another. This is usually the same point where we see our sigmoid function cross the 0.5 mark.

Logistic Regression Decision Boundaries

• The threshold or line at which input data is favoring one class or another. This is usually the same point where we see our sigmoid function cross the 0.5 mark. (Sunny, Rainy, Cloudy..etc)

Multi class classification is dealing with multiple

categories of classification. Typically

done as a one-vs-all classification. Where each class is trained as (1 = positive for a given

class, 0 for everything else).

Find max probability of all classes tested against.

Overfitting and Regularization

Regularization

• Regularized Logistic Regression

• Regularized Gradient Descent

[ ]

Regularization Parameter

Neural Networks

Sophisticated Neural Networks

can do some really amazing things

Multi-layered (deep) neural networks can be built to identify extremely complex things with

potentially millions of features to train on.

Neural Networks can auto-encode (learn from the input itself/self-learn), classify into many categories at once, can be trained to output real-values, they can even be built to retain memory or long-term state (such as in the

case of hidden markov models or finite state automatons)

Types of Neural Networks• Feedforward

• Recurrent

• Echo-State

• Long-Short-Term Memory

• Stochastic

• Bidirectional (propagates in both directions)

Feed Forward Network

Feed Forward Network+ 1 = bias

unit+1

+1

Input Features /

Input Layer

Output

+1 +1 +1

What is the value of ?

Answer: the sigmoid activation of the sum of its weighted inputs

+1 +1 +1

What is the output of ?

Answer: the activation (sigmoid) of it the sum of its weighted inputs

+1 +1 +1

What is the output of ?

+1 +1 +1

Feed Forward Propagation

Backpropagation

• Gradient computation is done by computing the derivative gradient of our expected output versus our actual output and propagating that error backwards through the network.

• Calculate:

Backpropagation+1 +1 +1 let y = 1 for this

sample

Backpropagation+1 +1 +1

Recurrent Neural Networks

Recurrent Neural Networks• Connections units form a directed cycle

• Allows items to exhibit dynamic temporal behavior

• Useful for maintaining internal memory or state over time

• Ex: unsegmented hand writing recognition

• At any given time step, each non-input unit computes its current activation as a nonlinear function of the weighted sum of the activations of all units from which it receives connections.

• Training is done with back propagation through time

• vanishing gradient (solved with LSTM networks)


http://www.manoonpong.com/AMOSWD08.html



LSTM Recurrent Neural Network• Long Short Term Memory

• Well suited for classifying, predicting and processing time series data with very long range dependencies.

• Achieves best known results in unsegmented handwriting recognition

• Traps error within a memory block (often referred to as an error carousel)

• Amazing applications in rhythm learning, grammar learning, music composition, robot control…etc

Other classification techniques

• SVM (support vector machines)

• constructs a hyperplane in a high/infinite-dimensional space used for training/classification, regression..etc

• by defining a kernel function (or some function that will tell us similarity) svm will allow us to perform simple dot products between high-dimensional features

• high-margin (decision boundary has good separation between training points) which benefits good generalization

• Naive Bayes

Unsupervised Learning• Categorization

• Clustering (density estimation)

• Selecting top clusters (k-means) and updating average centroid, assign data points to a cluster and iterate

• Blind Signal Separation

• Feature Extraction for Dimensionality Reduction

• Hidden Markov Models

• Non-normal & normal distribution analysis (finding the distributions of data)

• Self-Organizing Maps

Autoencoders

Unsupervised Learning from Neural Networks

Autoencoders

Knowing what to do next• Build your algorithm quick and dirty, don’t spend a lot of time on it until you have something to use

• Split up your training, cross validation and test sets (don’t test on your training data!)

• Move on to PCA or unsupervised pre-training for your supervised algorithms to help improve performance after: —>

• Don’t just try and get a lot of data to train on, implement your algorithm quick and dirty, use smaller data sets initially and determine bias/variance

• High variance: get more training data

• High variance: try fewer features

• High bias: add additional features

• High bias: add polynomial features

• High bias: decrease regularization

• High variance: increase regularization

Follow me @ @nyxtom

Thank you!

Questions?

http://ml-class.org/

https://www.coursera.org/course/bluebrain

https://www.coursera.org/course/neuralnets

http://ml-class.org/



utah code camp 2014 - learning from data by thomas holloway

Technology