data science in industry - applying machine learning to real-world challenges

Post on 15-Jul-2015

374 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Science in Industry Applying Machine Learning to

Real-world Challenges

About me - Yuchen Zhao

principal data scientist at

obtained Ph.D. indata mining and machine learning

worked in both academia and industry

Not just a researcher,but a coder & hacker

What is data science?

data is everywhere...

data science helps

extract knowledge from data...

Data scientists investigate complex data problems

find and interpret rich data sources

Visualize the data

get insights from data

from insights….

Questions?

Now is the fun part...

Data Science techniques!

Data science 101

● regression● classification● clustering● ranking (not covered in this lecture)● recommendation (not covered in this lecture)

Regression

What is regression?

A bit formal definition….

models a functional relationship between

an input variable x and

a response variable y

x

y

find the equation

What else can regression do?

Predicting who may change jobs!

x

y

Recap - regression

classification

identify to which of a set of categories a new data point belongs

Spam or Not spam?

Credit approve or not?

Optical character recognition

Document classification

SVM

Decision tree

Use classification to...

find who you are in social networks

classification

classification

Missing data

Outdated data

Non-standard data

Why we want to classify?

Understanding users’ social roles is crucial to many

social network applications

including advertising targeting,

marketing, personalization,

recommendation, etc.

Finding out who you really are...

manually labeling is time-consuming

and error prone

Human learning

Machine learning

SVM

Decision tree

How accurate can we get?

Can we further improve?

Clustering

grouping a set of data points

data points in the same group ( cluster) are more similar to each other

than to those in other groups (clusters)

k-means clustering algorithm

k clusters

k = 3

step 1:randomly select k points

as centroids

3 random centroids

step 2:assign every data point to

the nearest centroid

step 3:calculate mean of each cluster

as the new centroid

repeatassign clusters based on

the new centroids

How to use clustering to solve big data problem?

Machine data is massive

1 Tb/day is normal

no one has time to read all data...

Clustering comes to rescue!

clustering algorithm summarizesbig data to a few groups

each group representsa number of similar data points

investigating data pointsone by one

just investigating the clusters!

Things to considerin practice...

scalability

velocity

variety

real-time

What’s next?

Recap

● regression

● classification

● clustering

This presentation was initially created for a guest lecture at Utah State University for teaching and education purposes.

Thanks!

top related