data science

Post on 11-May-2015

473 Views

Category:

Education

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

A quick introduction to the fascinating world of business and data analytics

TRANSCRIPT

prithwis mukerjee, ph.d.

Introduction to Data Science

Prithwis Mukerjee, PhDPraxis Business School, Calcutta

prithwis mukerjee, ph.d.

Agenda

● Why data science ?● Techniques

○ Statistics○ Data Mining○ Visualisation

● Tools & Platforms○ R○ Hadoop / MapReduce○ Real Time Systems

● Business Domains

prithwis mukerjee, ph.d.

prithwis mukerjee, ph.d.

Data is being acquired from a variety of sources● EFT in Banks, Credit card

payments● Cell phones● Sensors attached to a variety

of equipment● Surveillance cameras, CCTV● Social Media Updates● Blogs● Websites

Volume

prithwis mukerjee, ph.d.

Variety / Velocity

● Numeric data● Structured text data● Unstructured text data● Images● Sound and video recordings● Graph Nodes

○ Social Media “friends”○ Websites linked to each

other

Data is being generated fast and is becoming obsolete or useless equally faster● Realtime ( or near realtime)

data from sensors, cameras● Website traffic● Social media “trends”

prithwis mukerjee, ph.d.

So what is Big Data ?

● Volume● Velocity● Variety ?

A new term coined by IT vendors to push new technology like● Map Reduce● Hadoop● NOSQL

A new way to● collect● store● manage● analyse● visualise data

prithwis mukerjee, ph.d.

Big Data is like Crude Oil { not new Oil }

Think of data as crude oil !

Big Data is like extracting the crude oil, transporting it in mega tankers, pumping it through pipelines and storing it in massive silos

But what about refining ?

prithwis mukerjee, ph.d.

The Science (and Art ) of Data

Think of data as crude oil !

Big Data is like extracting the crude oil, transporting it in mega tankers, pumping it through pipelines and storing it in massive silos

Data Science● Discovering what we do not

know about the data● Obtaining predictive, actionable

insight● Creating data products that have

business impacts● Communicating relevent

business stories

Refining

prithwis mukerjee, ph.d.

Two Perspectives

Programmingor “Hacking”Skills

Mathematics,Statistics

Knowledge

BusinessDomain

Knowledge

MachineLearning

OperationsResearch

RDBMSERP / BI

DataScience

prithwis mukerjee, ph.d.

10 Things {most} Data Scientists do ...1. Ask good questions

What is what ?We do not know ! We would like to know

2. Define, Test Hypothesis, Run experiments3, Scoop, scrape, sample business data4. Wrestle and tame data5. Play with data, discover unknowns

6. Create models, algorithms7. Under data relationships8. Tell the machine how to learn from the data9. Create data products that deliver actionable insights10. Tell relevant business stories from data

prithwis mukerjee, ph.d.

Statistics - World of Data

● Data comes in various types○ Nominal - colour, gender,

PIN code ○ Ordinal - scale of 1-10,

{high, medium, low}○ Interval - Dates,

Temperature (Centigrade)○ Ratio - length, weight, count

● Data comes in various structure○ Structured data - nominal,

ordinal, interval, ratio○ Unstructured text - email,

tweets, reviews○ Images, voice prints○ graphs, networks - social

media friendships, likes

prithwis mukerjee, ph.d.

Descriptive Statistics

● Numeric Description○ Mean, Median, Mode○ Quartile, Percentile○ Variance / Standard

Deviation

prithwis mukerjee, ph.d.

Statistics : The Path Ahead

Probability, Distributions

Testing of Hypothesis

Regression,Testing

PredictiveAnalysis

prithwis mukerjee, ph.d.

Data Mining / Machine Learning

Is the process of obtaining● novel

● valid

● potentially useful

● understandable

patterns in data

Typical tasks are ● classification

● clustering

● association rules

● sequential patterns

● regression

● deviation detection

prithwis mukerjee, ph.d.

Some definitionsInstance ( an item or record)● an observation that is

characterised by a number of attributes

○ person - with attributes like age, salary, qualification

○ sale - with product, quantity, price

Attribute● measuring characteristics of an

instanceClass● grouping of an instance into

○ acceptable, not acceptable○ mammal, fish, bird

Nominal● colour, PIN code, state

Ordinal● ranking : tall, medium, short or

feedback on a scale of 1 - 10Ratio● length, price, duration, quantity

Interval● date, temperature

prithwis mukerjee, ph.d.

Data Mining : Classification

Classification● Which loan applicant will not

default on the loan ?● Which potential customer will

respond to a mailer campaign ?

prithwis mukerjee, ph.d.

Classification Example

categorical

categorical

continuous

class

Training Set

ModelLearn

Classifier

Test Set

prithwis mukerjee, ph.d.

Data Mining : Clustering

Given a set of unclassified data points, how to find a natural grouping within them

● Can we segment the market in some way that is not yet known ?

prithwis mukerjee, ph.d.

Example of Document Clustering

Clustering points : 3204 article from the Los Angeles Times

Similarity Measure : How many words are common in these documents ( after excluding some common words )

prithwis mukerjee, ph.d.

Clustering of S&P Stock Data

● Observe Stock Movements every day.

● Clustering points: Stock-{UP/DOWN}

● Similarity Measure: Two points are more similar if the events described by them frequently happen together on the same day.

● We used association rules to quantify a similarity measure.

prithwis mukerjee, ph.d.

Regression● Predict a value of a given continuous valued variable

based on the values of other variables, assuming a linear or nonlinear model of dependency.○ Greatly studied in statistics, neural network fields.

● Examples:○ Predicting sales amounts of new product based on advertising

expenditure.

○ Predicting wind velocities as a function of temperature, humidity, air pressure, etc.

○ Time series prediction of stock market indices.

prithwis mukerjee, ph.d.

Data Mining : Association Rules Mining

Association Rules● which products

should be kept along with other products

● which two products should never be discounted together

prithwis mukerjee, ph.d.

Visualisation : The need to tell a story

prithwis mukerjee, ph.d.

Visualisation : The need to tell a story

prithwis mukerjee, ph.d.

Definitions

Data Mining● Is the process of extracting

unknown, valid and actionable information from large databases and using this to make business decisions

● Non trivial process of identifying valid, novel, potentially useful and understandable / explainable patterns in data

Data Science is a rare combination of multiple skills that include● Technology : obviously !

but also● Curiosity - a desire to go below

the surface and discover a hypothesis that can be tested

● Storytelling - create a business story around the data

● Cleverness - again obviously, to look at the problem from different angles

prithwis mukerjee, ph.d.

prithwis mukerjee, ph.d.

R : Your first step into Data Science

Try out this free interactive tutorial just now

prithwis mukerjee, ph.d.

Statistical Tools

http://r4stats.com/articles/popularity/

prithwis mukerjee, ph.d.

Some Comparisons

prithwis mukerjee, ph.d.

Map Reduce

● Input : A set of (key, value) pairs

● User supplies two functions○ Map (k,v) => List(k1,v1)○ Reduce (k1, list(v1)) => v2

● Output is the set of (k1,v2) pairs

prithwis mukerjee, ph.d.

Hadoop

A programming framework that allows you to run Map-Reduce jobs on a distributed cluster of low cost machines without having to bother about anything except ● the Map and Reduce functions● loading data into HDFS

1. HIVEa. A plug-in that allows one to

use SQL like queries that are converted into map-reduce jobs

2. PIGa. A scripting language for

writing long queries3. HBASE

a. A non-relational DBMS4. SQOOP

a. moves data to andfrom HDFS

prithwis mukerjee, ph.d.

Data-in-Flight

prithwis mukerjee, ph.d.

JavaScript for Data Visualisation

prithwis mukerjee, ph.d.

Business Domain

● Financial Sector○ Risk Management, Credit

Scoring○ Predict Customer Spend○ Stock and Investment

Analysis○ Loan approval

● Telecom Sector○ Fraud Detection○ Churn Prediction

● Retail and Marketing○ Market segmentation○ Promotional strategy○ Market Basket Analysis○ Trend Analysis

● Healthcare & Insurance○ Fraud Detection○ Drug Development○ Medical Diagnostic Tools

prithwis mukerjee, ph.d.

Conclusion

Data Science is a rare combination of multiple skills that include● Technology : obviously !

but also● Curiosity - a desire to go below

the surface and discover a hypothesis that can be tested

● Storytelling - create a business story around the data

● Cleverness - again obviously, to look at the problem from different angles

● Why data science ?● Techniques

○ Statistics○ Data Mining○ Visualisation

● Tools & Platforms○ R○ Hadoop / MapReduce○ Real Time Systems

● Business Domains

prithwis mukerjee, ph.d.

Thank You

Contact

Prithwis MukerjeeProfessor, Praxis Business Schoolprithwis@praxis.ac.in

This presentation is accessible at at the blog

http://blog.yantrajaal.com at the following URL

http://bit.ly/pm-datascience

top related