mr. d. g. sancheti · increase revenue decrese costs increse productivity why big data analytics??...
TRANSCRIPT
Presented by:
Mr. D. G. Sancheti
“We are drowning in data, but starving for knowledge!”
TODAY‘S SHOW
You will learn a few data analysis topics
Posing a question
Wrangling your data into a format you can use and fixing
any problems with it
Exploring the data, finding patterns in it, and building
your intuition about it
Drawing conclusions and/or making predictions
Communicating your findings
What is Big Data Analytics?
Data analytics is an emerging technique that dives into a
data set without prior set of hypotheses
Accumulation of raw data captured from various sources
(i.e. discussion boards, emails, exam logs, chat logs in e-
learning systems) can be used to identify fruitful
patterns and relationships
Examining large amount of data
Data Drives
Performance
Big Data Analytics Drives
result
Increase Revenue
Decrese Costs
Increse Productivity
Why Big Data Analytics??
Why Big Data Analytics??
Applications of Data analytics
Understanding and targetting Customers
Understanding and optimizing Business Processes
Improving Healthcare and Public Health
Optimizing Machine and Device Performance
Financial Trading
Improving and Optimizing Cities and Countries
Can you think of anything more??
How??
Reference Models
CRISP-DM
Agile methodology: ASD-DM
Cross Industry Standard Process for Data Mining
(CRISP-DM)
The CRISP-DM reference model
Cross Industry Standard Process for Data Mining
(CRISP-DM)
The CRISP-DM reference model
Cross Industry Standard Process for Data Mining
(CRISP-DM)
The CRISP-DM reference model
Cross Industry Standard Process for Data Mining
(CRISP-DM)
The CRISP-DM reference model
Cross Industry Standard Process for Data Mining
(CRISP-DM)
The CRISP-DM reference model
Cross Industry Standard Process for Data Mining
(CRISP-DM)
The CRISP-DM reference model
The BIG Four
Classification Cluster Analysis
Association Rules Prediction
Data Classification
Some Examples:
Separating Customer based on gender
Data sorting based on content type/file type,size etc
Classifying data into restricted, pubic or private data
types
"Among all the customers of Zalando, which are likely to respond to a new
offer?"
Will respond Will not respond
Decision trees (DT)
Build classification or regression models in the form of Tree
structure
Classification Methods
Classification Methods
Decision Trees to Decision Rules
Classification Methods
Support Vector Machines(SVM)
Each data item is a point in n-dimensional space(n number
of features)
Find the hyperplane that differentiate the two classes
Classification Methods
Which do you think are the separating
Hyperplanes?
Classification Methods
Select the hyperplane which
segragates two classes better
Ans: B
Maximising the distance between
nearest data point (Margin)
Ans: C
Select hyper-plane which classifies
accurately prior to maximising margin
Ans: A
Ignores outliers
Introduce: Z=x²+y²
In original input space
hyperplane looks like a circle
Classification Methods
Dotted lines: Potential Links
Blue box: Additional nodes and links between input
and output
Bayesian Networks
Based on probability theory.
Can mix expert opinion and data to build
models
Backwards reasoning - in addition to
predicting outputs given inputs, we can
use output values to infer inputs.
Support for missing data during learning
and classification
Classification Methods
Bayesian Network Example
Association Rules
Discovering interesting realtions between variables in
large DB
Example Problems
Which products are frequently bought together by
customers? (Basket Analysis)
● DataTable = Receipts x Products
● Results could be used to change the placements of products in the market
Which courses tend to be attended together?
● DataTable = Students x Courses
● Results could be used to avoid scheduling conflicts....
Association Rules
Examples
Bread, Cheese → Red Wine.
Customers that buy bread and cheese, also tend to buy red
wine
Machine Learning → Web Mining, ML Praktikum
Students that take 'Machine Learning' also take 'Web Mining'
and the 'Machine Learning Praktikum'
Apriori Principle illustration
If {c,d,e} is frequent then all
subssets of this itemset are
frequent
Support Based pruning illustration
If {a,b} is infrequent then all
supersets of this itemset are
infrequent
Association Rules
Association Rules: Apriori example
Cluster analysis
Task of grouping a set of objects in such a way that
objects in the same group (called a cluster) are more
similar (in some sense or another) to each other than to
those in other groups (clusters).
Examples
Biology: What is the taxonomy of the species?
Education: What are student groups that need special
attention?
Business: What are the customer segments?
Clustering workflow
Cluster analysis
Methodologies
K-Means Clustering
Hierarchical Clustering
And many more!!
K-means clustering
k-means clustering aims to partition n observations into k
clusters in which each observation belongs to the cluster
with the nearest mean, serving as a prototype of the
cluster
Unsupervised learning algorithm
Define k centroids, one for each cluster
Take each point in the data set and associate it to the
nearest centroid
Recalculate the centroids
Repeat until the centroid doesnt move
Hierarchical clustering
Groups data over a variety of scales by creating a cluster
tree or dendrogram.
Find the similarity or dissimilarity between every pair of
objects in the data set.
Group the objects into a binary, hierarchical cluster
tree.
Determine where to cut the hierarchical tree into
clusters
Hierarchical clustering
Dissimilarity
measures
Grouped (B,F), less
dissimilarity
Grouped (A,E), less
dissimilarity
Hierarchical clustering
Hierarchical clustering
Cutting the Tree
50% similarity=50% dissimilarity
Take cluster samples below 0.5 dissimilarity
(B,F),(A,E,C,G),(D)
Creating 3 cluster labelled 1,2,3
Clustering workflow
Which algorithm fits my data?
Which parameters fit my data?
How good is the obtained result?
How to improve result quality?
Predictive Analytics
Make predictions about unknown future events based on
past happenings
Why now?
Growing volumes and types of data, and more interest in
using data to produce valuable insights.
Faster, cheaper computers.
Easier-to-use software.
Tougher economic conditions and a need for competitive
differentiation.
Predictive Analytics
improve pattern detection and prevent criminal
behavior.
determine customer responses or purchases, as well as
promote cross-sell opportunities
forecast inventory and manage resources, to set ticket
prices.
Credit scores are used to assess a buyer’s likelihood of
default for purchases
Data Visualization
Data visualization is the process of converting raw data
into easily understood pictures of information that
enable fast and effective decisions.
Visualization plays the key role in the efficient
communication of information (especially with large
amounts of information).
Visualization is used as a "check" to verify / falsify
results of automatic data analysis.
Why Data Visualization?
Identify areas that need attention or improvement.
Clarify which factors influence customer behavior.
Help you understand which products to place where.
Predict sales volumes.
Data visualization is a quick, easy way to convey concepts in a
universal manner
Where does Visualization fit in CRISP-DM
Visual
Reportting
Visual Analytics Loop
Visual Analytics will foster the constructive evaluation, correction and rapid
improvement of our processes and models and - ultimately - the improvement of our
knowledge and our decisions
Visual Analytics : Humane and Machine
Visual Analytics vs Information Visualization
Visual analytics is more than just visualization. It can rather be seen as an
integral approach to decision-making, combining visualization, human
factors and data analysis.
Various Data Visualization Techniques