hochschule düsseldorf fachbereich ... · knime analytics platform ss 2016 - it applications in...

25
HSD Hochschule Düsseldorf University of Applied Scienses W Fachbereich Wirtschaftswissenschaften Faculty of Business Studies IT Applications in Business Analytics Business Analytics (M.Sc.) IT in Business Analytics SS2016 / Lecture 05 Introduction to KNIME Thomas Zeutschler

Upload: others

Post on 09-Jun-2020

1 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDHochschule Düsseldorf

University of Applied Scienses

WFachbereich Wirtschaftswissenschaften

Faculty of Business Studies

IT Applications in Business Analytics

Business Analytics (M.Sc.)

IT in Business Analytics

SS2016 / Lecture 05 – Introduction to KNIME

Thomas Zeutschler

Page 2: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Let’s get started…

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 2

Page 3: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Intoduction

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 3

Page 4: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

KNIME Analytics Platform

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4

The KNIME is an open source platform for

analytical data modelling and processing.

KNIME was developed at University of Konstanz in 2004-2006 and

focussed initially on pharmaceutical research.

Today KNIME is modular, highly scalable data processing platform

which allow an easy integration of different modules for:

data loading, processing, transformation

data analysis

visual data exploration

www.knime.org

Page 5: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

KNIME Analytics Platform – Workflows

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 5

An analysis is defined by a graphical Workflow.

Interlinked Nodes are defining the various steps of a workflow.

Hundreds of predefined nodes available for various purposes…

data loading, processing, transformation and data delivery

data analysis and visualization

interaction with other tools (e.g. run an R script)

Page 6: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

KNIME Analytics Platform - Frontend

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 6

Page 7: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

KNIME Analytics Platform – Real World Example

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 7

Page 8: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

KNIME Analytics Platform – Real World Example

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 8

Page 9: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Knime – Installation

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 9

Register, download and install Knime from http://knime.org

www.knime.org

Page 10: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Knime – Lets get started…

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 10

Page 11: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Knime – First Data Analysis

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 11

“Sleep in Mammals: Ecological and Constitutional Correlates"

Description

https://www.stat.auckland.ac.nz/~stats330/datasets.dir/sleep.txt

Dataset

https://www.stat.auckland.ac.nz/~stats330/datasets.dir/sleep.csv

“Titanic Survival Status”

Description

http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html

Dataset

http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls

Page 12: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Knime - Essential Nodes

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 12

Problem: Too many nodes…

Solution 1: You can search directly in the Node Repository.

Solution 2: Search https://tech.knime.org/forum for your problem.

Reading Data

Page 13: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Knime - Essential Nodes

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 13

Data Preparation

The input table is split into two partitions (i.e. row-wise),

e.g. train and test data. The two partitions are available

at the two output ports.

This node helps handle missing values found in cells of

the input table.

The node allows for row / column filtering according to

certain criteria

Page 14: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Knime - Essential Nodes

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 14

First Statistical Data Analysis

Calculates statistical moments such as minimum, maximum,

mean, standard deviation, variance, median, overall sum,

number of missing values and row count across all numeric

columns, and counts all nominal values together with their

occurrences.

Creates a cross table (also referred as contingency table

or cross tab). It can be used to analyze the relation of

two columns with categorical data and does display the

frequency distribution of the categorical variables in a

table.

Page 15: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Knime – Data Mining Cheating…

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 15

http://scikit-learn.org/stable/tutorial/machine_learning_map/

Page 16: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Knime – Data Mining Cheating…

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 16

Algorithm Pros Cons Good at

Linear regression

- Very fast (runs in constant time)

- Easy to understand the model

- Less prone to overfitting

- Unable to model complex relationships

-Unable to capture nonlinear relationships

without first transforming the inputs

- The first look at a dataset

- Numerical data with lots of features

Decision trees

- Fast

- Robust to noise and missing values

- Accurate

- Complex trees are hard to interpret

- Duplication within the same sub-tree is

possible

- Star classification

- Medical diagnosis

- Credit risk analysis

Neural networks

- Extremely powerful

- Can model even very complex relationships

- No need to understand the underlying data

- Almost works by “magic”

- Prone to overfitting

- Long training time

- Requires significant computing power for

large datasets

- Model is essentially unreadable

- Images

- Video

- “Human-intelligence” type tasks like driving or

flying

- Robotics

Support Vector

Machines

- Can model complex, nonlinear

relationships

- Robust to noise (because they maximize

margins)

- Need to select a good kernel function

- Model parameters are difficult to interpret

- Sometimes numerical stability problems

- Requires significant memory and

processing power

- Classifying proteins

- Text classification

- Image classification

- Handwriting recognition

K-Nearest Neighbors

- Simple

- Powerful

- No training involved (“lazy”)

- Naturally handles multiclass classification

and regression

- Expensive and slow to predict new

instances

- Must define a meaningful distance

function

- Performs poorly on high-dimensionality

datasets

- Low-dimensional datasets

- Computer security: intrusion detection

- Fault detection in semiconducter manufacturing

- Video content retrieval

- Gene expression

- Protein-protein interaction

Page 17: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Knime – Data Mining Cheating…

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 17

http://www.kdnuggets.com/2015/07/good-data-science-machine-learning-cheat-sheets.html

https://github.com/soulmachin

e/machine-learning-cheat-

sheet/raw/master/machine-

learning-cheat-sheet.pdf

https://azure.microsoft.com/en-

us/documentation/articles/mach

ine-learning-algorithm-cheat-

sheet/

Page 18: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Exercise in Knime

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 18

Page 19: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

First Exercise in Knime

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 19

"Sleep in Mammals: Ecological and Constitutional Correlates" by Allison, T. and Cicchetti, D. (1976)

https://www.stat.auckland.ac.nz/~stats330/datasets.dir/sleep.txt

…/sleep.csv

Source:

https://www.stat.auckland.

ac.nz/~stats330/datasets.d

ir/

Training Video:

https://www.youtube.com/

watch?v=Uo1C7Iligw0

Page 20: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

First Exercise in Knime

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 20

"Sleep in Mammals: Ecological and Constitutional Correlates" by Allison, T. and Cicchetti, D. (1976)

1. How old do animals become on average?

2. Which species gets the oldest?

3. Can we have a histogram of lifespan?

4. What is the correlation between lifespan and size

of an animal?

5. Can we have a full correlation matrix of all

variables (see figure 1)?

6. Can we have a scatter-plot of species size vs.

danger factor (see figure 2)?

7. Split the dataset (train, test). And answer the

following question: Can we predict “total-sleep”?

Figure 1

Figure 2

https://www.stat.auckland.ac.nz/~stats330/datasets.dir/sleep.csv

Page 21: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Lecture Summary & Homework

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 21

Page 22: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Lessons Learned

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 22

Knime is an easy path towards analytics.

A workflow oriented way of working, dramatically simplifies the data

analysis and modelling process.

Combine CRISP DM and Knime and you are able to solve complex

analytical problems in a well organized and repeatable format.

First try to understand what algorithm fits to what problem and how they

behave and what influences their behavior.

Second (if you are willing), try to understand how algorithms work.

Page 23: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Resources

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 23

Knime

Knime Forum: https://tech.knime.org/forum

Knime Training Video: https://www.youtube.com/user/KNIMETV

Data Mining Literature

Data Mining for the Masses:

http://docs.rapidminer.com/downloads/DataMiningForTheMasses.pdf

Machine Learning Cheat Sheet

https://github.com/soulmachine/machine-learning-cheat-

sheet/raw/master/machine-learning-cheat-sheet.pdf

Page 24: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Get Prepared (Homework)

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 24

Homework: Titanic Survival Status

http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls

Answer the following question:

“What was the probability

to survive per Class (1,2,3)

and Sex (male, female)?”

Create a Knime workflow that answers the

question based on the original Titanic

data set.

Submit your results as a Knime archive file to

[email protected].

Hint:

Page 25: Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4 The KNIME is an open source platform

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Any Questions?

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 25