business intelligence & process modellingliacs.leidenuniv.nl/~takesfw/bipm/lecture3.pdf ·...

Post on 05-Jul-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Business Intelligence & Process Modelling

Frank Takes

Universiteit Leiden

Lecture 3 — BI & Descriptive Analytics

BIPM — Lecture 3 — BI & Descriptive Analytics 1 / 84

IP Viking map

http://map.norsecorp.com

BIPM — Lecture 3 — BI & Descriptive Analytics 2 / 84

Recap

Business Intelligence: anything that aims at providing actionableinformation that can be used to support business decision making

Business IntelligenceVisual AnalyticsDescriptive AnalyticsPredictive Analytics

Process Modelling (April and May)

BIPM — Lecture 3 — BI & Descriptive Analytics 3 / 84

Visual Analytics(“last week’s leftovers” or:

“how it’s not done”)

BIPM — Lecture 3 — BI & Descriptive Analytics 4 / 84

Visualization

Visualization: mapping data properties to visual attributes

Good visualization: “proper” mapping of data attributes to visualattributes and properly “balancing” the number of data propertiesand visual attributes used

Bad visualization:

False data inputMisleading visual attributesAbusing human background knowledge

BIPM — Lecture 3 — BI & Descriptive Analytics 5 / 84

Visualization

Visualization: mapping data properties to visual attributes

Good visualization: “proper” mapping of data attributes to visualattributes and properly “balancing” the number of data propertiesand visual attributes used

Bad visualization:

False data inputMisleading visual attributesAbusing human background knowledge

BIPM — Lecture 3 — BI & Descriptive Analytics 5 / 84

“Unbiased” data

BIPM — Lecture 3 — BI & Descriptive Analytics 6 / 84

Rainbow colors

http://poynter.org/uncategorized/224413

BIPM — Lecture 3 — BI & Descriptive Analytics 7 / 84

Parts and sums

https://hbr.org/2014/12/vision-statement-how-to-lie-with-charts

BIPM — Lecture 3 — BI & Descriptive Analytics 8 / 84

2D bars and icons

BIPM — Lecture 3 — BI & Descriptive Analytics 9 / 84

2D bars explained

http://en.wikipedia.org/wiki/Misleading_graph

BIPM — Lecture 3 — BI & Descriptive Analytics 10 / 84

2D bars explained

http://en.wikipedia.org/wiki/Misleading_graph

BIPM — Lecture 3 — BI & Descriptive Analytics 10 / 84

2D bars explained

http://en.wikipedia.org/wiki/Misleading_graph

BIPM — Lecture 3 — BI & Descriptive Analytics 10 / 84

3D pies

http://en.wikipedia.org/wiki/Misleading_graph

BIPM — Lecture 3 — BI & Descriptive Analytics 11 / 84

3D pies

http://en.wikipedia.org/wiki/Misleading_graph

BIPM — Lecture 3 — BI & Descriptive Analytics 11 / 84

Color-coding geographic regions

https://hbr.org/2014/12/vision-statement-how-to-lie-with-charts

BIPM — Lecture 3 — BI & Descriptive Analytics 12 / 84

Color-coding geographic regions

https://hbr.org/2014/12/vision-statement-how-to-lie-with-charts

BIPM — Lecture 3 — BI & Descriptive Analytics 13 / 84

Axis ranges

https://hbr.org/2014/12/vision-statement-how-to-lie-with-charts

BIPM — Lecture 3 — BI & Descriptive Analytics 14 / 84

Axis ranges

https://hbr.org/2014/12/vision-statement-how-to-lie-with-charts

BIPM — Lecture 3 — BI & Descriptive Analytics 15 / 84

Who understands?

http://www.multimension.com/project/upgrading-clinical-infographics/

BIPM — Lecture 3 — BI & Descriptive Analytics 16 / 84

Data Mining in a BI context

BIPM — Lecture 3 — BI & Descriptive Analytics 17 / 84

Overview

Data warehouse

Data preparation

Data mining theory recap

Data mining case studies

Data mining evaluation techniques

BIPM — Lecture 3 — BI & Descriptive Analytics 18 / 84

Data warehouse

Data warehouse: a copy of transaction data specifically structuredfor query and analysis (R. Kimball)

Data warehouse: a system used for reporting and data analysis(Wikipedia)

Data warehouse: a subject oriented, integrated, nonvolatile,timestamped collection of data designed to support management’sdecision support needs (B. Inmon)

BIPM — Lecture 3 — BI & Descriptive Analytics 19 / 84

Data warehouse data

In a data warehouse, data is organized around subjects(whereas information systems are organized around applications)

Data is collected from heterogeneous sources and may already beaggregated (for example from an ERP or CRM system)

Data is timestamped

Data is nonvolatile

BIPM — Lecture 3 — BI & Descriptive Analytics 20 / 84

Data warehouse

http://savis.vn/

BIPM — Lecture 3 — BI & Descriptive Analytics 21 / 84

Transactional system vs. Data warehouse

Transactional System

Holds current data

Detailed data

Volatile data

High transaction frequency

Oriented on daily operations

Support for daily decisions

Many operational users

Availability very important

Data storage focus

Data warehouse

Current and historic data

Detailed and aggregated data

Nonvolatile data

Medium-low frequency

Oriented on data analysis

Support for strategic decisions

Few decision-making users

Availability not so important

Information acquisition focus

https://www.fer.unizg.hr/ (Business Intelligence)

BIPM — Lecture 3 — BI & Descriptive Analytics 22 / 84

Data mining

Data mining: the computational process of discovering patterns inlarge data sets involving methods at the intersection of artificialintelligence, machine learning, statistics, and database systems(Wikipedia)

Data mining: the practice of examining large pre-existing databasesin order to generate new information (Oxford)

Data mining: knowledge discovery from data (or information) in anautomated way (DIKW pyramid)

BIPM — Lecture 3 — BI & Descriptive Analytics 23 / 84

DIKW Pyramid

BIPM — Lecture 3 — BI & Descriptive Analytics 24 / 84

DIKW Gaps

ZPR FER Zagreb - Business Intelligence 20113

BIPM — Lecture 3 — BI & Descriptive Analytics 25 / 84

Data mining . . .

KDD: Knowledge Discovery in Databases

Data archeology

Information harvesting

Knowledge extraction

Machine learning

Big data techniques?

Data science?

Business intelligence?

BIPM — Lecture 3 — BI & Descriptive Analytics 26 / 84

Data mining

http://blogs.sas.com/content/subconsciousmusings/2014/08/22

BIPM — Lecture 3 — BI & Descriptive Analytics 27 / 84

KDD

Knowledge Discovery in Data is the

non-trivial process of identifyingvalid,novel,potentially usefuland ultimately understandable

patterns in data.

Fayyad et al., Advances in knowledge discovery and data mining,MIT press, 1996.

BIPM — Lecture 3 — BI & Descriptive Analytics 28 / 84

KDD

BIPM — Lecture 3 — BI & Descriptive Analytics 29 / 84

Why data mining now?

Data flood / data explosion

Cloud computing power

Cheap storage

Algorithms have matured

Software is available

Competition is killing

BIPM — Lecture 3 — BI & Descriptive Analytics 30 / 84

Data mining in businesses

Process management

Market basket analysis

Marketing

Customer loyalty

Fraud detection

Trend analysis

BIPM — Lecture 3 — BI & Descriptive Analytics 31 / 84

Data mining in practice

1 Learn about the problem domain

2 Data selection

3 Data, cleaning, preprocessing and reduction

4 Data mining

5 Interpretation of information

6 Apply knowledge in domain

BIPM — Lecture 3 — BI & Descriptive Analytics 32 / 84

Data preprocessing

Sampling

Normalization

Missing data

Data conflicts

Duplicate data

Ambiguity in data

BIPM — Lecture 3 — BI & Descriptive Analytics 33 / 84

Guidelines for successful data mining

The data must be available

The data must be relevant, adequate and clean

There must be a well-defined problem

The problem should not be solvable by means of ordinary query orOLAP tools

The results must be actionable

BIPM — Lecture 3 — BI & Descriptive Analytics 34 / 84

Successful data mining in businesses

Use a small team with a strong internal integration and a loosemanagement style

Carry out a small pilot project before a major data mining project

Identify a clear problem owner responsible for the project, e.g., fromsales or marketing

Try to realize a positive return on investment within 6 to 12 months

Have top management back the project up

BIPM — Lecture 3 — BI & Descriptive Analytics 35 / 84

Break?

BIPM — Lecture 3 — BI & Descriptive Analytics 36 / 84

Data attribute types

Categorical attributes: discrete

Nominal attribute: has no logical ordering(e.g., colors or names)Ordinal attribute: has ordering(e.g.: bad, OK, good, perfect)

Numerical attributes continuous(e.g., 4.815m and EUR 162 342)

BIPM — Lecture 3 — BI & Descriptive Analytics 37 / 84

Data quality

Accuracy

Completeness

Consistency (uniformity)

Validity

Timeliness

Data cleaning, data cleansing, data scrubbing, . . .

BIPM — Lecture 3 — BI & Descriptive Analytics 38 / 84

Data quality

http://www.hicxsolutions.com/supplier-management-programmes/

BIPM — Lecture 3 — BI & Descriptive Analytics 39 / 84

http://halobi.com/wp-content/uploads/data-quality-infographic.png

http://halobi.com/wp-content/uploads/data-quality-infographic.png

Example: Corporate data quality

BIPM — Lecture 3 — BI & Descriptive Analytics 42 / 84

Data quality

ORBIS database (Bureau van Dijk, http://orbis.bvdinfo.com)

Aggregates data from Chambers of Commerce across the world

Snapshot from September 2015

Extracted all firms (including meta-data such as operating revenue,employees, assets and market capitalization)

140,087,471 firms found. Is that all?

BIPM — Lecture 3 — BI & Descriptive Analytics 43 / 84

Data quality

ORBIS database (Bureau van Dijk, http://orbis.bvdinfo.com)

Aggregates data from Chambers of Commerce across the world

Snapshot from September 2015

Extracted all firms (including meta-data such as operating revenue,employees, assets and market capitalization)

140,087,471 firms found. Is that all?

BIPM — Lecture 3 — BI & Descriptive Analytics 43 / 84

Data quality

ORBIS database (Bureau van Dijk, http://orbis.bvdinfo.com)

Aggregates data from Chambers of Commerce across the world

Snapshot from September 2015

Extracted all firms (including meta-data such as operating revenue,employees, assets and market capitalization)

140,087,471 firms found.

Is that all?

BIPM — Lecture 3 — BI & Descriptive Analytics 43 / 84

Data quality

ORBIS database (Bureau van Dijk, http://orbis.bvdinfo.com)

Aggregates data from Chambers of Commerce across the world

Snapshot from September 2015

Extracted all firms (including meta-data such as operating revenue,employees, assets and market capitalization)

140,087,471 firms found. Is that all?

BIPM — Lecture 3 — BI & Descriptive Analytics 43 / 84

Observed data

Figure : Observed average revenue per country (darker is more)

BIPM — Lecture 3 — BI & Descriptive Analytics 44 / 84

Completeness per size categoryA

T

BE

BG CY

CZ

DE

DK

EE

ES FI FR GB

GR

HR

HU IE IT LT LU LV MT

NL

NO PL

PT

RO SE SI

SK

Country

0

50

100

150

2000-9 10-19 20-49 50-249 GE250

Com

ple

teness

(%

of

num

ber

of

com

panie

s)

Figure : Percentage of companies present, segmented by number of employees.

BIPM — Lecture 3 — BI & Descriptive Analytics 45 / 84

Assessing completeness

Lognormal distribution for firm revenue in a country

Idea: fix distribution scale based on known high quality countries

Estimate mean revenue for each country using World Bank indicators

Result: GDP per capita ∼ Mean revenue

Mean revenue ∼ Distribution location

Assess completeness by comparing observed average revenue withestimated average revenue

BIPM — Lecture 3 — BI & Descriptive Analytics 46 / 84

Assessing completeness

Lognormal distribution for firm revenue in a country

Idea: fix distribution scale based on known high quality countries

Estimate mean revenue for each country using World Bank indicators

Result: GDP per capita ∼ Mean revenue

Mean revenue ∼ Distribution location

Assess completeness by comparing observed average revenue withestimated average revenue

BIPM — Lecture 3 — BI & Descriptive Analytics 46 / 84

Assessing completeness

Lognormal distribution for firm revenue in a country

Idea: fix distribution scale based on known high quality countries

Estimate mean revenue for each country using World Bank indicators

Result: GDP per capita ∼ Mean revenue

Mean revenue ∼ Distribution location

Assess completeness by comparing observed average revenue withestimated average revenue

BIPM — Lecture 3 — BI & Descriptive Analytics 46 / 84

Mean vs. standard deviation

BIPM — Lecture 3 — BI & Descriptive Analytics 47 / 84

Understanding average revenue

Figure : Observed average revenue Figure : Estimated average revenue

BIPM — Lecture 3 — BI & Descriptive Analytics 48 / 84

Low average in rich countries

100 101102 103 104 105

Company Turnover (thds dollars)

10-9

10-8

10-7

10-6

10-5

10-4

10-3

10-2

Freq

uency

100 101102 103 104 105

Company Turnover (thds dollars)

10-9

10-8

10-7

10-6

10-5

10-4

10-3

10-2

100 101102 103 104 105

Company Turnover (thds dollars)

10-9

10-8

10-7

10-6

10-5

10-4

10-3

10-2

100 101102 103 104 105

Company Turnover (thds dollars)

10-9

10-8

10-7

10-6

10-5

10-4

10-3

10-2

BIPM — Lecture 3 — BI & Descriptive Analytics 49 / 84

Real completeness

BIPM — Lecture 3 — BI & Descriptive Analytics 50 / 84

Completeness per country

1 41E13E19E12E27E22E35E31E44E410E43E

57E52E65E61E74E71E83E8

NOSEEEFISKGBCZCLFRUSHUCHPTSIESITCADKKRDEJPAUPLGRIEBELUNLATILNZ

≥10010−110−210−310-3.5

Completeness

1 1E2 2E4 3E6 5E810 -1

100

101

102

103

104

105

106

1 1E2 2E4 3E6 5E810- 1

100

101

102

103

104

105

106

1 1E2 2E4 3E6 5E810- 1

100

101

102

103

104

105

106

A B

C

D

BIPM — Lecture 3 — BI & Descriptive Analytics 51 / 84

Categories of techniques

Machine learning

Supervised learning: learning on labeled dataSemi-supervised learning: partially labeled dataUnsupervised learning: leaning/mining on unlabeled dataReinforcement learning: agents learning to act in an environment

BIPM — Lecture 3 — BI & Descriptive Analytics 52 / 84

Unsupervised learning

BIPM — Lecture 3 — BI & Descriptive Analytics 53 / 84

Categories of techniques

Unsupervised learning: leaning/mining on unlabeled data

Supervised learning: learning on labeled data

Semi-supervised learning: partially labeled data

Reinforcement learning: agents learning to act in an environment

BIPM — Lecture 3 — BI & Descriptive Analytics 54 / 84

Unsupervised learning

Clustering

Anomaly detection

Pattern recognition

Data summarization

BIPM — Lecture 3 — BI & Descriptive Analytics 55 / 84

Clustering

Clustering

Data is unlabeled

Label data: grouping

BIPM — Lecture 3 — BI & Descriptive Analytics 56 / 84

Clustering

Clustering

Data is unlabeled

Label data: grouping

Grouping based on similarattributes: relatively close“neighbors” in n-dimensionalspace

BIPM — Lecture 3 — BI & Descriptive Analytics 57 / 84

k-means Clustering

1 k means are randomly placed

2 k clusters are created by assigning each observation to the nearestmean (according to some distance notion)

3 the centroid of each cluster becomes the new mean

4 steps 1–3 are repeated until convergence

BIPM — Lecture 3 — BI & Descriptive Analytics 58 / 84

k-means Clustering

BIPM — Lecture 3 — BI & Descriptive Analytics 59 / 84

Hierarchical clustering

1 Define a distance function between objects

2 Assign each object to its own cluster

3 Merge the two nearest clusters (based on distance between itsobjects) into one cluster

4 Until there is only one cluster, go to 3

5 Pick a level in the resulting dendogram as the preferred method ofclustering

BIPM — Lecture 3 — BI & Descriptive Analytics 60 / 84

Hierarchical clustering

BIPM — Lecture 3 — BI & Descriptive Analytics 61 / 84

Hierarchical clustering

BIPM — Lecture 3 — BI & Descriptive Analytics 61 / 84

Hierarchical clustering

BIPM — Lecture 3 — BI & Descriptive Analytics 62 / 84

Hierarchical clustering

BIPM — Lecture 3 — BI & Descriptive Analytics 63 / 84

Hierarchical clustering

BIPM — Lecture 3 — BI & Descriptive Analytics 63 / 84

Hierarchical clustering

BIPM — Lecture 3 — BI & Descriptive Analytics 64 / 84

Hierarchical clustering

BIPM — Lecture 3 — BI & Descriptive Analytics 64 / 84

Clustering validation

Expectation-Maximization (EN) clustering: https://en.wikipedia.org/wiki/Expectation-maximization_algorithm

BIPM — Lecture 3 — BI & Descriptive Analytics 65 / 84

Hierarchical vs. k-means clustering

Time complexity (linear vs. quadratic)

Predefined number of clusters

Influence of outliers

Assumption of the presence of a hierarchical structure

BIPM — Lecture 3 — BI & Descriptive Analytics 66 / 84

Case: Anomalies in energy expenditure

BIPM — Lecture 3 — BI & Descriptive Analytics 67 / 84

Case: anomalies in energy expenditure

BSc project J. Kalmeijer in cooperation with “Rijkswaterstaat”

Total of 254 objects all over the Netherlands

Energy expenditure over 3 years known for each object

Measurements every 15 minutes:365 days × 24 hours × 4 measurements ≈ 35.000 yearlymeasurements

BIPM — Lecture 3 — BI & Descriptive Analytics 68 / 84

Objects

Public lighting or traffic control

Office

Tunnel

Radarpost

Pumping station

Floodgate or weir

Traffic control center

Bridge or dam

Small building

BIPM — Lecture 3 — BI & Descriptive Analytics 69 / 84

Goal

Clustering on all data:

Public lightingAll other objects

Clustering to detect object groups

Identify regular energy usage pattern of objects

Objects are of different types

Detect anomalies in energy usage per object type

Data-driven!

BIPM — Lecture 3 — BI & Descriptive Analytics 70 / 84

Approach

BIPM — Lecture 3 — BI & Descriptive Analytics 71 / 84

A clustering result

Figure : Public lighting objects

BIPM — Lecture 3 — BI & Descriptive Analytics 72 / 84

Anomaly detection results

Figure : Outlier in seasonal behavior

BIPM — Lecture 3 — BI & Descriptive Analytics 73 / 84

Project results and conclusion

Objects clustered into types based on the data

Some anomalies detected for various types of objects

Correlations between weather and object (types) identified

Data-driven insight!

BIPM — Lecture 3 — BI & Descriptive Analytics 74 / 84

Unsupervised learning

Clustering

Anomaly detection

Pattern recognition

Data summarization

BIPM — Lecture 3 — BI & Descriptive Analytics 75 / 84

Market basket analysis

Han & Kamber, Data mining: Concepts and techniques, 2006

BIPM — Lecture 3 — BI & Descriptive Analytics 76 / 84

Association

X and Y are variables. There are N instances, of which NX

instances have variable X

Derive rules of the form IF(X) THEN Y X ⇒ Y

support(X ⇒ Y ) = NX∧Y /N

confidence(X ⇒ Y ) = NX∧Y /NX

lift(X ⇒ Y ) =NX∧Y N

NXNY

support: higher is better

confidence: close to 1

lift: factors higher is better

BIPM — Lecture 3 — BI & Descriptive Analytics 77 / 84

Association

X and Y are variables. There are N instances, of which NX

instances have variable X

Derive rules of the form IF(X) THEN Y X ⇒ Y

support(X ⇒ Y ) = NX∧Y /N

confidence(X ⇒ Y ) = NX∧Y /NX

lift(X ⇒ Y ) =NX∧Y N

NXNY

support:

higher is better

confidence: close to 1

lift: factors higher is better

BIPM — Lecture 3 — BI & Descriptive Analytics 77 / 84

Association

X and Y are variables. There are N instances, of which NX

instances have variable X

Derive rules of the form IF(X) THEN Y X ⇒ Y

support(X ⇒ Y ) = NX∧Y /N

confidence(X ⇒ Y ) = NX∧Y /NX

lift(X ⇒ Y ) =NX∧Y N

NXNY

support: higher is better

confidence:

close to 1

lift: factors higher is better

BIPM — Lecture 3 — BI & Descriptive Analytics 77 / 84

Association

X and Y are variables. There are N instances, of which NX

instances have variable X

Derive rules of the form IF(X) THEN Y X ⇒ Y

support(X ⇒ Y ) = NX∧Y /N

confidence(X ⇒ Y ) = NX∧Y /NX

lift(X ⇒ Y ) =NX∧Y N

NXNY

support: higher is better

confidence: close to 1

lift:

factors higher is better

BIPM — Lecture 3 — BI & Descriptive Analytics 77 / 84

Association

X and Y are variables. There are N instances, of which NX

instances have variable X

Derive rules of the form IF(X) THEN Y X ⇒ Y

support(X ⇒ Y ) = NX∧Y /N

confidence(X ⇒ Y ) = NX∧Y /NX

lift(X ⇒ Y ) =NX∧Y N

NXNY

support: higher is better

confidence: close to 1

lift: factors higher is better

BIPM — Lecture 3 — BI & Descriptive Analytics 77 / 84

Association rules

http://www.saedsayad.com

BIPM — Lecture 3 — BI & Descriptive Analytics 78 / 84

Unsupervised learning

Clustering

Anomaly detection

Pattern recognition

Data summarization

BIPM — Lecture 3 — BI & Descriptive Analytics 79 / 84

Anomaly detection

Supervised: normal/outlier can be learned as a class attribute

Semi-supervised: train on a labeled dataset, determine outliers inunlabeled data based on likelihood of a deviation

Unsupervised: identify patterns (for example, using clustering) andthen select small clusters or instances that do not logically fall inany of the large clusters

BIPM — Lecture 3 — BI & Descriptive Analytics 80 / 84

Assignment 1

Gaming industry context

Sales log spanning 4 years of sales

Apply and compare BI techniques

Inspect, visualize, aggregate, segment, score . . .

Deliverables:

1 Web-based BI Dashboard2 Short assignment report in LATEX

BIPM — Lecture 3 — BI & Descriptive Analytics 81 / 84

Assignment 1 — Hints

Model: MySQL database containing the data

View: HTML page using Javascript that reads JSON

Controller: PHP outputs relevant data in JSON

BIPM — Lecture 3 — BI & Descriptive Analytics 82 / 84

Lab session February 23

Make serious progress with Assignment 1

Continue with dashboard and data integration

Error reporting in PHP and other handy tricks:http://liacs.leidenuniv.nl/ict

Start thinking about the BI questions

Ask all relevant questions

BIPM — Lecture 3 — BI & Descriptive Analytics 83 / 84

Credits

Lecture partially based on (slides of the (previous edition of the)) course book:W. van der Aalst, Process Mining: Data Science in Action, 2nd edition,Springer, 2016.

Slides partially based on “From Data Mining to Knowledge Discovery: An

Introduction” by Gregory Piatetsky-Shapiro (KDnuggets.com)

BIPM — Lecture 3 — BI & Descriptive Analytics 84 / 84

top related