12 knowledge mining

77
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information Service Engineering Lecture 12: Knowledge Mining Prof. Dr. Harald Sack FIZ Karlsruhe - Leibniz Institute for Information Infrastructure AIFB - Karlsruhe Institute of Technology Summer Semester 2017 This file is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0 )

Upload: harald-sack

Post on 21-Jan-2018

358 views

Category:

Education


0 download

TRANSCRIPT

Page 1: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Information Service Engineering Lecture 12: Knowledge Mining

Prof. Dr. Harald SackFIZ Karlsruhe - Leibniz Institute for Information InfrastructureAIFB - Karlsruhe Institute of Technology

Summer Semester 2017

This file is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0)

Page 2: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Information Service EngineeringLast lecture: Information Retrieval - 2

4.1 A Brief History of Libraries and IR

4.2 Fundamental Concepts of IR

4.3 Information Retrieval Models

4.4 Retrieval Evaluation

4.5 Web Information Retrieval

4.6 Indexing

4.7 Query Processing and Ranking

2

● Information Retrieval Models● Average Precision @ rank● Mean Average Precision (MAP)● Web Crawler● Inverted Index● TF-IDF

Page 3: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

● Term Frequency - Inverse Document Frequency is tf multiplied by idf

TF-IDF

Term frequency tf ● +1 to avoid negative results,● log to lower impact of

frequent words

Normalization usually via cosine similarity,But there are more weighting variants...

4. Information Retrieval / 4.7 Query Processing and Ranking

3

Invers document frequency idf

Page 4: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

PageRank

● For Web Search besides the relevance of a search term for a document also the “popularity” of a document can be taken into account

● Google’s PageRank is a graph algorithm that determines the “importance” of a web page via its link graph

● PageRank basic assumptions:

○ Importance of a page is dependent on the number of incoming links, i.e. other web pages referring to the web page

○ If a web page links to another web page, the importance of a link is determined by the importance of the originating web page

○ The more outgoing links, the less important is a single outgoing link

4. Information Retrieval / 4.7 Query Processing and Ranking

4

Page 5: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

PageRank4. Information Retrieval / 4.7 Query Processing and Ranking

5

damping factor

for all incoming links

importance of incoming link j

PRi

PageRank of page iPR

jPageRank of page j that is linking to page i

cj page j is linking to c

j different pages

n number of incoming links to page iN number of all pagesd damping factor d∈[0,1]

(Brin, Page, 1998)

Page 6: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

PageRank4. Information Retrieval / 4.7 Query Processing and Ranking

6

1.0 1.0

1.0 1.0

DC

Iteration r(A) r(B) r(C) r(D)

1 1,0 1,0 1,0 1,0

2 1,0 0,575 2,275 0,15

3 2,083 0,575 1,1912 0,15

… … … … …

n 1,49 0,7833 1,577 0,15

BA

Initial state

1.49 0.78

1.57 0.15

DC

BA

Final stateIteration of PageRank computation

Page 7: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Information Service EngineeringLast Lecture: Information Retrieval - 2

4.1 A Brief History of Libraries and IR

4.2 Fundamental Concepts of IR

4.3 Information Retrieval Models

4.4 Retrieval Evaluation

4.5 Web Information Retrieval

4.6 Indexing

4.7 Query Processing and Ranking

7

Page 8: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Information Service EngineeringLecture Overview

1. Information, Natural Language and the Web

2. Natural Language Processing

3. Linked Data Engineering

4. Information Retrieval

5. Knowledge Mining

6. Exploratory Search and Recommender Systems

8

Page 9: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Information Service Engineering5. Kowledge Mining

9

5.1 From Data to Knowledge

5.2 Linked Data based Information Visualization

5.3 Linked Data based Knowledge Mining

5.4 Linked Data Analytics

Page 10: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

collective application of knowledge in context

understanding principles(why?, what is best?, doing things right)

experience, context, value applied to a message

understanding patterns(principles: how to?)

a message meant to change the receivers perception

understanding relations(description: what?)

discrete objective facts about event

DIKW Pyramid, Ackoff 1989 [1]

Knowledge Mining

evaluated understanding

information enriched with semantics

in usable form

raw characters and symbols

futurenovelty

pastexperience

5. IKnowledge Engineering/ 5.1 From Data to Knowledge

Page 11: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Data

● Data is raw.

● It simply exists and has no significance beyond its existence (in and of itself).

● It can exist in any form, usable or not.

5. Knowledge Engineering/ 5.1 From Data to Knowledge

Page 12: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Information

● Information is data that has been given meaning by way of relational connection.

● This "meaning" can be useful, but does not have to be.

● Information is contained in descriptions, answers to questions that begin with such words as who, what, when, where, and how many

5. Knowledge Engineering/ 5.1 From Data to Knowledge

Page 13: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Knowledge

● Knowledge is the appropriate collection of information, such that it's intent is to be useful.

● Understanding is a continuum that leads from data, through information and knowledge, and ultimately to wisdom

● Data transforms to information by convention, information to knowledge by cognition, and knowledge to wisdom by contemplation

5. Knowledge Engineering/ 5.1 From Data to Knowledge

Page 14: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Knowledge5. Knowledge Engineering/ 5.1 From Data to Knowledge

● How do we transform data into knowledge? 1. Collect Data 2. Organize Data 3. Analyze Data

Page 15: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Information Service Engineering5. Kowledge Mining

15

5.1 From Data to Knowledge

5.2 Linked Data based Information Visualization

5.3 Linked Data based Knowledge Mining

5.4 Linked Data Analytics

Page 16: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology16

Charles Joseph Minard (1781-1870) - Napoleon’s Russian Campaign

http://patrimoine.enpc.fr/document/ENPC01_Fol_10975?image=54#bibnum

Page 17: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology17

https://en.wikipedia.org/wiki/Charles_Joseph_Minard#/media/File:Minard_map_of_napoleon.png

Page 18: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology18

A Picture is worth a 1000 words...

● Pictures have been used to convey information long before the development of writing

● A single picture can be processed (“understood”) much faster than a (linear) text page

● Human perception is processing in parallel, text analysis is limited by the sequential process of reading

https://commons.wikimedia.org/wiki/File%3AA_picture_is_worth_a_thousand_words.jpg

5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization

Page 19: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology19

Information Visualization

● Information Visualization is the study of (interactive) visual representations of abstract data to reinforce human cognition

● Information graphics or infographics are graphic visual representations of information, data or knowledge intended to present information quickly and clearly

○ a static form of information visualization

○ aims to emphasize specific findings gained from the visualized data

○ Mandatory precondition: Data Analysis

5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization

Page 20: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

SPARQL

e.g. via GoogleDoc

Hands On Data Visualization Example

● Dataset: DBpedia

● Task: Draw a map chart which indicates the number of soccer players per country

5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization

Page 21: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Task: Draw a Map Chart...

● Look for a representative (prominent) example

5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization

Page 22: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Task: Draw a Map Chart...

● Dataset: DBpedia

○ Examine a representative example from DBpedia, e.g. http://dbpedia.org/page/Lionel_Messi

5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization

Page 23: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Task: Draw a Map Chart...

● SPARQL Query:

○ Country and Number of soccerplayers per country

■ ?s rdf:type dbo:SoccerPlayer .

■ ?s dbo:birthPlace ?birthplace .

■ ?birthplace dbo:country ?country .

■ ?country rdfs:label ?country_name

■ FILTER (lang(?countryLabel)=”en”)

■ GROUP BY ?countryLabel

■ COUNT(DISTINCT ?s)

5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization

Page 25: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Task: Draw a Map Chart...

● SPARQL Result:

○ Save as CSV

5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization

Page 26: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Task: Draw a Map Chart...

● Import the csv data into a spreadsheet, as e.g. in GoogleDocs

5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization

Page 27: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Task: Draw a Map Chart...

● Display spreadsheet data in some diagram

5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization

Page 28: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Task: Draw a Map Chart...

● Important step: Data Cleansing

○ Is the displayed data correct?○ Outlier detection

○ In our example: are all displayed countries really existing countries?

○ Potential Solutions:■ Adapt your original query (new SPARQL query)■ Remove evidently wrong data

● manually or procedural

5. Knowledge Engineering/ 5.2 Data Mining

Page 29: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Task: Draw a Map Chart...

● Redraw after Data Cleansing

5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization

Page 30: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Information Service Engineering5. Kowledge Mining

30

5.1 From Data to Knowledge

5.2 Linked Data based Information Visualization

5.3 Linked Data based Knowledge Mining

5.4 Linked Data Analytics

Page 31: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Knowledge Mining and Knowledge Discovery5. Knowledge Engineering/ 5.3 Linked Data Based Knowledge Mining

Knowledge Discovery [in Databases] (KDD) is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in (massive) data sources.(Fayyad et al, 1996 [2])

● valid: to a certain degree the discovered patterns should also hold for new, previously unseen problem instances.

● novel: at least to the system and preferable to the user.

● potentially useful: they should lead to some benefit to the user or task.

● ultimately understandable: the end user should be able to interpret the patterns either immediately or after some postprocessing.

Page 32: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Knowledge Mining and Knowledge Discovery5. Knowledge Engineering/ 5.3 Linked Data Based Knowledge Mining

Knowledge Discovery [in Databases] (KDD) is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in (massive) data sources.(Fayyad et al, 1996 [2])

● Goals:

○ Descriptive Modelling: explains the characteristic and the behaviour of the observed data

○ Predictive Modelling: predicts the behaviour of new data based on some model

● Important:○ The extracted model/pattern does not have to apply in 100% of the cases

Page 33: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

The Knowledge Discovery Process5. Knowledge Engineering/ 5.3 Linked Data Based Knowledge Mining

Data

Selection:Select a relevant dataset or focus on a subset of a dataset

Target Data

Preprocessing/Cleaning:Data integration from different sources, Data Cleaning

Preprocessed Data

Transformation:Select useful features, feature transformation, dimensionality reduction

Transformed Data

Data Mining:Search for patterns of interest

Patterns

Evaluation:Evaluate patterns based on interestingness measures, model validation

Knowledge

Page 34: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Data Cleaning5. Knowledge Engineering/ 5.3 Linked Data Based Knowledge Mining

● “Dirty” Data:○ Dummy values, absence of data, contradicting data, etc.

● Steps in Data Cleaning○ Parsing: locates and identifies individual data elements in raw data

○ Correcting: corrects parsed individual data components using sophisticated data algorithms

○ Normalization: applies conversion routines to transform data into standard formats

○ Matching: searching and matching records within and across data based on predefined rules

○ Consolidating: merges data into one representation

Page 35: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Knowledge Mining Functionality5. Knowledge Engineering/ 5.3 Linked Data Based Knowledge Mining

● Characterization: summarizing general features of objects in a target class (concept description)

● Discrimination: comparing general features of objects between a target class and a contrasting class (concept comparison)

● Association: studying the frequency of items occurring together

● Prediction: predicting some unknown or missing attribute values

● Classification: organizing data in given classes based on attribute values (supervised)

● Clustering: organizing data in classes based on attribute values (unsupervised)

● Outlier analysis: identifying and explaining exceptions (surprises)

● Time-series analysis: analyzing trends and deviations

Page 36: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology36

Data Analysis in the Knowledge Mining Process

● Data Analysis is a fundamental iterative process:1. Formulation and execution of a query2. Analysis of the results 3. Formulation of a consecutive query based on the achieved results

● Goals of Data Analysis:○ maximize understanding of analyzed data○ uncover hidden structures/patterns○ extraction of important variables○ detection of anomalies and outliers○ testing of hypotheses○ development of a simple model

5. Knowledge Engineering/ 5.3 Linked Data Based Knowledge Mining

Page 37: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Information Service Engineering5. Kowledge Mining

37

5.1 From Data to Knowledge

5.2 Linked Data based Information Visualization

5.3 Linked Data based Knowledge Mining

5.4 Linked Data Analytics

Page 38: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology38

HandsOn: Linked Data Analytics

● Available Toolset:○ Data 1: DBpedia SPARQL endpoint○ Data 2: Wikidata SPARQL endpoint

● Query language: ○ SPARQL

● simple statistics and visualization: ○ R

https://www.r-project.org/

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 39: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology39

HandsOn: Linked Data Analytics5. Knowledge Engineering/ 5.4 Linked Data Analytics

https://commons.wikimedia.org/wiki/File:Fu%C3%9Fballgeschichte_(1892).jpg

Page 40: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology40

Linked Data Analytics

● First example: Analyse the number of achieved goals of soccer players with DBpedia

● Solution1. Create appropriate SPARQL query2. Save result as csv file3. Analyze data with R

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 42: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology42

Linked Data Analytics

● First example: Analyse the number of achieved goals of soccer players with DBpedia

3. Analyse data with R○ read data:

goals <- read.csv("dbpedia-goals", header=TRUE) ○ summarize data:

summary(goals)

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 43: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology43

Linked Data Analytics

First example: Analyse the number of achieved goals of soccer players 3. Analyse data with R

○ analyze data via boxplot:boxplot(goals)

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 44: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology44

Linked Data Analytics

First example: Analyse the number of achieved goals of soccer players 3. Analyse data with R

○ analyze data via boxplot:boxplot(goals)

○ adapt ranges for visibility:boxplot(goals,ylim=c(-90,200))

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 45: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology45

Linked Data Analytics

First example: Analyse the number of achieved goals of soccer players 3. Analyse data with R

○ analyze data via boxplot:boxplot(goals)

○ adapt ranges for visibility:boxplot(goals,ylim=c(-90,200))

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 46: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology46

Linked Data Analytics

First example: Analyse the number of achieved goals of soccer players 3. Analyse data with R

○ analyze data via boxplot:boxplot(goals)

○ adapt ranges for visibility:boxplot(goals,ylim=c(-90,200))

InterQuartile Range (IQR)

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 47: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology47

Linked Data Analytics

First example: Analyse the number of achieved goals of soccer players 3. Analyse data with R

○ analyze data via boxplot:boxplot(goals)

○ adapt ranges for visibility:boxplot(goals,ylim=c(-90,200))

IQR

Whiskers: IQR x 1.5 = (Q3-Q

1) x 1.5

Q3+ 1.5 IQR

Q1- 1.5 IQR

Outliers

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 48: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology48

Linked Data Analytics

First example: Analyse the number of achieved goals of soccer players 3. Analyse data with R

○ Data Cleaning■ remove negative values■ examine outliers (SPARQL query)

○ New data analysis

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 49: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology49

Linked Data Analytics

Second example: ● Is there a relationship between the number

of achieved goals, the birthdate, or the height of a soccer player (in DBpedia)

https://commons.wikimedia.org/wiki/File:Fu%C3%9Fballgeschichte_(1892).jpg

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 50: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology50

Linked Data Analytics

1. Create appropriate SPARQL query

SPARQL Query

SELECT SAMPLE(?goals) as ?goals SAMPLE(xsd:integer(?height)) as ?height SAMPLE(xsd:date(?bday)) as ?bday WHERE { ?s rdf:type dbo:SoccerPlayer ; <http://dbpedia.org/ontology/Person/height> ?height ; dbo:birthDate ?bday ; dbp:totalgoals ?goals FILTER (DATATYPE(?goals)=xsd:integer).} GROUP BY ?s

2. Save the result as csv file

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 51: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology51

Linked Data Analytics

Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia)

3. Analyse data with R

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 52: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology52

Linked Data Analytics

Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia)

3. Analyse data with R

> boxplot(goals2$height)

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 53: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology53

Linked Data Analytics

Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia)

3. Analyse data with R

> boxplot(goals2$bday)

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 54: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology54

Linked Data Analytics

Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia)

3. Analyse data with R○ How are the values distributed?○ > hist(goals2$goals)

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 55: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology55

Linked Data Analytics

Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia)

3. Analyse data with R○ How are the values distributed?○ > hist(goals2$goals)○ > hist(goals2$height)

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 56: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology56

Linked Data Analytics

Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia)

3. Analyse data with R○ How are the values distributed?○ > hist(goals2$goals)○ > hist(goals2$height)○ > hist(goals2$bday)

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 57: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology57

Linked Data Analytics

Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia)

3. Analyse data with R○ Data Cleaning and 2D plot

1. height vs. goals

> plot(goals2$height,goals2$goals)

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 58: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology58

Linked Data Analytics

Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia)

3. Analyse data with R○ Data Cleaning and 2D plot

2. birthday vs. goals

> plot(goals2$bday,goals2$goals)

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 59: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology59

Linked Data Analytics

Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia)

3. Analyse data with R○ Data Cleaning and 2D plot

3. birthday vs. height

> plot(goals2$bday,goals2$height)

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 60: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology60

Linked Data Analytics

● But, is there really a relationship between the number of achieved goals, the birthdate, or the height of a soccer player?

● Statistics to the Rescue: ○ Correlation Coefficient

determines correlation and dependence of two or more variables:

○ Applied to out problem in R: ■ cor(goals2$height,goals2$goals)= -0.04350283

■ cor(goals2$bday,goals2$goals)= -0.1098645

■ cor(goals2$bday,goals2$height)= 0.2881508

Covariance of X and Y

Standard deviations of X and Y

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 61: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology61

Linked Data Analytics

● Let’s have a closer look at the distribution of heights

looks like a Normal distribution

might probably be an anomaly

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 62: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology62

Linked Data Analytics

● Let’s have a closer look at the distribution of heights○ Use SPARQL for further analysis

SPARQL query

might probably be an anomaly

select ?s ?height WHERE {?s rdf:type dbo:SoccerPlayer ; dbo:height ?height .}ORDER BY ?height

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 63: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology63

Linked Data Analytics

● Let’s have a closer look at the distribution of heights○ Use SPARQL for further analysis ○ For DBpedia, you always have to consider

the extraction process from the originalWikipedia infoboxes...

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 64: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology64

Linked Data Analytics

Third example: ● Use another data source:

● Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in Wikidata)

https://commons.wikimedia.org/wiki/File:Fu%C3%9Fballgeschichte_(1892).jpg

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 65: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology65

● WIKIDATA follows a different data schema

qualifiers

https://www.wikidata.org/wiki/Q39444

statement

Page 66: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology66

Linked Data Analytics

● WIKIDATA uses a different data schema

https://commons.wikimedia.org/wiki/File:Fu%C3%9Fballgeschichte_(1892).jpg https://www.wikidata.org/wiki/Q39444

Access via different namespaces for properties:

● wdt: connects an item to a valuewd:Q39444 wdt:P54 ?team .

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Object=

subject/context for statement

Page 67: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology67

Linked Data Analytics

● WIKIDATA uses a different data schema

https://commons.wikimedia.org/wiki/File:Fu%C3%9Fballgeschichte_(1892).jpg https://www.wikidata.org/wiki/Q39444

statementAccess via different namespaces for properties:

● wdt: connects an item to a valuewd:Q39444 wdt:P54 ?team .

● p: connects a subject to a statementwd:Q39444 p:P54 ?team_statement .

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 68: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology68

Linked Data Analytics

● WIKIDATA uses a different data schema

https://commons.wikimedia.org/wiki/File:Fu%C3%9Fballgeschichte_(1892).jpg https://www.wikidata.org/wiki/Q39444

property and object/value of

statement

Access via different namespaces for properties:

● wdt: connects an item to a valuewd:Q39444 wdt:P54 ?team .

● p: connects a subject to a statementwd:Q39444 p:P54 ?team_statement .

● pq: connects statement to qualifier value?team_statement pq:1351 ?statement_value

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 69: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology69

Linked Data Analytics

https://commons.wikimedia.org/wiki/File:Fu%C3%9Fballgeschichte_(1892).jpg https://www.wikidata.org/wiki/Q39444

pq:P1351 ?goals

Access via different namespaces for properties:

● pq: connects statement to qualifier value

wd:Q39444 p:P54 ?team_statement .?team_statement pq:P1351 ?goals .

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 70: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology70

Linked Data Analytics

1. Create appropriate SPARQL query

SPARQL Query

SELECT (SUM(?goals) as ?total_goals) (SAMPLE(?height) as ?height) (SAMPLE(xsd:date(?birthdate)) as ?bday) WHERE { ?s wdt:P106 wd:Q937857 ; p:P54 ?team_statement ; wdt:P2048 ?height ; wdt:P569 ?birthdate . ?team_statement pq:P1351 ?goals .} GROUP BY ?s

2. Save the result as csv file

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 71: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology71

Linked Data Analytics

Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in WIKIDATA)

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 72: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology72

Linked Data Analytics

Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in WIKIDATA)

● height vs. goals> plot(goals3$height,goals3$goals)

> cor(goals3$height,goals3$goals) [1] -0.046644459

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 73: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology73

Linked Data Analytics

Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in WIKIDATA)

● Distribution of birthdays> hist(goals3$bday)

5. Knowledge Engineering/ 5.4 Linked Data Analytics

Page 74: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology74

Linked Data Analytics

Are there any relationships between the number of goals and other properties?

5. Knowledge Engineering/ 5.4 Linked Data Analytics

SPARQL query

Page 75: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Information Service Engineering5. Kowledge Mining

75

5.1 From Data to Knowledge

5.2 Linked Data based Information Visualization

5.3 Linked Data based Knowledge Mining

5.4 Linked Data Analytics

Page 76: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

5. Knowledge Mining Bibliography

[1] Ackoff, R. L. (1989). From data to wisdom. Journal of Applied Systems Analysis 15: 3-9

[2] Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. (1996). From data mining

to knowledge discovery: an overview. In Advances in knowledge discovery and data mining,

American Association for Artificial Intelligence, Menlo Park, CA, USA 1-34.

[3] The R Project for Statistical Computing, https://www.r-project.org/

76

Page 77: 12  Knowledge Mining

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

5. Knowledge Mining Syllabus Questions

● Explain the concept of PageRank. What are its basic assumptions?

● What’s the difference: Data, Information, Knowledge, and Wisdom?

● What is Knowledge Discovery?

● What are the goals of Knowledge Discovery?

● Explain the process of Knowledge Discovery

● Why do we need a “Data Cleaning” step in Knowledge Mining?

77