knime user group meeting 2015 text mining workshop · •subset(s) of the large movie review...

37
Copyright © 2015 KNIME.com AG KNIME User Group Meeting 2015 Text Mining Workshop Kilian Thiel [email protected] @KNIME

Upload: others

Post on 18-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

KNIME User Group Meeting 2015Text Mining Workshop

Kilian Thiel

[email protected]

@KNIME

Page 2: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Agenda

2

• Introduction

• Sentiment Analysis

– Single word features

– Ngram features

– Dictionary based

Page 3: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

• The KNIME Website (www.knime.org)• Textprocessing (tech.knime.org/knime-text-processing)

• Forum (tech.knime.org/forum/knime-textprocessing)

• Blog for news, tips and tricks(www.knime.org/blog)

• KNIME TV channel on

• KNIME on @KNIME

Resources

3

Page 4: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG 4

Introduction to KNIME Textprocessing

Page 5: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Installation

5

1.) 2.)

Page 6: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Tip

• Increase maximum memory for KNIME

• Edit knime.ini

– Add “-Xmx3G” as last line of knime.ini file

– Or change existing setting

• Useful additional extensions

– Palladian (community extension)• Web crawling, Text Mining

– XML-Processing (KNIME extension)• Parsing and processing of XML documents

6

Page 7: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Philosophy

7

… perhaps your nameis

Rumpelstiltskin[Person] ? …

… perhaps your nameis

Rumpelstiltskin[Person] ? … Visualization

Cluster-ing

Classifi-cation

1 1 1 0 1 0 0 1 10 1 1 0 0 1 0 0 00 0 1 1 1 0 1 1 0

Page 8: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Data Import

• Node Repository: KNIME Labs/Text Processing/IO

• Available Parser Nodes

– Flat File Document Parser

– PDF Parser

– Word Parser

– Document Grabber

– …

8

Page 9: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Enrichment

• Node Repository:

KNIME Labs/Text Processing/Enrichment

• Available Tagger Nodes

– Standford tagger

– Dictionary (& Wildcard) tagger

– OpenNLP tagger

– Abner tagger

– …

9

Page 10: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Preprocessing

• Node Repository:

KNIME Labs/Text Processing/Preprocessing

• Available Preprocessing Nodes

– Stop word Filter

– Snowball Stemmer

– General Tag Filter

– Case converter

– RegEx Filter

– …

10

Page 11: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Transformation

• Node Repository:

KNIME Labs/Text Processing/Transformation

• Available Transformation Nodes

– BoW Creator

– Document Vector

– Strings to Document

– Sentence Extractor

– Document Data Extractor

– …

11

Page 12: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Additional Data Types

• Document Cell

– Encapsulates a document• Title, sentences, terms, words

• Authors, category, source

• Generic meta data (key, value pairs)

• Term Cell

– Encapsulates a term• Words, tags

12

Page 13: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Data Table Structures

• Document table– List of documents

• Bag of words– Tuples of documents

and terms

• Document vectors– Numerical

representations of documents

13

Page 14: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Import Export Workflows

14

Page 15: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Today‘s example

• Classification of free-text documents is a common task in the field of text mining.

• It is used to categorize documents, i.e. assigning pre-defined topics or for sentiment analysis.

• Today we construct a workflow that reads and preprocesses text documents, transforms them into a numerical representation and build a predictive model to assign pre-defined sentiment labels to documents.

15

Page 16: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Today‘s example

• Subset(s) of the Large Movie Review Dataset v1.0

• The LMR Dataset v1.0 contains 50.000 English movie reviews with their associated sentiment label “positive” or “negative”

• http://ai.stanford.edu/~amaas/data/sentiment/

16

Page 17: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Today‘s example

17

Goal:

• Build classifier to distinguish between positive andnegative movie reviews.

• „Absolute masterpiece of a film! Goodnight Mr.Tomhas swiftly become one of my favourite films of all time. Nobody should miss out on seeing this film, it's just too good! …“

Positive or negative movie review?

Page 18: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Hands On

• Adjust Xmx setting in knime.ini

• Start KNIME

• Import Workflows

18

Page 19: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG 19

Sentiment Analysis

Page 20: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Classification (Recap)

20

Training

Set

Test

Set

Original

Data Set

Train

Model

Apply

Model

Score

Model

Data Partitioning Training and

Applying Models

Scoring

Strategies

Page 21: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Classification (Recap)

• All data mining models use a Learner-Predictor motif.

• The Learner node trains the model with its input data.

• The Predictor node applies the model to a different subset of data.

21

Training set

Test set

Trained Model

Page 22: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

From Strings to Numbers

22

Page 23: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Hands On

• Read movie review data

• Convert to documents

• Preprocess documents

• Filter features based on frequency

• Transform documents into numerical representation

• Train and score predictive model

• View decision tree view

23

Page 24: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Decision Tree Model

24

Page 25: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Negations

• What about negations?

– Not good

– Not bad

• Can not be handled with single word features

• Ngram features (1- and 2grams)

25

Page 26: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

NGram Creator

• Allows for creation of

– Character Ngrams

– Word Ngrams

• As

– Bag of words

– With frequencies

26

Page 27: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

NGram Features

27

Page 28: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Hands On

• Read movie review data

• Convert to documents

• Preprocess documents

• Create 1- and 2-gram features

• Filter features based on frequency

• Transform documents into numerical representation

• Train and score predictive model

• View decision tree view

28

Page 29: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Decision Tree Model

29

Page 30: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Dictionaries

• Custom dictionaries for sentiment classification

– Dictionary Tagger

– Wildcard Tagger (regular expressions)

30

Page 31: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

MPQA Subjectivity Lexicon

• Dictionary with words and related subjectivity clues

– 2718 words with positive polarity

– 4901 words with negative polarity

• http://mpqa.cs.pitt.edu/lexicons/subj_lexicon

31

Page 32: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Dictionary Based

32

Page 33: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Hands On

• Read movie review data

• Convert to documents

• Tag documents

• Create pivot features

• Specify score and rule for label assignment

• Score prediction

33

Page 34: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Appendix

• Palladian and XML nodes to extract move titles from IMDB web pages

• First dataset contains IMDB URLs of related movies

34

Page 35: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Web Data Extraction

35

Page 36: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

Hands On

• Read movie review data

• Load webpage content

• Parse HTML

• Extract title field

36

Page 37: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their

Copyright © 2015 KNIME.com AG

The EndKilian Thiel

[email protected]

@KNIME