improving upon semantic classification of spoken diary entries using pragmatic context information...

27
Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of North Carolina Wilmington

Upload: derrick-waters

Post on 30-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic

Context Information

Daniel J. Rayburn Reeves Curry I. Guinn

University of North Carolina Wilmington

Page 2: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Overview

Introduction Problem definition Hypotheses

– Hypothesis 1: Using Context– Hypothesis 2: Using Thresholds

Limitations and future Work

Page 3: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

EPA Chemical Exposure Study

Create models of exposure to various chemicals

Activity/Location/Time/Energy expenditure database

Requires data

Page 4: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Database

Necessary data from study:– Date/Time– Location– Activity

Activity and location representation: CHAD– Consolidated Human Activity Database – Designed by EPA– Single representation for location and activity

Page 5: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Background on Data collection

Recall Data

Real-time Paper Diaries

Direct Observation

Page 6: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Digital voice diaries

Sony Voice Recorder

Subject recorded daily locations/activities

1220 utterances

Transcribed and classified

Page 7: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Database Sample

Time Recorded Utterance CHAD Location CHAD Activity

8:57 AM in the bedroom starting housework 30125 - Bedroom 11200 - Indoor chores

8:59 AM carrying clothes to the laundry room30128 - Utility room /

Laundry room11410 - Wash clothes

9:00 AM the bedroom getting more clothes 30125 - Bedroom 11410 - Wash clothes

9:05 AM loading the washing machine in the laundry room30128 - Utility room /

Laundry room11410 - Wash clothes

9:06 AMsitting down going to watch twenty minutes of

Regis30122 - Living room /

family room17223 - Watch TV

9:23 AMI'm going to be brushing the dog in the family

room30122 - Living room /

family room11800 - Care for

pets/animals

Page 8: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Problem Definition

Difficulties in human encoding:– Error prone– Inefficient– Expensive

Computer classification assistance Possible Solution:

– statistical language processing to perform text abstraction

Page 9: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Solution Strategies – Word-only system

Word n-grams at utterance level to identify the most likely semantic categories– Probabilistic relationship between words

Page 10: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

N-grams

Diary entry substrings Word relationships These relationships used in word-only n-

gram model Example: “I am walking to the store”

– Trigram: “I am walking”– Bigram: “am walking”

Page 11: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Leave one out testing

Problems with single data set– Database small size– Single test/training set bias– More data sets with better diversity

Leave-one-out testing– 1 test set = 1 day of recordings from 1 subject– 42 training/testing sets in all

Page 12: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Word-only system results

Leave-one out test sets

– Location: 65.5% correct– Activity: 55.3% correct

Page 13: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Hypothesis 1

Word + context system– Performing statistical NLP text abstraction using

multi-diary entry contextual information will improve the disambiguation of human speech diary entries over the word-only n-gram model applied to single diary entries in the word-only study.

Page 14: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Reasoning for using context

Information human used when encoding Relationship between activities and locations

– Relationship between current location and current activity

– Relationship between current location and previous location

Page 15: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Previous context information

Past context information helps disambiguate Diary Entry: “in the office at the computer”

– Correct Location: Study or Home Office– Previous Location: Living room / family room– Top 3 Location Word-only Choices (w/ probability)

0.904 - Office building/bank/post office 0.217 - Public building/library/museum /theater 0.053 - Public garage / parking

Page 16: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

6 context relationships

Current location given:– Current activity– Previous activity– Previous location

Current activity given– Current location– Previous location– Previous activity

Page 17: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Context incorporation

How much do we weight the words in the utterance versus the context information?

We assumed a linear combination of weights

We applied a brute force search of coefficients to achieve the optimal results

Page 18: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Average activity results

Word-only– 55.3%

Word+context– 66.1%

% improvement– 19.5%

Weights– Word-only: 0.354– Previous Location: 0.177– Current Activity: 0.201– Previous Activity: 0.268

0.7598360.654918

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

word-only word & context

Page 19: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Average location results

Word-only– 65.5%

Word+context– 76.0%

% improvement– 16.0%

Weights– Word-only: 0.294– Previous Location: 0.146– Current Activity: 0.286– Previous Activity: 0.274

0.5532790.660656

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

word-only word & context

Page 20: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Hypothesis 2: Thresholds

Threshold System:– “Thresholds can be found experimentally in the data to

balance trade-offs between precision and recall.” Threshold

– A level at which the computer can classify diary entries with a certain level of precision

– Level will be computed using precision and recall Guesses

– Computer can either classify or not classify– If classifies, considered a guess– Ex: SAT tests

Page 21: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Threshold Example

Difference of top 2 scores– “going to lay [sic] in bed for 20 to 30 minutes”

Correct Location: 30125 – Bedroom Top Score: 30122 - Living room / family room: 0.6448 Second Score: 30125 – Bedroom: 0.6296 Relative Difference: (0.6448 - 0.6296) / 0.6448 = 0.0235

Page 22: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Precision & Recall

Precision– The accuracy of the computer system when it

encoded a diary entry Recall

– The number of total diary entries the computer made a correct guess on relative to the entire data set

Relationship between– Generally as precision goes up recall goes down

Page 23: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Example: Precision and Recall

Student takes 10 question test– Guesses at 7 questions– Answers 6 questions right

Precision– 86%, 6 out of 7 attempted answers correct

Recall– 60%, 6 answers correct out of all questions

Page 24: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Appropriate threshold levels

Done experimentally– Step size of 0.05

Attempt to determine tradeoff between precision and recall

Relationship between scores– Different between top 2 scores

Page 25: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Threshold results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold

% o

f pre

cisi

on a

nd r

ecal

l

Activity Precision Activity Recall Location Precision Location Recall

Page 26: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Limitations

Optimal Classifier– Neural Network and Markov modeling

Database– Increased size

Context Information– Utilize more information from context

Page 27: Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of

Questions?