inf o magnets : making sense of corpus data
DESCRIPTION
Inf o Magnets : Making Sense of Corpus Data. Jaime Arguello Language Technologies Institute. Topic Segmentation: Helping Inf o Magnets Make Sense of Corpus Data. Jaime Arguello Language Technologies Institute. Outline. InfoMagnets Applications Topic Segmentation - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/1.jpg)
InfoMagnets: Making Sense of Corpus DataJaime Arguello
Language Technologies Institute
![Page 2: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/2.jpg)
Topic Segmentation:Helping InfoMagnets Make Sense
of Corpus DataJaime Arguello
Language Technologies Institute
![Page 3: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/3.jpg)
3
Outline
• InfoMagnets• Applications• Topic Segmentation
– Evaluation of 3 Algorithms
• Results• Conclusions• Q/A
![Page 4: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/4.jpg)
4
InfoMagnets
![Page 5: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/5.jpg)
5
InfoMagnets Applications
• Behavioral Research 2 Publishable results (submitted to CHI)
• CycleTalk Project, LTI• Netscan Group, HCII
• Conversational Interfaces• Tutalk Gweon et al., (2005)
• Guide authoring using pre-processed human-human sample conversations
• Corpus organization makes authoring conversational agents less intimidating. Rose, Pai, & Arguello (2005)
![Page 6: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/6.jpg)
6
Pre-processing Dialogue
Topic Segmentation
A
C
A
B
Topic Clustering
(1)
(2)
B
C
Transcribed conversations
Topic “chunks”
![Page 7: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/7.jpg)
7
Topic Segmentation
• Preprocess for InfoMagnets• Important computational linguistics
problem!• Previous Work:
– Marti Hearst’s TextTiling (1994)– Beeferman, Berger, and Lafferty (1997)– Barzilay and Lee (2004) NAACL best paper
award!– Many others
• But we are segmenting dialogue…
![Page 8: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/8.jpg)
8
Topic Segmentation of Dialogue
• Dialogue is Different:– Very little training data– Linguistic Phenomena
• Ellipsis• Telegraphic content
- And, most importantly …
Coherence in dialogue is organized around a shared task, and not around a single flow of information!
![Page 9: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/9.jpg)
9
Correlation Values Across One Dialogue
0
0.1
0.2
0.3
0.4
0.5
1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137
Utterance #
Co
sin
e C
orr
elat
ion
Coherence Defined Over Shared Task
Multiple topic shifts in regions w/ no intersection of content words
![Page 10: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/10.jpg)
10
Evaluation of 3 Algorithms
• 22 student-tutor pairs• Thermodynamics• Conversation via chat interface• One coder
• Results shown in terms of Pk Lafferty et al., 1999
• Significant tests: 2-tailed, t-tests
![Page 11: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/11.jpg)
11
3 Baselines
• NONE: no topic boundaries
• ALL: every utterance marks topic boundary
• EVEN: every 13th utterance marks topic boundary– avg topic length = 13 utterances
![Page 12: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/12.jpg)
12
1st Attempt: TextTiling
• Slide two adjacent “windows” down the text• Calculate cosine correlation at each step• Use correlation values to calculate “depth”• “Depth” values higher than a threshold
correspond to topic shifts
w1
w2
(Hearst, 1997)
![Page 13: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/13.jpg)
13
TextTiling Results
• TextTiling performs worse than baselines• Difference not statistically significant• Why doesn’t it work?
Algorithm (avg) Pk
NONE 0.489703303
ALL 0.518040038
EVEN 0.511727103
TT 0.535304902
TT T-test p-value
TT (NONE) 0.160036665
TT (ALL) 0.089668491
TT (EVEN) 0.221969444
![Page 14: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/14.jpg)
14
• Topic boundary set heuristically where correlation is 0
• Bad results, but still valuable!
TextTiling Results
Correlation Values Across One Dialogue
0
0.1
0.2
0.3
0.4
0.5
1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137
Utterance #
Co
sin
e C
orr
elat
ion
![Page 15: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/15.jpg)
15
• Cluster utterances• Treat each cluster as a “state”• Construct HMM
– Emissions: state-specific language models– Transitions: based on location and cluster-
membership of the utterances
• Viterbi re-estimation until convergence
2nd Attempt: Barzilay and Lee (2004)
![Page 16: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/16.jpg)
16
• B&L statistically better than TT, but not better than degenerate algorithms
B&L Results
Algorithm (avg) Pk
NONE 0.489703303
ALL 0.518040038
EVEN 0.518040038
TextTiling 0.511727103
B&L 0.5086039
B&L T-test p-value
B&L (NONE) 0.586165431
B&L (ALL) 0.471531964
B&L (EVEN) 0.862335954
B&L (TextTiling) 0.038565997
![Page 17: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/17.jpg)
17
B&L Results
• Too fine grained topic boundaries• Fixed expressions (“ok”, “yeah”, “sure” )
• Remember: cohesion based on shared task
• State-based language models sufficiently different?
![Page 18: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/18.jpg)
18
Adding Dialogue Dynamics
• Dialogue Act coding scheme– Developed for discourse analysis of human-
tutor dialogues
• 4 main dimensions:– Action– Depth– Focus – Control
• Dialogue Exchange (Sinclair and Coulthart, 1975)
![Page 19: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/19.jpg)
19
• X- dimensional learning (Donmez et al., 2004)
• Use estimated labels on some dimensions to learn other dimensions
• 3 types of Features:– Text (discourse cues)– Lexical coherence (binary)– Dialogue Acts labels
• 10-fold cross-validation• Topic Boundaries learned on estimated
labels, not hand coded ones!
3rd Attempt: Cross-Dimensional Learning
![Page 20: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/20.jpg)
20
X-Dimensional Learning Results
• X-DIM statistically better than TT, degenerate algorithms AND B&L!
Algorithm (avg) Pk
NONE 0.489703303
ALL 0.518040038
EVEN 0.511727103
TextTiling 0.535304902
B&L 0.5086039
X-DIM 0.430374083
X-DIM T-test p-value
X-DIM (NONE) 0.024460654
X-DIM (ALL) 0.023027116
X-DIM (EVEN) 0.022848002
X-DIM (TextTiling)
0.011833934
X-DIM (B&L) 0.022591035
![Page 21: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/21.jpg)
21
Statistically Significant Improvement
TT B&L X-DIM
NONE NON-SIG NON-SIG SIG
ALL NON-SIG NON-SIG SIG
EVEN NON-SIG NON-SIG SIG
TT SIG SIG
B&L SIG
![Page 22: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/22.jpg)
22
Future Directions
• Merge cross-dimensional learning (w/ dialogue act features) with B&L content modeling HMM approach.
• Explore other work in topic segmentation of dialogue
![Page 23: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/23.jpg)
23
Summary
• Introduction to InfoMagnets• Applications• Need for topic segmentation• Evaluation of other algorithms• Novel algorithm using X-dimensional
learning w/statistically significant improvement
![Page 24: Inf o Magnets : Making Sense of Corpus Data](https://reader036.vdocuments.us/reader036/viewer/2022062519/56815065550346895dbe6396/html5/thumbnails/24.jpg)
24
Q/A
Thank you!