claire cardie (cs+is), cynthia farina (law), matt rawding (is), adil aijaz (cs)
DESCRIPTION
An eRulemaking Corpus: Identifying Substantive Issues in Public Comments. Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University. Plan for the Talk. Background E-rulemaking CeRI FTA Grant Circulars Corpus - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/1.jpg)
Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)
CeRI (Cornell eRulemaking Initiative)Cornell University
An eRulemaking Corpus:Identifying Substantive Issues in Public Comments
![Page 2: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/2.jpg)
Plan for the Talk
Background– E-rulemaking
CeRI FTA Grant Circulars Corpus Text Categorization Experiments
![Page 3: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/3.jpg)
RulemakingE-Rulemaking
Rulemaking: one of the principal methods of making regulatory policy in the US- ~4000+ per year
“notice and comment” rulemaking: formal public participation phase– 10 – 500,000 comments per rule– comment length: 1 sentence – 10’s of pages – agency legally bound to respond to all substantive
issues
E-rulemaking = e-notice and e-comment
![Page 4: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/4.jpg)
Current Agency Practice
![Page 5: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/5.jpg)
Goals of Our Current Work
Determine the degree to which automatic issue categorization can facilitate analysis of comments by identifying and categorizing “relevant issues”.
Framed as a text categorization task: Given a comment set, the automated system
should determine, for each sentence in each comment, which of a group of pre-defined issue categories it raises, if any.
Builds on the work of Kwon & Hovy (2007) and Kwon et al. (2006)
![Page 6: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/6.jpg)
Plan for the Talk
Background CeRI FTA Grant Circulars Corpus
– Difficulties– Interannotator agreement results
Text Categorization Experiments
![Page 7: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/7.jpg)
FTA Grant Circulars Rule
Topic: guidance to public and private transportation providers applying for federal aid for elderly, disabled and low income persons
267 comments shortest: 1 sentence longest: 1420 sentences
11,094 sentences total
![Page 8: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/8.jpg)
FTA Grant Circulars Issue Set
17 top-level issues
39 fine-grained issues
![Page 9: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/9.jpg)
Kwon & Hovy (2007)
vs.
![Page 10: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/10.jpg)
Difficulties for Text Categorization
Large, hierarchical issue set
![Page 11: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/11.jpg)
FTA Grant Circulars Issue Set
17 top-level issues
39 fine-grained issues
![Page 12: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/12.jpg)
Difficulties for Text Categorization
Large, hierarchical issue set “NONE” category Skewed distribution across issues
– 87% of the sentences are from 6 categories– 13% of the sentences are from 33 categories
Potentially multiple issues per sentence. Even long sentences contain few words. Variation in comment quality, scope,
vocabulary and form.
![Page 13: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/13.jpg)
The Annotators
![Page 14: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/14.jpg)
Interannotator Agreement
146 comments used for the study 6 annotators 2.66 annotators per comment 41.5 sentences per comment Overlap agreement measure
![Page 15: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/15.jpg)
Plan for the Talk
Background– E-rulemaking– Public comment analysis
CeRI FTA Grant Circulars Corpus– Difficulties– Interannotator agreement results
Text Categorization Experiments
![Page 16: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/16.jpg)
Fine-grained issues (39) Coarse-grained issues (17)
Standard Text Categorization Algorithms
Standard (flat) text categorization methods
Hierarchical text categorization methods
• SVMs (0/1 loss)• Maxent• Naïve Bayes
• cascaded classification Dumais & Chen (2000)
![Page 17: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/17.jpg)
Cascaded Categorization
Some
![Page 18: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/18.jpg)
Cascaded Categorization
![Page 19: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/19.jpg)
Cascaded Categorization
![Page 20: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/20.jpg)
Gold Standard Data Set
Simulate agency comment analysis process– One analyst / rule
Six data sets– One data set / annotator
![Page 21: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/21.jpg)
SVM Results with tf.idf Features
![Page 22: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/22.jpg)
Best-Performing Fine-Grained Issues (Annotator 1)
0.71
0.83
0.75
0.69
0.75
0.62
0.57
0.660.65
0.57
0.68
0.510.48
0.66
0.0280.0150.0110.0390.032
0.51
0.60
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
OtherNFreeElig JARC_EligActiv CompSelect TechAsstTrain MobilMgt DesRecip none
Rule-specific Issue
% a
gre
em
en
t, c
orr
ec
t, s
en
ten
ce
s c
ov
ere
d
(de
pe
nd
ing
on
ba
r c
olo
r)
categorization accuracy
interannotator agreement
data coverage
![Page 23: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/23.jpg)
Progress and Plans
• Promising initial results rule-specific issue categorization of public comments
– Annotate comments for more rules– Expert (rulewriter) vs. law student annotation– Integrate automatic text categorization into
annotation interface• Active learning (Purpura, Cardie & Simons, dg.o
2008)• Collaboration with HCI colleagues in InfoSci
![Page 24: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/24.jpg)
The End
For more on– the hierarchical text categorization method
• Cardie et al. (dg.o 2008)
– a new structural learning approach for hierarchical classification
• Purpura et al. (in preparation)
– active learning methods for hierarchical text categorization
• Purpura, Cardie & Simons (dg.o 2008)
![Page 25: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/25.jpg)
![Page 26: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/26.jpg)
![Page 27: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)](https://reader036.vdocuments.us/reader036/viewer/2022062519/568151b6550346895dbfe567/html5/thumbnails/27.jpg)
Minimizing the Costliest Errors**
**Underinclusive errors are the most costly