Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)
CeRI (Cornell eRulemaking Initiative)Cornell University
An eRulemaking Corpus:Identifying Substantive Issues in Public Comments
Plan for the Talk
Background– E-rulemaking
CeRI FTA Grant Circulars Corpus Text Categorization Experiments
RulemakingE-Rulemaking
Rulemaking: one of the principal methods of making regulatory policy in the US- ~4000+ per year
“notice and comment” rulemaking: formal public participation phase– 10 – 500,000 comments per rule– comment length: 1 sentence – 10’s of pages – agency legally bound to respond to all substantive
issues
E-rulemaking = e-notice and e-comment
Current Agency Practice
Goals of Our Current Work
Determine the degree to which automatic issue categorization can facilitate analysis of comments by identifying and categorizing “relevant issues”.
Framed as a text categorization task: Given a comment set, the automated system
should determine, for each sentence in each comment, which of a group of pre-defined issue categories it raises, if any.
Builds on the work of Kwon & Hovy (2007) and Kwon et al. (2006)
Plan for the Talk
Background CeRI FTA Grant Circulars Corpus
– Difficulties– Interannotator agreement results
Text Categorization Experiments
FTA Grant Circulars Rule
Topic: guidance to public and private transportation providers applying for federal aid for elderly, disabled and low income persons
267 comments shortest: 1 sentence longest: 1420 sentences
11,094 sentences total
FTA Grant Circulars Issue Set
17 top-level issues
39 fine-grained issues
Kwon & Hovy (2007)
vs.
Difficulties for Text Categorization
Large, hierarchical issue set
FTA Grant Circulars Issue Set
17 top-level issues
39 fine-grained issues
Difficulties for Text Categorization
Large, hierarchical issue set “NONE” category Skewed distribution across issues
– 87% of the sentences are from 6 categories– 13% of the sentences are from 33 categories
Potentially multiple issues per sentence. Even long sentences contain few words. Variation in comment quality, scope,
vocabulary and form.
The Annotators
Interannotator Agreement
146 comments used for the study 6 annotators 2.66 annotators per comment 41.5 sentences per comment Overlap agreement measure
Plan for the Talk
Background– E-rulemaking– Public comment analysis
CeRI FTA Grant Circulars Corpus– Difficulties– Interannotator agreement results
Text Categorization Experiments
Fine-grained issues (39) Coarse-grained issues (17)
Standard Text Categorization Algorithms
Standard (flat) text categorization methods
Hierarchical text categorization methods
• SVMs (0/1 loss)• Maxent• Naïve Bayes
• cascaded classification Dumais & Chen (2000)
Cascaded Categorization
Some
Cascaded Categorization
Cascaded Categorization
Gold Standard Data Set
Simulate agency comment analysis process– One analyst / rule
Six data sets– One data set / annotator
SVM Results with tf.idf Features
Best-Performing Fine-Grained Issues (Annotator 1)
0.71
0.83
0.75
0.69
0.75
0.62
0.57
0.660.65
0.57
0.68
0.510.48
0.66
0.0280.0150.0110.0390.032
0.51
0.60
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
OtherNFreeElig JARC_EligActiv CompSelect TechAsstTrain MobilMgt DesRecip none
Rule-specific Issue
% a
gre
em
en
t, c
orr
ec
t, s
en
ten
ce
s c
ov
ere
d
(de
pe
nd
ing
on
ba
r c
olo
r)
categorization accuracy
interannotator agreement
data coverage
Progress and Plans
• Promising initial results rule-specific issue categorization of public comments
– Annotate comments for more rules– Expert (rulewriter) vs. law student annotation– Integrate automatic text categorization into
annotation interface• Active learning (Purpura, Cardie & Simons, dg.o
2008)• Collaboration with HCI colleagues in InfoSci
The End
For more on– the hierarchical text categorization method
• Cardie et al. (dg.o 2008)
– a new structural learning approach for hierarchical classification
• Purpura et al. (in preparation)
– active learning methods for hierarchical text categorization
• Purpura, Cardie & Simons (dg.o 2008)
Minimizing the Costliest Errors**
**Underinclusive errors are the most costly