text classification in electronic discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdftext...
TRANSCRIPT
2/23/12
1
Text Classification in Electronic Discovery
Dave Lewis, Ph.D. David D. Lewis Consulting, LLC
www.DavidDLewis.com Slides for lecture via Skype on February 23, 2012 in LBSC 708X/INFM 718X (Seminar on E-Discovery, Univ. of Maryland).
1 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
Disclaimer
• I am not a lawyer • Nothing here should be considered legal
advice • If you need legal advice, consult a lawyer
• Repeat as needed
2 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
2/23/12
2
Disclaimer 2
• I have a variety of interests in this area – Co-founder of TREC Legal Track – Publishing and public speaking in attempt to
influence evolving legal standards – Consulting expert and expert witness on legal
cases involving e-discovery – Algorithm designer for an e-discovery vendor
3 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
Outline
• What is e-discovery? • IR technologies in e-discovery
– Text retrieval – Text classification – Effectiveness evaluation
• Summary
4 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
2/23/12
3
E-Discovery in Black and White
• A document is – Preserved or not – Reviewed or not – Listed in privilege log or not – Turned over as responsive or not – Presented in court or not
• As evidence of a particular fact or not
• Law tries to define actions unambiguously – That means classification
5 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
Copyright 2001-2010, David D. Lewis Consulting. All rights reserved. 6
Text Classification
• Deciding which of several predefined classes (groups) a text belongs to
• Crudest form of understanding text • BUT…current technology often can do
well • AND…many tasks can be viewed as
classification – Particularly tasks of organizing information
2/23/12
4
Copyright 2001-2010, David D. Lewis Consulting. All rights reserved. 7
Advantages of Viewing a Task As Classification
• Classifiers are good plug-in modules – “Text in/class out" a simple interface
• Finite set of outputs a good basis for – Conditioning action on output – Counting, correlation, etc.
• Automated applying and learning of classifiers – Software reusable across classification tasks
• Clarifies nature of task, successes, failure modes – Straightforward numerical measures of quality
Copyright 2001-2010, David D. Lewis Consulting. All rights reserved. 8
Organizing Sets of Classes
• Binary classification • Three or more classes
– Multiclass – Multilabel – Ordinal – Hierarchical
• Probabilistic and graded classification
2/23/12
5
CULLING REVIEW
PRESENTATION
Other parties
classification in e-discovery
MANAGE- MENT
9 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
which documents to keep?
MANAGE- MENT
10 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
2/23/12
6
CULLING
which documents to review?
11 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
REVIEW
which documents must we turn over?
12 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
2/23/12
7
REVIEW
what types of documents do we have?
13 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
PRESENTATION
which documents do we show in court?
14 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
2/23/12
8
Arkfeld, Michael. Arkfeld on Electronic Discovery and Evidence, 2nd edition, Feb. 2010. 15 Copyright 2001-2010, David D. Lewis
Consulting. All rights reserved.
Approaches to Classification
• Interactive manual classification – Search and save documents
• Interactive manual classifier creation – Search and save query
• Supervised learning of classifier – Review documents – Computer creates query
16 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
2/23/12
9
Classification "System"
Search and
Save
17 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
Classification "System"
Search and
Save
18 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
2/23/12
10
Search and Save • Strengths:
– See interesting docs first – Leverage search algorithms and interfaces
• Weaknesses – Personnel must have both case knowledge and
search skills – Documents arrive in bursty fashion
• Constantly need these highly trained people – Human variability – Can't show later why doc was or wasn't saved
19 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
Classification System
Query = Classifier
Search and
Save Query
(Manual Classifier
Construction) 20 Copyright 2001-2010, David D. Lewis
Consulting. All rights reserved.
2/23/12
11
Classification System
Classifier
21 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
Manual Classifier Construction in Culling
• 1-2 orders of magnitude reduction typically quoted
• Traditionally simple Booleans • Developed iteratively • Sometimes statistical evaluation of
effectiveness
Things are moving very rapidly in this area 22 Copyright 2001-2010, David D. Lewis
Consulting. All rights reserved.
2/23/12
12
Classification System
Classifier
23 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
Classification System
Classifier
24 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
2/23/12
13
Classification System
Classifier
25 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
Research Questions in Manual Classifier Construction for E-Disco • Culling:
– Optimizing large-grained selection – Detecting redundancy – Summary representations – How good are people at this? – How can we help them?
• Much from distributed search applies!
26 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
2/23/12
14
Manual Classifier Construction
• Strengths – Classifier can be run as new docs arrive – Explicit justification of decisions
• Weaknesses – Must justify why classifier built way it was – New docs may differ from old ones, requiring
ongoing modification of classifier • Haven't escaped need for single search/case
expert
27 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
28 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
2/23/12
15
X
X ✔ ✔
29 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
X
X ✔ ✔
Supervised Learning
30 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
2/23/12
16
Supervised Learning for Text Classification
• People manually classify documents • Computer searches for a rule that
– Agrees with most manual classifications – Is not too "complex"
• Uses that rule to classify future documents – Rules typically associate numeric weights with
words and other document features
31 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
REVIEW
32 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
2/23/12
17
REVIEW
33 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
REVIEW
$ $ $ $
$
$ $
$
34 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
2/23/12
18
REVIEW
?
35 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
REVIEW
?
responsiveness?
privilege?
36 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
2/23/12
19
REVIEW
?
37 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
Other Favorable Factors for Supervised Learning
• Review is a recall-oriented task – Classifier likely to need thousands of words – Very hard to create by hand
• Many, weak, disparate predictors – Text, custodian, time, place, organizational
structure, file type, message headers,... – Again very hard to manually use
• Objective reason the classifier is way it is – "The learning algorithm said so"
38 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
2/23/12
20
X
X ✔ ✔
Classification System
Classifier
X
✔ ✔ ✔ ✔
✔?
✔
✔?
39 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
X
X ✔ ✔
Classification System
Classifier
X
✔ ✔ ✔ ✔
✔?
✔
✔?
40 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
2/23/12
21
X
X ✔ ✔
Classification System
Classifier
X
✔ ✔ ✔ ✔
✔
✔
✔0.98 X X X
✔ ✔ ✔ ✔ ✔ ✔
X
✔
0.93
0.89 41 Copyright 2001-2010, David D. Lewis
Consulting. All rights reserved.
Unusual Aspects of E-Discovery for Supervised Learning
1 Ultra high recall, moderate precision 2 Extremely diverse document sets 3 Duplicate documents 4 Document families 5 Multiple assessors 6 Some categories conditional on others 7 Goal of assessment is review, not training
--May be unwilling to evaluate documents for categories if not responsive
All of these are pains / research opportunities
42 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
2/23/12
22
X
X ✔ ✔
Classification System
Classifier
X
✔
X X X
✔ ✔ ✔ ✔ ✔ ✔
X
0.50 ? 0.50
0.50
?
?
8. active learning
43 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
X
X ✔ ✔
Classification System
Classifier
X
✔
0.98 X X X
✔ ✔ ✔ ✔ ✔ ✔
X
0.80
0.76 BA
C
9. diversity-biased classification
44 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
2/23/12
23
X
X ✔ ✔
Classification System
Classifier
X
✔
✔
✔0.98 X X X
✔ ✔ ✔ ✔ ✔ ✔
X
✔
0.93
0.89
10. transduction
45 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
Outline
• What is e-discovery? • IR technologies in e-discovery
– Text retrieval – Text classification – Effectiveness evaluation
• Why I love e-discovery
46 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
2/23/12
24
Copyright 2001-2010, David D. Lewis Consulting. All rights reserved. 47
The Joy of E-Discovery • All that SIGIR/TREC/etc. IR geek stuff...
– Supervised learning from huge amounts of training data
– High recall search – Careful statistical measurement of
effectiveness – Obsessed tuning to get effectiveness up
• Matters, works, and saves people great drudgery and cost
Summary
• E-discovery is an application where IR really matters – Current technology can help a lot – Advances would help even more – Rigor matters: write, program as if you might
have to testify about it
48 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.
2/23/12
25
Thanks!
• I'm always happy to answer questions:
• Sign up if you'd like to be on list to receive updated copies of this talk
49 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.